Parallel Processing and Multithreading

Almost everyone today has a computer with more than 1 core.  Those of us who run time-intensive applications, such as simulations, are frustrated in having to wait for results that should be gotten much sooner had all the cores been fully engaged on the task.  Many of us have had the experience of looking at Windows Task Manager and seeing just one processor working at 100% and all the other processors doing pretty much nothing.  For years I have been wondering whether I would be able to speed up QRS II to use all of those wasted processors.

About 3 years ago I embarked on a project to do just that.  I knew that Windows wasn’t really designed for parallel processing, but I also knew that it could run parallel “threads”.  My compiler, Delphi, has a standard object called TThread that could be used to accomplish a parallel task.  TThread is not the easiest programming feature to implement, and the most adventurous people have used it for an occasional background task or two.  Could I get TThread to do parallel processing?

TThread is rather simple in theory.  As soon as a TThread object is created (I literally use the “create” method in TThread to do this), an object of TThread runs a user-defined method, “execute”, which contains a task.  When the task is done, the TThread object is told to destroy itself.  Code within “execute” can share data arrays and subroutines with the main program, which has considerable advantages.  Here is the TThread “execute” that I use to calculate a gradients (direction of steepest ascent; a gradient is made up of a bunch of partial derivatives) for whole table LS static refinement.

procedure TTripsQGradThread.Execute;

var

  found: boolean;

  i, j: integer;

begin

  for i:= rb to re do

   for j:= rb to re do

     begin

      ThreadCriticalSection.Acquire;

      found:= TripsToDoList[i,j];

      if found then

        begin

          TripsToDoList[i, j]:= false;

          ThreadCriticalSection.Release;

          tripsgradhold[i, j]:= tgradientq(i, j);

        end

        else

          ThreadCriticalSection.Release;

    end;

  Terminate;

end;

 

TTripsQGradThread is a descendent (that is, as customized variation) of the stock TThread.  In this code, rb is the lowest zone number and re is the highest zone number.  The thread appears to be moving through every possible OD pair and storing the calculated partial derivatives in an array called tripsgradhold.  If this were the only thread running, it would calculate all the needed partial derivatives.  However, it is not the only thread running.  This thread is acting greedily in competition with all the other threads that are running.  There is an array, TripsToDoList, that keeps track of which derivatives have been claimed by the various threads.  The thread registers its claim before it starts the calculation so that no other thread can come along and attempt to do the same thing.

All code running in a thread must be “thread safe”.  That is, the thread must not change any values that may be needed at the same time by any other thread or the main program.   This rule obviously applies to “execute”, but it also applies to any subroutine that “execute” might initiate, such as tgradientq and its subroutines.  Threads do not need to be entirely self-contained, but they cannot share any variables or working arrays with other threads.  Unfortunately, the TripsToDoList is a working array that must be shared.  Microsoft anticipated this problem and gave us a work-around in the object TThreadCriticalSection.  An instance of TThreadCriticalSection does one and only one thing:  it freezes execution of all threads until the current thread gets done with its critical task, in this case updating the TripsToDoList.

You might ask:  Why did you decide not to divide the tasks evenly across all the threads ahead of time?  There are two reasons.  First, each thread will not necessarily run at the same rate and tasks may vary hugely in how much work is required.  If I create exactly as many threads as there are processors, some of those processers may be tied up with other things the computer needs to do and would not run as efficiently.  We really don’t want that already busy processor holding up the whole simulation until it finishes with its pre-assigned tasks.

The threads are like a bunch of pigeons eating a trough of grain.  I raised pigeons as a child and I learned that each pigeon has a distinct personality.  Some pigeons are more aggressive, some are slower and some are always hungrier.  If I didn’t provide enough grain they would consume the whole trough rather quickly, but not every pigeon would consume the same amount.  All the birds will remain at the trough until it was clear they would get no more grain. 

When writing threaded code it is important keep tabs on anything that might be inefficient.  For example, the number of lines between acquiring and releasing a TThreadCriticalSection must be as small as possible.  We don’t want two threads to write to the same file, because we would need to stop the execution of all but one thread at a time to do so, and file writes take a comparatively long time.  We certainly do not want many more threads than processors.  And we don’t want to create and destroy threads too frequently, because each creation also takes a comparatively long time.

Had I known all of this in 1986 when I first started writing QRS II, making the whole program run faster would have been a heck of a lot easier.  However, there are now sections of QRS II that would be very difficult to untangle without introducing the possibility new bugs in routines that have been working wonderfully for years.   Thus, I focused on those tasks that have become most time consuming in recent years, particularly path building and OD table refinement.   These parts of the software represent between 75% and 95% of the computational effort.  The remaining parts of QRS II (with the exception of area calculations for area-spread traffic assignments) run in a single thread.  Thus, computation time is approximately:

Time = a + b/(number of processors)

So in an admittedly best case scenario, a run that originally took 100 minutes on a single processor computer would take just:

Time = 5 + 95/8 = 16.9 minutes

on a computer with 8 processors.  Theoretically, this relationship should break down as the number of processors increase to the point when thread management overhead becomes noticeable, but I have not seen this point yet in computers with up to 12 processors.

So what counts as a processor?  Intel’s Core i5 and i7 CPUs come with “hypertheading”.  That is, they are capable of running two threads per core.  My laptop computer has an i5 chip with 2 cores and 2 threads per core.  Is this two or four processors?  Windows 7 counts this chip as having four processors and my empirical tests confirm it.  Multithreaded sections of code in QRS II 8 do indeed run four times faster than the same single-threaded portions of QRS II 7.

This is all very nice, but there are three downsides to multithreading.  First, QRS II will monopolize all the processing power of its computer.  Without some fiddling, you won’t be able to do even the simplest of other jobs, such as browsing the net, reading e-mail or writing a memo.  Even Solitaire will be sticky.  Second, QRS II can cause your computer to run very hot.  You may not want to run QRS II in your office on a hot summer day if you don’t have great air conditioning.  Third, separate working arrays for each thread can gobble up memory.

I have three suggestions if the computer becomes too hot or two unmanageable.  One, keep your simulations on one computer and do all of your other tasks on a second computer.  If this cannot be done, then (two) downgrade the number of threads QRS II is allowed to spawn.  You can find this setting on the Run Controls dialog box.  Third, decrease QRS II’s priority in Windows’ Task Manager.  You can safely ignore any warning about terrible things happening if you change QRS II’s priority.

For your next computer, buy the fastest multicore, multithreaded, Windows-based computer you can find.  Don’t settle for some lame computer that has been preselected for you by your IT department.  I would look seriously at computers built specifically for gamers, such as Alienware.  And, you did not hear it from me, some computers permit overclocking that can boost speed even further.  However, keep in mind that overclocking can void your warrantee and possibly fry a $1200 CPU chip if you are too aggressive with your settings.  As for the amount of RAM, QRS II is still a 32-bit application and cannot use more than 4 GB under Windows 7.  So at this time 6 GB should be enough to handle any QRS II run and a reasonable number of background tasks.  Theoretically, a memory bottleneck can form that can slow down QRS II (I have not actually seen this yet), so I suggest installing the fastest RAM available.

When running QRS II 8 you will notice some subtle and not so subtle differences from QRS II 7.  QRS II 8 creates many more temporary files because of the need to keep data from all threads separate, although the total amount of hard drive space is similar.  When QRS II shows which path it is building, it skips some centroid numbers.  QRS II skips numbers because specifically-created threads cannot update QRS II’s main window, so updates only occur when the main thread (which is also doing its own path building) gets a chance to perform the update.  The centroid number that is displayed at any moment is the highest that any thread is currently working on.     

Alan Horowitz, March 27, 2011