C# Parallel For Loop

,Parallel vs Regular For Loop

Run Time Comparison between Sequential For Loop and Parallel For Loop

.Parallel For Loop vs Regular For Loop

"This video demonstrates the C# Parallel For loop is three times faster than a Regular For loop on the task of counting to 1 Billion, 10 Times. The Parallel For loop saturates all four cores for 100% total CPU usage as compared to 33% CPU usage for the Regular For loop."

Run times were recorded for the Regular For loop, followed by the Parallel For loop. The C# code used in this demonstration is listed below.


The Experiment

A Regular (sequential) For loop and a Parallel For loop where both coded in C# using Visual Studio 2012. The Stopwatch class was used to measure the elapsed times. Both For loops performed the task of counting to 1 Billion in the inner loop with the outer loop executing 10 times. The performance of the processor and its individual cores were shown using the All CPU Meter gadget from AddGadgets.com along with the Windows 7 resource monitor. I also used the CPUID utility to obtain information about the CPU and to monitor the temperatures of the CPU cores. The screen was captured and encoded with Microsoft Expression Encoder 4 Pro into an .mp4 format. The final editing of the video was performed with Microsoft Movie Maker.

The Parallel.For method is part of the Task Parallel Library (TPL) which supports data parallelism. Data parallelism partitions the source data so that multiple threads can operate on different segments concurrently. In this experiment the Parallel For was used to parallelize the outer loop which executed 10 times. The inner loop incremented a variable from 1 to 1 billion.

The Parallel For loop in this experiment used thread-local variables for each task. When the tasks were complete the values from the thread-local variables were use to create the final result. A generic version of the Parallel For was used with a type parameter of <long>. The first two parameters of the Parallel For are the beginning and ending iteration values. The third parameter initializes the local state. In this code the third parameter initialized the thread-local variable to zero. The forth parameter used a lambda expression to define the loop logic. The fifth parameter defined a method that is called one time after all the threads have completed. Note the Interlocked class was used to support multiple thread usage of the subtotal variable. In this experiment when using single-thread addition on the Work variable resulted in a value of only 2,500,000,000, or 1/4 the true 1E10 work value. The Interlocked.add method was required to add the values from each thread into the final total.

Parallel.For<long>(0, 10, () => 0, (myInt, loop, subtotal) =>
for (long myLong = 0; myLong < 1E9; myLong++) subtotal++;
return subtotal;
(x) => Interlocked.Add(ref work, x)

This experiment was performed more than ten times with nearly identical results. When the work load was increased from 1E10 to 1E11 the Parallel For loop was over 3.5 times faster. It would be interesting to vary the work type, work load, and Parallel For parameter values and determine the degree of processing improvement in different scenarios.

C# Code Used in Demonstration Video

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Globalization;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace ParallelTests
    class Program
        static void Main(string[] args)
            long work = 0;

            Stopwatch stopWatch = new Stopwatch();

            #region Regular For Loop Region           
            //Console.WriteLine("Regular for Loop is running ...\n");
            //for (int myInt = 0; myInt < 10; myInt++)
            //    for (long myLong = 0; myLong < 1E9; myLong++) { work++; }

            #region Parallel For Loop Region
            Console.WriteLine("Parallel for Loop is running ...\n");
            Parallel.For<long>(0, 10, () => 0, (myInt, loop, subtotal) =>
                    for (long myLong = 0; myLong < 1E9; myLong++) subtotal++;
                    return subtotal;
                (x) => Interlocked.Add(ref work, x)

            // Calculate and Display Execution Time
            TimeSpan ts = stopWatch.Elapsed;
            string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
                ts.Hours, ts.Minutes, ts.Seconds,
                ts.Milliseconds / 10);
            Console.WriteLine("RunTime " + elapsedTime + "\n");

            // Display Work Value
            Console.WriteLine("Work is: " + work.ToString("E", CultureInfo.InvariantCulture));

The Run Time Environment

,Core Temperatures Displaying in Task Bar

CoreTemp Running with Individual Core Temperatures Displayed in Task Bar

Saturating the cores for a prolonged period of time can increase the core temperatures, especially if the CPU is overclocked. The test computer contains an AMD Phenom II 940 BE overclocked from 3.0 to 3.5 GHz. I use the CoreTemp utility to display the individual core temperatures in the task bar and also have it's overheat alarm set to 70 C. I monitor the core temperatures when the computer is under a heavy load.

Before starting the experiment I cleaned the radiator on the CPU cooler for the first time in about three years. I noticed the core temperatures dropped a few degrees, especially when under a heavy load. Wiki How to Do Anything has a good article on How to Build a Powerful Quiet Computer that covers the principles of computer cooling.

,CPU Cooler

Provide Adequate CPU Cooling as Saturated Cores Run Hot