reduce.reduce

Parallel reduce on a random access range. Except as otherwise noted, usage is similar to std.algorithm.iteration._reduce. There is also fold which does the same thing with a different parameter order.

This function works by splitting the range to be reduced into work units, which are slices to be reduced in parallel. Once the results from all work units are computed, a final serial reduction is performed on these results to compute the final answer. Therefore, care must be taken to choose the seed value appropriately.

Because the reduction is being performed in parallel, functions must be associative. For notational simplicity, let # be an infix operator representing functions. Then, (a # b) # c must equal a # (b # c). Floating point addition is not associative even though addition in exact arithmetic is. Summing floating point numbers using this function may give different results than summing serially. However, for many practical purposes floating point addition can be treated as associative.

Note that, since functions are assumed to be associative, additional optimizations are made to the serial portion of the reduction algorithm. These take advantage of the instruction level parallelism of modern CPUs, in addition to the thread-level parallelism that the rest of this module exploits. This can lead to better than linear speedups relative to std.algorithm.iteration._reduce, especially for fine-grained benchmarks like dot products.

An explicit seed may be provided as the first argument. If provided, it is used as the seed for all work units and for the final reduction of results from all work units. Therefore, if it is not the identity value for the operation being performed, results may differ from those generated by std.algorithm.iteration._reduce or depending on how many work units are used. The next argument must be the range to be reduced.

// Find the sum of squares of a range in parallel, using
// an explicit seed.
//
// Timings on an Athlon 64 X2 dual core machine:
//
// Parallel reduce:                     72 milliseconds
// Using std.algorithm.reduce instead:  181 milliseconds
auto nums = iota(10_000_000.0f);
auto sumSquares = taskPool.reduce!"a + b"(
    0.0, std.algorithm.map!"a * a"(nums)
);

If no explicit seed is provided, the first element of each work unit is used as a seed. For the final reduction, the result from the first work unit is used as the seed.

// Find the sum of a range in parallel, using the first
// element of each work unit as the seed.
auto sum = taskPool.reduce!"a + b"(nums);

An explicit work unit size may be specified as the last argument. Specifying too small a work unit size will effectively serialize the reduction, as the final reduction of the result of each work unit will dominate computation time. If TaskPool.size for this instance is zero, this parameter is ignored and one work unit is used.

// Use a work unit size of 100.
auto sum2 = taskPool.reduce!"a + b"(nums, 100);

// Work unit size of 100 and explicit seed.
auto sum3 = taskPool.reduce!"a + b"(0.0, nums, 100);

Parallel reduce supports multiple functions, like std.algorithm.reduce.

// Find both the min and max of nums.
auto minMax = taskPool.reduce!(min, max)(nums);
assert(minMax[0] == reduce!min(nums));
assert(minMax[1] == reduce!max(nums));

Exception Handling:

After this function is finished executing, any exceptions thrown are chained together via Throwable.next and rethrown. The chaining order is non-deterministic.

template reduce(functions...)
reduce
(
Args...
)
(
Args args
)

See Also

fold is functionally equivalent to _reduce except the range parameter comes first and there is no need to use `tuple` for multiple seeds.

Meta