Parallel reduce on a random access range. Except as otherwise noted,
usage is similar to std.algorithm.iteration._reduce. There is
also fold which does the same thing with a different parameter
order.
This function works by splitting the range to be reduced into work
units, which are slices to be reduced in parallel. Once the results
from all work units are computed, a final serial reduction is performed
on these results to compute the final answer. Therefore, care must be
taken to choose the seed value appropriately.
Because the reduction is being performed in parallel, functions
must be associative. For notational simplicity, let # be an
infix operator representing functions. Then, (a # b) # c must equal
a # (b # c). Floating point addition is not associative
even though addition in exact arithmetic is. Summing floating
point numbers using this function may give different results than summing
serially. However, for many practical purposes floating point addition
can be treated as associative.
Note that, since functions are assumed to be associative,
additional optimizations are made to the serial portion of the reduction
algorithm. These take advantage of the instruction level parallelism of
modern CPUs, in addition to the thread-level parallelism that the rest
of this module exploits. This can lead to better than linear speedups
relative to std.algorithm.iteration._reduce, especially for
fine-grained benchmarks like dot products.
An explicit seed may be provided as the first argument. If
provided, it is used as the seed for all work units and for the final
reduction of results from all work units. Therefore, if it is not the
identity value for the operation being performed, results may differ
from those generated by std.algorithm.iteration._reduce or
depending on how many work units are used. The next argument must be
the range to be reduced.
// Find the sum of squares of a range in parallel, using// an explicit seed.//// Timings on an Athlon 64 X2 dual core machine://// Parallel reduce: 72 milliseconds// Using std.algorithm.reduce instead: 181 millisecondsautonums = iota(10_000_000.0f);
autosumSquares = taskPool.reduce!"a + b"(
0.0, std.algorithm.map!"a * a"(nums)
);
If no explicit seed is provided, the first element of each work unit
is used as a seed. For the final reduction, the result from the first
work unit is used as the seed.
// Find the sum of a range in parallel, using the first// element of each work unit as the seed.autosum = taskPool.reduce!"a + b"(nums);
An explicit work unit size may be specified as the last argument.
Specifying too small a work unit size will effectively serialize the
reduction, as the final reduction of the result of each work unit will
dominate computation time. If TaskPool.size for this instance
is zero, this parameter is ignored and one work unit is used.
// Use a work unit size of 100.autosum2 = taskPool.reduce!"a + b"(nums, 100);
// Work unit size of 100 and explicit seed.autosum3 = taskPool.reduce!"a + b"(0.0, nums, 100);
Parallel reduce supports multiple functions, like
std.algorithm.reduce.
// Find both the min and max of nums.autominMax = taskPool.reduce!(min, max)(nums);
assert(minMax[0] == reduce!min(nums));
assert(minMax[1] == reduce!max(nums));
Exception Handling:
After this function is finished executing, any exceptions thrown
are chained together via Throwable.next and rethrown. The chaining
order is non-deterministic.
Parallel reduce on a random access range. Except as otherwise noted, usage is similar to std.algorithm.iteration._reduce. There is also fold which does the same thing with a different parameter order.
This function works by splitting the range to be reduced into work units, which are slices to be reduced in parallel. Once the results from all work units are computed, a final serial reduction is performed on these results to compute the final answer. Therefore, care must be taken to choose the seed value appropriately.
Because the reduction is being performed in parallel, functions must be associative. For notational simplicity, let # be an infix operator representing functions. Then, (a # b) # c must equal a # (b # c). Floating point addition is not associative even though addition in exact arithmetic is. Summing floating point numbers using this function may give different results than summing serially. However, for many practical purposes floating point addition can be treated as associative.
Note that, since functions are assumed to be associative, additional optimizations are made to the serial portion of the reduction algorithm. These take advantage of the instruction level parallelism of modern CPUs, in addition to the thread-level parallelism that the rest of this module exploits. This can lead to better than linear speedups relative to std.algorithm.iteration._reduce, especially for fine-grained benchmarks like dot products.
An explicit seed may be provided as the first argument. If provided, it is used as the seed for all work units and for the final reduction of results from all work units. Therefore, if it is not the identity value for the operation being performed, results may differ from those generated by std.algorithm.iteration._reduce or depending on how many work units are used. The next argument must be the range to be reduced.
If no explicit seed is provided, the first element of each work unit is used as a seed. For the final reduction, the result from the first work unit is used as the seed.
An explicit work unit size may be specified as the last argument. Specifying too small a work unit size will effectively serialize the reduction, as the final reduction of the result of each work unit will dominate computation time. If TaskPool.size for this instance is zero, this parameter is ignored and one work unit is used.
Parallel reduce supports multiple functions, like std.algorithm.reduce.
Exception Handling:
After this function is finished executing, any exceptions thrown are chained together via Throwable.next and rethrown. The chaining order is non-deterministic.