Summary

The work library is intended to simplify the use of multithreading in the context of our software ecosystem.

This library is intended as a thin abstraction layer on top of a multithreading subsystem. The abstraction serves two purposes:

To simplify the use of common constructs like "Parallel For"
To centralize our dependency on a particular multithreading subsystem (e.g., TBB, etc.).

Because of the way multithreading subsystems work and because of the way they need to interact with each other in managing system resources, it is not generally practical for each client to use whatever threading system they like (e.g., TBB for one client, OpenMP for another).

Initializing and Limiting Multithreading

The library defaults to maximum concurrency, i.e. it will attempt to use as many threads as available on the system. The default concurrency limit is established at static initialization time. If granular thread limits is enabled on the libwork backend (as described in Providing an Alternate Work Implementation) then the PXR_WORK_THREAD_LIMIT environment variable can be set to further limit concurrency, such as for example in a farm environment. PXR_WORK_THREAD_LIMIT must be set to an integer N, and it is up to your implementation to interpret this value given its thread limiting granularity. In the default TBB-based backend, setting PXR_WORK_THREAD_LIMIT to N denotes one of the following:

0 - maximum concurrency (default if unset)
1 - single-threaded mode
positive N - limit to N threads (clamped to number of hardware threads available)
negative N - limit to all but N hardware threads (clamped to 1)

If granular thread limits is not enabled then the PXR_WORK_THREAD_LIMIT environment variable can be set to allow maximum concurrency or run the program serially. PXR_WORK_THREAD_LIMIT must be set to an integer N, denoting one of the following:

0 - maximum concurrency (default if unset)
1 - single-threaded mode
positive N - maximum concurrency
negative N - maximum concurrency (if absolute value of N equals or exceeds max physical concurrency, clamped to 1)

The concurrency limit can be set programmatically, using for example:

WorkSetConcurrencyLimitArgument(N);

WorkSetConcurrencyLimitArgument

WORK_API void WorkSetConcurrencyLimitArgument(int n)

Sanitize n as described below and set the concurrency limit accordingly.

or

WorkSetMaximumConcurrencyLimit();

WorkSetMaximumConcurrencyLimit

WORK_API void WorkSetMaximumConcurrencyLimit()

Set the concurrency limit to be the maximum recommended for the hardware on which it's running.

It is preferable to use WorkSetMaximumConcurrencyLimit() when the desire to use the hardware to its fullest rather than specify the maximum concurrency limit manually.

Simple "Parallel For" Example

Once you've initialized the library, you can now harness the awesome power of your multi-core machine. Here's a simple example of a Parallel For.

static void _DoubleTheValues(size_t begin, size_t end, std::vector<int> *v)
{
    for (size_t i = begin; i < end; ++i)
        (*v)[i] *= 2;
}
 
 
static void DoubleInParallel(std::vector<int> *v)
{
    WorkParallelForN(v->size(), std::bind(&_DoubleTheValues, _1, _2, v));
}

You can avoid the std::bind and provide your own functor object as well.

Providing an Alternate Work Implementation

You can provide your own work backend that uses your preferred dispatching system, instead of TBB's task/task_group API, by building your own library that implements the APIs described in the following sections:

Note: In each of the subsections below we list the required API and also outline specific behaviors that the implementations must follow. For general requirements please refer to the API docs under pxr/base/work. For more information on building and linking an alternate work backend to USD please refer to BUILDING.md.

Parallel Algorithms API

template <typename C>
void WorkImpl_ParallelSort(C* container);
 
template <typename C, typename Compare>
void WorkImpl_ParallelSort(C* container, const Compare& comp);
 
template <typename Fn>
void WorkImpl_ParallelForN(size_t n, Fn &&callback, size_t grainSize);
 
template <typename RangeType, typename Fn>
void WorkImpl_ParallelForTBBRange(const RangeType &range, Fn &&callback);
 
template <typename InputIterator, typename Fn>
void WorkImpl_ParallelForEach(InputIterator first, InputIterator last, Fn &&fn);
 
template <typename Fn, typename Rn, typename V>
V WorkImpl_ParallelReduceN(const V &identity,
                           size_t n,
                           Fn &&loopCallback,
                           Rn &&reductionCallback);

Note that the work abstraction contains a serial implementation of these functions so you only need to provide a concurrent implementation. We do not require that your work backend must include TBB to interoperate with TBB range types, and so we have provided a WorkParallelForTBBRange implementation built on the WorkDispatcher. If you do wish to supply your own WorkImpl_ParallelForTBBRange then you must also define WORK_IMPL_HAS_PARALLEL_FOR_TBB_RANGE in your implementation.

Concurrency Limiting API

unsigned WorkImpl_GetPhysicalConcurrencyLimit();
 
void WorkImpl_InitializeThreading(int threadLimit);
 
bool WorkImpl_SupportsGranularThreadLimits();
 
unsigned WorkImpl_GetConcurrencyLimit();
 
unsigned WorkImpl_SetConcurrencyLimit();

If the implementation can support granular thread limits (limiting the concurrency to any value other than 1 and maximum concurrency), you must set WorkImpl_SupportsGranularThreadLimits accordingly. If granular thread limits are supported, it is up to you to define what "granular" entails.
testWorkThreadLimits only checks if the implementation supports some level of granularity i.e. checking if the thread limit has been set to some value less than or equal to whichever is greater: the physical concurrency limit given by WorkImpl_GetPhysicalConcurrencyLimit or the requested thread limit. You are responsible for further testing the granularity of your implementation's thread limiting.

In the non granular case, the implementation should respect the behavior outlined in Initializing and Limiting Multithreading.

You can implement WorkImpl_InitializeThreading if any thread limiting object needs to be eagerly initialized if PXR_WORK_THREAD_LIMIT is set.

Dispatching Tasks API

template <class Fn>
void WorkImpl_RunDetachedTask(Fn &&fn);
 
template <class Fn>
auto WorkImpl_WithScopedParallelism(Fn &&fn);
 
class WorkImpl_Dispatcher {
    WorkImpl_Dispatcher();
 
    ~WorkImpl_Dispatcher() noexcept;
 
    WorkImpl_Dispatcher(WorkImpl_Dispatcher const &) = delete;
    WorkImpl_Dispatcher &operator=(WorkImpl_Dispatcher const &) = delete;
 
    template <class Callable>
    void Run(Callable &&c);
 
    void Reset();
    void Wait();
    void Cancel();
}

When running a detached task, you must ensure that the program does not end while a detached task could still be running. An example of how this could be done is in /pxr/extras/usd/examples/workTaskflowExample/detachedTask.cpp.

When executing with scoped parallelism, the callable fn must be executed in the same calling thread as WorkImpl_WithScopedParallelism.

On top of the documented requirements for WorkDispatcher, you must ensure that WorkImpl_Dispatcher can execute serially and that Run(Callable &&c) will still return immediately when single threaded. It is common for tasks to spawn more tasks for the same dispatcher, so if the implemention executes the callable in place, then the program runs a risk of causing a stack overflow due to the now recursive nature of these nested calls.

Caveats of an Alternate Work Backend.

Currently OpenUSD and especially OpenExec take advantage of the low level control over work stealing and scheduling that TBB provides to optimize our code; however not all work dispatching systems may provide that same functionality. If an alternate backend is not able to implement the WorkImpl_IsolatingDispatcher, the abstraction will default to the WorkImpl_Dispatcher, and if so it is possible that the performance of OpenUSD and OpenExec may suffer.