loop_tool Environment Reference =============================== `loop_tool `_ is an experimental intermediate representation of N-dimensional data computation. It defines a lightweight, highly-portable and restrictive IR that can be lowered to both GPU and CPU backends. CompilerGym exposes the ``loop_tool`` IR for reinforcement learning through a :class:`LoopToolEnv ` environment. .. contents:: Overview: :local: .. _Installation: Installation ------------ It is highly recommended you install CUDA to use ``loop_tool``. The package will work with most newer versions of CUDA. Please see NVidia's `CUDA Toolkit installation page `_ for details. Datasets -------- We provide a program that does a simple point-wise addition. Two datasets (one for CPU and one for GPU) are provided. The benchmarks within these dataset are generated by selecting an arbitrary size between 1 and 8x2^11 (~8M). To set a specific size, simply use the trailing term in the benchmarking uri. For GPU with size 1024: .. code-block:: python >>> env = gym.make("loop_tool-v0") >>> env.reset(benchmark="benchmark://loop_tool-cuda-v0/1024") The CPU variant can be created with: .. code-block:: python >>> env.reset(benchmark="benchmark://loop_tool-cpu-v0/1024") Note that although the action spaces are the same, the underlying hardware is dramatically different. GPU threads can (and should) be used heavily, whereas CPU threading often shows little to no benefit at sizes greater than the number of virtual cores available. Further, the CPU backend should be viewed as an interpreted fallback and cannot reach near-peak performance as the GPU can. Observation Spaces ------------------ Three observation spaces are provided for tuning the compiled result. +--------------------------+------------+ | Observation space | Type | +==========================+============+ | flops | ``Float64``| +--------------------------+------------+ | loop_tree | ``String`` | +--------------------------+------------+ | action_state | ``Int64[]``| +--------------------------+------------+ ``flops`` will benchmark the program and return the average achieved gigaflops per run. This includes program compilation and benchmark warmup, which may take some time. ``loop_tree`` will return a string representation of the current loop tree associated with the program. The string represents the computation in a platform-agnostic way. ``action_state`` returns the cursor information for the current environment. The cursor information is composed of three values in order: current loop index, current loop size, current loop tail. For example, given an ``action_state`` of [0, 341, 1] and ``loop_tree``, the cursor points to the first (outermost) loop: .. code-block:: none for a in 341 r 1 : L0 [thread] <-- cursor for a' in 3 : L1 for a'' in 1 : L2 %0[a] <- read() for a'' in 1 : L4 %1[a] <- read() for a'' in 1 : L6 %2[a] <- add(%0, %1) for a'' in 1 : L8 %3[a] <- write(%2) If the ``action_state`` were [1, 3, 0], we'd know the cursor is pointing to the child of the first loop: .. code-block:: none for a in 341 r 1 : L0 [thread] for a' in 3 : L1 <-- cursor for a'' in 1 : L2 %0[a] <- read() for a'' in 1 : L4 %1[a] <- read() for a'' in 1 : L6 %2[a] <- add(%0, %1) for a'' in 1 : L8 %3[a] <- write(%2) In the case of ``action_state`` being [2, 1, 0], the cursor simultaneously points to all innermost loops. This is an artifact of the innermost loop always being unrolled when the ``loop_tree`` is generated: .. code-block:: none for a in 341 r 1 : L0 [thread] for a' in 3 : L1 for a'' in 1 : L2 <-- cursor %0[a] <- read() | for a'' in 1 : L4 <--+ %1[a] <- read() | for a'' in 1 : L6 <--+ %2[a] <- add(%0, %1) | for a'' in 1 : L8 <--+ %3[a] <- write(%2) Action Spaces ----------- Currently, only a "simple" action space is available. This can be understood as control over a cursor that has two different modes. Either the cursor is moving between loops or it is frozen in place and can be used to change the sizes of loops. +-----------------+-----------------------------------------------------------------------------------------------------+ | Action | Description | +=================+=====================================================================================================+ | `toggle_mode` | Swaps between shifting the cursor location and shifting the size of the loop selected by the cursor | +-----------------+-----------------------------------------------------------------------------------------------------+ | `up` | Either shifts the cursor inward or increases the size of the selected loop by 1 | +-----------------+-----------------------------------------------------------------------------------------------------+ | `down` | Either shifts the cursor outward or decreases the size of the selected loop by 1 | +-----------------+-----------------------------------------------------------------------------------------------------+ | `toggle_thread` | Toggles the threading parameter of the selected loop | +-----------------+-----------------------------------------------------------------------------------------------------+ The default state for the benchmark we've been looking at is: .. code-block:: none for a in 1024 : L0 [thread] for a' in 1 : L1 for a'' in 1 : L2 %0[a] <- read() for a'' in 1 : L4 %1[a] <- read() for a'' in 1 : L6 %2[a] <- add(%0, %1) for a'' in 1 : L8 %3[a] <- write(%2) Now we will disable threading on the outer loop, enable threading on the first inner loop and then increase its size. The cursor mode starts with shifting sizes on the outermost loop. This means we can first run the ``toggle_thread`` action: .. code-block:: none for a in 1024 : L0 for a' in 1 : L1 for a'' in 1 : L2 %0[a] <- read() for a'' in 1 : L4 %1[a] <- read() for a'' in 1 : L6 %2[a] <- add(%0, %1) for a'' in 1 : L8 %3[a] <- write(%2) Now we have to swap the mode and move the cursor inward with ``toggle_mode`` and then ``up``. This won't change the visible state of ``loop_tree`` output, but ``action_state`` will be updated to [1, 1, 0]. Now that we have the right loop selected, we can thread it with ``toggle_thread``: .. code-block:: none for a in 1024 : L0 for a' in 1 : L1 [thread] for a'' in 1 : L2 %0[a] <- read() for a'' in 1 : L4 %1[a] <- read() for a'' in 1 : L6 %2[a] <- add(%0, %1) for a'' in 1 : L8 %3[a] <- write(%2) After this we toggle back to size shifting and increase the size to 3: ``toggle_mode`` and ``up``, ``up``: .. code-block:: none for a in 341 r 1 : L0 for a' in 3 : L1 [thread] for a'' in 1 : L2 %0[a] <- read() for a'' in 1 : L4 %1[a] <- read() for a'' in 1 : L6 %2[a] <- add(%0, %1) for a'' in 1 : L8 %3[a] <- write(%2) The new ``r 1`` we see on the first line denotes a tail iteration (of size 1). The compiler will automatically inject tail logic to preserve the functionality of the code. ``up`` will always "steal" loops from the nearest outer loops so the tail will always be on outer loops.