loop_tool Environment Reference
loop_tool is an experimental intermediate representation of N-dimensional data computation. It defines a lightweight, highly-portable and restrictive IR that can be lowered to both GPU and CPU backends.
CompilerGym exposes the loop_tool
IR for reinforcement learning through a
LoopToolEnv
environment.
Overview:
Installation
It is highly recommended you install CUDA to use loop_tool
.
The package will work with most newer versions of CUDA.
Please see NVidia’s CUDA Toolkit installation page
for details.
Datasets
We provide a program that does a simple point-wise addition. Two datasets (one for CPU and one for GPU) are provided. The benchmarks within these dataset are generated by selecting an arbitrary size between 1 and 8x2^11 (~8M).
To set a specific size, simply use the trailing term in the benchmarking uri. For GPU with size 1024:
>>> env = gym.make("loop_tool-v0")
>>> env.reset(benchmark="benchmark://loop_tool-cuda-v0/1024")
The CPU variant can be created with:
>>> env.reset(benchmark="benchmark://loop_tool-cpu-v0/1024")
Note that although the action spaces are the same, the underlying hardware is dramatically different. GPU threads can (and should) be used heavily, whereas CPU threading often shows little to no benefit at sizes greater than the number of virtual cores available. Further, the CPU backend should be viewed as an interpreted fallback and cannot reach near-peak performance as the GPU can.
Observation Spaces
Three observation spaces are provided for tuning the compiled result.
Observation space |
Type |
---|---|
flops |
|
loop_tree |
|
action_state |
|
flops
will benchmark the program and return the average achieved gigaflops per run.
This includes program compilation and benchmark warmup, which may take some time.
loop_tree
will return a string representation of the current loop tree associated with
the program. The string represents the computation in a platform-agnostic way.
action_state
returns the cursor information for the current environment.
The cursor information is composed of three values in order:
current loop index, current loop size, current loop tail.
For example, given an action_state
of [0, 341, 1] and loop_tree
, the
cursor points to the first (outermost) loop:
for a in 341 r 1 : L0 [thread] <-- cursor
for a' in 3 : L1
for a'' in 1 : L2
%0[a] <- read()
for a'' in 1 : L4
%1[a] <- read()
for a'' in 1 : L6
%2[a] <- add(%0, %1)
for a'' in 1 : L8
%3[a] <- write(%2)
If the action_state
were [1, 3, 0], we’d know the cursor is pointing to the
child of the first loop:
for a in 341 r 1 : L0 [thread]
for a' in 3 : L1 <-- cursor
for a'' in 1 : L2
%0[a] <- read()
for a'' in 1 : L4
%1[a] <- read()
for a'' in 1 : L6
%2[a] <- add(%0, %1)
for a'' in 1 : L8
%3[a] <- write(%2)
In the case of action_state
being [2, 1, 0], the cursor simultaneously points to
all innermost loops. This is an artifact of the innermost loop always being
unrolled when the loop_tree
is generated:
for a in 341 r 1 : L0 [thread]
for a' in 3 : L1
for a'' in 1 : L2 <-- cursor
%0[a] <- read() |
for a'' in 1 : L4 <--+
%1[a] <- read() |
for a'' in 1 : L6 <--+
%2[a] <- add(%0, %1) |
for a'' in 1 : L8 <--+
%3[a] <- write(%2)
Action Spaces
Currently, only a “simple” action space is available. This can be understood as control over a cursor that has two different modes. Either the cursor is moving between loops or it is frozen in place and can be used to change the sizes of loops.
Action |
Description |
---|---|
toggle_mode |
Swaps between shifting the cursor location and shifting the size of the loop selected by the cursor |
up |
Either shifts the cursor inward or increases the size of the selected loop by 1 |
down |
Either shifts the cursor outward or decreases the size of the selected loop by 1 |
toggle_thread |
Toggles the threading parameter of the selected loop |
The default state for the benchmark we’ve been looking at is:
for a in 1024 : L0 [thread]
for a' in 1 : L1
for a'' in 1 : L2
%0[a] <- read()
for a'' in 1 : L4
%1[a] <- read()
for a'' in 1 : L6
%2[a] <- add(%0, %1)
for a'' in 1 : L8
%3[a] <- write(%2)
Now we will disable threading on the outer loop, enable threading on the first inner loop and then increase its size.
The cursor mode starts with shifting sizes on the outermost loop.
This means we can first run the toggle_thread
action:
for a in 1024 : L0
for a' in 1 : L1
for a'' in 1 : L2
%0[a] <- read()
for a'' in 1 : L4
%1[a] <- read()
for a'' in 1 : L6
%2[a] <- add(%0, %1)
for a'' in 1 : L8
%3[a] <- write(%2)
Now we have to swap the mode and move the cursor inward with
toggle_mode
and then up
. This won’t change the visible state
of loop_tree
output, but action_state
will be updated to
[1, 1, 0].
Now that we have the right loop selected, we can thread it with
toggle_thread
:
for a in 1024 : L0
for a' in 1 : L1 [thread]
for a'' in 1 : L2
%0[a] <- read()
for a'' in 1 : L4
%1[a] <- read()
for a'' in 1 : L6
%2[a] <- add(%0, %1)
for a'' in 1 : L8
%3[a] <- write(%2)
After this we toggle back to size shifting and increase the size to 3:
toggle_mode
and up
, up
:
for a in 341 r 1 : L0
for a' in 3 : L1 [thread]
for a'' in 1 : L2
%0[a] <- read()
for a'' in 1 : L4
%1[a] <- read()
for a'' in 1 : L6
%2[a] <- add(%0, %1)
for a'' in 1 : L8
%3[a] <- write(%2)
The new r 1
we see on the first line denotes a tail iteration (of size 1).
The compiler will automatically inject tail logic to preserve the
functionality of the code. up
will always “steal” loops from the nearest
outer loops so the tail will always be on outer loops.