loop_tool Environment Reference

loop_tool is an experimental intermediate representation of N-dimensional data computation. It defines a lightweight, highly-portable and restrictive IR that can be lowered to both GPU and CPU backends.

CompilerGym exposes the loop_tool IR for reinforcement learning through a LoopToolEnv environment.

Installation

It is highly recommended you install CUDA to use loop_tool. The package will work with most newer versions of CUDA. Please see NVidia’s CUDA Toolkit installation page for details.

Datasets

We provide a program that does a simple point-wise addition. Two datasets (one for CPU and one for GPU) are provided. The benchmarks within these dataset are generated by selecting an arbitrary size between 1 and 8x2^11 (~8M).

To set a specific size, simply use the trailing term in the benchmarking uri. For GPU with size 1024:

>>> env = gym.make("loop_tool-v0")
>>> env.reset(benchmark="benchmark://loop_tool-cuda-v0/1024")

The CPU variant can be created with:

>>> env.reset(benchmark="benchmark://loop_tool-cpu-v0/1024")

Note that although the action spaces are the same, the underlying hardware is dramatically different. GPU threads can (and should) be used heavily, whereas CPU threading often shows little to no benefit at sizes greater than the number of virtual cores available. Further, the CPU backend should be viewed as an interpreted fallback and cannot reach near-peak performance as the GPU can.

Observation Spaces

Three observation spaces are provided for tuning the compiled result.

Observation space

Type

flops

Float64

loop_tree

String

action_state

Int64[]

flops will benchmark the program and return the average achieved gigaflops per run. This includes program compilation and benchmark warmup, which may take some time.

loop_tree will return a string representation of the current loop tree associated with the program. The string represents the computation in a platform-agnostic way.

action_state returns the cursor information for the current environment. The cursor information is composed of three values in order: current loop index, current loop size, current loop tail.

For example, given an action_state of [0, 341, 1] and loop_tree, the cursor points to the first (outermost) loop:

for a in 341 r 1 : L0 [thread] <-- cursor
 for a' in 3 : L1
  for a'' in 1 : L2
   %0[a] <- read()
  for a'' in 1 : L4
   %1[a] <- read()
  for a'' in 1 : L6
   %2[a] <- add(%0, %1)
  for a'' in 1 : L8
   %3[a] <- write(%2)

If the action_state were [1, 3, 0], we’d know the cursor is pointing to the child of the first loop:

for a in 341 r 1 : L0 [thread]
 for a' in 3 : L1              <-- cursor
  for a'' in 1 : L2
   %0[a] <- read()
  for a'' in 1 : L4
   %1[a] <- read()
  for a'' in 1 : L6
   %2[a] <- add(%0, %1)
  for a'' in 1 : L8
   %3[a] <- write(%2)

In the case of action_state being [2, 1, 0], the cursor simultaneously points to all innermost loops. This is an artifact of the innermost loop always being unrolled when the loop_tree is generated:

for a in 341 r 1 : L0 [thread]
 for a' in 3 : L1
  for a'' in 1 : L2            <-- cursor
   %0[a] <- read()                |
  for a'' in 1 : L4            <--+
   %1[a] <- read()                |
  for a'' in 1 : L6            <--+
   %2[a] <- add(%0, %1)           |
  for a'' in 1 : L8            <--+
   %3[a] <- write(%2)

Action Spaces

Currently, only a “simple” action space is available. This can be understood as control over a cursor that has two different modes. Either the cursor is moving between loops or it is frozen in place and can be used to change the sizes of loops.

Action

Description

toggle_mode

Swaps between shifting the cursor location and shifting the size of the loop selected by the cursor

up

Either shifts the cursor inward or increases the size of the selected loop by 1

down

Either shifts the cursor outward or decreases the size of the selected loop by 1

toggle_thread

Toggles the threading parameter of the selected loop

The default state for the benchmark we’ve been looking at is:

for a in 1024 : L0 [thread]
 for a' in 1 : L1
  for a'' in 1 : L2
   %0[a] <- read()
  for a'' in 1 : L4
   %1[a] <- read()
  for a'' in 1 : L6
   %2[a] <- add(%0, %1)
  for a'' in 1 : L8
   %3[a] <- write(%2)

Now we will disable threading on the outer loop, enable threading on the first inner loop and then increase its size.

The cursor mode starts with shifting sizes on the outermost loop. This means we can first run the toggle_thread action:

for a in 1024 : L0
 for a' in 1 : L1
  for a'' in 1 : L2
   %0[a] <- read()
  for a'' in 1 : L4
   %1[a] <- read()
  for a'' in 1 : L6
   %2[a] <- add(%0, %1)
  for a'' in 1 : L8
   %3[a] <- write(%2)

Now we have to swap the mode and move the cursor inward with toggle_mode and then up. This won’t change the visible state of loop_tree output, but action_state will be updated to [1, 1, 0]. Now that we have the right loop selected, we can thread it with toggle_thread:

for a in 1024 : L0
 for a' in 1 : L1 [thread]
  for a'' in 1 : L2
   %0[a] <- read()
  for a'' in 1 : L4
   %1[a] <- read()
  for a'' in 1 : L6
   %2[a] <- add(%0, %1)
  for a'' in 1 : L8
   %3[a] <- write(%2)

After this we toggle back to size shifting and increase the size to 3: toggle_mode and up, up:

for a in 341 r 1 : L0
 for a' in 3 : L1 [thread]
  for a'' in 1 : L2
   %0[a] <- read()
  for a'' in 1 : L4
   %1[a] <- read()
  for a'' in 1 : L6
   %2[a] <- add(%0, %1)
  for a'' in 1 : L8
   %3[a] <- write(%2)

The new r 1 we see on the first line denotes a tail iteration (of size 1). The compiler will automatically inject tail logic to preserve the functionality of the code. up will always “steal” loops from the nearest outer loops so the tail will always be on outer loops.