loop_tool Environment Reference
===============================

`loop_tool <https://github.com/facebookresearch/loop_tool>`_ is an experimental
intermediate representation of N-dimensional data computation.
It defines a lightweight, highly-portable and restrictive IR that can be lowered
to both GPU and CPU backends.

CompilerGym exposes the ``loop_tool`` IR for reinforcement learning through a
:class:`LoopToolEnv <compiler_gym.envs.LoopToolEnv>` environment.

.. contents:: Overview:
    :local:

.. _Installation:

Installation
------------

It is highly recommended you install CUDA to use ``loop_tool``.
The package will work with most newer versions of CUDA.
Please see NVidia's `CUDA Toolkit installation page <https://developer.nvidia.com/cuda-downloads>`_
for details.

Datasets
--------

We provide a program that does a simple point-wise addition. Two datasets (one
for CPU and one for GPU) are provided. The benchmarks within these dataset are
generated by selecting an arbitrary size between 1 and 8x2^11 (~8M).

To set a specific size, simply use the trailing term in the benchmarking uri.
For GPU with size 1024:

.. code-block:: python

    >>> env = gym.make("loop_tool-v0")
    >>> env.reset(benchmark="benchmark://loop_tool-cuda-v0/1024")

The CPU variant can be created with:

.. code-block:: python

    >>> env.reset(benchmark="benchmark://loop_tool-cpu-v0/1024")

Note that although the action spaces are the same, the underlying hardware is dramatically different.
GPU threads can (and should) be used heavily, whereas CPU threading often shows little to no benefit
at sizes greater than the number of virtual cores available.
Further, the CPU backend should be viewed as an interpreted fallback and cannot reach
near-peak performance as the GPU can.

Observation Spaces
------------------

Three observation spaces are provided for tuning the compiled result.

+--------------------------+------------+
| Observation space        | Type       |
+==========================+============+
| flops                    | ``Float64``|
+--------------------------+------------+
| loop_tree                | ``String`` |
+--------------------------+------------+
| action_state             | ``Int64[]``|
+--------------------------+------------+


``flops`` will benchmark the program and return the average achieved gigaflops per run.
This includes program compilation and benchmark warmup, which may take some time.

``loop_tree`` will return a string representation of the current loop tree associated with
the program.  The string represents the computation in a platform-agnostic way.

``action_state`` returns the cursor information for the current environment.
The cursor information is composed of three values in order:
current loop index, current loop size, current loop tail.

For example, given an ``action_state`` of [0, 341, 1] and ``loop_tree``, the
cursor points to the first (outermost) loop:

.. code-block:: none

  for a in 341 r 1 : L0 [thread] <-- cursor
   for a' in 3 : L1
    for a'' in 1 : L2
     %0[a] <- read()
    for a'' in 1 : L4
     %1[a] <- read()
    for a'' in 1 : L6
     %2[a] <- add(%0, %1)
    for a'' in 1 : L8
     %3[a] <- write(%2)

If the ``action_state`` were [1, 3, 0], we'd know the cursor is pointing to the
child of the first loop:

.. code-block:: none

  for a in 341 r 1 : L0 [thread]
   for a' in 3 : L1              <-- cursor
    for a'' in 1 : L2
     %0[a] <- read()
    for a'' in 1 : L4
     %1[a] <- read()
    for a'' in 1 : L6
     %2[a] <- add(%0, %1)
    for a'' in 1 : L8
     %3[a] <- write(%2)

In the case of ``action_state`` being [2, 1, 0], the cursor simultaneously points to
all innermost loops.  This is an artifact of the innermost loop always being
unrolled when the ``loop_tree`` is generated:

.. code-block:: none

  for a in 341 r 1 : L0 [thread]
   for a' in 3 : L1
    for a'' in 1 : L2            <-- cursor
     %0[a] <- read()                |
    for a'' in 1 : L4            <--+
     %1[a] <- read()                |
    for a'' in 1 : L6            <--+
     %2[a] <- add(%0, %1)           |
    for a'' in 1 : L8            <--+
     %3[a] <- write(%2)


Action Spaces
-----------


Currently, only a "simple" action space is available. This can be understood as control over a cursor
that has two different modes.  Either the cursor is moving between loops or it is frozen in place and
can be used to change the sizes of loops.

+-----------------+-----------------------------------------------------------------------------------------------------+
| Action          | Description                                                                                         |
+=================+=====================================================================================================+
| `toggle_mode`   | Swaps between shifting the cursor location and shifting the size of the loop selected by the cursor |
+-----------------+-----------------------------------------------------------------------------------------------------+
| `up`            | Either shifts the cursor inward or increases the size of the selected loop by 1                     |
+-----------------+-----------------------------------------------------------------------------------------------------+
| `down`          | Either shifts the cursor outward or decreases the size of the selected loop by 1                    |
+-----------------+-----------------------------------------------------------------------------------------------------+
| `toggle_thread` | Toggles the threading parameter of the selected loop                                                |
+-----------------+-----------------------------------------------------------------------------------------------------+

The default state for the benchmark we've been looking at is:

.. code-block:: none

  for a in 1024 : L0 [thread]
   for a' in 1 : L1
    for a'' in 1 : L2
     %0[a] <- read()
    for a'' in 1 : L4
     %1[a] <- read()
    for a'' in 1 : L6
     %2[a] <- add(%0, %1)
    for a'' in 1 : L8
     %3[a] <- write(%2)

Now we will disable threading on the outer loop,
enable threading on the first inner loop and then increase its size.

The cursor mode starts with shifting sizes on the outermost loop.
This means we can first run the ``toggle_thread`` action:

.. code-block:: none

  for a in 1024 : L0
   for a' in 1 : L1
    for a'' in 1 : L2
     %0[a] <- read()
    for a'' in 1 : L4
     %1[a] <- read()
    for a'' in 1 : L6
     %2[a] <- add(%0, %1)
    for a'' in 1 : L8
     %3[a] <- write(%2)

Now we have to swap the mode and move the cursor inward with
``toggle_mode`` and then ``up``.  This won't change the visible state
of ``loop_tree`` output, but ``action_state`` will be updated to
[1, 1, 0].
Now that we have the right loop selected, we can thread it with
``toggle_thread``:

.. code-block:: none

  for a in 1024 : L0
   for a' in 1 : L1 [thread]
    for a'' in 1 : L2
     %0[a] <- read()
    for a'' in 1 : L4
     %1[a] <- read()
    for a'' in 1 : L6
     %2[a] <- add(%0, %1)
    for a'' in 1 : L8
     %3[a] <- write(%2)


After this we toggle back to size shifting and increase the size to 3:
``toggle_mode`` and ``up``, ``up``:


.. code-block:: none

  for a in 341 r 1 : L0
   for a' in 3 : L1 [thread]
    for a'' in 1 : L2
     %0[a] <- read()
    for a'' in 1 : L4
     %1[a] <- read()
    for a'' in 1 : L6
     %2[a] <- add(%0, %1)
    for a'' in 1 : L8
     %3[a] <- write(%2)

The new ``r 1`` we see on the first line denotes a tail iteration (of size 1).
The compiler will automatically inject tail logic to preserve the
functionality of the code.  ``up`` will always "steal" loops from the nearest
outer loops so the tail will always be on outer loops.