compiler_gym.envs.llvm

The compiler_gym.envs.llvm module contains datasets and API extensions for the LLVM Environments. See LlvmEnv for the class definition.

Constructing Benchmarks

compiler_gym.envs.llvm.make_benchmark(inputs: Union[str, Path, ClangInvocation, List[Union[str, Path, ClangInvocation]]], copt: Optional[List[str]] = None, system_includes: bool = True, timeout: int = 600) Benchmark[source]

Create a benchmark for use by LLVM environments.

This function takes one or more inputs and uses them to create an LLVM bitcode benchmark that can be passed to compiler_gym.envs.LlvmEnv.reset().

The following input types are supported:

File Suffix

Treated as

Converted using

.bc

LLVM IR bitcode

No conversion required.

.ll

LLVM IR text format

Assembled to bitcode using llvm-as.

.c, .cc, .cpp, .cxx

C / C++ source

Compiled to bitcode using clang and the given copt.

Note

The LLVM IR format has no compatability guarantees between versions (see LLVM docs). You must ensure that any .bc and .ll files are compatible with the LLVM version used by CompilerGym, which can be reported using env.compiler_version.

E.g. for single-source C/C++ programs, you can pass the path of the source file:

>>> benchmark = make_benchmark('my_app.c')
>>> env = gym.make("llvm-v0")
>>> env.reset(benchmark=benchmark)

The clang invocation used is roughly equivalent to:

$ clang my_app.c -O0 -c -emit-llvm -o benchmark.bc

Additional compile-time arguments to clang can be provided using the copt argument:

>>> benchmark = make_benchmark('/path/to/my_app.cpp', copt=['-O2'])

If you need more fine-grained control over the options, you can directly construct a ClangInvocation to pass a list of arguments to clang:

>>> benchmark = make_benchmark(
    ClangInvocation(['/path/to/my_app.c'], system_includes=False, timeout=10)
)

For multi-file programs, pass a list of inputs that will be compiled separately and then linked to a single module:

>>> benchmark = make_benchmark([
    'main.c',
    'lib.cpp',
    'lib2.bc',
    'foo/input.bc'
])
Parameters
  • inputs – An input, or list of inputs.

  • copt – A list of command line options to pass to clang when compiling source files.

  • system_includes – Whether to include the system standard libraries during compilation jobs. This requires a system toolchain. See get_system_library_flags().

  • timeout – The maximum number of seconds to allow clang to run before terminating.

Returns

A Benchmark instance.

Raises
  • FileNotFoundError – If any input sources are not found.

  • TypeError – If the inputs are of unsupported types.

  • OSError – If a suitable compiler cannot be found.

  • BenchmarkInitError – If a compilation job fails.

  • TimeoutExpired – If a compilation job exceeds timeout seconds.

class compiler_gym.envs.llvm.BenchmarkFromCommandLine(invocation: GccInvocation, bitcode: bytes, timeout: int)[source]

A benchmark that has been constructed from a command line invocation.

See env.make_benchmark_from_command_line().

compile(env, timeout: int = 60) None[source]

This completes the compilation and linking of the final executable specified by the original command line.

class compiler_gym.envs.llvm.ClangInvocation(args: List[str], system_includes: bool = True, timeout: int = 600)[source]

Class to represent a single invocation of the clang compiler.

__init__(args: List[str], system_includes: bool = True, timeout: int = 600)[source]

Create a clang invocation.

Parameters
  • args – The list of arguments to pass to clang.

  • system_includes – Whether to include the system standard libraries during compilation jobs. This requires a system toolchain. See get_system_library_flags().

  • timeout – The maximum number of seconds to allow clang to run before terminating.

compiler_gym.envs.llvm.get_system_library_flags(compiler: Optional[str] = None) List[str][source]

Determine the set of compilation flags needed to use the host system libraries.

This uses the system compiler to determine the search paths for C/C++ system headers, and on macOS, the location of libclang_rt.osx.a. By default, c++ is invoked. This can be overridden by setting os.environ["CXX"] prior to calling this function.

Returns

A list of command line flags for a compiler.

Raises
  • HostCompilerFailure – If the host compiler cannot be determined, or fails to compile a trivial piece of code.

  • UnableToParseHostCompilerOutput – If the output of the compiler cannot be understood.

Datasets

compiler_gym.envs.llvm.datasets.get_llvm_datasets(site_data_base: Optional[Path] = None) Iterable[Dataset][source]

Instantiate the builtin LLVM datasets.

Parameters

site_data_base – The root of the site data path.

Returns

An iterable sequence of Dataset instances.

class compiler_gym.envs.llvm.datasets.AnghaBenchDataset(site_data_base: Path, sort_order: int = 0, manifest_url: Optional[str] = None, manifest_sha256: Optional[str] = None, deprecated: Optional[str] = None, name: Optional[str] = None)[source]

A dataset of C programs curated from GitHub source code.

The dataset is from:

da Silva, Anderson Faustino, Bruno Conde Kind, José Wesley de Souza Magalhaes, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimaraes, and Fernando Magno Quinão Pereira. “ANGHABENCH: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction.” In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 378-390. IEEE, 2021.

And is available at:

Installation

The AnghaBench dataset consists of C functions that are compiled to LLVM-IR on-demand and cached. The first time each benchmark is used there is an overhead of compiling it from C to bitcode. This is a one-off cost.

class compiler_gym.envs.llvm.datasets.BlasDataset(site_data_base: Path, sort_order: int = 0)[source]
class compiler_gym.envs.llvm.datasets.CBenchDataset(site_data_base: Path)[source]
class compiler_gym.envs.llvm.datasets.CLgenDataset(site_data_base: Path, sort_order: int = 0)[source]

The CLgen dataset contains 1000 synthetically generated OpenCL kernels.

The dataset is from:

Cummins, Chris, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. “Synthesizing benchmarks for predictive modeling.” In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 86-99. IEEE, 2017.

And is available at:

Installation

The CLgen dataset consists of OpenCL kernels that are compiled to LLVM-IR on-demand and cached. The first time each benchmark is used there is an overhead of compiling it from OpenCL to bitcode. This is a one-off cost. Compiling OpenCL to bitcode requires third party headers that are downloaded on the first call to install().

class compiler_gym.envs.llvm.datasets.CsmithDataset(site_data_base: Path, sort_order: int = 0, csmith_bin: Optional[Path] = None, csmith_includes: Optional[Path] = None)[source]

A dataset which uses Csmith to generate programs.

Csmith is a tool that can generate random conformant C99 programs. It is described in the publication:

Yang, Xuejun, Yang Chen, Eric Eide, and John Regehr. “Finding and understanding bugs in C compilers.” In Proceedings of the 32nd ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI), pp. 283-294. 2011.

For up-to-date information about Csmith, see:

Note that Csmith is a tool that is used to find errors in compilers. As such, there is a higher likelihood that the benchmark cannot be used for an environment and that env.reset() will raise BenchmarkInitError.

class compiler_gym.envs.llvm.datasets.GitHubDataset(site_data_base: Path, sort_order: int = 0)[source]
class compiler_gym.envs.llvm.datasets.JotaiBenchDataset(site_data_base: Path)[source]

A dataset of C programs curated from GitHub source code.

The dataset is from:

da Silva, Anderson Faustino, Bruno Conde Kind, José Wesley de Souza Magalhaes, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimaraes, and Fernando Magno Quinão Pereira. “ANGHABENCH: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction.” In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 378-390. IEEE, 2021.

And is available at:

Installation

The JotaiBench dataset consists of C functions that are compiled to LLVM-IR on-demand and cached. The first time each benchmark is used there is an overhead of compiling it from C to bitcode. This is a one-off cost.

class compiler_gym.envs.llvm.datasets.LinuxDataset(site_data_base: Path, sort_order: int = 0)[source]
class compiler_gym.envs.llvm.datasets.LlvmStressDataset(site_data_base: Path, sort_order: int = 0)[source]

A dataset which uses llvm-stress to generate programs.

llvm-stress is a tool for generating random LLVM-IR files.

This dataset forces reproducible results by setting the input seed to the generator. The benchmark’s URI is the seed, e.g. “generator://llvm-stress-v0/10” is the benchmark generated by llvm-stress using seed 10. The total number of unique seeds is 2^32 - 1.

Note that llvm-stress is a tool that is used to find errors in LLVM. As such, there is a higher likelihood that the benchmark cannot be used for an environment and that env.reset() will raise BenchmarkInitError.

class compiler_gym.envs.llvm.datasets.MibenchDataset(site_data_base: Path, sort_order: int = 0)[source]
class compiler_gym.envs.llvm.datasets.NPBDataset(site_data_base: Path, sort_order: int = 0)[source]
class compiler_gym.envs.llvm.datasets.OpenCVDataset(site_data_base: Path, sort_order: int = 0)[source]
class compiler_gym.envs.llvm.datasets.POJ104Dataset(site_data_base: Path, sort_order: int = 0)[source]

The POJ-104 dataset contains 52000 C++ programs implementing 104 different algorithms with 500 examples of each.

The dataset is from:

Lili Mou, Ge Li, Lu Zhang, Tao Wang, Zhi Jin. “Convolutional neural networks over tree structures for programming language processing.” To appear in Proceedings of 30th AAAI Conference on Artificial Intelligence, 2016.

And is available at:

class compiler_gym.envs.llvm.datasets.TensorFlowDataset(site_data_base: Path, sort_order: int = 0)[source]

Miscellaneous

compiler_gym.envs.llvm.compute_observation(observation_space: ObservationSpaceSpec, bitcode: Path, timeout: float = 300) ObservationType[source]

Compute an LLVM observation.

This is a utility function that uses a standalone C++ binary to compute an observation from an LLVM bitcode file. It is intended for use cases where you want to compute an observation without the overhead of initializing a full environment.

Example usage:

>>> env = compiler_gym.make("llvm-v0")
>>> space = env.observation.spaces["Ir"]
>>> bitcode = Path("bitcode.bc")
>>> observation = llvm.compute_observation(space, bitcode, timeout=30)

Warning

This is not part of the core CompilerGym API and may change in a future release.

Parameters
  • observation_space – The observation that is to be computed.

  • bitcode – The path of an LLVM bitcode file.

  • timeout – The maximum number of seconds to allow the computation to run before timing out.

Raises
  • ValueError – If computing the observation fails.

  • TimeoutError – If computing the observation times out.

  • FileNotFoundError – If the given bitcode does not exist.