compiler_gym.envs.llvm
The compiler_gym.envs.llvm
module contains datasets and API extensions
for the LLVM Environments. See LlvmEnv
for the class definition.
Document contents:
Constructing Benchmarks
- compiler_gym.envs.llvm.make_benchmark(inputs: Union[str, Path, ClangInvocation, List[Union[str, Path, ClangInvocation]]], copt: Optional[List[str]] = None, system_includes: bool = True, timeout: int = 600) Benchmark [source]
Create a benchmark for use by LLVM environments.
This function takes one or more inputs and uses them to create an LLVM bitcode benchmark that can be passed to
compiler_gym.envs.LlvmEnv.reset()
.The following input types are supported:
File Suffix
Treated as
Converted using
.bc
LLVM IR bitcode
No conversion required.
.ll
LLVM IR text format
Assembled to bitcode using llvm-as.
.c
,.cc
,.cpp
,.cxx
C / C++ source
Compiled to bitcode using clang and the given
copt
.Note
The LLVM IR format has no compatability guarantees between versions (see LLVM docs). You must ensure that any
.bc
and.ll
files are compatible with the LLVM version used by CompilerGym, which can be reported usingenv.compiler_version
.E.g. for single-source C/C++ programs, you can pass the path of the source file:
>>> benchmark = make_benchmark('my_app.c') >>> env = gym.make("llvm-v0") >>> env.reset(benchmark=benchmark)
The clang invocation used is roughly equivalent to:
$ clang my_app.c -O0 -c -emit-llvm -o benchmark.bc
Additional compile-time arguments to clang can be provided using the
copt
argument:>>> benchmark = make_benchmark('/path/to/my_app.cpp', copt=['-O2'])
If you need more fine-grained control over the options, you can directly construct a
ClangInvocation
to pass a list of arguments to clang:>>> benchmark = make_benchmark( ClangInvocation(['/path/to/my_app.c'], system_includes=False, timeout=10) )
For multi-file programs, pass a list of inputs that will be compiled separately and then linked to a single module:
>>> benchmark = make_benchmark([ 'main.c', 'lib.cpp', 'lib2.bc', 'foo/input.bc' ])
- Parameters
inputs – An input, or list of inputs.
copt – A list of command line options to pass to clang when compiling source files.
system_includes – Whether to include the system standard libraries during compilation jobs. This requires a system toolchain. See
get_system_library_flags()
.timeout – The maximum number of seconds to allow clang to run before terminating.
- Returns
A
Benchmark
instance.- Raises
FileNotFoundError – If any input sources are not found.
TypeError – If the inputs are of unsupported types.
OSError – If a suitable compiler cannot be found.
BenchmarkInitError – If a compilation job fails.
TimeoutExpired – If a compilation job exceeds
timeout
seconds.
- class compiler_gym.envs.llvm.BenchmarkFromCommandLine(invocation: GccInvocation, bitcode: bytes, timeout: int)[source]
A benchmark that has been constructed from a command line invocation.
- class compiler_gym.envs.llvm.ClangInvocation(args: List[str], system_includes: bool = True, timeout: int = 600)[source]
Class to represent a single invocation of the clang compiler.
- __init__(args: List[str], system_includes: bool = True, timeout: int = 600)[source]
Create a clang invocation.
- Parameters
args – The list of arguments to pass to clang.
system_includes – Whether to include the system standard libraries during compilation jobs. This requires a system toolchain. See
get_system_library_flags()
.timeout – The maximum number of seconds to allow clang to run before terminating.
- compiler_gym.envs.llvm.get_system_library_flags(compiler: Optional[str] = None) List[str] [source]
Determine the set of compilation flags needed to use the host system libraries.
This uses the system compiler to determine the search paths for C/C++ system headers, and on macOS, the location of libclang_rt.osx.a. By default,
c++
is invoked. This can be overridden by settingos.environ["CXX"]
prior to calling this function.- Returns
A list of command line flags for a compiler.
- Raises
HostCompilerFailure – If the host compiler cannot be determined, or fails to compile a trivial piece of code.
UnableToParseHostCompilerOutput – If the output of the compiler cannot be understood.
Datasets
- compiler_gym.envs.llvm.datasets.get_llvm_datasets(site_data_base: Optional[Path] = None) Iterable[Dataset] [source]
Instantiate the builtin LLVM datasets.
- Parameters
site_data_base – The root of the site data path.
- Returns
An iterable sequence of
Dataset
instances.
- class compiler_gym.envs.llvm.datasets.AnghaBenchDataset(site_data_base: Path, sort_order: int = 0, manifest_url: Optional[str] = None, manifest_sha256: Optional[str] = None, deprecated: Optional[str] = None, name: Optional[str] = None)[source]
A dataset of C programs curated from GitHub source code.
The dataset is from:
da Silva, Anderson Faustino, Bruno Conde Kind, José Wesley de Souza Magalhaes, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimaraes, and Fernando Magno Quinão Pereira. “ANGHABENCH: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction.” In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 378-390. IEEE, 2021.
And is available at:
Installation
The AnghaBench dataset consists of C functions that are compiled to LLVM-IR on-demand and cached. The first time each benchmark is used there is an overhead of compiling it from C to bitcode. This is a one-off cost.
- class compiler_gym.envs.llvm.datasets.BlasDataset(site_data_base: Path, sort_order: int = 0)[source]
- class compiler_gym.envs.llvm.datasets.CLgenDataset(site_data_base: Path, sort_order: int = 0)[source]
The CLgen dataset contains 1000 synthetically generated OpenCL kernels.
The dataset is from:
Cummins, Chris, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. “Synthesizing benchmarks for predictive modeling.” In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 86-99. IEEE, 2017.
And is available at:
Installation
The CLgen dataset consists of OpenCL kernels that are compiled to LLVM-IR on-demand and cached. The first time each benchmark is used there is an overhead of compiling it from OpenCL to bitcode. This is a one-off cost. Compiling OpenCL to bitcode requires third party headers that are downloaded on the first call to
install()
.
- class compiler_gym.envs.llvm.datasets.CsmithDataset(site_data_base: Path, sort_order: int = 0, csmith_bin: Optional[Path] = None, csmith_includes: Optional[Path] = None)[source]
A dataset which uses Csmith to generate programs.
Csmith is a tool that can generate random conformant C99 programs. It is described in the publication:
Yang, Xuejun, Yang Chen, Eric Eide, and John Regehr. “Finding and understanding bugs in C compilers.” In Proceedings of the 32nd ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI), pp. 283-294. 2011.
For up-to-date information about Csmith, see:
Note that Csmith is a tool that is used to find errors in compilers. As such, there is a higher likelihood that the benchmark cannot be used for an environment and that
env.reset()
will raiseBenchmarkInitError
.
- class compiler_gym.envs.llvm.datasets.GitHubDataset(site_data_base: Path, sort_order: int = 0)[source]
- class compiler_gym.envs.llvm.datasets.JotaiBenchDataset(site_data_base: Path)[source]
A dataset of C programs curated from GitHub source code.
The dataset is from:
da Silva, Anderson Faustino, Bruno Conde Kind, José Wesley de Souza Magalhaes, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimaraes, and Fernando Magno Quinão Pereira. “ANGHABENCH: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction.” In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 378-390. IEEE, 2021.
And is available at:
Installation
The JotaiBench dataset consists of C functions that are compiled to LLVM-IR on-demand and cached. The first time each benchmark is used there is an overhead of compiling it from C to bitcode. This is a one-off cost.
- class compiler_gym.envs.llvm.datasets.LinuxDataset(site_data_base: Path, sort_order: int = 0)[source]
- class compiler_gym.envs.llvm.datasets.LlvmStressDataset(site_data_base: Path, sort_order: int = 0)[source]
A dataset which uses llvm-stress to generate programs.
llvm-stress is a tool for generating random LLVM-IR files.
This dataset forces reproducible results by setting the input seed to the generator. The benchmark’s URI is the seed, e.g. “generator://llvm-stress-v0/10” is the benchmark generated by llvm-stress using seed 10. The total number of unique seeds is 2^32 - 1.
Note that llvm-stress is a tool that is used to find errors in LLVM. As such, there is a higher likelihood that the benchmark cannot be used for an environment and that
env.reset()
will raiseBenchmarkInitError
.
- class compiler_gym.envs.llvm.datasets.MibenchDataset(site_data_base: Path, sort_order: int = 0)[source]
- class compiler_gym.envs.llvm.datasets.NPBDataset(site_data_base: Path, sort_order: int = 0)[source]
- class compiler_gym.envs.llvm.datasets.OpenCVDataset(site_data_base: Path, sort_order: int = 0)[source]
- class compiler_gym.envs.llvm.datasets.POJ104Dataset(site_data_base: Path, sort_order: int = 0)[source]
The POJ-104 dataset contains 52000 C++ programs implementing 104 different algorithms with 500 examples of each.
The dataset is from:
Lili Mou, Ge Li, Lu Zhang, Tao Wang, Zhi Jin. “Convolutional neural networks over tree structures for programming language processing.” To appear in Proceedings of 30th AAAI Conference on Artificial Intelligence, 2016.
And is available at:
Miscellaneous
- compiler_gym.envs.llvm.compute_observation(observation_space: ObservationSpaceSpec, bitcode: Path, timeout: float = 300) ObservationType [source]
Compute an LLVM observation.
This is a utility function that uses a standalone C++ binary to compute an observation from an LLVM bitcode file. It is intended for use cases where you want to compute an observation without the overhead of initializing a full environment.
Example usage:
>>> env = compiler_gym.make("llvm-v0") >>> space = env.observation.spaces["Ir"] >>> bitcode = Path("bitcode.bc") >>> observation = llvm.compute_observation(space, bitcode, timeout=30)
Warning
This is not part of the core CompilerGym API and may change in a future release.
- Parameters
observation_space – The observation that is to be computed.
bitcode – The path of an LLVM bitcode file.
timeout – The maximum number of seconds to allow the computation to run before timing out.
- Raises
ValueError – If computing the observation fails.
TimeoutError – If computing the observation times out.
FileNotFoundError – If the given bitcode does not exist.