In the previous lab we have looked into dynamic and static analyses techniques to find vulnerabilities in the internal behavior of programs. Another essential process of secure software design is that of software testing, i.e., providing inputs for the programs and checking the outputs of programs against such inputs. On top of classical software testing techniques, such as unit testing, various automated testing approaches propose to complement existing software testing efforts by facilitating the writing of tests and improving test coverage.
In this lab we will look at two techniques that, both separately and combined, have proven to be valuable additions for finding serious bugs that have been lying in software for a long time (see these examples):
Fuzzing, symbolic execution and concolic execution are a currently hot research topic, and there are many recently proposed tools that combine these techniques (to the point that the nomenclature is often “fuzzy”). The FuzzBench project is an effort to standardize benchmarks of such tools. You may get a visual picture of the ever-growing list of available techniques and tools in this survey.
To install the specific tools that will be used in this lab, run:
cd ses/vm
git pull
sh install-fuzz.sh
Before we start, this lab will cover, by example, a series of testing techniques and tools. These topics are only introduced in the theoretical lectures in a broad sense and shortly introduced in this lab, which together should be sufficient for our experimentation. For a more in-depth contextualization or more technical detail, you may ask the instructors or check the following references:
Radamsa is a mutation-based fuzzing tool that generates random program inputs by mutating some given input. Radamsa is fully scriptable, and so far has been successful in finding vulnerabilities in various real-world applications.
Radamsa is simply a command-line tool that receives a file with some data to mutate and returns various possible mutations. We can control command-line parameters such as the number of mutations or the seeds (randomness) used for generation of mutations, as in the following example:
Radamsa by itself is not a testing framework. Therefore, in order to run generated test cases against an application, we have to script the testing logic ourselves. We will use a simple example borrowed from this course. Consider an interactive C program wisdom-alt.c that has two modes: storing a secret string or displaying a stored secret string. We have written a Python script fuzz.py that connects the output from radamsa to the input of the wisdom program and reads the initial data from the inputs/1 file; it will mutate the input data using different seeds and, in each run, send the mutated data line by line to the interactive wisdom program. You may run this example as follows:
The fuzzer will quickly find a bug, i.e., record a crash. Why did the program crash? You can replicate the same behavior by running the program manually. You can also edit fuzz.py to change the input data, the seed or have radamsa generate different inputs.
As you may perceive, it turns out that the bug occurs when an invalid menu option (that is not 1 or 2) is passed to the interactive program. The file wisdom-alt2.c contains an additional guard to ignore invalid options; this change fixes the previous bug. You may run the second wisdom program as before:
This time, the fuzzer will not find a sequence of interactive inputs that crashes the program. But how certain can we be about the effectiveness of the fuzzer? Since it is essentially generating random inputs, hoping to find a crashing execution may come down to mere chance. As an experiment, fuzz2.py changes the input fuzzing file from inputs/1 to inputs/2. Re-run the fuzzer; it will now find a crashing execution, why? You will notice that finding a crash depends on the input length.
KLEE is a symbolic execution tool which can significantly beat the coverage of developer’s own hand-written test suites. KLEE is able to automatically generate high-coverage test inputs that perform better than the poor performance of manual and random testing approaches. It does so by forking symbolic variables on program branches, to make sure that if generates concrete inputs to check every possible program path. In practice, KLEE will not have 100% program coverage: evaluating all program executions is a computationally expensive and undecidable problem, and hence, like all symbolic execution techniques, KLEE needs to compromise on a maximum path depth.
The KLEE tool runs on LLVM bitcode. Many other symbolic execution tools exist for non-LLVM languages. A few examples for reference:
To try KLEE on the wisdom program, we have to modify it to identify which variables KLEE should treat as symbolic.
Borrowing again from this course, the changes in wisdom-alt-klee.c are the following:
gets
for a function sym_gets
with an infinite loop that simulates the generation of a varying-size sequence of symbolic characters determined by their ASCII code;read
s the initial input to use the KLEE-specific klee_make_symbolic
function, which returns a fixed-size symbolic array.We will use a pre-configured docker container. Inside the vm folder, open a bash inside the KLEE container, compile the instrumented wisdom program to LLVM bytecode, and run KLEE:
The --libc
flag tells KLEE to replace standard libc
functions by a symbolically-compiled uclibc
implementation; it is natural to get some warning related to this.
KLEE shall run shortly and discover the error (an overflow related to the size of the array read from gets
), printing a stack trace and some information about the current state. It will have created a directory klee-last
in the current directory that contains further information about the symbolic execution. If you look in there, you will see that it generated some tests, error reports and some statistics.
The (binary) files ending in .ktest
in this directory can be formatted intelligibly by using ktest-tool
. Use the following commands to inspect the concrete state that the error occurred in:
While we can run the test cases generated by KLEE on our program by hand, KLEE provides a convenient replay library, which simply replaces the calls to symbolic functions to assignments of the inputs stored in a .ktest
file.
To run our erroneous test, you can recompile it with the lkleeRuntest
library:
The last command prints the test’s output: it shall be 0 for successful tests and non-0 for failing tests. For our erroneous tests, it shall fail. You can also confirm the error.
In a sense, a symbolic execution tool is exploring a maze defined by the program’s execution space. We can make this analogy a reality by using KLEE to symbolically execute a program that asks its user to solve a maze. Check the maze.c, taken from this blog post, that defines a maze-solving procedure.
Compile and run the program. We can solve the maze with the following input:
The file maze-klee.c is a slightly modified version: we introduce a KLEE-specific assertion when the maze is solved, to make it easier to distinguish when KLEE finds a solution. Compile the symbolic program for KLEE and run it. This time, we make the input a symbolic string with size 30 using command-line arguments.
KLEE will work for a while and then end. The maze program will generate an assertion failure when a path through the maze has been identified, so the test that is a winning path through the maze is identified as an error. Look for the path that solved the maze looking for a file that ends in .err
inside the klee-last
folder.
It turns out there are multiple “solutions” to the maze; you can see them all by passing the argument --emit-all-errors
to KLEE.
As it turns out, something funny is going on: the found path is not even the length of the maze and somehow the solution is allowed to walk through walls. Look through the code, and find the condition that allows this to happen. What line is it on? Comment it out and try again: KLEE shall now find the single correct solution.
Blab is a small tool for generating data according to grammars, and can be seen as a minimal for a generation-based fuzzer. It is intended to be used to generate data which has a known context-free structure, usually in order to be able to test programs or produce interesting sample data for fuzzers. Fuzzers that rely on a specification for the input format to generate test cases are often called generation-based fuzzers.
If we try to use a fuzzer like radamsa to solve our maze, it will unlikely succeed since since it does not know that only sequences of wsad
characters are valid inputs. We can however improve by encoding such an input grammar using blab; check the fuzz.py which automates the search for a maze solution, you can run it as follows.
$ cd path/to/c/misc/maze/
$ python3 fuzz.py ./maze
Did it find a solution? You may try to improve the grammar, the seed or the number of attempts to make it so.
AFL is a coverage-guided fuzzing tool, often called a grey-box fuzzer, which takes into account the code coverage of tested inputs to understand if it is making progress and to make informed decisions about which inputs to mutate to maximize coverage. In a sense, it is a lightweight form of symbolic execution. The tested program should first be compiled with a utility program to enable control flow tracking. Any behavioural changes as a response to the input can then be detected by the fuzzer. If there is no access to the source code, then blackbox testing is supported as well.
We will try a few examples from this AFL tutorial.
Consider the vulnerable.c program, which simply reads a string from stdin
and prints it to stdout
.
Compile it for AFL and test it for some input:
$ AFL_HARDEN=1 afl-clang-fast vulnerable.c -o vulnerable
$ ./vulnerable # Press enter to get usage instructions
$ ./vulnerable < inputs/u
Now fuzz it with AFL. The following command receives a folder inputs
with input files that will serve as a base for mutations, and produces reports in folder out
:
$ afl-fuzz -i inputs -o out ./vulnerable
You will see an AFL menu with progress statistics, that will run indefinitely. Waait a few minutes until you see a few reported crashes, and stop the process with CTRL+C
. You will then find the inputs that provoked crashes in the folder out/default/crashes
. Try them out, and trace the bug back to the code.
We can also see how to detect a real-world bug such as Heartbleed in a complex library such as OpenSSL. Heartbleed is a heap buffer overflow bug in the TLS handshake phase, that is triggered if a Heartbeat message is longer than its expected length. You may read more about the bug in resources such as this one.
Start by pulling a buggy version of OpenSSL 1.0.1f from the official OpenSSL GitHub repository, and compiling it with AFL support; we also compile with Address Sanitizer support to detect memory errors 1. Building will take a while:
$ cd c/misc/heartbleed/
$ git submodule --init --recursive
$ cd openssl
$ CC=afl-clang-fast CXX=afl-clang-fast++ ./config -d
$ AFL_USE_ASAN=1 make
The next step is to compile the handshake.cc C++ program which performs a TLS handshake.
$ cd ..
AFL_USE_ASAN=1 afl-clang-fast++ -g handshake.cc openssl/libssl.a openssl/libcrypto.a -o handshake -I openssl/include -ldl
We can now run the program with AFL to find the bug (note that this is not a working exploit, what would require a cleverly formed, malicious heartbeat message):
$ afl-fuzz -i inputs -o out ./handshake
You may stop as soon as AFL finds a crash. Inspect the produced crash log.
This may make your VM allocate a lot of memory and lead to random crashes; make sure that you are not running anything else important on the system.
As another example, start by pulling libxml2 version 2.9.2 (with a few known fuzzable vulnerabilities, see here) and compile it with ALF support. Building will take a while:
$ cd c/misc/libxml2/
$ git submodule --init --recursive
$ cd libxml2
$ CC=afl-clang-fast ./autogen.sh --disable-shared --without-debug --without-ftp --without-http --without-legacy --without-python
$ AFL_USE_ASAN=1 make -j 4
To fuzz XML files, we define a program that reads a XML file and use the W3C XML test suite a base corpus of inputs. AFL also supports a dictionary with the vocabulary for aiding in generating mutations of the original inputs; we will use AFL’s XML dictionary.
$ AFL_USE_ASAN=1 afl-clang-fast ./xmlreadAFL.cc -I libxml2/include libxml2/.libs/libxml2.a -lz -lm -o fuzzer
$ wget https://www.w3.org/XML/Test/xmlts20130923.tar.gz -O - | tar -xz
$ mkdir output
$ afl-fuzz -i xmlconf -o output -x xml.dict ./fuzzer @@
AFL will soon find a crash (see here).
LLVM’s libFuzzer is a coverage-guided fuzzing engine that ships with Clang. Contrarily fo AFL, that is targeted for fuzzing executables via I/O, libFuzzer is designed for fuzzing libraries via in-memory handlers; a few more technical differences can be consulted in this article.
We can recompile our libxml2 example with libFuzzer support as follows:
$ cd c/misc/libxml2/
$ git submodule --init --recursive
$ cd libxml2
$ FUZZ_CXXFLAGS="-g -O2 -fsanitize=fuzzer-no-link,address,undefined"
$ CXX="clang++ $FUZZ_CXXFLAGS" CC="clang $FUZZ_CXXFLAGS" CCLD="clang++ $FUZZ_CXXFLAGS" ./autogen.sh --disable-shared --without-debug --without-ftp --without-http --without-legacy --without-python
$ make -j 4
To fuzz XML files, we use the same XML dictionary and input corpus as we did for AFL:
$ clang -g -O2 -fsanitize=fuzzer,address,undefined xmlreadLibFuzzer.cc -I libxml2/include libxml2/.libs/libxml2.a -lz -o fuzzer
$ ./fuzzer xmlconf -dict=xml.dict -max_len=64
The fuzzer will soon find a crash (see here), and exit when it finds the first crash.
Note that libFuzzer will add its generated inputs to the xmlconf
directory.
There are many other modern fuzzing techniques that can achieve better results for custom programs, often by combining some form of symbolic execution. Many of the associated tools are experimental, possibly quite complex to configure and use, and can become out-of-date quickly.
Remark: For the sake of demonstration, we will look at a few, but be advised that it may be hard to understand their behavior outside of the example programs that we provide.
The list of fuzzers presented here is merely demonstrative. For a slightly more complete picture, from 2021, the UNIFUZZ project has made an effort to real world examples that you may alternatively explore.
Driller is a concolic execution tool that explores only the paths that are found interesting by the fuzzer and uses symbolic execution to generate inputs for path branches that a fuzzer cannot satisfy. It uses AFL as a fuzzer and angr, a binary analysis framework, as a symbolic tracer over the executable’s control flow graph.
Consider an example buggy.c program taken from this article that simply reads 6 bytes of input, checks them one by one against a sequence of characters, and crashes if all 6 of them match. This is a typically hard program for a fuzzer to check without insight on the internal program behavior. You may compile and test this program:
$ gcc -o buggy buggy.c
$ echo 123456 | ./buggy
No problem
$ echo 7/42a8 | ./buggy
Segmentation fault (core dumped)
Let’s now run a pre-packaged docker container, which contains a script, shellphuzz
, that combines the AFL and Driller runs into a single command:
$ echo 1 | sudo tee /proc/sys/kernel/sched_child_runs_first
$ make run-driller
driller@container# shellphuzz -d 1 -c 1 -w out -C --length-extension 4 ./buggy
This script will run until AFL finds the first crash. When it does, check the out/buggy/sync/fuzzer-master/crashes
folder for the crash report.
Fuzzolic is a concolic execution fuzzer with an approach similar to driller. It uses QEmu to analyze binaries and generate symbolic queries whose result may help in finding new test cases; such queries are discarded by a fuzzy (approximate) SAT solver. Consider an example.c program which performs a conditional on a particular value. You may easily test this program using a pre-bundled docker container; change into the vm folder:
$ make run-fuzzolic
fuzzolic@container# gcc tests/example/example.c -o tests/example/example
fuzzolic@container# ./fuzzolic/fuzzolic.py -o ./workdir -i tests/example/inputs -- ./tests/example/example [args]
You may find more details about this example in the documentation. You may find more running examples in the GitHub repository.
Fuzzolic also provides automated support for alternating between fuzzolic and AFL, by feeding the generated fuzzolic test cases fo AFl and vice-versa. For the example program, you may try:
fuzzolic@container# ./fuzzolic/run_afl_fuzzolic.py --address-reasoning --optimistic-solving --fuzzy -o workdir/ -i tests/example/inputs -- ./tests/example/example
SymCC is a compiler wrapper which embeds symbolic execution into the program during compilation, and an associated run-time support library. In essence, the compiler inserts code that computes symbolic expressions for each value in the program. The actual computation happens through calls to the support library at run time.
You may build a docker container to try SymCC and AFL. Inside the vm folder, just try compiling the buggy.c example with SymCC, and running it:
$ git pull
$ make run-symcc
symcc@container# ./symcc /path/to/ses/c/misc/buggy.c -o /path/to/ses/c/misc/buggy
symcc@container# mkdir results
symcc@container# export SYMCC_OUTPUT_DIR=`pwd`/results
symcc@container# echo 'aaaaaa' | ./path/to/ses/c/misc/buggy
SymCC will only perform one mutation: in the results
folder, you will find a slightly mutated input, such as 7aaaaa
, that is informed by the conditional performed by the program and is closer to the bug.
You may automate fuzzing by providing a folder with initial inputs and using the following script:
symcc@container# mkdir inputs #fill this folder with input files
symcc@container# ./util/pure_concolic_execution.sh -i inputs -o results /path/to/ses/c/misc/buggy
It will run indefinitely; stop it after a while and inspect the results
folder. Can you find an input that causes the bug?
Even though it does not offer an not automated script, SymCC may also be combined with AFL similarly to fuzzolic; check the documentation.
Angora is a mutation-based fuzzer that seeks to improve AFL’s branch coverage without resorting to symbolic execution. Instead, it relies on dynamic byte-level taint tracking to try to understand how inputs affect branch constraints, and therefore how to mutate inputs to improve branch coverage.
You may build a docker container to try Angora. Inside the vm folder, just try running:
$ echo core | sudo tee /proc/sys/kernel/core_pattern
$ make run-angora
angora@container# cd tests
angora@container# ./test.sh mini
This will compile and run the mini.c example that you may find in tests/mini/mini.c
inside the container; this example is very similar to the buggy.c example that we have seen before. You may inspect the output, including the found errors, in the folder tests/output
. To fuzz your own program, the simplest way is to place it in a new subfolder under tests
and running it as above; since Angora is a mutation-based fuzzer, the above test script will look for input seed files in folder tests/input
.
The testing methodologies that we have seen so far focus on traditional safety properties such as undefined behavior or memory errors. These do not permit to directly capture security properties such as information leakage, as we have seen before in Lab 1.
Even if less usual when it comes to existing tools, we can naturally combine automated testing with security analysis techniques such as taint analysis 2. One such example is KLEE-taint, a (slightly outdated 3) fork of KLEE that dynamically propagates taint annotations alongside symbolic variables during symbolic execution. You can find more information in the GitHub repository and in the original paper.
Consider the command injection example from before, minimally adapted to run with KLEE-taint in os_cmd_injection_basic-bad-klee.c. The most relevant are the two following annotations:
klee_set_taint(1,arg,strlen(arg))
...
klee_assert(klee_get_taint(command,commandLength) != 1)
KLEE-taint taints memory bytes much like we have seen before for Taintgrind.
For an address arg
containing a symbolic string, the first instruction taints a memory block if strlen(arg)
bytes with taint tag 1
.
The second instruction reads the taint commandLength
bytes starting at address command
, and checks if any of those bytes has a taint tag different than 1
.
Internally, KLEE-taint keeps a per-byte set of taint tags. By default, non-tainted data has taint tag 0
.
Inside the vm folder, we can run the command injection example as follows:
$ git pull
$ make run-klee-taint
klee-taint@container# cd /path/to/ses/c/SARD-testsuite-100/000/149/241/
klee-taint@container# clang -I /home/klee/klee_src/include/ -emit-llvm -c -g -O0 -disable-O0-optnone os_cmd_injection_basic-bad-klee.c
klee-taint@container# klee --libc=uclibc --posix-runtime -taint=direct os_cmd_injection_basic-bad-klee.bc
KLEE-taint allows to choose from 3 growingly precise taint propagation modes: direct (taint is only propagated through assignments), indirect (when the condition of a conditional statement is dependent on a tainted value, the control flow is tainted) or region-based (when two branches of a conditional statement converge to instructions that are executed independently of the condition, the control flow is untainted); for this example, direct tainting suffices.
The result of running KLEE-taint shall be an assertion violation: if you remember the example, the content of arg
will indeed flow into some bytes of the command
string.
To be sure that taint is being correctly propagated, we can test if the fixed command
header is also tainted.
Program os_cmd_injection_basic-bad2-klee.c only checks if the first catLength
bytes of the command
string are tainted.
If you run this program in the same way, KLEE-taint will not find an assertion violation as expected; indeed, taint is propagated at the byte-level as expected.
Similarly to how TIMECOP operates, KLEE-taint can be easily extended to automatically verify constant-time security properties: if the user marks secret symbolic inputs as tainted, (the symbolic execution of) a program is constant-time if no branch condition is symbolically tainted. KLEE-taint-ct, a simple fork of KLEE-taint, automates the taint checking of all program branch conditions.
Remember the password checking example from before, adapted for KLEE in program pass-loop-bad-klee.c. Inside the vm folder, we can run it as follows:
$ git pull
$ make run-klee-taint-ct
klee-taint-ct@container# cd /path/to/ses/c/misc/
klee-taint-ct@container# clang -I /home/klee/klee_src/include/ -emit-llvm -c -g -O0 pass-loop-bad-klee.c
klee-taint-ct@container# klee -taint=controlflow pass-loop-bad-klee.bc
The controlflow
options corresponds to the indirect
option of KLEE-taint. The fact that the control flow of the for
loop depends on the secret arguments will be quickly detected by KLEE-taint-ct.
Then try the constant-time program pass-loop-good-klee.c; this time, KLEE-taint-ct shall not find any constant-time violation.
You may try other constant-time security analysis examples using KLEE-taint-ct in the examples/taint folder of the GitHub repository. Inside the vm folder, you may just run:
$ git clone https://github.com/hpacheco/klee-taint-ct
$ cd examples/taint/path/to/the/folder/of/the/example/
$ make verify
It is possible to reduce the analysis of some security properties to the comparison of two program executions (for slightly different inputs). One simple example is dudect, a minimal C library that tests cryptographic functions with multiple inputs to find timing variations.
Constant-time static analysis tools such as ct-verif consider a self-composed program that simulates two parallel executions of the original program. Using the same rationale, it is possible to use traditional fuzzers to support the automated testing of security properties by comparing the outputs of multiple fuzzed inputs; one such example is ct-fuzz, tailored for automated testing of constant-time security for cryptographic implementations. You may read the ct-fuzz paper and the GitHub repository.
We have seen extensions of KLEE (a symbolic execution tool) that combine dynamic taint analysis to support the analysis of security properties, including general information flow and constant-time security. Nonetheless, symbolic execution by itself can be generalized to verify security properties, such as constant-time, that fall under the class of relational properties, that is, properties that can be expressed over multiple executions of the same program. This multiplication of the symbolic state is often more expensive, but also more precise than performing symbolic taint analysis.
BinSec is a binary-level security analysis tool that employs symbolic execution among other techniques. BinSec/Rel is a plugin that provides support checking for constant-time security using relational symbolic execution, that is, symbolically executing two executions of the same program such that their results must be the same for different secret symbolic inputs.
Remember the password checking example from before, adapted in program pass-loop-bad-binsec.c. Inside the vm folder, we can compile our program to a binary and run it inside a BinSec docker container as follows:
Note that the pass-loop-binsec.cfg file includes a configuration which declares global C variables to be treated as secret or public symbolic inputs. In this example, BinSec will report that the program is insecure, and present symbolic inputs that testify that the program is not constant-time, i.e., different secret inputs may yield different public outputs.
The found counter-example will be detailed in file pass-loop-bad-binsec.toml
, which will include sample instantiations for symbolic variables similar to the ones below:
Then try the constant-time program pass-loop-good-binsec.c; this time, BinSec shall report that the program is secure:
The approaches behind dudect and ct-fuzz have been developed specifically for cryptographic code. A more recent research direction, called differential fuzzing, considers the adaptation of general-purpose fuzzers for side-channel analysis. One such example is DifFuzz, from Java programs, which is built on top of AFL and employs resource-guided heuristics to automatically find inputs that attempt to maximize the difference in time/resource consumption between different program executions.
The goal of this lab is to experiment with the automated testing tools described above. We will detect and fix the vulnerabilities found in example C programs from the c/SARD-testsuite-100 testsuite.
Lab2.md
.Address Sanitizer needs a lot of virtual memory, and by default AFL will limit the amount of memory a fuzzed software gets. ↩
Symbolic execution can also be extended for testing that a program satisfies a general security property. ENCoVer is an example of such an academic prototype, developed as an extension to Java Pathfinder. ↩
KLEE-taint appears to be a no longer supported extension to an older version of KLEE; the changes are minimal so, in principle, it would be easily portable to recent versions. More modern similar tools exist, e.g. Jaint for dynamic taint analysis and symbolic execution of Java programs; we could not however find its source code. ↩