Static analysis is a method to extract specific information about the program without actually running it. In this blog post, we will first have a look at different types of information that we can obtain using static analysis. Later, we will have a comprehensive overview of different techniques for static program analysis.
Motivation
There are broadly two categories of information, based on their usage, that we can extract: Proscriptive information and Descriptive information[1]. Proscriptive information is obtained when we statically analyze the program behavior against a set of prescribed rules. For instance, identification, localization, and explanation of all software anomalies like out of array bound access, use-after-free bugs fall into this category. Descriptive information, on the contrary, consists of different sets of information that can be used to describe a given program. For instance, the call graph and program dependency graph fall into this category. Stack or heap memory usage and Worse-case execution time (WCET) determination are also a type of descriptive information. Static analysis can be done at various levels of program granularity, all the way from an abstract representation of a program like a state transition diagram to the machine code generated after compiling the program. Static analysis of source code is widely used by popular coding IDEs like Eclipse, Android Studio to automatically highlight frequent possible coding errors like Null pointer dereference, dead code indication, illegal memory access, etc.
Moreover, static analysis is used by compilers to detect anomalies and perform certain program optimizations. For instance: Live variable analysis is used for detecting data-flow anomalies like accessing a variable outside its scope; Reaching definition analysis is used for detecting uninitialized memory reads. It is also used by compilers to bring loop invariant out of the loop, which reduces the computation time of the loop. Constant propagation analysis is used by the compiler for performing peephole optimization. Static analysis is also used by automatic documentation generators to make a UML model of the program by analyzing source code.
Techniques
There are numerous ways in which static program analysis is performed. Following list comprises of few widely used static program analysis techniques:
- Abstract interpretation
- Model-checking
- Extended type systems
- Code slicing, control/data flow analysis
- The deduction, symbolic execution
An extensive amount of research work has been done on each of these areas, and covering all the techniques is out of the scope of this blog. We will look at a concise working of each of these techniques.
Abstract interpretation
Abstract interpretation[2] is a method to approximate the values of each variable at every program point in the form of a lattice of either all numbers that a variable can take or in the form of intervals of values that a variable can have. A lattice is always finite and has a single starting and ending node. Due to the finite property of the lattice, we can say for sure that an abstract interpretation of the program will always terminate. A lattice is computed by narrowing down the interval of values a variable can take by iteratively analyzing the program until a fixed point is reached. The Galois connection is an essential notion from the theory of abstract interpretation. It connects two or more abstract domains through abstraction and concretization functions. Galois insertion, a special type of Galois connection, is used to find the transfer function used in data flow analysis.
Model-checking
Model checking is used to create run-time ‘states’ of a program without actually running it. In this, a program(usually presented as a state transition diagram) is checked against a prescribed specification using decision procedures. There are different ways in which a program can be modeled, or there are different ‘meanings’ of states in the state transition diagram. One common way of representing programs is Fair Discrete Systems(FDS). Any fairness-free FDS can be represented as a Kripke structure. Here, a state refers to a particular combination of values that a set of variables can take. Specifications are often modeled using LTL(Linear temporal logic), CTL(computation-tree logic), and CTL*(a combination of CTL and LTL). Decision procedures take these Kripke structures(or FDS) and Specification(in LTL/CTL form) as input and use decision procedures to verify that the specification holds against all program paths. If it is not the case, then a model checker usually returns a counterexample. One of the most famous decision procedure is BDD(Binary decision diagrams).
Extended type systems
A type-based analysis is an analysis in which we use data-types of variables, structures, etc. to reason about a given property of a program. There are usually two types of rules: confinement rule and anonymity rules that are used by type systems; Confinement rules apply to classes in the same package and ensures that private data members of a class are not leaked outside it. Anonymity rules apply to methods applied to a confined class to ensure that they do not leak a reference of ‘this’ pointer.
Code slicing, Control, and Data flow analysis
Control flow and data flow analyses are two standard and easiest ways for static program analysis. In the case of control flow analysis, a call-graph and program dependency graph are generated. Data flow analysis focuses on interprocedural, flow-sensitive/insensitive, path sensitive/insensitive, context-sensitive/insensitive analysis of how data in terms of variable and constants are propagated within the program. For example, Taint analysis is used to find what all user inputs or function parameters(called source) are affecting the function output(called sinks). Code slicing refers to extracting the parts of the source code that affect the value of a given variable. It can be done in both static and dynamic manner.
Symbolic execution, Deduction
Symbolic execution refers to executing the program on a set of symbolic inputs, and then to infer how these symbolic values are propagated, updated within the code. Since the input values are symbolic(not concrete), all feasible program paths are evaluated(which hampers the practicality of this approach). It is a method for white-box testing and can only reason about the partial correctness of the program. One can consider symbolic execution to take place on an infinite abstract domain, and at every program point, it stores the path constraint and symbolic values. The automatic deduction, on the other hand, uses theorem provers to deduce the properties of programs and variables.
Reference:
- Laski, Janusz, and William Stanley. Software verification and analysis: An integrated, hands-on approach. Springer Science & Business Media, 2009.
- https://www.youtube.com/watch?v=j2m5YMnHvQQ , https://www.youtube.com/watch?v=1HfmnS1wx4k
- http://cs.tau.ac.il/~msagiv/courses/sp/symbolic.pdf
- http://web.cs.ucla.edu/~palsberg/paper/paste01.pdf
- http://www-verimag.imag.fr/~mounier/Enseignement/Software_Security/ConcolicExecution.pdf
- http://web.cs.ucla.edu/~palsberg/paper/handbook-chapter07.pdf
- http://web.cs.ucla.edu/~palsberg/paper/paste01.pdf
- https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/nbjorner-fose14.pdf