model checker, Software, static analysis, validation

Understanding Dependency Graphs for Program Analysis

Hi, In this blog post, I’ll briefly cover several types of dependency graphs used for program analysis, transformation, and representation. Following is the table of content:

  • Prerequisites
    • Reaching definition analysis
    • Pointer analysis
    • Dominator tree
    • Call graphs
    • Control flow graphs
  • Control dependency graphs
  • Data dependency graphs
  • Program dependency graphs
  • System dependency graphs

Prerequisites

Reaching definition analysis (RDA)

RDA aims to determine the set of variable definitions (assignments) that may (or must) reach a given program point. Every program statement has an Entry set and Exit set associated with it, where the entry set is a set of all variable definitions reaching that statement and the exit set is essentially entry set + variable definition at the current program statement. For instance, consider the code in figure 1a:

Entry set of statement 1 is RD1entry = {<a, ?>, <b, ?>, <c, ?>, <d, ?>}. Here, <Var, Loc> represents the definition of the variable named Var from location Loc. ‘?’ represents that Location is unknown, which is the default value used for initializing RDA. Exit set of statement 1 is RD1exit = {<a, 1>, <b, ?>, <c, ?>, <d, ?>}. Similarly, Entry and Exit set of each statement is given in the following snippet.

Formally, there are two sets of equations in RDA. The first set of equations combines the output (Exit set) of predecessors to determine the entry set. For instance: RD2entry = RD1exit, RD4entry = UnionOf( RD3exit, RD5exit ). Please note that if a statement has only one predecessor, then the Entry set of the statement will be equal to the exit set of its predecessor like in the case of statement number 2[b = 2]. However, in the case of multiple predecessors, like for statement number 4, RD4entry is the union/intersection of exit sets of its predecessors. The choice of operation that is either Union (May analysis) or Intersection(Must analysis) depends on the type of analysis you are performing. In this example, I have used the May analysis.

RD(i)entry = UnionOf( RD(j)exit ) for all j belonging to the set: predecessors(i).

The second set of equations determines the exit set of a statement. If a statement is not an assignment, i.e. no variable is getting defined, then the exit set of that statement will be the same as the entry set. However, in the case of an assignment statement like a = a + 1, all the previous definitions of the variable(here, a) will be removed from the entry set and the new definition of the variable ‘a’ is added. For instance, In the case of statement number 5, since ‘c’ is being redefined here, the previous definition of c, i.e. <c, 3> is removed and a new definition, i.e. <c, 5> is added.

RD(j)exit = RD(j)entry ; If j is not an assignment statement
RD(j)exit = {(RD(j)entry \ <a,*>) U <a,j> | If j is an assignment statement where variable ‘a’ is getting redefined}

Figure 1
//RD1entry = {<a, ?>, <b, ?>, <c, ?>, <d, ?>}
//RD1exit = {<a, 1>, <b, ?>, <c, ?>, <d, ?>}
//RD2entry = {<a, 1>, <b, ?>, <c, ?>, <d, ?>}
//RD2exit = {<a, 1>, <b, 2>, <c, ?>, <d, ?>}
//RD3entry = {<a, 1>, <b, 2>, <c, ?>, <d, ?>}
//RD3exit = {<a, 1>, <b, 2>, <c, 3>, <d, ?>}
//RD4entry = {<a, 1>, <a, 4>, <b, 2>, <c, 3>, <c, 5>, <d, ?>}
//RD4exit = {<a, 4>, <b, 2>, <c, 3>, <c, 5>, <d, ?>}
//RD5entry = {<a, 4>, <b, 2>, <c, 3>, <c, 5>, <d, ?>}
//RD5exit = {<a, 4>, <b, 2>, <c, 5>, <d, ?>}
//RD6entry = {<a, 1>, <a, 4>, <b, 2>, <c, 3>, <c, 5>, <d, ?>}
//RD6exit = {<a, 1>, <a, 4>, <b, 2>, <c, 3>, <c, 5>, <d, 6>}

Note-:

In the Reaching definition, we only consider unambiguous assignment statements. Statements like *ptr = 1 or **ptr = 1 are considered as ambiguous and thus not considered.

Pointer Analysis

There are  broadly three types of pointer analysis:

  1. Alias analysis: It aims to answer the question, whether two given pointers are aliases of each other or not.
  2. Points-to analysis: It aims to determine bindings of the form p->x where p is a pointer that points to the memory location x. As you might have correctly guessed, points-to analysis can also be used to answer Alias-related queries.
  3. Storage shape analysis: It aims to determine the shape of data structures(on the heap) through pointer access patterns.

In general, pointer analysis is Undecidable and there are few approximations algorithms whose worst-case complexity can vary from linear to exponential. Two defacto pointer analysis approaches are Anderson and Steensgaard’s. For more information on these topics, please refer to Dr. Le’s lecture notes: here.

Dominator tree

Figure 2

Pre-Dominator: In a given directed graph G, a node D is said to pre-dominate the node E if every path starting from the root node to the node E, passes through D.
A node D is called an immediate pre-dominator of E iff D pre-dominates E and all other pre-dominators of E pre-dominates D.

Post-Dominator: In a given directed graph G, a node D is said to post-dominate the node E if every path starting from E to the end of the function, passed through D.
A node D is called an immediate post-dominator of E iff D post-dominates E, and all other post-dominators of E post-dominates D. Post dominator tree is also known as inverse dominator tree or forward dominator tree.

For instance, consider the code given in Figure 2, along with its control flow graph(CFG). Notice in the CFG that every path starting from the root node to node 10 reaches through nodes 5, 4, 2, and 1. Thus, these nodes are pre-dominators of node 10. However, the immediate pre-dominator of 10 is 5. Also, notice that all paths from 4 to the exit node passes through 5 and 10; thus, these are the post-dominators of 4. However, the immediate post-dominator of 4 is five and of 5 is 10.

While constructing the pre-dominator tree, the root is always the first node, here 1. However, while constructing the inverse/forward dominance tree, the root is the exit node of the function/program. Also, note that before constructing the pre/post dominator trees, the function/program is brought into SESE(Single Entry, Single Exit) form.
You can learn more about dominator trees from here and here.

Call Graphs

Figure 3

As you might have correctly guessed, class graphs are directed graphs with functions as nodes and edges of the form <Fun1, Fun2> where Fun1 calls Fun2. For instance, consider the example in figure 3. Here, the order of function call is immaterial, and while constructing call graphs, we may or may not include system function or API calls. In this example, read() and lorem() are system defined functions. Generating call graphs might seem to be an easy problem, but in general, it’s not. In fact, generating call graphs for programs written in Object-Oriented(OO) and functional languages is an open research problem. This is mainly because of dynamic dispatch/polymorphism and function pointers, which makes it difficult to determine the target of a function call.  The following four analysis methods are used for call graph generation of OO programs: (1) Class Hierarchy Analysis(CHA) (2) Rapid Type Analysis(RTA) (3) 0-CFA, context-insensitive Analysis (4) k-CFA, context-sensitive Analysis. K-CFA analysis is most precise, but as the value of k increases, a tradeoff between precision and scalability becomes noticeable.

Please refer to the following article for more info on call graph generation of OO programs.

Frank Tip and Jens Palsberg. 2000. Scalable propagation-based call graph construction algorithms. SIGPLAN Not. 35, 10 (October 2000), 281–293. DOI:https://doi.org/10.1145/354222.353190

Control Flow Graph

Figure 4

It is a directed graph with basic blocks( or individual instructions) as nodes, and an edge,<Node1, Node2>, between two nodes represent that control can flow from node1 to node2. It considers all possible control flow paths. CGF can be generated for an entire procedure and an entire program. For instance, consider the example in figure 4:

There can be multiple exit points within a procedure, as shown here in the case of the function foo. While constructing Interprocedural CFG, all reachable exit points map to the call site(statement 4, in main() function).


Control Dependency Graph(CDG)

Figure 5

CDG is another commonly used method to represent control flow information that combines information from the control flow graph and Forward Dominator tree(FDT). If statement X is control dependent on statement Y, then statement Y decides whether X will be executed or not. Formally, statement X is control dependent on statement Y iff:

  1. X is not a post dominator of Y, i.e., there exists a path from Y to function Exit that doesn’t contain X.
  2. There exists a path P starting from Y and ending on X, such that any in-between node on P is post-dominated by X.

CDT = CFG + FDT. Consider the example given in figure 5, 7 is controlled dependent on five as 7 is not a post-dominator of 5, and there is no node in between 5 and 7.

Is 8 control dependent on 5, as 8 is not a post-dominator of 5?
No, since on the path 5 -> 7 -> 8, 8 is not the post dominator of 7 (Condition 2 fails).

Is 10 control dependent on 7? No, since 10 is the post-dominator of 7 (Condition 1 fails).

Data Dependency Graph(DDG)

Figure 6

It is a directed graph with respect to a variable ‘Var’ where program statements are represented as nodes, and an edge of the form <A, B> means that in the CFG there is at least one path P from A to B and:

  1. A defines ‘Var’ and B uses it, Or
  2. B defines ‘Var,’ and A uses it, Or
  3. A and B both define ‘Var.’

with no redefinition of ‘Var’ on P. Consider the example in figure 6. Note that the edge between 1 and 5, labeled with ‘a,’ means that 5 uses the definition of ‘a’ from node 1. Since DDG is always with respect to a variable, I find it necessary to label edges with the variable name. Please refer to this article for more info on this topic.

Apart from DDG, data usage/dependence relationships between program statements can also be represented by Def-Use chains, SSA form, and Value Dependency graphs(VDG).

Program Dependency Graphs (PDG)

Figure 7

PDGs are directed graphs with program statements as nodes and 2 types of edges: one for representing control dependency and another for data dependencies. It essentially combines CDG and DDG in a single graph. PDGs make several compiler optimizations and program transformations much easier. You can refer to the following publication for detailed advantages of PDG: The Program Dependence Graph and Its Use in Optimization.

 

Consider the example in figure 7.

System Dependency Graphs (SDG)

SDGs are another intermediate representation with data and control flow among multiple procedures. They are the same as interprocedural PDGs with extra nodes for inter procedure parameter passing and edges for inter-procedure control transfer.

Figure 8

There are 3 types of edges: (1) Call edge: Edge between call site and callee entry point (2) edges for parameter passing from caller to callee (3) edges for parameter passing from callee to caller,  SDGs include both application procedures and system procedures. Consider the example in figure 8. Here, the system procedure read() is not shown. They are context, flow, and object sensitive. SDGs make inter-procedural analysis a lot precise and ‘somewhat’ scalable.

 

Tagged , , , , , , ,

1 thought on “Understanding Dependency Graphs for Program Analysis

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.