In the previous post, we looked at different types and design objectives of high-level IRs used by ML compilers. Today, we are going to look at different optimizations that are performed using high-level IRs.These optimizations can be performed either offline or online. In online mode, the optimizations are done before performing the inference, while in offline mode, the runtime saves the optimized graph to disk.

In this post, I have attempted to further extend Li et al.’s [1] classification of high-level IR optimizations into five categories to the best of my knowledge.

## Content

- Overview
- Node-level Optimizations
- Block-level Optimizations
- Data flow-level Optimizations
- Quantization-related Optimizations
- Hardware-specific Optimizations
- References

## Overview

Figure 1 shows the five categories of graph optimizations. **These categories are neither mutually-exclusive nor exhaustive** – some optimizations can belong to two or more categories. For instance, several quantization-based optimizations require block-level analysis and are dependent on the underlying hardware. Nevertheless, I hope this categorization will be sufficient to provide a big-picture view of various concepts and techniques used for optimizing ML models.

The first, and perhaps the most straightforward, type of optimization is Node-level optimizations, where we optimize each node (Tensor operator) in isolation from the rest of the ML model. An example of node-level optimization is the removal of “No-op” nodes like a ‘sum’ node with just one input or with one input as zero. Contrary to the node level, block-level optimizations attempt to optimize a block or a group of adjacent nodes. For instance, Operator fusion – where we merge two or more adjacent operators – is a type of block optimization. The third type of optimization is data-flow optimization, where we analyze and understand the data flow within the ML model, and based on that, we attempt to optimize the model without changing the semantics of the model. An example of data-flow level optimization is dead code elimination, where we analyze the data flow of the ML model to identify and remove dead code (code will not be executed under any circumstance during the model’s runtime). Hardware-specific optimizations, as its name suggests, optimizes the ML model by leveraging additional information about the underlying hardware platform. For example, TensorRT’s Kernel autotuning replaces tensor operators with a version that’s fine-tuned for the underlying hardware.

Different parts of the ML model contain floating-point values with different dynamic ranges. In some parts (for example. Softmax and the normalize layer), the values are between 0 and 1, therefore selecting a single data type can be imprecise as a courser data type can inaccurately describe smaller values, while a finer data type (like float 16) might end up truncating larger values. Moreover, the ML model light contains rescaling nodes that change the dynamic range of tensors for some parts of the ML model. Quantization-related optimizations focus on optimizing the ML model with varying dynamic ranges to make the model more efficient. An example of quantization-related optimization is the folding and addition of rescaling nodes.

### Node Level Optimizations

### Constant Folding

The goal of constant folding is to **evaluate expressions involving constants at compile time** and replace them with their computed values, reducing the need for runtime computations.

Specifically, constant folding involves identifying operations or nodes in the graph where all input values are constants. Instead of executing these operations at runtime, the compiler performs the computations during the graph optimization phase and replaces the operation node with a new constant node containing the computed result.

### Operator Strength Reduction

Operator strength reduction aims to **replace computationally expensive operations** with equivalent, less expensive operations.

**Example**:

Original operation: `y = x / 2`

Strength-reduced operation: `y = x * 0.5`

### No-op Node Elimination

This optimization aims to **identify and eliminate “No-op” tensor operators**. No-op tensor operators refer to ML operators that do not do any meaningful computation, i.e., are identity functions. For example, a scalar multiply operation with the scalar being 1 is an identity function, and this optimization removes these No-op nodes from the computation graph.

### Zero Dimension Tensor Elimination

Zero-dimensional tensor elimination is a graph optimization technique aimed at r**emoving tensors with zero dimensions** from the computation graph. Zero-dimensional tensors represent scalars, and eliminating them simplifies the graph and reduces unnecessary overhead associated with tensor operations.

Example: Original computation graph:

`Input: x (shape: [])`

Operation: y = x * 2

After zero-dimensional tensor elimination:

`Input: x (scalar)`

Operation: y = x * 2

## Block Level Optimizations

### Algebraic Simplification

Block-level algebraic simplification involves applying various mathematical identities, properties, and transformations to simplify operations. The goal is to replace complex expressions with simpler forms that are equivalent in terms of their mathematical results. These optimizations leverage the commutativity, associativity, and distributivity of different tensor operators to generate efficient, semantically-equivalent code.

### Operator Fusion

In Operator fusion, multiple consecutive tensor operations or nodes in a computation graph are combined into a single fused operation. This optimization aims to reduce memory transfers, improve cache utilization, and minimize overhead by executing multiple operations together as a single unit.

Here’s an example:

Original computation graph:`Operation 1: y = x * 2`

`Operation 2: z = y + 3`

```
Operation 3: w = z / 4
```

After operator fusion:

```
Fused Operation: w = (x * 2 + 3) / 4
```

In this example, the consecutive operations `x * 2`

, `y + 3`

, and `z / 4`

are fused into a single operation `(x * 2 + 3) / 4`

. This **reduces the need for intermediate memory storage, minimizes memory transfers, and improves cache locality** by executing these operations together.

Operator fusion can be particularly effective on hardware with limited memory bandwidth, like GPUs, as it reduces the amount of data movement between memory and computation units. It also reduces the overhead of launching multiple kernels for individual operations.

### Operator Sinking

Operator sinking, also called allocation sinking in Glow, involves moving computations closer to their consumers within a group of operations. This optimization aims to improve memory access patterns and reduce intermediate storage, leading to better cache utilization and potentially faster execution.

## Data flow-level Optimizations

### Common Sub-expression Elimination

Common sub-expression elimination (CSE) identifies and eliminates redundant computations. In CSE, if an expression E has been computed before and hasn’t changed since, its value is stored, avoiding recomputation. This value is then used to replace instances of E throughout the computation graph, reducing redundant calculations. For example, if E = `a + b`

, and its value is computed earlier as `x`

, subsequent occurrences of E are replaced by `x`

, preventing duplicate calculations and enhancing overall efficiency of the ML model.

### Dead Code Elimination

Dead code elimination (DCE) **identifies and removes code segments whose outcomes or effects are unused**. DCE aims to improve efficiency by eliminating computations that don’t impact the final result. Dead code is often generated by earlier optimizations and is not typically the programmer’s doing. **DCE, like common sub-expression elimination (CSE), is performed after other graph optimizations.** Furthermore, DCE encompasses optimizations like dead store elimination, which discards unnecessary tensor stores that won’t be utilized. These optimizations collectively enhance the effectiveness and streamline the computation graph by targeting redundant or unused operations.

### Layout Transformations

Layout transformation optimizes data layouts of tensors in the computation graph to enhance memory access and cache utilization. It introduces layout transformation nodes in the graph, but the actual transformation occurs during computation. Different hardware benefits from specific layouts; e.g., NCHW on GPUs. Compilers consider hardware-specific libraries, complex layouts, and heterogenous units. However, layout transformations impact performance and resource usage, demanding careful consideration.

## Quantization-related Optimizations

Various parts of the ML model consist of floating-point values within distinct ranges. In certain segments, numbers usually fall between zero and one, whereas in other segments, the potential range extends into the hundreds. Opting for a uniform conversion scale across the entire network would be ineffective, as a solitary scale value might lack precision for small values and might lead to truncation for larger ones. And that’s why quantization-based optimizations exploit quantization opportunities in the ML model, thus making it consume fewer resources and execute faster.

### Rescale Node folding and addition

A rescale operator is a mathematical operation that adjusts the scale or range of data values. Different frameworks have different names for a rescale operation: in ONNX, it is referred to as QuantizeLinear, while in Glow it is referred to as RescaleQuantized.

Quite often rescale operators can be folded into the previous numerical computation node, which further reduce memory accesses. Moreover, along with folding, additional rescale nodes can be added to the ML model to adapt to the underlying hardware or simplify certain operations. As an illustration, take the case of ‘max’ operations. When we transform both operands of the ‘max’ operation onto a consistent scale, it enables the hardware to execute a straightforward comparison. Through the normalization of both sides of the ‘max’ operation to a uniform scale, we facilitate the utilization of this efficient optimization [2].

### Constant Quantization

This optimization involves replacing the operation “Quantize(Constant)” with a simple updated constant that holds the quantized weights. This means that a constant value is transformed into a quantized version at compile time, allowing for more efficient computation.

### Profile-Guided Quantization

Profile-Guided Optimizations were first described in Glow’s [2] research paper. In this optimization the authors employ profile-guided insights to predict potential numerical ranges for each stage of the neural network. The quantization process operates through a dual-phase approach. Initially, the network is augmented with specialized profiling nodes to track activation ranges, followed by network optimization encompassing these nodes. After inference, the network is recompiled based on the profile data, leading to its transformation into a quantized structure. This allows for static optimization of the quantized graph. The conversion involves isolating segments of the network for integer computation, with the objective of generating outcomes within the range produced by the original floating-point network.

## Hardware-Specific Quantization

## Kernel Auto-Tuning

Kernel auto-tuning involves automatically **searching and selecting the best configuration for a specific computational kernel** in a ML model. It aims to find the optimal configuration that yields the highest performance on a given hardware platform. This optimization is crucial because different hardware platforms, GPUs, TPUs, etc., have different architectures and memory hierarchies. What’s optimal for one platform might not be optimal for another. Auto-tuning ensures that the kernel’s parameters are adjusted to exploit the specific characteristics of the target hardware, achieving the best possible performance.

## Memory Planning

Memory planning optimizations **aims at reducing memory allocation, accesses, and exploit data reuse**. Broadly speaking, this involves memory reuse, alignment, Tensor compression and decompression, exploiting memory hierarchies (like Cache in Streaming multiprocessors), memory pooling and temporary buffer elimination.

Effective memory planning optimizations can lead to significant improvements in the overall performance of machine learning models, especially on hardware with limited memory resources or complex memory hierarchies.These optimizations can be done at both Graph IRs and low-level IRs within ML compilers.

## Multi & Lazy Stream Execution

Multi-stream optimization involves **executing multiple independent computations or streams concurrently on the same hardware** device. Each stream can correspond to a different neural network layer, batch of data, or operation. This optimization takes advantage of the device’s parallelism capabilities to overlap computation and memory transfer operations, reducing idle time and improving overall throughput.

On the other hand, Lazy stream optimization focuses on **deferring the execution of operations until they are actually needed**. It’s particularly useful for operations that might be conditionally executed based on the program’s logic or user input. By postponing the execution of certain operations until the last possible moment, memory resources can be conserved, and unnecessary computations can be avoided.

**References**

**[1]** Li, Mingzhen, et al. “The deep learning compiler: A comprehensive survey.” *IEEE Transactions on Parallel and Distributed Systems* 32.3 (2020): 708-727.

**[2]** Rotem, Nadav, et al. “Glow: Graph lowering compiler techniques for neural networks.” *arXiv preprint arXiv:1805.00907* (2018).

**[3]** Optimizations in Glow: https://github.com/pytorch/glow/blob/master/docs/Optimizations.md

**[4]** Optimizations in Torch: https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/OVERVIEW.md#pre-derivative-optimization

**[5]** Optimizations in nGraph: https://intel.github.io/intel-extension-for-pytorch/latest/tutorials/features/graph_optimization.html

**[6]** ONNX Operators: https://github.com/onnx/onnx/blob/main/docs/Operators.md

**[7]** ONNX Runtime Graph Optimizations: https://onnxruntime.ai/docs/performance/model-optimizations/graph-optimizations.html

[8] TVM Optimizations: https://tvm.apache.org/docs/how_to/optimize_operators/index.html

### Share this:

- Click to share on Twitter (Opens in new window)
- Click to share on Facebook (Opens in new window)
- Click to share on Telegram (Opens in new window)
- Click to share on LinkedIn (Opens in new window)
- Click to share on Reddit (Opens in new window)
- Click to share on Pocket (Opens in new window)
- Click to email a link to a friend (Opens in new window)
- Click to share on WhatsApp (Opens in new window)
- Click to print (Opens in new window)