# A primer on Sparsity: What is it and Why should we care about it?

Broadly speaking, in mathematics, sparsity generally refers to the property of having a relatively small number of non-zero elements or structures within a larger space or set. This concept can be applied to various mathematical objects, such as matrices, graphs, functions, and more. Sparsity is often exploited to optimize efficiency. For instance, in linear algebra, sparse matrices (matrices with a large number of zero elements) are handled differently from dense matrices, leveraging the abundance of zeros to save memory and accelerate computations. In graph theory, sparse graphs (graphs with relatively few edges) can be processed more efficiently with different algorithms.

In this blog post, I will discuss sparsity in the context of machine learning (ML), what are the types and sources of sparsity, and how can we leverage it to design faster compilers and hardware architectures.

## Content

1. Sources of Sparsity in Machine Learning
2. Types of Sparsity in Machine Learning
3. Opportunities and Challenges in exploting Sparsity

## Application-Inherent Sparsity

Application-Inherent Sparsity arises due to characteristics of the input data when ML is deployed in a certain domain. Following are the three examples of application-inherent sparsity:

### Recommender System

The source of sparsity in recommender systems stems from the nature of user-item interactions and the vastness of the item space. In many recommendation scenarios, a small subset of items receives the majority of interactions, while a large portion of items receives very few interactions. This leads to a long tail distribution, where most items have sparse user interactions.

### Object Detection

In LIDAR-based object detection systems, sparsity can arise due to the point cloud generated by LIDAR. LIDAR systems generate point cloud data by measuring the distance to objects using laser pulses. However, the density of point cloud data can vary, resulting in sparse areas where fewer laser measurements are captured. LIDAR-based sparsity is very common in autonomous driving systems.

### NLP and Text Processing

Language has a vast vocabulary, but in any given text, only a small subset of words is used. This leads to sparsity in text representations as most words do not appear frequently. Moreover, word frequencies often follow Zipf’s Law, where a few words (high-frequency) are extremely common, and many rare words occur infrequently. This results in a skewed distribution and sparsity.

## Sparsity due to the architecture of the ML model

Sparsity can arise due to certain operators used in the ML model. The following are the examples of sparsity arising due to ML operators:

### Dilated and Transposed Convolution

In dilated convolutions, filters are expanded by inserting zeros between the weights (thus, adding sparsity), creating a dilated effect. This allows the network to have a larger receptive field without significantly increasing the number of parameters.

Unlike the dilated convolution, the main purpose of transpose convolution is to upsample or increase the spatial resolution of feature maps. It generates a higher-resolution output feature map by inserting zeros between the input values.

### Activation Functions

Some activation functions, like ReLU (Rectified Linear Unit), produce zero outputs for negative inputs. Neurons that consistently receive negative inputs during training will become inactive (output zero), leading to sparsity in the network’s activation patterns.

### Attention Mechanism

In models with attention mechanisms, only a subset of input elements might receive attention, leading to sparse activation patterns.

## Sparsity due to the training dataset

### Imbalanced Dataset

Class imbalance in datasets can lead to sparsity due to the unequal distribution of instances across different classes. When one or more classes have significantly fewer examples compared to other classes, it results in sparse data representations for those minority classes.

### Missing Data Point

Datasets are often represented as matrices or tables, where rows represent instances and columns represent features. When data points are missing for certain instances and features, the corresponding entries in the matrix become empty. This results in a sparse matrix representation.

### Type Of Data (Categorical Variables and Graphs)

Categorical variables with a large number of categories can lead to sparse one-hot encoded features, especially if certain categories have very few instances. Data with irregular patterns or unstructured data, such as social network graphs, can lead to sparse connections or interactions.

## Sparsity due to ML optimization techniques

### Dropout Layers

During training, dropout randomly selects a subset of neurons to be deactivated with a certain probability. This means that the output of those neurons is set to zero for that iteration.

### Weight Pruning and Regularization

Weight pruning, just like dropout layers, sets weights of the neurons to zero, thus introducing sparsity.

L1 regularization adds a penalty term to the loss function that is proportional to the absolute values of the model’s weights. As the model is trained with L1 regularization, the optimization process tends to drive many weights towards exactly zero, especially for less influential features or connections, thus introducing sparsity.

### Dimensionality Reduction

Dimensionality reduction algorithms like Sparse PCA, Sparse Factor Analysis, and Sparse Autoencoders introduce sparsity in ML models. For instance, Sparse autoencoders explicitly adds a sparsity constraint to the optimization process. This encourages the model to learn representations where only a subset of features is activated for each instance.

## Types of Sparsity in Machine Learning

### Unstructured Sparsity

Unstructured sparsity occurs when specific elements are zero or negligible without adhering to a structured or organized arrangement. Applying ReLU, dropout, quantization, or fine-grain pruning also induces unstructured sparsity in activations or weights.

### Course-Grained Block Sparsity

Coarse-grained structured sparsity refers to a type of sparsity pattern where groups or blocks of elements within a dataset, model, or activations are sparse. It involves larger units of sparse data, such as channels, or filters.

Pooling layers in CNNs aggregate information from neighboring units, which can lead to course-grained block sparsity.

### Fine-Grained Block Sparsity

Fine-Grained Block sparsity is a form of sparsity where the density of sparseness (fraction of zero elements to the non-zero elements) remains constant across the weights or the activations of the tensor operators.

### Pattern-based Sparsity

Pattern-based structured sparsity refers to a type of sparsity pattern where specific predefined patterns or structures are enforced on the elements or parameters within a model, or representation. These patterns can appear due to domain knowledge, dataset characteristics, or algorithmic requirements.

Pattern-based sparsity can arise by applying convolutional kernels with specific patterns, such as edge detection or texture analysis.

## Benefits of exploiting Sparsity

Exploiting sparsity in ML models offers several benefits that can enhance the efficiency, interpretability, and performance of the models. Here are three key reasons why leveraging sparsity is advantageous:

1. Computational Efficiency:

Sparse models require fewer computations compared to dense models, as operations involving zero elements can be skipped. This results in faster training and inference times, making sparse models well-suited for real-time applications and large datasets.

2. Reduced Memory Footprint:

Sparse models occupy less memory due to the presence of fewer non-zero elements. This is particularly valuable when dealing with resource-constrained environments, such as edge devices or mobile applications, where memory usage is critical.

3. Reduced Communication and Bandwidth Overhead:

Sparse models require sending fewer non-zero parameters during communication between distributed nodes. This significantly reduces the amount of data transferred, minimizing communication overhead and improving overall training or inference speed.

## Challenges in exploiting Sparsity

While exploiting sparsity in machine learning offers many benefits, there are several challenges that need to be addressed to effectively leverage sparsity for improved model performance and efficiency. Some of these challenges include:

1.  Sparse Data Handling:

Dealing with sparse data can require specialized algorithms and data structures, as many standard algorithms assume dense data. Sparse data can also introduce issues like increased memory usage due to the storage of indices.

2. Algorithm Compatibility:

Not all ML algorithms are inherently compatible with sparse data or models. Adapting algorithms to handle sparse representations without sacrificing efficiency can be a challenge.

3. Irregular Memory Access:

Sparse operations can lead to irregular memory access patterns, which might negatively impact cache utilization and memory bandwidth, resulting in reduced computational efficiency.

## Conclusion and References

To conclude, sparsity is very common in ML models and exploiting sparsity can enormously reduce the computation and storage requirement for ML models. However, exploiting sparsity poses several challenges, the primary being the design and usage of algorithms to represent and compute sparse matrices. In the next blog post, I am going to discuss several compiler-related techniques to exploit sparsity.

[1]  Dave, Shail, et al. “Hardware acceleration of sparse and irregular tensor computations of ml models: A survey and insights.” Proceedings of the IEEE 109.10 (2021): 1706-1752.

Tagged ,

This site uses Akismet to reduce spam. Learn how your comment data is processed.