Namaste 🙏 ,
I’m Udit Agarwal, a software developer with an interest in compilers, machine learning, and program analysis. I hold a Masters degree in Computer Engineering from the University of British Columbia, Vancouver. Currently, I’m working as a compiler developer at Intel, Vancouver.
Down below, you can find all my publications and open-source projects.
Skills: Compilers, LLVM, Machine Learning, C++, Python, Linux, GPU Programming
~~ Career Timeline ~~
[Masters Thesis] Resilience assessment of machine learning applications under hardware faults
Abstract: Machine learning (ML) applications have been ubiquitously deployed across critical domains such as autonomous vehicles (AVs), and medical diagnosis. Vision-based ML models like ResNet are used for object classification and lane detection, while Large Language Models (LLMs) like ChatGPT are used in cars to enable robust and flexible voice commands in AVs. The use of ML models in safety-critical scenarios requires reliable ML models. In the first part of this thesis, we primarily focus on understanding the resilience of ML models against transient hardware faults in CPUs. Towards this end, we present an LLVM IR-level FI tool, LLTFI, which we use to evaluate the effect of transient faults on Deep Neural Networks (DNNs) and LLMs. We found that LLTFI is more precise than TensorFI, an application-level FI tool proposed by prior work. Unlike LLTFI, TensorFI underestimates the resilience of DNNs by implicitly assuming that every injected fault corrupts the outputs of the intermediate layers of the DNN. Using LLTFI, we also evaluated the efficacy of Selective Instruction Duplication to make DNNs more resilient against transient faults. While in the case of DNNs, transient faults cause the model to misclassify or mispredict the object, for LLMs, we found transient faults to cause the model to produce semantically and syntactically incorrect outputs. In the second part of this thesis, we evaluate the effect of permanent stuck-at faults in systolic arrays on DNNs. We present a Register Transfer (RTL)-Level FI tool, called SystoliFI, to inject permanent stuck-at faults in the systolic array, which we use to understand the manifestation of stuck-at faults in systolic arrays in the intermediate layers of the DNNs. We found that the manifestation of the stuck-at faults varies significantly with the type of operation (Convolution vs. Matrix multiplication), the operation size, and the systolic array size.
[JPDC’23] Mixed Precision Support in HPC Applications: What About Reliability?
Alessio Netti, Yang Peng, Patrik Omland, Michael Paulitsch, Jorge Parra, Gustavo Espinosa, Udit Agarwal, Abraham Chan, and Karthik Pattabiraman, To appear in the Journal of Parallel and Distributed Computing (JPDC). [ PDF ] (code)
Abstract: In their quest for exascale and beyond, High-Performance Computing (HPC) systems continue becoming ever larger and more complex. Application developers, on the other hand, leverage novel methods to improve the efficiency of their own codes: a recent trend is the use of floating-point mixed precision, or the careful interlocking of single- and double-precision arithmetic, as a tool to improve performance as well as reduce network and memory boundedness. However, while it is known that modern HPC systems suffer hardware faults at daily rates, the impact of reduced precision on application reliability is yet to be explored. In this work we aim to fill this gap: first, we propose a qualitative survey to identify the branches of HPC where mixed precision is most popular. Second, we show the results of instruction-level fault injection experiments on a variety of representative HPC workloads, comparing vulnerability to Silent Data Errors (SDEs) under different numerical configurations. Our experiments indicate that use of single and mixed precision leads to comparatively more frequent and more severe SDEs, with concerning implications regarding their use on extreme-scale, fault-prone HPC platforms.
🏆 [SELSE’23] 🏆 Towards Reliability Assessment of Systolic Arrays against Stuck-at Faults
Udit Kumar Agarwal, Abraham Chan, Ali Asgari, and Karthik Pattabiraman. 19th IEEE Workshop on Silicon Errors in Logic – System Effects (SELSE), 2023. Received Best-of-SELSE award (one of three papers). [ PDF | Presentation ] (Code)
Abstract: Neural Networks are ubiquitously used in safety-critical applications such as autonomous vehicles and medical diagnostics. The increasing complexity and compute-intensiveness of deep neural networks (DNN) have motivated the need for DNN accelerators like Google’s Tensor Processing Unit (TPU) to accelerate convolution and matrix multiplication operations. At its core, a TPU consists of a 2-Dimensional array of Multiply and Accumulation Units, called a systolic array, which is susceptible to permanent (e.g., stuck-at faults in the data path) and transient hardware faults (e.g., radiation-induced). We propose an RTL-level fault injection (FI) framework for systolic arrays. Using this framework, we characterize the software effect of errors (called Fault Patterns) induced by stuck-at faults within the multiply and accumulation units of the systolic array. We further analyze the effect of different dataflows mapping schemes (output and weight stationery), operation types (convolution and matrix multiplication), and operation configurations (e.g., input size, convolution kernel size). Through the FI experiments, we categorized the fault patterns for stuck-at faults into well-defined classes based on their spatial patterns.
[ISSRE’23] Resilience Assessment of Large Language Models under Transient Hardware Faults (PER)
Udit Agarwal, Abraham Chan, and Karthik Pattabiraman, To appear in the Proceedings of the IEEE International Symposium on Software Reliability Engineering (ISSRE), 2023. (Acceptance Rate: 29.5%) [PDF (coming soon)]
Abstract: Large Language Models (LLMs) are transforming the field of natural language processing and revolutionizing the way machines interact with humans. LLMs like ChatGPT and Google’s Bard have already made significant strides in conversational AI, enabling machines to understand natural language and respond in a more human-like manner. In addition to typical applications like sentiment analysis and text generation, LLMs are also used in safety-critical applications such as code generation and speech comprehension in autonomous driving vehicles, where reliability is important.
In this work, we investigate the resilience of LLMs under transient hardware faults. Specifically, we used IR-level fault injection (FI) to assess the reliability of five popular LLMs, including Bert, GPT2, and T5, under transient faults. Moreover, we also investigate how the resilience of LLMs varies with different pre-training, fine-tuning objectives, and the number of encoder and decoder blocks. We find that LLMs are quite resilient to transient faults overall. We also find that the behavior of the LLM under transient faults varies significantly with the input, LLM’s architecture, and the type of task (e.g., translation vs. fill-in-the-blank). Finally, we find that the Silent Data Corruption (SDC) rate varies with different fine-tuning objectives, and for the fill-mask fine-tuning objective, the SDC rate also increases with the model size.
[ISSTA’23] CGuard: Scalable and Precise Object Bounds Protection for C
Piyus Kedia, Rahul Purandare, Udit Kumar Agarwal, Rishabh. International Symposium on Software Testing and Analysis (ISSTA), 2023. [ PDF ]
Abstract: Spatial safety violations are the root cause of many security attacks and unexpected behavior of applications. Existing techniques to enforce spatial safety work broadly at either object or pointer granularity. Object-based approaches tend to incur high CPU overheads, whereas pointer-based approaches incur both high CPU and memory overheads. SGXBounds, an object-based approach, provides precise out-of-bounds protection for objects at a lower overhead compared to other tools with similar precision. However, a major drawback of this approach is that it cannot support address space larger than 32-bit.
In this paper, we present CGuard, a tool that provides precise object-bounds protection for C applications with comparable overheads to SGXBounds without restricting the application address space. CGuard stores the bounds information just before the base address of an object and encodes the relative offset of the base address in the spare bits of the virtual address available in x86_64 architecture. For an object that cannot fit in the spare bits, CGuard uses a custom memory layout that enables it to find the base address of the object in just one memory access. Our study revealed spatial safety violations in the gcc and x264 benchmarks from the SPEC CPU2017 benchmark suite and the string_match benchmark from the Phoenix benchmark suite. The execution time overheads for the SPEC CPU2017 and Phoenix benchmark suites were 42% and 26% respectively, whereas the reduction in the throughput for the Apache webserver when the CPUs were fully saturated was 30%. These results indicate that CGuard can be highly effective while maintaining a reasonable degree of efficiency.
[ISSRE’22] LLTFI: Framework Agnostic Fault Injection for Machine Learning Applications
Udit Agarwal, Abraham Chan, and Karthik Pattabiraman, IEEE International Symposium on Software Reliability Engineering (ISSRE), 2022. (Acceptance Rate: 29%) [ PDF | Talk (video) ] (Code)
Abstract: As machine learning (ML) has become more prevalent across many critical domains, so has the need to understand ML applications’ resilience. While prior work like TensorFI , MindFI , and PyTorchFI  has focused on building ML fault injectors for specific ML frameworks, there has been little work on performing fault injection (FI) for ML applications written in multiple frameworks. We present LLTFI, a Framework-Agnostic Fault Injection tool for ML applications, allowing users to run FI experiments on ML applications at the LLVM IR level. LLTFI provides users with finer FI granularity at the level of instructions and a better understanding of how faults manifest and propagate between different ML components. We evaluate LLTFI on six ML programs and compare it with TensorFI. We found significant differences in the Silent Data Corruption (SDC) rates for similar faults between the two tools. Finally, we use LLTFI to evaluate the efficacy of selective instruction duplication – an error mitigation technique – for ML programs.
[ASE’21] Nekara: Generalized Consistency Testing
Udit Agarwal, Pantazis Deligiannis, Cheng Huang, Kumseok Jung, Akash Lal, Immad Naseer, Matthew Parkinson, Arun Thangamani, Jyothi Vedurada, Yunpeng Xiao, Proceedings of the ACM/IEEE International Conference on Automated Software Engineering (ASE), 2021. [ PDF | Talk Slides]
Abstract: Testing concurrent systems remains an uncomfortable problem for developers. The common industrial practice is to stress-test a system against large workloads, with the hope of triggering enough corner-case interleavings that reveal bugs. However, stress testing is often inefficient and its ability to get coverage of interleavings is unclear. In reaction, the research community has proposed the idea of systematic testing, where a tool takes over the scheduling of concurrent actions so that it can explore the space of interleavings.
We present an experience paper on the application of systematic testing to several case studies. We separate the algorithmic advancements in prior work (on searching the large space of interleavings) from the engineering of their tools. The latter was unsatisfactory; often the tools were limited to a small domain, hard to maintain, and hard to extend to other domains. We designed Nekara, an open-source cross-platform library for easily building custom systematic testing solutions.
We show that (1) Nekara can effectively encapsulate state-of-the-art exploration algorithms by evaluating on prior benchmarks, and (2) Nekara can be applied to a wide variety of scenarios, including existing open-source systems as well as cloud services of a major IT company. Nekara was easy to use, improved testing, and found multiple new bugs.