Pruning and quantization for efficient deep neural networks

Abstract: In recent years, deep neural networks have proven to outperform classical methods on several machine learning tasks. Such deep networks make predictions based on pattern matching and receive training based on experience. By leveraging large amounts of data, they are capable of learning hierarchical representations of raw input data and thus combine both feature learning and classification. However, a high initial model capacity as well as floating-point operations are required to successfully train a deep neural network from scratch. As a result, trained models are usually over-parameterized after training and require powerful processing units.

In contrast, both mobile and embedded devices have finite resources regarding their memory, energy, and computational capacity. This severely limits the complexity of neural networks. Nevertheless, to make them available for use on devices with limited capacity, reduction methods are employed to reduce the complexity of trained models. On the one hand, quantization methods reduce the bit sizes of operands and operations, which immediately decreases the memory requirements. Furthermore, fixed-point quantization methods additionally reduce the computational and energy requirements on dedicated hardware. On the other hand, pruning reduces the number of operands and operations by removing redundant network connections. Furthermore, pruning entire filters and neurons from the network architecture directly reduces the memory, energy, and computational complexity without the need for specialized hardware. A common approach is therefore to first train a large and over-parameterized network and then reduce it using appropriate
reduction methods.

However, there are several problems with previous approaches. First, many of them are either complicated to implement or must be solved outside the standard optimization procedure of deep neural networks. Moreover, neglecting fixed-point constraints is a common problem with quantization approaches. On the other hand, it is usually not possible to directly specify the critical limitations of the target hardware, which makes iterative reduction procedures necessary.

In the underlying thesis, we address these problems by different contributions to both pruning and quantization. Each of our approaches consists of a reduction loss that can be integrated into the common training procedure of deep neural networks with little implementation effort. Minimizing the reduction loss during training reduces the model complexity either by fixed-point quantization, filter pruning, or a combination of both.

At first, we propose a simple and efficient reduction loss to train deep neural networks with multi-modal weight distributions and minimal quantization error. Consequently, the weights can be quantized into fixed-point representations after training with no significant loss in accuracy. Thus, we present an approach that is very easy to implement and yields excellent performance even for small bit sizes. Furthermore, we extend our approach by taking into account both the batch-normalization layers and activation functions. In this way, it is possible to train deep neural networks that can be evaluated without floating-point operations after training.

Next, we propose a novel filter pruning method that is capable of reducing the number of parameters and multiplication of a deep neural network based on a given target size. Therefore, the user is able to define maximum values for both the number of parameters and multiplications according to the memory and computational resources of the target device. During training, the reduction loss calculates the difference between the actual model size and the target size in terms of the number of parameters and required multiplications. Furthermore, the reduction loss is minimized by pruning whole filters and neurons via the channel-wise affine transformation of the batch-normalization layers. In this way, a global selection of filters and neurons can be found that, on the one hand, solves the learning task in the best possible way and, on the other hand, fulfills the constraints of the target device.

Finally, we propose a novel and highly efficient combination of filter pruning and fixed-point quantization. Here, we define complexity as an aggregation of four essential metrics: the memory requirement, the computational complexity resulting from the number of bit operations, the bandwidth resulting from the communication between the processing unit and the memory, and the maximum storage cost of the activations. Based on these four metrics, the reduction loss calculates the difference between the actual model complexity and the resources available on the target device. The reduction loss can be minimized during training by using pruning and quantization layers specially developed by us for this purpose. The trained model is thus highly efficient: it runs without batch-normalization layers, has all parameters and activations in fixed-point representation, and fulfills the complexity metrics of the target device

Location
Deutsche Nationalbibliothek Frankfurt am Main
Extent
Online-Ressource
Language
Englisch
Notes
Universität Freiburg, Dissertation, 2022

Keyword
Pruning
Quantisierung
Festkommarechnung
Deep Learning

Event
Veröffentlichung
(where)
Freiburg
(who)
Universität
(when)
2022
Creator
Contributor

DOI
10.6094/UNIFR/228535
URN
urn:nbn:de:bsz:25-freidok-2285359
Rights
Open Access; Der Zugriff auf das Objekt ist unbeschränkt möglich.
Last update
15.08.2025, 7:38 AM CEST

Data provider

This object is provided by:
Deutsche Nationalbibliothek. If you have any questions about the object, please contact the data provider.

Time of origin

  • 2022

Other Objects (12)