In our previous blog post, we discussed the process of training neural networks (NN) and briefly touched on NN training platforms and related memory bandwidth issues. As we noted, neural network training and inference performance are heavily contingent upon memory bandwidth. This is because the memory system is typically tasked with holding the neural network parameters – weights and biases – along with training data.
With hardware optimized for performing fast computations, neural networks constantly stress the memory system as model parameters and data are fetched from memory. More specifically, moving data to and from the compute engines strains both memory capacity and bandwidth, as the training sets and model parameters are held in local memory to avoid data transfers over slower PCIe interconnects.
Reduced-Precision Computation for Neural Networks
One popular method of reducing memory bandwidth demand and increasing power-efficiency for neural networks focuses on reduced-precision computation. Instead of computing with 32-bit floating point numbers, many implementations are now using 16, 8, and in some cases fewer bits of precision. This is because neural networks are tolerant of reduced precision and the power saved by avoiding computing on the least significant digits is often considerable.
Indeed, NVIDIA has discussed the benefits of using 16-bit operands and 32-bit accumulates for neural network training. Neural network parameters can be reduced in precision to 8 bits after (or during) training. While this approach may reduce the accuracy of inferencing, networks can be further trained to regain the classification accuracy lost when reducing the precision of network parameters. Microsoft also discussed the benefits of reduced precision computation when introducing Brainwave, the company’s machine learning infrastructure, at Hot Chips 2017.
In addition to using 16-bit operands and 32-bit accumulates for neural network training, numerous other methods have been discussed in various industry and academic research papers to better utilize memory bandwidth and to improve power-efficiency, including pruning weights that are close to 0 and quantizing weights to reduce storage overhead.
New Formats for Reduced Precision Computation
Reduced precision computation has garnered significant industry interest over the past few years, with a range of new formats now supported in hardware for AI and related software frameworks. For example, bfloat16 is perhaps the most popular new format that is supported by Intel’s Nervana AI processor, Xeon processors and FPGAs, as well as Google’s TPUs and TensorFlow framework. Compared to the IEEE 754 single-precision floating point format, bfloat16 has the same number of exponent bits and covers the same numerical range (~1e-38 to ~3e38), but at lower numerical precision. The similarity between the two also allows fast conversion between the two formats.
Extreme Reduced Precision Computing
Xnor.ai is a company exploring extreme reduced precision computation by using single-bit model parameters. Essentially, the company has reworked training and inference algorithms so that interference operations are reduced to single-bit xnor operations. By doing so, these computations can be performed natively in hardware on almost any processor at high speed and with high power-efficiency, including the Raspberry Pi and smartphone CPUs. Among the many demos they have shown to date, the most impressive is real-time object recognition running locally on an iPhone.
Conclusion
Neural network training applications can more effectively utilize memory bandwidth by packing more numbers into each set of bytes – instead of defaulting to full-precision computation operations. In turn, these savings can be leveraged to implement additional neurons or reduce the total cost of operation (TCO). With the benefits of reduced precision computation being realized in modern AI systems, the industry must now turn its attention to additional innovations in order to continue fueling improvements in future AI systems.
Interested in reading more about machine learning and neural networks? You can browse our article archive on the subject here.
Leave a Reply