A processor, regardless of whether it is a CPU or a GPU, does nothing but process data, which means that it needs a memory to feed it. Unfortunately, over time the distance between the speed of the memory and the CPU has been growing, which has led to implement techniques such as cache memory. We cannot forget about the latency between the processor and the memory, which occurs when the interface between the RAM and the processor cannot transfer or modify data fast enough.
However, we cannot measure performance in a general way, since each program or rather, each algorithm within each program has a different computational load. And this is where the term arithmetic intensity comes in. But, let’s see what it is and what it consists of, as well as other elements that have to do with performance in a computer.
What is arithmetic intensity?
Arithmetic density is a measure of performance, which consists of measuring the number of floating point operations that a processor executes in a specific section of code. It is obtained by dividing the number of floating point operations by the number of bytes used by the algorithm to execute.
How useful is it? Well, the fact that it allows in certain fields of computing where very powerful computers are needed for specific tasks to have the best possible hardware system to run the algorithms under the best conditions. This model is mainly used in scientific computing. Although it is also used to optimize performance in closed systems such as video game consoles.
In the case of making use of a highly parallelized hardware architecture, a high arithmetic intensity is required, that is, a low ratio between bandwidth and computational capacity since the ratio between the computational capacity of these processors and the bandwidth available in memory is high. Since it is required in many applications and especially in graphics that a calculation is processed several times and therefore requires a large computational power in comparison.
Algorithm performance and relation to arithmetic intensity
When writing an algorithm, programmers take into account the performance of the algorithms they write in their programs, which is measured with the Big O notation in which it measures the average number of operations with respect to the data. The Big O notation is not measured using any benchmark, but programmers calculate them by hand to get an indicative idea of the workload of programs
- O(1):the algorithm does not depend on the size of the data to be processed. An algorithm with a performance O(1) is considered to have the performance ideal and is unbeatable.
- O(n): the execution time is directly proportional to the size of the data, the performance grows linearly. It is also possible that a
- O(log n): it occurs in algorithms that usually chunk and solve a problem by part, such as data sorting algorithms or binary searches.
- O(n log n): it is an evolution of the previous one, it is about dividing even more the resolution of the different parts.
- O(n2): there are algorithms that perform several iterations because they have to query the data several times. So they tend to be highly repetitive algorithms and therefore have an exponential computational load.
- O(n!): an algorithm that follows this complexity is a totally failed algorithm in terms of performance and needs to be rewritten.
Not all algorithms can reach the O(1) level of complexity, and some of them perform much better on one type of hardware than on another. This is why accelerators or domain-specific processors are being developed in recent years that accelerate one type of algorithm over another. The general idea is to divide the algorithms into parts and treat each of them with the most suitable processing unit for their arithmetic intensity.
Ratio between communication and computation
The inverse case is the ratio between communication and computation, which is measured inversely to the arithmetic intensity and therefore is achieved by dividing the number of bytes by the power in floating point operations. So it is used to measure the bandwidth required to execute that part of the code. The problem when measuring comes with the fact that the data is not always in the same place and therefore the RAM bandwidth is used as a reference.
It must be taken into account that this is not a totally reliable measurement, not only because the cache system brings the data closer to the processor, but also because there is the phenomenon of latency where each type of RAM used has its different advantages and disadvantages and can vary a result depending on the type of memory used.
Nowadays, when choosing the memory in a system, not only the bandwidth is taken into account, but also the energy consumption, since the energy cost of moving the data is exceeding the cost of processing it. This is why certain types of memory are being chosen for certain applications. Of course, always within the costs associated with building a system and are not the same in a supercomputer as in a home PC.