Faster Division With Newton-Raphson Approximation
Many devices, especially embedded (micro-controllers and the like) do not come with an FPU and the circuitry required for carrying out integer division. In such a case, one looks towards methods of approximating the results of division and storing them in Fixed Point format.
C has standardized support for such an instance via its stdfix library. The ISO
Document
describes the data types and functions available in stdfix.h.
This post describes the theory, provides a dependency-free C++ implementation of the core algorithm and discusses optimizations to speed it up even further. In that order.
Theory
The problem at hand is that of division:
The first step to solving this is to split the problem in two: Reciprocal calculation, followed by multiplication. This is what it looks like:
It is assumed that the device has a fast multiplication hardware (which is
mostly the case). Thus, multiplication can be carried out by the good ol’ mul
instruction. Now for the most tricky part - calculating the reciprocal - which
is a division operation!
As it turns out, this is a known problem and solutions are approximately as old as the Unix epoch.
Newton-Raphson Method
The Wikipedia article sufficiently describes what and how of the NR Method. I’ll summarize and add some missing context.
“Newton’s Method is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valued function.” This is the generic iterative equation according to Newton’s method:
The idea is to find a function for which is zero. One such function is:
(Substitue in the above equation and it should result in zero)
Next, we find and substitue it in the NM equation to give us an equation that allows successive improvements.
Astute readers will notice that the result depends on (the previous iteration) i.e. depends on and depends on How do we calculate ? This is the problem of initial approximation.
Initial Approximation to the Reciprocal
Lest the curtain be drawn too soon, here is the final equation for calculating , provided has been scaled to be in the range :
This equation is a linear, smooth, and non-periodic function. In numerical algorithms like division, the goal is not necessarily the smallest average error, but guaranteeing the worst-case error () is as small as possible. This predictability is important because the initial error directly determines the number of Newton-Raphson iterations required to reach full machine precision.
We wish to calculate an approximation for the function such that the worst-case error is minimal. The right tool for this job is the Chebyshev Equioscillation Theorem.
Chebyshev approximation is used because it provides the Best Uniform Approximation (or Minimax Approximation). This means that out of all possible polynomials of a given degree, the Chebyshev method yields the one that minimizes the maximum absolute error across the entire target interval.
We start by formulating the error function on which equioscillation will be applied. The error function for figuring out the reciprocal using a simple straight line (a linear equation) tells us how far off our guess is from the perfect answer. Because we want the total result to be near , we make the error function measure the difference between that product and . The formula is . When we plug in the straight-line guess, the formula becomes:
The main goal is to pick the numbers and that minimize the absolute value of this error everywhere in the range. This is exactly what the Chebyshev method does.
Before we apply the theorem on the error equation, we need to constrain the values that can take. Bounding guarantees that the starting error is small enough for the subsequent iterations to converge quickly and predictably to full fixed-point precision. Without this bound, a much more complex, higher-degree polynomial would be needed, defeating the efficiency goal. We bound to be . In code, this scaling can be achieved through simple bit-shifts. As long as we scale the numerator too, the result remains correct.
The theorem states that a polynomial is the best uniform approximation to a continuous function over an interval if and only if the error function alternates between its maximum positive and maximum negative values at least times, where is the degree of the polynomial. Since (linear approximation), we need alternating extrema: at the two endpoints ( and ) and the local extremum () between them.
The location of the local extremum is found by setting the derivative to zero:
The first condition of the theorem is that the error magnitude is equal at endpoints, i.e., .
This simplifies to:
The second condition states that the error at endpoints must be the negative of the error at the extremum, i.e., .
Substituting and simplifying:
Substituting the value of from above:
Solving this linear equation for :
Substituting back into the original equation to find :
The resulting linear approximation is :
This equation gives the optimal initial estimate for the reciprocal that can be refined by iterations of the Newton-Raphson equation.
Implementation
(For convenience, here’s a link to the complete implementation)
The original rationale for this implementation was two-fold:
- Lack of division circuitry.
- Lack of support for floating point types.
Lack of division circuitry is solved by the algorithm design (reciprocal multiplication). Lack of floating point types is dealt with by using Fixed-Point notation. Fixed point allows storing everything in integers and operating using bit-shift operations, which are highly cost-efficient on embedded hardware.
Here’s an implementation of a fixed-point type in C++:
#include <iostream>
#include <cmath>
#include <cstdint>
#include <type_traits>
#include <limits>
#include <cassert>
#include <numeric>
template <int FRAC_BITS, int TOTAL_BITS>
class FixedPoint {
using BaseType = decltype([] {
if constexpr (TOTAL_BITS <= 8) {
return int8_t{};
} else if constexpr (TOTAL_BITS <= 16) {
return int16_t{};
} else if constexpr (TOTAL_BITS <= 32) {
return int32_t{};
} else {
return int64_t{};
}
}());
using WideType = typename std::conditional<(sizeof(BaseType) <= 4), int64_t, __int128_t>::type;
BaseType value;
public:
constexpr static BaseType SCALE = static_cast<BaseType>(1ULL << FRAC_BITS);
FixedPoint() : value(0) {}
FixedPoint(float f) : value(static_cast<BaseType>(std::round(f * SCALE))) {}
BaseType raw() const { return value; }
static FixedPoint fromRaw(BaseType rawVal) {
FixedPoint fp;
fp.value = rawVal;
return fp;
}
float toFloat() const { return static_cast<float>(value) / SCALE; }
FixedPoint operator+(const FixedPoint &other) const {
return fromRaw(value + other.value);
}
FixedPoint operator-(const FixedPoint &other) const {
return fromRaw(value - other.value);
}
FixedPoint operator*(const FixedPoint &other) const {
WideType prod = static_cast<WideType>(value) * other.value;
return fromRaw(static_cast<BaseType>(prod >> FRAC_BITS));
}
FixedPoint operator*(float factor) const {
return FixedPoint(toFloat() * factor);
}
friend std::ostream &operator<<(std::ostream &os, const FixedPoint &fp) {
return os << fp.toFloat();
}
};
The template parameters specify the fractional and total bit lengths. For
example, FixedPoint<8, 16> uses a 16-bit integer with the scale . The
central idea is that all operations are performed on the integer directly, and
conversion involves scaling and de-scaling with the constant scale value.
The division function below implements the 4-part process: Normalization,
Initial Approximation, NR Iterations, and Multiplication with
Numerator. We use the higher precision FixedPoint<16, 32>
(Fx16_32) for intermediate calculations to minimize approximation
error before truncating to the final Fx8_16 result.
using Fx8_16 = FixedPoint<8, 16>;
using Fx16_32 = FixedPoint<16, 32>;
Fx8_16 fxdiv_corrected(int n, int d) {
assert(d != 0 && "Divide by zero undefined");
if (n == 0) return Fx8_16(0.0f);
bool result_is_negative = ((n < 0) != (d < 0));
uint32_t nv = static_cast<uint32_t>(std::abs(n));
uint32_t dv = static_cast<uint32_t>(std::abs(d));
// 1. Normalization/scaling 'd' to fit between 0.5 and 1.0
int shift = std::countl_zero<uint32_t>(dv);
uint32_t scaled_val_raw = dv << shift;
uint32_t scaled_val_n_raw = nv << shift;
constexpr int INTERPRETATION_SHIFT = 31 - 16;
Fx16_32 d_scaled = Fx16_32::fromRaw(static_cast<int32_t>(scaled_val_raw >> INTERPRETATION_SHIFT));
Fx16_32 n_scaled = Fx16_32::fromRaw(static_cast<int32_t>(scaled_val_n_raw >> INTERPRETATION_SHIFT));
// 2. Initial approximate calculation (Chebyshev: X0 = 48/17 - 32/17 * D)
Fx16_32 a {2.8235294f}; // 48/17
Fx16_32 b {1.8823529f}; // 32/17
Fx16_32 initial_approx = a - (b * d_scaled);
// 3. Newton-Raphson iterations
Fx16_32 val = initial_approx;
val = val * (Fx16_32{2.f} - (d_scaled * val)); // E1: Precision ~8 bits
val = val * (Fx16_32{2.f} - (d_scaled * val)); // E2: Precision ~16 bits (sufficient for Fx16_32)
val = val * (Fx16_32{2.f} - (d_scaled * val)); // E3: Conservative overkill
val = val * (Fx16_32{2.f} - (d_scaled * val)); // E4: Conservative overkill
// 4. Multiplication with Numerator
Fx16_32 res_16_32 = n_scaled * val;
if (result_is_negative) {
// Simple negation
res_16_32 = res_16_32 * -1.0f;
}
// Truncate from Fx16_32 (FRAC_BITS=16) to Fx8_16 (FRAC_BITS=8)
constexpr int TRUNCATION_SHIFT = 16 - 8;
// Shift the raw 32-bit value right by 8 bits to change the binary point position
int32_t raw_final_32 = res_16_32.raw();
int16_t raw_final_16 = static_cast<int16_t>(raw_final_32 >> TRUNCATION_SHIFT);
Fx8_16 final_res = Fx8_16::fromRaw(raw_final_16);
return final_res;
}
The INTERPRETATION_SHIFT and TRUNCATION_SHIFT variables account for
correctly aligning the binary point.
- The
INTERPRETATION_SHIFT(31 - 16 = 15) adjusts the normalized input to the format used for calculation, ensuring the value is interpreted as being in the range . - The
TRUNCATION_SHIFT(16 - 8 = 8) reduces the result to the final output format, discarding the lower 8 fractional bits.
The code is compiled and run with an example:
int main() {
std::cout.precision(10);
int n = 3;
int d = 4;
Fx8_16 result = fxdiv_corrected(n, d);
std::cout << "Approximate division of " << n << "/" << d << ": " << result << std::endl;
std::cout << "Real division of " << n << "/" << d << ": " << static_cast<float>(3)/static_cast<float>(4) << std::endl;
return 0;
}
$ clang++ fixed_div.cpp -o a -std=c++20
$ ./a
Approximate division of 3/4: 0.74609375
Real division of 3/4: 0.75
The result is produced. The error is . This error is precisely , confirming that the final precision is limited by the output format’s smallest bit, which is a key design feature of fixed-point systems. In fact, the number has a name: ULP. Due to our use of as the resultant type, it is impossible to precisely express . What we can express is and
Optimizations
Power-of-Two Denominator Shortcut
When the denominator is a power of two (), the division simplifies to a bit shift. This avoids the computationally expensive Newton-Raphson (NR) iterative loop entirely. The check to detect if an integer is a power of 2 can be performed in time, providing a major speed increase for such common cases.
Exploiting Quadratic Convergence
One property of NR iteration is Quadratic Convergence. Every iteration approximately doubles the number of correct bits in the result.
The optimization involves determining the required number of iterations based on
the fractional length of the resultant type. For a required precision of
fractional bits (as in the Fx16_32 intermediate type), only two
iterations are mathematically necessary after the initial approximation.
constexpr if statements
can be used to compile-time check the required precision and eliminate
unnecessary NR iterations, ensuring no runtime performance penalty.
Alternative Initial Approximation Methods
While the Chebyshev approximation provides the optimal minimax error for the initial guess, alternative methods exist:
- Remez Algorithm: Can generate even better approximations using a higher-degree polynomial. The trade-off is higher initial calculation complexity for the potential benefit of requiring fewer subsequent NR iterations.
- Look-up Tables (LUTs) and Magic Constants: Recent research shows that pre-computed LUTs combined with constants can speed up the initial reciprocal approximation.
- CORDIC (Coordinate Rotation Digital Computer): An alternative iterative technique that can calculate division and other trigonometric functions using only additions and shifts. It offers competitive performance in some hardware environments.
Conclusion
This article came into being after I spent around a month trying to understand approximation methods and getting code for it to work. The actual code was contributed to the llvm-project as an addition to the stdfix library in the llvm libc project. Here is the PR. Here’s a link to the complete division function implemented with stdfix primitives and the optimizations mentioned above.