Thoughts, et cetera.

Faster Division With Newton-Raphson Approximation

Many devices, especially embedded (micro-controllers and the like) do not come with an FPU and the circuitry required for carrying out integer division. In such a case, one looks towards methods of approximating the results of division and storing them in Fixed Point format.

C has standardized support for such an instance via its stdfix library. The ISO Document describes the data types and functions available in stdfix.h.

This post describes the theory, provides a dependency-free C++ implementation of the core algorithm and discusses optimizations to speed it up even further. In that order.

Theory

The problem at hand is that of division:

C=ND

The first step to solving this is to split the problem in two: Reciprocal calculation, followed by multiplication. This is what it looks like:

C=N·(1D)

It is assumed that the device has a fast multiplication hardware (which is mostly the case). Thus, multiplication can be carried out by the good ol’ mul instruction. Now for the most tricky part - calculating the reciprocal - which is a division operation!

As it turns out, this is a known problem and solutions are approximately as old as the Unix epoch.

Newton-Raphson Method

The Wikipedia article sufficiently describes what and how of the NR Method. I’ll summarize and add some missing context.

Newton’s Method is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valued function.” This is the generic iterative equation according to Newton’s method:

xi+1=xif(xi)f(xi)

The idea is to find a function f(x) for which x=1D is zero. One such function is:

f(x)=1xD

(Substitue x=1D in the above equation and it should result in zero)

Next, we find f(x) and substitue it in the NM equation to give us an equation that allows successive improvements.

xi+1=xi1xiD1xi2=xi·(2Dxi)

Astute readers will notice that the result xi+1 depends on xi (the previous iteration) i.e. x2 depends on x1 and x1 depends on x0 How do we calculate x0? This is the problem of initial approximation.

Initial Approximation to the Reciprocal

Lest the curtain be drawn too soon, here is the final equation for calculating x0, provided D has been scaled to be in the range [0.5,1]:

x0=48173217D

This equation is a linear, smooth, and non-periodic function. In numerical algorithms like division, the goal is not necessarily the smallest average error, but guaranteeing the worst-case error (|ϵ0|) is as small as possible. This predictability is important because the initial error directly determines the number of Newton-Raphson iterations required to reach full machine precision.

We wish to calculate an approximation for the function 1/D such that the worst-case error is minimal. The right tool for this job is the Chebyshev Equioscillation Theorem.

Chebyshev approximation is used because it provides the Best Uniform Approximation (or Minimax Approximation). This means that out of all possible polynomials of a given degree, the Chebyshev method yields the one that minimizes the maximum absolute error across the entire target interval.

We start by formulating the error function on which equioscillation will be applied. The error function for figuring out the reciprocal x0=1/D using a simple straight line x0=T0+T1·D (a linear equation) tells us how far off our guess is from the perfect answer. Because we want the total result D·x0 to be near 1, we make the error function f(D) measure the difference between that product and 1. The formula is f(D)=1D·x0. When we plug in the straight-line guess, the formula becomes:

f(D)=1T0DT1D2

The main goal is to pick the numbers T0 and T1 that minimize the absolute value of this error f(D) everywhere in the range. This is exactly what the Chebyshev method does.

Before we apply the theorem on the error equation, we need to constrain the values that D can take. Bounding D guarantees that the starting error is small enough for the subsequent iterations to converge quickly and predictably to full fixed-point precision. Without this bound, a much more complex, higher-degree polynomial would be needed, defeating the efficiency goal. We bound D to be [0.5,1]. In code, this scaling can be achieved through simple bit-shifts. As long as we scale the numerator too, the result remains correct.

The theorem states that a polynomial is the best uniform approximation to a continuous function over an interval if and only if the error function alternates between its maximum positive and maximum negative values at least (n+2) times, where n is the degree of the polynomial. Since n=1 (linear approximation), we need 1+2=3 alternating extrema: at the two endpoints (D=1/2 and D=1) and the local extremum (Dmin) between them.

The location of the local extremum is found by setting the derivative to zero:

f(D)=T02T1D=0, which gives: Dmin=T02T1

The first condition of the theorem is that the error magnitude is equal at endpoints, i.e., f(1/2)=f(1).

1T02T14=1T0T1

This simplifies to:

T0=32T1

The second condition states that the error at endpoints must be the negative of the error at the extremum, i.e., f(1)=f(Dmin).

1T0T1=(1T0DminT1Dmin2)

Substituting Dmin and simplifying:

1T0T1=(1+T024T1)

Substituting the value of T0 from above:

1(32T1)T1=1(3/2·T1)24T11+12T1=19T116

Solving this linear equation for T1:

2=17T116T1=3217

Substituting T1 back into the original equation to find T0:

T0=32·(3217)T0=4817

The resulting linear approximation is x0=T0+T1D:

x0=48173217D

This equation gives the optimal initial estimate for the reciprocal that can be refined by iterations of the Newton-Raphson equation.

Implementation

(For convenience, here’s a link to the complete implementation)

The original rationale for this implementation was two-fold:

  1. Lack of division circuitry.
  2. Lack of support for floating point types.

Lack of division circuitry is solved by the algorithm design (reciprocal multiplication). Lack of floating point types is dealt with by using Fixed-Point notation. Fixed point allows storing everything in integers and operating using bit-shift operations, which are highly cost-efficient on embedded hardware.

Here’s an implementation of a fixed-point type in C++:

#include <iostream>
#include <cmath>
#include <cstdint>
#include <type_traits>
#include <limits>
#include <cassert>
#include <numeric>

template <int FRAC_BITS, int TOTAL_BITS>
class FixedPoint {
    using BaseType = decltype([] {
        if constexpr (TOTAL_BITS <= 8) {
            return int8_t{};
        } else if constexpr (TOTAL_BITS <= 16) {
            return int16_t{};
        } else if constexpr (TOTAL_BITS <= 32) {
            return int32_t{};
        } else {
            return int64_t{};
        }
    }());

    using WideType = typename std::conditional<(sizeof(BaseType) <= 4), int64_t, __int128_t>::type;
    BaseType value;

public:
    constexpr static BaseType SCALE = static_cast<BaseType>(1ULL << FRAC_BITS);
    FixedPoint() : value(0) {}
    FixedPoint(float f) : value(static_cast<BaseType>(std::round(f * SCALE))) {}
    BaseType raw() const { return value; }

    static FixedPoint fromRaw(BaseType rawVal) {
        FixedPoint fp;
        fp.value = rawVal;
        return fp;
    }

    float toFloat() const { return static_cast<float>(value) / SCALE; }

    FixedPoint operator+(const FixedPoint &other) const {
        return fromRaw(value + other.value);
    }
    FixedPoint operator-(const FixedPoint &other) const {
        return fromRaw(value - other.value);
    }
    FixedPoint operator*(const FixedPoint &other) const {
        WideType prod = static_cast<WideType>(value) * other.value;
        return fromRaw(static_cast<BaseType>(prod >> FRAC_BITS));
    }
    FixedPoint operator*(float factor) const {
        return FixedPoint(toFloat() * factor);
    }

    friend std::ostream &operator<<(std::ostream &os, const FixedPoint &fp) {
        return os << fp.toFloat();
    }
};

The template parameters specify the fractional and total bit lengths. For example, FixedPoint<8, 16> uses a 16-bit integer with the scale 28. The central idea is that all operations are performed on the integer directly, and conversion involves scaling and de-scaling with the constant scale value.

The division function below implements the 4-part process: Normalization, Initial Approximation, NR Iterations, and Multiplication with Numerator. We use the higher precision FixedPoint<16, 32> (Fx16_32) for intermediate calculations to minimize approximation error before truncating to the final Fx8_16 result.

using Fx8_16 = FixedPoint<8, 16>;
using Fx16_32 = FixedPoint<16, 32>;

Fx8_16 fxdiv_corrected(int n, int d) {
    assert(d != 0 && "Divide by zero undefined");

    if (n == 0) return Fx8_16(0.0f);

    bool result_is_negative = ((n < 0) != (d < 0));
    uint32_t nv = static_cast<uint32_t>(std::abs(n));
    uint32_t dv = static_cast<uint32_t>(std::abs(d));

    // 1. Normalization/scaling 'd' to fit between 0.5 and 1.0
    int shift = std::countl_zero<uint32_t>(dv);
    uint32_t scaled_val_raw = dv << shift;
    uint32_t scaled_val_n_raw = nv << shift;
    constexpr int INTERPRETATION_SHIFT = 31 - 16;

    Fx16_32 d_scaled = Fx16_32::fromRaw(static_cast<int32_t>(scaled_val_raw >> INTERPRETATION_SHIFT));
    Fx16_32 n_scaled = Fx16_32::fromRaw(static_cast<int32_t>(scaled_val_n_raw >> INTERPRETATION_SHIFT));

    // 2. Initial approximate calculation (Chebyshev: X0 = 48/17 - 32/17 * D)
    Fx16_32 a {2.8235294f}; // 48/17
    Fx16_32 b {1.8823529f}; // 32/17
    Fx16_32 initial_approx = a - (b * d_scaled);

    // 3. Newton-Raphson iterations
    Fx16_32 val = initial_approx;
    val = val * (Fx16_32{2.f} - (d_scaled * val)); // E1: Precision ~8 bits
    val = val * (Fx16_32{2.f} - (d_scaled * val)); // E2: Precision ~16 bits (sufficient for Fx16_32)
    val = val * (Fx16_32{2.f} - (d_scaled * val)); // E3: Conservative overkill
    val = val * (Fx16_32{2.f} - (d_scaled * val)); // E4: Conservative overkill

    // 4. Multiplication with Numerator
    Fx16_32 res_16_32 = n_scaled * val;

    if (result_is_negative) {
        // Simple negation
        res_16_32 = res_16_32 * -1.0f;
    }

    // Truncate from Fx16_32 (FRAC_BITS=16) to Fx8_16 (FRAC_BITS=8)
    constexpr int TRUNCATION_SHIFT = 16 - 8;

    // Shift the raw 32-bit value right by 8 bits to change the binary point position
    int32_t raw_final_32 = res_16_32.raw();
    int16_t raw_final_16 = static_cast<int16_t>(raw_final_32 >> TRUNCATION_SHIFT);

    Fx8_16 final_res = Fx8_16::fromRaw(raw_final_16);

    return final_res;
}

The INTERPRETATION_SHIFT and TRUNCATION_SHIFT variables account for correctly aligning the binary point.

The code is compiled and run with an example:

int main() {
    std::cout.precision(10);
    int n = 3;
    int d = 4;
    Fx8_16 result = fxdiv_corrected(n, d);
    std::cout << "Approximate division of " << n << "/" << d << ": " << result << std::endl;
    std::cout << "Real division of " << n << "/" << d << ": " << static_cast<float>(3)/static_cast<float>(4) << std::endl;
    return 0;
}
$ clang++ fixed_div.cpp -o a -std=c++20
$ ./a
Approximate division of 3/4: 0.74609375
Real division of 3/4: 0.75

The result 0.74609375 is produced. The error is |0.750.74609375|=0.00390625. This error is precisely 28, confirming that the final precision is limited by the Q7.8 output format’s smallest bit, which is a key design feature of fixed-point systems. In fact, the number 0.00390625 has a name: ULP. Due to our use of Q7.8 as the resultant type, it is impossible to precisely express 0.75. What we can express is 0.75ULP and 0.75+ULP

Optimizations

Power-of-Two Denominator Shortcut

When the denominator D is a power of two (±2k), the division N/D simplifies to a bit shift. This avoids the computationally expensive Newton-Raphson (NR) iterative loop entirely. The check to detect if an integer is a power of 2 can be performed in 𝒪(1) time, providing a major speed increase for such common cases.

Exploiting Quadratic Convergence

One property of NR iteration is Quadratic Convergence. Every iteration approximately doubles the number of correct bits in the result.

The optimization involves determining the required number of iterations based on the fractional length of the resultant type. For a required precision of 16 fractional bits (as in the Fx16_32 intermediate type), only two iterations are mathematically necessary after the initial approximation. constexpr if statements can be used to compile-time check the required precision and eliminate unnecessary NR iterations, ensuring no runtime performance penalty.

Alternative Initial Approximation Methods

While the Chebyshev approximation provides the optimal minimax error for the initial guess, alternative methods exist:

Conclusion

This article came into being after I spent around a month trying to understand approximation methods and getting code for it to work. The actual code was contributed to the llvm-project as an addition to the stdfix library in the llvm libc project. Here is the PR. Here’s a link to the complete division function implemented with stdfix primitives and the optimizations mentioned above.