An efficient pair of polynomials for approximating sincos

In CicuitPython, I was looking for an efficient polynomial approximation of sin(x) and cos(x) over the interval [0:π/2].

Being a simple person, I searched enough documentation to find that numpy was happy to make a polynomial approximation of any table of data:

>>> x = np.linspace(0, np.pi/2, 10000)
>>> Polynomial.fit(x, np.cos(x), deg=5)
Polynomial([ 0.70710188, -0.55535829, -0.21798633,  0.05707696,  0.01090002,
       -0.00171961], domain=[0.        , 1.57079633], window=[-1.,  1.], symbol='x')
>>> Polynomial.fit(x, np.sin(x), deg=5)
Polynomial([ 0.70710188,  0.55535829, -0.21798633, -0.05707696,  0.01090002,
        0.00171961], domain=[0.        , 1.57079633], window=[-1.,  1.], symbol='x')

But wait, the coefficients are the same, except for some of the signs? What's going on? That's not how sin and cos are related.

The design of numpy's polynomial fit routine has given us an unexpected gift: the original domain of 0 to π/2 has been changed (offset & scaled) to the window -1 to 1. This happens to make the shifted-sin and shifted-cos routines reflect around the line x=0. So it's no surprise that the even coefficients are the same (as (-x)^2k = x^2k for integers k) and the odd coefficients are negated.

This leads to an implementation of sincos that only has to do fewer multiplications than I expected, and will therefore be a bit quicker.

Oh, and the maximum absolute error for this polynomial seems to be about 5e-6, or little enough that it probably can't be noticed when processing 16-bit audio.

C implementation of the algorithm:

Compiled on godbolt with -Os -mcpu=cortex-m4 -mhard-float -fsingle-precision-constant, it gives some very tidy-looking code.

fast_offset_scaled_sincos(float, sincos_result*):
  vmul.f32 s15, s0, s0
  vldr.32 s12, .L2
  vmul.f32 s14, s15, s15
  vmul.f32 s13, s0, s15
  vmul.f32 s14, s14, s12
  vldr.32 s12, .L2+4
  vfma.f32 s14, s15, s12
  vldr.32 s12, .L2+8
  vmul.f32 s15, s15, s13
  vadd.f32 s14, s14, s12
  vldr.32 s12, .L2+12
  vmul.f32 s15, s15, s12
  vldr.32 s12, .L2+16
  vfma.f32 s15, s13, s12
  vldr.32 s13, .L2+20
  vfma.f32 s15, s0, s13
  vadd.f32 s13, s14, s15
  vsub.f32 s14, s14, s15
; function epilogue and table of constants (L2) omitted

Note: More optimal coefficients can come from better algorithms like Remez, as implemented by py-remezfit. The resulting worst error is a bit higher but the average error should be lower.

$ python3 remezfit.py -d single  -- "lambda x: np.cos((x+1)*np.pi/4)" -1 1 5
p = array([ 0.7071076   , -0.5553769   , -0.21797441  ,  0.05672453  ,
        0.011838001 , -0.0023185865], dtype=float32)
        
prec = 7.189810e-06