This code is rather hairy. When I look at it I don't like the way it
stares back at me.
First, the rewind business looks fishy. It has to die.
And we don't have to rely on modulus either. Not even for non-power-of-2
buffers. Let's kill that distinction too and make all sizes always
"high performance".
The code is now entirely relying only on simple ALU operations (add,
sub and compare).
The key assumption: 32-bit values do wrap around after max range has
been reached. No saturation. All architectures supported by Zephyr
do that.
Some stats:
lib/os/ring_buffer.c: 62 insertions(+), 124 deletions(-)
ring_buffer.c.obj before after diff
----------------------------------------------
frdm_k64f 1224 1136 -88
m2gl025_miv 2485 2079 -406
mps2_an385 1228 1132 -96
mps2_an521 1228 1132 -96
native_posix 1546 1496 -50
native_posix_64 1598 1595 -3
nsim_hs_mpuv6 1252 1192 -60
nsim_hs_smp 1252 1192 -60
nsim_sem 1252 1192 -60
qemu_arc_em 1324 1192 -132
qemu_arc_hs6x 1824 1620 -204
qemu_arc_hs 1252 1192 -60
qemu_cortex_a53_smp 2154 1888 -266
qemu_cortex_a53 2154 1888 -266
qemu_cortex_a9 1938 1792 -146
Before (qemu_cortex_a53):
START - test_ringbuffer_performance
1 byte put-get, avg cycles: 52
4 byte put-get, avg cycles: 47
1 byte put claim-finish, avg cycles: 39
5 byte put claim-finish, avg cycles: 41
5 byte get claim-finish, avg cycles: 52
PASS - test_ringbuffer_performance in 0.8 seconds
After (qemu_cortex_a53):
START - test_ringbuffer_performance
1 byte put-get, avg cycles: 34
4 byte put-get, avg cycles: 41
1 byte put claim-finish, avg cycles: 27
5 byte put claim-finish, avg cycles: 29
5 byte get claim-finish, avg cycles: 29
PASS - test_ringbuffer_performance in 0.4 seconds
Signed-off-by: Nicolas Pitre <npitre@baylibre.com>