Lbench
Lbench is a simple Linux multithread benchmark tool.
Various aspects of CPU and OS performance are measured for 1-64
parallel threads.
Lbench measures how performance scales with multiple threads
and multiple CPU processor cores.
Lbench can make the following measurements for 1 to 64 parallel
threads:
- integer-32 and float-64 arithmetic performance
- performance of common engineering functions (square
root, sine, etc.)
- memory throughput for processor cache and main memory
(block memory moves, int-32 sequential get/put, int-32 random
get/put)
- function call/return overhead
- 2D matrix math performance
- Fibonacci benchmark (recursion method)
- sort millions of random character strings using 4
threads
- classic Whetstone benchmark for type double (64 bit
floating point)
- classic Linpack benchmark for type double
- thread switch rate for do-nothing threads
- time required to start and complete a do-nothing
process thread
- time required to start and complete a do-nothing
sub-process
- time required to acquire and release a global lock
- disk throughput for serial and random I/O using
various block sizes
(direct to disk without OS caching)
Lbench has two other functions not related to benchmarking:
- Cooling performance: run multiple CPU-bound threads,
monitor processor throughput and
temperature, detect if the CPU clock is throttled down due to
thermal overload.
- Memory test and burn-in: continuously fill memory
with random values, read and compare.
The user guide is available from the [help] button and goes into
more detail about each
benchmark and configurable parameters.
Some Conclusions based on benchmarking the author's PC (July 2025):
- For pure calculations, 4 threads have nearly 4 times
the aggregate throughput of 1 thread.
- The cache memory throughput for 4 threads is nearly 4
times that of 1 thread.
- The main memory throughput for 4 threads is not much
greater than 1 thread.
- Block memory moves are 2-20 times faster than 4-byte
sequential get/put moves
(depends on size of block relative to size of CPU cache)
- 4-byte sequential get/put moves are 1 - 10 times
faster than 4-byte random get/put moves.
- 4-byte sequential get/put moves have about the same
speed for cache and main memory.
- 4 threads contending for a mutex-controlled resource
can swap ownership >20 million times/sec..
- The time to create a thread which does nothing but
exit is about 28 microseconds.
- The time to create a sub-process which does nothing
but exit is about 296 microseconds
Example Output Using 4 Threads
