Again using tmpfs but this time both runs where conducted one the same physical node (btc2 of the Mistral test system) and the same kernel but in one case KPTI was disabled via the debug interface.
As parameters for MD-Real-IO, it was used:
-O=1 -I=10000 -D=1 -P=10 -R=10 --process-reports -S=3901 --latency-all -- -D=/dev/shm/test
It was run either with one or 10 processes.
The 10 latency files produced after the run where merged such that timings for 100k individual I/Os could be assessed.
Note that the analyzed file contains now the measurements of all processes!
Understanding latency for 1 process
Firstly, let's look at the mean performance and the relative performance loss when KPTI is enabled and for 1 process as this is expected to have the highest impact:
Disabled KPTI | With KPTI enabled | Relative speed with KPTI | |
Create | 3.84E-06 | 4.33E-06 | 0.89 |
Read | 2.96E-06 | 3.65E-06 | 0.81 |
Delete | 2.47E-06 | 2.73E-06 | 0.91 |
Stat | 1.80E-06 | 1.98E-06 | 0.91 |
It can be seen that indeed there is some performance loss, especially reads are now 19% slower than without KPTI enabled. Still the performance degradation happens in the order of microseconds. The exact distribution is shown in the density distributions:
|
|
Understanding latency for 10 processes
The same experiment has been run with 10 processes producing a comparable table:
Disabled KPTI | With KPTI enabled | Relative speed with KPTI | |
Create | 1.31E-05 | 1.33E-05 | 0.99 |
Read | 1.13E-05 | 1.13E-05 | 0.99 |
Delete | 1.09E-05 | 1.06E-05 | 1.03 |
Stat | 8.74E-06 | 8.35E-06 | 1.05 |
Huh, that is surprising, isn't it? While the latency from a single process actually increased with KPTI enabled, with 10 processes the latency mean actually improved by 3% and 5% for delete and stat.
The exact distribution is shown in the density distributions:
As expected, the density distributions are a bit smoother and wider compared to a single process.
This indeed explains the previous reported and counterintuitive results that with enabled KPTI patch, the performance improved for some IO-500 benchmarks.
Conclusions
The KPTI patch has an impact on the latency of a single process which is in the order of 10-20% by about 2-4 microseconds on our system. This is far away from the Lustre latency which is at least in the order of 100 microseconds when running the same benchmark, thus will not influence our operational setup -- except for cached cases but we have a cache issue on our system anyhow. With multiple processes per node, the impact is neglectible and, KPTI actually improves overall performance slightly -- the reason should be investigated.
Keine Kommentare:
Kommentar veröffentlichen