Dienstag, 23. Januar 2018

Impact of KPTI on the IO-500 for DKRZ

In a previous post, a first test with the IO-500 has been made on tmpfs to see the impact on the resulting numbers. That experiment had some limitations as it was run on two different nodes.
This time, the experiment is repeated on the same node (btc2 on the DKRZ testystem) and using the exact same kernel but in one case we had KPTI enabled and in the other it was disabled using the debug interface: /sys/kernel/debug/x86/pti_enabled.
Also the overall benchmark suite was not only run 50 times but more than 500 times in each configuration.

Again, 10 processes have been used.

Results

An overview is given in the following table:
Experiment Relative speed with KPTI
ior_easy_write 0.995
mdtest_easy_write 1.011
ior_hard_write 0.991
mdtest_hard_write 1
find 0.973
ior_easy_read 1.002
mdtest_easy_stat 1.003
ior_hard_read 1.015
mdtest_hard_stat 0.999
mdtest_easy_delete 1.001
mdtest_hard_read 0.969
mdtest_hard_delete 0.993

So that means that overall a few experiments now run by 3% slower (find, mdtest_hard_read), but easy write is 1% faster. 


The following graphs provide boxplots of the indiviual repeated measurements with enabled/disabled KPTI:
Fig1: IOR measurements

Fig2: Metadata measurements
While there are some outliers in both configurations, the overal picture looks comparable.

Conclusions

The impact of KPTI is neglectable on our sytem for I/O benchmarks as particularly Lustre is significantly slower than using tmpfs, the results for IO-500 are similar with the results when analyzing latency more fine grained as in my previous post about the latency of individual operations.

Fine grained Impact of KPTI on HPC nodes I/O performance

In the previous post, I investigated the impact of a RedHat patched kernel for Meltdown on the IO-500 results. Since the results where not significant, I now investigate the fine grained timing behavior of individual results using the MD-Real-IO benchmark.
Again using tmpfs but this time both runs where conducted one the same physical node (btc2 of the Mistral test system) and the same kernel but in one case KPTI was disabled via the debug interface.

As parameters for MD-Real-IO, it was used:
-O=1 -I=10000 -D=1 -P=10 -R=10 --process-reports -S=3901 --latency-all -- -D=/dev/shm/test
It was run either with one or 10 processes.
The 10 latency files produced after the run where merged such that timings for 100k individual I/Os could be assessed.
Note that the analyzed file contains now the measurements of all processes!

Understanding latency for 1 process


Firstly, let's look at the mean performance and the relative performance loss when KPTI is enabled and for 1 process as this is expected to have the highest impact:


Disabled KPTI With KPTI enabled Relative speed with KPTI
Create 3.84E-06 4.33E-06 0.89
Read 2.96E-06 3.65E-06 0.81
Delete 2.47E-06 2.73E-06 0.91
Stat 1.80E-06 1.98E-06 0.91

It can be seen that indeed there is some performance loss, especially reads are now 19% slower than without KPTI enabled. Still the performance degradation happens in the order of microseconds. The exact distribution is shown in the density distributions:

Fig 1: Without KPTI

Fig 2: With KPTI enabled


Understanding latency for 10 processes

The same experiment has been run with 10 processes producing a comparable table:


Disabled KPTI With KPTI enabled Relative speed with KPTI
Create 1.31E-05 1.33E-05 0.99
Read 1.13E-05 1.13E-05 0.99
Delete 1.09E-05 1.06E-05 1.03
Stat 8.74E-06 8.35E-06 1.05

Huh, that is surprising, isn't it? While the latency from a single process actually increased with KPTI enabled, with 10 processes the latency mean actually improved by 3% and 5% for delete and stat.

 The exact distribution is shown in the density distributions:

Fig 3: 10 Processes, without KPTI enabled

Fig 4: 10 Processes, with KPTI enabled

As expected, the density distributions are a bit smoother and wider compared to a single process.
This indeed explains the previous reported and counterintuitive results that with enabled KPTI patch, the performance improved for some IO-500 benchmarks.

Conclusions

The KPTI patch has an impact on the latency of a single process which is in the order of 10-20% by about 2-4 microseconds on our system. This is far away from the Lustre latency which is at least in the order of 100 microseconds when running the same benchmark, thus will not influence our operational setup -- except for cached cases but we have a cache issue on our system anyhow. With multiple processes per node, the impact is neglectible and, KPTI actually improves overall performance slightly -- the reason should be investigated.

Freitag, 19. Januar 2018

Impact of KPTI on HPC storage performance

The patches for the bug Meltdown and Spectre may be performance critical for certain workloads. This is particularly crucial for data centers and HPC workloads.  Researchers started to investigate the impact on HPC workloads, so for example, looked at the performance of IOR and MDTest showing that MDTest is severely impacted. This suggests that applying the patch may impact performance intercomparison efforts such as by the IO-500 list which aims to track storage performance across file system and time. To do so, a measurement procedure and scripts are provided.

At the German Climate Computing Center (DKRZ), a RedHat Enterprise 6 Linux is used and also affected by the Meltdown and Spectre CPU bugs. A relevant question was how much will these bugs affect the IO-500 measurements. However, measurement of the impact is not easy in this setup as all nodes of the test system have been updated with the new kernel and this update comes along with an update of the Lustre client code that resolves several issues on the DKRZ system. 

Note that a new post with statistically better analysis is available here.

A first measurement of the impact

A first and simple approach that fits into the current situation has been conducted as follows:
  • Setup the IO-500 benchmark on the test system and the production system with the old kernel
  • Run the IO-500 benchmark on tmpfs (/dev/shm) to exclude the impact of the Lustre system
  • Repeat the measurement 50 times and perform some statistical analysis

Test setup

Both nodes are equipped with the same hardware particularly two Intel Xeon E5-E2680v3 @ 2.5 GHz microprocessors. The unpatched node uses the kernel 2.6.32-696.16.1.el6 and the node with the bugfix the kernel 2.6.32-696.18.7.

For the IO-500 configuration, the following parameters where used:
  io500_ior_easy_params="-t 2048k -b 2g -F" # 2M writes, 2 GB per proc, file per proc 
  io500_mdtest_easy_params="-u -L" # unique dir per thread, files only at leaves
  io500_mdtest_easy_files_per_proc=50000
  io500_ior_hard_writes_per_proc=10000
  io500_mdtest_hard_files_per_proc=20000

12 processes are started using srun, i.e., 12 times the number of files or data volume is actually used. Each individual benchmark run takes about one second to run; since the throughput and metadata rate is very high, the main memory is quickly full not allowing to run the benchmarks for larger settings.

Results

Boxplots for IOR are shown in the following diagram, i.e., the first and third quartile of the 50 measurements comprise the box and the median is the vertical lines, whiskers and outliers are shown:

There are quite some outliers due to the short runtime and the short wear-out phase of IOR, but in general the picture looks similar: There is no significant observable performance degradation to this benchmark.

Results for MDTest in this diagram behave generally similar, too:




Now, we compute the mean(patched)/mean(unpatched) for each performance number in the order as reported by the IO-500:

ior_easy_write = 1.062
Thus, the measurement in the patched version is 6% slower.

mdtest_easy_write = 0.962
Thus, the patched kernel is 3.8% faster than the old version.

The other values are similar with the exception of find and the ior_easy_read there actually is a 13% overhead in the new versions.
ior_hard_write = 0.973
mdtest_hard_write = 0.976
find = 1.132    # So this is considered to be 13% slower
ior_easy_read = 1.142  # This is 14 % slower
mdtest_easy_stat = 0.991
ior_hard_read = 1.001
mdtest_hard_stat = 0.981
mdtest_easy_delete = 0.961
mdtest_hard_read = 0.943
mdtest_hard_delete = 0.982

The values suggest that the patched version is actually often a bit faster, and a t-test confirms that in some cases, but take these numbers with a grain of salt as this quick measurement is biased:
  • Measurement is not conducted on the exact same hardware, there might be minimal differences albeit the same hardware components are used.
  • There are many outliers, therefore, the 50 repetitions are not sufficient.
However, it does show that the kernel update is not too worrisome; the values measured are similarly and not up to 40% slower as indicated by the paper cited above. Since Lustre is significantly slower than tmpfs, the impact on it is anticipated to be low. Certainly, the measurement should be refined and conducted on the same node to remove that bias and it should be repeated more often.

Email security with Postfix/DKIM/DMARC on Ubuntu 20.04

A while ago, I had setup DKIM on my Ubuntu 20.04 server with Postfix.  In a nutshell,  DomainKeys Identified Mail (DKIM)   provides means to...