How to Tune QEMU L2 Cache Size and QCOW2 Cluster Size
QCOW2 is a modern virtual disk image format natively supported by QEMU and KVM. In IBM Cloud, we use QCOW2 image format for storing virtual disk images.
The performance of a QCOW2 image can be tuned with appropriate settings of the L2 cache size in QEMU and the cluster size in QCOW2 image file.
About QCOW2 format
Each QCOW2 image has a header, an L1 table, a few L2 tables, and a number of data clusters storing the content of the virtual disk. In the virtual machine, a virtual disk IO is mapped to one of the data clusters, using two levels of references in L1 and L2 tables.
Let k be the cluster size and offset be the address of a data to be accessed in the virtual disk; a virtual disk IO is performed in the following three steps:
- Fetch reference of L2 table in L1 at entry offset / (k * k)
- Fetch reference of the data cluster from the L2 table at entry (offset / k) % k
- Fetch data in the data cluster at offset % k
To access a data block in the virtual disk, without any caching, the hypervisor needs to make three IO operations to the QCOW2 image file. This is quite expensive. To improve performance, the hypervisor caches L1 and L2 tables in memory, significantly reducing actual IOs. Since the L1 table is small, it can be easily cached in memory all the time. The L2 tables can be pretty large. Therefore, the L2 cache size needs to be tuned properly to obtain the best performance.
Tuning QCOW2 cluster size and L2 cache size
The performance QCOW2 can be tuned by setting the cluster size and the L2 cache size. The default L2 cache size is rather small in the version of QEMU we analyzed. The L2 cache is set to 1 MiB or 8 clusters, whichever is larger. Libvirt currently doesn’t support setting L2 cache size. Therefore, for KVM virtual machines created through libvirt, we have to tune QCOW2 performance by setting the cluster size.
The table below shows the effectiveness of using the default L2 cache size. For example, if the cluster size is set to 2MiB, each L2 table of 2MiB can contain 262,144 references of data clusters. The total storage referenced by each L2 table is 512 GiB. With the default 8 L2 tables or 16 MiB, we can cache 2,097,152 data cluster references, which translate to 4 TiB virtual disk space.
As shown in the following table in the last column, the effectiveness of 16 MiB L2 cache is 4,096 GiB or 4 TiB:
We measured IO performance in a VM over QCOW2 virtual disks with different settings of L2 cache sizes and cluster sizes. The performance difference can be significantly reduced if the L2 cache is not set large enough. We created five test cases, using FIO to measure IOs per second and average latency of IOs. The default L2 cache size is used in all five tests:
- 100 GiB virtual disk with 256 KiB cluster size
- 100 GiB virtual disk with 512 KiB cluster size
- 1 TiB virtual disk with 512 KiB cluster size
- 1 TiB virtual disk with 1 MiB cluster size
- 1 TiB virtual disk with 2 MiB cluster size
From the table shown in the previous section, we can see that for a cluster size of 256 KiB, the effectiveness of the default L2 in-memory cache (2 MiB) is 64 GiB. The virtual disk size in test case 1 is 100 GiB — much larger than 64 GiB. Therefore, the random IO read or write performance is low, caused by L2 cache thrashing. In QEMU, the L2 cache implementation is coarse grained; it loads or evicts one L2 cluster at a time. So, it is quite costly if the L2 cache starts to thrash.
If the cluster size is set to 512 KiB instead, the default L2 in-memory cache at 4 MiB (8 clusters of 512 KiB each) can reference up to 256 GiB. Therefore, the result of test case 2 is over 10x faster. Similarly, the L2 cache is too small for a 1 TiB virtual disk with 512 KiB cluster size in test case 3, while it is fine if the cluster size is set to 1 MiB or 2 MiB in test cases 4 and 5.
The performance of IOs over QCOW2 file format is impacted by the effectiveness of L2 in-memory cache in QEMU. The measured performance difference can be as much as 11x. To achieve good performance, we have to tune the L2 cache size, the cluster size, or both.