To evaluate and compare the performance (e.g., latency, throughput, accuracy, etc) of computer systems, compute scientists and engineers have built multiple benchmarks, which have become the foundation of both academia research and industrial competition. To name a few, TPC series have been widely used to evaluate SQL databases; YCSB has been used to evaluate NoSQL databases; SPEC is a popular benchmark for HPC systems; recently MLPerf was invented to evaluate machine learning based systems, and so on.
Many of these benchmarks have multiple tunable parameters, and their values could have a significant impact on the behavior and the result of the benchmark. While such versatility allows a benchmark to evaluate different scenarios, it creates challenges for one who is not an expert in the corresponding benchmark, often with the following types of questions: "I want to test my system with benchmark A. What parameter values should I use?"; "I see system A reports that it can significantly outperform system B under a specific setting. Can it have the same level of improvement under another setting?"; "I have a real application scenario and I want to choose a system to support my application. Corresponding systems are usually evaluated with a benchmark. Which benchmark setting is close to my real scenario, so that I can know which system is a better fit?".
This article provides an in-depth analysis about how parameter values of popular benchmarks affect their behavior and results. Hopefully it will help a reader understand how to choose parameter values and how to intepret results reported by others.
This article is NOT a replacement of the document of the corresponding benchmark. A reader is strongly encouraged to read the document of the corresponding benchmark first.This article does not answer "what parameter values are realistic/right to use in an evaluation?". This is a controversial question and the answer often depends on your application scenario. The answer may change over time as well.