5 EVALUATION
We evaluate our ideas by comparing in an offline setting our simple ISVM model against a more powerful LSTM model (Section 5.2). We then compare Glider against other online models, ie, against three of the top policies from the 2nd Cache Replacement Championship (Section 5.3), before discussing the practicality of our solution (Section 5.4).
5.1 Methodology
Simulator. We evaluate our models using the simulation framework released by the 2nd JILP Cache Replacement Championship (CRC2), which is based on ChampSim [1] and models a 4-wide out-of-order processor with an 8-stage pipeline, a 128-entry reorder buffer and a three-level cache hierarchy. Table 1 shows the parameters for our simulated memory hierarchy.
Benchmarks. To evaluate our models, we use the 33 memorysensitive applications of SPEC CPU2006 [15], SPEC CPU2017, and GAP [3], which we define as the applications that show more than 1 LLC miss per kilo instructions (MPKI). We run the benchmarks using the reference input set, and as with the CRC2, we use SimPoint to generate for each benchmark a single sample of 1 billion instructions. We warm the cache for 200 million instructions and measure the behavior of the next 1 billion instructions.
Multi-Core Workloads. Our multi-core experiments simulate four benchmarks running on 4 cores, choosing 100 mixes from all possible workload mixes. For each mix, we simulate the simultaneous execution of the SimPoint samples of the constituent benchmarks until each benchmark has executed at least 250M instructions. If a benchmark finishes early, it is rewound until every other application in the mix has finished running 250M instructions. Thus, all the benchmarks in the mix run simultaneously throughout the sampled execution. Our multi-core simulation methodology is similar to that of CRC2 [1].
To evaluate performance, we report the weighted speedup normalized to LRU for each benchmark mix. This metric is commonly used to evaluate shared caches [1, 20, 27] because it measures the overall performance of the mix and avoids domination by benchmarks of high IPC. The metric is computed as follows. For each program sharing the cache, we compute its IPC in a shared environment (IPCshar ed ) and its IPC when executing in isolation on the same cache (IPCs inдle ). We then compute the weighted IPC of the mix as the sum of IPCshar ed /IPCs inдle for all benchmarks in the mix, and we normalize this weighted IPC with the weighted IPC using the LRU replacement policy.
Settings for Offline Evaluation. Since LSTM and SVM are typically trained offline-requiring multiple iterations through the entire dataset-we evaluate these models with traces of LLC accesses, which are generated by running applications through ChampSim. For every LLC access, the trace contains a (PC, optimal decision) tuple. The optimal decisions are obtained by running an efficient variant of Belady's algorithm [20]. Because these models require significant training time, we run our offline learning models on 250 millions of instruction for a subset of single-core benchmarks. These benchmarks are statically summarized in Table 2. For offline evaluation, we use the first 75% of each trace for training and the last 25% for testing. The models evaluated in this section are insensitive to the split ratio, as long as at least 50% is used for training. For offline evaluation, models are iteratively trained until convergence.
Baseline Replacement Policies. Using the offline settings, we compare the accuracy of the attention-based LSTM and the offline ISVM models to two state-of-the-art hardware caching models, namely, Hawkeye [20] and Perceptron [52].
Hawkeye uses a statistical model that assumes that memory accesses by the same PC have the same caching behavior over a period of time. In particular, Hawkeye uses a table of counters, where each counter is associated with a PC and is incremented or decremented based on optimal decisions for that PC. Perceptron uses a linear perceptron model with a list of features including the PC of the past 3 memory accesses.
The Perceptron model respects the order of these PCs, but we find that using longer PC histories with order is not as effective as without order. For a fair comparison of PC history as the feature, we implement an SVM with the same hinge loss for Perceptron that uses the PC of the past 3 memory accesses, respecting the order, and that learns from Belady's optimal solution.8
To evaluate Glider as a practical replacement policy, we compare Glider against Hawkeye [20], SHiP++ [56] and MPPPB [27], which are the first, second and fourth finishers in the most recent Cache Replacement Championship (CRC2) [1]. For all techniques, we use code that is publicly available by CRC2. For single-thread benchmarks, we also simulate Belady's optimal replacement policy (MIN) [4].
5.2 Comparison of Offline Models
Figure 9 compares the accuracy of our models when trained offline. We see that (1) our attention-based LSTM improves accuracy by 10.4% over the Hawkeye baseline and (2) with a 9.1% accuracy improvement over Hawkeye, our offline ISVM comes close to the performance of LSTM. These results confirm our insight that we can approximate the powerful attention-based LSTM with a simpler hardware-friendly predictor.
5.3 Comparison of Online Models
We now compare the accuracy and speedup of our practical models when trained online as the program executes, i.e., we compare Glider against Hawkeye, SHiP++, and MPPPB.
Online training accuracy. Figure 10 shows that Glider is more accurate than state-of-the-art online models, including Hawkeye (88.8% vs. 84.9%). On the subset of benchmarks used for training the offline models, the accuracy improves from 73.5% to 82.4%, which is similar to the offline improvements from 72.2% to 81.2%. Thus, Glider is as effective as the offline attention-based LSTM model, and insights from offline training carry over to online predictors.
Single-Core Performance. Figure 11 shows that Glider significantly reduces the LLC miss rate in comparison with the three state-of-the-art replacement policies. In particular, Glider achieves an average miss reduction of 8.9% on the 33 memory-intensive benchmarks, while Hawkeye, MPPPB, and SHiP++ see miss reductions of 7.1%, 6.5%, and 7.5%, respectively. Figure 12 shows that Glider achieves a speedup of 8.1% over LRU. By contrast, Hawkeye, MPPPB, and SHiP++ improve performance over LRU by 5.9%, 7.6%, and 7.1%, respectively. These improvements indicate that even though our insights were derived from an offline attention-based LSTM model, they carry over to the design of practical online cache replacement policies.
Multi-Core Performance. Figure 13 shows that Glider performs well on a 3-core system as it improves performance by 14.7%, compared with the 13.6%, 11.4%, and 13.2% improvements for Hawkeye, SHiP++, and MPPPB, respectively, indicating that our features and insights are applicable to both private and shared caches.
Effective Sequence Length. Figure 14 shows the relationship between history length and offline accuracy, where the sequence length for the attention-based LSTM ranges from 10 to 100, and the number of unique PCs (k value) for offline ISVM and the number of PCs for Perceptron range from 1 to 10. We make three observations. First, the LSTM benefits from a history of 30 PCs, which is significantly larger than the history length of 3 considered by previous solutions [52]. Second, the offline ISVM with only 6 unique PCs approaches the accuracy of the attention-based LSTM; thus, the k-sparse feature representation used by ISVM effectively captures a long history with fewer elements, and this representation works well even with a linear model. Third, the accuracy curve of the perceptron, which uses an ordered PC history with repetition, does not scale as well as our ISVM, and it saturates at a history length of 4, indicating that the linear model does not work well with an ordered history representation.
5.4 Practicality of Glider vs. LSTM
We now compare the practicality of the attention-based LSTM with Glider along two dimensions: (1) hardware budget and (2) training overhead.
Hardware Budget of Glider vs. LSTM. In Glider, we replace the predictor module of Hawkeye with ISVM, keeping other modules the same as Hawkeye. For a 16-way 2MB LLC, Hawkeye's budgets for replacement state per line, sampler, and OPTgen are 12KB, 12.7KB, and 4KB, respectively. The main overhead of Glider is the predictor that replaces Hawkeye's per-PC counters with ISVM. For each ISVM, we track 16 weights, and each weight is 8-bit wide. Thus, each ISVM consumes 16 bytes. Since we track 2048 PCs, Glider's predictor consumes a total of 32.8KB. The PCHR with the history of past 5 accesses is only 0.1KB. Thus, Glider's total hardware budget is 61.6 KB. Note that the attention-based LSTM model is at least 3 orders of magnitude more expensive in terms of both storage and computational costs. (See Table 3.)
Since the Glider predictor requires only two table lookups to perform both training and prediction, its latency can be easily hidden by the latency of accessing the last-level cache.
Convergence Rate of Glider vs. LSTM. As discussed, deep learning models, such as LSTM, typically need to train for multiple iterations to converge. For caching, training over multiple iterations would imply that a trace of LLC accesses would need to be stored for training iterations, which is infeasibly expensive. Instead, for cache replacement, we need the machine learning model to train in an online manner, that is, by making a single pass over the input data. Figure 15 shows that with offline training, our offline ISVM achieves good accuracy in one iteration, while the LSTM takes 10-15 iterations to converge. We also see that online models, such as, Perceptron and Hawkeye, converge fast but have the limited accuracy.
On the Practicality of Deep Learning for Caching. The main barriers to the use of deep learning models for hardware prediction are model size, computational cost, and offline training. The model size of our attention-based LSTM is at least 1 megabyte, which significantly exceeds the hardware budget for hardware caches. In addition, LSTM typically requires floating-point operations, while models such as Perceptron and ISVM use integer operations. Fortunately, recent studies [13, 18] have shown the great potential of reducing the model size and computational costs by 30× to 50× through model compression techniques, such as quantization, pruning, and integerization / binarization. However, these models need to be pre-trained offline before being compressed and deployed, which is difficult for hardware prediction problems where program behavior varies from benchmark to benchmark and even from one input to another input of the same benchmark. Given their problem with underfitting (poor performance in the first 10 iterations) as shown in Figure 15, it's clear that even with further compression techniques, deep learning models are still not ready for direct use in hardware predictors.
5.5 Learning High-Level Program Semantics
Our attention-based LSTM model is able to learn high-level program semantics to better predict the optimal caching solution. For example, for the omnetpp benchmark that simulates network protocols such as HTTP, the model discovers that certain types of network messages tend to be cache-friendly, while other types of messages tend to be cache-averse. Furthermore, the model discovers this relationship by distinguishing the different control-flow paths for different types of messages.
More specifically, consider the scheduleAt() method, which is frequently called inside omnetpp to schedule incoming messages at a given time t. The scheduleAt() method takes as an argument a message pointer, and it dereferences this pointer resulting in a memory access to the object residing at the pointer location (see Figure 17). Table 4 shows the model's accuracy for four target load instructions (PCs) that access this object. We see that (1) the attention-based LSTM model significantly improves accuracy for all four target PCs, and (2) all four target PCs share the same anchor PC (the source PC with the highest attention weight).
To understand the accuracy improvement for the target PCs in the scheduleAt() method, Figure 16 shows that scheduleAt() is invoked from various locations in the source code, with each invocation passing it a different message pointer. We find that the anchor PC belongs to one of these calling methods, called scheduleEndIFGPeriod(), implying that load instructions in the scheduleAt() method tend to be cache-friendly when the scheduleAt() method is called from scheduleEndIFGPeriod() with the endIFGMsg pointer whereas they tend to be cache-averse when scheduleAt() is called from other methods with other message pointers. Thus, by correlating the control-flow histories of load instructions in the scheduleAt() method, our model has discovered that the endIFGMsg object has better cache locality than endJamSignal and endTxMsg objects.
5.6 Model Specifications
The hyper-parameters for the attention-based LSTM model and Glider are given in Table 5. Here we explain how we identify key hyper-parameters, namely, the sequence length for the attentionbased LSTM model and the number of unique PCs (k) for Glider and Perceptron. For the offline ISVM, we consider step sizes n from 0.0001 to 1 with a multiple of 5 (0.0001, 0.0005, 0.001, 0.005, ...), and for the corresponding Glider model we use an update threshold of 1 n with a fixed step size of 1. To avoid the need to perform floating point operations, no decay is used