We then use an SVM with the k-sparse binary feature. Since integer operations are much cheaper in hardware than floating point operations, we use an Integer SVM (ISVM) with an integer margin and learning rate of 1. While several variations exist, we use hinge loss as our objective function for optimization, which is defined as
where w is the weight vector of the SVM and y ∈ {±1} is the expected output.
Fact 1. With binary features, the use of gradient descent with learning rate γ = 1/n for an integer n is equivalent to optimizing the following objective function with learning rate 1.
Suppose we are optimizing (4) with initial weight vector w(0) , and learning rate 1 n produces the trace w(0) , w(1) , . . . , w(m) , . . .. Then optimizing (5) with initial weight vector n · w(0) and learning rate 1 produces n · w(0) ,n · w(1) , . . . ,n · w(m) , . . ., which means that they give the same prediction on any training sample. Therefore, by setting the learning rate to one, weight updates will be integral and we can avoid floating point operations. Thus, ISVM trained in an online manner is equivalent to a perceptron [52] that uses a threshold to prevent inadequate training.
ISVM is more amenable to hardware implementation than a vanilla SVM, and because it is a simpler model, it is likely to converge faster than an LSTM model and to achieve good performance in an online manner. In the following experiments, we use k = 5. Thus, our Glider solution consists of the ISVM model and k-sparse feature.
4.4 Hardware Design
Figure 8 shows the hardware implementation of Glider's predictor, which has two main components: (1) a PC History Register (PCHR) and (2) an ISVM Table. The PCHR maintains an unordered list of the last 5 PCs seen by each core; we model the PCHR as a small LRU cache that tracks the 5 most recent PCs. The ISVM Table tracks the weights of each PC's ISVM; we model it as a direct-mapped cache that is indexed by a hash of the current PC and that returns the ISVM's weights for that PC.
Each PC's ISVM consists of 16 weights for different possible PCs in the history register. To find the 5 weights corresponding to the current contents of the PCHR, we create a 4-bit hash for each element in the PCHR (creating 5 indices in the range 0 to 15), and we retrieve the 5 weights at the corresponding indices. For example, in Figure 8, the PCHR contains PC0, PC2, PC5, PC9 and PC15, and we retrieve weiдht0, weiдht2, weiдht5 (not shown), weiдht9 (not shown) and weiдht15 for both training and prediction. A detailed storage and latency analysis is presented in Section 5.4.
We now discuss the operations of Glider's predictor in more detail. For details on other aspects of the replacement policy, including insertion and eviction, we refer the reader to the Hawkeye policy [20].
Training. Glider is trained based on the behavior of a few sampled sets [20, 40]. On access to a sampled set, Glider retrieves the weights corresponding to the current PC and the PCHR. The weights are incremented by 1 if OPTgen determines that the line should have been cached; it is decremented otherwise. In keeping with the perceptron update rule [27, 52], the weights are not updated if their sum is above a certain threshold. To find a good threshold, Glider's predictor dynamically selects among a fixed set of thresholds (0, 30, 100, 300, and 3000). While this adaptive threshold provides some benefit for single-core workloads, the performance is largely insensitive to the choice of threshold for multi-core workloads.
Prediction. To make a prediction, the weights corresponding to the current PC and the PCHR are summed. If the summation is greater than or equal to a threshold (60 in our simulations), we predict that the line is cache-friendly and insert it with high priority (RRPV=07 ). If the summation is less than 0, we predict that the line is cache-averse and insert it with low priority (RRPV=7). For the remaining cases (sum between 0 and 60), we determine that the line is cache-friendly with a low confidence and insert it with medium priority (RRPV=2).