Previous Article in Journal
Automatic Generation of Medical Case-Based Multiple-Choice Questions (MCQs): A Review of Methodologies, Applications, Evaluation, and Future Directions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

LSTM-CRP: Algorithm-Hardware Co-Design and Implementation of Cache Replacement Policy Using Long Short-Term Memory

School of Microelectronics, Xi’an Jiaotong University, Xi’an 710049, China
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2024, 8(10), 140; https://doi.org/10.3390/bdcc8100140 (registering DOI)
Submission received: 27 August 2024 / Revised: 27 September 2024 / Accepted: 10 October 2024 / Published: 21 October 2024

Abstract

:
As deep learning has produced dramatic breakthroughs in many areas, it has motivated emerging studies on the combination between neural networks and cache replacement algorithms. However, deep learning is a poor fit for performing cache replacement in hardware implementation because its neural network models are impractically large and slow. Many studies have tried to use the guidance of the Belady algorithm to speed up the prediction of cache replacement. But it is still impractical to accurately predict the characteristics of future access addresses, introducing inaccuracy in the discrimination of complex access patterns. Therefore, this paper presents the LSTM-CRP algorithm as well as its efficient hardware implementation, which employs the long short-term memory (LSTM) for access pattern identification at run-time to guide cache replacement algorithm. LSTM-CRP first converts the address into a novel key according to the frequency of the access address and a virtual capacity of the cache, which has the advantages of low information redundancy and high timeliness. Using the key as the inputs of four offline-trained LSTM network-based predictors, LSTM-CRP can accurately classify different access patterns and identify current cache characteristics in a timely manner via an online set dueling mechanism on sampling caches. For efficient implementation, heterogeneous lightweight LSTM networks are dedicatedly constructed in LSTM-CRP to lower hardware overhead and inference delay. The experimental results show that LSTM-CRP was able to averagely improve the cache hit rate by 20.10%, 15.35%, 12.11% and 8.49% compared with LRU, RRIP, Hawkeye and Glider, respectively. Implemented on Xilinx XCVU9P FPGA at the cost of 15,973 LUTs and 1610 FF registers, LSTM-CRP was running at a 200 MHz frequency with 2.74 W power consumption.

1. Introduction

In current computing architectures, on-chip caching has been widely employed as a very effective approach to mitigating the widening gap between the ever-increasing performance of central processing units (CPUs) and the limited speed of accessing off-chip memory [1,2]. To achieve higher cache utilization and more speed compensation, issues such as cache replacement [3,4,5,6,7,8,9,10,11,12,13] and data prefetching [14,15,16,17,18,19,20,21] have been heavily studied over many years. A well-designed cache replacement policy can further reduce the misses of access on the cache, thereby improving the performance of the cache. Cache replacement algorithms arrange data placement in the cache according to the principle of program locality [22], where the software program always frequently accesses the recently used data during the run-time, thereby improving the hit rate of the cache. Traditional cache replacement algorithms [3,4,5,6] focus on common access patterns and use frequency counters or re-reference distance predictors to record the pattern information. They are good at processing simple access patterns but perform poorly when meeting complicated access patterns due to the limited history information kept by counters. Also, they do not perform any classification to the access patterns and treat them with the same replacement mechanism.
As the artificial neural network (ANN) has risen in recent years, deep learning has produced dramatic breakthroughs in many fields, such as semantic segmentation [23], speech recognition [24] and natural language processing [25]. Naturally, some studies have shown the possibility of innovation in the combination between neural networks and cache replacement algorithms. Some recent emerging replacement algorithms [7,8,9,10,11,12] take classification into consideration, which are called intelligent cache replacement algorithms in this paper. They typically perform learning models on the historical access information to judge the characteristic of the current access address (whether cache-friendly or not) and then arrange the eviction priority accordingly. As a result, they can improve the hit rate of the cache compared with the traditional replacement algorithms. However, the reliability of the characteristic judgement is low because of limited history information, and they cannot classify different access patterns either. These multi-layer learning models are slow, typically taking several milliseconds to produce a prediction, as well as resulting in excessive hardware overhead. Consequently, most of these learning models only stay in the software simulation and are impractical for hardware implementation.
The essence of a cache replacement algorithm is to study the characteristic of the access pattern and assign an eviction priority for each block in the cache according to the characteristic. The key to improving cache replacement algorithms is to detect an exact access pattern with timing information. Generally, a recurrent neural network (RNN), such as the widely used long short-term memory (LSTM) [26], has strong nonlinear mapping ability and is good at dealing with various time series problems. Therefore, this paper proposes an LSTM-based Cache Replacement Policy (named LSTM-CRP), which employs LSTM for access pattern identification at run-time to guide the cache replacement algorithm. The novelty of the proposed LSTM-CRP is to classify the current access pattern and use different predictors to judge the characteristic of the current address accordingly, which improves the reliability of the characteristic judgement, thereby accurately maintaining the data to be accessed in the cache and improving the cache hit rate finally. Furthermore, less hardware overhead and low inference latency are also of great significance for a practical cache replacement algorithm. Aiming at fast and low-cost implementation, we present heterogeneous lightweight LSTM networks in LSTM-CRP to reduce the hardware overhead as well as speed up the identification process of the current access pattern. Overall, the contributions of this paper are as follows:
  • Rather than directly using the address or program counter (PC) for access characteristic judgement, a novel one-bit key was generated in LSTM-CRP. The key has the advantages of information integrity, hardware friendliness and high efficiency. Compared with the direct address input, the proposed key can reflect the occupancy information of the cache and improve the training speed of LSTM networks. Compared with the Belady vector used in [8], the key performs better in terms of timeliness. Also, the usage of the key can save storage resources, because each key occupies only one bit and the whole cache needs only one queue of the key.
  • Aiming at accurately detecting different access patterns, heterogeneous predictors were designed for LSTM-CRP, each of which is constructed based on an offline trained lightweight LSTM network and dedicated to a certain access pattern. Different predictors are used to evaluate the cache hits of current access on different access patterns, and then LSTM-CRP dynamically picks up the predictor with the highest hit by a set dueling monitor. The heterogeneous predictors can save hardware resources as well as improve the judgement speed.
  • Two kinds of input generators (i.e., ergodic input generator and sampling input generator) are designed for the sampling cache in predictors and the main cache respectively. Such combination can significantly save hardware resources at the cost of slight accuracy degradation.
  • A high-parallel structure of LSTM network is presented, which employs three sigmoid functions, two tanh functions and three multipliers. The LSTM network module outputs the characteristic of current access addresses every cycle without a pipeline bubble, thereby substantially improving the speed of judgement.
The remainder of this paper is organized as follows. Section 2 summarizes related works. Section 3 explains the preliminary background. Section 4 provides the motivations. Section 5 describes the proposed LSTM-CRP algorithm. Section 6 presents the hardware implementation of LSTM-CRP. Section 7 details the experimental methodology and presents the experimental results. Section 8 concludes this work.

2. Related Works

Previous research has proposed numerous studies on cache replacement algorithms for improving cache hits. Most traditional works of cache replacement algorithms focus on capturing the re-reference distance of the address or the access frequency in short-term history. The replacement priority of the cache block is updated according to the re-reference distance or access frequency. Such update mechanism is simple and convenient for low-cost hardware implementation. Works [3,4,5,6] are traditional cache replacement algorithms, which focus on common access patterns and use a frequency counter or re-reference distance predictor and other modules to manage the cache heuristically. In work [4], an improved algorithm RRIP (Re-Reference Insertion Policy) is proposed on the basis of the LRU algorithm. RRIP predicts the re-reference distance of data blocks, replaces the newly inserted data blocks without subsequent access as early as possible and hits the data blocks with a larger re-reference distance. Work [6] focuses on evicting the data block that the program no longer accesses (dead block). It designs a predictor based on the lifetime to count the lifetime of a data block in the cache. If the data block is not accessed within twice the lifetime, the data block is a dead block. In work [5], several cache replacement strategies are integrated and dynamically switched according to the performance of the algorithm, which helps to adapt to different access patterns. In addition, it uses dynamic set sampling technology to reduce the extra resource consumption brought by the integration of multiple cache strategies.
In recent years, the potential of deep learning has motivated developing studies on emerging cache replacement algorithms [7,8,9,10,11,12]. Some simple machine learning algorithms are gradually applied to cache replacement algorithms. These intelligent cache replacement algorithms typically transform the cache replacement problem into a binary classification problem. They predict the cache characteristics of the access address, that is, whether it should be cached (i.e., cache-friendly or non-cache-friendly), and they can also adjust the replacement strategy according to the running program. In work [7], the impact of cache prefetch is also considered in the design of cache replacement policy, where the prefetch address and non-prefetch address are distinguished. Compared with the non-prefetch address, the delay in accessing the main memory is lower when the prefetch address is missing. Therefore, under the same condition, it tends to replace the prefetch address data and retain the non-prefetch address data. In work [8] (Hawkeye), for the history access pattern, it reconstructs the optimal Belady algorithm [9] and then predicts the cache characteristic of the same address in the future according to the Belady algorithm. The predictor adopts the counter. The MSB of the counter represents the cache characteristics, while the increase or decrease in the counter depends on the cache behavior decided by the module called OPTgen. Work [10] (Glider) also uses the same architecture, but the predictor is the integer support vector machine (ISVM). It trains the ISVM model online and predicts the cache characteristics of the address in real time. There are also some works [11,12] that use the perceptron with the program counter (PC) as the input for online or offline training and infer the cache characteristics of the current address. CHROME [27] dynamically adapts cache decisions based on multiple program features and applies a reward for each decision that considers the accuracy of the action and the system-level feedback information by leveraging online reinforcement learning. In addition, another work [28] proposes a cloud cache replacement framework for automatically learning the relationship between the probability distribution of different replacement policies and the workload distribution by using deep reinforcement learning. Work [29] uses Reinforcement Learning to learn a cache replacement policy and successfully derive a new cache replacement policy, Reinforcement Learned Replacement (RLR).

3. Background

3.1. Belady Algorithm

The Belady algorithm [9] is the theoretical basis of the intelligent cache replacement algorithms. However, the Belady algorithm is optimal but not realizable, because it needs the information of future accesses. It obtains the reuse distance of each access address from such future information and then obtains the optimal storage scheme based on this analysis. It stores the data block that will be accessed again in the future, releases it after hitting and evicts the data block that will not be accessed again in the future, even if the data block is just missed. Figure 1 shows an execution example of the Belady algorithm under a simple access sequence (A, B, B, C, D, E, A, F, D, E, F, C). Assuming that the cache way is two, the re-reference distance of each address is represented by a solid line or a dotted line. At the fourth access, the cache accesses C. At this time, A and B are stored in the cache, B has just been accessed and the priority value is low, so A should be replaced and C should be cached in. However, the Belady algorithm knows that the re-reference distance of A is less than that of C, and it will access A immediately so it does not store C. Similarly, when accessing E at the sixth access, because address A and address D in the cache will be reused earlier than address E, E will not be stored. According to this rule, the solid line represents the cache-friendly address, while the dotted line represents the non cache-friendly address. It can be seen that the Belady algorithm can adapt to the access pattern of any re-reference distance. Referring to work [8], the performance of the Belady algorithm is better than or equal to those of other algorithms under access patterns with long, medium or short re-reference distances.

3.2. Long Short-Term Memory Network

LSTM [26] is good at dealing with time series problems, such as speech recognition, language modeling and machine translation. The main part of the LSTM network is the LSTM layer, and its basic architecture is shown in Figure 2. It uses different Gates to selectively retain relevant information, discard irrelevant information and achieve weak sensitivity to the time step. A typical LSTM layer contains a memory unit G, input gate I, output gate O and forget gate F, as well as state information. Each gate is computed according to Equation (1). A typical LSTM network consists of one or more LSTM layers, SoftMax layers, classification layers and so on.
f t = σ W f x x t + W f h h t 1 + b f i t = σ W i x x t + W i h h t 1 + b i g t = t a n h W g x x t + W g h h t 1 + b g o t = σ W o x x t + W o h h t 1 + b o c t = f t c t 1 + i t g t h t = o t t a n h c t

4. Motivation

First, comparisons of typical cache replacement algorithms are briefly listed in Table 1. The traditional replacement algorithms, such as LRU and RRIP, have a simple mechanism and are good at processing simple access patterns. The intelligent replacement algorithms (e.g., Hawkeye and Glider) have the ability to judge the characteristic of the access address, which improves the accuracy of arranging the eviction priority of cache blocks.
However, both traditional and intelligent cache replacement algorithms lack the ability to accurately identify the discrimination of different access patterns. Traditional cache replacement algorithms [3,4,5,6] have no ability to conduct any classification of the access pattern. Most of them are variations of the LRU algorithm and MRU algorithm or a combination of both. They focus on simple access patterns and use a counter to record the re-reference distance of a cache block. The updated mechanism of the counter is also simple. They assign a fixed eviction priority value to the new inserted blocks (e.g., LRU, SRRIP [4]) or assign different eviction priority values with fixed possibility to the new inserted blocks (e.g., BIP [5]). However, the re-reference distance of a cache block in different access patterns makes a big difference. The traditional replacement algorithms have no ability to distinguish different access patterns, meaning they cannot assign different eviction priority values according to different access patterns. The assigned eviction priority value is thus not accurate enough. Thus, they perform poorly when meeting complex access patterns. In contrast, intelligent cache replacement algorithms [8,10] perform some simple classifications to access addresses; therefore, they can judge the characteristic of the current access address. However, the judgement is less accurate because they still cannot distinguish different access patterns.
Next, it is impractical to accurately predict the access address in the future, so the optimal Belady algorithm cannot be implemented in practical cache design. As a result, some works [7,8,10] tried to use the Belady algorithm to guide the design of the cache replacement algorithm. The typical architecture of these intelligent cache replacement algorithms is shown in Figure 3. They first predict the characteristic of the current address and then arrange the eviction priority of the new inserted block accordingly. The key part of these replacement algorithms is the classification algorithm, which mainly consists of two parts (i.e., OPTgen and the prediction algorithm in Figure 3). The classification algorithm uses the OPTgen module to imitate the Belady algorithm. The specific operation of the OPTgen module is to record a period of historical cache accesses, and then they judge whether the current access results in a cache hit or miss according to the simulated Belady algorithm. However, OPTgen only judges the cache hit/miss within certain historical access patterns. Based on historical accesses, OPTgen cooperates with the prediction algorithm to provide a statistical prediction of the cache characteristic for the cache control algorithm. The prediction algorithm generally contains accumulators and comparators. where the accumulators increment or decrement according to the output of OPTgen and are then compared with a threshold by comparators to judge whether the characteristic of the current cache address is cache-friendly or not. The cache control algorithm assigns a low eviction priority to the cache-friendly address, which means intelligent cache replacement algorithms are inclined to evict the non-cache-friendly cache blocks. Due to the fact that the occupancy value used in OPTgen cannot be determined until the next access on the same block happens, the timeliness of these intelligent cache replacement algorithms is deficient. In order to achieve better timeliness as well as better accuracy than the existing statistical judgment, the classification algorithm designed in the proposed LSTM-CRP does not imitate the Belady algorithm as an existing intelligent algorithm. Instead, the usage of an offline-trained LSTM network, along with the key generated by the input generator as the input of the LSTM network, can accurately approximate the performance of the Belady algorithm and judge the cache characteristics in real time.
Third, to the best of our knowledge, complex machine learning algorithms, such as the LSTM network, are rarely used in cache replacement algorithms. Some recent works have begun trying to employ the LSTM network or other complex machine learning methods in cache prefetch [21,30] and cache partitioning [31,32], which shows the potential of the LSTM network in cache management technologies. As shown in Figure 4, Glider [10] tried an LSTM-based cache replacement only at the algorithm level, where an attention mechanism is also added to analyze important rules of cache replacement, thereby enhancing cache hits and achieving performance improvements. However, it cannot distinguish different access patterns, either. So, when dealing with simple patterns which have high repeatability and a short re-reference distance, the hit rate degrades due to the complexity and redundancy of the algorithm. Due to the intolerable hardware overhead and excessive delay, both caused by the LSTM network, the LSTM-based cache replacement algorithm in Glider only stays in the level of feasibility analysis and is not put into the hardware implementation of the cache controller. In contrast, we propose the LSTM-CRP algorithm for accurately classifying different types of access patterns based on multiple offline-trained lightweight LSTM networks and identifying current cache characteristics in a timely manner via an online set dueling mechanism [33], thereby ensuring that the cache hit rate is maintained at a high level under various types of access patterns.
Finally, the excessive hardware overhead and inference delay introduced by the complex computation of neural networks result in great difficulty for efficient hardware implementation, which are critical limitations of machine learning-based cache management technologies, e.g., prefetch [21] and replacement [10]. Taking Glider [10] as an example, it employs an offline machine learning method to obtain some insights for the guidance of online hardware predictors. The online predictor is implemented by Integer Support Vector Machine (ISVM) rather than the LSTM network, as shown in Figure 5. This is because the LSTM-based predictor is much more complex than the ISVM-based predictor, which is beneficial for decreasing the cost and latency of the replacement algorithm. Thus, the key to improving the practicality of the proposed LSTM-CRP replacement algorithm is to design a lightweight LSTM network and a hardware-friendly architecture. To simplify LSTM models, two kinds of input generators were presented in LSTM-CRP: one kind is for the sampling cache in predictors and the other kind is dedicated to the main cache. In these input generators, a complex multi-bit access address is transformed to binary 1/0, and then the generated one-bit key is used instead of the typically used access address for the eviction decision. In this way, the inference of the LSTM network is easily carried out using only 16 or 32 hidden units, which substantially contributes to the saving of hardware resources and the sped-up inference. To further reduce the hardware overhead of LSTM-CRP implementation, the liner saturated quantization method is applied to cut down the bit width of the parameters in the LSTM network, and the piecewise polynomial fitting method is used to implement two activation functions (i.e., sigmoid and tanh) in the LSTM network.

5. Cache Replacement Policy Using LSTM

This section presents the proposed intelligent LSTM-CRP algorithm, including the algorithm workflow, the main components and improvements for area-efficient design.

5.1. LSTM-CRP Workflow

Compared with the traditional cache replacement algorithm, a classification algorithm is introduced in LSTM-CRP, as shown in Figure 6. The classification algorithm consists of an input generation algorithm and an LSTM-based decision-making algorithm. Once the input generation algorithm receives an address, it will generate a key as the output. The key is used as the input of the decision-making algorithm to generate the cache characteristic of the corresponding address, that is, whether the address is cache-friendly or not. The classification algorithm can accurately identify the type of access pattern, which makes the output cache characteristic more accurate, so the cache control algorithm can accurately manage the eviction priority of the cache data block according to the cache characteristic and ultimately improve the cache hit rate.

5.2. Cache Capacity-Based Input Generation Algorithm

The input generation algorithm is to generate input data for LSTM networks in the decision-making algorithm. To improve the prediction accuracy of the LSTM network, the input data should be able to precisely reflect the characteristic of the access pattern and the cache storage state. There are two commonly used methods for generating input data for an LSTM network, i.e., address-based and Belady vector-based input generation algorithms, respectively. The address-based input generation algorithm in Glider [10] and work [34] directly adopts the address or PC value as the input data of the LSTM network. Such strategy is simple but has two drawbacks. First, addresses do not reflect the current storage situation in the cache, so it may lead to the training of LSTM network non-convergence for complex access patterns. Second, to identify the characteristics of different access patterns, it needs enough historical access addresses, which takes up a lot of storage space and is not efficient for hardware implementation. The other method is the Belady vector-based input generation algorithm, which transforms the address sequence to the Belady vector, serving as the input data of the LSTM network. Work [8] uses the OPTgen module to generate the Belady vector, where two storage queues are set. One queue (which is finally the Belady vector) is used for each cache set to store the occupancy value (which marks the cache storage state), and the other queue stores the periodically historical addresses for each cache set. But the main disadvantage of such method is that the occupancy value cannot be determined until the next access on the same block happens; thus, the timeliness of the Belady vector is deficient. The resources of the two storage queues for each cache set required a lot of hardware overhead.
In view of the defects of the aforementioned two input generation algorithms, we propose a cache capacity-based input generation algorithm, which has the advantages of information integrity, hardware friendliness and high efficiency. Each time the address is accessed, the input generation algorithm will generate the corresponding output, which is called the key. And then, the follow-up LSTM network uses the key as the input to predict the cache characteristic of the current address. The procedure of the cache capacity-based input generation algorithm is shown in Figure 7. The first step is to set a queue S1 for the entire cache, which stores the addresses for a period of historical time. When the current address exists in S1, the accumulator adds 1, calculates the virtual capacity T and then counts the number of elements “1” in queue S2 (marked as N). When T is greater than N, “1” is inserted into queue S2; otherwise, “0” is inserted into queue S2. When the current address does not exist in S1, “0” is inserted into queue S2. In the second step, the current address is inserted on the top of queue S1. The third step is to output queue S2 as the key.
The calculation of virtual capacity T is listed in Equation (2). Combined with the definition of virtual capacity T, the inherent meaning of the input generation algorithm is to reflect whether the cache is filled with duplicate addresses in real time. When it is filled up (TN), the inserted element “0” indicates that the subsequent addresses will not be stored. When it is not filled up (T > N), the inserted element “1” indicates that the subsequent addresses will be stored. The accumulator is used to count the number of addresses in S1, which are the same as the current address, Cnt. Then, Cnt is multiplied by the cache capacity (W(way) × S(set)), which means that the repeated address will hit and not occupy the actual cache space. This operation will make the cache tend to store a duplicate address, that is, a cache-friendly address, so as to improve the cache hit rate. The actual access pattern contains a large number of repeat addresses. The idea of capturing repeat addresses as much as possible conforms to the original intention of setting the cache in the processor storage structure.
T = W × S × C n t C n t
Compared with the address-based and Belady vector-based input generation algorithms, the cache capacity-based input generation algorithm has three advantages: First, compared with the address, the key possesses the advantage of the Belady vector, which can reflect the cache storage state. Second, it is friendly for hardware implementation. The storage resource of the proposed input generation algorithm only consumes two queues, whose elements are addresses and the key. In the hardware design, as described in Section 5, only one bit is needed to store each key. More importantly, the entire cache is set with only one queue for storing keys, while storing the Belady vector requires one queue for each cache set. Third, the proposed input generation algorithm based on the cache capacity has no delay in the Belady vector from receiving the address to generating the key. Therefore, while ensuring a high hit rate of the cache, the delay is low so that the cache speed is improved.

5.3. Decisioh-Making Algorithm

The decision-making algorithm integrates LSTM networks to predict the cache characteristic of the address at run-time. The design of the basic LSTM network structure along with the dataset and training scheme are introduced first. Next, we explain how to achieve improvements for area efficiency.
  • Basic structure with the LSTM network
First, the training and tuning processes of this study are briefly introduced. As for the training data, we establish four training sets targeting different types of access sequences, with each training set comprising 5000 access sequences, and each type of access sequence was used to independently train an LSTM network. Additionally, as for the training methodology, initially, this study attempted to train a single network to support four types of access sequences; however, the characteristics of the four access sequences were not uniform, resulting in a network accuracy of less than 60%. Subsequently, the study constructed four LSTM networks to form a decision-making algorithm. During the subsequent tuning process, each predictor’s selection outcome was individually tested, revealing that the proportion of the predictor Str being selected was less than 1%. Therefore, for the optimization of hardware implementation, this predictor was eliminated.
2.
Basic structure with the LSTM network
Actually, the LSTM network used in this study is a classification type, which includes an LSTM layer, fully connected layer, Softmax layer and classification layer. Consequently, there are four layers in the LSTM network used in this study. The access addresses in general applications can be classified as four types of frequently occurring access patterns [4], as listed in Table 2, namely, the cache-friendly pattern (FRI), blocking pattern (TRA), streaming pattern (STR) and mixed pattern (MIX). Therefore, the decision-making algorithm can concentrate on assessing these four access patterns and picking the winner via a set dueling mechanism [33]. To identify different types of access patterns, four types of predictors based on LSTM networks are placed. These four LSTM-based predictors calculate in parallel and output the cache characteristics simultaneously. Then, the selector selects the best predictor according to the past historical hits of the predictor, which serves as the predictor for a period of time in the future. The period is controlled by a period counter. The selection of the period depends on the size of the main cache, which is generally set to 1/2 × W × S.
The basic structure of the decision-making algorithm is shown in Figure 8. The decision-making algorithm includes four predictors (i.e., Fri PD, Tra PD, Str PD and Mix PD) based on the LSTM network, which correspond to a cache-friendly pattern, blocking pattern, streaming pattern and mixed pattern, respectively. Each predictor has a sampling cache and a counter (Cnt). Four predictors receive the same key from the input generation algorithm and judge independently to obtain four cache characteristics. The sampling cache updates the data according to the cache characteristics generated by its predictor, and the counter counts the hits of the sampling cache. The selector receives the values of the four counters and compares them, and then it selects the predictor with the largest counter value. The output of the chosen predictor will be the real cache characteristic of the current address. The cache control algorithm updates the data of the main cache according to the cache characteristic of the current address. The cycle counter circulates the count according to the number of accesses. Whenever the threshold value is reached, a clear command is sent to clear the counter, LSTM network status and key queue in each predictor.
In order to obtain a richer and more flexible combination of datasets for LSTM network training, we use the random assortment of these four patterns, called a combination. Such combination contains different lengths, cycles and types of patterns. As discussed before, we use the key generated by the input generation algorithm as the input data of the decision-making algorithm. As shown in Figure 9, the vector Li in the rightmost column is used as the label of the input data, which is generated by OPTgen’s simulation of the Belady algorithm. When the current address is judged as a cache hit, the corresponding element in vector Li is set to ‘1’; if not, the corresponding element in vector Li is set to ‘0’. So, vector Li reflects whether the current address makes a cache hit.
Above all, the four LSTM networks in Figure 8 are trained using the key as the input data, together with its label. Each LSTM network is mainly constructed of an LSTM layer, followed by a full connection layer, SoftMax layer and classification layer. The number of hidden layer units in the LSTM layer is set to 200. For the complex blocking pattern and mixed pattern, the accuracy of the LSTM network can reach more than 97%. Figure 10 presents part of the training loss curve. To avoid the overfitting problem, first, we establish four training sets targeting different types of access sequences, with each training set comprising 5000 access sequences, and each type of access sequence was used to independently train an LSTM network. To mitigate the risk of overfitting, we expand the training sets; additionally, we incorporated dropout regularization techniques to prevent the LSTM network from becoming overly reliant on certain hidden layer units.
3.
Improvements for area-efficient implementation
When implementing the basic structure of the decision-making algorithm in Figure 8, the major hardware overhead is introduced by four individual LSTM networks as well as their exclusive sampling cache. Therefore, we focus on the reduction in major overhead (including the number of LSTM networks, the capacity of sampling caches and the number of hidden layer units) to improve area efficiency. First, the LSTM-based predictor with a low utilization rate in the decision-making algorithm is eliminated. Under the testing of the combination access pattern, the chosen output of the selector module is traced and illustrated in Figure 11. It is clear that the selector never chooses the predictor of the streaming pattern (i.e., Str PD). This is because the streaming pattern contains no data reuse, so it cannot be cache-hit by any replacement algorithm. Thus, the LSTM-based predictor for the streaming pattern can be completely removed from the decision-making algorithm.
Further, it is impractical that the size of the sampling cache is the same as that of the main cache. Typically, the main cache is tens or even hundreds of times the size of the sampling cache. Thus, the access characteristics produced by the predictor on the sampling cache cannot be used by the main cache. The essential reason is that the decision-making algorithm is based on the Belady algorithm, which is highly related to the cache size. To solve this issue, two kinds of datasets and corresponding labels are designed to train the LSTM networks in the remaining three predictors (i.e., Fri PD, Tra PD and Mix PD). One of the datasets is designed according to the size of the sampling cache and the other is based on the size of the main cache. At this configuration, we tested the network accuracy with the two kinds of datasets under different types of access patterns. The testing results are shown in Figure 12, where X means the capacity ratio of the main cache to the sampling cache, ranging from 1, 2, 4, … to 128. With the increase in X, the accuracy of the LSTM network slightly decreases. When X reaches 128, the accuracy of the blocking pattern can still reach over 97%, and the accuracies of the cache-friendly pattern and mixed pattern are superior to 98%. In this way, the size of the sampling cache can be significantly reduced by 128 times, while maintaining a high accuracy rate.
Lastly, the number of hidden layer units in the LSTM network is a key factor affecting both the network accuracy and area efficiency. Fewer hidden layer units are beneficial for significantly improving the area efficiency but introduce a negative impact on the network accuracy. Figure 13 shows the variation in accuracy with different numbers of hidden layer units. Because the convergences of the training processes for the remaining three types of access patterns are different, the numbers of hidden layer units in LSTM-based predictors are also different. In order to achieve higher area efficiency, it is necessary to minimize the number of hidden layer units while maintaining a high enough accuracy. Thus, the number of hidden layer units in the LSTM networks for a cache-friendly pattern and blocking pattern predictor is set to 16, and the number of hidden layer units in an LSTM-based predictor for a mixed pattern is set to 32. At this configuration, the accuracies of all three LSTM-based predictors can be over 97%.
In addition, we take an ablation study. Specifically, we evaluate the utilization ratio of the four predictors, as demonstrated in Figure 14. It can be found that the utilization of predictor Str is less than 1%. Therefore, this predictor is removed for further area-efficient improvement. After the aforementioned improvements, the area-efficient structure of the decision-making algorithm is shown in Figure 15, which consists of three kinds of heterogeneous predictors. For the cache-friendly pattern and blocking pattern, both the sampling cache and main cache share one LSTM-based predictor (i.e., LSTM Fri PD or LSTM Tra PD), which contains 16 hidden layer units. In contrast, two individual predictors (i.e., LSTM Mix PD and LSTM Mix PDx) are allocated for the mixed pattern, each of which is dedicated for the sampling cache and main cache, respectively. For LSTM Fri PD and LSTM Tra PD, they first receive the key from the input generation algorithm and predict the cache characteristic used for sampling caches (i.e., sampling cache F and sampling cache T), and then they receive key X from input generation algorithm X and predict the cache characteristic used for the main cache. For the mixed pattern, LSTM Mix PD receives the key from the input generation algorithm and predicts the cache characteristic used for sampling cache M; at the same time, LSTM Mix PDx receives key X from input generation algorithm X and predicts the cache characteristic used for the main cache. Such configuration is the way it is because the number of hidden layer units in the LSTM network for the mixed pattern is set to 32, which is double that in the other two access patterns. As a result, compared to the cache-friendly pattern and blocking pattern, the inference latency of the predictor for the mixed pattern is much slower under the cost of the same hardware resource. These two predictors for the mixed pattern can run in parallel to increase the inference speed.

5.4. Cache Control Algorithm

LSTM-CRP updates the contents of the cache according to the update mechanism shown in Table 3. The cache control algorithm is based on 3-bit RRPV, which treats cache-friendly addresses and non cache-friendly addresses differently. When a cache miss happens, the cache control algorithm selects the cache block with an RRPV (i.e., re-reference prediction value) of 7 as the eviction block. If no block has an RRPV of 7, then the block with the highest RRPV will be selected as the eviction block. For the insertion strategy, if the newly inserted block is cache-friendly, its RRPV will be set to 0, which means that it has the lowest eviction priority; then, the RRPV of other blocks with RRPV values less than 6 are increased. Otherwise, if the newly inserted block is non-cache-friendly, the cache control algorithm will set its RRPV to 7, which means that it has the highest eviction priority. When a cache hit happens, if the hit block is cache-friendly, its RRPV is set to 0; otherwise, if the hit block is non-cache-friendly, the RRPV is set to 7.
The hit rate of the intelligent cache algorithm is not as good as LRU when dealing with simple access patterns, especially for cache-friendly patterns. This is because the LRU can capture all duplicate addresses in cache-friendly patterns. Therefore, an LRU compensation mechanism is proposed in the cache control algorithm for LSTM-CRP. When the access pattern is judged as cache-friendly, the cache control algorithm of LSTM-CRP adopts an LRU replacement policy instead of the update mechanism, as shown in Table 3. In addition, since the decision-making algorithm does not judge the address for the first access, the cache control algorithm will also update the cache with an LRU replacement policy for the first access address. As a result, compared to other intelligent replacement algorithms (e.g., Hawkeye and Glider), LSTM-CRP can achieve more comprehensive and robust benefits via the LRU compensation mechanism.

6. Hardware Implementation of LSTM-CRP

6.1. Overview Architecture

Figure 16 shows an overview of the hardware architecture of the LSTM cache replacement algorithm, which corresponds to the algorithm shown in Figure 6. Here, the classifier and cache controller are hardware implementations of the classification algorithm and cache control algorithm, respectively. The classifier consists of two modules, i.e., the input generator and decision maker, which are hardware implementations of the input generation algorithm and decision-making algorithm in Figure 6, respectively.

6.2. Structure of the Input Generator

In order to reduce hardware overhead as well as maintain a highly precise reflection of the current access pattern, two types of input generators, i.e., the ergodic input generator and sampling input generator, are designed. The accuracy of the access patterns in the ergodic input generator is high, but its hardware resources are large. On the contrary, the hardware overhead of the sampling input generator is low, but its accuracy is also inferior. In the classifier, the sampling cache is small and needs to provide accurate information for the judgment of the selector, so the ergodic input generator is used to provide the key for the sampling cache. In view of the large capacity of the main cache, the sampling input generator can save hardware resources and is fit for providing the key for the main cache.
  • Ergodic input generator
The hardware structure of the ergodic input generator is shown in Figure 17. The length of the address queue is set to 8 × W, while W is the number of kinds of cache sets. The address queue is composed of several multi-bit registers. When an address enters, the traversal module performs a parallel comparison (using CMP) between the current address and the historical address in the address queue and produces an internal signal with a width of 8 × W, which is equal to the length of the address queue. When the i-th bit of the internal signal is ‘1’, it means that the i-th address in the address queue is the same as the current address. Therefore, the number of elements ‘1’ in the internal signal represents the number of current address hits in the address queue. The flag is generated via bitwise or operation on the internal signal. Then, the current address is stored in the address queue, and the earliest stored address is discarded. The counter is used to calculate the number of elements ‘1’ in the internal signal, and then the virtual capacity can be obtained according to Equation (2).
There is another queue storing historical key. Every time the input generator gets a new address, the number of elements ‘1’ in the key queue is also counted and sent to the comparator for comparison with the virtual capacity. (i) When the flag signal is ‘0’, it means that the current address is accessed for the first time, or has not been accessed in a historical period. The key of the current address is set to ‘0’. (ii) When the flag signal is ‘1’, it means that the current address exists in the address queue. In this case, if the number of ‘1’ in the key queue is less than the virtual capacity, the key of the current address is set to ‘1’, which indicates that the current address is a duplicate address, and at this time, the duplicate address does not fill up the cache, so the current address should be cached. Otherwise, if the number of ‘1’ in the key queue is greater than the virtual capacity, the key of the current address is set to ‘0’, which indicates that the current address is a duplicate address, but at this time, the duplicate address has filled the cache, so the current address should not be cached.
2.
Sampling input generator
The hardware structure of the sampling input generator is shown in Figure 18. In contrast to the ergodic input generator, the sampling input generator uses “sampling” instead of “traversal”, so the address queue and traversal calculation are removed. Here, three pairs of multi-bit registers and counters are used. When the clear signal is at a high level, the values of the counter and register are passed to the next pair of the counter and register, respectively. When the clear signal is at a low level, registers and counters maintain their current values. For example, when the input generator receives address A, if the clear signal is at a high level, register 1 saves the current address. Meanwhile, register 2 obtains the previous value of register 1, and register 3 obtains the previous value of register 2. In the next cycle, register 1 represents address A, and register 2 and register 3 remain empty; when the clear signal is at a low level, it is not processed (the value of the counter is passed to the next counter). In this way, three registers form a shift register, whose shifting operation is controlled by a clear signal. The initial value of the counter paired with the register is 1. They continuously compare the current address with the register address. If the addresses are the same, the counter will increase by 1. When the clear signal is valid (i.e., at a high level), the values of the three counters will also be transferred in the form of shift registers after the increment.
In contrast to an ergodic input generator, the sampling input generator uses much fewer hardware resources to simulate the input generation algorithm. For the access addresses containing a large number of repeated patterns, the ergodic input generator can accurately count the access frequency of a single address, while the sampling input generator counts the access frequency of the repeated patterns. These two kinds of input generators are equally efficient in many cases, such as cache-friendly patterns and blocking patterns. Thus, we only need to use the method of ‘sampling’ to count the access frequency of some addresses in the patterns, which is sufficient to accurately represent the access frequency of all other addresses. As for other complex situations, e.g., the mixed pattern, the access frequency of different addresses in a pattern is different. To deal with this situation, the barycenter value is designed to represent the access frequency of the current pattern.
The calculation process of the barycenter value consists of two steps. The first step is to calculate the maximum value, minimum value and average value of the three counters, respectively. The second step is to calculate the distance between the average value and the maximum value, as well as the distance between the average value and the minimum value, and also to compare them with each other. The key idea of a barycenter value represents a voting process. When two counters’ values are greater than or less than the third counter’s value, the maximum or minimum value shall be selected accordingly. If the distance between the average value and the maximum value is greater than the distance between the average value and the minimum value, the barycenter value is the maximum value or the minimum value, in contrast. Therefore, setting three counters is beneficial to selecting a more representative barycenter value. Then, we can obtain virtual capacity T according to Equation (3). The key of the current address will be obtained using the same steps as those used in the ergodic input generator when the flag signal is ‘1’. In such a way, the sampling input generator can accurately distinguish two different frequency subsequences in a mixed pattern.
T = W × S × b a r y c e n t e r b a r y c e n t e r

6.3. Structure of Decision Maker

The decision maker module is an LSTM-based predictor for performing a decision-making algorithm. The predictor realizes the inference process of the LSTM network, which is trained offline. Figure 19 shows the overall structure of the predictor, which consists of an LSTM layer, full connection layer and classification layer.
  • LSTM layer
The LSTM layer contains multiple LSTM units, each of which consists of a gate module, network module and memory module. Figure 20 shows the hardware structure of the LSTM unit. The parameters and status of the LSTM network are stored in the memory module. The gate module is responsible for calculating the input gate vector, output gate vector, forget gate vector and memory unit vector in the LSTM network. The network module is responsible for the activation function calculation, network state calculation and output vector calculation in the LSTM network.
In Figure 20, the memory module includes five parts: the cycle weight Wh, input weight Wx, offset weight B, hidden layer state H and network state parameter C. Read only memory (ROM) is used to store the cycle weight Wh, input weight Wx and offset weight B. Their contents do not need to be updated. The elements of these three kinds of matrix are initialized in the read-only memory as soon as the LSTM network training is completed. Random access memory (RAM) is used to store network state parameter C. A ping-pang mechanism is used to store the hidden layer state H. When we implement the memory module in the FPGA, the memories for B, Wx, Wh and H consist of LUTs as distributed memory, while the memory for C uses the register cluster because the network state needs to be reset according to the clear signal generated by the cycle counter, and the reset of the register cluster only consumes one cycle.
The gate module consists of a multiplication array and three addition arrays, as shown in Figure 20. The specific computing process of the gate module is shown in Figure 21. The multiplication array computes the product of the cyclic weights matrix Wh, (including Wfh, Wih, Wgh and Woh) and the hidden layer state vector H. There are Nh multipliers in the multiplication array. So, it computes the product of one row in Wh and H every clock time. Addition array 1 adds the point multiplication result of Wh and H. The structure of addition array 1 is a tree structure, which can improve the calculation speed. At the same time, the number of hidden layer units Nh is chosen as the power of 2, which is convenient for inserting the pipeline into the addition tree structure. Addition array 2 adds Wx × X and Wh × H, where X represents the key which is the input vector of the LSTM layer. Since the key has only two kinds of values, ‘1’ and ‘0’, the calculation of Wx × X can be replaced by a selector, and the X controls the selector to output ‘0’ or Wx. Addition array 3 adds bias vector B to the output of addition array 2, and the final output of the gate module is obtained.
To completely avoid a pipeline bubble, a kind of improved high parallelism network module is designed, as shown in Figure 22, which employs three sigmoid functions, two tanh functions and three multipliers to connect all the components together according to Equation (1). Although the consumption of hardware resources increases compared with the design of the time multiplexed network module, it can fully utilize the computing resources and completely eliminate the bubble. Thus, when the pipeline is full-load processing, the high parallelism network module outputs an element every cycle. As the pipeline diagram shown in Figure 23, it needs only 22 cycles to finish the operation of the high parallelism network module, under the condition of an LSTM network with 16 hidden layer units. In addition, when the scale of the multiplication array in the gate module enlarges, the entire high parallelism network module can be employed and arranged in parallel to match the data throughput of the gate module. So, we use the high parallelism network module in LSTM-CRP.
2.
Full connection and classification layers
The operation of the full connection layer is similar to that of the gate module in the LSTM layer. It computes the multiplication of matrix F and vector H. Matrix F is the parameter of the full connection layer, which is a matrix of 2 × Nh. The hardware structure shown in Figure 24 implements the point multiplication and addition operation of one line of matrix F and vector H in each cycle. F has two lines in total, so two such hardware structures are needed for parallel computing. The output result is vector Fc, which is a 2 × 1 vector. Because the decision-maker only needs to output the cache characteristics and does not need to judge the probability of each cache characteristic, the hardware design does not need to implement the SoftMax layer in the classification LSTM algorithm. The final output signal out can be obtained by comparing the two elements of vector Fc in the classification layer. The high level of out indicates that the cache characteristic is cache-friendly, and the low level indicates it is non-cache-friendly.
3.
Nonlinear activation function
Among the components of the predictor, activation functions, i.e., sigmoid and tanh, are the most complex parts and cost lots of resources. A low-latency network module uses three sigmoid functions and two tanh functions, as aforementioned, so improving the hardware overhead of the activation function is the key to reducing the resource cost. In this paper, the piecewise polynomial fitting method is used to realize these two kinds of activation functions. This method divides the nonlinear function into several segments. Each segment is fitted by different polynomials. The criterion of segmentation is that the larger the derivative of the original function is, the smaller the segmentation distance is, and vice versa. In this paper, we use a quadratic polynomial to fit the sigmoid and tanh functions; then, the third derivative of the two activation functions is referred. The piecewise results of the two activation functions are listed in Table 4. Because of the symmetry, we only consider half of the activation functions and obtain the other half values by some simple operations like bitwise invert. For sigmoid, the scope of piecewise is [−8, 8]. When the input is greater than 8, the output value is 1; when the input value is less than −8, the output value is 0. For tanh, the scope of piecewise is [−4, 4]. When the input is greater than 4, the output value is 1; when the input value is less than −4, the output value is −1.
The quadratic polynomial in Equation (4) is used to fit these two activation functions in every segment and save the parameters in memory. When an input value enters, we cut it into three parts and use the middle part to find the fitted parameters. Then, we obtain an intermediate result via the multiplier and addition. After quantization (described in the next subsection) to the intermediate result, the activation value is obtained. The hardware implementation of each activation function consumes 21B storage capacity and two multipliers. Compared with the eight-bit output look-up table scheme, the storage resource is reduced by about 12 times; compared with the Taylor polynomial expansion scheme (e.g., work [35]), the accuracy is improved by two to three orders of magnitude. The total resource consumption of the sigmoid function and tanh function is 102 LUTS and 96 LUTS, respectively.
y = a x 2 + b x + c
4.
Quantization of LSTM networks
When the LSTM network is training on the software, 16-bit floating point (FP16) is used to achieve high accuracy. When the LSTM network is implemented on the hardware, the fixed-point number representation method is used to save hardware resources and improve the computing speed. Quantization operation is performed on all parts of the LSTM network, such as the multiplier array of the gate module, the sigmoid activation function of the network module, etc. According to the distribution of the activation value and weight in LSTM-CRP, we choose the quantization scheme of TensorRT INT8, which is for a convolutional neural network [36], and improve it for the proposed LSTM network. The quantization of weight is carried out offline, while the quantization of the activation value is carried out online; such procedure is calibration. The calibration results of each module in Figure 20 and Figure 22 are listed in Table 5.

7. Experiments and Evaluation

7.1. Experimental Setup

The parameters of the simulated platform and the configuration of the LSTM-CRP hardware are as follows. The main cache of the 128 KB capacity is organized as an eight-way set-associative cache. The size of the data block in each cache line is 64-bit. For the hardware implementation of LSTM-CRP, the size of the sampling cache in the LSTM decision-maker is an eight-way set-associative cache including 16 sets; the number of hidden layer units of the LSTM network for the cache-friendly pattern and blocking pattern is 16, and the size of the gate module multiplication array is 64; the hidden layer units of the LSTM network for the mixed pattern is 32, and the size of the gate module multiplication array is 256.
At the configuration, the proposed LSTM-CRP algorithm was compared to four other replacement algorithms: LRU, RRIP [4], Hawkeye [8] and Glider [10]. LRU is the most widely used competitor for comparison and acts as the replacement baseline of four competing prefetching methods. RRIP is an improvement on the LRU replacement policy, which applied re-reference interval prediction on cache misses. Its effectiveness has been fully established; it can achieve an average of 4–10% improvements compared with LRU, and it also requires two times less hardware than LRU. Both Hawkeye and Glider are recently proposed emerging cache replacement approaches based on deep learning. Hawkeye predicts the cache characteristic of the same address in the future according to the Belady algorithm, where the predictor adopts the counter. Glider also uses similar architecture, but the predictor is replaced with the integer support vector machine (ISVM). All the competing prefetching schemes were implemented in strict accordance with their respective cited papers. The evaluation performs three aspects of comparison. First, we evaluate the effectiveness of basic LSTM-CRP, which uses the basic decision-making algorithm, as shown in Figure 8, and compare with other four existing replacement algorithms (i.e., Glider, Hawkeye, RRIP and LRU). Next, we test the performance degradation of the area-efficient LSTM-CRP, which uses the heterogeneous decision-making algorithm, as shown in Figure 15, compared to the basic LSTM-CRP. At last, LSTM-CRP was implemented on the FPGA platform and evaluated in terms of hit rate and hardware overhead comparisons.
Extracted from the access addresses in general applications, there is a classification of four types for frequently occurring access patterns [4], as listed in Table 2 (i.e., FRI, TRA, STR and MIX). Each of the four patterns presents several representative cache access patterns commonly found in applications. As a result, combinations of the four patterns were chosen as benchmarks to evaluate the effectiveness of cache replacement algorithms for the study. Such classification is widely used and cited in many studies [4,37,38] as benchmarks which evaluate system performance and efficiency instead of implementing too many applications. We fuse the four access patterns randomly to obtain five groups of test data, as listed in Table 6 (named combination 1, combination 2, combination 3, combination 4, combination 5), each of which is a combination of the four types of access patterns (i.e., FRI, TRA, STR, MIX).

7.2. Performance Comparison

First, we test the LSTM-CRP replacement algorithm with the basic structure, which uses the decision-making algorithm shown in Figure 8. Because it has not been improved for area efficiency, the performance of the basic structure-based LSTM-CRP algorithm can achieve a higher cache hit rate. The experimental results of the cache hit rate are illustrated in Figure 25a. Compared to intelligent replacement algorithms, LSTM-CRP is superior to Hawkeye and Glider, with the highest improvements being about 18% and the lowest improvements being about 7.3%. Compared with the traditional algorithms LRU and RRIP, LSTM-CRP achieves up to 22.8% improvements and at least about 13.2% improvements. Figure 25b demonstrates the cache hit rates of the four access patterns, each of which is carried out 50 times and is computed by the average hit rate. For the MIX pattern, the relative magnitude of each algorithm’ performance is similar to that of the five combinations. As for the STR pattern, none of the tested algorithms can hit the cache. Under the FRI pattern, LRU and RRIP perform better than existing intelligent replacement algorithms. And because LSTM-CRP contains the LRU compensation mechanism, its performance is better than that of Hawkeye and Glider. Under the TRA pattern, LSTM-CRP performs better than Hawkeye but is slightly inferior to Glider, and all the intelligent replacement algorithms are better than the traditional replacement algorithms, which cannot hit the cache in a blocking pattern. In summary, compared with LRU, RRIP, Hawkeye and Glider, LSTM-CRP improves the average hit rate by 20.10%, 15.35%, 12.11% and 8.49%, respectively. Moreover, our proposed method is compared with some other intelligent methods including Hawkeye [8] and Glider [10]. Taking Glider [10] as an example, an attention-based LSTM was designed for cache management. And our proposed LSTM-CRP can also achieve a 8.49% higher hit rate when compared with Glider. Therefore, we think the above comparisons with Hawkeye [8] and Glider [10] can also demonstrate the advancement of our proposed method.
After the improvements aiming at area efficiency, as shown in Figure 15, the hardware resources introduced by the LSTM-CRP algorithm are significantly reduced, but the hit rate of LSTM-CRP needs to be tested to evaluate the performance degradation caused by the improvements. The hit rate degradations of LSTM-CRP after improvements in area efficiency are shown in Figure 26. Compared with the basic structure-based LSTM-CRP algorithm, the hit rate fluctuation of the algorithm after area-efficient improvements is less than 0.06% under five combinations and four access patterns. The overall performance of the improved LSTM-CRP is better than that of the intelligent cache algorithms (e.g., Hawkeye and Glider) and superior to the traditional cache algorithms (e.g., LRU and RRIP), and it also has the advantage of simplifying the hardware implementation.

7.3. Hardware Overhead

The hardware test of the LSTM replacement algorithm includes not only the hit rate but also hardware resource consumption and power dissipation. The proposed LSTM-CRP algorithm was implemented on the FPGA platform, and Figure 27 demonstrates the hit rate degradation of LSTM-CRP on hardware implementation compared to algorithm-level simulation, which is introduced by the quantization of LSTM networks. The hit rate degradations on hardware implementation are no more than 0.4%. The results tested on the hardware level are consistent with those of algorithm-level evaluation. In general, compared to the other four algorithms, the cache hit rate of LSTM-CRP is significantly improved.
The proposed LSTM-CRP algorithm was implemented on Xilinx XCVU9P FPGA with a 200 MHz clock frequency. Instead of DSP, we use LUT and FF registers to implement the multiplier. The resource consumption of each part in LSTM-CRP is shown in Table 7. The power dissipation is 2.74 w, and the total resource consumption is 15,973 LUTs and 1610 FF registers.

8. Conclusions

In this paper, a cache replacement algorithm based on the LSTM network (i.e., LSTM-CRP) is proposed, which can accurately classify different access patterns and identify current cache characteristics in a timely manner via an online set dueling mechanism. For area-efficient implementation, heterogeneous architecture and lightweight LSTM networks are dedicatedly constructed in LSTM-CRP to lower the hardware overhead and inference delay. Compared with LRU, RRIP, Hawkeye and Glider, the proposed LSTM-CRP replacement algorithm improves the average hit rates by 20.10%, 15.35%, 12.11% and 8.49%, respectively. For LSTM networks applied in cache replacement algorithms, further optimizations can be made from both algorithmic and hardware perspectives. On the algorithmic side, the accuracy of the LSTM network needs to be further enhanced. Currently, the raw input for the network is the address, but it is feasible to consider using Program Counter (PC) values instead, as PCs contain more information and require less storage space. Subsequently, it is necessary to eliminate useless information from the PCs, and an Attention mechanism can be incorporated into the LSTM network to filter out high-value sequence information. On the hardware side, efforts should be made to further reduce the latency of LSTM network hardware, and asynchronous circuit designs can be attempted for implementation. Additionally, to further reduce the bit-width required for LSTM network computations while maintaining accuracy, a non-linear quantization scheme can be employed.

Author Contributions

Conceptualization, C.Y. and Y.W.; methodology, Y.W. and J.W.; software, Y.W. and Y.M.; validation, Y.W. and J.W.; writing—original draft preparation, Y.W. and J.W.; writing—review and editing, J.W. and Y.M.; project administration, C.Y.; funding acquisition, C.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 62176206.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Jacob, P.; Zia, A.; Erdogan, O.; Belemjian, P.; Kim, J.-W.; Chu, M.; Kraft, R.P.; McDonald, J.F.; Bernstein, K. Mitigating Memory Wall Effects in High-Clock-Rate and Multicore CMOS 3-D Processor Memory Stacks. Proc. IEEE 2009, 97, 108–122. [Google Scholar] [CrossRef]
  2. Wen, F.; Qin, M.; Gratz, P.; Reddy, A.L. Hardware Memory Management for Future Mobile Hybrid Memory Systems. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 3627–3637. [Google Scholar] [CrossRef]
  3. Duong, N.; Zhao, D.; Kim, T.; Cammarota, R.; Valero, M.; Veidenbaum, A.V. Improving Cache Management Policies Using Dynamic Reuse Distances. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, Vancouver, BC, Canada, 1–5 December 2012; pp. 389–400. [Google Scholar]
  4. Jaleel, A.; Theobald, K.B.; Steely, S.C.; Emer, J. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture, Saint-Malo, France, 19–23 June 2010; pp. 60–71. [Google Scholar]
  5. Qureshi, M.K.; Jaleel, A.; Patt, Y.N.; Steely, S.C.; Emer, J. Adaptive insertion policies for high performance caching. In Proceedings of the 34th Annual International Symposium on Computer Architecture, San Diego, CA, USA, 9–13 June 2007; pp. 381–391. [Google Scholar]
  6. Khan, S.M.; Tian, Y.; Jimenez, D.A. Sampling Dead Block Prediction for Last-Level Caches. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Atlanta, GA, USA, 4–8 December 2010; pp. 175–186. [Google Scholar]
  7. Jain, A.; Lin, C. Rethinking belady’s algorithm to accommodate prefetching. In Proceedings of the 45th Annual International Symposium on Computer Architecture, Los Angeles, CA, USA, 1–6 June 2018; pp. 110–123. [Google Scholar]
  8. Jain, A.; Lin, C. Back to the future: Leveraging Belady’s algorithm for improved cache replacement. In Proceedings of the 43rd International Symposium on Computer Architecture, Seoul, Republic of Korea, 18–22 June 2016; pp. 78–89. [Google Scholar]
  9. Belady, L.A. A study of replacement algorithms for a virtual-storage computer. IBM Syst. J. 1966, 5, 78–101. [Google Scholar] [CrossRef]
  10. Shi, Z.; Huang, X.; Jain, A.; Lin, C. Applying Deep Learning to the Cache Replacement Problem. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Columbus, OH, USA, 12–16 October 2019; pp. 413–425. [Google Scholar]
  11. Jiménez, D.A.; Teran, E. Multiperspective reuse prediction. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, MA, USA, 14–18 October 2017; pp. 436–448. [Google Scholar]
  12. Teran, E.; Wang, Z.; Jiménez, D.A. Perceptron learning for reuse prediction. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, Taiwan, 15–19 October 2016; p. 2. [Google Scholar]
  13. Liu, E.Z.; Hashemi, M.; Swersky, K.; Ranganathan, P.; Ahn, J. An imitation learning approach for cache replacement. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020; p. 579. [Google Scholar]
  14. Fu, J.W.C.; Patel, J.H.; Janssens, B.L. Stride directed prefetching in scalar processors. In Proceedings of the 25th Annual International Symposium on Microarchitecture, Portland, OR, USA, 1–4 December 1992; pp. 102–110. [Google Scholar]
  15. Joseph, D.; Grunwald, D. Prefetching using Markov predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, Denver, CO, USA, 2–4 June 1997; pp. 252–263. [Google Scholar]
  16. Kandiraju, G.B.; Sivasubramaniam, A. Going the distance for TLB prefetching: An application-driven study. In Proceedings of the 29th Annual International Symposium on Computer Architecture, Anchorage, Alaska, 25–29 May 2002; pp. 195–206. [Google Scholar]
  17. Nesbit, K.J.; Smith, J.E. Data Cache Prefetching Using a Global History Buffer. IEEE Micro 2005, 25, 90–97. [Google Scholar] [CrossRef]
  18. Bakhshalipour, M.; Tabaeiaghdaei, S.; Lotfi-Kamran, P.; Sarbazi-Azad, H. Evaluation of Hardware Data Prefetchers on Server Processors. ACM Comput. Surv. 2019, 52, 1–29. [Google Scholar] [CrossRef]
  19. Wu, H.; Nathella, K.; Sunwoo, D.; Jain, A.; Lin, C. Efficient metadata management for irregular data prefetching. In Proceedings of the 46th International Symposium on Computer Architecture, Phoenix, AZ, USA, 22–26 June 2019; pp. 449–461. [Google Scholar]
  20. Zhang, C.; Zeng, Y.; Shalf, J.; Guo, X. RnR: A Software-Assisted Record-and-Replay Hardware Prefetcher. In Proceedings of the 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Athens, Greece, 17–21 October 2020; pp. 609–621. [Google Scholar]
  21. Peled, L.; Weiser, U.; Etsion, Y. A Neural Network Prefetcher for Arbitrary Memory Access Patterns. ACM Trans. Archit. Code Optim. 2019, 16, 37. [Google Scholar] [CrossRef]
  22. Denning, P.J. The locality principle. Commun. ACM 2005, 48, 19–24. [Google Scholar] [CrossRef]
  23. Noh, H.; Hong, S.; Han, B. Learning Deconvolution Network for Semantic Segmentation. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1520–1528. [Google Scholar]
  24. Deena, S.; Hasan, M.; Doulaty, M.; Saz, O.; Hain, T. Recurrent Neural Network Language Model Adaptation for Multi-Genre Broadcast Speech Recognition and Alignment. IEEE/ACM Trans. Audio Speech Lang. Proc. 2019, 27, 572–582. [Google Scholar] [CrossRef]
  25. Otter, D.W.; Medina, J.R.; Kalita, J.K. A Survey of the Usages of Deep Learning for Natural Language Processing. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 604–624. [Google Scholar] [CrossRef] [PubMed]
  26. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  27. Lu, X.; Najafi, H.; Liu, J.; Sun, X.H. CHROME: Concurrency-Aware Holistic Cache Management Framework with Online Reinforcement Learning. In Proceedings of the 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Edinburgh, UK, 2–6 March 2024; pp. 1154–1167. [Google Scholar]
  28. Zhou, Y.; Wang, F.; Shi, Z.; Feng, D. An Efficient Deep Reinforcement Learning-Based Automatic Cache Replacement Policy in Cloud Block Storage Systems. IEEE Trans. Comput. 2024, 73, 164–177. [Google Scholar] [CrossRef]
  29. Sethumurugan, S.; Yin, J.; Sartori, J. Designing a Cost-Effective Cache Replacement Policy using Machine Learning. In Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Seoul, Republic of Korea, 27 February–3 March 2021; pp. 291–303. [Google Scholar]
  30. Ganfure, G.O.; Wu, C.F.; Chang, Y.H.; Shih, W.K. DeepPrefetcher: A Deep Learning Framework for Data Prefetching in Flash Storage Devices. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 3311–3322. [Google Scholar] [CrossRef]
  31. Sarwar, S.; Zia-ul-Qayyum, Z.-u.-Q.; Malik, O.A.; Rizvi, B.; Ahmed, H.F.; Takahashi, H. Performance comparison of case retrieval between case based reasoning and neural networks in predictive prefetching. In Proceedings of the 6th International Conference on High Capacity Optical Networks and Enabling Technologies, Alexandria, Egypt, 28–30 December 2010; pp. 57–61. [Google Scholar]
  32. Liu, W.; Cui, J.; Liu, J.; Yang, L.T. MLCache: A space-efficient cache scheme based on reuse distance and machine learning for NVMe SSDs. In Proceedings of the 39th International Conference on Computer-Aided Design, Virtual Event, 24–27 October 2020; p. 58. [Google Scholar]
  33. Qureshi, M.K.; Jaleel, A.; Patt, Y.N., Jr.; Steely, S.C.; Emer, J. Set-Dueling-Controlled Adaptive Insertion for High-Performance Caching. IEEE Micro 2008, 28, 91–98. [Google Scholar] [CrossRef]
  34. Zeng, Y.; Guo, X. Long short term memory based hardware prefetcher: A case study. In Proceedings of the International Symposium on Memory Systems, Alexandria, VA, USA, 2–5 October 2017; pp. 305–311. [Google Scholar]
  35. Chen, K.; Huang, L.; Li, M.; Zeng, X.; Fan, Y. A Compact and Configurable Long Short-Term Memory Neural Network Hardware Architecture. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4168–4172. [Google Scholar]
  36. Migacz, S. 8-Bit Inference with TensorRT [EB/OL]. 2017. Available online: http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf (accessed on 1 March 2024).
  37. Yang, C.; Hou, J.; Wang, Y.; Geng, L. CRP: Context-directed Replacement Policy to Improve Cache Performance for Coarse-Grained Reconfigurable Arrays. In Proceedings of the 2020 27th IEEE International Conference on Electronics, Circuits and Systems (ICECS), Glasgow, UK, 23–25 November 2020; pp. 1–2. [Google Scholar]
  38. Rodriguez, L.V.; Yusuf, F.; Lyons, S.; Paz, E.; Rangaswami, R.; Liu, J.; Zhao, M.; Narasimhan, G. Learning Cache Replacement with CACHEUS. In Proceedings of the 19th USENIX Conference on File and Storage Technologies, Santa Clara, CA, USA, 23–25 February 2021; pp. 341–354. [Google Scholar]
Figure 1. Execution of the Belady algorithm under a simple access pattern.
Figure 1. Execution of the Belady algorithm under a simple access pattern.
Bdcc 08 00140 g001
Figure 2. The basic architecture of LSTM.
Figure 2. The basic architecture of LSTM.
Bdcc 08 00140 g002
Figure 3. The typical architecture of an intelligent cache replacement algorithm.
Figure 3. The typical architecture of an intelligent cache replacement algorithm.
Bdcc 08 00140 g003
Figure 4. The attention-based LSTM network for an offline predictor in Glider.
Figure 4. The attention-based LSTM network for an offline predictor in Glider.
Bdcc 08 00140 g004
Figure 5. The ISVM-based online predictor of Glider.
Figure 5. The ISVM-based online predictor of Glider.
Bdcc 08 00140 g005
Figure 6. The workflow of the LSTM-CRP replacement algorithm.
Figure 6. The workflow of the LSTM-CRP replacement algorithm.
Bdcc 08 00140 g006
Figure 7. The procedure of the cache capacity-based input generation algorithm.
Figure 7. The procedure of the cache capacity-based input generation algorithm.
Bdcc 08 00140 g007
Figure 8. The basic structure of the decision-making algorithm.
Figure 8. The basic structure of the decision-making algorithm.
Bdcc 08 00140 g008
Figure 9. The Belady vector generation in OPTgen.
Figure 9. The Belady vector generation in OPTgen.
Bdcc 08 00140 g009
Figure 10. The training loss curve of the proposed LSTM network (part).
Figure 10. The training loss curve of the proposed LSTM network (part).
Bdcc 08 00140 g010
Figure 11. The predictor chosen by the selector module during the combination access pattern testing.
Figure 11. The predictor chosen by the selector module during the combination access pattern testing.
Bdcc 08 00140 g011
Figure 12. Accuracy of LSTM with different X values.
Figure 12. Accuracy of LSTM with different X values.
Bdcc 08 00140 g012
Figure 13. Accuracy with different numbers of hidden layer units.
Figure 13. Accuracy with different numbers of hidden layer units.
Bdcc 08 00140 g013
Figure 14. The utilization of the designed four predictors.
Figure 14. The utilization of the designed four predictors.
Bdcc 08 00140 g014
Figure 15. The heterogeneous structure of the decision-making algorithm for area-efficient improvements.
Figure 15. The heterogeneous structure of the decision-making algorithm for area-efficient improvements.
Bdcc 08 00140 g015
Figure 16. The hardware architecture of LSTM-CRP implementation.
Figure 16. The hardware architecture of LSTM-CRP implementation.
Bdcc 08 00140 g016
Figure 17. The hardware structure of the ergodic input generator.
Figure 17. The hardware structure of the ergodic input generator.
Bdcc 08 00140 g017
Figure 18. The hardware structure of the sampling input generator.
Figure 18. The hardware structure of the sampling input generator.
Bdcc 08 00140 g018
Figure 19. The overall structure of the predictor.
Figure 19. The overall structure of the predictor.
Bdcc 08 00140 g019
Figure 20. Hardware structure of the LSTM unit.
Figure 20. Hardware structure of the LSTM unit.
Bdcc 08 00140 g020
Figure 21. The specific computing process of the gate module.
Figure 21. The specific computing process of the gate module.
Bdcc 08 00140 g021
Figure 22. Improved structure of the high parallelism network module.
Figure 22. Improved structure of the high parallelism network module.
Bdcc 08 00140 g022
Figure 23. Pipeline diagram of the high parallelism network module.
Figure 23. Pipeline diagram of the high parallelism network module.
Bdcc 08 00140 g023
Figure 24. Calculation processes of the full connection layer and classification layer.
Figure 24. Calculation processes of the full connection layer and classification layer.
Bdcc 08 00140 g024
Figure 25. The performance comparison of LSTM-CRP and other algorithms.
Figure 25. The performance comparison of LSTM-CRP and other algorithms.
Bdcc 08 00140 g025
Figure 26. Hit rate degradation after improvements in area efficiency.
Figure 26. Hit rate degradation after improvements in area efficiency.
Bdcc 08 00140 g026
Figure 27. Hit rate degradation of LSTM-CRP on hardware.
Figure 27. Hit rate degradation of LSTM-CRP on hardware.
Bdcc 08 00140 g027
Table 1. Comparison of different cache replacement algorithms.
Table 1. Comparison of different cache replacement algorithms.
MethodAdvantageDisadvantageType
LRUHeuristic algorithmSimpleCan’t distinguish patterns;
Can’t distinguish cache characteristics
Traditional
RRIP [4]Heuristic algorithmSimple;
Handle Scan pattern
Can’t distinguish patterns;
Can’t distinguish cache characteristics
Traditional
Hawkeye [8]Statistical algorithmPredict cache characteristics with low accuracyCan’t distinguish patterns;Intelligent
Glider [10]Machine learning ISVMPredict cache characteristics with low accuracyCan’t distinguish patterns;Intelligent
Table 2. Common cache access patterns.
Table 2. Common cache access patterns.
Cache Access Pattern
FRI(a1, a2, …ak−1, ak, ak−1, …a2, a1)N s.t. k, N = N+
TRA(a1, a2, …ak−1, ak)N       s.t. k < W × S, N = N+
STR(a1, a2, …ak−1, ak)     s.t. = ∞
MIX[(a1, a2, …ak−1, ak, ak−1, …a2, a1)APε((a1, a2, …ak−1, ak, ak−1, …am))]N
s . t .   k   <   W   ×   S ,   m   >   W   ×   S ,   0   <   ε  < 1
Table 3. Update mechanism of LSTM-CRP.
Table 3. Update mechanism of LSTM-CRP.
Cache CharacteristicCache HitCache Miss
Non cache-friendlyRRPV = 7RRPV = 7
Cache-friendlyRRPV = 0RRPV = 0;
If RRPV < 6 (RRPV++)
Table 4. The piecewise results of the two activation functions.
Table 4. The piecewise results of the two activation functions.
Segment NumberSigmoidTanh
1[0, 0.5][0, 0.5]
2[(0.5 + step), 1][(0.5 + step), 1]
3[(1 + step), 1.5][(1 + step), 1.5]
4[(1.5 + step), 2.5][(1.5 + step), 2]
5[(2.5 + step), 4][(2 + step), 3]
6[(4 + step), 5.5][(3 + step), 4]
7[(5.5 + step), 8]/
The step for Sigmoid and Tanh is 0.0156 and 0.0078 respectively
Table 5. The calibration results of each module in the LSTM layer.
Table 5. The calibration results of each module in the LSTM layer.
Width Before QuantizationWidth After QuantizationRetained Bits
Multiplication array128[8:1]
Addition array198[7:0]
Addition array298[7:0]
Addition array398[8:1]
Sigmoid2510[11:2]
Tanh12510[11:2]
Tanh22510[11:2]
Multiplier 12010[9:0]
Multiplier 2188[7:0]
Multiplier 3188[7:0]
Adder188[8:1]
Table 6. Proportion of each access pattern in five combinations.
Table 6. Proportion of each access pattern in five combinations.
MIXSTRFRITRA
Combination 118%26%28%28%
Combination 226%26%25%23%
Combination 322%25%26%27%
Combination 429%21%34%16%
Combination 524%29%25%22%
Table 7. Hardware resources of each part in the classifier.
Table 7. Hardware resources of each part in the classifier.
ResourceKey
Generator
Key
Generator X
FRI
Preditor
TRA
Predictor
MIX
Predictor
MIX
Predictor X
OthersTotal
LUT327306330233024552403215215,973
FF213183281281320296361610
BRAM008888840
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Meng, Y.; Wang, J.; Yang, C. LSTM-CRP: Algorithm-Hardware Co-Design and Implementation of Cache Replacement Policy Using Long Short-Term Memory. Big Data Cogn. Comput. 2024, 8, 140. https://doi.org/10.3390/bdcc8100140

AMA Style

Wang Y, Meng Y, Wang J, Yang C. LSTM-CRP: Algorithm-Hardware Co-Design and Implementation of Cache Replacement Policy Using Long Short-Term Memory. Big Data and Cognitive Computing. 2024; 8(10):140. https://doi.org/10.3390/bdcc8100140

Chicago/Turabian Style

Wang, Yizhou, Yishuo Meng, Jiaxing Wang, and Chen Yang. 2024. "LSTM-CRP: Algorithm-Hardware Co-Design and Implementation of Cache Replacement Policy Using Long Short-Term Memory" Big Data and Cognitive Computing 8, no. 10: 140. https://doi.org/10.3390/bdcc8100140

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop