This section covers all information needed to implement Task 4: Instruction & Data Caches. We first designed our cache by answering the four fundamental questions of cache implementation. Then we discuss how we implemented the caches in our design.
A cache allows blocks of memory to be stored in a higher level of the processor architecture. This means that data in the cache can be located and retrieved significantly faster than data stored in main memory. However, copying data from the main memory into a cache adds some overhead, forcing the whole processor to stall. This is why we copy memory over in blocks. Data that relates to one another is typically stored closed to each other and is required by the processor at the same time. Therefore, we can pull this data all out at the same time in a block and avoid reaccessing the main memory as much as possible.
There are four fundamental questions that define the design of a cache:
- Where is a block placed?
- How do we identify each block?
- What happens when the block is not in the cache?
- What happens when data is written to memory in the cache?
We will dicusss the design of our cache by addressing each of these questions separately.
For our project, it was required that we write a 2-way set-associative cache. This means that the cache is separated into specific sets, and each set can hold a specific number of blocks. Another way to describe this is that each a block entering a set has only so many "ways" it can go to be stored. Therefore, a 2-way cache can only store 2 blocks per set.
Though the number of sets in our cache was determined for us, it was up to us to decide our block size and the number of sets in our cache. Our processor is designed to accept instructions and data entires of 32 bits long. Therefore, we decided that our blocks will comprise of 8 data entries. This means we have 256 bits per data block. Given the small size of our programs, we determined arbitrarily that we would have only 4 sets.
Therefore, the basic structure of our cache is shown in the following figure:
The set that a block should be contained in is decided by its address. We perform the following modular arithmetic operation: (address in main memory) % (# of sets). This is equal to: (block #) % (# of set), because our block numbers are determined by their physical addresses. When a request comes to the cache, it must indicate which set to check in the field index. Since we only have 4 sets, our index field is only 2 bits long. Also, for our processor, each block contains a tag field that contains its address in main memory.
This information is actually a little redundant. We could have designed our cache to only require the tag, and then cacluated the set ourselves. However, if we were ever to change our block placement design, both of these fields would likely be needed, so we designed our cache to contain both data fields.
Upon finding the correct set, the cache checks if a valid data block is stored there by checking the valid signal, V. If the stored block is valid (V is set to '1') and request tag matches the cache block's tag, then we know that this is the desired data is contained within this block. Finally, the request should contain an offset field to indicate which of the 8 data entires in the block it is trying to access. With this information, the cache can send out the correct data.
A new block first checks the valid signals of the two block locations in its set. If either of them are set to '0', there is not already a block located there, and the new block can be stored.
However, if both of the valid signals are set to '1', then one of them must be removed to make room for the new block. Therefore, we programmed our cache to contain a timestamp field on each block. This timestamp is updated every time its data is accessed within the cache. To determine which block to replace, we implemented our cache following the least-recently-used (LRU) strategy. The cache takes the difference between the two timestamp fields, and the timestamp with the smallest value is removed.
While our instruction cache cannot have its memory changed, the data cache can. Therefore, we designed our cache to include a "dirty" signal, D, with each block. The dirty signal is set to '0' when the block is first placed in the cache. However, if data is ever written to the block, dirty is set to '1' to indicate that the cache memory now differs from the main memory.
From there, we chose to implement a write-back policy, instead of a write-through policy. In a write-though policy, the main memory would be accessed right away to be updated to match the cache. However, to limit the number of times we access the main memory, we chose the write-back policy, which means that we only update the main memory when the "dirty" block is being removed from the cache. Therefore, we can avoid stalling our processor as much as possible.
<This information was shown in the project demo, but we will also include it here when we are able. Thank you for your patience.>
Regrettably, though we had our cache completely designed, we were unable to connect it to the rest of our processor. We simply procrastinated too much on this project and ran out of time. Therefore, we cannot verify that our design was implemented correctly. :(