Cache Capacity and its Effects on Power Consumption for Tiled Chip Multi-Processors

Shounak Chakraborty, Dipika Deb, Dhantu Buragohain, Hemangee K. Kapoor
Department of Computer Science and Engineering
IIT Guwahati,
Guwahati, India-781039
{c.shounak, d.dipika, dhantu.buragohain, hemangee}@iitg.ernet.in

Abstract—Minimizing power consumption of Chip Multiprocessors has drawn attention of the researchers now-a-days. A single chip contains a number of processor cores and equally larger caches. According to recent research, it is seen that, on chip caches consume the maximum amount of total power consumed by the chip. Reducing on-chip cache size may be a solution for reducing on-chip power consumption, but it will degrade the performance. In this paper we present a study of reducing cache capacity and analyzing its effect on power and performance. We reduce the number of available cache banks and see its effect on reduction in dynamic and static energy. Experimental evaluation shows that for most of the benchmarks, we get significant reduction in static energy; which can result in controlling chip temperature. We use CACTI and full system simulator for our experiments.

Keywords—Power optimisation; Cache; Chip multiprocessor; Dynamic power; Leakage power;

I. INTRODUCTION

In recent years, power consumption of Chip Multiprocessors (CMPs) has received attention of the researchers. With the rapid growth of IC technology, the number of on chip elements has been increased. Recently developed chips are having multiple processor cores with multilevel on chip caches to get better performance. The rapid increment of on chip components will increase the overall power consumption of the chip. The recent study [1] about the chip power consumption indicates that, the principal amount of chip power has been consumed by the on chip cache.

The power consumption of cache can be divided into two major parts-dynamic power and static power. Dynamic power is consumed when the cache is accessed and static power is generally referred as leakage power of the cache. To reduce cache power, focus should be given upon reduction of both the power components. The increased chip design complexity has increased power consumption of the chip, which will increase the chip running temperature. The rapid increment of chip temperature will increase the chip leakage power. Even, high working temperature of the chip can damage the internal circuits of the chip. To maintain a stable chip temperature, modern system suffers from high cooling cost. So, chip power minimization has opened a research avenue, under which cache power minimization for CMPs has received special attention now-a-days.

In modern CMPs, multilevel caches are organized in several ways. In our study, we will consider the chip having two levels of caches (i.e. L1 and L2). The L1 cache is considered as a private cache per core, whereas L2 will be shared by all the cores. According to the L2 organization, the L2 cache access pattern changes. When each core takes same amount of access time to access a particular data, the cache organization is called as Uniform Cache Access (UCA). When access time for a particular data differs from core to core, the cache organization is known as Non Uniform cache Access (NUCA). Generally, in case of on chip L2 NUCA cache, the L2 is divided into multiple banks. Cache lines and cache sets are organized inside the cache banks. The cache parameters need to be tuned to optimize cache power consumption according to [1] and references therein.

Cache power consumption is the major power component of chip power, out of which cache leakage is the principal one. So, for effective reduction in cache power consumption, selective bank shut down can be done. In this strategy, all cache banks will be allocated initially. Later, according to the change in Working Set Size (WSS) and cache bank usage, the number of required banks will be kept on, and remaining will be powered off. So, unnecessary power consumption will be reduced by this. In the latter time periods, if the program needs more caching of data, the banks will be turned on as needed. But frequent shut down and wake up of cache bank may be a system overhead with respect to power and performance. So, shut down and wake up decisions should be taken in a way which will not increase the system overhead.

In this work, we will study the cache power consumption while all the cache banks are on throughout the process execution. Later we will reduce the number of cache banks according to the requirement of the process. The bank shut down decision will be taken by running some benchmark programs in unchanged cache configuration and in the changed environment. With the improvement in IC technology, number of on-chip components has been increased. So, constant power supply to all of these components will increase chip temperature. So by putting some on chip components into power off state, effective chip
temperature will also be reduced which will reduce leakage power consumption. We will observe the power consumption for both cases by running the CACTI 6.0[2] tool with the proper configurations. The configuration requirement will be decided by running benchmark programs in Simics environment.

The paper is organized in the following ways. Section II reviews the related works. Section III presents our power optimization strategy and section IV discusses the experimental setup with results and analysis. Finally, section V will conclude the paper.

II. RELATED WORK

On chip cache power has become an important constraint for CMP Design. Some recent works has studied how cache power can be optimized by considering performance constraint. In [1], authors present a survey on cache tuning from power perspective. The survey presents a state-of-the-art offline static and online dynamic cache tuning techniques and summarizes the pros and cons of the techniques which open future research avenues.

Reduction of power consumption by Last Level Cache(LLC) can be a key factor for limiting peak power consumption of CMP chip. To reduce LLC power, dynamically some cache banks can be selected and put into low power mode. But this dynamic cache resizing can increase cache access latency which will increase the number of CPU stall cycles. To address these issues, Wang et. al. has proposed a novel cache management strategy to limit peak power consumption of LLC in CMPs[3]. Their work can be summarized in three steps-1) a novel L2 cache management strategy has been proposed which provides fair or differentiated cache sharing for threads running on a CMP, whose power consumption has been constrained. 2) a two-tier feedback control architecture has been designed to simultaneously limit peak cache power consumption to achieve the desired one. 3) an advanced feedback control theory has been used to incorporate stability in the system.

In another work, Brooks et. al. studied the validation and design strategy for power-performance simulator[4]. This study analyzed the accuracy of the simulators. In this work, authors break down accuracy into two sub-types: relative and absolute accuracy. They have also analyzed power-performance errors with their effects on the design choices used in the simulators.

To reduce the on chip cache power with considering performance constraint, cache configuration plays an important role. In [5], authors proposed an on chip cache management policy, named as Dynamic Cache Clustering(DCC), which dynamically forms a few clusters among the cache banks to provide a flexible and efficient cache organization for CMPs. A mapping and location strategy has been proposed to manage dynamically resizable cache configuration, especially on tiled CMPs.

In [6], Powell et. al. proposed a physical level power reduction technique to reduce leakage power of caches. The approach is known as Gated-Vdd, which gates the supply voltage and reduce leakage in unused SRAM cells. This technique together with the resizable cache architecture reduces energy-delay with less impact on performance. Apart from this physical level on chip cache power reduction, Aparna M. et. al. proposed an adaptive power optimization of on chip SNUCA cache on tiled CMP architectures[7]. In this work, authors proposed a tagged bloom filter, where (dynamic) L2 cache allocation will be done based upon the estimated WSS. In addition, a remap policy has been proposed here to prevent data loss in L2 cache during dynamic shut down of cache lines.

Adaptive Mode Control (AMC) technique has been proposed to reduce cache leakage power[8], where each individual cache line will be either in Active mode or Sleep mode. Sleep mode consumes comparatively low power among the two, by putting data store into low power mode. A new kind of cache miss, called sleep miss, will occur when data store is to be accessed during sleep mode. Activation from sleep mode needs few stall cycles without impacting any significant degradation in the performance. However, authors claimed a significant reduction in cache leakage with respect to prior works by implementing sleep strategy.

Drowsy cache is another way to cache power reduction. In [9], authors proposed a phase adaptive cache design method, through which both dynamic and static energy has been reduced. A small performance degradation has been noticed in this work. The whole cache has been partitioned into two parts, where one is faster and other is slower due to drowsy mode. Drowsy mode is a low power mode, which needs a few extra cycles to active and work normally. According to MRU policy, mostly used data will be kept in the fast accessed location, and remaining will be in the slower region, and they will be swapped as needed.

A workload independent cache energy reduction strategy has been proposed in [10]. In this work, authors proposed a power reduction technique for D-NUCA caches, which adapts the powered-on cache area to the needs of the running workload, but it does not rely on application-dependent parameters. Data of the farthest cache way will be brought at the possible nearest cache way of the core, which currently accesses the data. Turning of the cache line saves leakage power a lot. This strategy saves 49% of total cache energy consumption in single core environment and saves 44% in the CMP environment. In an analytical based work[11], A. Bardine studied the static and dynamic energy consumption of NUCA caches. They presented a comparison based energy consumption study on the conventional UCA caches with the SNUCA and DNUCA caches. The results show that, NUCA caches are the most energy saving architectures and give better performance with respect to conventional UCA caches. According to the results obtained in this work, it is proven that, DNUCA caches have highest number of bank accesses and also have highest amount of data migration in it. So it consumes more dynamic energy than other configuration. But, still result shows, static energy dominates the dynamic energy. This promotes a strong motivation to the future researchers for concentrating upon the leakage energy consumption.
III. STUDY METHODOLOGY

We have used a Tiled CMP architecture as a baseline of our work. The CMP consists of 16 tiles, with each tile has a core, a private L1 cache, and a shared L2 cache. The design of 16-tiled based CMP is shown in figure 1. The number written inside the rectangle represents the corresponding Tile-id. The energy model used in our experiment is similar with the model used in [11]. To convert the output in terms of power, we have modified the Energy dissipation formula of [11] as needed. The execution time is taken in terms of seconds. Total power consumption will be computed as:

\[ P_{total} = P_{dynamic} + P_{static} + P_{off-cache} \]  

And total energy consumption will be computed as:

\[ E_{total} = E_{dynamic} + E_{static} + E_{off-cache} \]  

where \( P_{dynamic} \) and \( E_{dynamic} \) indicates dynamic power(energy) consumed by banks of the cache including network elements, if they present. The dynamic power and energy can be broken down as follows-

\[ P_{dynamic} = \left( \text{no. of bank accesses} \times P_{bank access} + \text{no. of flit transmissions} \times P_{flit transmission} + \text{no. of flit traversals} \times P_{flit traversal} \right) \]  

The dynamic energy for the same will be calculated as-

\[ E_{dynamic} = P_{dynamic} \times \text{execution time} \]  

And the static power dissipation by the cache bank and network switches will be calculated as follows-

\[ P_{static} = \left( \text{no. of banks} \times P_{bank static} + \text{no. of switches} \times P_{switch static} \right) \]  

The static energy for the same will be calculated as-

\[ E_{static} = P_{static} \times \text{execution time} \]  

All the terms of the equations 3 and 4 have been described in [11] with a simple change-the term “Energy(E)” has been replaced by “Power(P)”.

A. CACTI

According to our study, we are going to analyze the cache power consumption for different cache configurations. The equations mentioned above will be used for the total energy dissipation of the cache. In this study, we will focus on the energy dissipation by L2 cache. We use CACTI 6.5 to derive the energy parameters for the L2 cache memory banks. This latest version of CACTI combines the enhancements made in CACTI 5.0 and CACTI 6.0.

We modify the configuration file of CACTI 6.5 as required. The modification details are given in table I. We will have a fixed L2 bank size, and for different cache sizes we will increase or decrease the number of banks in L2.

B. Evaluation Topologies

For our evaluation purpose, we have taken a 16-core Tiled CMP system like figure 1, initially. The tiles are connected to each other through a 2D mesh network, called network-on-chip. After running some benchmarks(from PARSEC[12]) in this unchanged configuration, we collect all the data required for our analysis. Later, we have reduced the number of L2 cache banks to get the changes in power values. We have assumed that, in our changed configurations, we will simply switch-off the L2 banks in the tiles. For concrete analysis, we will take different configurations and for each one, we will extract the power values.

Figure 1. Tiled CMP architecture

For reduction in cache power, we have reduced the number of cache banks. But random shut down of cache banks may increase the data miss rate which will degrade the overall system performance. To overcome this performance degradation, we redirect the addresses of shut down banks to the powered-on banks. The address redirection strategy is shown in the figure 2. Here, in this figure, we have powered-off four banks(bank numbers, from 12 to 15), and the corresponding addresses are mapped into the banks, numbered 8 to 11. So, in the next, when a new request will arrive for the bank number 13, the redirection mechanism will send the request to new location at bank number 9.

Figure 2. Address Redirection strategy when four banks are powered off, and their addresses will be redirected to the powered-on banks shown in the figure by arrow. The powered off banks are shown by dotted lines, along with the links they use.

For the evaluation purpose, we compare the following configurations as given below-

1) 16 banks vs. 12 banks
2) 16 banks vs. 8 banks

Bank shut-down reduces the cache power, but on the other hand, it will reduce the cache size. The reduction of cache size will increase the number of capacity and conflict misses which will degrade the system performance. To address these issues,
our analysis will give a clear idea about proper cache-size. Hence we use total cache initially, and later gradually we reduce the cache size by shutting down cache banks. By running certain benchmark programs, we will give the detailed analysis on system behaviors in terms of power consumption and performance for different cache configurations.

### TABLE I. CACTI CONFIGURATIONS

<table>
<thead>
<tr>
<th>Cache Parameters</th>
<th>Values</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cache Level</td>
<td>L2</td>
</tr>
<tr>
<td>Size of a L2 Bank</td>
<td>256 KB</td>
</tr>
<tr>
<td>Block Size</td>
<td>64 Bytes</td>
</tr>
<tr>
<td>Technology used</td>
<td>32nm</td>
</tr>
<tr>
<td>Associativity</td>
<td>8</td>
</tr>
<tr>
<td>Cache Model</td>
<td>NUCA</td>
</tr>
<tr>
<td>Operating Temperature</td>
<td>340 K</td>
</tr>
<tr>
<td>Actual Cache Size</td>
<td>4 MB</td>
</tr>
</tbody>
</table>

### TABLE II. SYSTEM PARAMETERS

<table>
<thead>
<tr>
<th>Components</th>
<th>Parameters</th>
</tr>
</thead>
<tbody>
<tr>
<td>No. of Tiles</td>
<td>16</td>
</tr>
<tr>
<td>L1 I/D Cache</td>
<td>UltraSPARCIII+</td>
</tr>
<tr>
<td>L2 Cache bank</td>
<td>64KB, 4-way</td>
</tr>
<tr>
<td>Memory bank</td>
<td>256KB, 4-way/8-way</td>
</tr>
<tr>
<td>CMP-VR/RCMP-VR: reserveways per set(R)</td>
<td>50%</td>
</tr>
</tbody>
</table>

### TABLE III. NETWORK PARAMETERS

<table>
<thead>
<tr>
<th>Network Configurations</th>
<th>Parameters</th>
</tr>
</thead>
<tbody>
<tr>
<td>Flit Size</td>
<td>16 bytes</td>
</tr>
<tr>
<td>Buffer Size</td>
<td>4</td>
</tr>
<tr>
<td>Pipeline Stage</td>
<td>5-stage</td>
</tr>
<tr>
<td>VCs per Virtual Network</td>
<td>4</td>
</tr>
<tr>
<td>Number of Virtual Networks</td>
<td>5</td>
</tr>
</tbody>
</table>

### IV. EXPERIMENTAL EVALUATION

#### A. Experimental setup

For evaluation purpose, simulations are performed by running benchmarks on a multi-core simulator GEMS[13] with the help of SIMICS[14], a full-system functional simulator. GEMS has a timing simulator of multiprocessor memory system, named Ruby. The detailed configuration about the processor, cache memory and main memory used for the experiment is given in Table II. Following multi-threaded benchmark suites from PARSEC[12] have been used for simulations- vips(vips), blacksholes(black), fluidanimate(fluid) and bodytrack(body). The tiled CMP used in the simulations, has 16 tiles in it as shown in figure 1. The L2 cache size used here is of 4MB. The L2 is divided into 16 banks where each bank is located in each tile. So, the size of each L2 bank will be 4MB/16=256KB. Remaining detailed configurations required for our experiment are given in table II. Apart from the system parameters, we need a set of network parameters, which are given in table III.

For the power calculation, we use CACTI 6.5 in the next. The dynamic energy and static energy for the L2 cache configurations will be computed as we described in the previous section.

### B. Result

According to the above experimental setup we have run our simulation in Simics and CACTI 6.5. The obtained results are normalized according to the base line design. These normalized results are shown in the figures 3-5. Figure 3(a) and (b) show that three benchmark programs give better result in case of dynamic as well as static energy consumption, for 12 L2 cache banks, than our base line(i.e. 16 L2 cache banks). Among the four benchmark programs, black does not show any improvement. Reduction in the number of cache banks will decrease the on-chip active area, which will reduce the cache leakage power consumption. This is evident from the savings in static power in range of 16-34% with average of 24% for benchmarks that benefit from our proposal. The benchmark black gives the reverse results as per our prediction. For this two case, the number of on-chip cache accesses has increased with the reduction in cache size. This increased on-chip cache accesses will increase the chip temperature which will increase the static power consumption of the cache. Using this result we conclude that the working set size needed for this benchmark is much larger and hence we should not reduce the cache capacity.

In the next, we further decrease the cache size to 8 banks. Figure 4(b) shows the total energy consumption. Apart from black, all remaining benchmarks do show significant energy savings. In case of black, huge amount of data movement is the basic reason for increment in power consumption.

### TABLE IV. PERCENTAGE REDUCTION IN POWER INCREASE IN CPI

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Comparison between 16 L2 banks and 12 L2 banks</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Dynamic Energy</td>
</tr>
<tr>
<td>vips</td>
<td>1.5</td>
</tr>
<tr>
<td>black</td>
<td>-26.78</td>
</tr>
<tr>
<td>fluid</td>
<td>16</td>
</tr>
<tr>
<td>body</td>
<td>1.27</td>
</tr>
<tr>
<td>Average improvement</td>
<td>3.17</td>
</tr>
</tbody>
</table>

**Effect of energy savings on performance:**

With the reduction in cache size, number of conflict and capacity misses may increase. This increased misses will increase the memory stall cycle, which degrades the system performance. According to figure 5, we can conclude that, reducing number of L2 cache banks from 16 to 12 will result in slight reduction in performance. The figure shows the graph of cycles-per-instruction for each benchmark. As is evident, the benchmarks that benefited from energy savings had to compromise on the performance. However, the degradation is not large. On average (cf Table IV) 2.23% increase occurs in CPI, except for program: black.

At the end, our conclusion on power consumption and system performance are given in table IV. According to the results available, we have made an average which can give us a concrete conclusion. The average values show that, reducing
L2 cache banks to 12 is beneficial than 8 for all the power values and performance. As excessive reduction in cache size will increase the capacity and conflict misses, which results a huge data movement to and from the cache. This huge data movement will increase the dynamic cache power along with the memory stall cycles which affects the system performance. The huge data movement in on-chip L2 cache will increase the chip temperature which will increase the on-chip L2 cache static power consumption. Using 12 L2 cache banks gives significant improvement in case of power consumption than our baseline, with negligible degradation in performance.

Acknowledge

We wish to acknowledge Department of Electronics & Information Technology(DeitY), Ministry of Communications & IT, Government of India, for the financial assistance provided for this work.

REFERENCES


general execution-driven multiprocessor simulator (gems) toolset,”