Secondary Bus Performance in Reducing Cache Writeback Latency 
 
by 
 
Rakshith Thambehalli Venkatesh 
 
 
 
 
A thesis submitted to the Graduate Faculty of 
Auburn University 
in partial fulfillment of the 
requirements for the Degree of 
Master of Science 
 
Auburn, Alabama 
May 9, 2011 
 
 
 
 
Keywords: Cache Writeback, System Bus, Queuing Delay, Processor Performance 
 
 
Copyright 2011 by Rakshith Thambehalli Venkatesh 
 
 
Approved by 
 
Sanjeev Baskiyar, Co-Chair, Associate Professor of Computer Science and Software Engineering 
Vishwani D. Agrawal, Co-Chair, James J. Danaher Professor of Electrical and Computer Engineering 
Weikuan Yu, Assistant Professor of Computer Science and Software Engineering 
 
 
 
 
ii 
 
 
 
Abstract 
 For single as well as multi core designs, effective strategies to minimize cache access latencies 
have been proposed by a number of researchers over the last decade. Such designs include the Miss Status 
Holding Registers, Victim Buffers, Eager and Lazy Write backs, and Cache Pre-fetching. However, write-
buffer stalls remain a bottleneck in real-time memory accesses. To alleviate this problem, the Secondary 
Bus Architecture was developed at Auburn. The secondary bus connects the write back buffer to the main 
memory via an independent secondary bus controller to retire dirty cache lines to memory. The write back 
traffic is only about 25-30% of the total traffic between the last level of cache and memory and is 
intermittent compared to read requests. Therefore, a narrow 8-bit secondary bus was used in the 
implementation. The secondary bus controller identifies idle main bus cycles by snooping on the main 
bus control lines. These idle cycles are used to retire write back buffer entries to the main memory.  
 In this research, we evaluated the effectiveness of secondary bus in retiring cache write-backs to 
the memory using a series of extensive rigorous experiments run on the computers of the Alabama Super 
Computer Center using SimAlpha and SPEC CPU 2006 benchmarks. The simulator SimAlpha was used 
for analyzing the architecture since it incorporates a well defined memory hierarchy. The SPEC CPU 
2006 programs are both CPU and memory intensive and thus were ideal candidates for our evaluations. 
The I/O injections used normal traffic distribution using DMA as well as the new Direct Cache Injection 
mechanism.  
 We observed performance improvements of up to 35% over the base architecture (i.e. one without 
a secondary bus) in presence of I/O traffic on the main bus and 17% in absence of any I/O traffic. 
Furthermore, queuing delays on the main bus were observed to drastically reduce. In comparisons with 
iii 
 
Eager Write back, a strategy that is popular in many contemporary cache designs, it was found that the 
secondary bus architecture is much superior in performance. 
 
  
iv 
 
 
 
Acknowledgements 
 My years at Auburn University as a graduate student have been excellent mainly because of a 
great academic course structure and the emphasis on research. This setup wouldn?t have been possible 
without the excellent faculty and amazing support staff at Auburn University who encourage innovation 
and provide exceptional classroom, library, laboratory and also recreational facilities. I would like to 
thank Auburn?s faculty and staff members who have helped me amass great amount of knowledge, hone 
my skills and graduate with a Master?s degree. 
 I would like to thank Professor Sanjeev Baskiyar for supervising my thesis. I also thank him for 
providing funding for the master?s thesis research for two years with a DARPA/AFRL grant. His 
availability for discussion and advice regarding my research work as well as my career choice is highly 
appreciated. I would also like to thank him for allowing me to go on an internship for the summer of 2010 
and get valuable experience. I appreciate John O?Farrell?s help during the course of the simulations using 
Sim-alpha. Simulations were carried out together with John and the code changes and the results were 
unified. He also pointed out the approach of simpoints for speeding up the simulations. 
 Professor Vishwani Agrawal has taught quite a few of my courses and every bit of information I 
gathered from his classes have been extremely helpful. I would also like to thank him for recommending 
me for a summer internship at Texas Instruments (TI) in India. The hardware design and test concepts I 
eventually learnt at TI are invaluable. Professor Nelson?s course on EDA tools, Professor Singh?s course 
on VLSI design and testing and Professor Reeves?s class on Digital Signal Processing require special 
mentioning as they were highly informative and helped me identify my interests in electrical engineering. 
v 
 
 I am grateful to my friends here at Auburn who made my sojourn an eventful and memorable one. 
I would like to specially thank my friends Santosh Kulkarni and Pratap Prasad for their advice and 
support during tough times and for gleefully sharing my good times. I am thankful to Balapradeep 
Gadamsetti for being a great buddy. I cannot end without thanking Dr Dave Sree and his family for giving 
glimpses of India at Auburn. 
 My experiences at Wipro Technologies Ltd have helped me approach the challenges at graduate 
school in a more mature fashion. I am thankful to my teammates at Wipro for creating a great team based 
environment. The internship at TI Ltd further enhanced and polished my knowledge in the area of VLSI 
after my coursework at Auburn laid the foundation. I am grateful to my manager and my mentor at TI for 
providing an excellent research environment. 
 No amount of gratitude can ever be enough towards my parents Venkatesh T. L and Sathyavathi 
G. S and brother Rohith, who have been there for me every moment. I wouldn?t have achieved anything 
without their constant love and emotional support. 
 
 
 
 
 
 
 
 
 
 
vi 
 
 
 
Table of Contents 
Abstract ......................................................................................................................................................... ii 
Acknowledgements ...................................................................................................................................... iv 
List of Figures ............................................................................................................................................ viii 
List of Tables ................................................................................................................................................ x 
List of Abbreviations ................................................................................................................................... xi 
Chapter 1. Introduction ................................................................................................................................. 1 
1.1 Caches and Memory Hierarchy ..................................................................................................... 2 
1.2 Bottlenecks and Tradeoffs in Cache Design ................................................................................. 3 
1.3 Problem Description ..................................................................................................................... 5 
Chapter 2. Background on Processor Architecture ....................................................................................... 8 
2.1 Uniprocessor and Multiprocessor Architectures ........................................................................... 8 
2.2 Processor System Bus ................................................................................................................. 10 
2.3 Input/Output techniques in modern day computers .................................................................... 12 
2.4 Memory mapped I/O ................................................................................................................... 12 
2.5 Interrupt driven I/O ..................................................................................................................... 14 
2.6 Direct Memory Access (DMA) ................................................................................................... 14 
2.7 Direct Cache Access (DCA) and Cache Injection ...................................................................... 16 
2.8 Cache Writeback Strategies ........................................................................................................ 17 
Chapter 3. Prior Work on Memory Hierarchy Optimization ...................................................................... 19 
3.1 Write Buffers .............................................................................................................................. 19 
3.2 Victim Buffers and Victim Caches ............................................................................................. 21 
vii 
 
3.3 Cache Prefetching ....................................................................................................................... 23 
3.4 Miss Status Holding Registers .................................................................................................... 25 
3.5 Eager Writeback .......................................................................................................................... 26 
3.6 Secondary Bus Architecture........................................................................................................ 27 
Chapter 4. Secondary Bus Architecture ...................................................................................................... 28 
4.1 Design of the Secondary Bus ...................................................................................................... 28 
4.2 Design Issues .............................................................................................................................. 30 
Chapter 5. Simulation Setup for Performance Evaluation .......................................................................... 32 
5.1 Sim-alpha Simulator ................................................................................................................... 32 
5.2 SPEC Benchmark Programs and Simpoints ............................................................................... 34 
5.3 Sim-alpha Architectural Configurations Used in Simulations .................................................... 34 
Chapter 6. Simulation Results and Observations ........................................................................................ 38 
6.1 Queuing Delay on the Main System Bus .................................................................................... 38 
6.2 Cycles per Instruction ................................................................................................................. 41 
Chapter 7. Future Work .............................................................................................................................. 45 
Chapter 8. Conclusion ................................................................................................................................. 47 
Bibliography ............................................................................................................................................... 49 
 
 
 
 
 
 
 
viii 
 
 
 
List of Figures 
Figure 1: Memory hierarchy in a computer .................................................................................................. 2 
Figure 2: Memory access requests per 100 million instructions ................................................................... 6 
Figure 3: A typical bus based computer architecture [10] ............................................................................ 9 
Figure 4: A simple multi core processor architecture ................................................................................. 10 
Figure 5: DMA flow diagram ..................................................................................................................... 15 
Figure 6: Direct cache injection based I/O. ................................................................................................. 17 
Figure 7: Write buffer example with write-back technique in a three level memory hierarchy. ................ 21 
Figure 8: Processor Cache Architecture with Write Buffers. ..................................................................... 22 
Figure 9: Cache Prefetching Architecture [25] ........................................................................................... 24 
Figure 10: Miss Handling Architecture for multi bank caches ................................................................... 25 
Figure 11: Architecture with the Secondary bus ......................................................................................... 29 
Figure 12: Microarchitecture of the Alpha 21264 processor [35]. .............................................................. 33 
Figure 13: Probability density function used for I/O injection, Mean = 100 cycles and SD = 60 cycles. .. 37 
Figure 14: Percentage reduction in queuing delays across different I/O rates. ........................................... 39 
Figure 15: Total number of queued cycles during an I/O traffic rate of 1.8 GB/Sec. ................................. 40 
Figure 16: Total number of queued cycles during an I/O traffic rate of 1.2 GB/Sec. ................................. 40 
Figure 17: Percentage improvement in processor throughput with the secondary bus. .............................. 41 
Figure 18: Comparison between I/O techniques and Eager Writeback for I/O rate of 1.2 GB/Sec. ........... 43 
Figure 19: Comparison between I/O techniques and Eager Writeback for I/O rate of 1.8 GB/Sec. ........... 43 
Figure 20: CPI percentage improvement for the GemsFDTD program...................................................... 44 
Figure 21: Front side bus architecture in Intel's multi core processors. ...................................................... 45 
ix 
 
Figure 25: Dedicated FSB for each dual core processor. ............................................................................ 46 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
x 
 
 
 
List of Tables 
Table 1: Main trade-offs for a bus design [10]. .......................................................................................... 11 
Table 2: Contemporary I/O bus bandwidths. .............................................................................................. 13 
Table 3: Simalpha specifications ................................................................................................................ 35 
 
 
 
 
 
 
 
 
 
 
 
 
 
xi 
 
 
 
List of Abbreviations 
1. SRAM - Static Random Access Memory. 
2. DRAM - Dynamic Random Access Memory. 
3. CPU - Central Processing Unit. 
4. L1/L2 - Level 1/Level 2 Cache. 
5. LRU - Least Recently Used. 
6. SPEC - Standard Performance Evaluation Corporation. 
7. I/O - Input/Output. 
8. DMA - Direct Memory Access. 
9. GPU - Graphics Processing Unit. 
10. ILP - Instruction Level Parallelism. 
11. ALU - Arithmetic and Logic Unit. 
12. PC - Personal Computer. 
13. DCA - Direct Cache Access. 
14. NIC - Network Interface Controller. 
15. IOC - I/O Controller. 
16. MC - Memory Controller. 
17. MSHR - Miss Status Holding Register. 
18. LLC - Last Level Cache. 
19. SDRAM - Synchronous DRAM. 
1 
 
 
 
Chapter 1. Introduction 
 Computer designs and related technologies have made incredible progress in the last half century. 
There has been a constant scaling up in speed and scaling down in size every generation. This can be 
strongly attributed to the advances in semiconductor devices and also to innovative designs at the 
architectural level. It has also given rise to the notion that smaller is faster. Memory and computational 
logic are analogous to cement and water when it comes to the construction of a computer. There are three 
digital circuit implementation factors critical to the design of the state of the art computer, which scale 
fast but at different rates relative to each other. Integrated circuit logic technologies, semiconductor 
memories and magnetic disk technologies form those three main components of a computer in the 
decreasing order of speed.  
 Circuit logic density scaling has always followed the Moore?s law [1] with the transistor count 
doubling every 1.5 years. Memory modules such as Register files, Static Random Access Memories 
(SRAMs, present on chip) and Dynamic RAMs (present off chip), have also increased in capacity at the 
same rate due to the transistor device scaling. But large interconnect capacitances have resulted in slower 
access speed for larger memory units. This explains the speed gap between the memory devices and the 
logic circuitry. Disk density has been improving by 50% per year, almost quadrupling in three years. 
Since disks have mechanical parts, they can never match the speeds of the RAMs. Hence they are mainly 
used for mass storage. As a consequence of this varied rate of scaling the speed gap between memory 
devices and computational logic has been widening, thereby creating several performance bottlenecks. 
Memory hierarchies are used in order to bridge this gap between these three component levels and ease 
the constraints.  
2 
 
 This chapter introduces the typical cache and memory setup in modern day processors and the 
performance issues associated with them. The chapter concludes with a description of the problem 
addressed in this work. 
1.1 Caches and Memory Hierarchy 
 A cache is nothing but a small memory unit that stores data for future data requests to be serviced 
faster. The keyword here is small, because a smaller memory structure would have lower access latencies. 
A typical memory hierarchy in present day processors, both single and multi core ones, starts with the 
register files within the processor core and gradually moves towards larger but slower memory levels 
comprising expensive cache memories and ends in either the disk or network storage elements. The main 
motive behind this arrangement is to bridge the speed gap between the Central Processing Unit (CPU) 
core and the slower memory devices. This is very clearly illustrated in the Figure 1 reproduced from [2] 
below. 
 
Figure 1: Memory hierarchy in a computer 
3 
 
  A successful cache access is termed as a hit and failure is called a miss. This applies to both reads 
and writes. A read hit occurs when the requested block is present in the cache and a miss occurs when it 
isn?t, prompting an access to the next level of cache with greater access latency. Similarly a write miss 
occurs when the modified data could not be successfully written to the next level of the cache because of 
the data block being absent from the cache. Let us consider a 3 level memory hierarchy comprising of a 
level 1 (L1) cache, a L2 cache and the memory for an example. If the probability of a hit in a level i 
memory structure is given by hi and if Ti is the access time in cycles for the corresponding cache level, the 
average memory access time in cycles is given by this expression and provides a good performance 
measure, 
TAverage   T       -      T       -     T    ??????? ???.     
 Miss rate (1 - hi) reduction is the primary motive behind all cache based designs. Designs that do 
not address large miss rates would essentially lead to more program stalls and a smaller processor 
throughput even with pipelined and superscalar architectures. The ?temporal? and ?spatial? locality of the 
cache blocks are used for mitigating the miss rates in caches. Programs vary widely in terms of 
workloads, algorithmic complexity and size. Hence, it is also hard to design a cache hierarchy that suits 
perfectly for every program. However, we can always design one for optimal performance requirements 
by analyzing the tradeoffs involved. Memory hierarchy design is simpler for a set of applications that 
have similar and fixed workloads. 
1.2 Bottlenecks and Tradeoffs in Cache Design 
 Memory hierarchies are very much required for cushioning the impact of access latencies due to 
slower devices, but a certain amount of tradeoffs are required for an optimal design. Reducing the number 
of cache misses has been the primary goal of most designers as it addresses both miss rate and miss 
penalty. Some of the basic cache design methodologies are listed below: 
4 
 
1. As seen from equation 1, the miss rate greatly affects the average memory access time. To reduce 
the miss rate caches (cache memories) with larger block sizes are used. As a drawback, a larger 
block size increases miss penalty after a certain optimal value since it would consume more 
cycles to transfer a block from the memory. 
2. Larger caches certainly help in reducing miss rate, but the miss penalty increases as it takes more 
cycles to access a larger memory device. Caches with multiple banks are a good option if the data 
sets of the programs are large. 
3. Using a higher associativity cache also helps in reducing the miss rate. This reduces the number 
of conflict misses. But the hardware complexity of the data retrieval circuitry increases because 
we now have to select between multiple ?ways?. 
4. Processors typically use two levels of caches. By increasing the number of levels to three, we can 
get some speed-up. 
5. Miss penalty can be reduced by giving more priority to reads than writes. In a setup with write 
buffers, on a miss we can check the buffer for the requested block. Writing the block to the 
memory and then reading it back would add to the miss penalty. 
 The tradeoffs between the hardware overhead and speed with caches are quite clear now. In 
addition to these, the write traffic handling is a major task for the cache controller. In programs involving 
large workloads, almost every cache miss results in an eviction as there is never much space on the cache 
for the incoming block. Write buffers are quintessential to every cache for absorbing the write latency 
(discussed in later chapters), and they are not foolproof either. Write buffer induced processor stalls can 
be attributed to the following three reasons [3]: 
1. Full stalls occur when the buffer is full. The processor would have to retire the entries in the buffer 
to make space for the replaced cache entry causing stalls as the requested block has to wait. 
2. A read-access stall occurs when a read miss in L2 cache encounters a delay in reading from the 
5 
 
memory because the write-back buffer is currently writing to memory. 
3. A read-hazard stall occurs when L2 read miss finds its data in write-back buffer. However, this 
hazard can be avoided if write-back buffer entries and L2 cache entries can be swapped. 
 There are several strategies for cache write handling. Designs explained in [3], [5] and [6] have 
shown that write buffers contribute significantly in mitigating stalls. Jouppi in [7] and [8] proposed the 
victim buffer for handling conflict misses that mainly occur with direct mapped L1 caches. Chu and 
Gottipati [3] examine various factors to be considered for write buffer performance evaluation in their 
work. They find that even a single word of buffering yields a substantial gain in performance. Write 
buffer strategies are deeply analyzed in [9]. Having a deeper buffer provides more write merging 
opportunities and also reduces conflict misses. A read bypassing strategy, mentioned earlier, helps in 
holding the write data until the read takes place. An eager writeback strategy helps in balancing the 
accesses on the main system bus to reduce delays due to bus contention by committing Least Recently 
Used (LRU) blocks to memory earlier than the expected time. Handling the write traffic in such a way 
that there is complete concurrency between reads and writes will act as the upper bound on any 
improvement that can be achieved by addressing the write buffer issues. 
1.3 Problem Description 
 Write data traffic to memory constitutes up to 30% of the total communication traffic to the 
memory in most of the modern computer configurations and with many of the existing software 
programs. The results shown in Figure 2 convey the same with some of the Standard Performance 
Evaluation Corporation (SPEC) CPU benchmarks for a typical uniprocessor architecture operating at 
3GHz and having a 2MB on chip L2 cache using Sim-alpha, an Alpha 21264 processor simulator. In 
Figure 2, our simulations with SPEC benchmarks show that as much as 30% of the traffic between the 
CPU and memory is comprised of writes. 
6 
 
 
Figure 2: Memory access requests per 100 million instructions 
  
 The amount of queued cycles per request on the main system bus that results due to the access 
conflict between the read requests and the write commits from the memory whenever the write buffer 
becomes full. Write intensive benchmarks have shown that write buffer induced stalls can add significant 
latency to read misses in the last level cache. Research work in the area of write buffer analysis is 
minimal, but the works in [4], [5], [6] and [9] agree that write buffers do contribute to processor stalls, a 
case of the solution itself becoming a problem. The average number of cycles required per instruction 
execution is significantly lower than the average queued cycles per request on the main bus. This does not 
mean that the program execution is entirely blocked because of the queuing delay (design features such as 
?out of order execution?, ?speculative execution? and ?non blocking caches? ensure this does not happen), 
but there is certainly a major impact on the program execution time. This indicates that those instructions 
that have to endure the penalty of L2 cache misses take a large beating in execution time because of the 
conflict between the incoming reads and the outgoing write traffic. 
 The above mentioned setbacks with write buffers are addressed by using a hardware enhancement 
and a write-back strategy to support the main system bus. This architecture along with I/O techniques 
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Pe
rce
nta
ge
 co
mp
osi
tio
n o
f m
em
or
y a
cce
sses
SPEC CPU 2006 benchmark programs
Reads Writebacks
7 
 
such as ?direct cache injection?, ?memory mapped I/O? and ?interrupt driven I/O? which communicate 
directly with the CPU cache (as opposed to DMA) can make the best use of the main bus by efficient 
memory bandwidth utilization. The proposal is to have a dedicated bus, smaller in bandwidth to the main 
bus, to handle cache writes to the memory. Having a separate bus will also help I/O communications and 
Write-backs to happen in parallel in the case of the above mentioned techniques. The benefit of using a 
secondary bus to handle all of the cache writes to the memory has been studied in this thesis on some of 
the latest SPEC benchmark programs. Serial bus speeds are shown to be enough to handle the write traffic 
and be used as the secondary bus. A secondary bus controller that snoops for main bus traffic and 
determines the cycles best suited for a writeback to the memory is the hardware addition required to allow 
the link to function in the presence of I/O traffic and read requests. The main idea is to have write buffer 
entries retire ahead of time and only during those cycles where the main system bus is free from either a 
communication with the memory or an I/O device. One such arrangement has been extensively simulated 
in this work for different I/O data rates and also for Direct Memory Access (DMA) and Direct I/O 
transfer techniques. 
 
 
 
 
 
 
 
8 
 
 
 
Chapter 2. Background on Processor Architecture 
 Processors are classified into various different categories based on the architectural design. They 
can be categorized based on the instruction set complexity, number of cores, internal register length and 
number of threads per core to name a few. One thing that is common to all these designs is the caching of 
data for faster access. Almost every processor has multiple levels of cache and they use a common system 
bus to communicate with the memory and other peripheral devices like Graphics Processing Units 
(GPUs), I/O devices and Network Interfaces. This chapter throws light on a typical uniprocessor 
architecture that has later been used for simulations in this work and also provides an insight into the 
concept of multi core processors. 
2.1 Uniprocessor and Multiprocessor Architectures 
 A typical uniprocessor is made up of only one processing core which can in turn have a pipelined 
and/or superscalar architecture that make use of instruction level parallelism (ILP). The former executes 
almost one instruction per cycle by having pipeline registers store the control information while a 
superscalar architecture incorporates multiple processing resources to enable two or more instructions to 
execute in parallel. A lot of the present day uniprocessors incorporate both pipelining and superscalar 
features to extract the best possible performance. Also, the architecture that exploits ILP in the best 
possible way is the one that guarantees a good instruction throughput. Several hardware-software 
techniques are used to make this happen. Resources can be register files, ALUs, branch predictors and 
multipliers to name a few. Resource redundancy can help us run multiple threads in parallel on the 
processor thus resulting in faster program execution. Typical uniprocessor architecture from [10] is shown 
in Figure 3. 
9 
 
 
  
 Multi-core architectures are in vogue today because the scalability of uniprocessors has reached 
its limits and researchers have leveraged on the concepts of parallel programming. This has led to the use 
of several processor cores of reasonable speeds to perform the task and make use of the multiple threads 
in programs in a more efficient manner. These architectures exploit both instruction level and thread level 
parallelism. Figure 4 shows a simple multi-core processor block diagram. It is a common design practice 
for each processor core to have a local L1 cache and a shared L2. An interconnection network between 
the L1 and L2 caches handles data transfers between the two. A cache coherency protocol checks for 
Figure 3: A typical bus based computer architecture [10] 
CPU-memory bus 
 
 
CPU 
Cache Bus Adapter 
(Bridge) 
Bus Adapter 
(Bridge) 
Main memory 
(DRAM) 
AGP bus 
I/O controller 
Graphics 
output 
 
PCI bus 
Bus Adapter 
(Bridge) 
I/O controller 
I/O bus 
Network  
 
I/O controller I/O controller 
Disk Disk CD 
10 
 
inconsistencies between the two levels and the memory. The L2 cache can later be connected to the main 
memory through a main system bus. 
 
 
2.2 Processor System Bus 
 In a computer system, the various subsystems will have communication interfaces to each other. 
For instance, the CPU needs to communicate with the Memory and also with the I/O devices because the 
executing program comprises of both memory and I/O bound instructions. This communication is 
commonly done using a bus. The bus serves as a shared communication link between the subsystems. The 
two major advantages of a bus based system are low implementation cost and versatility. By defining a 
single interconnection scheme, new devices can be added easily and peripherals can even be moved 
between computer systems that use a common bus. The cost of a bus is low because a single set of wires 
is shared among multiple devices. One major drawback with a bus is that it creates a communication 
bottleneck especially when there is I/O traffic on the bus along with the regular memory traffic. In server 
systems where I/O is frequent, designing a bus system capable of meeting the demands of the processor is 
a major challenge. 
?  Local L1 Local L1 Local L1 
Shared L2 
Interconnection 
Network 
?
. 
Figure 4: A simple multi core processor architecture 
11 
 
 One of the main challenges designers face with a bus based design is that the maximum bus speed 
is largely limited by physical factors like the length of the bus and the bus loading (number of devices on 
the bus). The desire for high I/O rates and high I/O throughput can also lead to conflicting design 
requirements. Buses are traditionally grouped into CPU-Memory buses (main system bus) or the I/O 
buses. I/O buses may be lengthy, may have many types of devices connected to them, have a wide range 
in the data bandwidth of the devices connected to them, and normally follow a bus standard. CPU-
memory buses, on the other hand, are smaller in length and faster. Several bus bridges are used to connect 
the buses of different bandwidth and speed specifications. Ultimately, all of the I/O buses connect to the 
main system bus as shown in Figure 3. 
 Any communication over the bus happens between a master, who initiates the transaction and the 
slave who services accordingly. A situation with multiple masters on the bus would call for some kind of 
an arbitration mechanism. Table 1 illustrates the cost and performance trade-offs that need to be looked 
into while choosing a bus design. One thing that is clear from the table is that ?higher performance comes 
at a cost?. The first four points are self explanatory. It also talks about split transactions and how they aid 
in performance at a higher cost. The idea behind split transactions is to divide bus events into requests 
and replies, so that the bus bandwidth can be utilized in the time between the request and the reply. 
Table 1: Main trade-offs for a bus design [10]. 
Option High Performance Low Cost 
Bus Width Separate address and data lines. Multiplex address and data lines. 
Data width Wider is faster. Narrower is cheaper. 
Transfer Size  Multiple Words have less bus overhead. Single-word transfer is simpler. 
Bus masters Multiple entities. Single master requires no 
arbitration. 
Split transaction Yes - separate request and reply packets get higher 
bandwidth. 
No - continuous connection is 
cheaper and has lower latency. 
Clocking Synchronous Asynchronous 
12 
 
 The final item in Table 1 is about bus clocking and it is concerns whether a bus is synchronous or 
asynchronous. If a bus is synchronous, it includes a clock in the control lines and a fixed protocol for 
sending address and data relative to the clock. Since little or no logic is needed to decide what to do next, 
these buses can be both fast and inexpensive. Major disadvantages include clock skew problems, which 
limit the length of the bus and a fixed clock rate means that everything on the bus must run at the same 
pace. Asynchronous buses, on the other hand, are not clocked. Instead, self-timed, handshaking protocols 
are used between the bus sender and receiver. It is much easier to accommodate a variety of devices and 
to lengthen the bus without worrying about clock skew. This comes at the cost of increased traffic on the 
bus causing large queuing delays for other traffic. It is not surprising to see the CPU-memory bus to be 
synchronous and an asynchronous I/O bus in computer architecture. 
2.3 Input/Output techniques in modern day computers 
 As explained in the previous chapters, I/O refers to the exchange of data between the CPU and 
the peripheral devices. Traditional as well as some current techniques under research are discussed in this 
section. Table 2 lists the peak bandwidths which some of the fastest I/O buses are capable of. Though 
these are just the maximum possible numbers and the I/O traffic may not always attend such high rates, it 
gives an insight into the potential traffic that can be associated with I/O. One more point to note is that 
most of this traffic transits via the main system bus (CPU-memory bus) before reaching its destination 
(see Figure 3). This destination for all practical purposes is either the CPU or the memory. 
2.4 Memory mapped I/O 
 In this type of I/O a peripheral device is connected to the CPU's address and data lines exactly 
like memory through some mapping, so whenever the CPU reads or writes to the address associated with 
the peripheral device, the CPU transfers data to or from the device. This mechanism has several benefits 
and only a few disadvantages. The prime advantage of a memory-mapped I/O subsystem is that the CPU 
can use any instruction that accesses memory to transfer data between the CPU and a memory-mapped 
13 
 
I/O device. The MOV instruction is the one most commonly used to send and receive data from a 
memory-mapped I/O device, but any instruction that reads or writes data in memory is also legal. 
Table 2: Contemporary I/O bus bandwidths. 
Bus Name Peak Bandwidth (GB/Sec) 
SATA 3.0 [11] 0.750 
Light Peak [12] 1.25 
USB 3.0 [13] 6.25 
PCI Express 2.0 [14] 2 - 16 
AGP [15] 2.133 
QPI [16] 19.2 ? 25.6 
HyperTransport [17] 22.4 ? 51.2 
10 Gigabit Ethernet (10GBASE-X) [18] 1.25 
40 Gigabit Ethernet (40GBASE-X) 5 
100 Gigabit Ethernet (100GBASE-X) 12.5 
Infiniband (SDR, 12X) [14] 3 
Infiniband (DDR, 12X) 6 
Infiniband (QDR, 12X) 12 
14 
 
 A major disadvantage of memory-mapped I/O devices is that they consume addresses in the 
memory map. Generally, the minimum amount of space that can be allocated to a peripheral (or block of 
related peripherals) is a four kilobyte page. Therefore, a few independent peripherals can wind up 
consuming a fair amount of the physical address space. Fortunately, a typical Personal Computer (PC) has 
only a couple dozen such devices, so this isn't much of a problem. However, some devices, like video 
cards, consume a large chunk of the address space (e.g., some video cards have 32 megabytes of on-board 
memory that they map into the memory address space). 
2.5 Interrupt driven I/O 
 In the case of programmed I/O, the CPU is busy waiting for an I/O opportunity and as a result 
remains tied up with that I/O operation until it is completed. This disadvantage can be overcome by 
means of interrupt driven I/O. In Programmed I/O, CPU itself checks for an I/O opportunity, but here the 
I/O controller interrupts the execution of CPU whenever an I/O device wants to initiate a transaction. This 
way the CPU can perform other computations in the mean time and execute an interrupt service routine 
only when an I/O operation is required, which is quite an optimal technique. A priority scheme 
determines what happens in the case of simultaneous interrupts. A fixed priority scheme results in devices 
getting assigned priorities in a fixed order. This may result in some low priority devices not being 
serviced enough. A solution to this is to assign priorities in a rotational order. This scheme rotates the 
highest priority among all devices by shifting the priorities. 
2.6 Direct Memory Access (DMA) 
 DMA technology provides special channels for CPU and I/O devices to exchange I/O data, and 
the memory is used for buffering the I/O data. When the CPU wants to handle I/O data, it triggers the 
DMA write operations that transfer the I/O data from I/O devices to the memory. On the opposite 
direction, when the CPU writes data to I/O devices, the DMA read operations (transferring I/O data from 
the memory to I/O devices) are performed. 
15 
 
 
 
 The data flow diagram for a DMA transaction over different levels of the memory hierarchy is 
shown in Figure 5 for a DMA produce - CPU consume direction, reproduced from [19]. The processor, 
memory and the DMA engine are involved in the interactions during a DMA operation. The interaction 
requires three data structures namely the DMA buffer, descriptor and destination buffer, all residing in the 
main memory. To start off, the device driver creates a descriptor for a DMA buffer. The driver allocates a 
DMA buffer in the memory and initializes the descriptor with the DMA buffer?s start address, size and 
status information. The driver informs the DMA engine of the descriptor?s start address. DMA engine 
then loads the descriptor?s content from the memory. With the DMA buffer?s start address and size 
information extracted from the descriptor, the DMA engine receives the data from the I/O device and 
writes the data to the DMA buffer. After all I/O data is stored in the DMA buffer, the owner status of the 
descriptor is modified to be the DMA engine. The DMA engine sends an interrupt to the processor to 
DMA 
Req Size 
Write 
I/O data Snoop 
Write back 
Memory 
Write 
DMA 
Produce 
DMA 
Engine 
Memory 
Controller Memory 
CPU 
Cache 
Reuse 
Distance 
CPU 
Consume 
Read 
Read data 
Figure 5: DMA flow diagram 
16 
 
indicate the completion of the receiving operation. The driver handles the interrupt raised by the DMA 
engine and copies the received I/O data from the DMA buffer to the Destination buffer. Then, it frees the 
DMA buffer. The processors adopt snooping-cache scheme for maintaining I/O data?s coherence, 
accordingly they need to send snoop requests to the processor?s data cache to invalidate those cache 
blocks that pertain to the I/O data under DMA request. Consequently, when the CPU consumes the I/O 
data, the compulsory misses will take place and trigger the memory read requests to the memory 
controller. 
2.7 Direct Cache Access (DCA) and Cache Injection 
 In addition to the traditional techniques discussed above, DCA and cache injection are two other 
techniques that attempt to ease the memory bottleneck but letting the I/O device directly inject I/O data 
into the processor?s cache. These techniques are producer driven when compared to the previously 
discussed techniques such as DMA, Interrupt Driven I/O and programmed I/O, which are consumer 
driven. Both of them are well suited for the large data rate network I/O over the Gigabit Ethernet. DCA 
[20] is basically a cache coherency optimization that delivers inbound data from a network interface 
controller (NIC) directly into processor caches dramatically reducing stalls due to memory access of 
descriptor, packet header and packet payload data structures. 
 Another technique that is worth mentioning because it is one of the assumptions in the 
simulations carried out in our work is that of direct cache injection [21]. Cache injection addresses the 
continuing disparity between processor and memory speeds by placing data into a processor?s cache 
directly from the I/O bus. This disparity adversely affects the performance of memory bound applications 
including certain scientific computations, encryption, image processing, and some graphics applications. 
As shown in Figure 6, reproduced from [21], the injection operation is first initiated by the NIC. Unlike in 
DMA, where the next step is to write to the memory, step 2 allocates incoming network data into the 
cache. If the processor uses this data promptly there is no need to fetch the data from the memory. 
17 
 
  
 
Figure 6: Direct cache injection based I/O. 
 
2.8 Cache Writeback Strategies 
 Cache writeback, as explained before, is the process of committing data blocks back to the 
memory via the system bus. Several write buffering and writeback techniques are used to ease the 
memory access latency after a cache miss. The most basic ones of them all are the ?writeback? and the 
?writethrough? techniques which are explained here. 
a. Write Back 
 In this technique, the memory locations written are marked as dirty and are held in the cache until 
a read request evicts this line as a replacement. More often than not there is traffic towards the memory 
every time a cache read miss occurs some dirty cache line has to make way for the incoming datum. As a 
result a read miss in a writeback cache would require two memory accesses: one to retrieve the needed 
datum, and one to write replaced data from the cache to the store. 
 
18 
 
b. Write Through 
 When the system writes to a memory location that is currently held in cache, it writes the new 
information both to the appropriate cache line and the memory location itself at the same time. This type 
of caching provides worse performance than write-back, but is simpler to implement and has the 
advantage of internal consistency, because the cache is never out of sync with the memory the way it is 
with a write-back cache. 
 Writeback caches are more complex architectures than the ones using writethrough when it 
comes to implementation. Both the techniques as discussed later use some sort of buffering to absorb the 
impact of memory accesses. These methodologies are also used between the L1 and L2 caches. L1 caches 
usually comprise of separate partitions for instruction and data portions of the cache lines to aid in faster 
instruction fetch rates. Since writes can happen only on a datum, we can use a buffer only for the data 
cache among the two. 
 
 
 
 
 
 
 
 
19 
 
 
 
Chapter 3. Prior Work on Memory Hierarchy Optimization 
 Cache accesses and the penalties associated with them have been targeted by a lot of researchers 
over the past two decades. All the proposed cache write-back policies are aimed at minimizing the impact 
that a write to the next level would have on the processor pipeline. Consequently there have been a good 
number of innovative solutions such as Write Buffers, Victim Buffers, MSHRs, Eager retirement policies 
and Cache Prefetching. Most of the modern day processors use a good mix of all of these strategies. This 
chapter discusses a few of them. 
3.1 Write Buffers 
 As mentioned in the previous chapter, cache write techniques used in uniprocessor architectures 
involve either the ?Write-Through? or the ?Write-Back? policy [10]. In a Write-Through technique, the 
modified data lines are written to the cache as well as the next lower level in the memory hierarchy. On 
the other hand, caches employing Write-Back usually mark the data line as ?dirty? to imply that the cache 
line is inconsistent with the next level in the hierarchy and do a write to the next level only when they are 
evicted by another incoming block. The disadvantage of Write-Through is that the processor has to stall 
since the memory is accessed every time there is a write operation. Write-Back also creates processor 
stalls whenever a dirty line is evicted from the cache and it has to be written to the memory to make space 
for the required line.  
 Using a ?write buffer? between the caches and the memory or the next lower level in the hierarchy 
helps in reducing this bottleneck. This is one of the earliest solutions proposed for tackling cache 
coherency and the associated latencies in the memory hierarchy. The cache writes directly go into the 
buffer than the next level and since the buffer has similar access latencies as the cache, we benefit through 
20 
 
reduced stalls. Write buffers can also induce CPU stalls at times. Listed below are a few problems 
associated with write buffers [3]: 
1. Full stalls occur when the buffer is full and the store cannot merge.  
2. A read-access stall occurs when a read miss in L2 cache encounters a delay in reading from 
memory because the write-back buffer is currently writing to memory.  
3. A load-hazard stall occurs when L2 read miss finds its data in write-back buffer. However, this 
hazard can be avoided if write-back buffer entries and L2 cache entries can be swapped. 
 There are certain occupancy based policies for retiring the buffer entries to the next level in the 
memory hierarchy. The buffer can retain a suitable number of entries for coalescing purposes, but can 
retire entries at the maximum possible rate when occupancy rises above a particular mark (number of 
valid entries in the buffer). Waiting until this mark before retiring means that sequential writes can 
achieve maximal coalescing. The most recently allocated entry cannot be retired until a new entry is 
allocated. We call the entry that triggers retirement, the high-water mark and name the retirement policy 
according to this mark. For example, a retire-at-2 policy would wait until 2 or more entries are valid in the 
buffer before starting the process.  
 Read access stalls on the other hand can be reduced by using an eager writeback policy, which is 
discussed later in this chapter. Flushing the write buffer on every load miss solves the load hazard 
problem, but at substantial cost. Techniques such as Flush-full, Flush-partial and Flush-item only are 
alternative solutions to this problem. Flush-full flushes the entire write buffer when the miss hits in the 
buffer. Flush-partial saves some work by flushing entries in FIFO order only as far as necessary to purge 
the hit entry. Flush-itemonly saves even more work by flushing only the hit entry. If a different entry is 
already being retired when the load hazard occurs, we assume this transaction completes first. Finally, 
read-from-WB allows the load miss to read its data directly from the write buffer without altering the 
buffer?s contents, avoiding an access to the next level in the process [3], [9]. Deeper buffers also help in 
21 
 
reducing the buffer full stall by storing more burst writes which is normally associated with the data 
intensive streaming programs. It also supports the concept of lazy retirement. As mentioned before, more 
contents held in the buffer provide more write merging opportunities and also improves the chances of a 
hit in the buffer upon a read miss in the cache. Figure 7 illustrates the writeback policy with a buffer. 
 
 
 
3.2 Victim Buffers and Victim Caches 
 In cache terminology a ?victim? cache block is the one that is evicted upon a conflict cache miss. 
Many cache blocks get evicted in direct mapped caches during iterative function calls or context switches 
in a program. Since the probability of a cache block becoming a victim is very high in direct mapped 
caches compared to set associative ones, we need a buffer mechanism to hold these blocks as they may be 
required sooner in the program. So misses in the cache that hit in the victim cache provide a great chance 
to reduce miss penalty [7], [22], [23]. Experiments carried out on certain benchmark programs in [7] 
show that a small victim cache of 5 to 8 entries was enough to reduce the number of misses in a 1 to 4KB 
first level cache by about 80% of the cache misses. 
Datapath 
L1 Cache WB 
L2 Cache W
B 
Memory 
Figure 7: Write buffer example with write-back technique in a three level memory hierarchy. 
22 
 
 The term ?victim buffer? on the other hand is associated with the buffering mechanism that 
ensures write merging. They are an advanced version of the write buffers discussed previously. A dirty 
cache line is buffered and any subsequent modifications to the line are merged to the entry in the buffer 
itself during write misses. Victim buffers are typically made up of more entries than a victim cache. 
Another difference between the two is in the fact that victim caches catch both ?dirty? and ?non-dirty? 
lines which are victims, whereas the victim buffer is usually meant for modified or ?dirty? cache lines 
[10]. The block diagram of a typical uniprocessor memory hierarchy can be seen in Figure 8. In a two 
cache arrangement, the L1 cache usually employs either a victim cache for support with a direct mapped 
cache or a victim buffer in case a cache with higher associativity is used. L  employs the ?write through? 
policy. The L2 employs a write buffer and uses a writeback policy for coherence. 
 
 
System Bus 
Main Memory I/O 
PROCESSOR 
Victim Cache / Write Buffer 
 
L2 Cache 
L1 Instruction Cache L1 Data Cache 
 
Write Buffer 
Processor 
Chip 
Figure 8: Processor Cache Architecture with Write Buffers. 
23 
 
3.3 Cache Prefetching 
 CPU cache prefetching involves fetching a block from the main memory into the CPU when the 
block has not been referenced in the expectation that it will be referenced in the near future. Hardware 
cache prefetching is specifically concerned with prefetching algorithms implemented solely by dedicated 
hardware without any software support. Two questions have to be answered before prefetching a block: 
which block to prefetch and when to prefetch. The simplest candidate to prefetch is the next sequential 
block after the one most recently referenced. This is illustrated in [24] with a technique called always 
prefetch. With this algorithm, every time there is a reference to block i, the cache is examined for block i 
+ 1 (i.e., the next sequential block, in terms of ascending memory addresses). If block i + 1 is absent from 
the cache, it is prefetched. 
 A variation which requires fewer prefetches and prefetch lookups (i.e., look into the cache to see 
if the block is there) is called prefetch on misses, which prefetches the next sequential cache block if and 
only if the access to the current cache block is a miss. A more complicated scheme, known as tagged 
prefetch [24], keeps the number of prefetch lookups low while issuing more prefetches than prefetch on 
misses. In this case, each cache block has a single bit, called the tag, which is set to zero whenever the 
block does not reside in the cache. When a block is referenced by the processor, its tag will be set to one. 
A block brought into the cache by a prefetch, however, retains its tag of zero. 
 Figure 9 (reproduced from [24]) shows a typical hardware cache prefetch architecture. The two 
prefetch units, one for each cache, are responsible for issuing new prefetch requests to the main memory. 
During each clock cycle, each prefetch unit receives information like cache misses, cache hits, instruction 
types, and branch target addresses from the processor and the caches. Based on this information, it 
decides whether to issue a new prefetch request or not. If it does, the prefetch address is looked up in the 
corresponding cache. The request is issued in the next clock cycle if the data is not found in the cache. 
Issued requests from both prefetch units are not sent directly to the memory bus, though, but to a prefetch 
address buffer, each of which is organized as a FIFO queue with 16 entries. The oldest entry is sent to the 
24 
 
memory bus only when the bus is free. If the buffer is full when a newly issued request arrives, the oldest 
entry is discarded from the buffer to make room for the new one. Whenever there is a cache miss, the 
address of the missing cache block is compared against every entry of the buffer. Any entry which 
matches the address represents a failed prefetch (because it is issued too late) and is discarded without 
being issued. 
 
 
  
 One of the disadvantages of cache prefetching is the unavoidable increase in memory traffic 
because of prefetches which are never referenced. The limitations of cache prefetching on a bus-based 
multiprocessor system are investigated in [26]. Results show that, when bus bandwidth is the bottleneck, 
prefetching will not improve performance, even when it reduces the demand miss ratio. Another 
disadvantage of cache prefetching is that useless prefetches may pollute cache contents by displacing 
useful cache blocks from the cache and, thus, cause new cache misses which would not have happened 
had there been no prefetching. All of these factors ensure the advantages from cache prefetching are 
minimal on CPU performance. 
Figure 9: Cache Prefetching Architecture [25] 
25 
 
 
3.4 Miss Status Holding Registers 
 Out of order instruction execution is a popular technique in pipelined computers that allow a 
processor to fetch another instruction whenever there is a miss in the data cache. This means the processor 
need not wait until the data request is serviced by the next level in memory. A non blocking cache is used 
in such a situation to reap the benefits. Extra hardware is needed to store the cache miss information in the 
form of the requested address and that is where the Miss Status Holding Register (MSHR) comes into 
play. These registers hold the address information for the miss that is to be serviced. A hit under miss 
optimization reduces the effective miss penalty by helping during a miss instead of ignoring the requests 
of the processor [10]. 
  
 
Cache 
Bank 
1 
Cache 
Bank 
2 
Cache 
Bank 
N 
?.  
MSHR 
File 1 
MSHR 
File 2 
MSHR 
File N ?.  
Figure 10: Miss Handling Architecture for multi bank caches 
26 
 
 Most of the modern processors such as Intel Pentium 4 use multi banked L2 caches to facilitate 
parallel miss request servicing [27]. This optimization allows multiple misses over a miss, which means 
we can have multiple misses in queue. Conventionally each cache bank would have an MSHR file to 
facilitate storing of this miss information [28]. Figure 10 shows the block diagram of a typical multi bank 
cache and the MSHR arrangement. 
3.5 Eager Writeback 
 The fundamental idea behind eager writeback strategies is to write the dirty cache lines to the 
next level in the hierarchy and clear the dirty bits earlier than in a conventional writeback. Since the main 
system bus handles a huge amount of data traffic for data intensive applications, this enhancement would 
be highly beneficial for performance improvement. The work in [29] explores one such technique by 
indirectly distributing the traffic on the main bus to make use of the idle bus cycles. The Eager writeback 
technique is a compromise between the writeback and write through techniques. Here the write commits 
are neither made upon every cache line modification like in write through nor does the write buffer waits 
to get filled. The dirty lines are evicted as and when the main system bus becomes free. However, the 
system bus carries more traffic than just the writebacks and reads from the cache hierarchy to the 
memory. Input/Output (I/O) traffic can cause a major bottleneck on the main system bus when network 
based I/O or other I/O intensive applications are running is considered. This technique does not provide 
much benefit with I/O considerations. In summary, the following two points highlight the shortcomings 
of the Eager writeback strategy: 
1. The strategy does not ?snoop? the main bus to identify free memory cycles. This can be a 
hazardous in situations where clustered activity is present on the main bus. The write commits are 
just offset to an earlier stage through an early writeback of LRU lines with Eager writeback. This 
may not always solve the problem of bus contention. 
27 
 
2. During high I/O traffic densities on the main bus, the performance of the CPU decreases because 
of the large queuing for CPU read requests. This is a situation amplified by two kinds of traffic, 
CPU Produce-I/O Consume and I/O Produce-CPU Consume.  
 A majority of the processors today use an optimal mix of the above techniques to mitigate the 
memory access penalties that results from a cache miss. Some researchers also call for dynamic 
optimizations in order to optimize the performance for a particular type of application currently running 
on the processor. 
3.6 Secondary Bus Architecture 
 The concept of a secondary bus to connect the level-2 cache to memory for cache writebacks has 
been explored in [30] by O?Farrell and Baskiyar. The simulations results on three data intensive micro-
benchmarks show that adding an additional bus to support the main system bus would help in achieving 
significant reductions in queuing delays on the main bus. Such reductions in queuing delays offer superior 
temporal determinacy in a real-time environment. Their simulation results also compare well with ?free 
writeback? (it models a system in which dirty writebacks do not generate any memory traffic on the bus) 
and indicate that this bus can indeed parallelize reads and writes. It also discusses the feasibility of 
implementing such a bus as a serial or a wireless link. 
 
 
 
 
 
 
28 
 
 
 
Chapter 4. Secondary Bus Architecture 
 The main system bus is a bottleneck in bus based systems and an efficient use of the bus cycles 
calls for some kind of access control mechanisms. An additional bus can support by carrying the write 
traffic off the main system bus. Just adding a second bus would not help in easing congestion as it would 
only transfer the bottleneck to the memory interface from the L2 cache interface of the main bus. This 
write traffic would require a new write back policy (retirement policy), to make sure there is a less 
contentious path between the memory and the processor for reads. A bus controller is required for 
controlling who (write buffer, an I/O device or Read requests) gets access to the main memory and when. 
This chapter explains the design of the secondary bus architecture that is expected to ease queuing delay 
and result in improved program execution speed for the CPU. The secondary bus architecture was 
designed by Wang and Baskiyar [31]. 
4.1 Design of the Secondary Bus 
The motive behind the secondary bus architecture design is to first provide a separate path from 
write buffers to main memory so that the three main reasons for CPU stalls (explained in Chapter 1) due 
to the main bus bottleneck are reduced. We refer to direct I/O here as a technique similar to cache 
injection or DCA (discussed in section 2.3.4), where the I/O data is directly written to or read from the 
processor cache. Direct I/O gives us an upper bound on the improvements possible with the architecture 
under discussion. Then, the following two time slices can be made use of to commit dirty cache lines in 
the write buffer to the memory over the main system bus: 
a. In the absence of any data traffic directed towards or from the main memory. 
b. During I/O transactions on the main system bus in the case of direct I/O. 
29 
 
 The first condition is required because we are assuming a single ported memory here. A single 
port memory restricts the number of simultaneous accesses to just one, irrespective of whether it is for a 
read or a write operation. Hence, to return the cache lines in the buffer, choosing intervals when the 
memory is free eliminates queuing delays on the system bus as opposed to the techniques discussed in the 
previous chapter. Secondly, in the presence of any I/O traffic that is a result of a communication directly 
between the processor and the I/O device (as it happens during a Direct I/O transaction), the memory unit 
is available for write commits via the secondary bus. The design of their secondary bus based architecture 
[31] is shown in Figure 11 below. 
 
 
Figure 11: Architecture with the Secondary bus 
 
 
As seen in Figure 11, the main bus is supported by the secondary bus during writebacks and I/O 
transactions. This is made possible by the secondary bus controller which does a snoop on the main bus 
and identifies the bus cycles where it is not busy with memory operations. These are the cycles where the 
30 
 
secondary bus would take over and commit the dirty cache lines to the memory, giving a ?faucet? like 
control. The ?control input? to the secondary bus controller is made up of the main bus control lines that 
give information about the type of transaction happening on the bus. This can be address strobe, burst 
ready or the I/O control lines [32], which indicate when a transaction starts on main bus. The signals ?s ? 
and ?s ? are sent in accordance with the states of the main bus to arbitrate their memory accesses. 
 
4.2 Design Issues 
 There are several design issues that need to be addressed while developing a write back policy for 
the secondary bus architecture. Some of them are explained here. 
a. There can be situations when there can be read requests on the main bus at a point where there can be 
cache retirements happening on the secondary bus to the memory. One of the ways of handling this 
situation is to make the main bus request to queue up in the MSHRs and then complete the write 
operation until the writeback buffer is empty, while the other option would be to abort the burst 
writeback and enable the main bus to do the transaction with the memory. In this design, the main bus 
access is given priority so as not to contribute to the queuing delay on the bus. This creates a 
dependency for the secondary bus states on that of the main bus. 
b. The addition of the secondary bus and the bus controller can consume some area on the motherboard 
and the memory controller hub respectively. Secondary bus though, is designed to be of a smaller 
width than the main bus and hence of a smaller bandwidth since it is only used for the writebacks. 
This would mean the transactions on the secondary bus would take longer than the main bus. This is 
not a concern because writes constitute a small percentage of the total memory traffic as discussed 
earlier and lesser bandwidth would suffice. However, the usefulness of the secondary bus depends on 
the amount of free memory bandwidth. If the main bus is busy with ?direct I/O? operations or idle for 
longer time, even a small secondary bus would be good enough to commit the dirty cache lines. This 
will lead to performance boost in situations when there is severe I/O traffic or clustered memory 
31 
 
accesses, which otherwise can lead to congestion on the main bus. This design ensures that the 
performance of a processor with a particular application will not be degraded as much as it would for 
a computer without the secondary bus. 
c. In the case of direct I/O operations, as explained before we can parallelize the write back and I/O read 
operations. During DMA, the memory bandwidth is used up most of the time for one of these: I/O 
reads, I/O writes, CPU reads or CPU writes. This movement of I/O data between the processor cache 
and the memory adds more queuing delay to the traffic on the main bus, thereby creating a bottleneck. 
DMA reduces the number of available cycles for writeback. It would cause the write buffer induced 
CPU stalls to aggravate, which otherwise is one of the major areas of improvement with the 
secondary bus. So the secondary bus will work fine with DMA without any specific design 
modifications, but with reduced benefits. 
  
 
 
 
 
 
 
 
 
32 
 
 
 
Chapter 5. Simulation Setup for Performance Evaluation 
 Any new design or an enhancement to an older design always requires evaluation. Computer 
architecture research groups around the world use simulators like SimpleScalar, Sim-alpha, GEMS, 
SimOS and Simics to name a few. These simulators try to simulate the entire architecture along with the 
modification to our desired level of accuracy. Though the simulation data generated can never match the 
data collected from a real time run on an actual hardware for accuracy, they are faster and good for 
comparisons at an early design stage. In this work the Sim-alpha simulator [33] is used for the same 
purpose. SPEC CPU 2006 benchmarks are used for simulations as they a both CPU and memory 
intensive. The latter, especially, was very important as a good workout for the memory hierarchy is 
essential for secondary bus benefits to be visible. The changes made to the simulator encompasses the 
architectural design originally made in [31]. In other words, secondary bus architecture is very much 
suited for memory and I/O intensive programs. This chapter throws light on the simulation setups created 
using Sim-alpha and the SPEC benchmarks. 
5.1 Sim-alpha Simulator 
 Sim-alpha is a validated, execution driven, Alpha 21264 processor simulator. It was written by 
extending the SimpleScalar tool suite [34]. Sim-alpha models the implementation constraints as well as 
the performance enhancing features of the Alpha 21264 processor. The simulator settings can be varied 
by the user to simulate the influence of parameters like cache sizes, memory speed, fetch width, issue 
queue sizes, bus bandwidths and many others associated with a computer. 
 The 21264 is a superscalar processor that can fetch and execute up to four instructions per cycle. 
It also features out of order execution, which results in critical path executions to start and complete 
33 
 
quickly. It also has a branch prediction unit and executes speculatively. Coupled with high clock speeds 
this unique combination of out-of-order and speculatively execution, provide exceptional core 
computational performance [35]. The processor has seven pipeline stages as shown in Figure 12 
(reproduced from [35]). Most of the present day complex applications cannot necessarily be run at a 
throughput of fours instructions per cycle. Some of them can take more than 1000 cycles to execute due 
to the access bottlenecks on the last level cache (LLC), the system bus and the memory. Though the 
numbers shown below in Table 3 are from the actual design of the 21264 processor, these can be varied 
using several flags or configuration files with sim-alpha. 
 
 
Figure 12: Microarchitecture of the Alpha 21264 processor [35]. 
 
 Sim-alpha incorporates a detailed memory subsystem with support for multi level cache 
hierarchies, address translations, bus contention and a Synchronous DRAM (SDRAM) memory model. It 
builds on x86 machines and acts a cross architectural simulator, whereby it runs on a x86 machine and 
34 
 
simulates binaries compiled for the Alpha 21264 instruction architecture. The average error by using sim-
alpha as opposed to the actual Alpha processor is only about 2% as evaluated across a handful of micro 
benchmarks in [33]. 
5.2 SPEC Benchmark Programs and Simpoints 
 The SPEC CPU 2006 benchmarks suite [36] consists of integer and floating-point programs that 
represent a wide range of applications that we use today on our computers. These benchmarks are highly 
rated for evaluating several computer design components, mostly the CPU, memory subsystem and also 
compilers. Though some of the previous SPEC suites had problems exercising the memory subsystem 
either due to lack of working sets [29] or due to lesser application complexity, the 2006 programs run 
longer, have large working sets and are more complex. Video compression and speech recognition have 
also been added to these new benchmarks. 
 The 2006 suite of programs have a large amount of run times though its predecessors can now 
finish a run within minutes on the existing architectures. Runs times of the current benchmarks can range 
from machine weeks to months on a cycle accurate simulator like sim-alpha, before one can get access to 
the results. They also show high variability across several runs on the same set of data even after these 
accurate simulations. In order to ease this problem simulation points (simpoints) [37] are used during 
simulations. These small set of samples (simpoints) when simulated and weighted appropriately provide 
an accurate picture of the complete execution of the program with large reduction in the simulation time. 
Several days of run times are reduced to just a few hours with a slight compromise on accuracy. The 
methods used to extract these points and their weights are discussed in [37]. It also provides a list of files 
for the CPU 2006 group for quick use in simulations. 
5.3 Sim-alpha Architectural Configurations Used in Simulations 
 Sim-alpha requires a processor and memory hierarchy configuration list to start a simulation run. 
We have used the data provided in Table 3 for simulations with the secondary bus. Modifications were 
35 
 
made to the memory hierarchy with the addition of a write buffer to the last level cache. In the final 
structure, L1 cache had a victim buffer for support and the L2 (LLC) had the write-back buffer to handle 
write traffic. This setup made up the base architecture against which the secondary bus architecture would 
be compared later. Changes to the configuration file involved the addition of a new bus connecting the 
write buffer (added previously) to the memory, bypassing the main bus. It was made sure that the write 
traffic used only the secondary bus.  
Table 3: Simalpha specifications 
Processor Parameter Specifications 
Processor Speed 3 GHz 
Level 1 Data Cache 8 way, 32KB, virtual-index virtual-tag 
Level 1 Instruction Cache 8 way, 32KB, virtual-index virtual-tag 
Level 2 Cache 8 way, 2MB, physical-index physical-tag 
Number of MSHRs per Cache 8 
Write Mechanism for Level 1 Cache Victim Buffer, No Writeback Buffer 
Write Mechanism for Level 2 Cache Writeback Buffer, No Victim Buffer 
Main Bus (Front Side Bus) 600MHz, 8B wide, 10 cycles of arbitration latency 
Secondary Bus 600MHz, 1B wide, 10 cycles of arbitration latency 
  
 A major benefit of using such a secondary bus would be in those situations where I/O traffic uses 
a large part of the bandwidth on the main bus causing congestion for non I/O data. Hence, a simulation 
setup to generate I/O traffic at a rate described by a ?Normal? distribution was created. Normal 
36 
 
distribution was assumed I/O traffic with the CPU because I/O can be mainly composed of network traffic 
and disk traffic. Figure 13 gives the probability density function used for our simulations. A dummy I/O 
device that would generate blocks of the size of cache lines was implemented and I/O injection 
frequencies were chosen to get a calculated bandwidth pinch. The I/O data rates tried were of 600 
MB/Sec, 1.2 GB/Sec and 1.8 GB/Sec. These numbers were chosen based on the I/O bus bandwidth 
number shown in Table 2. I/O data eventually reside in the LLC by evicting some dirty cache lines in 
order to make space for this incoming data. These replacements can lead to some conflict misses later in 
the run and cause more traffic on the main system bus. Comparisons with Eager writeback [29] were also 
made possible with a different set of changes to the simulator. DMA I/O was discussed in section 2.3.3 
and performance comparisons with the DMA I/O were also required. Simulator changes included 
implementing a DMA engine that would mimic the I/O injection and act as a bus master. This time I/O 
traffic went through the memory unit before landing in the LLC as opposed to Direct I/O, thereby 
reducing the memory bandwidth available for writeback. 
 Simpoints does help us in speeding up the simulations but faster simulations can only be achieved 
via a supercomputing environment. The Alabama Supercomputing Authority [38] provided us with a 
server cluster that was made up of some of the latest processors. As a result of multiple simpoints being 
run in parallel on the supercomputer nodes, the results for all of the 16 benchmarks were obtained in just a 
couple of weeks. Although separate such simulations had to be run for Eager writeback, each of the I/O 
injection rates, DMA I/O and Direct I/O conditions, each taking about two weeks, because of the inability 
of Sim-alpha to make use of multi processing. 
37 
 
 
Figure 13: Probability density function used for I/O injection, Mean = 100 cycles and SD = 60 cycles. 
  
 
 
 
 
 
 
 
 
 
 
 
38 
 
 
 
Chapter 6. Simulation Results and Observations 
In this chapter, the simulations results obtained with sim-alpha and their analysis are put down. 
The results shown include comparisons with the Eager writeback technique for all the different I/O 
frequencies, simulations with I/O traffic. The metrics used for evaluation included queuing delay on the 
main bus, maximum number of cycles taken by any instruction, average cycles per instruction execution 
and number of instructions taking more than 1000 cycles to complete. Though all of the metrics are 
interrelated, the results give understanding of their dependencies on each other. Reduction in some of the 
metrics like the average queuing delay for an instruction can be more beneficial to certain real time 
applications and systems as opposed to personal computers and server based systems.  
6.1 Queuing Delay on the Main System Bus 
The main bus in our simulations has a bandwidth of 4.8 GB / Sec and was shared by some I/O 
traffic (600MB / Sec, 1.2 GB / Sec and 1.8 GB / Sec) the writeback traffic and the read traffic. There are 
times when the bus is being used for servicing a number of clustered requests and another request that 
comes up at the same time has to be queued because of the limit on the number of outstanding requests. In 
real-time systems these queuing delays can become significant resulting in unexpected latencies and hard 
deadlines getting missed. Our simulations show significant reduction in queuing delays due to separation 
of the write traffic from the read traffic. Figure 14 shows the percentage queuing delay reduction achieved 
with the secondary bus against the base architecture for each of the SPEC programs at various I/O rates. 
Almost all of the programs showed great reduction in the queuing delays with an average reduction of 
nearly 99% during the absence of I/O traffic on the main bus. With I/O devices trying to communicate 
with the processor through the technique of direct I/O, we start to see reduced benefits, though Figure 14 
39 
 
conveys that the percentage reduction is still considerably high. It also indicates that a majority of the data 
queuing that occurs on the main bus is due to the write traffic in a write back cache and the benefit of 
using a secondary bus on queuing delay is very clear. 
 
Figure 14: Percentage reduction in queuing delays across different I/O rates. 
The comparisons with the Eager writeback and other types of I/O like DMA are shown in Figure 
15. It is obvious that direct I/O extracts the best out of the secondary bus architecture when compared to 
DMA. When DMA is used, the I/O data travels through the memory before reaching the processor. This 
data is very much like any other read data from the memory, reducing the memory bandwidth for 
writeback using the secondary bus. With direct I/O, we can do a write back to the memory as well as an 
I/O read from the I/O controller in parallel. All these factors lead to a reduced performance for DMA with 
the secondary bus, but it is quite useful in reducing the delays nevertheless. Although Eager Writeback 
results in good reduction in queued cycles, it cannot match the advantage with a secondary bus. This is 
because the write back policy there does not snoop for free cycles on the main bus and only offsets the 
0
20
40
60
80
100
120
Per
ce
ntage 
Re
du
ction
 in
 qu
eu
ed
 cy
cle
s
SPEC CPU 2006 Benchmarks
No I/O 600 MB/Sec 1.2 GB/Sec 1.8 GB/Sec
40 
 
writes to an earlier stage. Figure 16 provides the total queued cycles information during an I/O rate of 1.2 
GB/Sec. 
 
Figure 15: Total number of queued cycles during an I/O traffic rate of 1.8 GB/Sec. 
 
Figure 16: Total number of queued cycles during an I/O traffic rate of 1.2 GB/Sec. 
0
5E+09
1E+10
1.5E+10
2E+10
2.5E+10
3E+10
3.5E+10
4E+10
4.5E+10
5E+10
To
tal
 Qu
eu
ed
 C
yc
les
SPEC CPU 2006 Benchmarks
Direct_IO DMA Eager Writeback
0
5E+09
1E+10
1.5E+10
2E+10
2.5E+10
3E+10
To
tal
 Qu
eu
ed
 C
yc
les
SPEC CPU 2006 Benchmarks
Direct_IO DMA Eager Writeback
41 
 
6.2 Cycles per Instruction 
A comparison of the ?cycles per instruction? between the secondary bus architecture and the base 
architecture, gives us the speed-up achieved. Figure 17 shows the percentage speed-up achieved across a 
range of programs from the SPEC CPU 2006 benchmark suite. 
 
Figure 17: Percentage improvement in processor throughput with the secondary bus. 
 
Speed-ups of up to 19% were achieved due to the addition of the secondary bus as seen in Figure 
17 in the absence of I/O traffic on the main bus. With the presence of the secondary bus, read requests on 
main bus never waited for the dirty writeback traffic to be written to the main memory whenever it 
requested data due to a L2 cache miss. In the presence of I/O traffic further improvement was seen with 
speed-ups of up to 33%. In other words, the degradation of the base architecture was relatively severe 
when simulated with I/O traffic. The improvements in CPI are also a direct consequence of the reduction 
in queued cycles seen in the previous section. Programs that have a huge writeback rate or the ones that 
work on large data sets are the most beneficial of this architecture. On an average 13% speed-up over the 
base architecture was seen across 10 out of the 16 benchmarks we simulated with sim-alpha. 
0
5
10
15
20
25
30
35
40
45
Pe
rce
nta
ge
 Im
pr
ov
em
en
t in
 C
PI
SPEC CPU 2006 benchmarks
No I/O 600 MB/Sec 1.2 GB/Sec 1.8 GB/Sec
42 
 
Results also show that the speed-up depends on how much the program strains the memory 
hierarchy. Processors using smaller second level caches lead to higher number of cache misses and hence 
writebacks. Thus programs having a very large working set could be more advantageous compared to the 
ones using smaller caches and working sets. This can be seen with programs like namd, gromacs, sjeng, 
h264ref, etc. The program namd did not have any writebacks to the memory and hence the architecture 
was never put to test during the simulations. Programs like bwaves, zeusmp, gemsFDTD, sphinx, etc. were 
highly writeback intensive with nearly 30% of the traffic on the main bus being the writeback traffic. 
As a result of the speed-up achieved, the group of instructions taking more than 1000 processor 
cycles was reduced. The results are shown in Figure 18 and Figure 19 for I/O injection rates of 1.2 
GB/Sec and 1.8 GB/Sec respectively. Although a majority of the 100 million instructions simulated took 
only around 100 to 200 cycles to execute due to cache hits, there were instructions which took more than 
1000 cycles. Instructions taking more than 1000 cycles refer to those that are affected by the writeback 
latencies upon a read miss in L2. Hence these additional cycles were mainly due to the memory accesses 
and main bus contention. More than 90% decrease in the number of such instructions was seen across the 
benchmark suite and this justifies the speed-ups seen in Figure 17. 
43 
 
 
Figure 18: Comparison between I/O techniques and Eager Writeback for I/O rate of 1.2 GB/Sec. 
 
 
Figure 19: Comparison between I/O techniques and Eager Writeback for I/O rate of 1.8 GB/Sec. 
 
 A comparison with the Eager Writeback technique shows that the performance improvement 
tends to decrease with I/O traffic on the main bus. This is can also be seen in the results from [29]. 
0
50000
100000
150000
200000
250000
Nu
mb
er 
of 
In
str
uc
tio
ns 
tak
ing
 m
or
e t
ha
n 1
00
0 
cy
cle
s
SPEC CPU 2006 Benchmarks
Direct_IO DMA Eager Writeback
0
50000
100000
150000
200000
250000
300000
350000
400000
Nu
mb
er 
of 
In
str
uc
tio
ns 
tak
ing
 m
or
e t
ha
n 1
00
0 
Cy
cle
s
SPEC CPU 2006 Benchmarks
Direct_IO DMA Eager Writeback
44 
 
Secondary bus, because of bus redundancy, scales easily with increased I/O rates. This is best explained 
with the plot shown in Figure 20 for the GemsFDTD benchmark program. GemsFDTD was chosen 
mainly because it has a good writeback rate compared to other programs and also shows improvements in 
CPI in excess of 35% for high I/O rates. With DMA, the results tend to flatten out at large I/O traffic 
rates. But it is not necessarily true that the CPI keeps improving with increased I/O with the secondary 
bus. There will be a point where the entire 4.8GB/Sec bandwidth of the main bus would not be sufficient 
and it may result in queuing delays and CPIs going out of the usual range of 0 to 3 cycles per instruction. 
That problem is still due to the main bus bandwidth getting exhausted which, happens very rarely as the 
main bus width is proportional to the number of devices on the bus. Increased number of agents on the 
bus will surely lead to wider main system bus. 
 
 
Figure 20: CPI percentage improvement for the GemsFDTD program. 
 
 
 
 
0
5
10
15
20
25
30
35
40
600 MB/Sec 1.2 GB/Sec 1.8 GB/Sec
Pe
rce
nta
ge
 Im
pr
ov
em
en
t in
 C
PI
CPI for different IO Injection rates for GemsFDTD benchmark program
Direct_IO DMA_IO Eager_Writeback
45 
 
 
 
Chapter 7. Future Work 
 The benefits of the secondary bus were obvious from the simulation results seen in the previous 
chapter. Though the technique was applied to a uniprocessor system during our analysis, a similar 
implementation can very much be used even with a multi-core processor. Although the front side bus 
(FSB) architecture is slowly giving way to technologies such as Quickpath [16] and HyperTransport [17], 
we still have more room for improvement on the FSB as seen from the secondary bus. Since FSB is a 
simpler design compared to Quickpath or HyperTransport, secondary bus architecture is worth 
considering for a dual core or a quad core processor. As seen in Figure 21 and Figure 22 [16] many multi 
core processors still use FSB as their system bus for communications with the chipset and hence 
secondary bus can be quite handy in easing congestion. Simulations with the multi core processor 
environment with the secondary bus can be done similar to the ones in the previous chapter using full 
system simulators such as GEMS, Simics or the M5. 
 
Figure 21: Front side bus architecture in Intel's multi core processors. 
46 
 
 
Figure 22: Dedicated FSB for each dual core processor. 
 Implementation considerations for the secondary bus need to be researched. We have shown that 
an 8 bit bus with a bandwidth of 600MB/Sec is good enough for the secondary bus, but we should be able 
to find a technology for realizing the same in hardware. Serial buses such as SATA provide sufficient 
bandwidths for write back data and can be considered. Wireless link is another technique that can come 
handy when facing space related constraints. A wireless link however would require the addition of a 
wireless transmitter and a receiver which may consume slightly more power than a bus based link. 
 The I/O injection rates were well tested with the proposed architecture and future simulations can 
use benchmarks that can actually generate the I/O data as well as feed on them. Some web applications 
are best suited for this purpose. It would call for writing an independent benchmark and then compiling 
the same for the Alpha benchmark. Modifications have to the made to the bridges and other controller 
logic to differentiate between the I/O and CPU data when communicating through the memory hierarchy. 
As mentioned in chapter 5 sim-alpha does not have these extensions built into it. 
 
 
47 
 
 
 
Chapter 8. Conclusion 
A simple solution for reducing latencies due to bus contentions on the main system bus between 
the CPU and the chipset has been evaluated in this work. The technique of bus redundancy combined with 
a policy for efficient bus bandwidth management through data traffic distribution have shown to give 
significant performance improvements compared to existing architectures. Since write traffic on the main 
system bus is the largest contributor to the queuing delays, separating the write and read data traffic is 
beneficial. Also, queuing delays were considerably reduced due to the sharing of the bus load by the 
secondary bus. This improvement was verified across a range of the SPEC CPU 2006 benchmarks which 
comprised of both CPU and memory intensive workloads to test the architecture. 
Performance metrics such as the average queuing delay per instruction and cycles per instruction 
were used to validate the results and understand the CPU areas that were directly impacted by this 
architecture. Comparisons were made against two types of I/O techniques namely DMA and Direct I/O 
and we found that this design would be highly advantageous in the presence of I/O traffic on the main 
bus. I/O devices and processors can be involved in two types of communications that determine the traffic 
direction on the main system bus: 
1. I/O produce, CPU consume (more than 80% of the I/O traffic are of this type). 
2. CPU produce, I/O consume. 
We were unable to analyze the second type of traffic as our benchmarks did not generate any I/O traffic 
themselves. The ?I/O produce and CPU consume? traffic type was simulated and analyzed by creating an 
I/O device that would pump data at regular intervals either directly the on-chip cache (direct I/O) or to the 
memory through DMA. In addition to seeing better improvements over the base architecture, the 
48 
 
secondary bus approach also proved to be better than Eager writeback for most of the SPEC programs. 
When larger I/O rates were considered, the gap between Eager writeback and secondary bus widens. 
The secondary bus can be implemented in many ways. In our simulations we used an 8-bit wide 
bus to evaluate the design for different traffic intensities. In the future, various other implementations of 
the secondary bus can be tried. Split bus or pipelined transactions can be tried on the secondary bus with 
multiple bit lines. The number of bit lines that can be used for the secondary bus depends on the L2 cache 
write-back rate. The secondary bus provides excellent benefits for single ported memories at a minimal 
cost, which consists of a small hardware addition for controlling the bus accesses and a small bus. 
 
 
 
 
 
 
 
 
 
 
 
49 
 
 
 
Bibliography 
[1] G. E. Moore, Cramming More Components onto Integrated Circuits, Electronics, Vol. 38, pp. 114-
117, 19 Apr. 1965. 
 
[2] Memory Hierarchy [Online]. Available: http://en.wikipedia.org/wiki/Memory_hierarchy 
 
[3] P. P. Chu and R. Gottipati, Write Buffer Design for On-Chip Cache. Proceedings of the IEEE 
International Conference on Computer Design: VLSI in Computers and Processors, pp. 311-316, 
Oct. 1994. 
 
[4] T. E. Anderson, H. M. Levy, B. N. Bershad and E. D. Lazowska, The Interaction of Architecture and 
Operating System Design, Proceedings of the 4th International Conference on Architectural Support 
for Programming Languages and Operating Systems, pp. 108-120, 1991. 
 
[5] B. Chen, Memory behavior of an X11 window system, Proceedings of the USENIX Winter Technical 
Conference, Jan. 1994. 
 
[6] D. Nagle, R. Uhlig, T. Mudge, S. Sechrest, Optimal allocation of on-chip memory for multiple-API 
operating systems, Proceedings of the 21st Annual International Symposium on Computer 
Architecture, pp. 358-369, 18-21 Apr. 1994. 
 
[7] N.P. Jouppi, Improving direct-mapped cache performance by the addition of a small fully-associative 
cache and prefetch buffers, Proceedings of the 17th Annual International Symposium on Computer 
Architecture, pp. 364-373, 28-31 May. 1990. 
 
[8] N.P. Jouppi, Cache Write Policies and Performance, Proceedings of the 20th Annual International 
Symposium on Computer Architecture, pp. 191-201, 28-31 May. 1993. 
 
[9] K. Skadron, D.W. Clark, Design issues and tradeoffs for write buffers, Proceedings of the 3rd 
International Symposium on High-Performance Computer Architecture, pp.144-155, 1-5 Feb. 1997. 
 
[10] J.L. Hennessy and D.A. Patterson, Computer Architecture: A quantitative approach, Morgan 
Kauffmann Publishers Inc., 3rd edition, 1996. 
 
[11] Serial ATA Revision 3.0 specification, 27 May. 2009, [Online]. Available: 
http://www.serialata.org/documents/SATA-6Gbs-Fast-Just-Got-Faster.pdf 
 
[12] S. Addagatla, M. Shaw, S. Sinha, P. Chandra, A.S. Varde, M. Grinkrug, Direct Network Prototype 
Leveraging Light Peak Technology, Proceedings of the IEEE 18th Annual Symposium on High 
Performance Interconnects (HOTI), pp. 109-112, 18-20 Aug. 2010. 
 
 
50 
 
[13] PHY Interface for the PCI Express and USB 3.0 Architectures, 3 Nov. 2009, [Online]. Available: 
http://download.intel.com/technology/usb/USB_30_PIPE_10_Final_042309.pdf 
 
[14] M.J. Koop, Wei Huang, K. Gopalakrishnan, D.K. Panda, Performance Analysis and Evaluation of 
PCIe 2.0 and Quad-Data Rate InfiniBand, Proceedings of the 16th IEEE Symposium on High 
Performance Interconnects (HOTI), pp. 85-92, 26-28 Aug. 2008. 
 
[15] AGP V3.0 Interface Specification, Sep. 2002, [Online]. Available: 
http://download.intel.com/support/motherboards/desktop/sb/agp30.pdf 
 
[16] An Introduction to Intel Quickpath Interconnect, Jan. 2009, [Online]. Available: 
http://www.intel.com/technology/quickpath/introduction.pdf 
  
[17] HyperTransport I/O Link Specification, Nov. 2006, [Online]. Available: 
http://www.hypertransport.org/docs/twgdocs/HTC20051222-0046-0017.pdf 
 
[18] Career Sense Multiple Access with Collision Detection (CSMA/CD) Access Method and Physical 
Layer Specifications, Jun. 2010, [Online]. Available: 
http://standards.ieee.org/getieee802/download/802.3ba-2010.pdf 
 
[19] D. Tang, Y. Bao, W. Hu and M. Chen, DMA cache: Using on-chip storage to architecturally separate 
I/O data from CPU data for improving I/O performance, Proceedings of the 16th IEEE International 
Symposium on High Performance Computer Architecture, pp. 1-12, 9-14 Jan. 2010. 
 
[20] R. Huggahalli, R. Iyer and S. Tetrick, Direct cache access for high bandwidth network I/O, 
Proceedings of the 32nd International Symposium on Computer Architecture, pp. 50- 59, 4-8 June 
2005. 
 
[21] E.A. Leon, K.B. Ferreira, A.B. Maccabe, Reducing the Impact of the MemoryWall for I/O Using 
Cache Injection, Proceedings of the 15th Annual IEEE Symposium on High-Performance 
Interconnects, pp. 143-150, 22-24 Aug. 2007. 
 
[22] Z. Chuanjun and F. Vahid, Using a victim buffer in an application-specific memory hierarchy, 
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, vol.1, pp. 
220- 225, 16-20 Feb. 2004. 
 
[23] G. Memik, G. Reinman, W.H. Mangione-Smith, Reducing energy and delay using efficient victim 
caches," Proceedings of the International Symposium on Low Power Electronics and Design, pp. 
262- 265, 25-27 Aug. 2003. 
 
[24] A.J. Smith, Cache memories, ACM Computing Surveys, vol. 14, no. 3, pp. 473-530, Sep. 1982. 
 
[25] J. Tse and A.J. Smith, CPU cache prefetching: Timing evaluation of hardware implementations, IEEE 
Transactions on Computers, vol.47, no.5, pp. 509-526, May. 1998. 
 
[26] D.M. Tullsen, S.J. Eggers, Limitations Of Cache Prefetching On A Bus-based Multiprocessor, 
Proceedings of the 20th Annual International Symposium on Computer Architecture, pp.278-288, 16-
19 May. 1993. 
 
[27] The Microarchitecture of the Intel Pentium 4 Processor on 90nm Technology, Intel Technology 
Journal, vol. 8, Issue. 1, 18 Feb. 2004. 
51 
 
 
[28] D. Kroft, Lockup-Free Instruction Fetch/Prefetch Cache Organization, Proceedings of the 8th 
International Symposium on Computer Architecture, pp. 81-87, 12-14 May 1981. 
 
[29] H.H.S. Lee, G.S. Tyson, M.K. Farrens, Eager writeback-a technique for improving bandwidth 
utilization, Proceedings of the 33rd Annual IEEE/ACM International Symposium on 
Microarchitecture, pp.11-21, 2000. 
 
[30] J. O?Farrell and S. Baskiyar, Improved Real-Time Performance Using a Secondary Bus, Proceedings 
of the Computers And Their Applications, Honolulu, HI, March, 2010, ISCA Press. 
 
[31] S. Baskiyar and C. Wang, A secondary channel between cache and memory for decreasing queuing 
delay, US Provisional patent application filed no. 61/003,542 on Nov 17, 2007, Auburn University, 
AL.  
[32] Intel Embedded Pentium processor family Developer?s Manual, December 1998. [Online]. Available: 
http://www.intel.com/design/intarch/manuals/273204.htm 
 
[33] R. Desikan, D. Burger, S.W. Keckler, Measuring experimental error in microprocessor simulation, 
Proceedings of the 28th Annual International Symposium on Computer Architecture, pp.266-277, 
2001. 
 
[34] T. Austin, E. Larson and D. Ernst. Simplescalar: An Infrastructure for Computer System Modeling, 
IEEE Computer, Volume 35, Number 2, pp. 59-67, Feb. 2002. 
 
[35] R.E. Kessler, E.J. McLellan and D.A. Webb, The Alpha 21264 microprocessor architecture, 
Proceedings of the International Conference on Computer Design: VLSI in Computers and 
Processors, pp. 90-95, 5-7 Oct. 1998. 
 
[36] SPEC CPU2006 Benchmark Descriptions, ACM SIGARCH newsletter, Computer Architecture 
News, Volume 34, No. 4, September 2006. 
 
[37] K. Ganesan, D. Panwar and L. K. John, Generation, Validation and Analysis of SPEC CPU2006 
Simulation Points Based on Branch, Memory and TLB Characteristics, Proceedings of the SPEC 
Benchmark Workshop on Computer Performance Evaluation and Benchmarking, Section: Modeling 
and Sampling Techniques, pp. 121-137, Jan. 2009. 
 
[38] Alabama Supercomputing Authority. http://www.asc.edu/