⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 http:^^www.cs.wisc.edu^~cs354-2^cs354^lec.notes^arch.features.html

📁 This data set contains WWW-pages collected from computer science departments of various universities
💻 HTML
📖 第 1 页 / 共 2 页
字号:
Date: Tue, 05 Nov 1996 20:49:50 GMTServer: NCSA/1.5Content-type: text/htmlLast-modified: Wed, 30 Aug 1995 21:21:34 GMTContent-length: 18006<html><head><title> Lecture notes - Chapter 13 - Performance features</title></head><BODY><h1> Chapter 13 -- performance features</h1><pre>Architectural Features used to enhance performance--------------------------------------------------  (loosely following chapter 13)What is a "better" computer?  What is the "best" computer?  The factors involved are generally cost and performance.  COST FACTORS: cost of hardware design		cost of software design (OS, applications)		cost of manufacture		cost to end purchaser  PERFORMANCE FACTORS:		what programs will be run?		how frequently will they be run?		how big are the programs?		how many users?		how sophisticated are the users?		what I/O devices are necessary?  (this chapter discusses ways of increasing performance)There are two ways to make computers go faster.  1. Wait a year.  Implement in a faster/better/newer technology.    More transistors will fit on a single chip.    More pins can be placed around the IC.    The process used will have electronic devices (transistors)      that switch faster. 2. new/innovative architectures and architectural features.MEMORY HIERARCHIES------------------Known in current technologies:  the time to access data   from memory is an order of magnitude greater than a   CPU operation.   For example:  if a 32-bit 2's complement addition takes 1 time unit,   then a load of a 32-bit word takes about 10 time units.Since every instruction takes at least one memory access (forthe instruction fetch), the performance of computer is dominatedby its memory access time.  (to try to help this difficulty, we have load/store architectures,   where most instructions take operands only from memory.  We also   try to have fixed size, SMALL size, instructions.)what we really want:   very fast memory -- of the same speed as the CPU   very large capacity -- 512 Mbytes   low cost -- $50   these are mutually incompatible.  The faster the memory,   the more expensive it becomes.  The larger the amount of   memory, the slower it becomes.What we can do is to compromise.  Take advantage of the fact(fact, by looking at many real programs) that memory accessesare not random.  They tend to exhibit LOCALITY.  LOCALITY -- nearby.  2 kinds:  Locality in time (temporal locality)    if data has been referenced recently, it is likely to    be referenced again (soon!).    example:  the instructions with in a loop.  The loop is    likely to be executed more than once.  Therefore, each    instruction gets referenced repeatedly in a short period    of time.    example: The top of stack is repeatedly referenced within    a program.  Locality in space (spacial locality)    if data has been referenced recently, then data nearby    (in memory) is likely to be referenced soon.    example:  array access.  The elements of an array are    neighbors in memory, and are likely to be referenced    one after the other.    example: instruction streams.  Instructions are located    in memory next to each other.  Our model for program    execution says that unless the PC is explicitly changed    (like a branch or jump instruction) sequential instructions    are fetched and executed. We can use these tendencies to advantage by keeping likelyto be referenced (soon) data in a faster memory than mainmemory.  This faster memory is called a CACHE.	CPU-cache   <----------------> memoryIt is located very close to the CPU.  It contains COPIES ofPARTS of memory.A standard way of accessing memory, for a system with a cache: (The programmer doesn't see or know about any of this)   instruction fetch (or load or store) goes to the cache. If the data is in the cache, then we have a HIT.  The data is handed over to the CPU, and the memory access is completed. If the data is not in the cache, then we have a MISS.   The instruction fetch (or load or store) is then sent on   to main memory. On average, the time to do a memory access is       = cache access time + (% misses  *  memory access time)This average (mean) access time will change for each program.It depends on the program, and its reference pattern, and howthat pattern interracts with the cache parameters.cache is managed by hardware	Keep recently-accessed block -- exploits temporal locality	Break memory into aligned blocks (lines) e.g. 32 bytes		-- exploits spatial locality	transfer data to/from cache in blocks	put block in "block frame"		state (e.g valid)		address tag		data>>>> simple CACHE DIAGRAM here <<<<   if the tag is present, and if VALID bit active,     then there is a HIT, and a portion of the block is returned.   if the tag is not present or the VALID bit is not active,     then there is a MISS, and the block must be loaded from memory.     The block is placed in the cache (valid bit set, data written)     AND     a portion of the block is returned.Example	Memory words:		0x11c	0xe0e0e0e0		0x120	0xffffffff		0x124	0x00000001		0x128	0x00000007		0x12c	0x00000003		0x130	0xabababab	A 16-byte cache block frame:		state	tag	data (16 bytes == 4 words)		invalid	0x????	??????	lw $4, 0x128	Is tag 0x120 in cache?  (0x128 mod 16 = 0x128 & 0xfffffff0)	No, load block	A 16-byte cache block frame:		state	tag	data (16 bytes == 4 words)		valid	0x120	0xffffffff, 0x00000001, 0x00000007, 0x00000003	Return 0x0000007 to CPU to put in $4	lw $5, 0x124	Is tag 0x120 in cache?	Yes, return 0x00000001 to CPUBeyond the scope of this class:	block and block frames divided in "sets" (equivalence	  classes) to speed lookup.	terms: fully-associative, set-associative, direct-mappedOften	cache:  instruction cache 1 cycle	        data cache 1 cycle	main memory 20 cyclesPerformance for data references w/ miss ratio 0.02 (2% misses)        mean access time = cache-access + miss-ratio * memory-access			 =       1     +   0.02     *  20			 =       1.4Typical cache size is 64K byte given a 64Mbyte memory	20 times faster	1/1000 the capacity	often contains 98% of the referencesRemember:recently accessed blocks are in the cache (temporal locality)the cache is smaller than main memory, so not all blocks are in the cache.blocks are larger than 1 word (spacial locality)This idea of exploiting locality is (can be) done at manylevels.  Implement a hierarchical memory system:  smallest, fastest, most expensive memory         (registers)  relatively small, fast, expensive memory         (CACHE)  large, fast as possible, cheaper memory          (main memory)  largest, slowest, cheapest (per bit) memory       (disk)registers are managed/assigned by compiler or asm. lang programmercache is managed/assigned by hardware or partially by OSmain memory is managed/assigned by OSdisk managed by OSProgrammer's model:  one instruction is fetched and executed at  a time.Computer architect's model:  The effect of a program's execution are  given by the programmer's model.  But, implementation may be  different.  To make execution of programs faster, we attempt to exploit  PARALLELISM:  doing more than one thing at one time.  program level parallelism:  Have one program run parts of itself    on more than one computer.  The different parts occasionally    synch up (if needed), but they run at the same time.  instruction level parallelism (ILP):  Have more than one instruction    within a single program executing at the same time.PIPELINING  (ILP)----------------- concept -------   A task is broken down into steps.   Assume that there are N steps, each takes the same amount of time.   (Mark Hill's) EXAMPLE:  car wash     steps:  P -- prep	     W -- wash	     R -- rinse	     D -- dry	     X -- wax     assume each step takes 1 time unit     time to wash 1 car (red) = 5 time units     time to wash 3 cars (red, green, blue) = 15 time units     which car      time units		1  2  3  4  5  6  7  8  9 10 11 12 13 14 15       red      P  W  R  D  X       green                   P  W  R  D  X       blue                                   P  W  R  D  X   a PIPELINE overlaps the steps     which car      time units		1  2  3  4  5  6  7  8  9 10 11 12 13 14 15       red      P  W  R  D  X       green       P  W  R  D  X       blue           P  W  R  D  X       yellow            P  W  R  D  X	  etc.	 IT STILL TAKES 5 TIME UNITS TO WASH 1 CAR,	 BUT THE RATE OF CAR WASHES GOES UP!   Pipelining can be done in computer hardware. 2-stage pipeline ----------------  steps:    F -- instruction fetch    E -- instruction execute (everything else)    which instruction       time units

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -