📄 kn230_copy.s
字号:
addu a1,NBPW # dst += 4; BD slot bne a0,a3,1b b rbytecopy/* * For the specific ram buffer layout that this code is written for * the ram buffer is the source and it is already aligned. Plus * it only holds data in every other word (32 bit word, or longword). * Thus if we mis-align it, it is real tough to copy. Instead we deal * with the mis-aligned destination. Maybe the cache won't like it * but it will work. - burns *//* * dst unaligned, src aligned loop * NOTE: if MINCOPY >= 7, will always do 1 loop iteration or more * if we get here at all */rpartaligncopy: and a3,a2,~(NBPW-1) # space in word chunks subu a2,a3 # count after 'by word' loop#if MINCOPY < 7 beq a3,zero,rbytecopy # less than a word to copy#endif # due to screwy rambuf source end point is # twice as far away as you would expect addu a3,a3 # double the word count to create addu a3,a0 # the source endpoint1: lw v0,0(a0) # get next word#ifdef MIPSEB swl v0,0(a1) # store left half swr v0,3(a1) # store right half#endif#ifdef MIPSEL swr v0,0(a1) # store right half swl v0,3(a1) # store left half#endif addu a0,NBPW*2 # bump source pointer, skipping 'hole' addu a1,NBPW # bump dest pointer bne a0,a3,1b # at end of word copy?/* * brute force byte copy loop, for bcount < MINCOPY + tail of unaligned src * note that lwl, lwr, swr CANNOT be used for tail, since the lwr might * cross page boundary and give spurious address exception */rbytecopy: addu a3,a2,a0 # source endpoint; BDSLOT ble a2,zero,rcopydone # nothing left to copy, or bad length1: lb v0,0(a0) addu a0,1 sb v0,0(a1) addu a1,1 # BDSLOT: incr dst address bne a0,a3,1brcopydone: j ra END(kn230_rbcopy)/* * kn230_wbcopy(src, dst, bcount) * * NOTE: the optimal copy here is somewhat different than for the user-level * equivalents (kn230_wbcopy in 4.2, memcpy in V), because: * 1) it frequently acts on uncached data, especially since copying from * (uncached) disk buffers into user pgms is high runner. * This means one must be careful with lwl/lwr/lb - don't expect cache help. * 2) the distribution of usage is very different: there are a large number * of bcopies for small, aligned structures (like for ioctl, for example), * a reasonable number of randomly-sized copies for user I/O, and many * bcopies of large (page-size) blocks for stdio; the latter must be * well-tuned, hence the use of 32-byte loops. * 3) this is much more frequently-used code inside the kernel than outside * * Overall copy-loop speeds, by amount of loop-unrolling: assumptions: * a) low icache miss rate (this code gets used a bunch) * b) large transfers, especially, will be word-alignable. * c) Copying speeds (steady state, 0% I-cache-miss, 100% D-cache Miss): * d) 100% D-Cache Miss (but cacheable, so that lwl/lwr/lb work well) * Config Bytes/ Cycles/ Speed (VAX/780 = 1) * Loop Word * 08V11 1 35 0.71X (8MHz, VME, 1-Deep WB, 1-way ILV) * 4 15 1.67X * 8/16 13.5 1.85X * 32/up 13.25 1.89X * 08MM44 1 26 0.96X (8MHz, MEM, 4-Deep WB, 4-way ILV) * 4 9 2.78X * 8 7.5 3.33X * 16 6.75 3.70X * 32 6.375 3.92X (diminishing returns thereafter) * * MINCOPY is minimum number of byte that its worthwhile to try and * align copy into word transactions. Calculations below are for 8 bytes: * Estimating MINCOPY (C = Cacheable, NC = Noncacheable): * Assumes 100% D-cache miss on first reference, then 0% (100%) for C (NC): * (Warning: these are gross numbers, and the code has changed slightly): * Case 08V11 08M44 * MINCOPY C NC C NC * 9 (1 byte loop) 75 133 57 93 * 8 (complex logic) * Aligned 51 51 40 40 * Alignable, * worst (1+4+3) 69 96 53 80 * Unalignable 66 93 60 72 * MINCOPY should be lower for lower cache miss rates, lower cache miss * penalties, better alignment properties, or if src and dst alias in * cache. For this particular case, it seems very important to minimize the * number of lb/sb pairs: a) frequent non-cacheable references are used, * b) when i-cache miss rate approaches zero, even the 4-deep WB can't * put successive sb's together in any useful way, so few references are saved. * To summarize, even as low as 8 bytes, avoiding the single-byte loop seems * worthwhile; some assumptions are probably optimistic, so there is not quite * as much disadvantage. However, the optimal number is almost certainly in * the range 7-12. * * a0 src addr * a1 dst addr * a2 length remaining */#define MINCOPY 4LEAF(kn230_wbcopy) xor v0,a0,a1 # bash src & dst for align chk; BDSLOT blt a2,MINCOPY,wbytecopy # too short, just byte copy and v0,NBPW-1 # low-order bits for align chk subu v1,zero,a0 # -src; BDSLOT bne v0,zero,wunaligncopy # src and dst not alignable/* * src and dst can be simultaneously word aligned */ and v1,NBPW-1 # number of bytes til aligned subu a2,v1 # bcount -= alignment beq v1,zero,wblkcopy # already aligned#ifdef MIPSEB lwl v0,0(a0) # copy unaligned portion swl v0,0(a1)#endif#ifdef MIPSEL lwr v0,0(a0) swr v0,0(a1)#endif addu a0,v1 # src += alignment addu a1,v1 # dst += alignment add a1,4 and a1,0xfffffff8 # align on an 8 byte boundary/* * 32 byte block, aligned copy loop (for big reads/writes) */#ifdef PROMwblkcopy: li t1,0x9fffffff # need mask for cashing la v0,1f and v0,v0,t1 # switch to kseg0 j v0 # run cached#elsewblkcopy:#endif1: and a3,a2,~31 # total space in 32 byte chunks subu a2,a3 # count after by-32 byte loop done beq a3,zero,wwordcopy # less than 32 bytes to copy addu a3,a0 # source endpoint2: lw v0,0(a0) lw v1,4(a0) lw t0,8(a0) lw t1,12(a0) sw v0,0(a1) sw v1,0x8(a1) sw t0,0x10(a1) sw t1,0x18(a1) addu a0,32 # src+= 32; here to ease loop end lw v0,-16(a0) lw v1,-12(a0) lw t0,-8(a0) lw t1,-4(a0) sw v0,0x20(a1) sw v1,0x28(a1) sw t0,0x30(a1) sw t1,0x38(a1) addu a1,0x40 # dst+= 32; fills BD slot bne a0,a3,2b/* * word copy loop */wwordcopy: and a3,a2,~(NBPW-1) # word chunks subu a2,a3 # count after by word loop beq a3,zero,wbytecopy # less than a word to copy addu a3,a0 # source endpoint1: lw v0,0(a0) addu a0,NBPW sw v0,0(a1) addu a1,NBPW*2 # dst += 4; BD slot bne a0,a3,1b b wbytecopy/* * deal with simultaneously unalignable copy by aligning dst */wunaligncopy: subu a3,zero,a1 # calc byte cnt to get dst aligned and a3,NBPW-1 # alignment = 0..3 subu a2,a3 # bcount -= alignment beq a3,zero,wpartaligncopy # already aligned#ifdef MIPSEB lwl v0,0(a0) # get whole word lwr v0,3(a0) # for sure swl v0,0(a1) # store left piece (1-3 bytes)#endif#ifdef MIPSEL lwr v0,0(a0) # get whole word lwl v0,3(a0) # for sure swr v0,0(a1) # store right piece (1-3 bytes)#endif addu a0,a3 # src += alignment (will fill LD slot) addu a1,a3 # dst += alignment add a1,4 and a1,0xfffffff8 # JN trmendus hack/* * src unaligned, dst aligned loop * NOTE: if MINCOPY >= 7, will always do 1 loop iteration or more * if we get here at all */wpartaligncopy: and a3,a2,~(NBPW-1) # space in word chunks subu a2,a3 # count after by word loop#if MINCOPY < 7 beq a3,zero,wbytecopy # less than a word to copy#endif addu a3,a0 # source endpoint1:#ifdef MIPSEB lwl v0,0(a0) lwr v0,3(a0)#endif#ifdef MIPSEL lwr v0,0(a0) lwl v0,3(a0)#endif addu a0,NBPW sw v0,0(a1) addu a1,NBPW*2 bne a0,a3,1b/* * brute force byte copy loop, for bcount < MINCOPY + tail of unaligned dst * note that lwl, lwr, swr CANNOT be used for tail, since the lwr might * cross page boundary and give spurious address exception */wbytecopy: addu a3,a2,a0 # source endpoint; BDSLOT ble a2,zero,wcopydone # nothing left to copy, or bad length1: lb v0,0(a0) addu a0,1 sb v0,0(a1) addu a1,1 # BDSLOT: incr dst address bne a0,a3,1bwcopydone: j ra END(kn230_wbcopy)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -