lbn8086.asm

来自「包含哈希,对称以及非对称的经典算法包含经典事例」· 汇编代码 · 共 1,039 行 · 第 1/2 页
ASM
1,039 行
;;; Assembly primitives for bignum library, 80x86 family.;;;;;; Copyright (c) 1995, Colin Plumb.;;; For licensing and other legal details, see the file legal.c.;;;;;; Several primitives are included here.  Only lbnMulAdd1 is *really*;;; critical, but once that's written, lnmMul1 and lbnSub1 are quite;;; easy to write as well, so they are included here as well.;;; lbnDiv21 and lbnModQ are so easy to write that they're included, too.;;;;;; All functions here are for large code, large data.;;; All use standard "cdecl" calling convention: arguments pushed on the;;; stack (ss:sp) right to left (the leftmost agrument at the lowest address);;; and popped by the caller, return values in ax or dx:ax, and register;;; usage as follows:;;;;;; Callee-save (preserved by callee if needed):;;;	ss, esp, cs, eip, ds, esi, edi, ebp, high byte of FLAGS except DF,;;;	all other registers (CRx, DRx, TRx, IDT, GDT, LDT, TR, etc.).;;; Caller-save (may be corrupted by callee):;;;	es, eax, ebx, ecx, edx, low byte of flags (SF, ZF, AF, PF, CF);;;;;; The direction flag (DF) is either preserved or cleared.;;; I'm not sure what the calling convention is for fs and gs.  This;;; code never alters them.;; Not all of this code has to be '386 code, but STUPID FUCKING MASM (5.0);; gives an error if you change in the middle of a segment.  Rather than;; fight the thing, just enable '386 instructions everywhere.  (And lose;; the error checking.).386_TEXT   segment para public use16 'CODE'	; 16-byte aligned because '486 cares	assume  cs:_TEXT	public  _lbnMulN1_16	public  _lbnMulAdd1_16	public  _lbnMulSub1_16	public	_lbnDiv21_16	public	_lbnModQ_16	public  _lbnMulN1_32	public  _lbnMulAdd1_32	public  _lbnMulSub1_32	public	_lbnDiv21_32	public	_lbnModQ_32	public	_not386;; Prototype:;; BNWORD16;; lbnMulAdd_16(BNWORD16 *out, BNWORD16 *in, unsigned len, BNWORD16 k);;;; Multiply len words of "in" by k and add to len words of "out";;; return the len+1st word of carry.  All pointers are to the least-;; significant ends of the appropriate arrays.  len is guaraneed > 0.;;;; This 16-bit code is optimized for an 8086/80286.  It will not be run;; on 32-bit processors except for debugging during development.;;;; NOTE that it may be possible to assume that the direction flag is clear;; on entry; this would avoid the need for the cld instructions.  Hoewever,;; the Microsoft C libraries require that the direction flag be clear.;; Thus, lbnModQ_16 clears it before returning.;;;; Stack frame:;; +--------+ bp+18;; |   k    |;; +--------+ bp+16;; |  len   |;; +--------+ bp+14;; |        |;; +-  in  -+;; |        |;; +--------+ bp+10;; |        |;; +- out  -+;; |        |;; +--------+ bp+6;; |        |;; +-return-+;; |        |;; +--------+ bp+2;; | old bp |;; +--------+ bp;;;; Register usage for lbnMul1_16:;; ds:[si]	in;; es:[di]	out;; bp		k;; cx		loop counter (len/4);; dx,ax	high,low parts of product;; bx		carry from previous multiply iteration;;;; Register usage for lbnMulAdd1_16 and lbnMulSub1_16:;; ds:[si]	in;; es:[bx+si]	out;; bp		k;; cx		loop counter (len/4);; dx,ax	high,low parts of product;; di		carry from previous multiply iteration;;;; The reson for the difference is that straight mul can use stosw, but;; the multiply and add or multiply and subtract add the result in, so;; they have to reference es:[di] to add it in.;;;; The options are either "add ax,es:[di]; stosw" or "add es:[di],ax;;; add di,2"; both take 10 cycles on an 80286, 27 on an 8086 and 35 on;; an 8088 although the former is preferred since it's one byte smaller.;; However, using [bx+si] is even faster; "add es:[bx+si],ax" takes;; 7 cycles on an 80286, 25 on an 8086 and 33 on an 8088, as well as;; being the smallest.  (Of course, stosw, at 3 on an 80286, 11 on an;; 8086 amd 15 on an 8088 wins easily in the straight multiply case over;; mov es:[bx+si],ax, which takes 3/18/22 cycles and is larger to boot.);;;; Most of these register assignments are driven by the 8086's instruction;; set.  The only really practical variation would be to put the multiplier;; k into bx or di and use bp for carry, but if someone can make a faster;; Duff's device using a lookup table, bx and di are useful because indexing;; off them is more flexible than bp.;;;; Overview of code:;;;; len is guaranteed to be at least 1, so do the first multiply (with no;; carry in) unconditionally.  Then go to a min loop unrolled 4 times,;; jumping into the middle using a variant of Duff's device.;;;; The loop is constructed using the loop instruction, which does;; "} while (--cnt)".  This means that we have to divide the count;; by 4, and increment it so it doesn't start at 0.  To gain a little;; bit more efficiency, we actually increment the count by 2, so the;; minimum possible value is 3, which will be shifted down to produce 0.;; usually in Duff's device, if the number of iterations is a multiple;; of the unrolling factor, you branch to just before the loop conditional;; and let it handle the case of 0.  Here, we have a special test for 0;; at the head of the loop and fall through into the top of the loop;; if it passes.;;;; Basically, with STEP being a multiply step, it's:;;;; 	STEP;;;	count += 2;;;	mod4 = count % 4;;;	count /= 4;;;	switch(mod4) {;;	  case 3:;;		if (count) {;;	  		do {;;				STEP;;;	  case 2:;;				STEP;;;	  case 1:;;				STEP;;;	  case 0:;;				STEP;;;			} while (--count);;;		};;	};;;; The switch() is actually done by two levels of branch instructions;; rather than a lookup table._lbnMulN1_16	proc	far	push	bp	mov	bp,sp	push	ds	push	si	push	di	cld	les	di,[bp+6]	; out	lds	si,[bp+10]	; in	mov	cx,[bp+14]	; len	mov     bp,[bp+16]	; k;; First multiply step has no carry in	lodsw	mul	bp	stosw;; The switch() for Duff's device starts here;; Note: this *is* faster than a jump table for an 8086 and '286.;; 8086:  jump table: 44 cycles; this: 27/29/31/41;; 80286: jump table: 25 cycles; this: 17/17/20/22	shr	cx,1	jc	SHORT m16_odd	inc	cx	shr	cx,1	jc	SHORT m16_case2	jmp	SHORT m16_case0	nop			; To align loopm16_odd:	inc	cx	shr	cx,1	jnc	SHORT m16_case1	jz	SHORT m16_done	; Avoid entire loop in this casem16_loop:	lodsw	mov	bx,dx		; Remember carry for later	mul	bp	add	ax,bx		; Add carry in from previous word	adc	dx,0	stoswm16_case2:	lodsw	mov	bx,dx		; Remember carry for later	mul	bp	add	ax,bx		; Add carry in from previous word	adc	dx,0	stoswm16_case1:	lodsw	mov	bx,dx		; Remember carry for later	mul	bp	add	ax,bx		; Add carry in from previous word	adc	dx,0	stoswm16_case0:	lodsw	mov	bx,dx		; Remember carry for later	mul	bp	add	ax,bx		; Add carry in from previous word	adc	dx,0	stosw	loop	m16_loopm16_done:	mov	ax,dx	stosw			; Store last word	pop	di	pop	si	pop	ds	pop	bp	ret_lbnMulN1_16	endp	align	2_lbnMulAdd1_16	proc	far	push	bp	mov	bp,sp	push	ds	push	si	push	di	cld	les	bx,[bp+6]	; out	lds	si,[bp+10]	; in	mov	cx,[bp+14]	; len	mov     bp,[bp+16]	; k;; First multiply step has no carry in	lodsw	mul	bp	add	es:[bx],ax	; This time, store in [bx] directly	adc	dx,0	sub	bx,si		; Prepare to use [bx+si].;; The switch() for Duff's device starts here;; Note: this *is* faster than a jump table for an 8086 and '286.;; 8086:  jump table: 44 cycles; this: 27/29/31/41;; 80286: jump table: 25 cycles; this: 17/17/20/22	shr	cx,1	jc	SHORT ma16_odd	inc	cx	shr	cx,1	jc	SHORT ma16_case2	jmp	SHORT ma16_case0ma16_odd:	inc	cx	shr	cx,1	jnc	SHORT ma16_case1	jz	SHORT ma16_done	; Avoid entire loop in this casema16_loop:	lodsw	mov	di,dx		; Remember carry for later	mul	bp	add	ax,di		; Add carry in from previous word	adc	dx,0	add	es:[bx+si],ax	adc	dx,0ma16_case2:	lodsw	mov	di,dx		; Remember carry for later	mul	bp	add	ax,di		; Add carry in from previous word	adc	dx,0	add	es:[bx+si],ax	adc	dx,0ma16_case1:	lodsw	mov	di,dx		; Remember carry for later	mul	bp	add	ax,di		; Add carry in from previous word	adc	dx,0	add	es:[bx+si],ax	adc	dx,0ma16_case0:	lodsw	mov	di,dx		; Remember carry for later	mul	bp	add	ax,di		; Add carry in from previous word	adc	dx,0	add	es:[bx+si],ax	adc	dx,0	loop	ma16_loopma16_done:	mov	ax,dx	pop	di	pop	si	pop	ds	pop	bp	ret_lbnMulAdd1_16	endp	align	2_lbnMulSub1_16	proc	far	push	bp	mov	bp,sp	push	ds	push	si	push	di	cld	les	bx,[bp+6]	; out	lds	si,[bp+10]	; in	mov	cx,[bp+14]	; len	mov     bp,[bp+16]	; k;; First multiply step has no carry in	lodsw	mul	bp	sub	es:[bx],ax	; This time, store in [bx] directly	adc	dx,0	sub	bx,si		; Prepare to use [bx+si].;; The switch() for Duff's device starts here;; Note: this *is* faster than a jump table for an 8086 and '286.;; 8086:  jump table: 44 cycles; this: 27/29/31/41;; 80286: jump table: 25 cycles; this: 17/17/20/22	shr	cx,1	jc	SHORT ms16_odd	inc	cx	shr	cx,1	jc	SHORT ms16_case2	jmp	SHORT ms16_case0ms16_odd:	inc	cx	shr	cx,1	jnc	SHORT ms16_case1	jz	SHORT ms16_done	; Avoid entire loop in this casems16_loop:	lodsw	mov	di,dx		; Remember carry for later	mul	bp	add	ax,di		; Add carry in from previous word	adc	dx,0	sub	es:[bx+si],ax	adc	dx,0ms16_case2:	lodsw	mov	di,dx		; Remember carry for later	mul	bp	add	ax,di		; Add carry in from previous word	adc	dx,0	sub	es:[bx+si],ax	adc	dx,0ms16_case1:	lodsw	mov	di,dx		; Remember carry for later	mul	bp	add	ax,di		; Add carry in from previous word	adc	dx,0	sub	es:[bx+si],ax	adc	dx,0ms16_case0:	lodsw	mov	di,dx		; Remember carry for later	mul	bp	add	ax,di		; Add carry in from previous word	adc	dx,0	sub	es:[bx+si],ax	adc	dx,0	loop	ms16_loopms16_done:	mov	ax,dx	pop	di	pop	si	pop	ds	pop	bp	ret_lbnMulSub1_16	endp;; Two-word by one-word divide.  Stores quotient, returns remainder.;; BNWORD16 lbnDiv21_16(BNWORD16 *q, BNWORD16 nh, BNWORD16 nl, BNWORD16 d);;                      4            8            10           12	align	2_lbnDiv21_16	proc	far	mov	cx,bp		; bp NOT pushed; note change in offsets	mov	bp,sp	mov	dx,[bp+8]	mov	ax,[bp+10]	div	WORD PTR [bp+12]	les	bx,[bp+4]	mov	es:[bx],ax	mov	ax,dx	mov	bp,cx	ret	nop		; To align loop in lbnModQ properly_lbnDiv21_16	endp;; Multi-word by one-word remainder.;; BNWORD16 lbnModQ_16(BNWORD16 *q, unsigned len, unsigned d);;                     6            10            12_lbnModQ_16	proc	far	push	bp	mov	bp,sp	push	ds	mov	bx,si	mov	cx,10[bp]	; load len	lds	si,6[bp]	; load q	std			; loop MSW to LSW	add	si,cx	mov	bp,12[bp]	; load d	add	si,cx	xor	dx,dx		; Set up for first divide	sub	si,2		; Adjust pointer to point to MSW	lodsw			; Load first word	cmp	ax,bp		; See if we can skip first divide	jnc	SHORT modq16_inner	; No such luck	mov	dx,ax		; Yes!  Modulus > input, so remainder = input	dec	cx		; Do loop	jz	SHORT modq16_donemodq16_loop:	lodswmodq16_inner:	div	bp	loop	modq16_loopmodq16_done:	pop	ds	mov	ax,dx	; Return remainder	pop	bp	mov	si,bx	cld		; Microsoft C's libraries assume this	ret_lbnModQ_16	endp;; Similar, but using 32-bit operations.;;;; The differences are that the switch() in Duff's device is done using;; a jump table, and lods is not used because it's slower than load and;; increment.  The pointers are only updated once per loop; offset;; addressing modes are used, since they're no slower.  [di] is used;; instead of [bx+si] because the extra increment of di take only one;; cycle per loop a '486, while [bx+si] takes one extra cycle per multiply.;;;; The register assignments are also slightly different:;;;; es:[si]	in;; ds:[di]	out;; ecx		k;; bp		loop counter (len/4);; edx,eax	high,low parts of product;; ebx		carry word from previous multiply iteration;;;; The use of bp for a loop counter lets all the 32-bit values go;; in caller-save registers, so there's no need to do any 32-bit;; saves and restores.  Using ds:di for the destination saves one;; segment override in the lbnMulN1_32 code, since there's one more;; store to [di] than load from es:[si].;;;; Given the number of 32-bit references that this code uses, optimizing;; it for the Pentium is interesting, because the Pentium has a very;; inefficient implementation of prefix bytes.  Each prefix byte, with;; the exception of 0x0f *>> on conditional branch instructions ONLY <<*;; is a 1-cycle non-pairiable instruction.  Which has the effect of;; forcing the instruction it's on into the U pipe.  But this code uses;; *lots* of prefix bytes, notably the 0x66 operand size override.;;;; For example "add [di],eax" is advised against in Intel's optimization;; papers, because it takes 3 cycles and 2 of them are not pairable.;; But any longer sequence would have a prefix byte on every instruction,;; resulting in even more non-pairable cycles.  Also, only two instructions;; in the multiply kernel can go in the V pipe (the increments of si and;; di), and they're already there, so the pairable cycles would be wasted.;;;; Things would be *quite* different in native 32-bit mode.;;;; All instructions that could go in the V pipe that aren't there are;; marked.;;;; The setup code is quite intricately interleaved to get the best possible
lbn8086.asm - 源码说明

本页面展示了「包含哈希,对称以及非对称的经典算法包含经典事例」中的 lbn8086.asm 源码文件，采用汇编编程语言编写，共 1,039 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与对称相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?