📄 internal.doc
字号:
Internals of the Netwide Assembler==================================The Netwide Assembler is intended to be a modular, re-usable x86assembler, which can be embedded in other programs, for example asthe back end to a compiler.The assembler is composed of modules. The interfaces between themlook like: +--- preproc.c ----+ | | +---- parser.c ----+ | | | | float.c | | | +--- assemble.c ---+ | | | nasm.c ---+ insnsa.c +--- nasmlib.c | | +--- listing.c ----+ | | +---- labels.c ----+ | | +--- outform.c ----+ | | +----- *out.c -----+In other words, each of `preproc.c', `parser.c', `assemble.c',`labels.c', `listing.c', `outform.c' and each of the output formatmodules `*out.c' are independent modules, which do not directlyinter-communicate except through the main program.The Netwide *Disassembler* is not intended to be particularlyportable or reusable or anything, however. So I won't botherdocumenting it here. :-)nasmlib.c---------This is a library module; it contains simple library routines whichmay be referenced by all other modules. Among these are a set ofwrappers around the standard `malloc' routines, which will report afatal error if they run out of memory, rather than returning NULL.preproc.c---------This contains a macro preprocessor, which takes a file name as inputand returns a sequence of preprocessed source lines. The only symbolexported from the module is `nasmpp', which is a data structure oftype `Preproc', declared in nasm.h. This structure contains pointersto all the functions designed to be callable from outside themodule.parser.c--------This contains a source-line parser. It parses `canonical' assemblysource lines, containing some combination of the `label', `opcode',`operand' and `comment' fields: it does not process directives ormacros. It exports two functions: `parse_line' and `cleanup_insn'.`parse_line' is the main parser function: you pass it a source linein ASCII text form, and it returns you an `insn' structurecontaining all the details of the instruction on that line. Theparameters it requires are:- The location (segment, offset) where the instruction on this line will eventually be placed. This is necessary in order to evaluate expressions containing the Here token, `$'.- A function which can be called to retrieve the value of any symbols the source line references.- Which pass the assembler is on: an undefined symbol only causes an error condition on pass two.- The source line to be parsed.- A structure to fill with the results of the parse.- A function which can be called to report errors.Some instructions (DB, DW, DD for example) can require an arbitraryamount of storage, and so some of the members of the resulting`insn' structure will be dynamically allocated. The other functionexported by `parser.c' is `cleanup_insn', which can be called todeallocate any dynamic storage associated with the results of aparse.names.c-------This doesn't count as a module - it defines a few arrays which areshared between NASM and NDISASM, so it's a separate file which is#included by both parser.c and disasm.c.float.c-------This is essentially a library module: it exports one function,`float_const', which converts an ASCII representation of afloating-point number into an x86-compatible binary representation,without using any built-in floating-point arithmetic (so it will runon any platform, portably). It calls nothing, and is called only by`parser.c'. Note that the function `float_const' must be passed anerror reporting routine.assemble.c----------This module contains the code generator: it translates `insn'structures as returned from the parser module into actual generatedcode which can be placed in an output file. It exports twofunctions, `assemble' and `insn_size'.`insn_size' is designed to be called on pass one of assembly: ittakes an `insn' structure as input, and returns the amount of spacethat would be taken up if the instruction described in the structurewere to be converted to real machine code. `insn_size' also requiresto be told the location (as a segment/offset pair) where theinstruction would be assembled, the mode of assembly (16/32 bitdefault), and a function it can call to report errors.`assemble' is designed to be called on pass two: it takes all theparameters that `insn_size' does, but has an extra parameter whichis an output driver. `assemble' actually converts the inputinstruction into machine code, and outputs the machine code by meansof calling the `output' function of the driver.insnsa.c--------This is another library module: it exports one very big array ofinstruction translations. It is generated automatically from theinsns.dat file by the insns.pl script.labels.c--------This module contains a label manager. It exports six functions:`init_labels' should be called before any other function in themodule. `cleanup_labels' may be called after all other use of themodule has finished, to deallocate storage.`define_label' is called to define new labels: you pass it the nameof the label to be defined, and the (segment,offset) pair giving thevalue of the label. It is also passed an error-reporting function,and an output driver structure (so that it can call the outputdriver's label-definition function). `define_label' mentallyprepends the name of the most recently defined non-local label toany label beginning with a period.`define_label_stub' is designed to be called in pass two, once allthe labels have already been defined: it does nothing except toupdate the "most-recently-defined-non-local-label" status, so thatreferences to local labels in pass two will work correctly.`declare_as_global' is used to declare that a label should beglobal. It must be called _before_ the label in question is defined.Finally, `lookup_label' attempts to translate a label name into a(segment,offset) pair. It returns non-zero on success.The label manager module is (theoretically :) restartable: aftercalling `cleanup_labels', you can call `init_labels' again, andstart a new assembly with a new set of symbols.listing.c---------This file contains the listing file generator. The interface to themodule is through the one symbol it exports, `nasmlist', which is astructure containing six function pointers. The calling semantics ofthese functions isn't terribly well thought out, as yet, but itworks (just about) so it's going to get left alone for now...outform.c---------This small module contains a set of routines to manage a list ofoutput formats, and select one given a keyword. It contains threesmall routines: `ofmt_register' which registers an output driver aspart of the managed list, `ofmt_list' which lists the availabledrivers on stdout, and `ofmt_find' which tries to find the drivercorresponding to a given name.The output modules------------------Each of the output modules, `outbin.o', `outelf.o' and so on,exports only one symbol, which is an output driver data structurecontaining pointers to all the functions needed to produce outputfiles of the appropriate type.The exception to this is `outcoff.o', which exports _two_ outputdriver structures, since COFF and Win32 object file formats are verysimilar and most of the code is shared between them.nasm.c------This is the main program: it calls all the functions in the abovemodules, and puts them together to form a working assembler. Wehope. :-)Segment Mechanism-----------------In NASM, the term `segment' is used to separate the differentsections/segments/groups of which an object file is composed.Essentially, every address NASM is capable of understanding isexpressed as an offset from the beginning of some segment.The defining property of a segment is that if two symbols aredeclared in the same segment, then the distance between them isfixed at assembly time. Hence every externally-declared variablemust be declared in its own segment, since none of the locations ofthese are known, and so no distances may be computed at assemblytime.The special segment value NO_SEG (-1) is used to denote an absolutevalue, e.g. a constant whose value does not depend on relocation,such as the _size_ of a data object.Apart from NO_SEG, segment indices all have their least significantbit clear, if they refer to actual in-memory segments. For eachsegment of this type, there is an auxiliary segment value, definedto be the same number but with the LSB set, which denotes thesegment-base value of that segment, for object formats which supportit (Microsoft .OBJ, for example).Hence, if `textsym' is declared in a code segment with index 2, thenreferencing `SEG textsym' would return zero offset fromsegment-index 3. Or, in object formats which don't understand suchreferences, it would return an error instead.The next twist is SEG_ABS. Some symbols may be declared with asegment value of SEG_ABS plus a 16-bit constant: this indicates thatthey are far-absolute symbols, such as the BIOS keyboard bufferunder MS-DOS, which always resides at 0040h:001Eh. Far-absolutes arehandled with care in the parser, since they are supposed to evaluatesimply to their offset part within expressions, but applying SEG toone should yield its segment part. A far-absolute should never findits way _out_ of the parser, unless it is enclosed in a WRT clause,in which case Microsoft 16-bit object formats will want to knowabout it.Porting Issues--------------We have tried to write NASM in portable ANSI C: we do not assumelittle-endianness or any hardware characteristics (in order thatNASM should work as a cross-assembler for x86 platforms, even whenrun on other, stranger machines).Assumptions we _have_ made are:- We assume that `short' is at least 16 bits, and `long' at least 32. This really _shouldn't_ be a problem, since Kernighan and Ritchie tell us we are entitled to do so.- We rely on having more than 6 characters of significance on externally linked symbols in the NASM sources. This may get fixed at some point. We haven't yet come across a linker brain-dead enough to get it wrong anyway.- We assume that `fopen' using the mode "wb" can be used to write binary data files. This may be wrong on systems like VMS, with a strange file system. Though why you'd want to run NASM on VMS is beyond me anyway.That's it. Subject to those caveats, NASM should be completelyportable. If not, we _really_ want to know about it.Porting Non-Issues------------------The following is _not_ a portability problem, although it looks likeone.- When compiling with some versions of DJGPP, you may get errors such as `warning: ANSI C forbids braced-groups within expressions'. This isn't NASM's fault - the problem seems to be that DJGPP's definitions of the <ctype.h> macros include a GNU-specific C extension. So when compiling using -ansi and -pedantic, DJGPP complains about its own header files. It isn't a problem anyway, since it still generates correct code.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -