📄 intro.txt
字号:
A brief introduction to disassem package.
=========================================
1. How you can break down byte stream of an executable file
into instructions?
2. Can we use some kind of parsing tools (compiler compiler)
such as yacc to transform the structure (grammar) into
parsing code itself?
3. So what is preccx anyway?
4. To understand windows executable files what we need to know?
5. Combine two things and we get the basic disassembler.
6. What is the problem anyway?
7. Can we solve the problem?
8. What's next step?
9. Some explanation about each files and directory.
--------------------------------------------------------------------------
1. How you can break down byte stream of an executable file
into instructions?
--------------------------------------------------------------------------
I found the following structures for intel pentium instructions.
The structure is not related to any architectural considerations,
it is selected for the decoding purpose only.
<instruction> = <prefixes>* <instruction_body>
% instruction consists of zero or more prefixes followed by
instruction body
<instruction_body> = <one_byte_instruction>
| "0x0F"
<two_byte_instruction>
% instruction body is either one byte instruction
or "0x0F" followed by two byte instruction
<one_byte_instruction>
= <opcode0>
| <opcode1> <byte>
| <opcode2> <word>
| <opcode3> <word> <byte>
| <opcode4> <double_word>
| <opcode5> <pointer_word>
| <opcode6> <mod_reg_or_mem>
| <opcode7> <mod_reg_or_mem> <byte>
| <opcode8> <mod_reg_or_mem> <double_word>
| <opcode9> <opcode_extention>
| <opcode10> <opcode_extention> <byte>
| <opcode11> <opcode_extention> <double_word>
| <opcode12> <opcode_extention_group>
| <case_jump_block>
| <test_group>
| <wait_group>
| <repeat_group>
% there are 17 sub cases for one byte instruction
where byte and word and double word are well understood,
pointer word consists of 6 bytes,
the complication comes from mod_reg_or_mem byte (or bytes)
which determines addressing mode of the instruction,
opcode extention is same byte as mod_.. byte but the role is different,
case jump block needs special attention because there may be a lot of
addresses following the instruction itself,
test group, wait group, and repeat group is also different
from other cases.
<two_byte_instruction>
= <opcode20>
| <opcode21> <double_word>
| <opcode22> <mod_reg_or_mem>
| <opcode23> <mod_reg_or_mem> <byte>
| <opcode24> <opcode_extention>
| <opcode25> <opcode_extention> <byte>
% there are 7 sub case for two byte instruction
everything is explained above or in the following
<mod_reg_or_mem> = <mod1>
| <mod2> <sib_star> <double_word>
| <mod2> <sib_non_star>
| <mod3> <double_word>
| <mod4> <byte>
| <mod5> <sib> <byte>
| <mod6> <double_word>
| <mod7> <sib> <double_word>
| <mod8>
<sib_star> = <byte>
<sib_non_star> = <byte>
<sib> = <byte>
<word> = <byte> <byte>
<double_word> = <byte> <byte> <byte> <byte>
<pointer_word> = <word> <double_word>
<mod*> = <byte>
<opcode*> = <byte>
% mod_reg_or_mem can be either just one byte, or
it can follow sib (scale index base) byte again
it can follow byte or double word depending on the cases.
<case_jump_block> = "0xFF" "0x24" <sib> <label_start_position> <label>*
<label_start_position>
= <label>
<label> = <double_word>
<test_group> = "0xF6" <reg/0> <byte>
| "0xF6" <reg> <opcode_extention>
| "0xF7" <reg/0> <double_word>
| "0xF7" <opcode_extention>
<reg/0> = <reg> % register part is zero
<reg> = <byte>
<reg/6> = <reg> % register part is 6
<reg/7> = <reg> % register part is 7
<wait_group> = "0x9B" "0xD9" <op/6>
| "0x9B" "0xD9" <op/7>
| "0x9B" "0xDB" "0xE2"
| "0x9B" "0xDB" "0xE3"
| "0x9B" "0xDD" <op/6>
| "0x9B" "0xDD" <op/7>
| "0x9B" "0xDF" "0xE0"
| "0x9B"
<repeat_group> = "0xF2" <opcode> % for printing purpose
| "0xF3" <opcode>
| "0xF3" <opcode>
Of course for simplicty, not all of structures are presented here.
--------------------------------------------------------------
2. Can we use some kind of parsing tools (compiler compiler)
such as yacc to transform the structure (grammar) into
parsing code itself?
--------------------------------------------------------------
Of course, we have to use parse generator to finish our project
as early as possible. I choose the parser generator "preccx",
because i didn't have any experience with yacc, and i needed
some more and possibly better tool for my disassembler.
Since preccx is top down parser with infinite back tracking ability,
the script itself is easy to construct and easy to modify, and there
are a lot of power in your hand, I don't have any experience with yacc
so i cannot compare two tools.
But believe me this tool seems better.
To learn to use some tools as complicated as preccx or yacc is painful,
and it is not easy. But since I provide the preccx script for disassembler,
you can skip this part.
Preccx generates C source code which is modular and very compact in size.
It is incremental, which means you can break down scripts and process it
independently. Well what the heck, most of you don't understand what i am
talking about. But after you get use to this tool, then ...
To run generated parser, you need runtime support functions, which is also
included in preccx package. But I have to modify some part of it, since
the preccx accepts input from the keyboard, and I need to process data from
disk, and I don't have any end of line business, or what ever it matters.
---------------------------------------------------------------------
3. So what is preccx anyway?
---------------------------------------------------------------------
Preccx is a compiler compiler. It converts preccx-style
context-grammar definition scripts (with a .y extension)
into C code scripts (with a .c extension). The output code
compiles under ANSI C compilers such as the GNU Software
Foundation's gcc(1).
There is an easy-to-use hook for lex(1) tokenisers.
Preccx extends the UNIX yacc(1) utility by allowing:
[0] Contextual definitions. Each grammar definition may be
parameterized with contexts. For example, some languages
determine whether a declaration is local (and to what) or
global in scope by relative indentation, and this can be
encoded in preccx using the number of spaces indentation as
a parameter, n:
@ decl(n) = <' '>*n expr <'\n'> decl(n+1)*
This definition is intended to mean that a "decl" indented
by n spaces consists of n spaces, an expression, and a new-
line, optionally followed by one or several "decl"s indented
still further.
[1] Infinite lookahead and backtracking in place of the yacc
1-token lookahead, This means that preccx parsers distin-
guish correctly between sentences of the form `foo bah gum'
and `foo bah NAY' on a single pass. If you cannot imagine
why one should want to decide between the two, think about
`if ... then ...' and `if ... then ... else ... '.
[2] Arbitrarily complex expressions. This means that com-
pound definitions like
explain {{this | that} {several | no} times}+
are legal within preccx definition scripts.
[3] Preccx has the postfix operators `*' (zero or more
times), `*n' (exactly n times), `+' (one or more times), and
`!' (execute accumulated actions now) built in, along with
the `[ ]' (optionally) outfix operator. For example, the
following means `exactly n spaces':
@ space(n) = <' '>*n
The other built-ins are
`?' (any token)
`^' (beginning of line)
`$' (end of line)
`|' (or, placed between alternate phrases of the gram-
mar)
`{ }' (grouping brackets)
`< >' (around literals)
`> <' (to mean `not a particular literal')
`( )' around the name of a BOOLEAN valued predicate on
tokens, defined as an int 1 or 0 -valued C function
elsewhere in the script, and
`) (' (anti-brackets) round a C expression of BOOLEAN
type, meaning a logical test condition.
`]..[' anti-brackets hide an expression, causing it to
be required but ignored.
`]a[ b' means that input must satisfy both a and b, while `a
]b[' means that b is trailing context.
`$!' is a shorthand for matching end-of-line followed by
execution of pending actions (it also causes the input
buffer to start being written from the beginning again). It
is roughly equivalent to the conjunction '! $', but more
efficient.
`a b c' (conjunction) is the term denoting an expression
consisting of an `a expression' followed by a `b expression'
followed by a `c expression'. An example of a preccx script
follows in the section USAGE.
[4] Modular output. Parts of a script can be preccx'ed
separately, compiled separately, and then linked together
later, which makes maintenance and version control easy.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -