📄 intro.txt

📁 一个反汇编程序源码
💻 TXT
📖 第 1 页 / 共 2 页
字号:
12 下一页
A brief introduction to disassem package.
=========================================

1. How you can break down byte stream of an executable file 
   into instructions?
2. Can we use some kind of parsing tools (compiler compiler)
   such as yacc to transform the structure (grammar) into 
   parsing code itself?
3. So what is preccx anyway?
4. To understand windows executable files what we need to know?
5. Combine two things and we get the basic disassembler.
6. What is the problem anyway?
7. Can we solve the problem?
8. What's next step?
9. Some explanation about each files and directory.


--------------------------------------------------------------------------
1. How you can break down byte stream of an executable file 
   into instructions?
--------------------------------------------------------------------------

I found the following structures for intel pentium instructions.
The structure is not related to any architectural considerations,
it is selected for the decoding purpose only.

   <instruction>       =    <prefixes>* <instruction_body> 
   
 % instruction consists of zero or more prefixes followed by 
   instruction body 

   <instruction_body>  =    <one_byte_instruction>      	
                       |    "0x0F"  
					        <two_byte_instruction>     
 
 % instruction body is either one byte instruction 
   or "0x0F" followed by two byte instruction  
   
   <one_byte_instruction>   
                       =    <opcode0>        
					   |    <opcode1> <byte>
					   |    <opcode2> <word>
					   |    <opcode3> <word> <byte>
					   |    <opcode4> <double_word>
					   |    <opcode5> <pointer_word>
					   |    <opcode6> <mod_reg_or_mem>
					   |    <opcode7> <mod_reg_or_mem> <byte>
					   |    <opcode8> <mod_reg_or_mem> <double_word>
					   |    <opcode9> <opcode_extention>
					   |    <opcode10> <opcode_extention> <byte>
					   |    <opcode11> <opcode_extention> <double_word>
					   |    <opcode12> <opcode_extention_group>
					   |    <case_jump_block>
					   |    <test_group>
					   |    <wait_group>
					   |    <repeat_group>

  % there are 17 sub cases for one byte instruction
    where byte and word and double word are well understood, 
	pointer word consists of 6 bytes,
	the complication comes from mod_reg_or_mem byte (or bytes) 
	which determines addressing mode of the instruction, 
	opcode extention is same byte as mod_.. byte but the role is different,
	case jump block needs special attention because there may be a lot of
	addresses following the instruction itself, 
	test group, wait group, and repeat group is also different 
	from other cases.

   <two_byte_instruction>
                       =    <opcode20>
					   |    <opcode21> <double_word>
					   |    <opcode22> <mod_reg_or_mem>
					   |    <opcode23> <mod_reg_or_mem> <byte>
                       |    <opcode24> <opcode_extention> 
					   |    <opcode25> <opcode_extention> <byte>

 % there are 7 sub case for two byte instruction
   everything is explained above or in the following

   <mod_reg_or_mem>    =    <mod1>
                       |    <mod2> <sib_star> <double_word>
					   |    <mod2> <sib_non_star>
					   |    <mod3> <double_word>
					   |    <mod4> <byte>
					   |    <mod5> <sib> <byte>
					   |    <mod6> <double_word>
					   |    <mod7> <sib> <double_word>
					   |    <mod8>

   <sib_star>          =    <byte>
   <sib_non_star>      =    <byte>
   <sib>               =    <byte>
   <word>              =    <byte> <byte>
   <double_word>       =    <byte> <byte> <byte> <byte>
   <pointer_word>      =    <word> <double_word>
   <mod*>     		   =    <byte>
   <opcode*>           =    <byte>
  
 % mod_reg_or_mem can be either just one byte, or
   it can follow sib (scale index base) byte again 
   it can follow byte or double word depending on the cases.
   
   <case_jump_block>   =    "0xFF" "0x24" <sib> <label_start_position> <label>*

   <label_start_position>
                       =    <label>

   <label>             =    <double_word>

 
   <test_group>        =    "0xF6" <reg/0> <byte>
                       |    "0xF6" <reg> <opcode_extention>
					   |    "0xF7" <reg/0> <double_word>
					   |    "0xF7" <opcode_extention>

   <reg/0>             =    <reg>	         % register part is zero
   <reg>               =    <byte>
   <reg/6>             =    <reg>            % register part is 6
   <reg/7>             =    <reg>            % register part is 7

   <wait_group>        =    "0x9B" "0xD9" <op/6>
                       |    "0x9B" "0xD9" <op/7>
					   |    "0x9B" "0xDB" "0xE2"
					   |    "0x9B" "0xDB" "0xE3"
					   |    "0x9B" "0xDD" <op/6>
					   |    "0x9B" "0xDD" <op/7>
					   |    "0x9B" "0xDF" "0xE0"
					   |    "0x9B"



   <repeat_group>      =    "0xF2" <opcode>		   % for printing purpose
                       |    "0xF3" <opcode>
					   |    "0xF3" <opcode>


Of course for simplicty, not all of structures are presented here.


--------------------------------------------------------------
2. Can we use some kind of parsing tools (compiler compiler)
   such as yacc to transform the structure (grammar) into 
   parsing code itself?
--------------------------------------------------------------

Of course, we have to use parse generator to finish our project 
as early as possible. I choose the parser generator "preccx", 
because i didn't have any experience with yacc, and i needed 
some more and possibly better tool for my disassembler.

Since preccx is top down parser with infinite back tracking ability, 
the script itself is easy to construct and easy to modify, and there 
are a lot of power in your hand, I don't have any experience with yacc 
so i cannot compare two tools.
But believe me this tool seems better. 

To learn to use some tools as complicated as preccx or yacc is painful, 
and it is not easy. But since I provide the preccx script for disassembler, 
you can skip this part. 

Preccx generates C source code which is modular and very compact in size.
It is incremental, which means you can break down scripts and process it 
independently. Well what the heck, most of you don't understand what i am
talking about. But after you get use to this tool, then  ... 

To run generated parser, you need runtime support functions, which is also 
included in preccx package. But I have to modify some part of it, since
the preccx accepts input from the keyboard, and I need to process data from
disk, and I don't have any end of line business, or what ever it matters.


---------------------------------------------------------------------
3. So what is preccx anyway?
---------------------------------------------------------------------

     Preccx is a  compiler  compiler.  It  converts  preccx-style
     context-grammar  definition  scripts  (with  a .y extension)
     into C code scripts (with a .c extension). The  output  code
     compiles  under  ANSI  C  compilers such as the GNU Software
     Foundation's gcc(1).

     There is an easy-to-use hook for lex(1) tokenisers.

     Preccx extends the UNIX yacc(1) utility by allowing:

     [0] Contextual definitions. Each grammar definition  may  be
     parameterized  with  contexts.   For example, some languages
     determine whether a declaration is local (and  to  what)  or
     global  in  scope  by  relative indentation, and this can be
     encoded in preccx using the number of spaces indentation  as
     a parameter, n:

           @ decl(n) = <' '>*n expr <'\n'> decl(n+1)*

     This definition is intended to mean that a  "decl"  indented
     by  n spaces consists of n spaces, an expression, and a new-
     line, optionally followed by one or several "decl"s indented
     still further.

     [1] Infinite lookahead and backtracking in place of the yacc
     1-token  lookahead,  This  means that preccx parsers distin-
     guish correctly between sentences of the form `foo bah  gum'
     and  `foo  bah NAY' on a single pass.  If you cannot imagine
     why one should want to decide between the two,  think  about
     `if ... then ...' and `if ... then ... else ... '.

     [2] Arbitrarily complex expressions. This  means  that  com-
     pound definitions like

          explain {{this | that} {several | no} times}+

     are legal within preccx definition scripts.

     [3] Preccx has the  postfix  operators  `*'  (zero  or  more
     times), `*n' (exactly n times), `+' (one or more times), and
     `!' (execute accumulated actions now) built in,  along  with
     the  `[ ]'  (optionally)  outfix  operator. For example, the
     following means `exactly n spaces':

           @ space(n) = <' '>*n

     The other built-ins are

          `?' (any token)

          `^' (beginning of line)

          `$' (end of line)

          `|' (or, placed between alternate phrases of the  gram-
          mar)

          `{ }' (grouping brackets)

          `< >' (around literals)

          `> <' (to mean `not a particular literal')

          `( )' around the name of a BOOLEAN valued predicate  on
          tokens,  defined  as  an  int 1 or 0 -valued C function
          elsewhere in the script, and

          `) (' (anti-brackets) round a C expression  of  BOOLEAN
          type, meaning a logical test condition.

          `]..[' anti-brackets hide an expression, causing it  to
          be required but ignored.

     `]a[ b' means that input must satisfy both a and b, while `a
     ]b[' means that b is trailing context.

     `$!'  is a shorthand for matching  end-of-line  followed  by
     execution  of  pending  actions  (it  also  causes the input
     buffer to start being written from the beginning again).  It
     is roughly  equivalent  to the  conjunction '! $',  but more
     efficient.

     `a b c' (conjunction) is the  term  denoting  an  expression
     consisting of an `a expression' followed by a `b expression'
     followed by a `c expression'. An example of a preccx  script
     follows in the section USAGE.

     [4] Modular output.  Parts of  a  script  can  be  preccx'ed
     separately,  compiled  separately,  and then linked together
     later, which makes maintenance and version control easy.
12 下一页
💿 文件大小 619 K
👤 上传用户 rubyist
📂 所属分类编译器/解释器
📄 代码行数 530 行
💻 语言类型 TXT
🏷️ 相关标签

#反汇编 #程序源码
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -