📄 dta.hlp

📁 是一个经济学管理应用软件很难找的但是经济学学生又必须用到
💻 HLP
📖 第 1 页 / 共 2 页
字号:
12 下一页
{smcl}
{* 14feb2005}{...}
{cmd:help dta}
{hline}

{title:Title}

{p2colset 5 30 32 2}{...}
{p2col :{hi:[P] file formats .dta} {hline 2}}Description of .dta file format{p_end}
{p2colreset}{...}


{title:Syntax}

{pstd}
The information contained in this highly technical entry probably does not
interest you.  We describe in detail the format of Stata {cmd:.dta} datasets
for those interested in writing programs in C or other languages that read and
write them.


{title:Remarks}

{pstd}
Remarks are presented under the headings:

	{help dta##intro:1.  Introduction}
	{help dta##versions:2.  Versions and flavors of Stata}
	{help dta##strings:3.  Representation of strings}
	{help dta##numbers:4   Representation of numbers}
 	{help dta##definition:5.  Dataset format definition}
	    {help dta##header:5.1  Header}
	    {help dta##descriptors:5.2  Descriptors}
	    {help dta##variable_labels:5.3  Variable labels}
	    {help dta##expansion_fields:5.4  Expansion fields}
	    {help dta##data:5.5  Data}
	    {help dta##value_labels:5.6  Value labels}


{marker intro}{...}
{title:1.  Introduction}

{pstd}
Stata-format datasets record data in a way generalized to work across computers
that do not agree on how data are recorded.  Given a computer, datasets are
divided into two categories:  native-format and foreign-format datasets. Stata
uses the following two rules:

{p 8 13 2}
    R1.  On any computer, Stata knows how to write only native-format
	 datasets.

{p 8 13 2}
    R2.  On all computers, Stata can read foreign-format as well as
	 native-format datasets.

{pstd}
Rules R1 and R2 ensure that Stata users need not be concerned with
dataset formats.  If you are writing a program to read and write Stata
datasets, you will have to determine whether you want to follow the same rules
or instead restrict your program to operate on only native-format datasets.
Since Stata follows rules R1 and R2, such a restriction would not be too
limiting.  If the user had a foreign-format dataset, he or she could enter
Stata, {helpb use} the data, and then {helpb save} it again.


{marker versions}{...}
{title:2.  Versions and flavors of Stata}

{pstd}
Stata is continually being updated and these updates sometimes require changes
be made to how Stata records {cmd:.dta} datasets.  This document documents
what are known as {hi:format-113} datasets, the most modern format.  Stata
itself can read older formats, but whenever it writes a dataset, it writes in
{hi:113} format.

{pstd}
There are currently three flavors of Stata available:  small Stata, regular
Stata, and {help SpecialEdition:Stata/SE}.  The same {hi:113} format is used
by all flavors.  The difference is that datasets can be larger and Stata/SE
can write datasets with longer strings.


{marker strings}{...}
{title:3.  Representation of strings}

{phang}
1.  Strings in Stata may be from 1 to 80 bytes long (small and regular Stata)
    or from 1 to 244 bytes long (Stata/SE).

{phang}
2.  Stata records a string with a trailing binary 0 ({cmd:\0}) delimiter if
    the length of the string is less than the maximum declared length.  The
    string is recorded without the delimiter if the string is of the maximum
    length.

{phang}
3.  Leading and trailing blanks are significant.

{phang}
4.  Strings use ASCII encoding.


{marker numbers}{...}
{title:4.  Representation of numbers}

{phang}
1.  Numbers are represented as 1-, 2-, and 4-byte integers and 4- and 8-byte
    floats.  In the case of floats, ANSI/IEEE Standard 754-1985 format is used.

{phang}
2.  Byte ordering varies across machines for all numeric types.  Bytes are
    ordered either least-significant to most-significant, dubbed LOHI,
    or most-significant to least-significant, dubbed HILO. Pentiums, for
    instance, use LOHI encoding.  The HP-RISC and
    Sun SPARC-based computers use HILO encoding.

{phang}
3.  When reading a HILO number on a LOHI machine or a LOHI
    number on a HILO machine, perform the following before interpreting
    the number:

	    byte          no translation necessary
	    2-byte int    swap bytes 0 and 1
	    4-byte int    swap bytes 0 and 3, 1 and 2
	    4-byte float  swap bytes 0 and 3, 1 and 2
	    8-byte float  swap bytes 0 and 7, 1 and 6, 2 and 5, 3 and 4

{phang}
4.  For purposes of written documentation, numbers are written with the most
    significant byte listed first.  Thus, {cmd:0x0001} refers to a
    2-byte integer taking on the logical value 1 on all machines.

{phang}
5.  Stata has five numeric data types.  They are

	    {cmd:byte}          1-byte signed int
	    {cmd:int}           2-byte signed int
	    {cmd:long}          4-byte signed int
	    {cmd:float}         4-byte IEEE float
	    {cmd:double}        8-byte IEEE float

{phang}
6.  Each type allows for 27 {help missing:missing value codes}, known as
    {cmd:.}, {cmd:.a}, {cmd:.b}, ..., {cmd:.z}.
    For each type, the range allowed for nonmissing values and the missing
    values codes are

	    {cmd:byte}
		minimum nonmissing    -127   (0x80)
		maximum nonmissing    +100   (0x64)
		code for {cmd:.}            +101   (0x66)
		code for {cmd:.a}           +102   (0x67)
		code for {cmd:.b}           +103   (0x68)
		...
		code for {cmd:.z}           +127   (0x7f)

	    {cmd:int}
		minimum nonmissing    -32767 (0x8000)
		maximum nonmissing    +32740 (0x7fe4)
		code for {cmd:.}            +32741 (0x7fe5)
		code for {cmd:.a}           +32742 (0x7fe6)
		code for {cmd:.b}           +32743 (0x7fe7)
		...
		code for {cmd:.z}           +32767 (0x7fff)

	    {cmd:long}
		minimum nonmissing    -2,147,483,647  (0x80000000)
		maximum nonmissing    +2,147,483,620  (0x7fffffe4)
		code for {cmd:.}            +2,147,483,621  (0x7fffffe5)
		code for {cmd:.a}           +2,147,483,622  (0x7fffffe6)
		code for {cmd:.b}           +2,147,483,623  (0x7fffffe7)
		...
		code for {cmd:.z}           +2,147,483,647  (0x7fffffff)

	    {cmd:float}
		minimum nonmissing    -1.701e+38  (-1.fffffeX+7e)  {it:(sic)}
		maximum nonmissing    +1.701e+38  (+1.fffffeX+7e)
		code for {cmd:.}                        (+1.000000X+7f)
		code for {cmd:.a}                       (+1.001000X+7f)
		code for {cmd:.b}                       (+1.002000X+7f)
		...
		code for {cmd:.z}                       (+1.01a000X+7f)

	    {cmd:double}
		minimum nonmissing    -1.798e+308 (-1.fffffffffffffX+3ff)
		maximum nonmissing    +8.988e+307 (+1.fffffffffffffX+3fe)
		code for {cmd:.}                        (+1.0000000000000X+3ff)
		code for {cmd:.a}                       (+1.0010000000000X+3ff)
		code for {cmd:.b}                       (+1.0020000000000X+3ff)
		...
		code for {cmd:.z}                       (+1.01a0000000000X+3ff)

{pmore}
Note that for {cmd:float}, all {it:z}>1.fffffeX+7e, and for {cmd:double}, all
{it:z}>1.fffffffffffffX+3fe are considered to be missing values and it is
merely a subset of the values that are labeled {cmd:.}, {cmd:.a}, {cmd:.b},
..., {cmd:.z}.  For example, a value between {cmd:.a} and {cmd:.b} is still
considered to be missing and, in particular, all the values between {cmd:.a}
and {cmd:.b} are known jointly as {cmd:.a_}.  Nevertheless, the recording of
those values should be avoided.

{pmore}
In the table above, we have used the
{c -(}{cmd:+}|{cmd:-}{c )-}{cmd:1.}{it:<digits>}{cmd:X}{c -(}{cmd:+}|{cmd:-}{c )-}{it:<digits>}
notation.  The number to the left of the {cmd:X} is to be interpreted as a
base-16 number (the period is thus the base-16 point) and the number to the
right (also recorded in base 16) is to be interpreted as the power of 2
{it:(sic)}.  For example,

	    1.01aX+3ff = (1.01a) * 2^(3ff)                        (base 16)
		       = (1 + 0/16 + 1/16^2 + 10/16^3) * 2^1023   (base 10)

{pmore}
The
{c -(}{cmd:+}|{cmd:-}{c )-}{cmd:1.}{it:<digits>}{cmd:X}{c -(}{cmd:+}|{cmd:-}{c )-}{it:<digits>}
notation easily converts to IEEE 8-byte double:
the {cmd:1} is the hidden bit, the digits to the right of the hexadecimal
point are the mantissa bits, and the exponent is the IEEE exponent in signed
(removal of offset) form.
For instance, pi = 3.1415927... is

					    8-byte IEEE, HILO
					 {hline 23}
	    pi = +1.921fb54442d18X+001 = 40 09 21 fb 54 44 2d 18

				       = 18 2d 44 54 fb 21 09 40
					 {hline 23}
					    8-byte IEEE, LOHI

{pmore}
Converting
{c -(}{cmd:+}|{cmd:-}{c )-}{cmd:1.}{it:<digits>}{cmd:X}{c -(}{cmd:+}|{cmd:-}{c )-}{it:<digits>}
to IEEE 4-byte float is more difficult, but the same rule applies:  the
{cmd:1} is the hidden bit, the digits to the right of the hexadecimal point
are the mantissa bits, and the exponent is the IEEE exponent in signed
(removal of offset) form.  What makes it more difficult is that the
sign-and-exponent in the IEEE 4-byte format occupy 9 bits, which is not
divisible by four, and so everything is shifted one bit.  In float:

				      4-byte IEEE, HILO
					 {hline 11}
	    pi = +1.921fb60000000X+001 = 40 49 0f db

				       = db of 49 40
					 {hline 11}
				      4-byte IEEE, LOHI

{pmore}
The easiest way to obtain the above result is to first convert
+1.921fb60000000X+001 to an 8-byte double and then convert the 8-byte double
to a 4-byte float.

{pmore}
In any case, the relevant numbers are

	    V            value                HILO             LOHI
	    {hline 63}
	    m    -1.fffffffffffffX+3ff   ffefffffffffffff  ffffffffffffefff
	    M    +1.fffffffffffffX+3f3   7fdfffffffffffff  ffffffffffffdf7f
	    {cmd:.}    +1.0000000000000X+3ff   7fe0000000000000  000000000000e07f
	    {cmd:.a}   +1.0010000000000X+3ff   7fe0010000000000  000000000001e07f
	    {cmd:.b}   +1.0020000000000X+3ff   7fe0020000000000  000000000002e07f
	    {cmd:.z}   +1.01a0000000000X+3ff   7fe01a0000000000  00000000001ae07f

	    m    -1.fffffeX+7e           feffffff          fffffffe
	    M    +1.fffffeX+7e           7effffff          ffffff7e
	    {cmd:.}    +1.000000X+7f           7f000000          0000007f
	    {cmd:.a}   +1.001000X+7f           7f000800          0008007f
	    {cmd:.b}   +1.002000X+7f           7f001000          0010007f
	    {cmd:.z}   +1.01a000X+7f           7f00d000          00d0007f
	    {hline 63}


{marker definition}{...}
{title:5.  Dataset format definition}

{pstd}
Stata-format datasets contain five components, which are, in
order,

	1.  Header
	2.  Descriptors
	3.  Variable Labels
	4.  Expansion Fields
	5.  Data
	6.  Value Labels


{marker header}{...}
{title:5.1  Header}

{pstd}
The Header is defined as

	Contents            Length    Format    Comments
	{hline}
	{cmd:ds_format}                1    byte      contains 113 = 0x71
	{cmd:byteorder}                1    byte      0x01 -> HILO, 0x02 -> LOHI
	{cmd:filetype}                 1    byte      0x01
	unused                   1    byte      0x01
	{cmd:nvar} (number of vars)    2    int       encoded per {cmd:byteorder}
	{cmd:nobs} (number of obs)     4    int       encoded per {cmd:byteorder}
	{cmd:data_label}              81    char      dataset label, \0 terminated
	{cmd:time_stamp}              18    char      date/time saved, \0 terminated
	{hline}
	Total                  109


{pstd}
{cmd:time_stamp[17]} must be set to binary zero.  When writing a dataset, you
may record the time stamp as blank {cmd:time_stamp[0]}=\0), but you must still
set {cmd:time_stamp[17]} to binary zero as well.  If you choose to write a
time stamp, its format is

	{it:dd Mon yyyy hh}{cmd::}{it:mm}

{pstd}
{it:dd} and {it:hh} may be written with or without leading zeros, but if
leading zeros are suppressed, a blank must be substituted in their place.


{marker descriptors}{...}
{title:5.2  Descriptors}

{pstd}
The Descriptors are defined

	Contents            Length    Format       Comments
	{hline}
	{cmd:typlist}               {cmd:nvar}    byte array
	{cmd:varlist}            33*{cmd:nvar}    char array
	{cmd:srtlist}          2*({cmd:nvar}+1)   int array    encoded per {cmd:byteorder}
	{cmd:fmtlist}            12*{cmd:nvar}    char array
	{cmd:lbllist}            33*{cmd:nvar}    char array
	{hline}


{pstd}
{cmd:typlist} stores the type of each variable, 1, ..., nvar.
The types are encoded:

		type          code
		{hline 20}
		{cmd:str1}        1 = 0x01
		{cmd:str2}        2 = 0x02
		...
		{cmd:str244}    244 = 0xf4
		{cmd:byte}      251 = 0xfb  {it:(sic)}
		{cmd:int}       252 = 0xfc
		{cmd:long}      253 = 0xfd
		{cmd:float}     254 = 0xfe
		{cmd:double}    255 = 0xff
		{hline 20}

{pstd}
Stata stores five numeric types:  {cmd:double}, {cmd:float}, {cmd:long},
{cmd:int}, and {cmd:byte}.  If {cmd:nvar}==4, a {cmd:typlist} of 0xfcfffdfe
indicates that variable 1 is an {cmd:int}, variable 2 a {cmd:double}, variable
3 a {cmd:long}, and variable 4 a {cmd:float}.  Types above 0x01 through 0xf4
are used to represent strings.  For example, a string with maximum length 8
would have type {cmd:0x08}.  If {cmd:typlist} is read into the C-array
{cmd:char} {cmd:typlist[]}, then {cmd:typlist[i-1]} indicates the type of
variable {cmd:i}.

{pstd}
{cmd:varlist} contains the names of the Stata variables 1, ..., {cmd:nvar},
each up to 32 characters in length, and each terminated by a binary zero (\0).
For instance, if {cmd:nvar}==4,

	0       33        66          99
	|        |         |           |
	{cmd:vbl1\0...myvar\0...thisvar\0...lstvar\0...}


{pstd}
would indicate that variable 1 is named {cmd:vbl1}, variable 2
{cmd:myvar}, variable 3 {cmd:thisvar}, and variable 4 {cmd:lstvar}.  The byte
positions indicated by periods will contain random numbers (and note that we
have omitted some of the periods).  If {cmd:varlist} is read into the C-array
{cmd:char} {cmd:varlist[]}, then {cmd:&varlist[(i-1)*33]} points to the name
of the {cmd:i}th variable.

{pstd}
{cmd:srtlist} specifies the sort-order of the dataset and is terminated by an
(int) 0.  Each 2 bytes is a single int and contains either a variable number
or zero.  The zero marks the end of the {cmd:srtlist}, and the array positions
after that contain random junk.  For instance, if the data are not sorted, the
first int will contain a zero and the ints thereafter will contain junk.  If
{cmd:nvar}==4, the record will appear as

	{cmd:0000................}

{pstd}
If the dataset is sorted by a single variable {cmd:myvar} and if that variable
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -