📄 corphdr
字号:
<teiHeader type="corpus" creator='dominic' status="update"
date.updated="2000-10-17" id="BNC-W">
<fileDesc>
<titleStmt>
<title>
The British National Corpus: World Edition
</title>
<respStmt>
<resp>
Lead partner in consortium
</resp>
<name id="OUP">
Oxford University Press
</name>
</respStmt>
<respStmt>
<s n="4"><w AVQ>How <w VBZ>is <w NN1>infection <w VVN-VVD>transmitted
<resp>
Text selection for miscellaneous and unpublished written
materials
</resp>
<s n="19"><w PRP>In <w AT0>the <w NP0>UK
<name id="CHAMBERS">
W R Chambers
</name>
<resp>
Text selection, data capture and transcription for
spoken texts and for 14% of published written texts
</resp>
<name id="LONGMAN">
Longman ELT
</name>
<resp>
Text selection for 86% published written texts
</resp>
<name>
Oxford University Press
</name>
<resp>
Data capture and transcription for all miscellaneous and
unpublished written texts and for 86% of published
written texts
</resp>
<name>
Oxford University Press
</name>
</respStmt>
<respStmt>
<resp>
Encoding, storage and distribution
</resp>
<name id="OUCS">
Oxford University Computing Services
</name>
</respStmt>
<respStmt>
<resp>
Text enrichment
</resp>
<name id="LANCASTER">
Unit for Computer Research into the English
Language, University of Lancaster
</name>
</respStmt>
</titleStmt>
<editionStmt n="2.0">
<edition>First World Edition</edition>
</editionStmt>
<extent>Approximately 100 million words
</extent>
<publicationStmt>
<distributor>
Oxford University Computing Services
</distributor>
<address>
<addrLine>13 Banbury Road, Oxford OX2 6NN U.K.</addrLine>
<addrLine>Telephone: +44 1865 273221</addrLine>
<addrLine>Facsimile: +44 1865 273275</addrLine>
<addrLine>Internet mail: natcorp@oucs.ox.ac.uk</addrLine>
</address>
<idno type="BNC">BNC-W</idno>
<availability>
<para>The British National Corpus is distributed worldwide by Oxford
University Computing Services on a not-for-profit basis, and under the
terms of a standard license agreement. Each copy of the corpus must
include a copy of this corpus header and any redistribution or
republishing of the corpus texts (the "BNC Processed Material") is
strictly forbidden.</para>
<para>For information, the conditions of the Standard License Agreement
are as follows:
<list>
<label>(a)</label>
<item><para>The BNC Consortium grants according to the terms and
conditions set out herein and in consideration of the
payments specified herein a non-exclusive, non-transferable
Licence to the Licensee to use the BNC Processed Material
for the purposes of linguistic research and/or the
development of language products.
</para></item>
<label>(b)</label>
<item><para>Distribution of the BNC Processed Material is
restricted to the Licensee or in the event of the Licensee
being an organisation, to the Licensee's research group.
This group is defined as consisting only of those Licensee's
employees whom the Licensee authorises to perform the work
using the BNC Processed Material for the purposes described
in paragraph (a).
</para></item>
<label>(c)</label>
<item><para>Members of the said research group must not, except as
herein provided, copy, publish or otherwise give to any
third party access to the whole or any part of the BNC
Processed Material. It is the responsibility of the
Licensee to ensure that the members of the said research
group understand and abide by this restriction, and to
supervise their activities with respect to the BNC Processed
Material. Neither the Licensee nor members of the
Licensee's said research group may assign, transfer, lease,
sell, rent, charge or otherwise encumber the BNC Processed
Material.
</para></item>
<label>(d)</label>
<item><para>The BNC Processed Material may be installed at the
place or places of work of the said research group. The
place of work is defined as the computing systems that the
members of a research group normally use to conduct their
research activities. It can include both work and home
computers, and is not restricted to a particular machine or
building.
</para></item>
<label>(e)</label>
<item><para>Copies of the BNC Processed Material may be made for
backup purposes, or for the purposes of making data
available to members of the research group but the Licensee
shall ensure that the BNC Consortium's copyright notice is
reproduced on all copies or parts thereof of the BNC
Processed Material. Any such copies will be deemed to be
part of the BNC Processed Material.
</para></item>
<label>(f)</label>
<item><para>There is no restriction on the use of the Licensee's
Results except that the Licensee may not publish in print or
electronic form or exploit commercially in any form
whatsoever any extracts from the BNC Processed Material
other than as permitted under the provisions of the relevant
copyright laws.
</para></item>
<label>(g)</label>
<item><para>The BNC Consortium does not grant to the Licensee any
rights whatsoever to reproduce the BNC Texts or use all or
any part of the BNC Texts in commercial products or services
in any way other than would be permitted under the fair
dealings provision of copyright law.</para></item></list></para></availability>
<date value="2000-11-1">
31 October 2000
</date>
</publicationStmt>
<sourceDesc>
<para>The British National Corpus has no single source document. For
details of the source or sources used in the creation of each
electronic text, see the individual text headers. The principles and
practices underlying selection and design of the corpus is documented
in the BNC User Reference Guide, a copy of which should be supplied
with it.</para> </sourceDesc>
</fileDesc>
<encodingDesc>
<projectDesc>
<para><list type="gloss">
<label>Goals</label>
<item><para>
The British National Corpus (BNC) Consortium was formed in 1990,
and started work in 1991 on the three-year task of producing a
hundred-million word corpus of modern British English for use in
commercial and academic research. The first edition was published
in 1994.
</para><para>
This second edition was produced during between 1998 and 2000. It
contains a thorough revision of the part of speech tagging, several
corrections to the headers, and some minor revision of the SGML
tagging used.</para></item>
<label>The Consortium Participants</label>
<item><para>The BNC is the result of of a unique collaboration between
three major U.K. dictionary publishers, two universities, and the
British Library. The dictionary publishers are Chambers Harrap,
Longman, and Oxford University Press; the universities are The Unit
for Computer Research into the English Language (UCREL) at Lancaster
University; and Oxford University Services (OUCS). </para></item>
<label>Funding</label>
<item><para>The development of the BNC was funded by the commercial partners
in the consortium with assistance from the the U.K. government's
Department of Trade and Industry (DTI) and Science and Engineering
Research Council (SERC) under the Joint Framework for Information
Technology (JFIT).
</para></item>
<label>Design</label>
<item><para>The British National Corpus is
<list>
<item><para>
A large corpus: its hundred million words are made up of
ninety million from written and ten million from spoken sources.
</para></item>
<item><para>A <hi>sample</hi> corpus: it is composed of text samples, generally of
no more than 40,000 words, rather than of complete works.
</para></item>
<item><para>A <hi>synchronic</hi> corpus: it includes imaginative texts dating from
the 1960s to 1994; informative texts dating from 1975 to 1994;
and spoken texts gathered primarily between 1990 and 1994.
</para></item>
<item><para>A <hi>general</hi> corpus: it is not specifically restricted to any
particular subject field, register or genre. It includes language
from all age and social groups and a broad spread of U.K. regions.
</para></item>
<item><para>A <hi>monolingual</hi> British English Corpus: text samples are
substantially the product of British English speakers. A small
proportion of the words in the corpus are in a foreign language or
non-British English.
</para></item>
<item><para>
A <hi>TEI-conformant</hi> Corpus: texts in the corpus are uniformly
marked up according to the recommendations of the Text Encoding
Initiative (TEI), an international consortium concerned
with the mark-up of texts for use in academic research. These
recommendations are an application of Standardized General Markup
Language (SGML), defined by International Standard IS 8879:1986.
</para></item></list></para></item>
<label>Uses</label>
<item>:<list><item><para>
Lexicography: The corpus provides a body of new data on word
meaning, grammar and usage. It yields empirical data on word
frequencies, word classes and spelling preferences, among other
things. It also reveals hitherto undocumented evidence about the
spoken language, with consequences that go far beyond the
immediate impact on dictionary-writing.
</para></item>
<item><para>Linguistic research: The corpus provides a standard basis for
investigating phenomena and testing competing linguistic theories.
</para></item>
<item><para>Language technology: Statistical techniques, requiring very
large samples of text, are increasingly used in machine
translation, speech recognition, speech synthesizers, spelling and
grammar checkers for word-processing and desk-top publishing,
hand-held electronic books and other developments in information
technology.
</para></item>
<item><para>Teaching: The corpus provides a rich source of examples of
current usage for English Language Teaching, allowing more
frequent patterns of use to be distinguished from less frequent.
In addition, the corpus provides a valuable didactic resource for
use in many areas of higher education.
</para></item>
<item><para>As a model: Future TEI-conformant corpora, in English and other
languages, may base their designs on the experience gained in the
production of the BNC.
</para></item></list></item></list></para></projectDesc>
<samplingDecl>
<para>Different parts of the BNC were constructed using different sampling
policies, as further described in the BNC Design Documentation. The
policies are summarized below. Note that information about which policy
resulted in the selection of a particular text is not available.
<list type="taxonomy">
<item id="SD000"><para>
Published: chosen selectively from candidate population
</para></item>
<item id="SD001"><para>
Published: chosen at random from candidate population
</para></item>
<item id="SD002"><para>
Unpublished: chosen according to relevant design criteria
</para></item>
<item id="SD003"><para>
Spoken: obtained from demographic sample of UK population
</para></item>
<item id="SD004"><para>
Spoken: obtained in context determined by design criteria
</para></item></list></para></samplingDecl>
<editorialDecl>
<para>The following editorial policies were applied in creating
the BNC. The DECLS attribute indicates which policies apply to a
given <text> or <div> element; but not all policies
are necessarily marked. Policies are identified by ID codes as follows:
<list type="gloss">
<label>correction policies</label>
<item>:<list type="taxonomy">
<item id="CN000"><para>Errors tagged with <sic> when seen; no normalization
</para></item>
<item id="CN001"><para>
Errors tagged with <sic> if seen; normalisation with <corr>
</para></item>
<item id="CN002"><para>
Normalized to standard British English or control list member
</para></item>
<item id="CN004"><para>
Corrections and normalizations applied silently
</para></item></list></item>
<label>hyphenation policies</label>
<item>:<list type="taxonomy">
<item id="HN000"><para>
Smart elision of line-end hyphens; &rehy used for remainder
</para></item>
<item id="HN001"><para>
Dumb elision of line-end hyphens; true hyphens hand-reinstated
</para></item>
<item id="HN002"><para>
Line-end hyphens removed by hand where appropriate
</para></item>
<item id="HN003"><para>
Source material contains no line-end hyphens
</para></item></list></item>
<label>quotation policies</label>
<item>:<list type="taxonomy">
<item id="QN000"><para>
Open, close quote normalized to &bquo, &equo
</para></item><item id="QN001"><para>
Open and close quote normalized to &quo
</para></item>
<item id="QN002"><para>
Quotation may be represented using <shift>
</para></item>
</list></item>
<label>Segmentation</label>
<item id="SN000"><para>
In this version of the Corpus, all segmentation and word-class marking
was carried out in the same way, using CLAWS5.
</para></item>
<label>Transcription methods</label>
<!-- not marked in this version -->
<item>:<list type="taxonomy">
<item id="TN000"><para>
Copy-typed from hard-copy into OUP format; transduced to CDIF
</para></item><item id="TN001"><para>
Copy-typed from hard-copy into Longman format; transduced to CDIF
</para></item><item id="TN002"><para id="LB">
Scanned from hard-copy into OUP format; transduced to CDIF
</para></item><item id="TN003"><para>
Scanned from hard-copy into Longman format; transduced to CDIF
</para></item><item id="TN004"><para>
Transduced from M-R into OUP format; transduced to CDIF
</para></item><item id="TN005"><para>
Transduced from M-R into Longman format; transduced to CDIF
</para></item><item id="TN006"><para id="AGB">
Recording transcribed into Longman format; transduced to CDIF
</para></item></list></item></list></para>
</editorialDecl>
<tagsDecl>
<tagUsage gi="align">
Alignment map for synchronizing overlapped speech
</tagUsage>
<tagUsage gi="bibl">
Free format bibliographic citation
</tagUsage>
<tagUsage gi="bncDoc">
an individual text in the BNC
</tagUsage>
<tagUsage gi="body">
The body of a written text
</tagUsage>
<tagUsage gi="c">
A single character, typically punctuation
</tagUsage>
<tagUsage gi="caption">
Floating caption in written material
</tagUsage>
<tagUsage gi="corr">
An editorial correction
</tagUsage>
<tagUsage gi="div">
Spoken text division
</tagUsage>
<tagUsage gi="div1">
Written text division, level 1
</tagUsage>
<tagUsage gi="div2">
Written text division, level 2
</tagUsage>
<tagUsage gi="div3">
Written text division, level 3
</tagUsage>
<tagUsage gi="div4">
Written text division, level 4
</tagUsage>
<tagUsage gi="event">
Non-verbal event in spoken text
</tagUsage>
<tagUsage gi="gap">
Point where source material omitted from electronic text
</tagUsage>
<tagUsage gi="head">
Header or headline on written text division
</tagUsage>
<tagUsage gi="hi">
Written text highlight indicator
</tagUsage>
<tagUsage gi="item">
List item
</tagUsage>
<tagUsage gi="l">
Poem or verse line
</tagUsage>
<tagUsage gi="label">
List item's label
</tagUsage>
<tagUsage gi="lb">
Line break indicator
</tagUsage>
<tagUsage gi="lg">
Group of verse lines
</tagUsage>
<tagUsage gi="list">
A list
</tagUsage>
<tagUsage gi="loc">
Anchor indicating synchronization point
</tagUsage>
<tagUsage gi="note">
Editorial or original note pertaining to a text
</tagUsage>
<tagUsage gi="p">
Written text paragraph
</tagUsage>
<tagUsage gi="pause">
Pause indicator in spoken text
</tagUsage>
<tagUsage gi="pb">
Written text page break
</tagUsage>
<tagUsage gi="poem">
Poetic or verse material
</tagUsage>
<tagUsage gi="ptr">
Pointer from one part of a text to another
</tagUsage>
<tagUsage gi="quote">
Written text quoted material indicator
</tagUsage>
<tagUsage gi="reg">
Regularizes questionable or incorrectly-spelled material
</tagUsage>
<tagUsage gi="s">
Text segment
</tagUsage>
<tagUsage gi="salute">
A salutation (as in a letter etc.)
</tagUsage>
<tagUsage gi="shift">
Indicates a change of register etc. in spoken material
</tagUsage>
<tagUsage gi="sic">
Marks questionable spelling or usage
</tagUsage>
<tagUsage gi="sp">
Dramatic written material speech marker
</tagUsage>
<tagUsage gi="spkr">
Dramatic written material speaker indicator
</tagUsage>
<tagUsage gi="stage">
Dramatic written material stage direction
</tagUsage>
<tagUsage gi="stext">
Spoken text
</tagUsage>
<tagUsage gi="text">
Written text
</tagUsage>
<tagUsage gi="trunc">
Indicates truncated word in spoken material
</tagUsage>
<tagUsage gi="u">
Spoken text utterance
</tagUsage>
<tagUsage gi="unclear">
Indicates untranscribable material in spoken text
</tagUsage>
<tagUsage gi="vocal">
Vocalized non-word in spoken material
</tagUsage>
<tagUsage gi="w">
CLAWS-defined word
</tagUsage>
</tagsDecl>
<refsDecl>
<para>Canonical references in the British National Corpus
are to text segment (<s>) elements, and
are constructed by taking the value of the n attribute
of the <bncDoc> element containing the target text,
and concatenating a dot separator, followed by the value
of the n attribute of the target <s> element.
</para>
<para>Segments are numbered sequentially within each text or stext,
starting at 1. There may be gaps in the numeric sequence, as a consequence
of post-segmentation corrections.</para>
</refsDecl>
<classDecl>
<taxonomy id="DLee">
<bibl>David Lee's Register classification as documented at
<xref>http://members.xoom.com/davidlee00/genre_register.zip</xref></bibl>
</taxonomy>
<taxonomy id="COPAC">
<bibl>Keyword classifications as supplied by the UK's COPAC service
at <xref>http://copac.ac.uk</xref></bibl>
</taxonomy>
<taxonomy>
<category id="allava">
<catDesc>
Text availability
</catDesc>
<category id="allava0">
<catDesc>
Ownership has not been claimed
</catDesc>
</category>
<category id="allava1">
<catDesc>
Worldwide rights cleared
</catDesc>
</category>
<category id="allava2">
<catDesc>
Worldwide rights cleared
</catDesc>
</category>
<category id="allava3">
<catDesc>
Not available in North America
</catDesc>
</category>
<category id="allava4">
<catDesc>
Not available in U.S.A.
</catDesc>
</category>
<category id="allava5">
<catDesc>
Not available outside the European Union
</catDesc>
</category>
<category id="allava6">
<catDesc>
Not available in U.S.A. & Philippines
</catDesc>
</category>
<category id="allava7">
<catDesc>
Not available in N America & Philippines
</catDesc>
</category>
</category>
<category id="alltyp">
<catDesc>
Text type
</catDesc>
<category id="alltyp1">
<catDesc>
Spoken demographic
</catDesc>
</category>
<category id="alltyp2">
<catDesc>
Spoken context-governed
</catDesc>
</category>
<category id="alltyp3">
<catDesc>
Written books and periodicals
</catDesc>
</category>
<category id="alltyp4">
<catDesc>
Written-to-be-spoken
</catDesc>
</category>
<category id="alltyp5">
<catDesc>
Written miscellaneous
</catDesc>
</category>
</category>
<category id="alltim">
<catDesc>
Publication date
</catDesc>
<category id="alltim1">
<catDesc>
1960-1974
</catDesc>
</category>
<category id="alltim2">
<catDesc>
1975-1984
</catDesc>
</category>
<category id="alltim3">
<catDesc>
1985-1993
</catDesc>
</category>
<category id="alltim0">
<catDesc>
Unknown
</catDesc>
</category>
</category>
<category id="scgdom">
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -