📄 halguide.tex

📁 Emdros is a text database middleware-layer aimed at storage and retrieval of "text plus information
💻 TEX
字号:
\documentclass[a4paper]{article}\usepackage{graphicx}\usepackage{ucs}\usepackage{html}\title{Emdros HAL example (Hyperspace Analogue to Language)}\author{Ulrik Petersen}\begin{document}\newenvironment{precode}        {\begin{list}{}{        \setlength{\rightmargin}{\leftmargin}        \setlength{\listparindent}{0pt}% needed for AMS classes        \raggedright        \setlength{\itemsep}{0pt}        \setlength{\parsep}{0pt}        \normalfont\ttfamily}%        \item[]}        {\end{list}}\maketitle\tableofcontents\section{Introduction}Welcome to the Emdros HAL example.  Here you will find information on:\begin{itemize}  \item What is a HAL space?  \item What does the example do?  \item Running the example\end{itemize}\section{Part I: Preliminaries}\begin{itemize}  \item What is a HAL space?  \item What does the example do?\end{itemize}\subsection{What is a HAL space?}\subsubsection{Background}HAL stands for:\begin{center}  {\bf H}yperspace {\bf A}nalogue to {\bf L}anguage\end{center}It was invented by a research group at University of California atRiverside.  It has a homepage here.HAL is a numeric method for analysing text.  It does so by runninga {\em sliding window} of fixed length across a text, calculating amatrix of word cooccurrence values along the way.\subsubsection{Informal definition}A HAL space is an QxQ matrix of integers, where Q is the number ofdistinct word-forms in the text.  For example, if the text contains50,000 running words, of which 6,000 are unique forms, then thecorresponding HAL space will be a 6,000x6,000 matrix.Each entry in the matrix is a sum of all values arising fromrunning the sliding window over the text.The sliding window is {\em n} words wide, e.g., 8.Then, for a given word, say, the one at index {\em t}, a valueis added to the matrix for each of the word pairs which arise bypairing the word at index{\em t}with each of the words at indexes{\em t}-{\em n} to {\em t}-1.This basically means the {\em n} words before the one at index{\em t} are paired with the word at index {\em t}.For each of these word-pairs, a value is calculated as follows: Ifthe word before {\em t} is at index {\em j} (e.g.,{\em t}-3), then the value is{\em n} - $|${\em t}-{\em j}$|$ + 1For example ({\em t}=10, {\em j}=10-3=7):8 - $|$10 - 7$|$ + 1 = 6This basically means that words close to {\em t} get a higherscore than words farther away.If the word-form corresponding to {\em t} has number {\em p}in the HAL-matrix, and the word-form corresponding to {\em j} hasnumber {\em q} in the HAL-matrix, then this value is added to thep'th row and the q'th column.\subsubsection{Mathematical definition}Assume or define the following:\begin{itemize}   \item Assume that the words are numbered 1..M.   \item The words themselves will be called T(1), T(2), ..., T(M) for   easy reference.   \item Assume that there are Q unique forms.  Then the forms will be   called F(0), F(1), F(2), ..., F(Q-1).   \item Assume that the function Form(n) maps word-indexes in 1..M to   form-indexes in 0..Q-1.  For example, if T(2) has the form F(10),   then Form(2) = 10.   \item Assume that the window-width is n.   \item Define the delta function D(a,b) as follows: D(a,b) = n - $|$a-b$|$ + 1   \item Assume that Matrix[0..Q-1][0..Q-1] is the HAL matrix, and that   all values are initially 0.\end{itemize}The HAL space is calculated as follows:\begin{enumerate}  \item Initialize all matrix-cells to be 0.  \item For each word-index {\em i} in the range 2..M:  \begin{enumerate}     \item For each word-index {\em i2} in the range     {\em i}-n..{\em i}-1 (the lower bound of this range must be     adjusted if there is no such word, e.g., if {\em n}=4 and     {\em i} is 2, the first value will be 1, not -2):     \begin{enumerate}         \item Let d = D(i2,i)          \item Add d to the HAL matrix's Form({\em i})'th row and            Form({\em i2})'th column.  I.e., add d to the existing            value of Matrix[Form(i)][Form(i2)].     \end{enumerate}  \end{enumerate}\end{enumerate}\subsection{What does the example do?}\subsubsection{Overview}The example can do these things:\begin{enumerate}  \item It can load an Emdros database with a text, initializing the  database for use with the example.  \item It can build HAL-spaces of the in the database  \item It can load a previously generated HAL space from the database  \item It can emit two kinds of output:      \begin{enumerate}          \item A complete HAL-space as a Comma-Separated-Value (CSV)          text file suitable for loading into a spreadsheet.          \item Certain parts of a HAL-space, based on certain          word-forms that are of interest.      \end{enumerate}\end{enumerate}\subsubsection{What output does it give?}\begin{itemize}  \item For building the database:     \begin{itemize}        \item A loaded Emdros database        \item A list of word-forms found in the text (the Q word forms        spoken of on the HAL        space page).     \end{itemize}  \item For querying the database:     \begin{itemize}        \item Optionally, a Comma-Separated Value (CSV) file        containing the entire HAL-space, suitable for loading into a        spreadsheet.        \item An output file with data for words which you are        specially interested in.        For each word, a list is given of the words with which it        cooccurs most frequently and closely.        If the word form you are interested in is w1 and the word        form with which it cooccurs is w2, then the score is        calculated as Matrix[w1][w2] + Matrix[w2][w1].        This score is printed twice: First, it is printed in a        scaled form.  The score is multiplied by a user-specified        factor and divided by the text length.  And second, it is        printed in its raw form, as it came from the matrix.        The list is sorted, so that the "heaviest" words come        first.  The user can specify how many words to put in the        list.     \end{itemize}\end{itemize}\subsubsection{What input does it need?}\begin{enumerate}  \item For loading the database: Only a text file.  \item For querying the database: A configuration file with a  special format.\end{enumerate}\subsubsection{What is a configuration?}A configuration file is aplain text file which looks like a Unix configuration file, and holdsinformation necessary for running the HAL example.  See thelater page for its format.\subsubsection{What information does an input file contain?}An input file contains the following:\begin{itemize}   \item The {\bf database name} which holds the text to  analyze.  \item The {\bf sliding window width}, {\em n}.  \item The name of the {\bf CSV-file} containing the  HAL-space in a CSV-format, suitable for reading into a spreadsheet.  ("none" if you don't want a CSV file).  \item The name of the {\bf output file} containing the  output for the words you are interested in.  \item The {\bf words you are interested in}.  \item The {\bf maximum number of values} for each word you  are interested in.  \item The {\bf factor by which to multiply} each value for  a given word.  This is first divided by the number of words in the  text.  This can come in handy if you wish to compare texts of  different lengths.\end{itemize}\section{Part II: Running}The example can be run as follows:\begin{enumerate}  \item Building the database  \item Querying the database which has two sub-parts  \begin{enumerate}     \item Writing the input file     \item Running the example  \end{enumerate}\end{enumerate}\subsection{Building the database}\subsubsection{How}The database needs only be built once and for all.  To do so, run the\begin{verbatim}halblddb\end{verbatim}command-line program with the right options.\subsubsection{halblddb usage}To see the halblddb usage, run the following command:\begin{verbatim}halblddb --help\end{verbatim}This displays the following output:\begin{verbatim}halblddb version <version-number> on <backend-name>Usage: halblddb [options]OPTIONS:   -d , --dbname db     Use this database   -f textfilename      Use this text input file   -o wordlistfilename  Use this wordlist output file   --help               Show this help   -V , --version       Show version   -h , --host host     Set hostname to connect to   -u , --user user     Set database user to connect as (default: 'emdf')   -p , --password pwd  Set password to use for database user\end{verbatim}\subsubsection{Example}Assume that you want the following:\begin{itemize}  \item Build a database called {\bf hal\_test}  \item using an input-file called {\bf mytext.txt}  \item writing a word-list to a file called {\bf wordlist.txt}\end{itemize}And assume the following  (not necessary if using SQLite):\begin{itemize}  \item You have set up your backend database with the {\bf user}  called {\bf emdf}  \item having a {\bf password} called  {\bf changeme}.\end{itemize}The the following command-line would do it:\begin{verbatim}halblddb -d hal\_text -f mytext.txt -o wordlist.txt -p changeme -u emdf\end{verbatim}That's it.\subsection{Querying the database}Having built the database,your next step is to query it in interesting ways.There are two steps to this:  \begin{enumerate}     \item Writing the input file     \item Running the example  \end{enumerate}\subsubsection{Writing the input file}\paragraph{Overview}The format of the file is much like a Windows .ini file or a Unixconfiguration file:\begin{itemize}  \item The file contains a number of {\em key}={\em value} pairs.  \item Lines beginning "\#" are ignored (i.e., are comments).\end{itemize}\paragraph{Self-documenting example}The following is a self-documenting example.Please refer back to the "What does the example do?" page for adescription of each of the following settings.\begin{verbatim}\# The db namedatabase = hca\# The factor by which to multiply each value in the output.\# This is divided by the text length, so for example if your text\# has 70000 words, 100000 would be a good valuefactor = 100000\# The window widthn = 5\# The csv output filenamecsv = haltest.csv.txt\# The value-vector output filenameoutput = haltest.out.txt\# The maximum number of values in each value-vectormax\_values = 20\# Place the words here, with one 'word' key per line, e.g.:word = morword = barnword = blomst\end{verbatim}\subsubsection{Running the example}\paragraph{How}First, open a command-line prompt.Then follow this schema:\begin{verbatim}mqlhal -c myconfigfile.cfg  \end{verbatim}If on MySQL or PostgreSQL, you may need to use the "-u username","-p password", and/or the "-h hostname" options.\paragraph{Example}This example is for Windows users.\begin{enumerate}  \item Open a command-line prompt.  \item cd to the right directory, e.g., "C:$\backslash$Emdros$\backslash$"  \begin{verbatim}      C:\>C:      c:\>cd C:\Emdros\      C:\Emdros>\end{verbatim}  \item Run the program.  Assume that your configuration file is called "C:$\backslash$temp$\backslash$myconfig.cfg"  \begin{verbatim}      C:\Emdros>bin\mqlhal.exe -c "C:\temp\myconfig.cfg"\end{verbatim}\end{enumerate}That's it.  Now sit back and watch as the program spits out variousmessages and runs to completion.\paragraph{What happens next?}Afterwards, it's time to analyze the output.  Perhaps you need toadjust the configuration file and rerun the program to get new output.This will likely be an iterative process until you find what you arelooking for, or have tested your hypothesis about the text.\section{Part III: References}Here, we have two kinds of references:\begin{enumerate}  \item Bibliography  \item Links\end{enumerate}\subsection{Bibliography}\subsubsection{Articles}\begin{itemize}  \item Burgess, C., K. Livesay, and K. Lund (1998). {\em Explorations  in Context Space: Words, Sentences, Discourse}, Discourse  Processes, Volume 25, pp. 211 - 257.\newline {\em See http://locutus.ucr.edu/abstracts/97-bll-expl.html  from where you can also download the paper.}  \item Burgess, C. and K. Lund. (1997). {\em Modelling parsing  constraints with high-dimensional context space.} Language and  Cognitive Processes, Volume 12, pp. 1-34.\newline {\em See http://citeseer.nj.nec.com/context/398051/0.}  \item Lund, Kevin and Curt Burgess. (1996) {\em Producing  high-dimensional semantic spaces from lexical co-occurrence},  Behavior Research Methods, Instruments and Computers, Volume 28,  number 2, pp. 203--208.\newline {\em See http://locutus.ucr.edu/abstracts/96-lb-prod.html  from where you can also download the paper.}\end{itemize}\subsection{Links}\begin{itemize}  \item Official HAL Homepage  \item Emdros homepage\end{itemize}\end{document}
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -