📄 implementation.txt
字号:
This text is intended to give a general outline of how EARS isimplemented. However, the source, as always, is the final reference.File Conventions----------------C++ source has *.cc and *.h endings where the header file usuallycontains the class interfaces. C source goes into *.c files.The source is contained in several subdirs, and one .cc file foreach program in the main dir. After complation, each subdir hasa library with the name of the subdir.EARS requires the existence of the $(HOME)/.earsrc file, whereit is specified where the data resides in (default $(HOME)/.ears/).If the .earsrc file does not exist, a default one will be made.The .ears directory contains- word lists (ending *.words)- recognizer files (ending *.NDTW or so)- 'raw' directory for WAV files - directories for pattern files, depending on method and pattern type (e.g. RASTA-V/* is a variable sized Rasta feature file).EARS and C++------------In EARS, I have tried to use some OOP techniques (classes, abstraction)to make the source more modular and understandable. I would beglad to have some feedback about the correctness of the result.In the following, I'll use some terms from the excellent Gamma et al. 'Design Patterns' book. Please refer to it for the ideas.There are several different layers in the program:Top layer: the Protocol class called by main() plus some other subprotocolscalled by the top protocol. These create the objects that form the second layer. They can thus be considered to be mediators. - Files: train_ears.cc and listen.cc, for the first call. ears/pr_train_ears.cc and ears/pr_listen.cc, for the top protocol the other ears/pr_*.cc for the subprotocolsSecond layer: this consists of several classes most of which are instantiated only once at a time so I implemented them as singletons, most notably 'screen' and 'Sound' which are the I/O objects. 'screen' alsohandles keyboard input which should be separated in the future.'messages' was an early attempt at internationalization that will becompleted at some time. 'config' handles the app configuration.'words' holds the active word list. - Files (*.cc and *.h): - objects: config, words, sound, screen, messages - GUI bridge is in ui/*, the others are in modules/*Bottom layer: Here are the most elementary objects that are shiftedbetween the mediators. The first processing phase yields sndblockswhich are assembled to samples, which in turn become word_patterns that are saved and later recognized. The pattern objects come from the RecordingProtocol and are handed to a 'recognizer'; both 'pattern' and 'recognizer' are bridges, and there will be even more in the future.- Files (*.cc and *.h): sndblock, sample, pattern - protocols: speechstreamExceptions----------As exception class hierarchies require RTTI and gcc will not haveRTTI as default under Linux until 2.8.x, there is no real exceptionhandling in 'ears'. But I tried to do my best to make the comingchanges as small as possible. Now, a 'throw' invokes the functionThrow() in exception.cc which shuts down the screen and sound asgracefully as possible. There is no catching involved inside 'ears'.Libraries---------The following libraries are used:- libstdc++: for streams, strings and containers. Probably more. Compiling with -fno-implicit-templates and providing templates.cc reduces code size heavily but gives away the inlines. This will be better whenever gcc includes the template repository mechanism.- libncurses,libpanel: fancy graphics- libmrasta: feature extraction, source is provided--------------------------------------------------------------------Data----All files except the RIFF WAV files are written in ASCII.Config and word files have entries, each a single line.All I/O except setting/reading the sound device should be done with streams.Options-------All options have defaults set on startup. These can be changed inside the .earsrc file or by giving a command line option. The 'listen' program has a reduced set of options. Available are:.earsrc | command line | description------------------------------------------------------------------EARS_PATH | -p | the directory where all data goesBASENAME | -b | file base for .words and .net file [default]MIN_NUM_WORDS | -m | number of times to speak a word [1]KEEP_SAMPLES | -k | write WAV files? [no]FEATURE | -f | the feature extractor [MRASTA]RECOGNIZER | -r | the recognizer [NDTW]SOUND_SPEED | -S | sampling rate in Hz [8000]SOUND_BITS | -B | 8- or 16-bit sampling [8]DEBUG | -d | output recognizer debug info [no]NEWLINE | -n | output newline after words [no](listen only)New methods-----------For adding new methods/algorithms, esp. a new recognition module,an understanding of how bridges work is needed. E.g. with recognizer, the training class accesses the recognizer via an abstract base class that has a defined interface.So all you need to do is write a subclass to that ABC and implementthe respective interface. In the case of DTW, most functions areempty since all computation is done after hearing an unknown word;also, there is no training at all for DTW. A more complicated useof the interface can be seen with the BP recognizers.The provided interfaces should suffice for many purposes, but of courseyou can improve that too.The bridges are: screen ----------------->UserInterface ^ ^ | | tscreen UIraw lscreen UIncurses Sound ------------------->SoundInterface ^ | AFSound VoxwareSound OssSound SunSound recognizer-------------->RecognizerImplementation ^ | DTW NDTW BP BPMT ELMAN1 pattern----------------->PatternImplementation ^ | var_pattern fix_pattern bit_patternIt is planned to build more abstractions: feature and endpointer--------------------------------------------------------------------Efficiency of the methods=========================Comparisons of speech recognition methods can doubtlessly found inthe literature. I'm no expert --- I can only compare from theengineering point of view. Feedback is thankfully appreciated.As a first and best start, try: Rabiner, "Fundamentals of SpeechRecognition".Endpoint detection------------------Cutting a word out of a sound stream seems tricky. I tried severalsources but I was not satisfied, maybe I overlooked something. ThenI decided to write it from scratch and it works surprisingly well,at least until now.Let me describe what the program does: At the lowest level, whenreading raw data from the sound device and copying it into a'sndblock', we already compute a value we call 'energy' (thoughit is better described as the average of the derivative) of thesound data. This energy is high when there is a sound, and lowwhen there is no sound.When we measure the noise level, we calculate the maximum of allenergies during the measurement and call it 'e_limit', that is,the limit to the energy of a sndblock which is noise = no sound.Now, when listening for words, we first let pass a delay and then discard sound that might end a previous word, then we discard allsndblock's that are noise (energy < e_limit) until a sufficientlylong series of non-noise is encountered. From here on we save allsndblock's until a sufficiently long sequence of noise is seen.The recorded array of soundblock's is then further processed. That's it.Feature extraction------------------Another surprising find is that feature extraction with Rasta-PLPseems so superior. I have the impression that many researchers usepure LPC but I can't say that it works for me. There were no differencesbetween the OGI Rasta function and the Mrasta library, as far as Ican say.Recognition-----------Five words: The surface is only scratched.I have plans for at least five more recognizers that are substantiallydifferent from the existing ones and from each other. The next onewill be surely a recurrent neural net.DTW is robust and doesn't require training, but recognition time increases at least linearly with dictionary size. I simply copiedroutines from Dr. Robinson's cookbook.NDTW is the patched version of DTW with several speed improvements:matrix allocation is now done only once (with a sufficient size)and we search now only possible paths inside the parallelogram.This is inspired from reading Rabiner and leads to a 2x speedup.Additionally, after computing a row of global distances, we lookif the smallest globd is already bigger than the best result so far.If this is so, we stop the current comparison.Also, if the length of both patterns differ too much we do notcompare at all.BP/BPMT shows that backpropagation isn't the answer to all problems.Although the code is fast training time is a major problem. Evenworse, error rate is high. You would need more instances of wordsand that slows training even more. Last, the method doesn't accountfor time-variability in the data, and generalization is poor.ELMAN1 doesn't work yet.--------------------------------------------------------------------The 'ears' program------------------Nothing here yet. But look into the contrib directory, there are somenice tools that can make the 'ears' program obsolete. And for a start,I've begun to write down requirements for the program indoc/ears-requirements.txt
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -