📄 book2
字号:
The primary analysis technique used for this purpose is.ullinear predictionof speech, and this is treated in some detail in Chapter 6. It also reduces drastically thedata-rate of speech, by a factor of around 50.It is likely that many voice-response systems in the short- and medium-termfuture will use linear predictive representations for utterance storage..ppFor maximum flexibility, however, it is preferable to store a textualrepresentation of the utterance.There is an important distinction between speech.ulstorage,where an actual human utterance is recorded, perhaps processed to lowerthe data-rate, and stored for subsequent regeneration when required,and speech.ulsynthesis,where the machine produces its own individual utterances which are not basedon recordings of a person saying the same thing. The difference is summarizedin Figure 1.5..FC "Figure 1.5"In both cases something is stored: for the first it isa direct representation of an actual human utterance, while for the secondit is a typed.uldescriptionof the utterance in terms of the sounds, or phonemes, which constitute it.The accent and tone of voice of the human speaker will be apparent inthe stored speech output, while for synthetic speech the accent is themachine's and the tone of voice is determined by the synthesis program..ppProbably the most attractive representation of utterances in man-machinesystems is ordinary English text, as used by the Kurzweil reading machine.But, as noted above, this poses extraordinarily difficult problems for thesynthesis procedure, and these inevitably result in severely degraded speech.Although in the very long term these problems may indeed be solved,most speech output systems can adopt as their representation of an utterancea description of it which explicitly conveys the difficult features ofintonation, rhythm, and even pronunciation.In the kind of applications described above (barring the reading machine),input will be prepared by aprogrammer as he builds the software system which supports the interactivedialogue.Although it is important that the method of specifying utterances be easilylearned, it is not necessary that plain Englishis used. It should be simple for the programmer to enter newutterances and modify them on-line in cut-and-try attempts to render theman-machine dialogue as natural as possible. A phonetic inputcan be quite adequate for this, especially if the system allows theprogrammer to hear immediately the synthesized version of the messagehe types. Furthermore, markers which indicate rhythm and intonation canbe added to the message so that the system does not have to deduce these featuresby attempting to "understand" the plain text..ppThis brings us to another disadvantage of speech storage as compared withspeech synthesis. To provide utterances for a voice response system usingstored human speech, one must assemble together special input hardware,a quiet room, and (probably) a dedicated computer. If the speech is to beheavily encoded, either expensive special hardware is required or the encodingprocess, if performed by software on a general-purpose computer, will takea considerable length of time (perhaps hundreds of times real-time). Ineither case, time-consuming editing of the speech will be necessary, withfollow-up recordings to clarify sections of speech which turn out to beunsuitable or badly recorded. If at a later date the voice responsesystem needs modification, it will be necessary to recall the same speaker,or re-record the entire utterance set. This discourages the applicationprogrammer from adjusting his dialogue in the light of experience.Synthesizing from a textual representation, on the other hand, allows himto change a speech prompt as simply as he could a VDU one, and evaluateits effect immediately..ppWe will return to methods of digitizing and compacting speech in Chapters 3and 4, and carry on to consider speech synthesis in subsequent chapters.Firstly, however, it is necessary to take a look at what speech is and howpeople produce it..sh "1.8 References".LB "nnnn".[$LIST$.].LE "nnnn".sh "1.9 Further reading".ppThere are remarkably few general books on speech output, although asubstantial specialist literature exists for the subject.In addition to the references listed above, I suggest that you lookat the following..LB "nn".\"Ainsworth-1976-1.]-.ds [A Ainsworth, W.A..ds [D 1976.ds [T Mechanisms of speech recognition.ds [I Pergamon.nr [T 0.nr [A 1.nr [O 0.][ 2 book.in+2nA nice, easy-going introduction to speech recognition, this book coversthe acoustic structure of the speech signal in a way which makesit useful as background reading for speech synthesis as well.It complements Lea, 1980, cited above; which presents more recent resultsin greater depth..in-2n.\"Flanagan-1973-2.]-.ds [A Flanagan, J.L..as [A " and Rabiner, L.R. (Editors).ds [D 1973.ds [T Speech synthesis.ds [I Wiley.nr [T 0.nr [A 0.nr [O 0.][ 2 book.in+2nThis is a collection of previously-published research papers on speechsynthesis, rather than a unified book.It contains many of the classic papers on the subject from 1940\ -\ 1972,and is a very useful reference work..in-2n.\"LeBoss-1980-3.]-.ds [A LeBoss, B..ds [D 1980.ds [K *.ds [T Speech I/O is making itself heard.ds [J Electronics.ds [O May\ 22.ds [P 95-105.nr [P 1.nr [T 0.nr [A 1.nr [O 0.][ 1 journal-article.in+2nThe magazine.ulElectronicsis an excellent source of up-to-the-minute news, product announcements,titbits, and rumours in the commercial speech technology world.This particular article discusses the projected size of the voiceoutput market and gives a brief synopsis of the activities of severalinterested companies..in-2n.\"Witten-1980-5.]-.ds [A Witten, I.H..ds [D 1980.ds [T Communicating with microcomputers.ds [I Academic Press.ds [C London.nr [T 0.nr [A 1.nr [O 0.][ 2 book.in+2nA recent book on microcomputer technology, this is unusual in thatit contains a major section on speech communicationwith computers (as well as oneson computer buses, interfaces, and graphics)..in-2n.LE "nn".EQdelim $$.EN.CH "2 WHAT IS SPEECH?".ds RT "What is speech?.ds CX "Principles of computer speech.ppPeople speak by using their vocal cords as a sound source, and making rapidgestures of the articulatory organs (tongue, lips, jaw, and so on).The resulting changes in shape of the vocal tract allow productionof the different sounds that we know as the vowels and consonants ofordinary language..ppWhat is it necessary to learn about this process for the purposes ofspeech output from computers?That depends crucially upon how speech is represented in the system.If utterances are stored as time waveforms \(em and this is what we will bediscussing in the next chapter \(em the structure of speech is not important.If frequency-related parameters of particular natural utterances arestored, then it is advantageous to take into account some of theacoustic properties of the speech waveform..ppThis point can be brought into focus by contrasting the transmission(or storage) of speech with that of real-life television pictures,as has been proposed for a videophone service.Massive data reductions, of the order of 50:1, can be achieved for speech,using techniques that are described in later chapters. For pictures,data reduction is still an important issue \(em even more so for thevideophone than for the telephone, because of the vastly higherinformation rates involved.Unfortunately, the potential for data reduction is muchsmaller \(em nothing like the 50:1 figure quoted above.This is because speech sounds have definite characteristics, impartedby the fact that they are produced by a human vocal tract, whichcan be exploited for data reduction.Television pictures have no equivalent generative structure, forthey show just those things that the camera points at..ppMoving up from frequency-related parameters of.ulparticularutterances, itis possible to store such parameters in a.ulgeneralform which characterizes the sound segments that appear in spoken language.This immediately raises the issue of.ulclassificationof sound segments, to form a basis for storing generalized acousticinformation and for retrieval of the information needed to synthesizeany particular utterance.Speech is by nature continuous, and any synthesis system based upondiscrete classification must come to terms with this by tacklingthe problems of transition from one segment to another,and local modification of sound segments as a function of their context..ppThis brings us to another level of representation.So far we have talked of the.ulacousticnature of speech, but when we have to cope with transitions betweendiscrete sound segments it may be fruitful to consider.ularticulatoryproperties as well.Any model of the speech production processis in effect a model of the articulatory process that generates the speech.Some speech research is concerned withmodellingthe vocal tract directly, rather than modelling the acoustic output from it.One might specify, for example, position of tongue and posture of jaw and lipsfor a vowel, instead of giving frequency-relatedcharacteristics of it. This is a potenttool in linguistic research, for it brings one closer to human production ofspeech \(em in particular to the connection between brain and articulators..ppArticulatorysynthesis holds a promise of high-quality speech, for the transitionaleffects caused by tongue and jaw inertia can be modelled directly.However, this potential hasnot yet been realized.Speech from current articulatory models is of much poorer quality thanthat from acoustically-based synthesis methods.The major problem is in gaining data about articulatorybehaviour during running speech \(em it is much easier to perform acousticanalysis on the resulting sound than it is to examine the vocal organs inaction. Because of this, the subject is not treated in this book.We will only look at articulatory properties insofar as they help usto understand, in a qualitative way, the acoustic nature of speech..ppSpeech, however, is much more than mere articulation.Consider \(em admittedly a rather extreme and chauvinistic example \(em thenumber of ways a girl can say "yes".Breathy voice, slow tempo, low pitch \(em these are all characteristics whichaffect the utterance as a whole, rather than being classifiable intoindividual sound segments. Linguists call them "prosodic" or"suprasegmental" features, for they relate to overall aspects of theutterance, and distinguish them from "segmental" ones which concernthe articulation of individual segments of syllables.The most important prosodic features are pitch, or fundamental frequencyof the voice, and rhythm..ppThis chapter provides a brief introduction to the nature of the speechsignal. Depending upon what speech output techniques we use, it may benecessary to understand something of the acoustic nature of the speechsignal; the system that generates it (the vocal tract); commonly-usedclassifications of sound segments; and the prosodic aspects of speech.This material is little used in the early chapters of the book, butbecomes increasingly important as the story unfolds.Hence you may skip the remainder of this chapter if you wish, butshould return to it later to pick up more background whenever itbecomes necessary..sh "2.1 The anatomy of speech".ppThe so-called "voiced" sounds of speech \(em like the sound you make whenyou say "aaah" \(em are produced by passing air up from the lungs throughthe larynx or voicebox, which is situated just behind the Adam's apple.The vocal tract from the larynx to the lips acts as a resonant cavity,amplifying certain frequencies and attenuating others..ppThe waveform generated by the larynx, however, is not simply sinusoidal.(If it were, the vocal tract resonances would merelygive a sine wave of the same frequency but amplified orattenuated according to how close it was to the nearest resonance.) Thelarynx contains two folds of skin \(em the vocal cords \(em which blow apart and flaptogether again in each cycle of the pitch period.The pitch of a male voice in speech varies from as low as 50\ Hz(cycles per second) to perhaps250\ Hz, with a typical median value of 100\ Hz.For a female voice the range is higher, up to about 500\ Hz in speech.Singing can go much higher: a top C sung by a soprano has a frequencyof just over 1000\ Hz, and some opera singers can reachsubstantially higher than this..ppThe flapping action of the vocal cordsgives a waveform which can be approximated by atriangular pulse (this and other approximations will be discussed inChapter 5).
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -