📄 book2
字号:
Alternatively, the last number can be stored for future re-dialling, freeingthe phone for other calls."Short codedialling" allows a customer to associate short codes with commonly-diallednumbers.Alarm calls can be booked at specified times, and are made automaticallywithout human intervention.Incoming calls can be barred, as can outgoing ones. A diversion serviceallows all incoming calls to be diverted to another telephone, eitherimmediately, or if a call to the original number remains unanswered fora specified period of time, or if the original number is busy.Three-party calls can be set up automatically, without involving theoperator..ppMaking use of these facilities presents the caller with something of a problem.With conventional telephone exchanges, feedback is provided on what is happeningto a call by the use of four tones \(em the dial tone, the busy tone,the ringing tone, and the number unavailable tone.For the more sophisticated interaction which is expected on the advancedexchange, a much greater variety of status signals is required.The obvious solution is to usecomputer-generated spokenmessages to inform the caller when these services are invoked, and to guide himthrough the sequences of actions needed to set up facilities like callre-direction. For example, the messages used by the exchange when a useraccesses the alarm callservice are.LB.NIAlarm call service.Dial the time of your alarm call followed by square\u\(dg\d..FN 1\(dg\d"Square" is the term used for the "#" key on the touch-tone telephone.\u.EF.NIYou have booked an alarm call for seven thirty hours..NIAlarm call operator. At the third stroke it will be seven thirty..LE.ppBecause of the rather small vocabulary, the number of messages that can bestored in their entirety rather than being formed by concatenation ofsmaller units, and the short time which was available for development,System\ X stores speech as a time waveform, slightly compressed by a time-domainencoding operation (such techniques are described in Chapter 3).Utterances which contain variable parts, like the time of alarm in the messagesabove, are formed by inserting separately-recorded digits in a fixed "carrier" message. No attempt is made to apply uniform intonationcontours to the synthetic utterances. The resulting speech is of excellentquality (being a slightly compressed recording of a human voice), but sometimesexhibits somewhat anomalous pitch contours.For example, the digits comprising numbers often sound rather jerky andout-of-context \(em which indeed they are..ppEven more advanced facilities can be expected on telephone exchanges inthe future. A message storage capability is one example. Althoughautomatic call recording machines have been available for years, a centralizedfacility could time and date a message, collect the caller's identity(using the telephone keypad), and allow the recipient to select messages leftfor him through an interactive dialogue so that he could control the orderin which he listens to them. He could choose to leave certain messages to bedealt with later, or re-route them to a colleague. He may even wish to leavereminders for himself, to be dialled automatically at specified times (likealarm calls with user-defined information attached). The sender of a messagecould be informed automatically by the system when it is delivered. None ofthis requires speech recognition, but it does need economical speech.ulstorage,and also speech.ulsynthesis(for time and date tags)..sh "1.5 Travel consultant".ppSome current research in speech communication with computers is aimed atembedding the dialogue in a task environment which provides it with an overallcontextual framework.This in fact was the motivation for the experimental telephone enquiry servicedescribed above, and it was observed that people are able to tolerate ratherlow quality speech if they feel that the interaction is worthwhile. One stepfurther is the use of speech.ulinputas well as output, and here the context of the conversation which is providedby the task environment can be used to improve recognition accuracysignificantly.A system set up at Bell Telephone Laboratories indicates what we can expectto find operating commercially in the future (Levinson and Shipley, 1980)..[Levinson Shipley 1980.]It provides timetable information about airline flights, and allows the userto make reservations..ppAn example dialogue is given in Table 1.7..RF.fi.nh.na.in 0.3i.nr x0 \w'COMPUTER: '.nr x1 \w'CALLER: '.in+\n(x0u.ti-\n(x0uCOMPUTER: This is the Bell Laboratories flight information system. May I help you?.ti-\n(x0uCALLER:\h'\n(x0u-\n(x1u' I would like to go to Chicago..ti-\n(x0uCOMPUTER: Please specify the day of your departure..ti-\n(x0uCALLER:\h'\n(x0u-\n(x1u' I want to leave at four p.m...ti-\n(x0uCOMPUTER: Please specify the day of your departure..ti-\n(x0uCALLER:\h'\n(x0u-\n(x1u' I want to depart on the two five May..ti-\n(x0uCOMPUTER: Flight number nine two leaves New York at four p.m. arrives in Chicago at five twenty five p.m...ti-\n(x0uCALLER:\h'\n(x0u-\n(x1u' What plane is on the flight?.ti-\n(x0uCOMPUTER: The aircraft is a Boeing seven twenty seven..ti-\n(x0uCALLER:\h'\n(x0u-\n(x1u' How much is the fare?.ti-\n(x0uCOMPUTER: The round trip coach fare is one hundred forty two dollars..ti-\n(x0uCALLER:\h'\n(x0u-\n(x1u' I want a first class seat..ti-\n(x0uCOMPUTER: One seat thank you..ti-\n(x0uCALLER:\h'\n(x0u-\n(x1u' What is the fare?.ti-\n(x0uCOMPUTER: The round trip first class fare is one hundred eighty four dollars..ti-\n(x0uCALLER:\h'\n(x0u-\n(x1u' Please repeat the arrival time..ti-\n(x0uCOMPUTER: The arrival time is five twenty five p.m...ti-\n(x0uCALLER:\h'\n(x0u-\n(x1u' I will take the flight..ti-\n(x0uCOMPUTER: Your reservation on flight number nine two to Chicago on Friday is confirmed. Thank you..in 0.nf.MT 2Table 1.7 A conversation with an airline flight information service(from Levinson and Shipley, 1980).TE.[Levinson Shipley 1980.]The user dials the system from anordinary telephone. The recognition side must be trained by each user, andaccepts isolated words spoken with brief pauses between them.The voice response unit has a vocabulary of around 200 words, andsynthesizes its answers by slotting words into "templates" evoked by the speechunderstanding part in response to a query. For example,.LB.NIThis flight makes \(em stops.NIFlight number \(em leaves \(em at \(em , arrives in \(em at \(em.LEare templates which when called with specific slot fillers could produce theutterances.LB.NIThis flight makes three stops.NIFlight number nine two leaves New York at four p.m.,arrives in Chicago at five twenty-five p.m..LEThe chief research interest of the system is in its speech understandingcapabilities, and the method used for speech output is relativelystraightforward. The templates and words are recorded, digitized, compressedslightly, and stored on disk files (totalling a few hundred thousand bytes ofstorage), using techniques similar to those of System\ X.Again, no independent manipulation of pitch is possible, and so the utterancessound intelligible but the transition between templates and slot fillers is notcompletely fluent. However, the overall context of the interaction means thatthe communication is not seriously disrupted even if the machine occasionallymisunderstands the man or vice versa. The user's attention is drawn away fromrecognition accuracy and focussed on the exchange of information with the machine.The authors conclude that progress in speech recognition can best be made bystudying it in the context of communication rather than in a vacuum or as partof a one-way channel, and the same is undoubtedly true of speech synthesis aswell..sh "1.6 Reading machine for the blind".ppPerhaps the most advanced attempt to provide speech output from a computeris the Kurzweil reading machine for the blind, first marketed in the late1970's (Figure 1.4)..FC "Figure 1.4"This device reads an ordinary book aloud. Users adjust the readingspeed according to the content of the material and their familiarity withit, and the maximum rate has recently been improved to around 225 words perminute \(em perhaps half as fast again as normal human speech rates..ppAs well as generating speech from text, the machine has to scan the documentbeing read and identify the characters presented to it. A scanning camerais used, controlled by a program which searches for and tracks the lines oftext. The output of the camera is digitized, and the image is enhancedusing signal-processing techniques. Next each individual letter must beisolated, and its geometric features identified and compared with a pre-storedtable of letter shapes. Isolation of letters is not at all trivial, formany type fonts have "ligatures" which are combinations of characters joinedtogether (for example, the letters "fi" are often run together.) Themachine must cope with many printed type fonts, as well as typewritten ones.The text-recognition side of the Kurzweil reading machine is in fact one ofits most advanced features..ppWe will discuss the problem of speech generation from text in Chapter 9.It has many facets. First there is pronunciation, thetranslation of letters to sounds. It is important to take into accountthe morphological structure of words, dividing them into "root" and "endings".Many words have concatenated suffixes (like "like-li-ness"). These areimportant to detect, because a final "e" which appears on a root wordis not pronounced itself but affects the pronunciation of the previousvowel. Then there is the difficulty that some words look the samebut are pronounced differently, depending on their meaning or on the syntacticpart that they play in the sentence.Appropriate intonation is extremely difficult to generate from a plain textualrepresentation, for it depends on the meaning of the text and the way in whichemphasis is given to it by the reader. Similarly the rhythmic structure isimportant, partly for correct pronunciation and partly for purposes ofemphasis.Finally the sounds that have been deduced from the text need to be synthesizedinto acoustic form, taking due account of the many and varied contextual effectsthat occur in natural speech. This by itself is a challenging problem..ppThe performance of the Kurzweil reading machine is not good. While it seemsto be true that some blind people can make use of it, it is far fromcomprehensible to an untrained listener. For example,it will miss out words and even whole phrases, hesitate in astuttering manner, blatantly mis-pronounce many words, fail to detect"e"s which should be silent, and give completely wrong rhythmsto words, making them impossible to understand.Its intonation is decidedly unnatural, monotonous, and often downrightmisleading. When it reads completely new text to people unfamiliar with itsquirks,they invariably fail to understand more than an odd word here and there,and do not improve significantly when the text is repeated more than once.Naturally performance improves if the material is familiar or expectedin some way.One useful feature is the machine's ability to spell out difficult wordson command from the user..ppWhile not wishing to denigrate the Kurzweil machine, which is a remarkableachievement in that it integrates together many different advancedtechnologies, there is no doubt that the state of the art in speech synthesisdirectly from unadorned text is extremely primitive, at present.It is vital not to overemphasize the potential usefulness of abysmal speech,which takes a great deal of training on the part of the user beforeit becomes at all intelligible. To make a rather extreme analogy,Morse code could be used asaudio output, requiring a great deal of training, but capable of being understoodat quite high rates by an expert.It could be generated very cheaply.But clearly the man in the street would find it quite unacceptable asan audio output medium, because of the excessive effort required to learn to useit. In many applications, very bad synthetic speech is just as useless.However, the issue is complicated by the fact that for people who usesynthesizers regularly, synthetic speech becomes quite easily comprehensible.We will return to the problem of evaluating the quality of artificial speechlater in the book (Chapter 8)..sh "1.7 System considerations for speech output".ppFortunately, very many of the applications of speech output from computersdo not need to read unadorned text.In all the example systems described above (except the reading machine),it is enough to be able to store utterances in some representation which caninclude pre-programmed cues for pronunciation, rhythm, and intonation ina much more explicit way than ordinary text does..ppOf course, techniquesfor storing audio information have been in use for decades.For example, a domestic cassette tape recorder stores speech at much betterthan telephone quality at very low cost. The method of directrecording of an analogue waveform is currently used for announcements inthe telephone network to provide information such as the time, weatherforecasts, and even bedtime stories.However, it is difficult to provide rapid access to messages stored inanalogue form, and although some computer peripherals which use analoguerecordings for voice-response applications have been marketed \(em they arediscussed briefly at the beginning of Chapter 3 \(em they have beensuperseded by digital storage techniques..ppAlthough direct storage of a digitized audio waveform is used in somevoice-response systems, the approach has certain limitations. The mostobvious one is the large storage requirement: suitable coding can reducethe data-rate of speech to as little as one hundredth of that needed bydirect digitization, and textual representations reduce it by another factorof ten or twenty. (Of course, the speech quality is inevitably compromisedsomewhat by data-compression techniques.) However, the cost of storage isdropping so fast that this is not necessarily an overriding factor.A more fundamental limitation is that utterances stored directly cannot sensiblybe modified in any way to take account of differing contexts..ppIf the results of certain kinds of analysesof utterances are stored, instead of simply the digitized waveform,a great deal more flexibility can be gained.It is possible to separate out the features of intonation and amplitude fromthe articulation of the speech, and this raises the attractive possibilityof regenerating utterances with pitch contours different from those with which they wererecorded.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -