http:^^www.cs.cornell.edu^info^faculty^bsmith^query-by-humming.html

来自「This data set contains WWW-pages collect」· HTML 代码 · 共 671 行 · 第 1/3 页

HTML
671
字号
sounds are caused as a consequence of forces that are exerted on thelaryngeal walls when air flows through the glottis (the gap between thevocal cords <!WA20><!WA20><!WA20><!WA20><!WA20><!WA20><!WA20><!WA20><A HREF=#note:1>*</A>).  Hess <!WA21><!WA21><!WA21><!WA21><!WA21><!WA21><!WA21><!WA21><AHREF=#ref:hess>[5]</A> describes a model of the vocal cords as proposedby Hirano <!WA22><!WA22><!WA22><!WA22><!WA22><!WA22><!WA22><!WA22><A HREF=#ref:hirano>[6]</A>.  For the purposes of this paperthough, it is sufficient to know that the glottis repeatedly opens andcloses thus providing bursts of air through the vocal tract.<P>The vocal tract can be modeled as a linear passive transmission systemwith a transfer function H(z).  If we add an additional transferfunction R(z) which takes into account the radiation, the outputimpedance of the vocal tract can approximately be set to zero.  In theneutral position where the vocal tract can be regarded as a uniformtube, the resonances of the vocal tract occur at sound wavelengths of<!WA23><!WA23><!WA23><!WA23><!WA23><!WA23><!WA23><!WA23><IMG ALT="[IMG]" SRC="http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming/eq1.gif"> <BR> With L = 17cm (average value ofvocal-tract length) and a sound propagation speed of c = 340 m/s, thefrequencies of these resonances will be:  <!WA24><!WA24><!WA24><!WA24><!WA24><!WA24><!WA24><!WA24><IMG ALT="[IMG]"SRC="http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming/eq2.gif"> <BR> The frequencies F[k] are called formantfrequencies.<P>The resulting sound that we hear is considered to be the convolution ofthe excitation pulse created by the glottis and the formantfrequencies.  Therefore, if we want to model a speech signal, we startwith a train of excitation pulses as shown in figure 2.  For theformant frequencies, use equation (2) with k = 1, 2, or 3.  This givesformant frequencies:  F[1] = 500Hz, F[2] = 1500 Hz, and F[3] = 2500Hz.  Combining these frequencies and adding an exponential envelopeproduces the formant structure shown in figure 3.  By convolving thetrain of excitation pulses with the formant structure, we get asynthesized pitch as shown in figure 4.<P><!WA25><!WA25><!WA25><!WA25><!WA25><!WA25><!WA25><!WA25><IMG ALT="[IMG]" SRC="http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming/fig2.gif"><BR><STRONG>Figure 2.</STRONG> Excitation signal used to create thesynthesized pitch.  The period in the train of excitations isT[0]=0.01s making the pitch 100 Hz.<P><!WA26><!WA26><!WA26><!WA26><!WA26><!WA26><!WA26><!WA26><IMG ALT="[IMG]" SRC="http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming/fig3.gif"><BR><STRONG>Figure 3.</STRONG> Formant structure created using 500Hz, 1500Hz and2500Hz as the formant frequencies.<P><!WA27><!WA27><!WA27><!WA27><!WA27><!WA27><!WA27><!WA27><IMG ALT="[IMG]" SRC="http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming/fig4.gif"><BR><STRONG>Figure 4.</STRONG> Synthesized pitch of 100Hz created by convolvingthe train of excitation pulses (spaced by 0.01s) and the formant structure.<P>Now that we have a model of the human voice, how can it be converted topitch?  The most prevalent view on pitch is that what we hear as pitchis actually the frequency at which the bursts of air occur.  So if wecan track those bursts of air we can find the pitch of the segment.<H3> <A NAME="subsec:Tracking">Tracking pitch</H3>We tried three methods for tracking pitch: Autocorrelation, MaximumLikelihood, Cepstrum Analysis.<UL><LI> Autocorrelation <P>Autocorrelation is one of the oldest of the classical pitch trackers<!WA28><!WA28><!WA28><!WA28><!WA28><!WA28><!WA28><!WA28><A HREF=#ref:autocor>[7]</A>.Autocorrelation isolates and tracks the peak energy levels of thesignal which is a measure of the pitch. Referring back to figure 3, we seethat the signal s(n) peaks where the impulses occur. Therefore, trackingthe frequency of this peaks should give us the pitch of the signal.<P>In order to get the frequency of these peaks we can employ autocorrelationas defined by:<BR><!WA29><!WA29><!WA29><!WA29><!WA29><!WA29><!WA29><!WA29><IMG ALT="[IMG]" SRC="http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming/eq3.gif"><BR>Unfortunately autocorrelation is subject to aliasing (picking aninteger multiple of the actual pitch) and is computationally complex.We found our implementation of autocorrelation to require approximately45 seconds for 10 seconds of 44KHz, 16-bit audio on a 90MHz pentiumworkstation.<LI>Maximum Likelihood<P>Maximum Likelihood <!WA30><!WA30><!WA30><!WA30><!WA30><!WA30><!WA30><!WA30><A HREF=#ref:mlh>[14]</A> is amodification of Autocorrelation that increases the accuracy of thepitch and decreases the chances of aliasing.<P>Unfortunately, the computational complexity of this method makesautocorrelation look blindingly fast.  A straight-forwardimplementation in Matlab takes approximately one hour to evaluate 10seconds of audio on a 90MHz Pentium workstation.  With someoptimizations, we improved the performance to approximately 15 minutesper 10 seconds of audio, but this is still far too slow for ourpurposes.  Therefore, we discarded this method.  For a detailedexplanation of this method, the reader may refer to<!WA31><!WA31><!WA31><!WA31><!WA31><!WA31><!WA31><!WA31><A HREF=#ref:mlh>[14]</A>.<LI>Cepstrum Analysis<P>Cepstrum analysis is the definitive classical method of pitchextraction.  For an explanation, the reader is directed to Oppenheimand Schafer's original work in <!WA32><!WA32><!WA32><!WA32><!WA32><!WA32><!WA32><!WA32><A HREF=#ref:opp>[10]</A> or in a morecompact form in <!WA33><!WA33><!WA33><!WA33><!WA33><!WA33><!WA33><!WA33><A HREF=#ref:dtsp>[11]</A>. We found that this methoddid not give very accurate results for humming.</UL><P>The output of these methods can be construed as a sequence of frequencyestimations for successive pitches in the input. We convert theseestimates into a three-step contour representation by comparing eachestimated pitch with the previous one.  In our system adjacent pitchesare considered the same if they are within a quarter-step of each other(on an equal-tempered musical scale), but this parameter isadjustable.<P>After analyzing the costs and benefits of these methods, we decided to use amodified form of autocorrelation for our implementation.<H5><!WA34><!WA34><!WA34><!WA34><!WA34><!WA34><!WA34><!WA34><A HREF="#sec:Tracking"><-- Tracking Pitch in Hummed Queries</A></H5><H5><!WA35><!WA35><!WA35><!WA35><!WA35><!WA35><!WA35><!WA35><A HREF="#toc"><-- Table of Contents</A></H5><HR><H2><A NAME="sec:Searching">Searching the database </H2><P>Having described how the user input (a hummed tune) is converted into astring in a 3 letter alphabet, we now discuss our method for searchingan audio database. Our method of searching the database is simple.Songs in the database are preprocessed to convert the melody into astream of U,D,S characters, and the converted user input (the<em>key</em>) is compared with all the songs. The pattern-matching usesa ``fuzzy'' search to allow for errors within the matches. These errorsreflect the inaccuracies in the way people hum as well as errors in therepresentation of the songs themselves.<P>For performing the key-search within the database we need an efficientapproximate pattern matching algorithm. By ``approximate'' we mean thatthe algorithm should be able to take into account various forms oferrors.<P>Figure 5 summarizes the various forms of  errors anticipatedin a typical pattern matching scheme.<P><!WA36><!WA36><!WA36><!WA36><!WA36><!WA36><!WA36><!WA36><IMG ALT="[IMG]" SRC="http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming/fig5.gif"><BR><STRONG>Figure 5.</STRONG> Three forms of anticipated errors with onemismatch<P>The algorithm that we adopted for this purpose is described byBaesa-Yates and Perleberg <!WA37><!WA37><!WA37><!WA37><!WA37><!WA37><!WA37><!WA37><A HREF=#ref:asifa>[1]</A>. This algorithmaddresses the problem of string matching with <i>k</i> mismatches. Theproblem consists of finding all instances of a pattern string <i>P = p1p2 p3...pm</i> in a text string <i>T = t1 t2 t3...tn</i> such thatthere are at most <i>k</i> mismatches (characters that are not thesame) for each instance of <i>P</i> in <i>T</i>.  When <i>k = 0</i> (nomismatches) we have the simple string matching problem, solvable in<i>O(n)</i> time. When <i>k = m</i>, every substring of <i>T</i> oflength <i>m</i> qualifies as a match, since every character of <i>P</i>can be mismatched. Each of the errors in the figure above correspondsto <i>k = 1</i>.<P>It is worth mentioning that several algorithms have been developed thataddress the problem of approximate string matching. Running times haveranged from <i>O(mn)</i> for the brute force algorithm to <i>O(kn)</i><!WA38><!WA38><!WA38><!WA38><!WA38><!WA38><!WA38><!WA38><A HREF=#ref:asifb>[9]</A> or <i>O(n log(m))</i> <!WA39><!WA39><!WA39><!WA39><!WA39><!WA39><!WA39><!WA39><A HREF=#ref:asifd>[2]</A>.  The algorithm that we adopted offers betterperformance for average cases than most other algorithms.<P>The worst case for this algorithm occurs when<i>P</i> (the key) consists of<i>m</i> occurrences of a single distinct character, and<i>T</i> (contour representation of song)consists of<i>n</i> instances of that character.In this case the running time is<i>O(mn)</i>. However this is neither a common noruseful situation for ourpurposes.In the average case of an alphabet in which each character isequally likely to occur, the running time is<BR><!WA40><!WA40><!WA40><!WA40><!WA40><!WA40><!WA40><!WA40><IMG ALT="[IMG]" SRC="http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming/eq4.gif"><BR>where <!WA41><!WA41><!WA41><!WA41><!WA41><!WA41><!WA41><!WA41><IMG ALT="[IMG]" SRC="http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming/eq5.gif"> is the size of the alphabet.<P>The database incorporates the key-searching scheme (using patternmatching techniques explained above).  We envisioned the followingdesign goals for the database. For a given query, the database returnsa list of songs ranked by how well they matched the query, not just onebest match. The number of matches that the database should retrievedepends upon the error-tolerance used during the key-search. Thiserror-tolerance could be set in one of two possible ways: either it canbe a user-definable parameter or the database can itself determine thisparameter based on, for example by heuristics that depends on thelength of the key. This design gives the user an opportunity to performqueries even if the user is not sure of some notes within the tune.<P>From the results of the query the user can identify the song ofinterest. If the list is too large, the user can perform a new query ona restricted search list consisting of songs just retrieved. Aconsequence of this scheme is that the user can identify sets of songsthat contain similar melodies.<H5><!WA42><!WA42><!WA42><!WA42><!WA42><!WA42><!WA42><!WA42><A HREF="#toc"><-- Table of Contents</A></H5><HR>

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?