FI124869B

FI124869B - Voice activity detector and approver for noisy environments

Info

Publication number: FI124869B
Application number: FI20041013A
Authority: FI
Inventors: Douglas Ralph Ealey; Holly Louise Kelleher; David John Benjamin Pearce
Original assignee: Motorola Mobility Llc
Priority date: 2002-01-24
Filing date: 2004-07-22
Publication date: 2015-02-27
Also published as: GB2384670A; KR20040075959A; GB2384670B; CN1623186A; WO2003063138A1; JP2005516247A; FI20041013A; KR20090127182A; GB0201585D0; KR100976082B1; CN1307613C; JP2010061151A

Description

Ääniaktiviteetin tunnistin ja hyväksyjä kohinallisia ympäristöjä vartenVoice activity detector and approver for noisy environments

Keksinnön alaField of the Invention

Keksintö koskee puheen tunnistusta (tunnetaan yleisesti nimellä ääniaktiviteetin tunnistus (VAD)) kohinallisessa ympäristössä. Keksintöä voidaan soveltaa, vaikka ei vain tähän rajattuna, äänisignaalien energiakiihtyvyysmittaukseen puheentunnistus j ärj estelmässä.The invention relates to speech recognition (commonly known as voice activity recognition (VAD)) in a noisy environment. The invention can be applied, though not limited to, the energy acceleration measurement of audio signals in a speech recognition system.

Keksinnön taustaBackground of the Invention

Monet ääniviestintäjärjestelmät kuten GSM- matkapuhelinstandardin järjestelmä (global system for mobile communications) ja TETRA-järjestelmä (TErrestial Trunked RA-dio) yksityisiä matkaviestinradiokäyttäjiä varten, käyttävät puheenkäsittely-yksiköitä puhehahmomallien koodaamiseksi ja dekoodaamiseksi. Tällaisissa ääniviestintäjärjestelmissä pu-hekooderi muuntaa analogisen puhehahmomallin soveltuvaan digitaaliseen muotoon lähettämistä varten. Puhedekooderi muuntaa vastaanotetun digitaalisen puhesignaalin kuultavaksi au-diopuhehahmomalliksi.Many voice communication systems, such as the Global System for Mobile Communications (GSM) system and the Trrra (TErrestial Trunked RA-dio) system for private mobile radio users, use speech processing units to encode and decode voice protocols. In such voice communication systems, the speech encoder converts the analog speech pattern into a suitable digital format for transmission. The speech decoder converts the received digital speech signal into an audible au-diode speech model.

Alalla tunnetaan menetelmiä ja laitteistoja ääniaktiviteetin tunnistamiseksi. Ääniaktiviteetin tunnistin (VAD) toimii sillä oletuksella, että puhetta on vain osalla aikaa audiosignaalia. Tämä oletus on tavallisesti oikein, koska audio-signaalisissa on monia aikavälejä, joiden aikana esiintyy vain hiljaisuutta tai taustakohinaa. Ääniaktiviteetin tunnistinta voidaan käyttää moneen tarkoitukseen. Näihin kuuluvat kokonaislähetysaktiviteetin vaimennus lähetysjärjestelmässä, kun puhetta ei esiinny, jolloin säästetään mahdollisesti energiaa ja kanavan kaistanleveyttä. Kun VAD havaitsee, että puheaktiviteetti on palannut, se voi aloittaa lähetysaktiviteetin uudelleen. Ääniaktiviteetin tunnistinta voidaan käyttää myös puheental-lennuslaitteiden kanssa erottamaan puhetta sisältävät audio-osuudet "puheettomista" osuuksista. Puhetta sisältävät osuudet tallennetaan sitten tallennuslaitteeseen ja "puheetto-mat" osuudet hylätään.Methods and apparatus for detecting sound activity are known in the art. The Voice Activity Detector (VAD) operates on the assumption that speech is only part of the audio signal time. This assumption is usually correct because there are many time slots in the audio signals, with only silence or background noise. The Voice Activity Detector can be used for many purposes. These include attenuation of the overall transmission activity in the transmission system when speech is not present, potentially saving energy and channel bandwidth. When VAD detects that voice activity has returned, it may resume the transmission activity. The voice activity detector can also be used with speech-based aircraft to distinguish voice-containing audio portions from "speechless" portions. The speech-containing parts are then stored in the recording device and the "non-speech" parts are discarded.

Tavanomaiset menetelmät äänen tunnistamiseksi perustuvat ainakin osaksi menetelmiin puhesignaalin tehon tunnistamiseksi ja arvioimiseksi. Estimoitua tehoa verrataan joko vakioon tai sovitettavaan kynnysarvoon päätöksen tekemiseksi siitä, onko signaali puhetta vaiko ei. Näiden menetelmien suurimpia etuja on niiden vähäinen monimutkaisuus, mikä tekee niistä sopivia toteutuksiin, joissa käsittelyresursseja on vähän. Näiden menetelmien suurimpia haittoja on se, että taustakohina voi vahingossa aiheuttaa "puheen" tunnistuksen, kun mitään "puhetta" ei esiinny tosiasiallisesti. Vaihtoehtoisesti esiintyvää "puhetta ei välttämättä tunnisteta, koska se on epäselvää ja vaikeasti tunnistettavaa taustakohinan takia.Conventional methods of voice recognition are based, at least in part, on methods of recognizing and evaluating the power of a speech signal. The estimated power is compared with either a constant or an adjustable threshold to determine whether the signal is speech or not. The major advantages of these methods are their low complexity, which makes them suitable for implementations with low processing resources. The major disadvantage of these methods is that background noise can accidentally cause "speech" recognition when no "speech" is actually present. Alternatively, "speech" may not be recognized because it is unclear and difficult to recognize due to background noise.

Jotkin puheaktiviteetin tunnistamisen menetelmät on tarkoitettu kohinalliseen autoympäristöön ja ne perustuvat puhesignaalin adaptiiviseen suodatukseen. Tämä vähentää ko-hinasisällön signaalista ennen lopullista päätöstä. Taajuus-spektri ja kohinataso voi vaihdella, koska menetelmää käyte tään eri puhujien osalta ja erilaisissa ympäristöissä. Näin ollen tulosuodatin ja kynnysarvot ovat sovitettavia, jotta pysyttäisiin näiden vaihtelujen mukana.Some methods of recognizing speech activity are intended for the noisy auto environment and are based on adaptive filtering of the speech signal. This reduces the noise content of the signal before making a final decision. The frequency spectrum and noise level can vary, as the method is used for different speakers and in different environments. Thus, the input filter and thresholds are adjustable to keep up with these variations.

Esimerkkejä näistä menetelmistä on annettu GSM:n teknisissä määrittelyissä 06.42 "Ääniaktiviteetin tunnistin (VAD) vastaavasti puolen nopeuden, täyden nopeuden ja korotetun täyden nopeuden puheliikennekanaville". Toinen tällainen menetelmä on "Multiboundary Voice Activity Detection Algorithm" jota on esitetty ITU G.729:n liitteessä B. Nämä menetelmät on tarkkoja kohinallisessa ympäristössä, mutta ovat huomattavan monimutkaisia toteuttaa.Examples of these procedures are given in the GSM technical specifications 06.42 "Voice Activity Detector (VAD), respectively, a half rate, full rate and higher rates of full rate speech traffic channels". Another such method is the "Multiboundary Voice Activity Detection Algorithm" shown in Appendix B of ITU G.729. These methods are accurate in a noisy environment but are considerably complex to implement.

Kaikki nämä menetelmät vaativat puhesignaalin tulona. Jotkin sovellukset, joissa käytetään puheen pakkauksenpurkumenetel-miä, vaativat, että puheen tunnistus suoritetaan puheen pak-kauksenpurkuprosessin aikana.All these methods require a speech signal as input. Some applications that use speech decompression methods require that speech recognition be performed during the speech decompression process.

Eurooppapatenttihakemus Nro EP-A-0785419, jossa keksijöinä ovat Benyassine ym., on tarkoitettu ääniaktiviteetin tunnistuksen menetelmäksi, joka sisältää seuraavat vaiheet: (i) selvitetään ennalta määrätty joukko parametreja tulevasta puhesignaalista kunkin kehyksen osalta ja (ii) tehdään tulevan puhesignaalin kehyksen ääntä koskeva päätös jokaisen kehyksen osalta ennalta määrätystä joukosta parametreja selvitettyjen eromittojen joukon mukaisesti.European Patent Application No. EP-A-0785419, invented by Benyassine et al., Is directed to a method for detecting a voice activity, comprising the steps of: (i) determining a predetermined set of parameters for an incoming speech signal for each frame; for each frame, a predetermined set of parameters according to the set of resolved dimensions.

Matkapuhelinjärjestelmien VAD:tä esiohjataan sen varmistamiseksi, että kun osapuoli puhuu, radiojärjestelmä - mukaan lukien puhekoodekki ja RF-piiri jne. - on aktiivinen kuljettamaan tämän puheen toiselle osapuolelle taustakohinan ja muiden heikkoustilanteiden vallitessa. Tämä aiheuttaa kuitenkin datanlähetyksen silloin, kun osapuoli ei puhu. Tämän hintana on hieman lyhentynyt akunkesto ja hieman suurentunut häiriö saman taajuuskanavan käyttäjille järjestelmän toisissa soluissa. Nämä ovat olennaisesti tärkeysjärjestyksessä toisen (tai korkeamman) luokan vaikutuksia. Näissä järjestelmissä ei ole konseptia sen osalta, että käytettävissä on rajallinen resurssi kaksisuuntaista puhelua varten. On täysin mahdollista ja johdonmukaista, että yläsuuntainen linkki ja alasuuntainen linkki, jotka käyttävät tavallisesti eri kantoaaltoa, käyttävät samanaikaisesti täyttä kaistanleveyttä. Tämän keksinnön alalla tiedetään, että jotkin ääniaktivitee-tin tunnistimet tai äänen päälle tulon tunnistimet (VAD/VOD) yrittävät käyttää puheen ominaisuuksia, kuten sen harmonisia koskevaa rakennetta (esimerkiksi autokorreloinnin avulla) erottaakseen ääntä sisältävän puheen. Kohinassa tämä raken-neindikaattorit voivat kuitenkin epäonnistua, joko puheen rakenteen hajoamisen takia tai johtuen rakenteen jäämisestä kohinan sekaan. Tässä voi olla kyse esimerkiksi moottorin, renkaiden tai ilmastointilaitteen kohinasta auton sisässä. Nämä menetelmät ovat lisäksi huonoja tunnistamaan soinnitonta puhetta.The VAD of cellular systems is pre-controlled to ensure that when a party is speaking, the radio system - including speech codec and RF circuit, etc. - is active in transmitting that speech to the other party in the presence of background noise and other vulnerabilities. However, this causes data to be transmitted when the party is not speaking. This comes at the cost of slightly reduced battery life and slightly increased interference for users of the same frequency channel in other cells of the system. These are essentially second order (or higher) effects. There is no concept in these systems that a limited resource for two-way calling is available. It is quite possible and consistent that the uplink and the downlink, which typically use a different carrier, simultaneously use full bandwidth. It is known in the art of this invention that some voice activity sensors or voice input sensors (VAD / VOD) attempt to use speech features such as its harmonic structure (e.g., by autocorrelation) to distinguish voice-containing speech. However, in noise, these structural indicators may fail, either due to the disruption of the speech structure or due to the structure remaining trapped in the noise. This could be noise from the engine, tires or air conditioning inside the car. In addition, these methods are bad for recognizing unvoiced speech.

Vaihtoehtona on yksinkertaisesti se, että käytetään kehyksen energiatasoa puheen tunnistamiseksi. Tämä riittää sellaisen puheen osalta, joka tapahtuu hyvissä signaali-kohina-suhteen (SNR) olosuhteissa, joissa mielivaltainen kohinatason ylittävä kynnysarvo voidaan asettaa puhetta merkitseväksi. Tämä menetelmä ei kuitenkaan toimi tätä realistisemmissa kohina-olosuhteissa .The alternative is simply to use the energy level of the frame to detect speech. This is sufficient for speech that occurs under good signal-to-noise ratio (SNR) conditions where an arbitrary threshold above the noise level can be set to make the speech significant. However, this method does not work under more realistic noise conditions.

Normalisoimattomien tietokantojen eli todellisuuden sovellusten kohdalla on todennäköistä, että kohinatasot voivat olla yhdessä esimerkkijoukossa suuremmat kuin puhetasot toisessa, ja tämä tekee kynnysarvon asettamisen mahdottomaksi. Perinteinen menetelmä selvitä tästä on ottaa keskiarvo ensimmäisestä 100 ms.sta tai suurin piirtein sellaisesta ajasta ääni-ilmaisua käyttäen oletuksena sitä, että tämä edustaa kohinaa, ja luodaan tätä tapausta varten oma kynnysarvo. Taaskaan tämä ei riitä tasaisena pysymättömälle kohinalle, missä kohina voi poiketa äkillisesti alkuarviosta silloin, kun kohinalla on suuri varianssi tai kun muutamat ensimmäiset kehykset sisältävät tosiasiallisesti puhetta eivät oletuksena olevaa kohinaa.For non-normalized databases, i.e., reality applications, it is likely that noise levels in one set of examples may be higher than speech levels in another, and this makes setting a threshold impossible. The traditional method to find out here is to take the average of the first 100 ms or roughly such time using the default expression that it represents noise, and create a threshold for this case. Again, this is not enough for unsteady noise, where the noise can suddenly deviate from the initial estimate when the noise has a high variance or when the first few frames actually contain non-default noise.

Siksi on olemassa tarve parannetusta, kohinaympäristöihin tarkoitetusta ääniaktiviteetin tunnistimesta ja hyväksyjästä, jolla saadaan lievennetyksi edellä mainittuja haittapuolia .Therefore, there is a need for an improved noise activity detector and approver for noise environments to mitigate the above drawbacks.

Keksinnön yhteenvetoSummary of the Invention

Esillä olevan keksinnön ensimmäisen puolen mukaisesti saadaan patenttivaatimuksessa 1 esitetyn kaltainen viestintälaite.According to a first aspect of the present invention there is provided a communication device as set forth in claim 1.

Esillä olevan keksinnön toisen puolen mukaisesti saadaan patenttivaatimuksessa 11 esitetyn kaltainen menetelmä viestintälaitteeseen tulevan puhesignaalin tunnistamiseksi.According to a second aspect of the present invention there is provided a method as claimed in claim 11, the communication device to identify the input speech signal.

Esillä olevan keksinnön kolmannen puolen mukaisesti saadaan patenttivaatimuksessa 14 esitetyn kaltainen menetelmä sen päättämiseksi, onko viestintälaitteeseen tuleva signaali puhetta vai kohinaa.According to a third aspect of the present invention, a method of the type set forth in claim 14 to decide whether the communication device the incoming signal is speech or noise.

Epäitsenäisissä patenttivaatimuksissa on esitetty esillä olevan keksinnön muita puolia.Other aspects of the present invention are set forth in the dependent claims.

Yhteenvetona voidaan lausua, että esillä olevan keksinnön tavoitteena on ratkaista tapaus, jossa on kyse mielivaltaisesta amplitudista ja muuttuvasta kohinasta, käyttämällä energiakiihtyvyysmittausta ensisijaisesti energian amplitudin mittauksen sijasta puheen olemassa olon tai puuttumisen merkkinä.In summary, it is an object of the present invention to solve a case of arbitrary amplitude and variable noise by using energy acceleration measurement, rather than measuring energy amplitude, as a sign of the presence or absence of speech.

Kuvioiden lyhyt selostusBRIEF DESCRIPTION OF THE DRAWINGS

Esillä olevan keksinnön esimerkinomaisia suoritusmuotoja kuvataan nyt viitaten oheistettuihin kuvioihin, joista: kuvio 1 esittää lohkokaavion viestintälaitteesta, joka on sovitettu suorittamaan ääniaktiviteetin tunnistus ja hyväksyntä esillä olevan keksinnön edullisen suoritusmuodon mukaisesti, kuvio 2 esittää vuokaavion energiakiihtyvyyteen perustuvasta ääniaktiviteetin tunnistuksesta kohinallisia ympäristöjä varten esillä olevan keksinnön edullisen suoritusmuodon mukaisesti, kuvio 3 esittää vuokaavion energiakiihtyvyyteen perustuvasta ääniaktiviteetin tunnistuksesta kohinallisia ympäristöjä varten esillä olevan keksinnön edullisen suoritusmuodon mukaisesti ja kuvio 4 esittää puskurointitoiminnan esillä olevan keksinnön edullisen suoritusmuodon mukaisesti.Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which: FIG. 3, a flow chart of energy acceleration based sound activity detection for noisy environments according to a preferred embodiment of the present invention and FIG. 4 illustrates a buffering operation according to a preferred embodiment of the present invention.

Edullisten suoritusmuotojen kuvausDescription of Preferred Embodiments

Puheäänellä on verrattain suuri energiakiihtyvyysarvo, koska sen alkaminen riippuu aktivoinnista äänihuulissa, jotka joko värähtelevät tai ovat paikallaan. Vastaavasti soinnittomissa aluissa (esim. plosiivit) on myös suuri energiakiihtyvyys.Speech sound has a relatively high energy acceleration value because its onset depends on activation in the vocal cords, which either vibrate or are stationary. Similarly, unvoiced areas (eg plosives) also have high energy acceleration.

Keksijät ovat havainneet, että edustavassa alueessa, jossa äänen olemassaolo korostuu, kuten kapeakaistainen tehospektri eli Mel-spektri, syntyvä energiakiihtyvyys on huomattavasti suurempi kuin muuttumattomana pysyvä kohina. Ainoat merkittävät poikkeukset ovat impulsiiviset meluäänet (esimerkiksi käsien taputus). Täten, esillä olevan keksinnön edullisen suoritusmuodon mukaisesti, keksijät ovat arvioineet, että nämäkin äänet voidaan lisäksi erottaa keskittymällä energiaan taajuusalueella, joka sisältää todennäköisesti ihmisäänisignaalin perus-sävelkorkeuden. Esillä olevan keksinnön keksijät esittävät erityisesti, että käytetään puheen strukturoimatonta ominai suutta, nimittäin energiakiihtyvyyttä (tai jonkin mitan, joka heijastelee puheen tai sen komponenttien energiaa, kiihtyvyyttä) .The inventors have found that in a representative region in which the presence of sound is emphasized, such as a narrowband power spectrum, or Mel spectrum, the energy acceleration generated is significantly higher than the constant noise. The only notable exceptions are impulsive noise (eg hand clapping). Thus, in accordance with a preferred embodiment of the present invention, the inventors have appreciated that these sounds, moreover, can be further distinguished by focusing on energy in a frequency range likely to contain the basic pitch of the human voice signal. In particular, the present inventors disclose the use of the unstructured property of speech, namely energy acceleration (or the acceleration of some measure that reflects the energy of speech or its components).

Keksinnöllisen konseptin edullinen sovellus on erityisesti hajautettu puheen tunnistus (DSR, Distributed Speech Recognition), jonka standardin on nykyään määritellyt ETSI (European Telecommunications Standards Institute - "Speech Processing, Transmission and Quality aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithm", ETSI ES 201 108 VI.1.2 (200-2004), huhtikuu 2000.A preferred embodiment of the inventive concept is in particular Distributed Speech Recognition (DSR), the standard of which is currently defined by the European Telecommunications Standards Institute (STQ); algorithm; Compression algorithm ", ETSI ES 201 108 VI.1.2 (200-2004), April 2000.

Viitataan nyt kuvioon 1, jossa on esitetty lohkokaavio au-diotilaajalaitteesta 100, joka on sovitettu tukemaan esillä olevan keksinnön edullisten suoritusmuotojen keksinnöllistä konseptia.Reference will now be made to Figure 1, which is a block diagram of an auto-subscriber device 100 adapted to support the inventive concept of preferred embodiments of the present invention.

Esillä olevan keksinnön edullista suoritusmuotoa kuvataan käsitellen langatonta audioviestintälaitetta, esimerkiksi sellaista, joka pystyy toimimaan tulevaisuuden langattomien matkapuhelinviestintäjärjestelmien 3. sukupolven yhteistoi-mintaprojektin (3GPP, 3rd generation partnership project) standardin mukaisesti ja joka tarjoaa DSR-ominaisuudet. Keksinnön mukaisesti on kuitenkin ajateltavissa, että tässä kuvattua keksinnöllistä konseptia, joka koskee ääniaktivitee-tin tunnistusta ja sen hyväksyntää, voidaan soveltaa yhtä hyvin mihin tahansa elektroniseen laitteeseen, joka reagoi äänisignaaleihin ja joka voi hyötyä parannetusta ääniaktivi-teetin tunnistuspiiristä.A preferred embodiment of the present invention will be described with respect to a wireless audio communication device, for example one capable of operating in accordance with the standard of the 3rd generation partnership project (3GPP) of future wireless mobile communication systems and providing DSR capabilities. However, according to the invention, it is conceivable that the inventive concept described herein for the recognition and acceptance of voice activity can be applied as well to any electronic device that responds to voice signals and can benefit from an improved voice activity recognition circuit.

Kuten alalla tiedetään, audiotilaajalaite 100 sisältää antennin 102, joka on kytketty edullisesti duplex-suodattimeen, antennikytkimen eli kiertohaaroittimen 104, joka muodostaa erotuksen vastaanotto- ja lähetysketjun välillä audiotilaajalaitteen 100 sisässä.As is known in the art, the audio subscriber device 100 includes an antenna 102, preferably coupled to a duplex filter, an antenna switch, or rotary splitter 104, which forms a difference between a receive and transmit circuit within the audio subscriber device 100.

Vastaanotinketju sisältää vastaanottimen etupään piirin 106 (joka toimintana on suorittaa vastaanotto, suodatus ja muunto välitaajuuskaistalle tai kantataajuuskaistalle). Etupään piiri 106 on kytketty sarjamuoisesti signaalinkäsittelytoi-mintoon (joka on toteutettu yleensä digitaalisella signaaliprosessorilla (DSP)) 108. Signaalinkäsittelytoiminto 108 suorittaa signaalin demoduloinnin, virheenkorjauksen ja muotoilun. Ennalleen palautettu data signaalinkäsittelytoimin-nosta 108 on kytketty sarjamuotoisesti audiokäsittelytoimin-toon 109, joka muotoilee vastaanotetun signaalin sopivalla tavalla lähetettäväksi audioilmaisimeen/näytölle 111.The receiver chain includes a receiver front end circuit 106 (which functions to perform reception, filtering, and conversion to an intermediate band or baseband band). The front end circuit 106 is serially coupled to a signal processing function (generally implemented by a digital signal processor (DSP)) 108. The signal processing function 108 performs signal demodulation, error correction, and shaping. The restored data from the signal processing function 108 is connected in series to the audio processing operation 109, which formats the received signal in a suitable manner for transmission to the audio detector / display 111.

Keksinnön erilaisissa suoritusmuodoissa signaalinkäsittely-toiminto 108 ja audiokäsittelytoiminto 109 voi olla järjestetty samaan fyysiseen laitteeseen. Ohjain 114 on konfigu-roitu ohjaamaan informaatiovirtaa ja tilaajalaitteen 100 elinten toiminnallista tilaa.In various embodiments of the invention, the signal processing function 108 and the audio processing function 109 may be arranged in the same physical device. The controller 114 is configured to control the information flow and the functional state of the elements of the subscriber unit 100.

Mitä tulee lähetysketjuun, se sisältää olennaisin osin au-diotulolaitteen 120, joka on kytketty sarjaan audiokäsitte-lytoiminnon 109, signaalinkäsittelytoiminnon 108, lähetin-/modulointipiirin 122 ja tehovahvistimen 124 kanssa. Prosessori 108, lähetin-/modulointipiiri 122 ja tehovahvistin 124 toimivat vasteellisesti ohjaimeen nähden. Tehovahvistimen lähtö on kytketty duplex-suodattimeen, antennikytkimeen eli kiertohaaroittimeen 104 ja antenniin 102 lopullisen radio-taajuussignaalin lähettämiseksi.With respect to the transmission chain, it essentially comprises an audio input device 120, which is connected in series with the audio processing function 109, the signal processing function 108, the transmitter / modulation circuit 122 and the power amplifier 124. Processor 108, transmitter / modulation circuit 122, and power amplifier 124 are responsive to the controller. The output of the power amplifier is coupled to a duplex filter, an antenna switch, or rotary splitter 104, and an antenna 102 to output a final radio frequency signal.

Audiokäsittelytoiminto 109 sisältää erityisesti ääniaktivi-teetin (äänen alkamisen) tunnistuksen (VAD) toiminnon 130, joka on kytketty toiminnallisesti ääniaktiviteettipäätöstoi-mintoon 135. Esillä olevan keksinnön edullisten suoritusmuotojen mukaisesti VAD-toiminto 130 ja ääniaktiviteettipäätös-toiminto 135 on sovitettu antamaan parannettu äänen tunnistuksen ja päätöksenteon mekanismi, jonka toimintaa kuvataan kuvioihin 2 ja 3 viitaten. On huomattava, että ääniaktivi-teetin tunnistustoiminto 130 sisältää kehys kehykseltä -tunnistusvaiheen, joka koostuu kolmesta mittauksesta: Kolmen taajuusosa-alueen mittauksiin kuuluvat: (i) koko spektri (ii) spektrin osakaistat ja (iii) spektrin varianssi. Ääniaktiviteettipäätöstoiminto 135 suorittaa päätöksen sitten perustuen puskuroituihin mittauksiin, jotka analysoidaan niiden puhetodennäköisyyden osalta. Lopullinen päätös pää-tösasteesta kohdistetaan takautuvasti varhaisempaan kehykseen puskurissa.Specifically, the audio processing function 109 includes a voice activity (voice onset) detection (VAD) function 130 operatively coupled to a voice activity decision function 135. In accordance with preferred embodiments of the present invention, the VAD function 130 and the voice activity decision function 135 are adapted to provide improved voice recognition and a mechanism, the operation of which is described with reference to Figures 2 and 3. It should be noted that the voice activity recognition function 130 includes a frame-by-frame detection step consisting of three measurements: The measurements of the three frequency sub-bands include: (i) full spectrum (ii) sub-bands of the spectrum and (iii) variance of the spectrum. The voice activity decision function 135 then executes the decision based on buffered measurements, which are analyzed for their speech probability. The final decision on the decision rate is applied retrospectively to the earlier frame in the buffer.

Esillä olevan keksinnön edullisessa suoritusmuodossa ajas-tin/laskuri 118 on myös sovitettu suorittamaan ajoitustoi-minnot kuvioiden 2 ja 3 tunnistus- ja päätöksentekoprosessissa.In a preferred embodiment of the present invention, the timer / counter 118 is also adapted to perform the scheduling functions in the identification and decision process of Figures 2 and 3.

Signaaliprosessoritoiminto 108, audiokäsittelytoiminto 109, VAD-toiminto 130 ja ääniaktiviteettipäätöstoiminto 135 voi olla toteutettu erillisinä, toiminnallisesti kytkettyinä kä-sittelyeliminä. Vaihtoehtoisesti yhtä tai useampaa prosessoria voidaan käyttää toteuttamaan yksi tai useampi vastaavista käsittelytoiminnoista. Vielä yhdessä vaihtoehtoisessa suoritusmuodossa edellä mainitut toiminnot voi olla toteutettu laitteisto-, ohjelmisto- ja kiinto-ohjelmistoelimien sekakokoonpanolla, käyttäen sovelluskohtaisia integroituja piiriä (ASIC) ja/tai prosessoreja, esimerkiksi digitaalisia signaaliprosessoreja (DSP).The signal processor function 108, the audio processing function 109, the VAD function 130 and the voice activity decision function 135 may be implemented as separate, functionally connected processing members. Alternatively, one or more processors may be used to perform one or more of the respective processing functions. In yet another alternative embodiment, the above functions may be implemented by a mixed configuration of hardware, software and firmware using application specific integrated circuits (ASICs) and / or processors, e.g., digital signal processors (DSPs).

Tietenkin eri komponentit audiotilaajalaitteen 100 sisässä voi olla toteutettu erilliskomponenttien tai integroitujen komponenttien muodossa niin, että lopullinen rakenne on vain mielivaltainen valinta. Tämän lisäksi on olemassa lukuisia menetelmiä, joilla voidaan saada energiakiihtyvyystieto käytettäväksi esillä olevan keksinnön edullisessa suoritusmuodossa. (i) Teoreettisesti ideaalinen menetelmä on kirjaimellisesti kaksoisdifferentioida energiataso ilmaisun peräkkäisisten kehyksten suhteen, kuten on nähtävissä aiemmin julkaistusta patenttihakemuksesta US 6009391. Tämän ratkaisumallin haittana on se, että se on omiaan aiheuttamaan viiveitä, koska analyysissä on analysoitava joukko kehyksiä kehyksen kummaltakin puolelta. (ii) Energiakiihtyvyyden nollaviive-estimointi voidaan saavuttaa vertaamalla lyhyen ajan keskiarvon suhdetta hetkelliseen arvoon, esimerkiksi: käyttämällä kehyskeskiarvoa:Of course, the various components within the audio subscriber device 100 may be implemented in the form of discrete components or integrated components such that the final structure is only an arbitrary choice. In addition, there are a number of methods for obtaining energy acceleration information for use in a preferred embodiment of the present invention. (i) Theoretically, the ideal method is to literally double differentiate the energy level with respect to the sequential frames of expression, as can be seen in the previously published patent application US 6009391. The disadvantage of this solution is that it tends to cause delays because the frame has a plurality of frames to analyze. (ii) Zero-delay estimation of energy acceleration can be achieved by comparing the ratio of the short-term average to the instantaneous value, for example: using a frame average:

[1] tai käyttämällä liukuvaa keskiarvoa[1] or using a moving average

[2][2]

Kummassakin tapauksessa menetelmä antaa arvon, jota voidaan tulkita seuraavasti: heikkenevyys < 1 < kiihtyvyys. Näin voidaan löytää kokemusperäisiä arvoja termille A ja nimittäjän pituus, joka erottaa parhaiten puheen kohinasta.In either case, the method gives a value that can be interpreted as: weakening <1 <acceleration. This way you can find empirical values for the term A and the length of the denominator that best distinguishes speech from noise.

Esillä olevan keksinnön keksijät ovat havainneet, että edullinen optimaalinen ratkaisu on löytää nimittäjä, joka pystyy jäljittämään muuttuvaa kohinaa nopeasti, mutta joka on liian pitkä pysymään alkavan äänen perässä. Ehdotettu arvosekvens-si liukuvalle keskiarvolle on a=0,2m b=0,8*a, c=0,8*b jne., mikä voidaan ilmaista yksinkertaisesti rekursiolla: dt = 0,2xt+0, 8dt-i [3]The inventors of the present invention have found that a preferred optimum solution is to find a denominator that is capable of tracing changing noise rapidly, but is too long to keep up with the oncoming sound. The suggested value sequence for the moving average is a = 0.2m b = 0.8 * a, c = 0.8 * b, etc., which can be expressed simply by recursion: dt = 0.2xt + 0.8dt-i [3]

Sitten: A = xt/dt [4]Then: A = xt / dt [4]

Edullisena pidetty VAD ja parametrien alustusjärjestelmä tunnistusasteessa on esitetty koosteena kuvion 2 vuokaaviossa. Ei-muuttumattomana pysyvässä kohinassa pitkän ajan ener-giakynnykset eivät ole luotettavia puheen indikaattoreita. Vastaavasti suuren kohinan oloissa puheen rakenteeseen (esimerkiksi harmonisiin) ei voida täysin luottaa indikaattorina, koska ne voivat olla kohinan sotkemia tai rakenteellinen kohina voi sekoittaa tunnistimen. Edullisena pidetty ääniak-tiviteetin tunnistin käyttää täten puheen kohinan sietävää ominaispiirrettä, nimittäin energiakiihtyvyyttä äänen alku-hetkellä.The preferred VAD and initialization system of parameters in the recognition stage is summarized in the flowchart of Figure 2. In non-constant noise, long-term energy thresholds are not reliable speech indicators. Similarly, in conditions of high noise, the structure of speech (for example, harmonics) cannot be completely relied upon as an indicator because they may be confused by noise or structural noise may confuse the detector. The preferred voice activity detector thus utilizes a noise-tolerant characteristic of speech, namely, energy acceleration at the onset of the voice.

Viitataan nyt kuvioon 2, jossa on esitetty vuokaavio 200 edullisena pidetystä tunnistusprosessista. Kuten edellä on ilmoitettu, prosessi sisältää kehys kehykseltä -analyysin. Edullisena pidetty VAD-mekanismi koskee mittausmekanismia "koko spektri". Aluksi arvioidaan kehyslaskuri sen määrittämiseksi, onko se pienempi kuin "N", joka määrittelee puskuroitujen kehysten lukumäärän, kuten on esitetty vaiheessa 205. Esimerkkinä edullisesta suoritusmuodosta N asetetaan arvoon 15 sillä oletuksella, että järjestelmässä kukin kehys kestää 10 ms. Jos kehyslaskuri on pienempi kuin "N" vaiheessa 205, tällöin päivitetään liukuva keskiarvo alun kiihty-vyystestistä, kuten on esitetty vaiheessa 210. Jos kehyslaskuri ei ole pienempi kuin "N" vaiheessa 205, tällöin vaihe 210 jätetään väliin.Referring now to Figure 2, a flow chart 200 of a preferred identification process is shown. As stated above, the process includes a frame-by-frame analysis. The preferred VAD mechanism relates to the "full spectrum" measuring mechanism. Initially, the frame counter is evaluated to determine if it is smaller than "N", which defines the number of buffered frames as shown in step 205. As an example of a preferred embodiment, N is set to 15 assuming that each frame in the system lasts 10 ms. If the frame counter is less than "N" in step 205, then the moving average of the initial acceleration test is updated as shown in step 210. If the frame counter is not smaller than "N" in step 205, then step 210 is omitted.

Sitten tehdään määritys, onko energiakiihtyvyysmittaus yhden tai useamman määritellyn marginaalin sisässä, kuten on esitetty vaiheessa 235. Jos energiakiihtyvyysmittaus on yhden tai useamman määritellyn marginaalin sisässä vaiheessa 235, liukuva keskiarvo päivitetään myöhempien energiakiihtyvyys-testien tuloksilla, kuten vaiheessa 240. Jos energiakiihty-vyysmittaus ei ole yhden tai useamman määritellyn marginaalin sisässä vaiheessa 235, vaihe 240 jätetään väliin.Then, determine whether the acceleration measurement is within one or more of the specified margins as shown in step 235. If the acceleration measurement is within one or more of the defined margins at step 235, the moving average is updated with the results of subsequent energy acceleration tests, such as step 240. within one or more defined margins in step 235, step 240 is omitted.

Sitten tehdään määritys sen arvioimiseksi, onko energiakiih-tyvyysmittaus suurempi kuin määritelty kynnysarvo, kuten on esitetty vaiheessa 260. Jos energiakiihtyvyysmittaus on suurempi kuin määritelty kynnysarvo vaiheessa 260, tällöin kehys oletetaan puhekehykseksi, kuten vaiheessa 265. Jos energiakiihtyvyysmittaus ei ole suurempi kuin määritelty kynnysarvo vaiheessa 260, tällöin kehys oletetaan kohinakehyk-seksi, kuten vaiheessa 270.An determination is then made to evaluate whether the acceleration measurement is greater than the defined threshold as shown in step 260. If the acceleration measurement is greater than the determined threshold at step 260, then the frame is assumed to be a speech frame as in step 265. If the acceleration measurement is not greater than , then the frame is assumed to be a noise frame, as in step 270.

Kehyslaskuria kasvatetaan siten, kuten vaiheessa 275, ja prosessi toistuu vaiheesta 205.The frame counter is incremented as in step 275 and the process repeats from step 205.

Parannuksena tähän prosessiin, koko spektriin mittausprosessin sijasta tai sen lisäksi voidaan suorittaa osa-alueen mittausprosessi, joka on esitetty valinnaisissa vaiheissa 215 ja 245. Erityiseksi spektrin osa-alueeksi valitaan sellainen spektrin osa-alue, joka sisältää todennäköisimmin perus sävel korkeuden .As an improvement to this process, instead of or in addition to the whole spectrum measurement process, the sub-domain measurement process shown in optional steps 215 and 245 may be performed. The specific sub-spectrum is the one most likely to include the base pitch.

Osa-alueprosessissa heti, kun alun kiihtyvyystestin liukuva keskiarvo on päivitetty vaiheessa 210 koko spektrin mittauksessa, tehdään päätös tarkastaa, onko energiakiihtyvyysmittaus suurempi kuin kynnysarvo, kuten on esitetty vaiheessa 220. Jos energiakiihtyvyysmittaus on suurempi kuin kynnysarvo vaiheessa 220, muiden parametrien alustusprosessi keskeytetään, kuten on esitetty vaiheessa 225. Jos energiakiihty- vyysmittaus ei ole suurempi kuin kynnysarvo vaiheessa 220, muiden parametrien alustus päivitetään, kuten vaiheessa 230. Prosessi palaa sitten esitetyn mukaisesti vaiheeseen 235.In the sub-domain process, as soon as the moving average of the initial acceleration test is updated at step 210 for full spectrum measurement, a decision is made to check whether the acceleration measurement is greater than the threshold as shown in step 220. If the acceleration measurement is greater than the threshold at step 220, shown in step 225. If the energy acceleration measurement is not greater than the threshold in step 220, the initialization of the other parameters is updated as in step 230. The process then returns to step 235 as described.

Sitten tehdään vielä yksi edullisena pidetty määritys sen jälkeen kun on tehty määritys sen arvioimiseksi, onko ener-gianiihtyvyysmittaus yhden tai useamman määritellyn marginaalin sisässä vaiheessa 235. Heikkenemisarvo arvioidaan sen määrittämiseksi, onko se tilassa "suuri" vaiheessa 250, ja jos on, energian kiihtyvyyden testin liukuvaa keskiarvoa päivitetään hitaasti, kuten on esitetty vaiheessa 255. Prosessi palaa sitten kokospektrimenetelmään vaiheessa 260. Tällä tavalla osakaistan tunnistimen yleensä suuret signaa-li-kohina-suhteet (SNR) tekevät tästä tunnistimesta hyvin kohinaa sietävän. Se on kuitenkin haavoittuva haitallisille mikrofoni- ja kaiutinmuutoksille sekä kaistarajatulle kohinalle. Täten mittauksiin ei saisi luottaa kaikissa olosuhteissa. Sen takia esillä olevan keksinnön edullinen suoritusmuoto sisältää osakaistan tunnistimen koko spektrin mittauksen lisätueksi.Another preferred assay is then performed after determining to determine if the energy acceleration measurement is within one or more of the defined margins at step 235. The attenuation value is evaluated to determine whether it is in a "high" state at step 250 and, if so, an energy acceleration test. the moving average is updated slowly as described in step 255. The process then returns to the full spectrum method in step 260. In this way, the generally high signal-to-noise ratios (SNRs) of the subband detector make this detector very noise tolerant. However, it is vulnerable to harmful microphone and speaker changes and bandwidth noise. Thus, measurements should not be relied upon in all circumstances. Therefore, a preferred embodiment of the present invention includes a subband detector for additional support for measuring the entire spectrum.

Lisämittausprosessi suoritetaan edullisesti käyttäen arvojen varianssin "kiihtyvyyttä" esimerkiksi kunkin kehyksen spektrin alemman puolikkaan sisässä. Varianssin mitta ilmaisee rakenteen spektrin alemmassa puolikkaassa, mikä tekee siitä erittäin herkän soinnilliselle puheelle. Varianssimittaus noudattaa osakaistaprosessin menettelytapaa siten, että valitaan tietyksi osakaistaksi spektrin alapuolikas. Tämä va-rianssimittaus täydentää edelleen koko spektrin menettelyä, joka pystyy paremmin tunnistamaan soinnittoman ja plosiivi-sen osan.The additional measurement process is preferably performed using the "acceleration" of the variance of the values, for example, inside the lower half of the spectrum of each frame. The measure of variance expresses the structure in the lower half of the spectrum, which makes it very sensitive to voiced speech. The variance measurement follows the procedure of the subband process by selecting the lower half of the spectrum as a particular subband. This variance measurement further complements the whole spectrum procedure, which is better able to identify the unvoiced and plosive part.

Kaikki nämä kolme mittausta ottavat raakatulonsa kaksois-Wiener-suodattimen ensimmäisen asteen muodostamien suodatin-vahvistusten spektriesityksestä, kuten on esitetty yhdysvaltalaisessa patenttihakemuksessa nro. US 09/427497, jossa hakijana on Motorola INc. ja keksijänä Yan-Ming Chen. Kuten edellä on kuvattu, kukin mittaus käyttää tämän datan eri tarkastelupuolta.All three measurements take their raw product from the spectral representation of the first-order filter amplifications of the dual Wiener filter, as disclosed in U.S. patent application no. US 09/427497 to Motorola INc. and inventor Yan-Ming Chen. As described above, each measurement uses a different view of this data.

Kokospektritunnistin käyttää erityisesti tunnettua kaksois-Wiener-suodattimen ensimmäisen asteen muodostamien suodatin-vahvistusten Mel-suodatettua spektriesitystä. Yksi tuloarvo saadaan neliöimällä Mel-suodatinpankkien summa.The full-spectrum detector uses a particularly known Mel-filtered spectral representation of the first-order filter amplifications of the dual Wiener filter. One revenue value is obtained by squaring the sum of the Mel filter banks.

Kokospektritunnistin käyttää keksinnön edullisessa suoritusmuodossa seuraavaa prosessia kaikille kehyksille, kuten on kuvattu seuraavassa:In a preferred embodiment of the invention, the full spectrum detector uses the following process for all frames, as described below:

Vaihe yksi alustaa kohinaestimaatin Jäljittäjä seuraavasti:Step one initializes the noise estimator Tracker as follows:

Jos Kehys<15 JA Kiihtyvyys<2,5 niin Jäijittäjä=MAX(Jäijittäjä, Tulo).If Frame <15 AND Acceleration <2.5 then Quencher = MAX (Quencher, Input).

Energiakiihtyvyysmitta estää Jäljittäjän päivityksen, jos puhe esiintyy 15 kehyksen alukeaikana.The acceleration measure prevents the Tracker from updating if the speech occurs during the 15 frame initial time.

Vaihe 2 päivittää Jäljittäjän arvon, jos ajankohtainen tulo on samanlainen kuin kohinaestimaatti, seuraavasti:Step 2 updates the Tracker value if the current revenue is similar to the noise estimate, as follows:

Jos Tulo<Jäjittäjä*YläRaja ja Tulo>Jäjittäj ä*AlaRaj a niin Jäjittäjä=a*Jäijittäjä+(1-a)*TuloIf Input <Bottom * Upper and Input> Bottom * Bottom then Bottom = a * Bottom + (1-a) * Input

Vaihe kolme antaa varmistetun mekanismin niille tapauksille, joissa ensimmäisten muutaman kehyksen aikana on puhetta tai epätyypillisen suuri kohinasisältö. Tämä saa aikaan seurauksena olevan virheellisen suuren kohinaestimaatin häviämisen. Vaihe kolme toimii edullisesti seuraavasti:Step three provides a sure mechanism for cases where there is talk or atypically high noise content during the first few frames. This causes the resulting erroneous high noise estimate to be lost. Step three preferably works as follows:

Jos Tulo<Jäijittäjä*Pohjataso, niin Jäi j ittäj ä=£>* Jäi j ittäj ä+ (1 -h) *TuloIf Input <Stacker * Bottom Level, Stacker = £> * Stacker + (1 -h) * Input

Vaihe 4 palauttaa puheen määritykseen "tosi" , jos ajankohtainen tulo on enemmän kuin 165% suurempi kuin Jäljittäjä, seuraavasti:Step 4 returns speech to "true" if the current revenue is more than 165% higher than Tracker, as follows:

Jos Tulo>Jäjittäjä*Kynnysarvo niin lähtö TOSI muutoin EPÄTOSI.If Input> Arrester * Threshold then output is TRUE otherwise FALSE.

Hetkellisen tulon suhde lyhyen ajan Jäljittäjän keskiarvoon on peräkkäisten tulojen energiakiihtyvyyden funktio.The ratio of the instantaneous input to the short tracker average is a function of the energy acceleration of the successive inputs.

Jolloin edellä olevassa: a=0,8 ja jb=0,97So in the above: a = 0.8 and jb = 0.97

YläRaja on 150 % ja AlaRaja 75 %Upper limit is 150% and Lower limit is 75%

Pohjataso on 50 % ja Kynnysarvo on 165 %.The bottom level is 50% and the Threshold is 165%.

On huomattava, että päivitystä ei tapahdu, jos arvo on suurempi kuin YläRaja tai AlaRajan ja Pohjatason välillä. Lisäksi energiakiihtyvyystulo, sellaisena kuin se on ilmoitettu edellä, voidaan laskea joko: peräkkäisten tulojen kaksoisdifferentioinnilla tai estimoinnin avulla jäljittämällä tulojen kahden liukuvan keskiarvon suhde.Note that no upgrade occurs if the value is higher than the Upper Limit or between the Lower Limit and the Lower Level. In addition, the energy acceleration product, as stated above, may be calculated either: by double differentiation of the successive inputs or by estimation by tracing the ratio of the two moving averages of the inputs.

On huomattava, että nopean ja hitaasti asettuvan liukuvan keskiarvon suhde heijastelee peräkkäisten tulojen energia-kiihtyvyyttä .It should be noted that the ratio of fast to slow moving average reflects the energy acceleration of successive incomes.

Esimerkin vuoksi mainittakoon, että vaikutusnopeudet edellä käytetyille keskiarvoille olivat: (i) O*keskiarvo + l*tulo, ja (ii) ((Kehys-1)^keskiarvo + l*tulo)/kehys, mikä tekee energiakiihtyvyysmitasta erittäin herkän yli ensimmäisten viidentoista kehyksen.By way of example, the impact velocities for the averages used above were: (i) O * mean + l * input, and (ii) ((Frame-1) ^ mean + l * input) / frame, which makes the energy acceleration measure very sensitive over the first fifteen frame.

Osakaistan tunnistin käyttää edullisesti "kokospektri"-mittaukseen johdettua toisen, kolmannen ja neljännen Mel-suodatinpankin keskiarvoa. Tunnistin soveltaa sitten seuraa-vaa prosessia kaikille kehyksille seuraavassa kuvatulla tavalla : (i) Tulo=p*AjankohtainenTulo+(1-p)*EdellinenTulo (ii) Jos Kehys<15, niin Jäijittäjä=MAX(Jäijittäjä, Tulo) (iii) Jos Tulo<Jäijittäjä*YläRaja ja Tulo>Jäjittäj ä*AlaRaj a, niin Jäijittäjä=a*Jäijittäjä+(1-a)*Tulo (iv) Jos Tulo<Jäijittäjä*Pohjataso, niin Jäi j ittäj ä=£>* Jäi j ittäj ä+ (1-A) *Tulo (v) Jos Tulo>Jäljittäjä*Kynnysarvo,Preferably, the subband detector uses the average of the second, third and fourth Mel filter banks derived from the "full spectrum" measurement. The detector then applies the following process to all frames as follows: (i) Input = p * CurrentInput + (1-p) * PreviousInput (ii) If Frame <15, then Arbiter = MAX (Arbiter, Input) (iii) If Input <Infantry * Upper bound and Income> Infantry * Lower bound, then Infantry = a * Infantry + (1-a) * Income (iv) If Input <Infantry * Bottom, then Influencer = £> * Influencer + (1-A) * Input (v) If Input> Tracker * Threshold,

niin lähtö TOSI muutoin lähtö EPÄTOSIthen the output is TRUE otherwise the output is FALSE

Jolloin osa-aluemittauksessa p= 0,75Then in the sub-area measurement p = 0.75

Kaikki muut parametrit ovat samat kuin kokospektrimittauk-sessa, paitsi Kynnysarvo, joka on yhtä kuin 3,25.All other parameters are the same as in the full spectrum measurement except Threshold is equal to 3.25.

Spektrivarianssimittauksen osalta tulona käytetään varianssia arvoista, jotka käsittävät kunkin kehyksen osalta vahvistuksen kapeakaistaisen spektriesityksen alemman taajuus-puoliskon. Tunnistin käyttää sitten tarkalleen samaa prosessia kuin kokospektrimittauksessa.For spectral variance measurement, the product used is the variance of the values comprising the lower frequency half of the narrowband spectral representation of the gain for each frame. The detector then uses exactly the same process as the full spectrum measurement.

Varianssi lasketaan seuraavasti:The variance is calculated as follows:

[5] jossa N=FFT pituus/4, ja wi ovat vahvistuksen kapeakaistaisen spektriesityksen arvot.[5] where N = FFT length / 4 and wi are the values of the narrowband spectral representation of the gain.

Esillä olevan keksinnön edullisen suoritusmuodon mukaisesti edellä yksityiskohtaisesti kuvatut kolme mittaa esitetään VAD-päätöksentekoalgoritmille, kuten on esitetty kuvion 3 vuokaaviossa. Peräkkäiset tulot viedään puskuriin, mikä antaa kokonaistilanteellisen analyysin. Tämä saa aikaan kehys-viiveen, joka on yhtä kuin puskurin pituus miinus yksi kehys .In accordance with a preferred embodiment of the present invention, the three dimensions described in detail above are shown in the VAD decision algorithm as shown in the flowchart of Figure 3. The successive inputs are exported to the buffer, which gives an overall analysis. This causes a frame delay equal to the buffer length minus one frame.

Viitataan nyt kuvioon 3, jossa on esitetty vuokaavio 300 kiihtyvyyspohjaisesta ääniaktiviteetin hyväksyntäprosessista kohinallisia ympäristöjä varten, esillä olevan keksinnön edullisen suoritusmuodon mukaisesti.Referring now to Figure 3, there is shown a flowchart 300 of an acceleration-based voice activity validation process for noisy environments, in accordance with a preferred embodiment of the present invention.

Puskurin jossa on N=1 kehystä osalta viimeisin tosi/epätosi-puhetulo tallennetaan paikkaan N datapuskurissa, kuten on esitetty vaiheessa 305. Päätöksentekologiikka soveltaa joukkoa vaiheita ja edullisesti kutakin seuraavista vaiheista:For a buffer with N = 1 frames, the most recent true / false speech input is stored in position N in the data buffer, as shown in step 305. Decision technology applies a plurality of steps, and preferably each of the following steps:

Vaihe 1: VN = Mitta 1 TAI Mitta 2 TAI Mitta 3;Step 1: VN = Dimension 1 OR Dimension 2 OR Dimension 3;

Tulo VN on määritelty arvoksi "tosi" (T), jos jokin kolmesta mittauksesta antaa arvon tosi puheen merkiksi.The VN input is defined as "true" (T) if any of the three measurements gives a value to indicate true speech.

Vaihe 2:Step 2:

[6][6]

Algoritmi etsii pisimmän yhtenäisen sekvenssin arvoja "tosi" puskurista, kuten vaiheessa 310. Täten esimerkiksi sekvens- sin " T T E T T T E" osalta M olisi yhtä kuin "3".The algorithm searches for the longest continuous sequence values in the "true" buffer as in step 310. Thus, for example, for the sequence "T T E T T T E", M would be equal to "3".

Vaihe 3:Step 3:

Jos M>=Sp JA. T<LS, T=LS, jossa SP on yhtä kuin ensimmäinen kynnysarvo vaiheessa 315. Jos pisin sekvenssin on tosi (T), puheen arvo on yhtä kuin tai suurempi kuin ensimmäinen kynnys vaiheessa 315, so. SP= 3 tai useampia peräkkäisiä arvoja "tosi", puskurin arvioidaan sisältävän "mahdollisen" puheen. Lyhytaikaislaskuri T, sanotaan vaikka Ls= 5 kehystä (Aika_l) aktivoidaan, vaiheessa 325, jos se ei ole jo voimassa (tai ylittynyt) vaiheessa 320 tehdystä määrityksestä. vaihe 4:If M> = Sp JA. T <LS, T = LS, where SP is equal to the first threshold in step 315. If the longest sequence is true (T), the speech value is equal to or greater than the first threshold in step 315, i.e.. SP = 3 or more consecutive values of "true", the buffer is estimated to contain "possible" speech. The short-term counter T, even though Ls = 5 frames (Time_l) is activated, in step 325 if it is not already valid (or exceeded) from the determination made in step 320. Step 4:

Jos M>=Sl JA F>Fs, T=Lm muutoin T=Ll, jossa SL on yhtä kuin toinen kynnysarvo vaiheessa 330. Jos on SL=4 tai useampia peräkkäisiä arvoja "tosi", puskurin arvioidaan taaskin sisältävä "todennäköisen" puheen. Keskipitkän välin ajastin T, sanotaan vaikka Lm=22 kehystä aktivoidaan vaiheessa 340, jos ajankohtainen kehys F on alustavan aluketurvajakson Fs ulkopuolella, kuten määritetään vaiheessa 335. Muussa tapauksessa varmistavaa pitkän ajan ajastinta T, sanotaan vaikkapa Ll=40 kehystä, käytetään vaiheessa 345. Tällaista järjestelyä käytetään, koska puheen aikainen esiintyminen ilmaisussa voi aiheuttaa sen, että VAD:n ko-hinaestimaatti on liian suuri.If M> = S1 and F> Fs, T = Lm otherwise T = L1, where SL is equal to another threshold value in step 330. If SL = 4 or more consecutive values are "true", the buffer is again estimated to contain "probable" speech. The medium slot timer T, even though Lm = 22 frames is activated in step 340, if the current frame F is outside the initial primer security period Fs, as determined in step 335. Otherwise, the long-time backup T, say Ll = 40 frames, is used in step 345. the arrangement is used because the presence of speech during speech can cause the VAD to be too high.

Vaihe 5:Step 5:

Jos M<Sp JA T>0, T-If M <Sp AND T> 0, T-

Jos prosessi määrittää, että on vähemmän kuin SP=3 peräkkäistä arvoa "tosi", vaiheessa 350, ja ajastin on suurempi kuin nolla vaiheessa 355, tällöin ajastimen arvoa pienennetään vaiheessa 360.If the process determines that there are less than SP = 3 consecutive values of "true" in step 350, and the timer is greater than zero in step 355, then the timer value is reduced in step 360.

Vaihe 6:Step 6:

Jos T>0, lähtö TOSI, muutoin lähtö EPÄTOSIIf T> 0, the output is TRUE, otherwise the output is FALSE

Jos ajastin on suurempi kuin nolla vaiheessa 365, prosessi antaa lähtönä puhepäätöksen "tosi", kuten on esitetty vaiheessa 370. Vaihtoehtoisesti, jos ajastin ei ole suurempi kuin nolla vaiheessa 365, prosessi antaa lähtönä päätöksen "kohina", kuten on esitetty vaiheessa 375.If the timer is greater than zero in step 365, the process outputs a speech decision "true" as shown in step 370. Alternatively, if the timer is not greater than zero in step 365, the process outputs a "noise" decision as outlined in step 375.

Vaihe 7:Step 7:

Kehys++, siirrä puskuria vasemmalle ja paluu vaiheeseen 1.Frame ++, move the buffer to the left and return to step 1.

Valmistelemiseksi seuraavaa kehystä varten vaiheessa 380 puskuria siirretään vasemmalle tilan saamiseksi seuraavalle tulolle, kuten on esitetty kuviossa 4. Lähtöä puhepäätös sovelletaan kehykseen, joka poistetaan puskurista. Sitten prosessi toistuu vaiheesta 305 puskuriin tulevan seuraavalle tosi/epätosi-tulon osalta.To prepare for the next frame, in step 380, the buffer is shifted to the left to provide space for the next input, as shown in Figure 4. The output speech decision is applied to the frame which is removed from the buffer. The process is then repeated for the next true / false input from step 305 to the buffer.

Ajateltavissa on, että voidaan toteuttaa vaihtoehtoinen mekanismi puhe- tai kohinapäätöksen tekemiseksi edellä kuvatun energiakiihtyvyysprosessin perusteella. Päätöksentekomekanismi ei esimerkiksi välttämättä perustu yhteen tai useampaan ajastimeen, ja päätös voidaan tehdä puhtaasti sen mukaan, ylittyykö yksi tai useampi energiakiihtyvyyskynnysar-vo.It is conceivable that an alternative mechanism may be implemented for making a speech or noise decision based on the energy acceleration process described above. For example, the decision mechanism is not necessarily based on one or more timers, and the decision can be made purely according to whether one or more of the energy acceleration thresholds are exceeded.

Viitataan nyt kuvioon 4, jossa on esitetty tarkemmin esimerkki puskurointitoiminnasta 400 esillä olevan keksinnön edullisen suoritusmuodon mukaisesti. Olettakaamme, että ensimmäinen kynnys on asetettu kolmen peräkkäisen "tosi"-arvoon. Olettakaamme, että ajanhetkenä "t" 410 vain ajankohtainen tulo (kehys #7) 425 ja edellinen tulo (kehys #6) 420 olivat "tosia". Sen mukaisesti, kun puskuria siirretään, ensimmäinen kehys (kehys #1) 415 merkataan epätodeksi.Referring now to Figure 4, a more detailed example of a buffering operation 400 according to a preferred embodiment of the present invention is shown. Suppose that the first threshold is set to three consecutive "true" values. Suppose that at time "t" 410, only the current input (frame # 7) 425 and the previous input (frame # 6) 420 were "true". Accordingly, as the buffer is moved, the first frame (frame # 1) 415 is marked as false.

Ajanhetkenä "t+1" 430 kolmas "tosi" tulo (kehys #8) 450 on otettu vastaan kahden aiemman "tosi" tulon 440, 445 jatkoksi. Siksi, kun puskuria siirretään, seuraava lähtökehys (kehys #2) 435 merkitään "todeksi".At time "t + 1" 430, the third "true" input (frame # 8) 450 is received as a continuation of the two previous "true" inputs 440, 445. Therefore, when the buffer is moved, the next output frame (frame # 2) 435 is marked "true".

On huomattava, että edellä olevassa päätöksentekoprosessissa ainoat rajoitukset ovat: (i) Aika_l < Aika_2 < Aika 3 ja (ii) Kynnysarvo 1 < Kynnysarvo 2.Note that in the above decision-making process, the only restrictions are: (i) Time_l <Time_2 <Time 3 and (ii) Threshold 1 <Threshold 2.

Kun oletetaan, että vain nämä kolme tuloa (kehys #6, kehys #t ja kehys #8) ovat "tosia", täysi lähtösekvenssi on: ETTTTTTTT T T Τ Τ Τ Τ Τ Τ E E E E Ε 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 jossa kehykset #2-#5 ilmoittavat arvon „tosi" puskurin alu-ketoiminnon takia. Kehykset #6-#8 ilmoittavat arvon "tosi", koska näissä kohdissa oli alun perin puhetulolla arvo "tosi". Kehykset #9-#12 ilmoittavat arvon "tosi", puskurin lo-puketoiminnon takia. Kehykset #13-#18 ilmoittavat arvon "tosi" vasteena käytettyyn ajastimen kestoaikaan. Heti kun ilmaisun kaikki kehykset on otettu tulona, puskuri siirtää "epätosi" kirjauksia (kehykset #19-#LM) kunnes tyhjenee.Assuming that only these three inputs (Frame # 6, Frame #t, and Frame # 8) are "true", the full output sequence is: ETTTTTTTT TT Τ Τ Τ Τ Τ EEEE Ε 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 where frames # 2- # 5 represent the value "true" because of the buffer function. Frames # 6- # 8 indicate the value "true" because these points originally had a value as a voice input "true". Frames # 9- # 12 declare "true" due to the buffer end function. Frames # 13- # 18 declare "true" in response to the timer duration used. As soon as all frames are taken as input, the buffer moves " false "entries (frames # 19- # LM) until cleared.

Keksinnössä on ajateltavissa, että puskurin pituutta ja kes-toaika-ajastimia voidaan säätää dynaamisesti sopimaan audio-viestintälaitteen tarpeisiin. Näin ollen edullinen suoritusmuoto, jossa käytetään puskuripituutena "N" on 8 ja kestoai-ka-ajastimen arvona viittä kehystä, on vain esimerkin vuoksi. On kuitenkin huomattava, että puskuripituus "N" on aina päätettävä niin, että N>=SL.It is contemplated in the invention that the buffer length and duration timers can be dynamically adjusted to suit the needs of the audio communication device. Thus, a preferred embodiment using a buffer length "N" of 8 and a duration timer value of five frames is by way of example only. However, it should be noted that the buffer length "N" must always be terminated such that N> = SL.

Vaikka keksinnön käytöllä VAD:ssä on omat oikeutuksensa, keksinnössä on ajateltavissa, että kuvion 2 menetelmävai-heissa hankittua energiakiihtyvyysmittaa voidaan käyttää muiden parametrien alustuksen hyväksynnässä. Esimerkiksi spektrisupistusmenettely edellyttää alkuestimaattia kohinalle puheen ensimmäisten kymmenen kehyksen (tyypillisesti 100 ms) perusteella. Siinäkin tapauksessa, että kohina on muuttumatonta, voi sattua lukuisia tapahtumia, jotka tekevät al-kuestimaatista epäkelvon. Esimerkkejä tällaisista tapahtumista ovat: (a) Signaalin nousu:Although the use of the invention in VAD has its own merits, it is conceivable that the energy accelerometer obtained in the method steps of Figure 2 may be used to approve the initialization of other parameters. For example, the spectral contraction procedure requires an initial estimate of the noise based on the first ten frames of speech (typically 100 ms). Even if the noise is constant, numerous events can occur which render the initial estimate invalid. Examples of such events are: (a) Signal rise:

Erilaisista mahdollisista syistä johtuen tallennuksen aivan alku voi "nousta" täydelle voimakkuudelle arviointijakson aikana. Syinä tällaiseen täyteen nousuun voivat olla: puskurin täyttyminen digitaalisissa järjestelmissä, kapasitanssi tai nauhapään kosketus analogisissa järjestelmissä. Tällaisten tapahtumien vaikutus voi tehdä estimaatista epäkelvolli-sen. Energiakiihtyvyysmittaa voidaan täten käyttää tunnistamaan tällainen nousu ja estämään virhe. (b) Piikit alkuperäisessä signaalissaFor a variety of possible reasons, the very beginning of recording may "rise" to full intensity during the evaluation period. The reasons for such a full rise may be: buffer loading in digital systems, capacitance, or tape end contact in analog systems. The effect of such events may render the estimate invalid. The energy accelerometer can thus be used to detect such an increase and to prevent error. (b) Pins in the original signal

Yleisesti "piikki" esiintyy, kun tilaajaradiolaitteen paina kun haluat puhua -nappi (PPT, press-to-talk) on täysin ulos ponnahtaneena, jolloin sähköinen kosketus edeltää hiukan kytkimen selkään iskeytyvää nappia. Energiakiihtyvyysmittaa, sellaisena kuin se on kuvattu edellä, voidaan käyttää lykkäämään kuvion 2 vaiheessa 225 esitetyn kaltaista estimoin-tiprosessia, kun tällainen tapahtuma esiintyy. (c) Puhetta alkusignaalissa:Generally, a "spike" occurs when the subscriber radio press-to-talk (PPT) button is fully popped out, with electrical contact slightly ahead of the button on the back of the switch. The energy acceleration measure, as described above, may be used to delay an estimation process such as that shown in step 225 of Figure 2 when such an event occurs. (c) Speech in Initial Signal:

Toinen yleinen tapahtuma erityisesti PTT-järjestelmissä on se, että käyttäjä aloittaa puhumisen heti, kun painanut PTT-nappia. Tällaisen toimintatavan kanssa sähköinen kosketus syntyy vasta sen jälkeen, kun puhe on aloitettu. Energia-kiihtyvyysmitta pystyy tunnistamaan tämän ja lykkäämään kuvion 2 vaiheessa 225 esitetyn kaltaista kohinaan perustuvaa alustusta tai ohjaamaan käytettäväksi oletusestimaatit.Another common occurrence, especially in PTT systems, is that the user starts talking as soon as he or she presses the PTT button. With this mode of operation, electrical contact only occurs after speech has been initiated. The energy accelerometer is capable of detecting this and delaying noise-based initialization such as that shown in step 225 of Figure 2 or controlling for use of default estimates.

Yhteenvetona voidaan sanoa, että on kuvattu viestintälaitetta, joka sisältää ääniaktiviteetin tunnistusmekanismin sisältävän audiokäsittely-yksikön. Ääniaktiviteetin tunnistus-mekanismi antaa ilmoituksen energiakiihtyvyydestä signaali-tulossa viestintälaitteeseen ja määrittää, onko mainittu tu-losignaali puhetta vai kohinaa, mainitun ilmoituksen perusteella .In summary, a communication device including an audio processing unit including a voice activity recognition mechanism has been described. The voice activity recognition mechanism provides a notice of energy acceleration at the signal input to the communication device and determines whether said input signal is speech or noise based on said notice.

Lisäksi on kuvattu menetelmää viestintälaitteeseen tulevan puhesignaalin tunnistamiseksi. Menetelmä sisältää vaiheina sen, että ilmoitetaan kiihtyvyys tulosignaalissa viestintälaitteeseen; ja määritetään, onko mainittu tulosignaali puhetta vai kohinaa mainitun ilmoitusvaiheen perusteella.Further, a method for detecting a speech signal entering a communication device is described. The method includes the steps of indicating an acceleration in the input signal to the communication device; and determining whether said input signal is speech or noise based on said notification step.

Lisäksi on kuvattu menetelmää sen päättämiseksi, onko viestintälaitteeseen tuleva signaali puhetta vai kohinaa. Menetelmä sisältää vaiheet, joissa päätetään, onko mainittu tulosignaali puhetta vai kohinaa, energiakiihtyvyyden perusteella, käyttäen esimerkiksi tulosignaalien joukon kehyskes-kiarvoa tai liukuvaa keskiarvoa. Täten on ymmärrettävä, että edellä kuvattu kohinallisiin ympäristöihin tarkoitettu energiakiihtyvyyteen perustuva ääniaktiviteetin tunnistin ja hyväksyjä antaa etuina ko-hinasiedon ja nopean vasteen. Koska edullinen suoritusmuoto käyttää energiakiihtyvyydestä riippuvaa mittaa absoluuttisen mitan sijasta, tässä kuvattua keksinnöllistä konseptia voidaan soveltaa millä tahansa voimakkuustasolla tulevaan puheeseen .Further, a method for deciding whether a signal to a communication device is speech or noise is described. The method includes the steps of deciding whether said input signal is speech or noise based on energy acceleration, using, for example, a frame mean or a moving average of a plurality of input signals. Thus, it will be appreciated that the above-described energy acceleration-based sound activity detector and approver for noisy environments provides the advantages of noise tolerance and rapid response. Since the preferred embodiment uses an energy-accelerated measure instead of an absolute measure, the inventive concept described herein can be applied to speech at any level.

Vaikka edellä on kuvattu erityisiä ja edullisena pidettyjä toteutuksia esillä olevan keksinnön suoritusmuodoista, on selvää, että alaan perehtynyt voisi käyttää tämän keksinnöllisen konseptin vaihtoehtoja ja muunnoksia, jotka jäisivät esillä olevan keksinnön piiriin.While specific and preferred embodiments of the embodiments of the present invention have been described above, it will be appreciated that alternatives and modifications of this inventive concept would be within the scope of the present invention.

On siis kuvattu kohinallisiin ympäristöihin tarkoitettua parannettua ääniaktiviteetin tunnistinta ja hyväksyjää, jossa on vähennetty olennaisesti edellä mainittuja ennestään tunnettuun tekniikkaan liittyviä haittoja.Thus, an improved sound activity detector and approver for noise environments has been described, substantially reducing the aforementioned drawbacks of the prior art.

Claims

A communication device (100) comprising an audio processing unit (109) comprising a sound activity detection mechanism (130, 135), said communication device (100) being characterized in that the sound activity detection mechanism (130, 135) is adapted to measure the energy acceleration of the signal entering the communication device (100) by observing the relation of the inputs fast and slow solidifying moving average and determining frame by frame whether said input signal is speech or noise, on the basis of said measurement, whereby the energy acceleration measurement gives an energy acceleration value , which is greater than the threshold of the energy acceleration, the input frame is considered to be a number frame (265).

The communication device (100) of claim 1, wherein the audio activity detection mechanism includes an audio activity detection function (130) which performs the speech detection frame by frame for the signals entering the audio activity detection mechanism (130, 135).

The communication device (100) of claim 2, wherein said frame detection frame consists of conducting an energy acceleration measurement for the signal entering the audio activity detection mechanism (130, 135) for one or more of the following frequency sub-ranges: (i) ) the entire spectrum (ii) subband of the spectrum and (iii) the spectrum variance

The communication device (100) of claim 3, wherein the audio activity detection mechanism includes an audio activity decision-making function (135), which is functionally coupled to the audio activity detection function (130) and arranged to determine if said input signal is speech, based on the buffer function of the input signal input frames contained in the buffer and one or more of said energy acceleration measurements, wherein the sound activity decision-making function (135) is further arranged to name a true or error indication for each buffered input frame contained in the buffer, whereby a true indication is named when one of the one or more of said energy acceleration measurements of the input frame produces a speech indication and wherein the decision-making function (135) of the audio activity is further arranged to determine that said input signal in the buffer is speech when the indications for each buffered sequence of the input frames are if contained in the buffer is true.

The communication device (100) of claim 1, wherein the audio activity expression mechanism (135) is arranged to measure energy acceleration using the frame mean of the group of the input signals or the sliding average.

Communication device (100) according to any one of claims 1 to 4, wherein the energy acceleration is estimated by observing the relationship between the moving average of two input signals with (0 * average + 1 * input) and ((Ram-1) * average). + l * input) / frame, in which Frame corresponds to the value of the frame cursor.

The communication device (100) of claim 5, wherein the estimate of the energy acceleration with the frame is:

[13

The communication device (100) of claim 5 or 6, wherein the measurement of the energy acceleration is within one or more defined limits, the estimation of the energy acceleration with the moving average is:

[2]

The communication device (100) of claim 4, wherein the buffer has the buffer length of N's frame and successive input frames are advanced to the buffer and removed from the buffer and wherein when the input frame contained in the buffer is determined as a speech frame, the decision that the input frame is a speech frame (265) is adapted backwards to the previous frame in the buffer.

The communication device (100) according to any of claims 3, 4 or 9, in which, if a sub-region of the input signal spectrum is selected, the selection is based on the sub-region containing the fundamental tone height of the audio signal.

The communication device (100) of claim 1, wherein the audio activity expression mechanism is arranged to measure the energy acceleration from the signal input, which is taken from the filter amplification's mid-filtered spectrum representation, which is formed in the first stage of the double-Wiener filter.

12. A method of detecting the speech signal entering the communication device, which method is characterized by the stage in which the relation of the fast and slow solidifying sliding means of the inputs is observed; and it is determined (315, 330, 350) frame by frame whether said input signal is speech (370) or noise (375), on the basis of said measurement stage, wherein if the energy acceleration measurement gives an energy acceleration value greater than the energy acceleration threshold, then the input frame is considered to be a number frame (265).

Method for detecting the speech signal according to claim 12, further characterized by the stage in which sound detection is performed frame by frame for the part of the communication device's input signals.

The speech signal detection method of claim 13, wherein the frame detection frame contains a stage in which: an energy acceleration measurement for said input signal is performed for one or more of the following frequency sub-ranges: (i) the entire spectrum (ii) subband of the spectrum and (iii) the spectrum variance

The method of detecting the speech signal according to any of claims 12-14, in which the stage of the measurement of the energy acceleration uses the frame mean of the group of the input signals or the moving average.

The method of detecting the speech signal according to claim 12, 13 or 14, wherein the energy acceleration is estimated by observing the relationship between the moving average of two input signals with (0 * average + 1 * input) and ((Ram-1) * average + l * input) / frame, in which Frame corresponds to the value of the frame cursor.

A method for detecting the speech signal according to claim 15 or 16, wherein the stage of measuring the energy acceleration comprises estimating the energy acceleration with the frame by averaging:

[1]

The method of detecting the speech signal according to claim 15 or 16, wherein the stage of measuring the energy acceleration comprises estimating the energy acceleration with the moving average, when the measurement of the energy acceleration is within one or more of said limits,

< 4 yrs

The method of detecting the speech signal according to claim 12, further comprising: adapting said determination that the input frame is a speech frame backward at an earlier frame in the input signals buffer.

The method of detecting the speech signal according to claim 12, wherein the determining stage further comprises:: the input frames of the input signal are buffered in the buffer; a true or error indicator is named for each buffered input frame in the buffer; a true indication is named when the energy acceleration measurement for the input frame gives a speech indication; and it is determined that said input signal in the buffer is speech when the named indications for each sequence of the input frames buffered in the buffer are true.

The method of detecting the speech signal according to claim 12, wherein the measurement of the energy acceleration from the signal input comprises the measurement of the energy acceleration from the signal input, which is taken from the filter filter's medium-filtered spectrum production, which is formed in the first stage of the double Wiener filter.