CN103177721B

CN103177721B - Audio recognition method and system

Info

Publication number: CN103177721B
Application number: CN201110440273.4A
Authority: CN
Inventors: 冯克威; 赵江
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2011-12-26
Filing date: 2011-12-26
Publication date: 2015-08-19
Anticipated expiration: 2031-12-26
Also published as: CN103177721A

Abstract

The present invention discloses a kind of audio recognition method and system.Wherein in audio recognition method, speech quality evaluation is carried out to the first voice messaging and the second voice messaging, select the good voice messaging of voice quality as with reference to voice messaging, using voice messaging poor for voice quality as assistant voice information; Speech recognition is carried out to reference voice information, obtains with reference to identifying information; With reference to selecting in identifying information n word that degree of confidence is the highest as keyword; For each keyword, generate the set of m level related term according to predetermined vocabulary; Improve keyword and the weighted value of related term in speech recognition modeling dictionary; Utilize the speech recognition modeling dictionary upgraded, respectively speech recognition is carried out to reference voice information and assistant voice information.Owing to have modified the weighted value of related term according to conversation content, thus the accuracy that raising speech recognition modeling describes current session content, improve the accuracy rate of speech recognition.

Description

Audio recognition method and system

Technical field

The present invention relates to field of information processing, particularly relate to audio recognition method and system.

Background technology

Language is natural, the most the most frequently used exchange waies of the mankind, (Automatic Speech Recognition is called for short: ASR) be the new branch of science grown up nearly half a century for speech recognition (SpeechRecognition) or automatic speech recognition.The target of speech recognition is the natural-sounding making machine " understand " people, and by identifying that the information that obtains can be used as control signal and is applied to every field, speech recognition has broad application prospects in industry, military affairs, traffic, medical science, each side such as civilian.Speech recognition system, according to the requirement to speaker's tongue, can be divided into isolated word, word speech recognition system, conjunction speech recognition system and Continuous Speech Recognition System; According to the degree of dependence to speaker, particular person and signer-independent sign language recognition system can be divided into; According to vocabulary size, little vocabulary, medium vocabulary, large vocabulary and unlimited vocabulary speech recognition system can be divided into.Different speech recognition systems, although it is different to realize details, the basic framework adopted is similar.

Mainly based on Hidden Markov Model (HMM), (HiddenMarkov Model is called for short: HMM) existing main flow speech recognition system.In general recognition system, mainly utilize acoustic model (Acoustic Model, be called for short: AM) and language model (LanguageModel, be called for short: LM), by decoding (Decode) operation acquisition recognition result.Wherein in language model, widely used form is statistical language model, and statistical language model is the statistical law disclosing linguistic unit inherence by the method for probability statistics, and wherein N-Gram is simply effective, is widely used.

For call voice identification, particularly for speech recognition and the speech retrieval of call center, because voice quality is relatively poor relative to normal speech identification scene, such as in an office environment, therefore speech recognition effect is restricted.Here voice quality is very poor comprises following reason, have powerful connections difference that noise, client voice capture device, the noise of verbal system, the Noise and Interference of communication line, different communication circuit or switch produce, dissimilar device end voice coding modes in communication process of such as client is different, also have client itself speak band have an accent or employ dialect, speaker itself speak ambiguous or unclear etc.All of these factors taken together all may cause speech recognition deleterious.

On the other hand, the content difference of each dialogue is very large, and often logical talk time is not very long, and generally only have some minutes, content is between hundreds of word to one or two K word.For once talking with, especially for the voice of client, carrying out acoustic model self-adaptation or language model adaptation data and being all difficult to satisfactory.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of audio recognition method and system, by the weighted value of conversation content amendment related term, weighted value is also referred to as probable value, thus the accuracy that raising speech recognition modeling describes current session content, improve the accuracy rate of speech recognition.

According to an aspect of the present invention, provide a kind of audio recognition method, comprising:

First voice messaging of the first teller and second voice messaging of the second teller is obtained respectively from dialogic voice information;

Respectively speech quality evaluation is carried out to the first voice messaging and the second voice messaging, select the good voice messaging of voice quality as with reference to voice messaging, using voice messaging poor for voice quality as assistant voice information;

Speech recognition is carried out to reference voice information, obtains with reference to identifying information;

With reference in identifying information, select n word that degree of confidence is the highest as keyword, n be greater than 0 positive integer;

For each keyword, the set of m level related term is generated according to predetermined vocabulary, each related term wherein in the set of first order related term is associated with a keyword respectively, each related term in the set of L level related term is associated with a related term in the set of L-1 level related term respectively, m, L be greater than 0 positive integer, 2≤L≤m, does not comprise keyword in the set of m level related term, simultaneously in the set of m level related term, related term does not also repeat;

Improve the weighted value of related term in speech recognition modeling dictionary in keyword and the set of m level related term, wherein be multiplied by a multiple for each weighted value, the weight increase multiple of keyword is greater than the weight increase multiple of related term in the set of m level related term, in the set of L-1 level related term, the weight of related term increases the weight increase multiple that multiple is greater than related term in the set of L level related term, the weighted value of word whole in speech recognition modeling dictionary is normalized, obtains the speech recognition modeling upgraded;

Utilize the speech recognition modeling upgraded, respectively speech recognition is carried out to reference voice information and assistant voice information, obtain the first identifying information and the second identifying information.

According to an aspect of the present invention, provide a kind of speech recognition system, comprising:

Acquiring unit, with the second voice messaging of the first voice messaging and the second teller that obtain the first teller from dialogic voice information respectively;

Assessment unit, for carrying out speech quality evaluation to the first voice messaging and the second voice messaging respectively, selects the good voice messaging of voice quality as with reference to voice messaging, using voice messaging poor for voice quality as assistant voice information;

First voice recognition unit, for carrying out speech recognition to reference voice information, obtains with reference to identifying information;

Keyword generation unit, for reference in identifying information, selects n word that degree of confidence is the highest as keyword, n be greater than 0 positive integer;

Related term generation unit, for for each keyword, the set of m level related term is generated according to predetermined vocabulary, each related term wherein in the set of first order related term is associated with a keyword respectively, and each related term in the set of L level related term is associated with a related term in the set of L-1 level related term respectively, m, L be greater than 0 positive integer, 2≤L≤m, in the set of m level related term, do not comprise keyword, simultaneously in the set of m level related term, related term does not also repeat;

Weight adjustment unit, for improving the weighted value of related term in speech recognition modeling dictionary in keyword and the set of m level related term, wherein be multiplied by a multiple for each weighted value, the weight increase multiple of keyword is greater than the weight increase multiple of related term in the set of m level related term, in the set of L-1 level related term, the weight of related term increases the weight increase multiple that multiple is greater than related term in the set of L level related term, the weighted value of word whole in speech recognition modeling dictionary is normalized, obtains the speech recognition modeling upgraded;

Second voice recognition unit, for utilizing the speech recognition modeling of renewal, carrying out speech recognition to reference voice information and assistant voice information respectively, obtaining the first identifying information and the second identifying information.

The present invention carries out speech recognition by utilizing the good reference voice information of voice quality in dialogue, obtains with reference to identifying information.With reference in identifying information, select n word that degree of confidence is the highest as keyword, n be greater than 0 positive integer; For each keyword, the set of m level related term is generated according to predetermined vocabulary, each related term wherein in the set of first order related term is associated with a keyword respectively, each related term in the set of L level related term is associated with a related term in the set of L-1 level related term respectively, m, L be greater than 0 positive integer, 2≤L≤m; Improve the weighted value of related term in speech recognition modeling dictionary in keyword and the set of m level related term, the weighted value of word whole in speech recognition modeling dictionary is normalized, obtain the speech recognition modeling dictionary upgraded; Utilize the speech recognition modeling dictionary upgraded, respectively speech recognition is carried out to reference voice information and assistant voice information, obtain the first identifying information and the second identifying information.Owing to have modified the weighted value of related term according to conversation content, thus the accuracy that raising speech recognition modeling describes current session content, improve the accuracy rate of speech recognition.

Description of the invention provides in order to example with for the purpose of describing, and is not exhaustively or limit the invention to disclosed form.Many modifications and variations are obvious for the ordinary skill in the art.Selecting and describing embodiment is in order to principle of the present invention and practical application are better described, and enables those of ordinary skill in the art understand the present invention thus design the various embodiments with various amendment being suitable for special-purpose.

Accompanying drawing explanation

Fig. 1 is the schematic diagram of an audio recognition method of the present invention embodiment.

Fig. 2 is the schematic diagram of another embodiment of audio recognition method of the present invention.

Fig. 3 is the schematic diagram of a speech recognition system of the present invention embodiment.

Fig. 4 is the schematic diagram of another embodiment of speech recognition system of the present invention.

Embodiment

With reference to the accompanying drawings the present invention is described more fully, exemplary embodiment of the present invention is wherein described.

Fig. 1 is the schematic diagram of an audio recognition method of the present invention embodiment.As shown in Figure 1, the audio recognition method of this embodiment is as follows:

Step 101, obtains first voice messaging of the first teller and second voice messaging of the second teller respectively from dialogic voice information.

Step 102, carries out speech quality evaluation to the first voice messaging and the second voice messaging respectively, selects the good voice messaging of voice quality as with reference to voice messaging, using voice messaging poor for voice quality as assistant voice information.

Step 103, carries out speech recognition to reference voice information, obtains with reference to identifying information.

Step 104, with reference in identifying information, selects n word that degree of confidence is the highest as keyword, n be greater than 0 positive integer.

Step 105, for each keyword, the set of m level related term is generated according to predetermined vocabulary, each related term wherein in the set of first order related term is associated with a keyword respectively, and each related term in the set of L level related term is associated with a related term in the set of L-1 level related term respectively, m, L be greater than 0 positive integer, 2≤L≤m, in the set of m level related term, do not comprise keyword, simultaneously in the set of m level related term, related term does not also repeat.

Step 106, improve the weighted value of related term in speech recognition modeling dictionary in keyword and the set of m level related term, wherein be multiplied by a multiple for each weighted value, the weight increase multiple of keyword is greater than the weight increase multiple of related term in the set of m level related term, in the set of L-1 level related term, the weight of related term increases the weight increase multiple that multiple is greater than related term in the set of L level related term, the weighted value of word whole in speech recognition modeling dictionary is normalized, obtains the speech recognition modeling upgraded.

Step 107, utilizes the speech recognition modeling upgraded, carries out speech recognition respectively, obtain the first identifying information and the second identifying information to reference voice information and assistant voice information.

Based on the audio recognition method that the above embodiment of the present invention improves, by utilizing the good reference voice information of voice quality in dialogue to carry out speech recognition, obtain with reference to identifying information.With reference in identifying information, select n word that degree of confidence is the highest as keyword, n be greater than 0 positive integer; For each keyword, the set of m level related term is generated according to predetermined vocabulary, each related term wherein in the set of first order related term is associated with a keyword respectively, each related term in the set of L level related term is associated with a related term in the set of L-1 level related term respectively, m, L be greater than 0 positive integer, 2≤L≤m; Improve the weighted value of related term in speech recognition modeling dictionary in keyword and the set of m level related term, the weighted value of word whole in speech recognition modeling dictionary is normalized, obtain the speech recognition modeling dictionary upgraded; Utilize the speech recognition modeling dictionary upgraded, respectively speech recognition is carried out to reference voice information and assistant voice information, obtain the first identifying information and the second identifying information.Owing to have modified the weighted value of related term according to conversation content, thus the accuracy that raising speech recognition modeling describes current session content, improve the accuracy rate of speech recognition.

Fig. 2 is the schematic diagram of another embodiment of audio recognition method of the present invention.As shown in Figure 2, the audio recognition method of this embodiment is as follows:

Step 201, obtains first voice messaging of the first teller and second voice messaging of the second teller respectively from dialogic voice information.

According to another specific embodiment of the present invention, in dialogic voice information, from correspond to the first teller first via signal obtain the first voice messaging, from correspond to the second teller the second road signal obtain the second voice messaging.

Step 202, carries out speech quality evaluation to the first voice messaging and the second voice messaging respectively, selects the good voice messaging of voice quality as with reference to voice messaging, using voice messaging poor for voice quality as assistant voice information.

Due in heart speech recognition in a call, a favourable factor is the mandarin that contact staff is all suitable for comparatively standard, acoustic enviroment simultaneously residing for contact staff is comparatively stable and single, thus voice quality is higher, and relative its accuracy rate of client speech recognition is higher.Therefore the result of contact staff's speech recognition can be utilized to improve the speech recognition effect of client.Certainly, the customer voice quality situation higher than contact staff voice quality is not got rid of yet.

Those skilled in the art are scrutable, carry out speech quality evaluation to voice messaging, and this is known in the art, such as signal-noise ratio estimation method, voice quality objective evaluating method, a method in the methods such as pronunciation standard evaluation and test or combination.

Step 203, carries out speech recognition to reference voice information, obtains with reference to identifying information.

Speech recognition is carried out to reference voice information, existing speech recognition technology can be adopted.

Step 204, with reference in identifying information, selects n word that degree of confidence is the highest as keyword, n be greater than 0 positive integer.

Such as under many circumstances, because the voice quality of contact staff is higher, therefore by tentatively identifying the voice of contact staff, result relatively reliably can be obtained.

Step 205, for each keyword, the set of m level related term is generated according to predetermined vocabulary, each related term wherein in the set of first order related term is associated with a keyword respectively, and each related term in the set of L level related term is associated with a related term in the set of L-1 level related term respectively, m, L be greater than 0 positive integer, 2≤L≤m, in the set of m level related term, do not comprise keyword, simultaneously in the set of m level related term, related term does not also repeat.

By determining keyword and the set of m level related term, the word sequence of a corresponding relation from tight to general can be obtained.

Step 206, improve the weighted value of related term in speech recognition modeling dictionary in keyword and the set of m level related term, wherein be multiplied by a multiple for each weighted value, the weight increase multiple of keyword is greater than the weight increase multiple of related term in the set of m level related term, in the set of L-1 level related term, the weight of related term increases the weight increase multiple that multiple is greater than related term in the set of L level related term, the weighted value of word whole in speech recognition modeling dictionary is normalized, obtains the speech recognition modeling upgraded.

According to another specific embodiment of the present invention, in order to the accuracy utilizing conversation content to improve speech recognition, when weighted value upgrades, the weighted value of keyword is greater than the weighted value of related term in the set of m level related term, the weighted value of related term in the set of L-1 level related term, is greater than the weighted value of related term in the set of L level related term.

Due to the weighted value of part word become large after, the weight of all words and more than 1, may therefore need the weight bi-directional scaling to all words, make the weight of all words and be 1, ensure integrality and the standardization of language model.

Step 207, utilizes the speech recognition modeling upgraded, carries out speech recognition respectively, obtain the first identifying information and the second identifying information to reference voice information and assistant voice information.

Step 208, according to pre-conditioned, judges whether to need to carry out iterative processing to reference voice information and assistant voice information.If desired iterative processing is carried out to reference voice information and assistant voice information, then return step 204; If do not need, iterative processing is carried out to reference voice information and assistant voice information, then terminate this steps flow chart.

With a concrete example, this programme is described below.Such as by identifying the voice of contact staff, obtain following result:

" to (0.9) railway station (0.9), how (0.7) is walked (0.8) ", " from (0.33) emperor (0.55) mansion (0.8) (0.7) ".

Numerical value in its bracket is degree of confidence.The word selecting degree of confidence the highest is keyword, such as, select following keyword:

To (0.9), railway station (0.9), walk (0.8), mansion (0.8).

Word higher for these degree of confidence is filtered, remove the word that word comparatively common does not in general sense have quantity of information in other words, such as adopt word frequency-reverse document-frequency (TermFrequency-Inverse Document Frequency, be called for short: TF-IDF) rule is except going to (0.9), walking (0.8), finally determines that keyword is

" railway station ", " mansion ".

According to predetermined vocabulary, select " related term " of several these keywords, wherein the relation list of word and word by calculating in advance from a large amount of text data, and she describes the sequence of a word and the word of corresponding relation from tight to general thereof, such as

With " railway station " word in close relations as " train number ", " traffic ", " bus station " etc., with " mansion " word in close relations as " floor ", " office building ", " commercial affairs " etc.Such as each " keyword " selects two related terms as the set of first order related term, wherein:

" railway station ": related term is " train number ", " traffic ".

" mansion ": related term is " floor ", " office building ".

Like this, the set of first order related term comprises " train number ", " traffic ", " floor ", " office building ".

Equally, for the set of first order related term, second level related term set can be generated, concentrate:

" train number ": related term is train, moment.

" traffic ": related term is automobile, railway.

" floor ": related term is elevator, one deck.

" office building ": related term is mansion, rent.

Like this, second level related term set comprises train, moment, automobile, railway, elevator, one deck, mansion, rent, notice in the related term set of the second level and may comprise keyword, such meeting causes repetition when weight adjusting, therefore need to remove the keyword that related term set at different levels comprises, final second level related term set is:

Train, moment, automobile, railway, elevator, one deck, rent.

As required, multistage related term set can be set.

In speech recognition modeling dictionary, improve the weighted value of related term in keyword and related term set at different levels.Such as, for keyword, weight becomes original 3 times; For the related term in the set of first order related term, weight becomes original 2.5 times; For the related term in the related term set of the second level, weight becomes original 1.5 times.The weight of all the other words remains unchanged.Such weight adjusting is put for the uni-gram part in N-gram.

According to the speech recognition modeling dictionary after renewal, the voice of client and contact staff are identified.Because speech recognition modeling dictionary has carried out dynamic conditioning according to conversation content, the accuracy of speech recognition therefore can be improved.

Fig. 3 is the schematic diagram of an audio recognition method of the present invention embodiment.In the embodiment shown in fig. 3, comprise acquiring unit 301, assessment unit 302, first voice recognition unit 303, keyword generation unit 304, related term generation unit 305, weight adjustment unit 306, second voice recognition unit 307, wherein:

Acquiring unit 301, with the second voice messaging of the first voice messaging and the second teller that obtain the first teller from dialogic voice information respectively.

Assessment unit 302, for carrying out speech quality evaluation to the first voice messaging and the second voice messaging respectively, selects the good voice messaging of voice quality as with reference to voice messaging, using voice messaging poor for voice quality as assistant voice information.

First voice recognition unit 303, for carrying out speech recognition to reference voice information, obtains with reference to identifying information;

Keyword generation unit 304, for reference in identifying information, selects n word that degree of confidence is the highest as keyword, n be greater than 0 positive integer.

Related term generation unit 305, for for each keyword, the set of m level related term is generated according to predetermined vocabulary, each related term wherein in the set of first order related term is associated with a keyword respectively, and each related term in the set of L level related term is associated with a related term in the set of L-1 level related term respectively, m, L be greater than 0 positive integer, 2≤L≤m, in the set of m level related term, do not comprise keyword, simultaneously in the set of m level related term, related term does not also repeat.

Weight adjustment unit 306, for improving the weighted value of related term in speech recognition modeling dictionary in keyword and the set of m level related term, wherein be multiplied by a multiple for each weighted value, the weight increase multiple of keyword is greater than the weight increase multiple of related term in the set of m level related term, in the set of L-1 level related term, the weight of related term increases the weight increase multiple that multiple is greater than related term in the set of L level related term, the weighted value of word whole in speech recognition modeling dictionary is normalized, obtains the speech recognition modeling upgraded.

Second voice recognition unit 307, for utilizing the speech recognition modeling of renewal, carrying out speech recognition to reference voice information and assistant voice information respectively, obtaining the first identifying information and the second identifying information.

Based on the speech recognition system that the above embodiment of the present invention improves, by utilizing the good reference voice information of voice quality in dialogue to carry out speech recognition, obtain with reference to identifying information.With reference in identifying information, select n word that degree of confidence is the highest as keyword, n be greater than 0 positive integer; For each keyword, the set of m level related term is generated according to predetermined vocabulary, each related term wherein in the set of first order related term is associated with a keyword respectively, each related term in the set of L level related term is associated with a related term in the set of L-1 level related term respectively, m, L be greater than 0 positive integer, 2≤L≤m; Improve the weighted value of related term in speech recognition modeling dictionary in keyword and the set of m level related term, the weighted value of word whole in speech recognition modeling dictionary is normalized, obtain the speech recognition modeling dictionary upgraded; Utilize the speech recognition modeling dictionary upgraded, respectively speech recognition is carried out to reference voice information and assistant voice information, obtain the first identifying information and the second identifying information.Owing to have modified the weighted value of related term according to conversation content, thus the accuracy that raising speech recognition modeling describes current session content, improve the accuracy rate of speech recognition.

According to another specific embodiment of the present invention, acquiring unit 301 is specifically specifically in dialogic voice information, from correspond to the first teller first via signal obtain the first voice messaging, from correspond to the second teller the second road signal obtain the second voice messaging.

Fig. 4 is the schematic diagram of another embodiment of audio recognition method of the present invention.Compared with embodiment illustrated in fig. 3, in the embodiment shown in fig. 4, also comprise judging unit 401, after speech recognition modeling dictionary for upgrading in the second voice recognition unit 307 utilization carries out speech recognition to the first voice messaging and the second voice messaging respectively, judge whether to need to carry out iterative processing to reference voice information and assistant voice information, if desired iterative processing is carried out to reference voice information and assistant voice information, then indicate selection unit 304 to perform with reference to the operation selecting in identifying information n word that degree of confidence is the highest as keyword.

According to another specific embodiment of the present invention, the weighted value of keyword is greater than the weighted value of related term in the set of m level related term; The weighted value of related term in the set of L-1 level related term, is greater than the weighted value of related term in the set of L level related term.

Claims

1. an audio recognition method, is characterized in that, comprising:

Improve the weighted value of related term in speech recognition modeling dictionary in keyword and the set of m level related term, wherein be multiplied by a multiple for each weighted value, the weight increase multiple of keyword is greater than the weight increase multiple of related term in the set of m level related term, in the set of L-1 level related term, the weight of related term increases the weight increase multiple that multiple is greater than related term in the set of L level related term, the weighted value of word whole in speech recognition modeling dictionary is normalized, obtain the speech recognition modeling upgraded, wherein the weighted value of keyword is greater than the weighted value of related term in the set of m level related term, the weighted value of related term in the set of L-1 level related term, be greater than the weighted value of related term in the set of L level related term,

2. method according to claim 1, is characterized in that,

After the speech recognition modeling dictionary of utilization renewal carries out speech recognition to the first voice messaging and the second voice messaging respectively, also comprise:

Judge whether to need to carry out iterative processing to reference voice information and assistant voice information;

If desired iterative processing is carried out to reference voice information and assistant voice information, then perform with reference to the step selecting in identifying information n word that degree of confidence is the highest as keyword.

3. method according to claim 1 and 2, is characterized in that,

Second voice messaging of described the first voice messaging and the second teller that obtain the first teller from corresponding voice messaging respectively comprises:

In dialogic voice information, from correspond to the first teller first via signal obtain the first voice messaging, from correspond to the second teller the second road signal obtain the second voice messaging.

4. a speech recognition system, is characterized in that, comprising:

Weight adjustment unit, for improving the weighted value of related term in speech recognition modeling dictionary in keyword and the set of m level related term, wherein be multiplied by a multiple for each weighted value, the weight increase multiple of keyword is greater than the weight increase multiple of related term in the set of m level related term, in the set of L-1 level related term, the weight of related term increases the weight increase multiple that multiple is greater than related term in the set of L level related term, the weighted value of word whole in speech recognition modeling dictionary is normalized, obtain the speech recognition modeling upgraded, wherein the weighted value of keyword is greater than the weighted value of related term in the set of m level related term, the weighted value of related term in the set of L-1 level related term, be greater than the weighted value of related term in the set of L level related term,

5. system according to claim 4, is characterized in that, also comprises:

Judging unit, after speech recognition modeling dictionary for upgrading in the second voice recognition unit utilization carries out speech recognition to the first voice messaging and the second voice messaging respectively, judge whether to need to carry out iterative processing to reference voice information and assistant voice information, if desired iterative processing is carried out to reference voice information and assistant voice information, then indicate selection unit to perform with reference to the operation selecting in identifying information n word that degree of confidence is the highest as keyword.

6. the system according to claim 4 or 5, is characterized in that,

Acquiring unit specifically in dialogic voice information, from correspond to the first teller first via signal obtain the first voice messaging, from correspond to the second teller the second road signal obtain the second voice messaging.