WO2012001458A1 - Voice-tag method and apparatus based on confidence score - Google Patents

Voice-tag method and apparatus based on confidence score Download PDF

Info

Publication number
WO2012001458A1
WO2012001458A1 PCT/IB2010/052954 IB2010052954W WO2012001458A1 WO 2012001458 A1 WO2012001458 A1 WO 2012001458A1 IB 2010052954 W IB2010052954 W IB 2010052954W WO 2012001458 A1 WO2012001458 A1 WO 2012001458A1
Authority
WO
WIPO (PCT)
Prior art keywords
pronunciation
tag
confidence score
tags
recognition
Prior art date
Application number
PCT/IB2010/052954
Other languages
French (fr)
Inventor
Lei He
Rui Zhao
Original Assignee
Kabushiki Kaisha Toshiba
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kabushiki Kaisha Toshiba filed Critical Kabushiki Kaisha Toshiba
Priority to PCT/IB2010/052954 priority Critical patent/WO2012001458A1/en
Priority to CN2010800015191A priority patent/CN102439660A/en
Publication of WO2012001458A1 publication Critical patent/WO2012001458A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams

Definitions

  • the present invention relates to information processing technology, specifically to a voice-tag method and apparatus based on confidence score.
  • the voice-tag technology is an application of speech recognition technology, which is widely used especially in embedded speech recognition systems.
  • the working process of a voice-tag technology based system is as follows ⁇ firstly, the voice registration process is performed, that is, the user input a registration speech, the system converts the registration speech into a tag which represents the pronunciation of the speech; then, the speech recognition process is performed, that is, when the user input a testing speech, the system performs recognition on the testing speech based on its recognition network consisting of voice tag items to determine the content of the testing speech.
  • the recognition network of a voice-tag system consists of not only the voice tag items of recognition speech but also other items whose pronunciations are decided by a dictionary or grapheme-to-phoneme (G2P) converting module, which can be called dictionary items.
  • G2P grapheme-to-phoneme
  • the original voice-tag technology is usually implemented based on template matching framework in which, in the registration process, one or more templates are extracted from a registration speech as the tags of the registration speech; in the recognition process, the Dynamic Time Warping (DTW) algorithm is applied between testing speech and template tags to do matching.
  • DTW Dynamic Time Warping
  • HMM Hidden Markov Model
  • phoneme sequences are obtained by performing phoneme recognition on the registration speeches.
  • the advantages of phoneme sequence tags are as follows ⁇ firstly, a phoneme sequence tag occupies less memory space than a template tag! secondly, phoneme sequence tag items are easily combined with dictionary items to form new items. The advantageous of phoneme sequence tags are very helpful to enlarge the number of items provided by a recognition network.
  • phoneme sequence tags also have shortages ⁇ firstly, under the current phoneme recognition capability, phoneme recognition errors are unavoidable, with the result that a phoneme sequence tag may not correctly represent the pronunciation of a registration speech, thereby causing the recognition error! secondly, the mismatch between registration speech and testing speech may exist, which will also cause the recognition error.
  • the voice recognition system may give an incorrect recognition result, for example the Initial and Final sequence "w an m ing" for the registration speech, thereby the incorrect sequence "w an m ing" will be added into the recognition network as the pronunciation tag of the registration speech
  • the testing speech is also " ⁇ if the system determines that the testing speech is nearest to the sequence "w an m ing" in the recognition network, then the recognition result will be correct, however, since the system may determine that the testing speech is nearest to another sequence in the recognition network, an incorrect recognition result will be obtained.
  • a voice tag item corresponding to the registration speech is constituted by a plurality of pronunciation tags based on different phoneme sequences. Specifically, when performing phoneme recognition on the registration speech, the N best phoneme sequence recognition results or phoneme lattice recognition result are obtained as the pronunciation tags of the registration speech.
  • the above sequences are combined into a voice tag item corresponding to the registration speech " ⁇ 3 ⁇ 4" and added into the recognition network. Therefore, in the recognition process, as long as the recognition network determines that a testing speech is nearest to any one of the above three sequences, the match between the testing speech and the registration speech " ⁇ 3 ⁇ 4" can be carried out. Thus, the recognition rate can be improved.
  • the multi-pronunciation registration since for a registration speech, comparing that one phoneme sequence is added into the recognition network in the single-pronunciation registration, in the multi-pronunciation registration, a plurality of phoneme sequences are added into the recognition network, the multi-pronunciation registration will increase the scale of recognition network. Further, constituting a voice tag item by using a plurality of pronunciation sequences will also increase the confusion of recognition network, especially will drop the recognition performance for dictionary items in the voice-tag system.
  • the present invention is proposed to resolve the above problem in the prior art, the object of which is to provide a voice-tag method and apparatus based on confidence score, in order to in the multi-pronunciation registration technology, optimize voice tags based on confidence score to reduce the confusion of recognition network consisting of voice tags.
  • a voice-tag method based on confidence score comprising: performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; calculating a confidence score for each of the plurality of pronunciation tags! selecting at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and creating a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network.
  • a voice-tag method based on confidence score comprising: performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; determining a confidence score based weight for each of the plurality of pronunciation tags! creating a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into a recognition network and correspondingly recording the confidence score based weight of each of the plurality of pronunciation tags! and when recognition on a testing speech is performed by using the recognition network, combining a plurality of recognition result candidates belonging to a same voice tag item among recognition result candidates with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
  • a voice-tag apparatus based on confidence score, comprising: a phoneme recognition unit configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; a confidence score calculating unit configured to calculate a confidence score for each of the plurality of pronunciation tags! a pronunciation tag selecting unit configured to select at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and a voice tag creating unit configured to create a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network.
  • a voice-tag apparatus based on confidence score comprising: a phoneme recognition unit configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; a confidence weight determining unit configured to determine a confidence score based weight for each of the plurality of pronunciation tags! a voice tag creating unit configured to create a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into a recognition network and correspondingly record the confidence score based weight of each of the plurality of pronunciation tags!
  • a recognition result combining unit configured to, when recognition on a testing speech is performed by using the recognition network, combine a plurality of recognition result candidates belonging to a same voice tag item among recognition result candidates with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
  • Fig.l depicts a flowchart of the voice-tag method based on confidence score according to the first embodiment of the invention!
  • Fig.2 depicts an exemplary of phoneme lattice of a registration speech
  • Fig.3 depicts a flowchart of the voice-tag method based on confidence score according to the second embodiment of the invention!
  • Fig.4 depicts a block diagram of the voice-tag apparatus based on confidence score according to the third embodiment of the invention!
  • Fig.5 depicts a block diagram of the voice-tag apparatus based on confidence score according to the fourth embodiment of the invention.
  • Fig.l depicts a flowchart of the voice-tag method based on confidence tag according to the first embodiment of the invention.
  • the confidence score is used as the basis of selection of pronunciation tags for a registration speech.
  • the method performs phoneme recognition on a registration speech input by a user, to obtain a plurality of pronunciation tags of the registration speech.
  • the plurality of pronunciation tags may be a plurality of best phoneme sequences or the phoneme lattice of the registration speech.
  • phoneme lattice is a multi-pronunciation representation generated by combining same parts in the plurality of phoneme sequences representing the pronunciations of the speech together.
  • a phoneme recognition system commonly used in the art which adopts the Hidden Markov Model as the acoustic model and performs decoding by using the Viterbi searching algorithm, is employed to perform phoneme recognition to obtain a plurality of best phoneme sequences of the registration speech which are arrayed in the descending order of acoustic score or the phoneme lattice of the registration speech.
  • any phoneme recognition system or method presently known or future knowable may be employed but not limited to the above commonly used phoneme recognition system in the art which adopts the Hidden Markov Model as the acoustic model and performs decoding by using the Viterbi searching algorithm, and there is no special limitation on this in the present invention.
  • a confidence score is calculated for each of the plurality of pronunciation tags of the registration speech.
  • a confidence score is calculated for a single phoneme on each of arcs in the phoneme lattice.
  • any method presently known or future knowable for calculating a confidence score for a phoneme sequence or a single phoneme for example, the post-probability based confidence score calculating method or the anti-model based confidence score calculating method may be adopted.
  • At step 115 at least one best pronunciation tag is selected from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags.
  • the pronunciation tag with the highest confidence score is selected from the plurality of pronunciation tags as the at least one best pronunciation tag.
  • the phoneme sequence with the highest confidence score is selected from the plurality of best phoneme sequences as the best pronunciation tag.
  • the plurality of pronunciation tags are the phoneme lattice of the registration speech, on the basis of the confidence scores of the phonemes on the respective arcs in the phoneme lattice, the path in which the phonemes on the arcs thereof have the highest confidence scores in the phoneme lattice is reserved, while other arcs are removed, thereby constructing the best pronunciation tag of the registration speech by using the reserved path.
  • the pronunciation tags whose confidence scores are higher than a preset confidence threshold are selected from the plurality of pronunciation tags as the at least one best pronunciation tag.
  • the phoneme sequences whose confidence scores are higher than the preset confidence threshold are selected from the plurality of best phoneme sequences. For example, in the case of the above three sequences 1 ⁇ 3 of the registration speech "JL (wang ming)", if the confidence threshold is set to 65, then the sequences 1 and 3 whose confidence scores are higher than the confidence threshold will be selected from the three sequences 1 ⁇ 3 as the best pronunciation tags of the registration speech "i a ⁇
  • the plurality of pronunciation tags are the phoneme lattice of the registration speech
  • the arcs whose phonemes have lower confidence scores than the preset confidence threshold are removed from the phoneme lattice, thereby constructing the best pronunciation tags of the registration speech by using the reserved arcs.
  • the above confidence threshold may be decided according to the experience of developers. Specifically, for example, firstly, a large amount of testing data is prepared, then the phoneme recognition system used at step 105 is applied to perform phoneme recognition on the testing data, and further confidence scores are calculated for the phoneme recognition results, and then a suitable confidence threshold may be set with reference to the confidence scores of high quality recognition results in order to ensure that the high quality recognition results can be selected with the confidence threshold.
  • a voice tag item corresponding to the registration speech is created based on the at least one best pronunciation tag to add into a recognition network.
  • the recognition can be performed on the testing speech by using the recognition network. Since the creation and addition of a voice tag item are existing knowledge in the art, the detailed description thereof is omitted.
  • the voice-tag method based on confidence score according to the first embodiment of the present invention.
  • the voice tags can be optimized and the negative effects of multi-pronunciation registration on application of voice tags can be reduced.
  • the scale of recognition network consisting of voice tags can be decreased, the confusion of recognition network can be reduced, and the recognition performance of voice tags, especially the dictionary items can be improved.
  • the method of the present embodiment still keeps the advantages of the multi-pronunciation registration to some degree, it can reduce the negative effect due to phoneme recognition error, as well reduce recognition errors due to the mismatch between registration speech and testing speech.
  • the voice-tag method based on confidence score according to the second embodiment of the present invention will be described in combination with Fig.3.
  • the confidence score is used to combine a plurality of pronunciation tags of a registration speech.
  • step 305 the method performs phoneme recognition on a registration speech inputted by a user, to obtain a plurality of pronunciation tags of the registration speech. Since the step is the same as the above step 105 in Fig.l, the detailed description thereof is omitted.
  • step 310 a confidence score is calculated for each of the plurality of pronunciation tags of the registration speech. Since the step is the same as the above step 110 in Fig.l, the detailed description thereof is omitted.
  • a confidence score based weight is determined for each of the pronunciation tags of the registration speech.
  • the confidence score based weight is calculated for each of the plurality of pronunciation tags in accordance with the following equation (l):
  • weight i confidence score i / (confidence score 1+confidence score 2+...+confidence score n) (l)
  • the weight i denotes the confidence score based weight of the i th pronunciation tag
  • the confidence score 1 denotes the confidence score of the first pronunciation tag
  • the confidence score 2 denotes the confidence score of the second pronunciation
  • the confidence score n denotes the confidence score of the n th pronunciation tag and so on
  • n denotes the number of the plurality of the pronunciation tags.
  • the confidence score based weight of each of the plurality of pronunciation tags is the ratio of the confidence score of this pronunciation tag to the sum of confidence scores of all the plurality of pronunciation tags.
  • each of the plurality of pronunciation tags of the registration speech is defined as a component of the voice tag of the registration speech by using the confidence score based weight.
  • a voice tag item corresponding to the registration speech is created based on the plurality of pronunciation tags of the registration speech to add into a recognition network and meanwhile the confidence score based weight of each of the plurality of pronunciation tags is recorded.
  • the voice tag item corresponding to the registration speech may be created directly based on the plurality of pronunciation tags obtained for the registration speech at step 305, or be created based on at least one best pronunciation tag which is selected from the plurality of pronunciation tags on the basis of the confidence score of each of the plurality of pronunciation tags like step 115 in the first embodiment.
  • the foregoing detailed description about step 115 may be referred to, and the detailed description of this step is accordingly omitted.
  • step 325 when a user inputs a testing speech, the recognition is performed on the testing speech by using the recognition network to obtain a plurality of best recognition result candidates of the testing speech.
  • the recognition network obtains the nearest pronunciation sequence "w u m ing" and a similar sequence as well as the three sequences in the voice tag item corresponding to the registration speech "JL ⁇ ”, and finally outputs the following recognition results arrayed in the in the descending order of acoustic score for the testing speech:
  • the plurality of recognition result candidates belonging to a same voice tag item are combined with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
  • the plurality of recognition result candidates belonging to a same voice tag item among the plurality of best recognition result candidates of the testing speech are combined into one recognition result candidate, and a weighted sum of the acoustic scores of the plurality of recognition result candidates belonging to a same voice tag item is calculated on the basis of the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates, as the acoustic score of the combined recognition result candidate.
  • the recognition result candidates 1, 4 and 5 are combined into one recognition result candidate, which corresponds to the voice tag item of the registration speech "i a ⁇
  • the recognition result candidates 1, 4 and 5 are combined into one recognition result candidate, since they belong to one voice tag item and correspond to the registration speech "JL (wang ming)" before combination, even if they are combined, the obtained combined recognition result candidate still can be correspond to the registration speech "JL (wang ming)".
  • the recognition result candidate with the highest acoustic score is selected from the recognition result candidates formed by the plurality of best recognition result candidates after combination as the final recognition result.
  • the recognition result candidates 2 and 3 of the testing speech " ⁇ ! ⁇ Kwu ming)" will also belong to a same voice tag item, then the recognition result candidates 2 and 3 will also be combined on the basis of the confidence score based weights. Further, if the combined recognition result of the recognition result candidates 2 and 3 still has the highest acoustic score, then it will be selected, thereby the voice tag item to which the recognition result candidates 2 and 3 belong will become the one matching to the testing speech " ⁇ ! ⁇ Kwu ming)", thus the correct content of the testing speech " ⁇ ! ⁇ Kwu ming)" can be recognized based on this voice tag item.
  • the voice-tag method based on confidence score is a description of the voice-tag method based on confidence score according to the second embodiment of the present invention.
  • the negative effects of multi-pronunciation registration on application of voice tags can be reduced.
  • the confusion of recognition network consisting of voice tags can be reduced, and the recognition performance of voice tags, especially the dictionary items can be improved.
  • the method of the present embodiment still keeps the advantages of the multi-pronunciation registration, it can reduce the negative effect due to phoneme recognition error, as well reduce recognition errors due to the mismatch between registration speech and testing speech.
  • the present invention provides a voice-tag apparatus based on confidence score which will be described in detail below in conjunction with drawings.
  • Fig.4 depicts a block diagram of the voice-tag apparatus based on confidence score according to the third embodiment of the invention.
  • the voice-tag apparatus 40 based on confidence score of the present embodiment comprises: phoneme recognition unit 41, confidence score calculating unit 42, pronunciation tag selecting unit 43, voice tag creating unit 44 testing speech recognizing unit 45 and recognition network 46.
  • the phoneme recognition unit 41 is configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech.
  • the plurality of pronunciation tags may be a plurality of best phoneme sequences or the phoneme lattice of the registration speech.
  • the phoneme recognition unit 41 is implemented based on a phoneme recognition system commonly used in the art which adopts the Hidden Markov Model as the acoustic model and performs decoding by using the Viterbi searching algorithm, and the phoneme recognition unit 41 performs phoneme recognition on the registration speech inputted by the user to obtain a plurality of best phoneme sequences of the registration speech which are arrayed in the descending order of acoustic score or the phoneme lattice of the registration speech.
  • the phoneme recognition unit 41 may be implemented with any phoneme recognition system or method presently known or future knowable, there is no limitation on this in the present invention.
  • the confidence score calculating unit 42 is configured to calculate a confidence score for each of the plurality of pronunciation tags.
  • the confidence score calculating unit 42 calculates a confidence score for each of the phoneme sequences. In addition, in the case that the plurality of pronunciation tags of the registration speech are phoneme lattice, the confidence score calculating unit 42 calculates a confidence score for a single phoneme on each of arcs in the phoneme lattice.
  • the confidence score calculating unit 42 may be implemented based on any method presently known or future knowable for calculating a confidence score for a phoneme sequence or a single phoneme, for example, the post-probability based confidence score calculating method or the anti-model based confidence score calculating method.
  • the pronunciation tag selecting unit 43 is configured to select at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags.
  • the pronunciation tag selecting unit 43 selects the pronunciation tag with the highest confidence score from the plurality of pronunciation tags as the at least one best pronunciation tag.
  • the pronunciation tag selecting unit 43 selects the pronunciation tags whose confidence scores are higher than a preset confidence threshold from the plurality of pronunciation tags as the at least one best pronunciation tag.
  • the confidence threshold may be decided based on testing data prepared in advance and according to experience of the developers.
  • the voice tag creating unit 44 is configured to create a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into the recognition network 46.
  • the testing speech recognizing unit 45 is configured to perform recognition on a testing speech by using the recognition network 46 to recognize the content of the testing speech when a user inputted the testing speech.
  • the above is a description of the voice-tag apparatus based on confidence score of the embodiment.
  • the voice-tag apparatus 40 based on confidence score of the embodiment can operationally implement the voice-tag method based on confidence score in the first embodiment described above.
  • the recognition network 46 is included in the voice-tag apparatus 40 based on confidence score in this embodiment, it is not limited to this. The recognition network 46 may also reside outside the voice-tag apparatus 40 based on confidence score in other embodiments.
  • the voice-tag apparatus 50 based on confidence score of the present embodiment comprises : phoneme recognition unit 51, confidence score calculating unit 52, confidence weight determining unit 53, voice tag creating unit 54, testing speech recognizing unit 55 recognition result combining unit 56 and recognition network 57.
  • the phoneme recognition unit 51 is configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech.
  • the confidence score calculating unit 52 is configured to calculate a confidence score for each of the plurality of pronunciation tags of the registration speech.
  • the confidence weight determining unit 53 is configured to determine a confidence score based weight for each of the plurality of pronunciation tags. Herein, the higher the confidence score of a pronunciation tag is, the larger weight will be determined for the pronunciation tag.
  • the confidence weight determining unit 53 calculates the ratio of the confidence score of the pronunciation tag to the sum of confidence scores of all the plurality of pronunciation tags as the confidence score based weight of the pronunciation tag.
  • the voice tag creating unit 54 is configured to create a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into the recognition network 57 and correspondingly record the confidence score based weight of each of the plurality of pronunciation tags.
  • the voice tag creating unit 54 selects at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags and creates the voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag.
  • the testing speech recognizing unit 55 is configured to perform recognition on a testing speech by using the recognition network 57 to obtain a plurality of best recognition result candidates of the testing speech when a user inputted the testing speech.
  • the recognition result combining unit 56 is configured to combine a plurality of recognition result candidates belonging to a same voice tag item among the plurality of best recognition result candidates obtained by the testing speech recognizing unit 55 with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
  • the recognition result combining unit 56 for the plurality of recognition result candidates belonging to a same voice tag item among the plurality of best recognition result candidates performs the following process : combines the plurality of recognition result candidates into one recognition result candidate, and calculates a weighted sum of the acoustic scores of the plurality of recognition result candidates on the basis of the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates, as the acoustic score of the combined recognition result candidate.
  • the recognition result combining unit 56 selects the best recognition result candidate, namely the one with the highest acoustic score among the recognition result candidates formed by the plurality of best recognition result candidates after combination as the final recognition result.
  • the above is a description of the voice-tag apparatus based on confidence score of the embodiment.
  • the voice-tag apparatus 50 based on confidence score of the embodiment can operationally implement the voice-tag method based on confidence score in the second embodiment described above.
  • the recognition network 57 is included in the voice-tag apparatus 50 based on confidence score in this embodiment, it is not limited to this. The recognition network 57 may also reside outside the voice-tag apparatus 50 based on confidence score in other embodiments.
  • the voice-tag apparatuses 40, 50 based on confidence score of the third and fourth embodiments as well as their components can be implemented with specifically designed circuits or chips or be implemented by a computing device (information processing device) executing corresponding programs.

Abstract

The invention provides a voice-tag method and apparatus based on confidence score. The voice-tag method based on confidence score comprises: performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; calculating a confidence score for each of the plurality of pronunciation tags! selecting at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and creating a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network. The present invention optimizes voice tags based on confidence score to reduce the confusion of recognition network consisting of voice tags in the multi-pronunciation registration based voice-tag technology.

Description

VOICE-TAG METHOD AND APPARATUS BASED ON CONFIDENCE SCORE
TECHNICAL FIELD
[OOOl] The present invention relates to information processing technology, specifically to a voice-tag method and apparatus based on confidence score.
TECHNICAL BACKGROUND
[0002] The voice-tag technology is an application of speech recognition technology, which is widely used especially in embedded speech recognition systems.
[0003] The working process of a voice-tag technology based system is as follows^ firstly, the voice registration process is performed, that is, the user input a registration speech, the system converts the registration speech into a tag which represents the pronunciation of the speech; then, the speech recognition process is performed, that is, when the user input a testing speech, the system performs recognition on the testing speech based on its recognition network consisting of voice tag items to determine the content of the testing speech. Usually, the recognition network of a voice-tag system consists of not only the voice tag items of recognition speech but also other items whose pronunciations are decided by a dictionary or grapheme-to-phoneme (G2P) converting module, which can be called dictionary items.
[0004] The original voice-tag technology is usually implemented based on template matching framework in which, in the registration process, one or more templates are extracted from a registration speech as the tags of the registration speech; in the recognition process, the Dynamic Time Warping (DTW) algorithm is applied between testing speech and template tags to do matching. Recently, along with the wide use of phoneme based Hidden Markov Model (HMM) in the speech recognition field, phoneme sequences are more used as the pronunciation tags of registration speeches in current mainstream voice-tag method. It should be noted that, depending on the language, the phoneme which is the unit of pronunciation may also be changed as other voice unit, for example, for the Chinese, the Initial and Final sequence may be used as the voice tag of a registration speech.
[0005] In the method which uses phoneme sequences as the pronunciation tags of registration speeches, the phoneme sequences are obtained by performing phoneme recognition on the registration speeches. The advantages of phoneme sequence tags are as follows^ firstly, a phoneme sequence tag occupies less memory space than a template tag! secondly, phoneme sequence tag items are easily combined with dictionary items to form new items. The advantageous of phoneme sequence tags are very helpful to enlarge the number of items provided by a recognition network.
[0006] However, phoneme sequence tags also have shortages^ firstly, under the current phoneme recognition capability, phoneme recognition errors are unavoidable, with the result that a phoneme sequence tag may not correctly represent the pronunciation of a registration speech, thereby causing the recognition error! secondly, the mismatch between registration speech and testing speech may exist, which will also cause the recognition error.
[0007] Specifically, supposing that the registration speech is " E¾(wang ming)", then the correct Initial and Final sequence corresponding to the registration speech should be "w ang m ing". However, due to the current recognition capability, the voice recognition system may give an incorrect recognition result, for example the Initial and Final sequence "w an m ing" for the registration speech, thereby the incorrect sequence "w an m ing" will be added into the recognition network as the pronunciation tag of the registration speech In this case, when the testing speech is also "ΞΕ if the system determines that the testing speech is nearest to the sequence "w an m ing" in the recognition network, then the recognition result will be correct, however, since the system may determine that the testing speech is nearest to another sequence in the recognition network, an incorrect recognition result will be obtained.
[0008] Therefore, in the phoneme sequence tag based voice-tag technology, how to reduce the recognition errors caused by the above reasons becomes a current research emphases.
[0009] In order to overcome the shortages of the above phoneme sequence tag method, researchers proposed the following multi-pronunciation registration approach: for a registration speech, a voice tag item corresponding to the registration speech is constituted by a plurality of pronunciation tags based on different phoneme sequences. Specifically, when performing phoneme recognition on the registration speech, the N best phoneme sequence recognition results or phoneme lattice recognition result are obtained as the pronunciation tags of the registration speech.
[OOIO] Specifically, by still taking the registration speech "ΞΕ¾" as an example, suppose that the voice recognition system gave the following three best Initial and Final sequences arrayed in the descending order of acoustic score after recognition of the registration speech:
1. "w an m ing";
2. "w an m in";
3. "w ang m ing";
then in the multi-pronunciation registration, the above sequences are combined into a voice tag item corresponding to the registration speech "ΞΕ¾" and added into the recognition network. Therefore, in the recognition process, as long as the recognition network determines that a testing speech is nearest to any one of the above three sequences, the match between the testing speech and the registration speech "ΞΕ¾" can be carried out. Thus, the recognition rate can be improved.
[0011] By using such a multi-pronunciation registration method, the negative effects on voice recognition due to phoneme recognition errors can be obviously reduced, and the recognition performance degradation due to the mismatch between registration speech and testing speech can be alleviated.
[0012] However, since for a registration speech, comparing that one phoneme sequence is added into the recognition network in the single-pronunciation registration, in the multi-pronunciation registration, a plurality of phoneme sequences are added into the recognition network, the multi-pronunciation registration will increase the scale of recognition network. Further, constituting a voice tag item by using a plurality of pronunciation sequences will also increase the confusion of recognition network, especially will drop the recognition performance for dictionary items in the voice-tag system.
SUMMARY OF THE INVENTION
[0013] The present invention is proposed to resolve the above problem in the prior art, the object of which is to provide a voice-tag method and apparatus based on confidence score, in order to in the multi-pronunciation registration technology, optimize voice tags based on confidence score to reduce the confusion of recognition network consisting of voice tags.
[0014] According to one aspect of the invention, there is provided a voice-tag method based on confidence score, comprising: performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; calculating a confidence score for each of the plurality of pronunciation tags! selecting at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and creating a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network.
[0015] According to another aspect of the invention, there is provided a voice-tag method based on confidence score, comprising: performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; determining a confidence score based weight for each of the plurality of pronunciation tags! creating a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into a recognition network and correspondingly recording the confidence score based weight of each of the plurality of pronunciation tags! and when recognition on a testing speech is performed by using the recognition network, combining a plurality of recognition result candidates belonging to a same voice tag item among recognition result candidates with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
[0016] According to further another aspect of the invention, there is provided a voice-tag apparatus based on confidence score, comprising: a phoneme recognition unit configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; a confidence score calculating unit configured to calculate a confidence score for each of the plurality of pronunciation tags! a pronunciation tag selecting unit configured to select at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and a voice tag creating unit configured to create a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network. [0017] According to yet another aspect of the invention, there is provided a voice-tag apparatus based on confidence score, comprising: a phoneme recognition unit configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; a confidence weight determining unit configured to determine a confidence score based weight for each of the plurality of pronunciation tags! a voice tag creating unit configured to create a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into a recognition network and correspondingly record the confidence score based weight of each of the plurality of pronunciation tags! and a recognition result combining unit configured to, when recognition on a testing speech is performed by using the recognition network, combine a plurality of recognition result candidates belonging to a same voice tag item among recognition result candidates with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
BRIEF DESCRIPTION OF THE DRAWINGS
It is believed that the features, advantages and purposes of the present invention will be better understood from the following description of the detailed implementation of the present invention read in conjunction with the accompanying drawings, in which:
Fig.l depicts a flowchart of the voice-tag method based on confidence score according to the first embodiment of the invention!
Fig.2 depicts an exemplary of phoneme lattice of a registration speech;
Fig.3 depicts a flowchart of the voice-tag method based on confidence score according to the second embodiment of the invention!
Fig.4 depicts a block diagram of the voice-tag apparatus based on confidence score according to the third embodiment of the invention! and
Fig.5 depicts a block diagram of the voice-tag apparatus based on confidence score according to the fourth embodiment of the invention.
DETAILED DESCRIPTION OF THE INVENTION
Next, a detailed description of preferred embodiments of the present invention will be given with reference to the drawings.
( First embodiment )
[0018] Firstly, the first embodiment of the present invention will be described in combination with Fig.l~2. Fig.l depicts a flowchart of the voice-tag method based on confidence tag according to the first embodiment of the invention. In the present embodiment, the confidence score is used as the basis of selection of pronunciation tags for a registration speech.
[0019] Specifically, as shown in Fig.l, firstly, at step 105, the method performs phoneme recognition on a registration speech input by a user, to obtain a plurality of pronunciation tags of the registration speech. Specifically, the plurality of pronunciation tags may be a plurality of best phoneme sequences or the phoneme lattice of the registration speech. So-called phoneme lattice is a multi-pronunciation representation generated by combining same parts in the plurality of phoneme sequences representing the pronunciations of the speech together.
[0020] At this step, for the registration speech input by the user, a phoneme recognition system commonly used in the art which adopts the Hidden Markov Model as the acoustic model and performs decoding by using the Viterbi searching algorithm, is employed to perform phoneme recognition to obtain a plurality of best phoneme sequences of the registration speech which are arrayed in the descending order of acoustic score or the phoneme lattice of the registration speech.
[0021] However, the person skilled in the art can appreciate that as long as a plurality of pronunciation tags can be obtained at this step, any phoneme recognition system or method presently known or future knowable may be employed but not limited to the above commonly used phoneme recognition system in the art which adopts the Hidden Markov Model as the acoustic model and performs decoding by using the Viterbi searching algorithm, and there is no special limitation on this in the present invention.
[0022] At step 110, a confidence score is calculated for each of the plurality of pronunciation tags of the registration speech.
[0023] Specifically, in the case that the plurality of pronunciation tags of the registration speech are a plurality of best phoneme sequences, a confidence score is calculated for each of the phoneme sequences. Herein, by still taking the foregoing registration speech "JL (wang ming)" as an example, suppose that after the user inputted this registration speech "JL (wang ming)", the following three Initial and Final sequences arrayed in the descending order of acoustic score are obtained through recognition:
1. "w an m ing";
2. "w an m in";
3. "w ang m ing";
then at this step, a confidence score is calculated for each of the above three sequences, and it is supposed that the confidence scores are obtained as follows:
1. "w an m ing", confidence score: 70;
2. "w an m in", confidence score: 60;
3. "w ang m ing", confidence score: 75.
[0024] On the other hand, in the case that the plurality of pronunciation tags of the registration speech are phoneme lattice, a confidence score is calculated for a single phoneme on each of arcs in the phoneme lattice.
For example, suppose that after recognition on the registration speech "JL (wang ming)", another multi-pronunciation representation manner, namely the Initial and Final lattice as shown in Fig.2 corresponding to the above Initial and Final sequences 1~3 is obtained, which is one generated by combining same parts in the above sequences 1~3 together. In this case, at this step, for the Initial and Final lattice, a confidence score is calculated for each element (initial or final) "w", "an", "ang", "m", "in", "ing" on the arcs.
[0025] The person skilled in the art can appreciate that at this step, any method presently known or future knowable for calculating a confidence score for a phoneme sequence or a single phoneme, for example, the post-probability based confidence score calculating method or the anti-model based confidence score calculating method may be adopted.
[0026] Next, at step 115, at least one best pronunciation tag is selected from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags.
[0027] In an embodiment, at this step, the pronunciation tag with the highest confidence score is selected from the plurality of pronunciation tags as the at least one best pronunciation tag.
[0028] In this case, when the plurality of pronunciation tags are a plurality of best phoneme sequences of the registration speech, on the basis of the confidence scores of the respective phoneme sequences, the phoneme sequence with the highest confidence score is selected from the plurality of best phoneme sequences as the best pronunciation tag. On the other hand, when the plurality of pronunciation tags are the phoneme lattice of the registration speech, on the basis of the confidence scores of the phonemes on the respective arcs in the phoneme lattice, the path in which the phonemes on the arcs thereof have the highest confidence scores in the phoneme lattice is reserved, while other arcs are removed, thereby constructing the best pronunciation tag of the registration speech by using the reserved path.
[0029] In addition, in another embodiment, at this step, the pronunciation tags whose confidence scores are higher than a preset confidence threshold are selected from the plurality of pronunciation tags as the at least one best pronunciation tag.
[0030] In this case, when the plurality of pronunciation tags are a plurality of best phoneme sequences of the registration speech, on the basis of the confidence scores of the respective phoneme sequences, the phoneme sequences whose confidence scores are higher than the preset confidence threshold are selected from the plurality of best phoneme sequences. For example, in the case of the above three sequences 1~3 of the registration speech "JL (wang ming)", if the confidence threshold is set to 65, then the sequences 1 and 3 whose confidence scores are higher than the confidence threshold will be selected from the three sequences 1~3 as the best pronunciation tags of the registration speech "ia^| (wang ming)".
[0031] On the other hand, when the plurality of pronunciation tags are the phoneme lattice of the registration speech, on the basis of the confidence scores of the phonemes on the respective arcs in the phoneme lattice, the arcs whose phonemes have lower confidence scores than the preset confidence threshold are removed from the phoneme lattice, thereby constructing the best pronunciation tags of the registration speech by using the reserved arcs.
[0032] Herein, the above confidence threshold may be decided according to the experience of developers. Specifically, for example, firstly, a large amount of testing data is prepared, then the phoneme recognition system used at step 105 is applied to perform phoneme recognition on the testing data, and further confidence scores are calculated for the phoneme recognition results, and then a suitable confidence threshold may be set with reference to the confidence scores of high quality recognition results in order to ensure that the high quality recognition results can be selected with the confidence threshold.
[0033] At step 120, a voice tag item corresponding to the registration speech is created based on the at least one best pronunciation tag to add into a recognition network. Thus, when a user input a testing speech, the recognition can be performed on the testing speech by using the recognition network. Since the creation and addition of a voice tag item are existing knowledge in the art, the detailed description thereof is omitted.
[0034] The above is a description of the voice-tag method based on confidence score according to the first embodiment of the present invention. In the present embodiment, by selecting at least one best pronunciation tag from a plurality of pronunciation tags of a registration speech based on confidence scores to create a voice tag item corresponding to the registration speech, the voice tags can be optimized and the negative effects of multi-pronunciation registration on application of voice tags can be reduced. Specifically, the scale of recognition network consisting of voice tags can be decreased, the confusion of recognition network can be reduced, and the recognition performance of voice tags, especially the dictionary items can be improved. As the same time, the method of the present embodiment still keeps the advantages of the multi-pronunciation registration to some degree, it can reduce the negative effect due to phoneme recognition error, as well reduce recognition errors due to the mismatch between registration speech and testing speech.
( Second embodiment )
[0035] Next, the voice-tag method based on confidence score according to the second embodiment of the present invention will be described in combination with Fig.3. In the present embodiment, the confidence score is used to combine a plurality of pronunciation tags of a registration speech.
[0036] Specifically, as shown in Fig.3, firstly, at step 305, the method performs phoneme recognition on a registration speech inputted by a user, to obtain a plurality of pronunciation tags of the registration speech. Since the step is the same as the above step 105 in Fig.l, the detailed description thereof is omitted.
[0037] At step 310, a confidence score is calculated for each of the plurality of pronunciation tags of the registration speech. Since the step is the same as the above step 110 in Fig.l, the detailed description thereof is omitted.
[0038] Next, at step 315, a confidence score based weight is determined for each of the pronunciation tags of the registration speech. Herein, the higher the confidence score of a pronunciation tag is, the larger weight will be determined for the pronunciation tag.
[0039] In an embodiment, at this step, the confidence score based weight is calculated for each of the plurality of pronunciation tags in accordance with the following equation (l):
weight i= confidence score i / (confidence score 1+confidence score 2+...+confidence score n) (l)
wherein the weight i denotes the confidence score based weight of the ith pronunciation tag, the confidence score 1 denotes the confidence score of the first pronunciation tag, the confidence score 2 denotes the confidence score of the second pronunciation,..., and the confidence score n denotes the confidence score of the nth pronunciation tag and so on, n denotes the number of the plurality of the pronunciation tags. In other words, in accordance with the above equation (l), the confidence score based weight of each of the plurality of pronunciation tags is the ratio of the confidence score of this pronunciation tag to the sum of confidence scores of all the plurality of pronunciation tags.
[0040] Next, the description will be given in combination with a specific example. By still taking the foregoing registration speech "ia^| (wang ming)" as an example, suppose that the recognition results and confidence score calculation results are the same as that of the first embodiment, namely:
1. "w an m ing", confidence score: 70;
2. "w an m in", confidence score: 60;
3. "w ang m ing", confidence score: 75;
then in this case, at this step, the confidence score based weights are calculated in accordance with the above equation (l) as follows:
1. w an m mg ", confidence score: 70, weight = 70 / (70+60+75) = 0.34; 2. "w an m in", confidence score: 60, weight = 60 / (70+60+75) = 0.29;
3. "w ang m ing", confidence score: 75, weight = 75 / (70+60+75) = 0.37.
That is, in the present embodiment, each of the plurality of pronunciation tags of the registration speech is defined as a component of the voice tag of the registration speech by using the confidence score based weight.
[0041] Next, at step 320, a voice tag item corresponding to the registration speech is created based on the plurality of pronunciation tags of the registration speech to add into a recognition network and meanwhile the confidence score based weight of each of the plurality of pronunciation tags is recorded.
[0042] At this step, the voice tag item corresponding to the registration speech may be created directly based on the plurality of pronunciation tags obtained for the registration speech at step 305, or be created based on at least one best pronunciation tag which is selected from the plurality of pronunciation tags on the basis of the confidence score of each of the plurality of pronunciation tags like step 115 in the first embodiment. As to this step, the foregoing detailed description about step 115 may be referred to, and the detailed description of this step is accordingly omitted.
[0043] Next, at step 325, when a user inputs a testing speech, the recognition is performed on the testing speech by using the recognition network to obtain a plurality of best recognition result candidates of the testing speech.
[0044] Specifically, at this step, when performing recognition on the testing speech by using the recognition network, all pronunciation sequences, namely pronunciation tags near to the testing speech will be obtained from the recognition network by doing match as the plurality of best recognition result candidates of the testing speech.
[0045] For example, in the case that the user inputs the testing speech "^!^Kwu ming)", suppose that by obtaining all sequences near to the testing speech, the recognition network obtains the nearest pronunciation sequence "w u m ing" and a similar sequence as well as the three sequences in the voice tag item corresponding to the registration speech "JL^", and finally outputs the following recognition results arrayed in the in the descending order of acoustic score for the testing speech:
1. w an m in, acoustic score: 90;
2. w u m ing, acoustic score: 89;
3. w u n ing, acoustic score: 87; 4. w an m ing, acoustic score: 80;
5. w ang m ing, acoustic score: 70.
[0046] At step 330, among the plurality of best recognition result candidates of the testing speech, the plurality of recognition result candidates belonging to a same voice tag item are combined with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
[0047] Specifically, at this step, the plurality of recognition result candidates belonging to a same voice tag item among the plurality of best recognition result candidates of the testing speech are combined into one recognition result candidate, and a weighted sum of the acoustic scores of the plurality of recognition result candidates belonging to a same voice tag item is calculated on the basis of the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates, as the acoustic score of the combined recognition result candidate.
[0048] Next, the description will be given in combination with a specific example. By still taking the foregoing testing speech "^!^Kwu ming)" and the recognition result candidates 1 ~ 5 thereof as an example, suppose that the recognition result candidates 1, 4 and 5 belong to a same voice tag item, namely the voice tag item corresponding to the registration speech "JL (wang ming)" , while the recognition result candidates 2 and 3 belong to different voice tag items among the recognition result candidates 1 ~ 5 according to the recognition network, then at this step, the recognition result candidates 1, 4 and 5 will be combined into one recognition result candidate and a weighted sum of the acoustic scores of the recognition result candidates 1, 4 and 5 will be calculated on the basis of the confidence score based weights of the respective pronunciation tags corresponding to the recognition result candidates 1, 4 and 5, as the acoustic score of the combined recognition result candidate. Thereby, through combination, the recognition result candidates will become:
1. 4, 5. w an m in (w an m ing, w ang m ing), acoustic score after combination: 90*0.29+80*0.34+70*0.37=79.2;
2. w u m ing, acoustic score: 89;
3. w u n ing, acoustic score :87.
[0049] Thus, the recognition result candidates 1, 4 and 5 are combined into one recognition result candidate, which corresponds to the voice tag item of the registration speech "ia^| (wang ming)".
[0050] Herein, it should be noted that although the recognition result candidates 1, 4 and 5 are combined into one recognition result candidate, since they belong to one voice tag item and correspond to the registration speech "JL (wang ming)" before combination, even if they are combined, the obtained combined recognition result candidate still can be correspond to the registration speech "JL (wang ming)".
[0052] At step 335, the recognition result candidate with the highest acoustic score is selected from the recognition result candidates formed by the plurality of best recognition result candidates after combination as the final recognition result.
[0054] Therefore, in the above example, since the recognition result 2. w u m ing becomes the one with the highest acoustic score after the weight based combination of the recognition result candidates 1~5, it will be selected as the final recognition result, thus the correct recognition result can be obtained.
[0055] In addition, if it is supposed that the recognition result candidates 2 and 3 of the testing speech "^!^Kwu ming)" also belong to a same voice tag item, then the recognition result candidates 2 and 3 will also be combined on the basis of the confidence score based weights. Further, if the combined recognition result of the recognition result candidates 2 and 3 still has the highest acoustic score, then it will be selected, thereby the voice tag item to which the recognition result candidates 2 and 3 belong will become the one matching to the testing speech "^!^Kwu ming)", thus the correct content of the testing speech "^!^Kwu ming)" can be recognized based on this voice tag item.
[0056] The above is a description of the voice-tag method based on confidence score according to the second embodiment of the present invention. In the present embodiment, by combining recognition result candidates belonging to a same voice tag item with confidence score based weights, the negative effects of multi-pronunciation registration on application of voice tags can be reduced. Specifically, the confusion of recognition network consisting of voice tags can be reduced, and the recognition performance of voice tags, especially the dictionary items can be improved. As the same time, the method of the present embodiment still keeps the advantages of the multi-pronunciation registration, it can reduce the negative effect due to phoneme recognition error, as well reduce recognition errors due to the mismatch between registration speech and testing speech.
( Third embodiment)
[0057] Under the same inventive conception, the present invention provides a voice-tag apparatus based on confidence score which will be described in detail below in conjunction with drawings.
[0058] Fig.4 depicts a block diagram of the voice-tag apparatus based on confidence score according to the third embodiment of the invention. As shown in Fig.4, the voice-tag apparatus 40 based on confidence score of the present embodiment comprises: phoneme recognition unit 41, confidence score calculating unit 42, pronunciation tag selecting unit 43, voice tag creating unit 44 testing speech recognizing unit 45 and recognition network 46.
[0059] Specifically, the phoneme recognition unit 41 is configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech. The plurality of pronunciation tags may be a plurality of best phoneme sequences or the phoneme lattice of the registration speech.
[0060] In an embodiment, the phoneme recognition unit 41 is implemented based on a phoneme recognition system commonly used in the art which adopts the Hidden Markov Model as the acoustic model and performs decoding by using the Viterbi searching algorithm, and the phoneme recognition unit 41 performs phoneme recognition on the registration speech inputted by the user to obtain a plurality of best phoneme sequences of the registration speech which are arrayed in the descending order of acoustic score or the phoneme lattice of the registration speech.
[0061] Of course, it is not limited to this, the phoneme recognition unit 41 may be implemented with any phoneme recognition system or method presently known or future knowable, there is no limitation on this in the present invention.
[0062] The confidence score calculating unit 42 is configured to calculate a confidence score for each of the plurality of pronunciation tags.
[0063] Specifically, in the case that the plurality of pronunciation tags of the registration speech are a plurality of best phoneme sequences, the confidence score calculating unit 42 calculates a confidence score for each of the phoneme sequences. In addition, in the case that the plurality of pronunciation tags of the registration speech are phoneme lattice, the confidence score calculating unit 42 calculates a confidence score for a single phoneme on each of arcs in the phoneme lattice.
[0064] The confidence score calculating unit 42 may be implemented based on any method presently known or future knowable for calculating a confidence score for a phoneme sequence or a single phoneme, for example, the post-probability based confidence score calculating method or the anti-model based confidence score calculating method.
[0065] The pronunciation tag selecting unit 43 is configured to select at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags.
[0066] In an embodiment, the pronunciation tag selecting unit 43 selects the pronunciation tag with the highest confidence score from the plurality of pronunciation tags as the at least one best pronunciation tag.
[0067] In addition, in another embodiment, the pronunciation tag selecting unit 43 selects the pronunciation tags whose confidence scores are higher than a preset confidence threshold from the plurality of pronunciation tags as the at least one best pronunciation tag. As mentioned above, the confidence threshold may be decided based on testing data prepared in advance and according to experience of the developers.
[0068] The voice tag creating unit 44 is configured to create a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into the recognition network 46.
[0069] The testing speech recognizing unit 45 is configured to perform recognition on a testing speech by using the recognition network 46 to recognize the content of the testing speech when a user inputted the testing speech.
[0070] The above is a description of the voice-tag apparatus based on confidence score of the embodiment. The voice-tag apparatus 40 based on confidence score of the embodiment can operationally implement the voice-tag method based on confidence score in the first embodiment described above.
It should be noted that although the recognition network 46 is included in the voice-tag apparatus 40 based on confidence score in this embodiment, it is not limited to this. The recognition network 46 may also reside outside the voice-tag apparatus 40 based on confidence score in other embodiments.
( Fourth embodiment )
[0071] Next, the voice-tag apparatus based on confidence score according to the fourth embodiment of the present invention will be described in combination with Fig.5.
[0072] As shown in Fig.5, the voice-tag apparatus 50 based on confidence score of the present embodiment comprises : phoneme recognition unit 51, confidence score calculating unit 52, confidence weight determining unit 53, voice tag creating unit 54, testing speech recognizing unit 55 recognition result combining unit 56 and recognition network 57.
[0073] Specifically, the phoneme recognition unit 51 is configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech.
[0074] The confidence score calculating unit 52 is configured to calculate a confidence score for each of the plurality of pronunciation tags of the registration speech.
[0075] The confidence weight determining unit 53 is configured to determine a confidence score based weight for each of the plurality of pronunciation tags. Herein, the higher the confidence score of a pronunciation tag is, the larger weight will be determined for the pronunciation tag.
[0076] In an embodiment, the confidence weight determining unit 53, for each of the plurality of pronunciation tags, calculates the ratio of the confidence score of the pronunciation tag to the sum of confidence scores of all the plurality of pronunciation tags as the confidence score based weight of the pronunciation tag.
[0077] The voice tag creating unit 54 is configured to create a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into the recognition network 57 and correspondingly record the confidence score based weight of each of the plurality of pronunciation tags.
[0078] In an embodiment, the voice tag creating unit 54 selects at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags and creates the voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag.
[0079] The testing speech recognizing unit 55 is configured to perform recognition on a testing speech by using the recognition network 57 to obtain a plurality of best recognition result candidates of the testing speech when a user inputted the testing speech.
[0080] The recognition result combining unit 56 is configured to combine a plurality of recognition result candidates belonging to a same voice tag item among the plurality of best recognition result candidates obtained by the testing speech recognizing unit 55 with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
[0081] In an embodiment, the recognition result combining unit 56 for the plurality of recognition result candidates belonging to a same voice tag item among the plurality of best recognition result candidates performs the following process: combines the plurality of recognition result candidates into one recognition result candidate, and calculates a weighted sum of the acoustic scores of the plurality of recognition result candidates on the basis of the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates, as the acoustic score of the combined recognition result candidate.
[0082] In addition, the recognition result combining unit 56 selects the best recognition result candidate, namely the one with the highest acoustic score among the recognition result candidates formed by the plurality of best recognition result candidates after combination as the final recognition result.
[0083] The above is a description of the voice-tag apparatus based on confidence score of the embodiment. The voice-tag apparatus 50 based on confidence score of the embodiment can operationally implement the voice-tag method based on confidence score in the second embodiment described above.
It should be noted that although the recognition network 57 is included in the voice-tag apparatus 50 based on confidence score in this embodiment, it is not limited to this. The recognition network 57 may also reside outside the voice-tag apparatus 50 based on confidence score in other embodiments.
[0084] It can be appreciated by the person skilled in the art that the voice-tag apparatuses 40, 50 based on confidence score of the third and fourth embodiments as well as their components can be implemented with specifically designed circuits or chips or be implemented by a computing device (information processing device) executing corresponding programs.
[0085] While the voice-tag method and apparatus based on confidence score of the present invention have been described in detail with some exemplary embodiments, these embodiments are not exhaustive, and those skilled in the art may make various variations and modifications within the spirit and scope of the present invention. Therefore, the present invention is not limited to these embodiments, the scope of which is only defined by appended claims.

Claims

1. A voice-tag method based on confidence score, comprising:
performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech;
calculating a confidence score for each of the plurality of pronunciation tags! selecting at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and creating a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network.
2. A voice-tag method based on confidence score, comprising:
performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech;
determining a confidence score based weight for each of the plurality of pronunciation tags!
creating a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into a recognition network and correspondingly recording the confidence score based weight of each of the plurality of pronunciation tags! and
when recognition on a testing speech is performed by using the recognition network, combining a plurality of recognition result candidates belonging to a same voice tag item among recognition result candidates with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
3. The method according to claim 2, wherein the step of determining a confidence score based weight for each of the plurality of pronunciation tags further comprises: calculating a confidence score for each of the plurality of pronunciation tags! and determining the confidence score based weight for each of the plurality of pronunciation tags, wherein the higher the confidence score of the pronunciation tag is, the larger weight will be determined for the pronunciation tag.
4. The method according to claim 2, wherein:
the confidence score based weight of each of the plurality of pronunciation tags is the ratio of the confidence score of the pronunciation tag to the sum of confidence scores of all the plurality of pronunciation tags.
5. The method according to claim 2, wherein the step of creating a voice tag item corresponding to the registration speech based on the plurality of pronunciation tag further comprises :
selecting at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and creating the voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag.
6. The method according to claim 1 or 5, wherein the step of selecting at least one best pronunciation tag further comprises:
selecting the pronunciation tag with the highest confidence score from the plurality of pronunciation tags as the at least one best pronunciation tag.
7. The method according to claim 1 or 5, wherein the step of selecting at least one best pronunciation tag further comprises:
selecting the pronunciation tags whose confidence scores are higher than a preset confidence threshold from the plurality of pronunciation tags as the at least one best pronunciation tag.
8. The method according to claim 2, wherein the step of combining further comprises^
for the plurality of recognition result candidates belonging to a same voice tag item among the recognition result candidates:
combining the plurality of recognition result candidates into one recognition result candidate! and
calculating a weighted sum of the acoustic scores of the plurality of recognition result candidates on the basis of the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates, as the acoustic score of the combined recognition result candidate.
9. A voice-tag apparatus based on confidence score, comprising:
a phoneme recognition unit configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech;
a confidence score calculating unit configured to calculate a confidence score for each of the plurality of pronunciation tags!
a pronunciation tag selecting unit configured to select at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and
a voice tag creating unit configured to create a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network.
10. A voice-tag apparatus based on confidence score, comprising:
a phoneme recognition unit configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech;
a confidence weight determining unit configured to determine a confidence score based weight for each of the plurality of pronunciation tags!
a voice tag creating unit configured to create a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into a recognition network and correspondingly record the confidence score based weight of each of the plurality of pronunciation tags! and
a recognition result combining unit configured to, when recognition on a testing speech is performed by using the recognition network, combine a plurality of recognition result candidates belonging to a same voice tag item among recognition result candidates with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
PCT/IB2010/052954 2010-06-29 2010-06-29 Voice-tag method and apparatus based on confidence score WO2012001458A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/IB2010/052954 WO2012001458A1 (en) 2010-06-29 2010-06-29 Voice-tag method and apparatus based on confidence score
CN2010800015191A CN102439660A (en) 2010-06-29 2010-06-29 Voice-tag method and apparatus based on confidence score

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2010/052954 WO2012001458A1 (en) 2010-06-29 2010-06-29 Voice-tag method and apparatus based on confidence score

Publications (1)

Publication Number Publication Date
WO2012001458A1 true WO2012001458A1 (en) 2012-01-05

Family

ID=45401457

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2010/052954 WO2012001458A1 (en) 2010-06-29 2010-06-29 Voice-tag method and apparatus based on confidence score

Country Status (2)

Country Link
CN (1) CN102439660A (en)
WO (1) WO2012001458A1 (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104078050A (en) * 2013-03-26 2014-10-01 杜比实验室特许公司 Device and method for audio classification and audio processing
DE102014109122A1 (en) * 2013-07-12 2015-01-15 Gm Global Technology Operations, Llc Systems and methods for result-based arbitration in speech dialogue systems
US9715878B2 (en) 2013-07-12 2017-07-25 GM Global Technology Operations LLC Systems and methods for result arbitration in spoken dialog systems
CN103500579B (en) * 2013-10-10 2015-12-23 中国联合网络通信集团有限公司 Audio recognition method, Apparatus and system
CN103559881B (en) * 2013-11-08 2016-08-31 科大讯飞股份有限公司 Keyword recognition method that languages are unrelated and system
CN106157969B (en) * 2015-03-24 2020-04-03 阿里巴巴集团控股有限公司 Method and device for screening voice recognition results
CN107808662B (en) * 2016-09-07 2021-06-22 斑马智行网络(香港)有限公司 Method and device for updating grammar rule base for speech recognition
CN106340297A (en) * 2016-09-21 2017-01-18 广东工业大学 Speech recognition method and system based on cloud computing and confidence calculation
TWI697890B (en) * 2018-10-12 2020-07-01 廣達電腦股份有限公司 Speech correction system and speech correction method
CN110070854A (en) * 2019-04-17 2019-07-30 北京爱数智慧科技有限公司 Voice annotation quality determination method, device, equipment and computer-readable medium
CN112447173A (en) * 2019-08-16 2021-03-05 阿里巴巴集团控股有限公司 Voice interaction method and device and computer storage medium
CN110364146B (en) * 2019-08-23 2021-07-27 腾讯科技(深圳)有限公司 Speech recognition method, speech recognition device, speech recognition apparatus, and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1165590A (en) * 1997-08-25 1999-03-09 Nec Corp Voice recognition dialing device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050043948A1 (en) * 2001-12-17 2005-02-24 Seiichi Kashihara Speech recognition method remote controller, information terminal, telephone communication terminal and speech recognizer
US7313527B2 (en) * 2003-01-23 2007-12-25 Intel Corporation Registering an utterance and an associated destination anchor with a speech recognition engine
CN1753083B (en) * 2004-09-24 2010-05-05 中国科学院声学研究所 Speech sound marking method, system and speech sound discrimination method and system based on speech sound mark

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1165590A (en) * 1997-08-25 1999-03-09 Nec Corp Voice recognition dialing device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANNE-MARIE DEROUAULT ET AL.: "Improving Speech Recognition Accuracy with Contextual Phonemes and MMI Traning", ICASSP-89 1989 INTERNATIONAL CONFERENCE ON, vol. 1, May 1989 (1989-05-01), pages 116 - 119 *
YAN MING CHENG ET AL.: "VOICE-TO-PHONEME Conversion Algorithms for SPEAKER-INDEPENDENT VOICE-TAG Applications in Embedded Platforms", AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, 2005 IEEE WORKSHOP ON, 27 November 2005 (2005-11-27), pages 403 - 408 *

Also Published As

Publication number Publication date
CN102439660A (en) 2012-05-02

Similar Documents

Publication Publication Date Title
WO2012001458A1 (en) Voice-tag method and apparatus based on confidence score
CN106683677B (en) Voice recognition method and device
JP5207642B2 (en) System, method and computer program for acquiring a character string to be newly recognized as a phrase
JP4410265B2 (en) Speech recognition apparatus and method
US8271282B2 (en) Voice recognition apparatus, voice recognition method and recording medium
JP6284462B2 (en) Speech recognition method and speech recognition apparatus
US7921014B2 (en) System and method for supporting text-to-speech
CN105654940B (en) Speech synthesis method and device
JPWO2009081861A1 (en) Word category estimation device, word category estimation method, speech recognition device, speech recognition method, program, and recording medium
WO2021040842A1 (en) Optimizing a keyword spotting system
JP5752060B2 (en) Information processing apparatus, large vocabulary continuous speech recognition method and program
CN112242144A (en) Voice recognition decoding method, device and equipment based on streaming attention model and computer readable storage medium
JP6690484B2 (en) Computer program for voice recognition, voice recognition device and voice recognition method
CN112750445B (en) Voice conversion method, device and system and storage medium
CN111599339B (en) Speech splicing synthesis method, system, equipment and medium with high naturalness
KR101483947B1 (en) Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof
JP6027754B2 (en) Adaptation device, speech recognition device, and program thereof
JP5738216B2 (en) Feature amount correction parameter estimation device, speech recognition system, feature amount correction parameter estimation method, speech recognition method, and program
KR102299269B1 (en) Method and apparatus for building voice database by aligning voice and script
KR101066472B1 (en) Apparatus and method speech recognition based initial sound
JP2010230913A (en) Voice processing apparatus, voice processing method, and voice processing program
JP5104732B2 (en) Extended recognition dictionary learning device, speech recognition system using the same, method and program thereof
JP5772219B2 (en) Acoustic model generation apparatus, acoustic model generation method, and computer program for acoustic model generation
JP2003271185A (en) Device and method for preparing information for voice recognition, device and method for recognizing voice, information preparation program for voice recognition, recording medium recorded with the program, voice recognition program and recording medium recorded with the program
US20120130715A1 (en) Method and apparatus for generating a voice-tag

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201080001519.1

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10854018

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10854018

Country of ref document: EP

Kind code of ref document: A1