WO2012001458A1

WO2012001458A1 - Voice-tag method and apparatus based on confidence score

Info

Publication number: WO2012001458A1
Application number: PCT/IB2010/052954
Authority: WO
Inventors: Lei He; Rui Zhao
Original assignee: Kabushiki Kaisha Toshiba
Priority date: 2010-06-29
Filing date: 2010-06-29
Publication date: 2012-01-05
Also published as: CN102439660A

Abstract

The invention provides a voice-tag method and apparatus based on confidence score. The voice-tag method based on confidence score comprises: performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; calculating a confidence score for each of the plurality of pronunciation tags! selecting at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and creating a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network. The present invention optimizes voice tags based on confidence score to reduce the confusion of recognition network consisting of voice tags in the multi-pronunciation registration based voice-tag technology.

Description

VOICE-TAG METHOD AND APPARATUS BASED ON CONFIDENCE SCORE

TECHNICAL FIELD

[OOOl] The present invention relates to information processing technology, specifically to a voice-tag method and apparatus based on confidence score.

TECHNICAL BACKGROUND

[0002] The voice-tag technology is an application of speech recognition technology, which is widely used especially in embedded speech recognition systems.

[0003] The working process of a voice-tag technology based system is as follows^ firstly, the voice registration process is performed, that is, the user input a registration speech, the system converts the registration speech into a tag which represents the pronunciation of the speech; then, the speech recognition process is performed, that is, when the user input a testing speech, the system performs recognition on the testing speech based on its recognition network consisting of voice tag items to determine the content of the testing speech. Usually, the recognition network of a voice-tag system consists of not only the voice tag items of recognition speech but also other items whose pronunciations are decided by a dictionary or grapheme-to-phoneme (G2P) converting module, which can be called dictionary items.

[0004] The original voice-tag technology is usually implemented based on template matching framework in which, in the registration process, one or more templates are extracted from a registration speech as the tags of the registration speech; in the recognition process, the Dynamic Time Warping (DTW) algorithm is applied between testing speech and template tags to do matching. Recently, along with the wide use of phoneme based Hidden Markov Model (HMM) in the speech recognition field, phoneme sequences are more used as the pronunciation tags of registration speeches in current mainstream voice-tag method. It should be noted that, depending on the language, the phoneme which is the unit of pronunciation may also be changed as other voice unit, for example, for the Chinese, the Initial and Final sequence may be used as the voice tag of a registration speech.

[0005] In the method which uses phoneme sequences as the pronunciation tags of registration speeches, the phoneme sequences are obtained by performing phoneme recognition on the registration speeches. The advantages of phoneme sequence tags are as follows^ firstly, a phoneme sequence tag occupies less memory space than a template tag! secondly, phoneme sequence tag items are easily combined with dictionary items to form new items. The advantageous of phoneme sequence tags are very helpful to enlarge the number of items provided by a recognition network.

[0006] However, phoneme sequence tags also have shortages^ firstly, under the current phoneme recognition capability, phoneme recognition errors are unavoidable, with the result that a phoneme sequence tag may not correctly represent the pronunciation of a registration speech, thereby causing the recognition error! secondly, the mismatch between registration speech and testing speech may exist, which will also cause the recognition error.

[0007] Specifically, supposing that the registration speech is " E¾(wang ming)", then the correct Initial and Final sequence corresponding to the registration speech should be "w ang m ing". However, due to the current recognition capability, the voice recognition system may give an incorrect recognition result, for example the Initial and Final sequence "w an m ing" for the registration speech, thereby the incorrect sequence "w an m ing" will be added into the recognition network as the pronunciation tag of the registration speech In this case, when the testing speech is also "ΞΕ if the system determines that the testing speech is nearest to the sequence "w an m ing" in the recognition network, then the recognition result will be correct, however, since the system may determine that the testing speech is nearest to another sequence in the recognition network, an incorrect recognition result will be obtained.

[0008] Therefore, in the phoneme sequence tag based voice-tag technology, how to reduce the recognition errors caused by the above reasons becomes a current research emphases.

[0009] In order to overcome the shortages of the above phoneme sequence tag method, researchers proposed the following multi-pronunciation registration approach: for a registration speech, a voice tag item corresponding to the registration speech is constituted by a plurality of pronunciation tags based on different phoneme sequences. Specifically, when performing phoneme recognition on the registration speech, the N best phoneme sequence recognition results or phoneme lattice recognition result are obtained as the pronunciation tags of the registration speech.

[OOIO] Specifically, by still taking the registration speech "ΞΕ¾" as an example, suppose that the voice recognition system gave the following three best Initial and Final sequences arrayed in the descending order of acoustic score after recognition of the registration speech:

1. "w an m ing";

2. "w an m in";

3. "w ang m ing";

then in the multi-pronunciation registration, the above sequences are combined into a voice tag item corresponding to the registration speech "ΞΕ¾" and added into the recognition network. Therefore, in the recognition process, as long as the recognition network determines that a testing speech is nearest to any one of the above three sequences, the match between the testing speech and the registration speech "ΞΕ¾" can be carried out. Thus, the recognition rate can be improved.

[0011] By using such a multi-pronunciation registration method, the negative effects on voice recognition due to phoneme recognition errors can be obviously reduced, and the recognition performance degradation due to the mismatch between registration speech and testing speech can be alleviated.

[0012] However, since for a registration speech, comparing that one phoneme sequence is added into the recognition network in the single-pronunciation registration, in the multi-pronunciation registration, a plurality of phoneme sequences are added into the recognition network, the multi-pronunciation registration will increase the scale of recognition network. Further, constituting a voice tag item by using a plurality of pronunciation sequences will also increase the confusion of recognition network, especially will drop the recognition performance for dictionary items in the voice-tag system.

SUMMARY OF THE INVENTION

[0013] The present invention is proposed to resolve the above problem in the prior art, the object of which is to provide a voice-tag method and apparatus based on confidence score, in order to in the multi-pronunciation registration technology, optimize voice tags based on confidence score to reduce the confusion of recognition network consisting of voice tags.

[0014] According to one aspect of the invention, there is provided a voice-tag method based on confidence score, comprising: performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; calculating a confidence score for each of the plurality of pronunciation tags! selecting at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and creating a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network.

[0015] According to another aspect of the invention, there is provided a voice-tag method based on confidence score, comprising: performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; determining a confidence score based weight for each of the plurality of pronunciation tags! creating a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into a recognition network and correspondingly recording the confidence score based weight of each of the plurality of pronunciation tags! and when recognition on a testing speech is performed by using the recognition network, combining a plurality of recognition result candidates belonging to a same voice tag item among recognition result candidates with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.

[0016] According to further another aspect of the invention, there is provided a voice-tag apparatus based on confidence score, comprising: a phoneme recognition unit configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; a confidence score calculating unit configured to calculate a confidence score for each of the plurality of pronunciation tags! a pronunciation tag selecting unit configured to select at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and a voice tag creating unit configured to create a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network. [0017] According to yet another aspect of the invention, there is provided a voice-tag apparatus based on confidence score, comprising: a phoneme recognition unit configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; a confidence weight determining unit configured to determine a confidence score based weight for each of the plurality of pronunciation tags! a voice tag creating unit configured to create a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into a recognition network and correspondingly record the confidence score based weight of each of the plurality of pronunciation tags! and a recognition result combining unit configured to, when recognition on a testing speech is performed by using the recognition network, combine a plurality of recognition result candidates belonging to a same voice tag item among recognition result candidates with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.

BRIEF DESCRIPTION OF THE DRAWINGS

It is believed that the features, advantages and purposes of the present invention will be better understood from the following description of the detailed implementation of the present invention read in conjunction with the accompanying drawings, in which:

Fig.l depicts a flowchart of the voice-tag method based on confidence score according to the first embodiment of the invention!

Fig.2 depicts an exemplary of phoneme lattice of a registration speech;

Fig.3 depicts a flowchart of the voice-tag method based on confidence score according to the second embodiment of the invention!

Fig.4 depicts a block diagram of the voice-tag apparatus based on confidence score according to the third embodiment of the invention! and

Fig.5 depicts a block diagram of the voice-tag apparatus based on confidence score according to the fourth embodiment of the invention.

DETAILED DESCRIPTION OF THE INVENTION

Next, a detailed description of preferred embodiments of the present invention will be given with reference to the drawings.

( First embodiment )

[0018] Firstly, the first embodiment of the present invention will be described in combination with Fig.l~2. Fig.l depicts a flowchart of the voice-tag method based on confidence tag according to the first embodiment of the invention. In the present embodiment, the confidence score is used as the basis of selection of pronunciation tags for a registration speech.

[0019] Specifically, as shown in Fig.l, firstly, at step 105, the method performs phoneme recognition on a registration speech input by a user, to obtain a plurality of pronunciation tags of the registration speech. Specifically, the plurality of pronunciation tags may be a plurality of best phoneme sequences or the phoneme lattice of the registration speech. So-called phoneme lattice is a multi-pronunciation representation generated by combining same parts in the plurality of phoneme sequences representing the pronunciations of the speech together.

[0020] At this step, for the registration speech input by the user, a phoneme recognition system commonly used in the art which adopts the Hidden Markov Model as the acoustic model and performs decoding by using the Viterbi searching algorithm, is employed to perform phoneme recognition to obtain a plurality of best phoneme sequences of the registration speech which are arrayed in the descending order of acoustic score or the phoneme lattice of the registration speech.

[0021] However, the person skilled in the art can appreciate that as long as a plurality of pronunciation tags can be obtained at this step, any phoneme recognition system or method presently known or future knowable may be employed but not limited to the above commonly used phoneme recognition system in the art which adopts the Hidden Markov Model as the acoustic model and performs decoding by using the Viterbi searching algorithm, and there is no special limitation on this in the present invention.

[0022] At step 110, a confidence score is calculated for each of the plurality of pronunciation tags of the registration speech.

[0023] Specifically, in the case that the plurality of pronunciation tags of the registration speech are a plurality of best phoneme sequences, a confidence score is calculated for each of the phoneme sequences. Herein, by still taking the foregoing registration speech "JL (wang ming)" as an example, suppose that after the user inputted this registration speech "JL (wang ming)", the following three Initial and Final sequences arrayed in the descending order of acoustic score are obtained through recognition:

1. "w an m ing";

2. "w an m in";

3. "w ang m ing";

then at this step, a confidence score is calculated for each of the above three sequences, and it is supposed that the confidence scores are obtained as follows:

1. "w an m ing", confidence score: 70;

2. "w an m in", confidence score: 60;

3. "w ang m ing", confidence score: 75.

[0024] On the other hand, in the case that the plurality of pronunciation tags of the registration speech are phoneme lattice, a confidence score is calculated for a single phoneme on each of arcs in the phoneme lattice.

For example, suppose that after recognition on the registration speech "JL (wang ming)", another multi-pronunciation representation manner, namely the Initial and Final lattice as shown in Fig.2 corresponding to the above Initial and Final sequences 1~3 is obtained, which is one generated by combining same parts in the above sequences 1~3 together. In this case, at this step, for the Initial and Final lattice, a confidence score is calculated for each element (initial or final) "w", "an", "ang", "m", "in", "ing" on the arcs.

[0025] The person skilled in the art can appreciate that at this step, any method presently known or future knowable for calculating a confidence score for a phoneme sequence or a single phoneme, for example, the post-probability based confidence score calculating method or the anti-model based confidence score calculating method may be adopted.

[0026] Next, at step 115, at least one best pronunciation tag is selected from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags.

[0027] In an embodiment, at this step, the pronunciation tag with the highest confidence score is selected from the plurality of pronunciation tags as the at least one best pronunciation tag.

[0028] In this case, when the plurality of pronunciation tags are a plurality of best phoneme sequences of the registration speech, on the basis of the confidence scores of the respective phoneme sequences, the phoneme sequence with the highest confidence score is selected from the plurality of best phoneme sequences as the best pronunciation tag. On the other hand, when the plurality of pronunciation tags are the phoneme lattice of the registration speech, on the basis of the confidence scores of the phonemes on the respective arcs in the phoneme lattice, the path in which the phonemes on the arcs thereof have the highest confidence scores in the phoneme lattice is reserved, while other arcs are removed, thereby constructing the best pronunciation tag of the registration speech by using the reserved path.

[0029] In addition, in another embodiment, at this step, the pronunciation tags whose confidence scores are higher than a preset confidence threshold are selected from the plurality of pronunciation tags as the at least one best pronunciation tag.

[0030] In this case, when the plurality of pronunciation tags are a plurality of best phoneme sequences of the registration speech, on the basis of the confidence scores of the respective phoneme sequences, the phoneme sequences whose confidence scores are higher than the preset confidence threshold are selected from the plurality of best phoneme sequences. For example, in the case of the above three sequences 1~3 of the registration speech "JL (wang ming)", if the confidence threshold is set to 65, then the sequences 1 and 3 whose confidence scores are higher than the confidence threshold will be selected from the three sequences 1~3 as the best pronunciation tags of the registration speech "i^a^| (wang ming)".

[0031] On the other hand, when the plurality of pronunciation tags are the phoneme lattice of the registration speech, on the basis of the confidence scores of the phonemes on the respective arcs in the phoneme lattice, the arcs whose phonemes have lower confidence scores than the preset confidence threshold are removed from the phoneme lattice, thereby constructing the best pronunciation tags of the registration speech by using the reserved arcs.

[0032] Herein, the above confidence threshold may be decided according to the experience of developers. Specifically, for example, firstly, a large amount of testing data is prepared, then the phoneme recognition system used at step 105 is applied to perform phoneme recognition on the testing data, and further confidence scores are calculated for the phoneme recognition results, and then a suitable confidence threshold may be set with reference to the confidence scores of high quality recognition results in order to ensure that the high quality recognition results can be selected with the confidence threshold.

[0033] At step 120, a voice tag item corresponding to the registration speech is created based on the at least one best pronunciation tag to add into a recognition network. Thus, when a user input a testing speech, the recognition can be performed on the testing speech by using the recognition network. Since the creation and addition of a voice tag item are existing knowledge in the art, the detailed description thereof is omitted.

[0034] The above is a description of the voice-tag method based on confidence score according to the first embodiment of the present invention. In the present embodiment, by selecting at least one best pronunciation tag from a plurality of pronunciation tags of a registration speech based on confidence scores to create a voice tag item corresponding to the registration speech, the voice tags can be optimized and the negative effects of multi-pronunciation registration on application of voice tags can be reduced. Specifically, the scale of recognition network consisting of voice tags can be decreased, the confusion of recognition network can be reduced, and the recognition performance of voice tags, especially the dictionary items can be improved. As the same time, the method of the present embodiment still keeps the advantages of the multi-pronunciation registration to some degree, it can reduce the negative effect due to phoneme recognition error, as well reduce recognition errors due to the mismatch between registration speech and testing speech.

( Second embodiment )

[0035] Next, the voice-tag method based on confidence score according to the second embodiment of the present invention will be described in combination with Fig.3. In the present embodiment, the confidence score is used to combine a plurality of pronunciation tags of a registration speech.

[0036] Specifically, as shown in Fig.3, firstly, at step 305, the method performs phoneme recognition on a registration speech inputted by a user, to obtain a plurality of pronunciation tags of the registration speech. Since the step is the same as the above step 105 in Fig.l, the detailed description thereof is omitted.

[0037] At step 310, a confidence score is calculated for each of the plurality of pronunciation tags of the registration speech. Since the step is the same as the above step 110 in Fig.l, the detailed description thereof is omitted.

[0038] Next, at step 315, a confidence score based weight is determined for each of the pronunciation tags of the registration speech. Herein, the higher the confidence score of a pronunciation tag is, the larger weight will be determined for the pronunciation tag.

[0039] In an embodiment, at this step, the confidence score based weight is calculated for each of the plurality of pronunciation tags in accordance with the following equation (l):

weight i= confidence score i / (confidence score 1+confidence score 2+...+confidence score n) (l)

wherein the weight i denotes the confidence score based weight of the i^th pronunciation tag, the confidence score 1 denotes the confidence score of the first pronunciation tag, the confidence score 2 denotes the confidence score of the second pronunciation,..., and the confidence score n denotes the confidence score of the n^th pronunciation tag and so on, n denotes the number of the plurality of the pronunciation tags. In other words, in accordance with the above equation (l), the confidence score based weight of each of the plurality of pronunciation tags is the ratio of the confidence score of this pronunciation tag to the sum of confidence scores of all the plurality of pronunciation tags.

[0040] Next, the description will be given in combination with a specific example. By still taking the foregoing registration speech "i^a^| (wang ming)" as an example, suppose that the recognition results and confidence score calculation results are the same as that of the first embodiment, namely^:

1. "w an m ing", confidence score: 70;

2. "w an m in", confidence score: 60;

3. "w ang m ing", confidence score: 75;

then in this case, at this step, the confidence score based weights are calculated in accordance with the above equation (l) as follows:

1. w an m mg ", confidence score: 70, weight = 70 / (70+60+75) = 0.34; 2. "w an m in", confidence score: 60, weight = 60 / (70+60+75) = 0.29;

3. "w ang m ing", confidence score: 75, weight = 75 / (70+60+75) = 0.37.

That is, in the present embodiment, each of the plurality of pronunciation tags of the registration speech is defined as a component of the voice tag of the registration speech by using the confidence score based weight.

[0041] Next, at step 320, a voice tag item corresponding to the registration speech is created based on the plurality of pronunciation tags of the registration speech to add into a recognition network and meanwhile the confidence score based weight of each of the plurality of pronunciation tags is recorded.

[0042] At this step, the voice tag item corresponding to the registration speech may be created directly based on the plurality of pronunciation tags obtained for the registration speech at step 305, or be created based on at least one best pronunciation tag which is selected from the plurality of pronunciation tags on the basis of the confidence score of each of the plurality of pronunciation tags like step 115 in the first embodiment. As to this step, the foregoing detailed description about step 115 may be referred to, and the detailed description of this step is accordingly omitted.

[0043] Next, at step 325, when a user inputs a testing speech, the recognition is performed on the testing speech by using the recognition network to obtain a plurality of best recognition result candidates of the testing speech.

[0044] Specifically, at this step, when performing recognition on the testing speech by using the recognition network, all pronunciation sequences, namely pronunciation tags near to the testing speech will be obtained from the recognition network by doing match as the plurality of best recognition result candidates of the testing speech.

[0045] For example, in the case that the user inputs the testing speech "^!^Kwu ming)", suppose that by obtaining all sequences near to the testing speech, the recognition network obtains the nearest pronunciation sequence "w u m ing" and a similar sequence as well as the three sequences in the voice tag item corresponding to the registration speech "JL^", and finally outputs the following recognition results arrayed in the in the descending order of acoustic score for the testing speech:

1. w an m in, acoustic score: 90;

2. w u m ing, acoustic score: 89;

3. w u n ing, acoustic score: 87; 4. w an m ing, acoustic score: 80;

5. w ang m ing, acoustic score: 70.

[0046] At step 330, among the plurality of best recognition result candidates of the testing speech, the plurality of recognition result candidates belonging to a same voice tag item are combined with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.

[0047] Specifically, at this step, the plurality of recognition result candidates belonging to a same voice tag item among the plurality of best recognition result candidates of the testing speech are combined into one recognition result candidate, and a weighted sum of the acoustic scores of the plurality of recognition result candidates belonging to a same voice tag item is calculated on the basis of the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates, as the acoustic score of the combined recognition result candidate.

[0048] Next, the description will be given in combination with a specific example. By still taking the foregoing testing speech "^!^Kwu ming)" and the recognition result candidates 1 ~ 5 thereof as an example, suppose that the recognition result candidates 1, 4 and 5 belong to a same voice tag item, namely the voice tag item corresponding to the registration speech "JL (wang ming)" , while the recognition result candidates 2 and 3 belong to different voice tag items among the recognition result candidates 1 ~ 5 according to the recognition network, then at this step, the recognition result candidates 1, 4 and 5 will be combined into one recognition result candidate and a weighted sum of the acoustic scores of the recognition result candidates 1, 4 and 5 will be calculated on the basis of the confidence score based weights of the respective pronunciation tags corresponding to the recognition result candidates 1, 4 and 5, as the acoustic score of the combined recognition result candidate. Thereby, through combination, the recognition result candidates will become^:

1. 4, 5. w an m in (w an m ing, w ang m ing), acoustic score after combination: 90*0.29+80*0.34+70*0.37=79.2;

2. w u m ing, acoustic score: 89;

3. w u n ing, acoustic score :87.

[0049] Thus, the recognition result candidates 1, 4 and 5 are combined into one recognition result candidate, which corresponds to the voice tag item of the registration speech "i^a^| (wang ming)".

[0050] Herein, it should be noted that although the recognition result candidates 1, 4 and 5 are combined into one recognition result candidate, since they belong to one voice tag item and correspond to the registration speech "JL (wang ming)" before combination, even if they are combined, the obtained combined recognition result candidate still can be correspond to the registration speech "JL (wang ming)".

[0052] At step 335, the recognition result candidate with the highest acoustic score is selected from the recognition result candidates formed by the plurality of best recognition result candidates after combination as the final recognition result.

[0054] Therefore, in the above example, since the recognition result 2. w u m ing becomes the one with the highest acoustic score after the weight based combination of the recognition result candidates 1~5, it will be selected as the final recognition result, thus the correct recognition result can be obtained.

[0055] In addition, if it is supposed that the recognition result candidates 2 and 3 of the testing speech "^!^Kwu ming)" also belong to a same voice tag item, then the recognition result candidates 2 and 3 will also be combined on the basis of the confidence score based weights. Further, if the combined recognition result of the recognition result candidates 2 and 3 still has the highest acoustic score, then it will be selected, thereby the voice tag item to which the recognition result candidates 2 and 3 belong will become the one matching to the testing speech "^!^Kwu ming)", thus the correct content of the testing speech "^!^Kwu ming)" can be recognized based on this voice tag item.

[0056] The above is a description of the voice-tag method based on confidence score according to the second embodiment of the present invention. In the present embodiment, by combining recognition result candidates belonging to a same voice tag item with confidence score based weights, the negative effects of multi-pronunciation registration on application of voice tags can be reduced. Specifically, the confusion of recognition network consisting of voice tags can be reduced, and the recognition performance of voice tags, especially the dictionary items can be improved. As the same time, the method of the present embodiment still keeps the advantages of the multi-pronunciation registration, it can reduce the negative effect due to phoneme recognition error, as well reduce recognition errors due to the mismatch between registration speech and testing speech.

( Third embodiment)

[0057] Under the same inventive conception, the present invention provides a voice-tag apparatus based on confidence score which will be described in detail below in conjunction with drawings.

[0058] Fig.4 depicts a block diagram of the voice-tag apparatus based on confidence score according to the third embodiment of the invention. As shown in Fig.4, the voice-tag apparatus 40 based on confidence score of the present embodiment comprises: phoneme recognition unit 41, confidence score calculating unit 42, pronunciation tag selecting unit 43, voice tag creating unit 44 testing speech recognizing unit 45 and recognition network 46.

[0059] Specifically, the phoneme recognition unit 41 is configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech. The plurality of pronunciation tags may be a plurality of best phoneme sequences or the phoneme lattice of the registration speech.

[0060] In an embodiment, the phoneme recognition unit 41 is implemented based on a phoneme recognition system commonly used in the art which adopts the Hidden Markov Model as the acoustic model and performs decoding by using the Viterbi searching algorithm, and the phoneme recognition unit 41 performs phoneme recognition on the registration speech inputted by the user to obtain a plurality of best phoneme sequences of the registration speech which are arrayed in the descending order of acoustic score or the phoneme lattice of the registration speech.

[0061] Of course, it is not limited to this, the phoneme recognition unit 41 may be implemented with any phoneme recognition system or method presently known or future knowable, there is no limitation on this in the present invention.

[0062] The confidence score calculating unit 42 is configured to calculate a confidence score for each of the plurality of pronunciation tags.

[0063] Specifically, in the case that the plurality of pronunciation tags of the registration speech are a plurality of best phoneme sequences, the confidence score calculating unit 42 calculates a confidence score for each of the phoneme sequences. In addition, in the case that the plurality of pronunciation tags of the registration speech are phoneme lattice, the confidence score calculating unit 42 calculates a confidence score for a single phoneme on each of arcs in the phoneme lattice.

[0064] The confidence score calculating unit 42 may be implemented based on any method presently known or future knowable for calculating a confidence score for a phoneme sequence or a single phoneme, for example, the post-probability based confidence score calculating method or the anti-model based confidence score calculating method.

[0065] The pronunciation tag selecting unit 43 is configured to select at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags.

[0066] In an embodiment, the pronunciation tag selecting unit 43 selects the pronunciation tag with the highest confidence score from the plurality of pronunciation tags as the at least one best pronunciation tag.

[0067] In addition, in another embodiment, the pronunciation tag selecting unit 43 selects the pronunciation tags whose confidence scores are higher than a preset confidence threshold from the plurality of pronunciation tags as the at least one best pronunciation tag. As mentioned above, the confidence threshold may be decided based on testing data prepared in advance and according to experience of the developers.

[0068] The voice tag creating unit 44 is configured to create a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into the recognition network 46.

[0069] The testing speech recognizing unit 45 is configured to perform recognition on a testing speech by using the recognition network 46 to recognize the content of the testing speech when a user inputted the testing speech.

[0070] The above is a description of the voice-tag apparatus based on confidence score of the embodiment. The voice-tag apparatus 40 based on confidence score of the embodiment can operationally implement the voice-tag method based on confidence score in the first embodiment described above.

It should be noted that although the recognition network 46 is included in the voice-tag apparatus 40 based on confidence score in this embodiment, it is not limited to this. The recognition network 46 may also reside outside the voice-tag apparatus 40 based on confidence score in other embodiments.

( Fourth embodiment )

[0071] Next, the voice-tag apparatus based on confidence score according to the fourth embodiment of the present invention will be described in combination with Fig.5.

[0072] As shown in Fig.5, the voice-tag apparatus 50 based on confidence score of the present embodiment comprises ^: phoneme recognition unit 51, confidence score calculating unit 52, confidence weight determining unit 53, voice tag creating unit 54, testing speech recognizing unit 55 recognition result combining unit 56 and recognition network 57.

[0073] Specifically, the phoneme recognition unit 51 is configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech.

[0074] The confidence score calculating unit 52 is configured to calculate a confidence score for each of the plurality of pronunciation tags of the registration speech.

[0075] The confidence weight determining unit 53 is configured to determine a confidence score based weight for each of the plurality of pronunciation tags. Herein, the higher the confidence score of a pronunciation tag is, the larger weight will be determined for the pronunciation tag.

[0076] In an embodiment, the confidence weight determining unit 53, for each of the plurality of pronunciation tags, calculates the ratio of the confidence score of the pronunciation tag to the sum of confidence scores of all the plurality of pronunciation tags as the confidence score based weight of the pronunciation tag.

[0077] The voice tag creating unit 54 is configured to create a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into the recognition network 57 and correspondingly record the confidence score based weight of each of the plurality of pronunciation tags.

[0078] In an embodiment, the voice tag creating unit 54 selects at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags and creates the voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag.

[0079] The testing speech recognizing unit 55 is configured to perform recognition on a testing speech by using the recognition network 57 to obtain a plurality of best recognition result candidates of the testing speech when a user inputted the testing speech.

[0080] The recognition result combining unit 56 is configured to combine a plurality of recognition result candidates belonging to a same voice tag item among the plurality of best recognition result candidates obtained by the testing speech recognizing unit 55 with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.

[0081] In an embodiment, the recognition result combining unit 56 for the plurality of recognition result candidates belonging to a same voice tag item among the plurality of best recognition result candidates performs the following process^: combines the plurality of recognition result candidates into one recognition result candidate, and calculates a weighted sum of the acoustic scores of the plurality of recognition result candidates on the basis of the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates, as the acoustic score of the combined recognition result candidate.

[0082] In addition, the recognition result combining unit 56 selects the best recognition result candidate, namely the one with the highest acoustic score among the recognition result candidates formed by the plurality of best recognition result candidates after combination as the final recognition result.

[0083] The above is a description of the voice-tag apparatus based on confidence score of the embodiment. The voice-tag apparatus 50 based on confidence score of the embodiment can operationally implement the voice-tag method based on confidence score in the second embodiment described above.

It should be noted that although the recognition network 57 is included in the voice-tag apparatus 50 based on confidence score in this embodiment, it is not limited to this. The recognition network 57 may also reside outside the voice-tag apparatus 50 based on confidence score in other embodiments.

[0084] It can be appreciated by the person skilled in the art that the voice-tag apparatuses 40, 50 based on confidence score of the third and fourth embodiments as well as their components can be implemented with specifically designed circuits or chips or be implemented by a computing device (information processing device) executing corresponding programs.

[0085] While the voice-tag method and apparatus based on confidence score of the present invention have been described in detail with some exemplary embodiments, these embodiments are not exhaustive, and those skilled in the art may make various variations and modifications within the spirit and scope of the present invention. Therefore, the present invention is not limited to these embodiments, the scope of which is only defined by appended claims.

Claims

1. A voice-tag method based on confidence score, comprising:

performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech;

calculating a confidence score for each of the plurality of pronunciation tags! selecting at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and creating a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network.

2. A voice-tag method based on confidence score, comprising:

determining a confidence score based weight for each of the plurality of pronunciation tags!

creating a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into a recognition network and correspondingly recording the confidence score based weight of each of the plurality of pronunciation tags! and

when recognition on a testing speech is performed by using the recognition network, combining a plurality of recognition result candidates belonging to a same voice tag item among recognition result candidates with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.

3. The method according to claim 2, wherein the step of determining a confidence score based weight for each of the plurality of pronunciation tags further comprises: calculating a confidence score for each of the plurality of pronunciation tags! and determining the confidence score based weight for each of the plurality of pronunciation tags, wherein the higher the confidence score of the pronunciation tag is, the larger weight will be determined for the pronunciation tag.

4. The method according to claim 2, wherein:

the confidence score based weight of each of the plurality of pronunciation tags is the ratio of the confidence score of the pronunciation tag to the sum of confidence scores of all the plurality of pronunciation tags.

5. The method according to claim 2, wherein the step of creating a voice tag item corresponding to the registration speech based on the plurality of pronunciation tag further comprises ^:

selecting at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and creating the voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag.

6. The method according to claim 1 or 5, wherein the step of selecting at least one best pronunciation tag further comprises^:

selecting the pronunciation tag with the highest confidence score from the plurality of pronunciation tags as the at least one best pronunciation tag.

7. The method according to claim 1 or 5, wherein the step of selecting at least one best pronunciation tag further comprises^:

selecting the pronunciation tags whose confidence scores are higher than a preset confidence threshold from the plurality of pronunciation tags as the at least one best pronunciation tag.

8. The method according to claim 2, wherein the step of combining further comprises^

for the plurality of recognition result candidates belonging to a same voice tag item among the recognition result candidates:

combining the plurality of recognition result candidates into one recognition result candidate! and

calculating a weighted sum of the acoustic scores of the plurality of recognition result candidates on the basis of the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates, as the acoustic score of the combined recognition result candidate.

9. A voice-tag apparatus based on confidence score, comprising:

a phoneme recognition unit configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech;

a confidence score calculating unit configured to calculate a confidence score for each of the plurality of pronunciation tags!

a pronunciation tag selecting unit configured to select at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and

a voice tag creating unit configured to create a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network.

10. A voice-tag apparatus based on confidence score, comprising:

a confidence weight determining unit configured to determine a confidence score based weight for each of the plurality of pronunciation tags!

a voice tag creating unit configured to create a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into a recognition network and correspondingly record the confidence score based weight of each of the plurality of pronunciation tags! and

a recognition result combining unit configured to, when recognition on a testing speech is performed by using the recognition network, combine a plurality of recognition result candidates belonging to a same voice tag item among recognition result candidates with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.