CN106782560B

CN106782560B - Method and device for determining target recognition text

Info

Publication number: CN106782560B
Application number: CN201710127503.9A
Authority: CN
Inventors: 陈仲帅; 马宏
Original assignee: Hisense Co Ltd
Current assignee: Hisense Co Ltd
Priority date: 2017-03-06
Filing date: 2017-03-06
Publication date: 2020-06-16
Anticipated expiration: 2037-03-06
Also published as: CN106782560A

Abstract

The application provides a method and a device for determining a target recognition text, wherein the method comprises the following steps: determining a determined recognition text and a to-be-determined recognition text in at least two candidate recognition texts corresponding to the voice data to be recognized, wherein the determined recognition text is the same part in the at least two candidate recognition texts, and the to-be-determined recognition text is the different part in the at least two candidate recognition texts; calculating the similarity between the texts at the corresponding positions of the recognition text to be determined and the target comparison text, wherein the target comparison text is a text which is consistent with the sentence pattern structure of the candidate recognition text in a preset text library, and the target comparison text comprises the recognition text; configuring candidate recognition texts consisting of the texts to be determined and recognized and the determined recognition texts corresponding to the maximum value in the similarity as target recognition texts; the target recognition texts are further screened from the candidate recognition texts, and the accuracy of the target recognition texts is improved.

Description

Method and device for determining target recognition text

Technical Field

The present application relates to speech recognition technologies, and in particular, to a method and an apparatus for determining a target recognition text.

Background

With the development of voice control technology, more and more intelligent devices have a voice recognition function, for example, a smart television, a smart refrigerator, a smart air conditioner and the like having a voice control function, and a smart phone and a smart computer having a voice input function.

The current voice recognition mainly comprises the processes of voice preprocessing, acoustic model decoding, pronunciation dictionary analysis, language model decoding and the like, wherein the voice preprocessing is to simply process a received voice signal to obtain a voice feature file and the like; the input of the acoustic model decoding is a feature file of voice, and a phoneme file with the highest probability is obtained through the acoustic model decoding; and then, converting the phoneme information into possible character combinations by inquiring the pronunciation dictionary, and acquiring character combination information with high probability from the character combinations as candidate recognition results through context associated information of the language model. Because the corpus sources in the language model are wide, the accuracy of the recognition result cannot be guaranteed by the candidate recognition result, and therefore, the accurate recognition result needs to be screened by some methods.

However, no suitable selection method exists in the prior art.

Content of application

The application provides a method and a device for determining a target recognition text, which are used for screening out an accurate recognition result from candidate recognition results of voice data to be recognized.

The first aspect of the present application provides a method for determining a target recognition text from at least two candidate recognition texts, comprising:

determining a determined recognition text and a to-be-determined recognition text in at least two candidate recognition texts corresponding to voice data to be recognized, wherein the determined recognition text is the same part in the at least two candidate recognition texts, and the to-be-determined recognition text is the different part in the at least two candidate recognition texts;

calculating the similarity between the text at the corresponding position of the recognition text to be determined and a target comparison text, wherein the target comparison text is a text which is consistent with the sentence pattern structure of the candidate recognition text in a preset text library, and the target comparison text comprises the recognition text to be determined;

and configuring the candidate recognition texts consisting of the texts to be determined and the determined recognition texts corresponding to the maximum value in the similarity as target recognition texts.

A second aspect of the present application provides an apparatus for determining a target recognition text from candidate recognition texts, including:

the device comprises a first determining module, a second determining module and a judging module, wherein the first determining module is used for determining a determined recognition text and a to-be-determined recognition text in at least two candidate recognition texts corresponding to voice data to be recognized, the determined recognition text is the same part in the at least two candidate recognition texts, and the to-be-determined recognition text is the different part in the at least two candidate recognition texts;

the calculation module is used for calculating the similarity between the text at the corresponding position of the recognition text to be determined and a target comparison text, wherein the target comparison text is a text which is consistent with the sentence pattern structure of the candidate recognition text in a preset text library, and the target comparison text comprises the determination recognition text;

and the second determining module is used for configuring the candidate recognition texts consisting of the texts to be determined and the determined recognition texts corresponding to the maximum value in the similarity as target recognition texts.

The beneficial effect of this application is as follows:

the method for determining the target recognition text comprises the steps of firstly determining a determined recognition text and a text to be determined in at least two candidate recognition texts corresponding to voice data to be recognized, then calculating the similarity between the text to be determined and the text at the corresponding position of a target comparison text aiming at the text to be determined, determining the text to be determined corresponding to the maximum value in the similarity as a correct result corresponding to the voice data to be recognized, further configuring the candidate recognition text consisting of the text to be determined and the determined recognition text as the target recognition text, and determining the text to be determined which is closest to the voice data input by a user according to the target comparison text with the same sentence pattern structure and further according to the similarity between the text to be determined and the text at the corresponding position of the target comparison text when a plurality of candidate recognition texts with close probabilities are obtained, and then the text to be identified and the text to be identified are combined into a target identification text and fed back to the user, namely, different parts in the candidate identification texts with close probabilities are further selected by referring to the target comparison text, so that the accuracy of identifying the voice data to be identified is improved, and the user experience of voice identification is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for determining a target recognition text from at least two candidate recognition texts according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a method for determining a target recognition text from at least two candidate recognition texts according to another embodiment of the present application;

fig. 3 is a schematic structural diagram of an apparatus for determining a target recognition text from at least two candidate recognition texts according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an apparatus for determining a target recognition text from at least two candidate recognition texts according to another embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Before explaining the embodiments of the present invention in detail, an application environment of the embodiments of the present invention will be described. The display method for displaying the voice input control instruction provided in the embodiment of the present invention is applied to a terminal, for example, the terminal may be a smart television, a smart phone, a tablet computer, or the like having an Android operating system or an IOS operating system, and the terminal may also be a computer, a PDA (Personal digital assistant), or the like having a Window operating system or an IOS operating system, which is not specifically limited in the embodiment of the present invention.

The application provides a method for determining a target recognition text from at least two candidate recognition texts, and on the basis that a plurality of recognition results are obtained by voice recognition, a final voice recognition text is further analyzed and selected from the plurality of recognition results, so that the accuracy of the voice recognition is improved.

Fig. 1 is a flowchart illustrating a method for determining a target recognition text from at least two candidate recognition texts according to an embodiment of the present application, as shown in fig. 1, the method includes:

s101, determining a determined recognition text and a text to be determined in at least two candidate recognition texts corresponding to the voice data to be recognized.

In a specific implementation process, after a user inputs voice data to be recognized, a plurality of voice recognition texts may be recognized due to reasons such as closeness of pronunciation or recognition accuracy.

For example, a user speaks a sentence "i want to listen to a highly attractive song", and may get a plurality of voice recognition texts "i want to listen to a highly attractive song", and the like.

Candidate recognition texts are determined from the plurality of voice recognition texts, and accurate recognition results are further selected.

The candidate recognition texts are composed of the determined recognition texts and the recognition texts to be determined. The identification text to be determined is the same part in the at least two candidate identification texts. For example, "song" that "i want to listen to" and "song that" i want to listen to "are the identification texts to be determined, and" high win "are the identification texts to be determined.

That is, the same part in the multiple candidate recognition texts can be considered as an accurate result, while the different part is the recognition text to be determined, which needs to be further determined, that is, the recognition text to be determined needs to be further recognized, so as to obtain a more accurate result.

S102, calculating the similarity between the texts at the corresponding positions of the recognition text to be determined and the target contrast text.

The target comparison text is a text which is consistent with the sentence pattern structure of the candidate recognition text in the preset text library, and the target comparison text comprises the determined recognition text.

The preset text library may include a plurality of pre-stored sentences, vocabulary combinations, etc., and the target comparison texts consistent with the candidate recognition text sentence patterns may be matched in the preset text library through word senses, parts of speech (nouns, verbs), etc. For example, "i want to listen to happy songs" may be matched to the target contrast text "i want to listen to zhou jenlen songs" etc. For another example, "please give me a cup of coffee" may match the target contrast text "please give me a cup of milk".

For example, the target contrast text includes the above-mentioned certain identification text, i.e., "i want to listen to a song of zhou jeren" includes the certain identification text "i want to listen to a song".

S103, configuring candidate recognition texts consisting of the texts to be determined and the determined recognition texts corresponding to the maximum value in the similarity as target recognition texts.

Optionally, the similarity between the texts at the corresponding positions of the recognition text to be determined and the target contrast text is respectively calculated and determined. For example, the similarity between "grand beauty" and "Zhou Jien" and the similarity between "grand beauty" and "Zhou Jien" are determined, respectively.

If the similarity between the high victory beauty and the Zhoujilun is the largest, the target recognition text is configured as the song that I want to listen to the high victory beauty.

The similarity may refer to semantic similarity, and may also be similarity of belonged types, similarity of parts of speech, and the like, which is not limited herein.

In the embodiment, firstly, a determined recognition text and a text to be determined in at least two candidate recognition texts corresponding to the voice data to be recognized are determined, then, for the text to be determined, the similarity between the text to be determined and the text at the corresponding position of the target comparison text is calculated, the text to be determined and recognized corresponding to the maximum value in the similarity is determined as the correct result corresponding to the voice data to be recognized, and further, the candidate recognition text composed of the text to be determined and the determined recognition text is configured as the target recognition text, so that when a plurality of candidate recognition texts with close probabilities are obtained, the text to be determined and the most similar to the voice data input by the user are determined according to the target comparison text with the same sentence pattern structure as the candidate recognition texts with the close probabilities and further according to the similarity between the text to be determined and the text at the corresponding position in the target comparison text, and then the text to be identified and the text to be identified are combined into a target identification text and fed back to the user, namely, different parts in the candidate identification texts with close probabilities are further selected by referring to the target comparison text, so that the accuracy of identifying the voice data to be identified is improved, and the user experience of voice identification is improved.

Fig. 2 is a flowchart illustrating a method for determining a target recognition text from candidate recognition texts according to another embodiment of the present application. As shown in fig. 2, on the basis of fig. 1, before S101, the method further includes:

s201, acquiring a plurality of voice recognition texts corresponding to the voice data to be recognized.

When a user inputs a segment of speech, the terminal may obtain a plurality of results according to a preset speech recognition decoder, and generally, the preset speech recognition decoder may include one or more models for speech recognition to recognize speech data to be recognized. Because some pronunciation in the voice information is fuzzy or the same pronunciation and similar vocabularies are more, a plurality of voice recognition texts can be recognized.

Specifically, the method comprises the following steps: after the voice data to be recognized is acquired, the voice data to be recognized can be subjected to some preprocessing such as front-end signal processing and endpoint detection processing, voice features are extracted frame by frame, the extracted features are sent to a preset voice recognition decoder, and the preset voice recognition decoder can include: the decoder combines the acoustic model, the language model, and the pronunciation dictionary to obtain a plurality of speech recognition texts.

The acoustic model mainly describes likelihood probability of features under the pronunciation model, and the acoustic model can adopt a Hidden Markov Model (HMM). The Language Model mainly describes continuous occurrence probability among words, the Language Model adopts an n-gram Model, for Chinese, the Model is called a Chinese Language Model (CLM), the Language Model can contain a large amount of linguistic data, the linguistic data can be a large amount of sentences, vocabularies and the like, and the word search result can be constrained according to the statistical probability of co-occurrence between front and back words. The pronunciation dictionary is mainly used for converting words and chords. In specific conversion, the acoustic model decoding is to search the feature file of the sound signal in the acoustic model to generate the optimal phoneme recognition result, wherein phonemes can identify letters. And converting the phoneme recognition result into characters by inquiring the pronunciation dictionary. Finally, the goal of language model decoding is to select the most likely word combination result from the word combinations obtained by querying the pronunciation dictionary as the speech recognition text.

It should be noted that, reference may be made to related technologies for the operation of recognizing the voice data to be recognized to obtain the corresponding voice recognition text, which is not described in detail in the embodiment of the present invention.

For example, the operation of recognizing the speech data to be recognized to obtain the corresponding speech recognition text can be sequentially realized through the following formula.

W₁＝argmaxP(W|X) (1)

In the formula (1), W represents any character sequence stored in a database, where the character sequence includes words or characters, and the database may be a corpus used for speech recognition; x represents voice data input by a user, W₁Represents a character sequence that can be matched with the speech data to be recognized obtained from the stored character sequence, and P (W | X) represents the probability that the speech data to be recognized can become a character. In the above formula (2), W₂The matching degree between the voice data to be recognized and the character sequence is shown, P (X | W) shows the probability that the character sequence can pronounce, P (W) shows the probability that the character sequence is a word or a character, and P (X) shows the probability that the voice data to be recognized is audio information.

In the above recognition process, P (W) may be determined by a language model, and P (X | W) may be determined by an acoustic model, so as to complete speech recognition on the speech data to be recognized, and obtain a speech recognition text corresponding to the speech data to be recognized. The language model and the acoustic model will be briefly described below, respectively.

Language model

The language model usually uses the chain rule to break down the probability of a word or character sequence into the product of the probabilities of each word or character, i.e., breaking down W into W₁、w₂、w₃、....w_n-1、w_nAnd p (w) is determined by the following formula (3).

P(W)＝P(w₁)P(w₂|w₁)P(w₃|w₁,w₂)...P(w_n|w₁,w₂,...,w_n-1) (3)

In the above formula (3), each term in p (w) is a probability that the current character sequence is a word or a character under the condition that all the character sequences before the character sequence is known are words or characters.

Since when determining p (w) by the above formula (3), if the condition is too long, it is determined that p (w) will be inefficient, thereby affecting subsequent speech recognition. Therefore, to improve the efficiency of determining P (W), P (W) is typically determined by an n-gram language model in the language model. When determining p (w) by the n-gram language model, the probability of the nth word depends only on the (n-1) th word located in front of the word, and p (w) can be determined by the following formula (4).

P(W)＝P(w₁)P(w₂|w₁)P(w₃|w₂)...P(w_n|w_n-1) (4)

Acoustic model

Since the pronunciation of each word needs to be determined when determining each word, the pronunciation of each word needs to be determined through a dictionary. Where the dictionary is a model juxtaposed to the acoustic model and the language module, and the dictionary can convert a single word into a phoneme string. The acoustic model may determine which sounds should be sequentially pronounced by the words in the user-input speech data through a dictionary, and find the demarcation point of each phoneme through a dynamic rule algorithm such as a Viterbi (Viterbi) algorithm, thereby determining the start-stop time of each phoneme, and thus determining the degree of matching of the user-input speech data with the phoneme string, that is, determining P (X | W).

In general, the distribution of feature vectors of each phoneme can be estimated by a classifier such as a Gaussian mixture model, and in the speech recognition stage, the feature vector x of each frame in the speech data input by the user is determined_tFrom the corresponding phoneme s_iResulting probability P (x)_t|s_i) The probabilities for each frame are multiplied to obtain P (X | W).

Wherein, the classifier can be obtained by training in advance, and the specific operation is as follows: a large number of feature vectors and phonemes corresponding to each feature vector are extracted from training data through a frequency cepstrum Coefficient (MFCC), and thus a classifier from features to phonemes is trained.

It should be noted that, in practical applications, not only the above-mentioned manner for determining P (X | W) but also other manners, such as directly giving P(s) through a neural network, may be included_i|x_t) Can be converted into P (x) by Bayesian formula_t|s_i) And then multiplied to obtain P (X | W), which is only for illustration and does not represent that the embodiment of the present invention is limited thereto.

S202, determining a maximum probability value and a second approximate probability value in probability values corresponding to a plurality of voice recognition texts.

The recognition probability of each speech recognition text can be calculated by adopting a preset algorithm according to the character combination of each speech recognition text.

Alternatively, formulas may be employed

Calculating probability value P of each voice recognition text_recWherein

Is the decoding rate of the acoustic model and,

is the decoding rate of the pronunciation dictionary,

is the decoding rate of the language model.

A feature file representing the speech data to be recognized,

in order to identify the combination of words,

is a sequence of phonemes.

Therefore, the character combination, the phoneme sequence and the feature file of the voice data to be recognized of each voice recognition text are substituted to obtain the corresponding feature file of each voice recognition text

And then obtaining the probability value corresponding to each voice recognition text.

Assuming that there are N speech recognition texts in total, the probability value of each speech recognition text is denoted as P_nWherein N is 1, 2, … …, N. The maximum probability value P can be further selected_maxAnd a second rough probability value P_2max。

S203, determining whether the difference value between the maximum probability value and the second approximate probability value is larger than a preset probability threshold value.

Further, a difference value between the maximum probability value and the second maximum probability value may be obtained, and if the difference value is greater than or equal to a preset probability threshold value, it indicates that the accuracy of the speech recognition text corresponding to the maximum probability value is higher, and the speech recognition text corresponding to the maximum probability value may be directly determined as the target recognition text.

In specific implementation, the maximum probability value P can be calculated in sequence_maxWith other probability values P_nOptionally, using a common equationFormula (II)

Calculating the mean absolute value as the acoustic probability value difference E_P，E_PThe distribution condition of the voice recognition texts is reflected, and the direct difference between the optimal voice recognition text and the rest voice recognition texts is measured. E_PWhen the probability value is larger than the preset threshold value, the maximum probability value P can be directly obtained_maxThe corresponding speech recognition text is determined as the target recognition text without further semantic analysis.

Further, when the difference value between the maximum probability value and the second approximate probability value is smaller than a preset probability threshold, at least two candidate recognized texts from the plurality of voice recognized texts are determined.

Alternatively, the determining of the at least two candidate recognized texts from the plurality of speech recognized texts may be: the method comprises the steps of obtaining a first voice recognition text of which the difference value between the probability value and the maximum probability value is smaller than a preset probability threshold value in a plurality of voice recognition texts, and determining the first voice recognition text and the voice recognition text corresponding to the maximum probability value as at least two candidate recognition texts.

And comparing the maximum probability value with other probability values, and when the difference value is smaller than a preset probability threshold value, taking the voice recognition text corresponding to the compared probability value as a candidate recognition text. If the difference value is larger than or equal to the preset probability threshold value, the probability that the voice recognition text corresponding to the compared probability value becomes the target recognition text is very low, and further analysis is not performed.

Alternatively, the probability values of the plurality of speech recognition texts may be ranked, and a preset number of speech recognition texts with the highest probability values may be selected as candidate recognition texts. Or from top to bottom, selecting candidate recognition texts in sequence according to the difference value of the probability values of two adjacent voice recognition texts, for example, if the difference value of the maximum probability value and the second high probability value is greater than a preset threshold value, directly using the voice recognition text with the highest probability value as a target recognition text, and not continuing to compare; otherwise, the voice recognition text with the highest probability value and the voice recognition text with the second highest probability value are used as candidate recognition texts, the difference value between the second highest probability value and the next probability value is determined in sequence, the candidate recognition texts are determined, and the analogy is repeated until a certain difference value is larger than a preset threshold value, and the comparison is not carried out. Of course, the methods are not limited to these methods, and the candidate recognition texts may be determined flexibly according to the needs, or may be obtained by using a formula or an algorithm.

If only one candidate speech recognition text is determined, this candidate speech recognition text can be directly configured as the target speech recognition text. If there are a plurality of candidate speech recognition texts, a result that best meets the actual situation is further determined as the target speech recognition text.

Optionally, calculating the similarity between the text at the corresponding position of the recognition text to be determined and the target contrast text may include: and determining semantic similarity between the text at the corresponding position of the recognition text to be determined and the target comparison text by adopting a preset word vector model.

The preset word vector model is used for identifying semantic similarity among words through word vector distances.

The preset word vector model can be obtained through word vector training, specifically, the text content can be converted into a real number vector with limited low dimensionality, and the dimensionality is common in 50-dimension and 100-dimension. The distance of the vector can be measured by the most traditional euclidean distance, and can also be measured by the cosine included angle, which is not limited herein. The distance of the vector reflects the distance of the word semantics, i.e. the semantic similarity between words can be expressed by the distance of the vector. The method comprises the following steps that word vector training can be carried out by adopting training tools of word vectors, firstly, training linguistic data capable of comprehensively covering basic words in Chinese are obtained, and corresponding preprocessing is carried out; and then, a training tool of word vectors is called to perform training, and a vector representation form is generated, for example, each word in the corpus has a corresponding 50-dimensional vector representation, which is not limited herein. The larger the vector distance is, the longer the semantic distance between words is, whereas the shorter the semantic distance is.

Specifically, the texts at the corresponding positions of the candidate recognition texts and the target comparison text to be determined appear in the same sentence pattern, and the positions are the same, so that the probability of the same kind of things is very high, and then the similarity is further determined according to the word vector distance.

Table 1 is taken as an example to illustrate:

TABLE 1

Therefore, the word vector distance between the "high-success" and the "zhongjilun" is the closest, then the "song that i want to listen to the high-success" is configured as the target recognition text, and the target recognition text is output and displayed to the user, if the target recognition text is voice information of a control instruction class, related instructions can be executed according to the target recognition text, and details are not repeated herein.

Optionally, the preset word vector model is adopted to determine semantic similarity between the text at the corresponding position of the recognition text to be determined and the target comparison text, and the semantic similarity may be: and when the text to be identified comprises at least two words, respectively determining the semantic similarity between each word in the text to be identified and the word at the corresponding position in the target contrast text by adopting a preset word vector model.

That is, the words at different positions are respectively compared, for example, semantic similarity between the text at the corresponding position of the recognition text "breakfast eating fruit is beneficial to health" and the target comparison text "dinner eating coarse food is beneficial to health" is compared, and semantic similarity between "breakfast" and "dinner" and semantic similarity between "coarse food" and "fruit" can be respectively determined.

Fig. 3 is a schematic structural diagram of an apparatus for determining a target recognition text from at least two candidate recognition texts according to an embodiment of the present application, as shown in fig. 3, the apparatus includes: a first determination module 301, a calculation module 302, and a second determination module 303, wherein:

the first determining module 301 is configured to determine a recognition determining text and a recognition to be determined text in at least two candidate recognition texts corresponding to the voice data to be recognized.

The identification text to be determined is the same part in at least two candidate identification texts, and the identification text to be determined is the different part in at least two candidate identification texts.

A calculating module 302, configured to calculate a similarity between the text at the corresponding position of the to-be-determined recognition text and the target comparison text.

The target comparison text is a text which is consistent with the sentence pattern structure of the candidate recognition text in a preset text library, and the target comparison text comprises the determined recognition text.

A second determining module 303, configured to configure the candidate recognition text composed of the to-be-determined recognition text and the determined recognition text corresponding to the maximum value in the similarity as a target recognition text.

In this embodiment, first, the first determining module 301 determines a determined recognition text and a text to be determined in at least two candidate recognition texts corresponding to the voice data to be recognized, then, the calculating module 302 calculates the similarity between the text to be determined and the text at the corresponding position of the target comparison text for the text to be determined, determines the text to be determined corresponding to the maximum value in the similarity as the correct result corresponding to the voice data to be recognized, and then, the second determining module 302 configures the candidate recognition text composed of the text to be determined and the determined recognition text as the target recognition text, so that when a plurality of candidate recognition texts with close probabilities are obtained, according to the target comparison text with the same sentence pattern structure thereof, further according to the similarity between the text to be determined and the text at the corresponding position in the target comparison text, the method comprises the steps of determining the text to be identified which is closest to voice data input by a user, further forming a target identification text by the text to be identified and the text to be identified, and feeding back the target identification text to the user, namely further selecting different parts in a plurality of candidate identification texts with close probabilities by referring to a target comparison text, so that the accuracy of identifying the voice data to be identified is improved, and the user experience of voice identification is improved.

Fig. 4 is a schematic structural diagram of an apparatus for determining a target recognition text from at least two candidate recognition texts according to another embodiment of the present application, as shown in fig. 4, on the basis of fig. 3, the apparatus further includes: a third determination module 401, wherein:

the third determining module 401 is configured to determine a maximum probability value and a second approximate probability value in the plurality of speech recognition texts corresponding to the speech data to be recognized before the first determining module 301 determines the determined recognition text and the recognized text to be determined in the at least two candidate recognition texts corresponding to the speech data to be recognized.

In this embodiment, the first determining module 301 determines at least two candidate recognized texts from the plurality of speech recognized texts when a difference between the maximum probability value and the second approximate probability value is smaller than a preset probability threshold.

Optionally, the first determining module 301 is specifically configured to acquire a first speech recognition text in the plurality of speech recognition texts, where a difference between the probability value and the maximum probability value is smaller than a preset probability threshold; and determining the first voice recognition text and the voice recognition text corresponding to the maximum probability value as the at least two candidate recognition texts.

Further, the calculating module 302 is specifically configured to determine semantic similarity between the text to be determined and identified and the text at the corresponding position in the target comparison text by using a preset word vector model. The preset word vector model is used for identifying semantic similarity among words through word vector distance.

Optionally, the calculating module 302 is specifically configured to, when the to-be-determined recognition text includes at least two vocabularies, respectively determine semantic similarity between each vocabulary in the to-be-determined recognition text and a vocabulary in a corresponding position in the target comparison text by using the preset word vector model.

It should be noted that: in the apparatus for determining a target recognition text provided in the above embodiment, when determining a target recognition text from at least two candidate recognition texts, only the division of the above function modules is used as an example, in practical applications, the function distribution may be completed by different function modules according to needs, that is, the internal structure of the apparatus is divided into different function modules, so as to complete all or part of the functions described above. In addition, the apparatus for determining a target recognition text and the method for determining a target recognition text provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to perform some steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for determining a target recognition text from at least two candidate recognition texts, comprising:

calculating the similarity between the text at the corresponding position of the recognition text to be determined and a target comparison text, wherein the target comparison text is a text in a preset text library consistent with the sentence pattern structure of the candidate recognition text, the target comparison text comprises the recognition text to be determined, and the positions of the text at the corresponding position of the recognition text to be determined and the text at the corresponding position of the target comparison text in the same sentence pattern structure are the same;

2. The method according to claim 1, wherein before determining the determined recognized text and the text to be determined in the at least two candidate recognized texts corresponding to the speech data to be recognized, the method further comprises:

determining a maximum probability value and a second approximate probability value in a plurality of voice recognition texts corresponding to the voice data to be recognized;

determining at least two candidate recognized texts from the plurality of speech recognized texts when a difference between the maximum probability value and the second approximate probability value is smaller than a preset probability threshold.

3. The method of claim 2, wherein determining at least two candidate recognized texts from the plurality of speech recognized texts comprises:

acquiring a first voice recognition text of which the difference value between the probability value and the maximum probability value in the plurality of voice recognition texts is smaller than a preset probability threshold;

and determining the first voice recognition text and the voice recognition text corresponding to the maximum probability value as the at least two candidate recognition texts.

4. The method according to claim 1, wherein the calculating of the similarity between the text at the corresponding position of the recognition text to be determined and the target comparison text specifically comprises:

and determining semantic similarity between the text to be determined and recognized and the text at the corresponding position of the target comparison text by adopting a preset word vector model, wherein the preset word vector model is used for identifying the semantic similarity between words through word vector distance.

5. The method according to claim 4, wherein the determining the semantic similarity between the recognition text to be determined and the text at the corresponding position in the target comparison text by using the preset word vector model specifically comprises:

and when the text to be identified comprises at least two words, respectively determining the semantic similarity between each word in the text to be identified and the word at the corresponding position in the target contrast text by adopting the preset word vector model.

6. An apparatus for determining a target recognized text from at least two candidate recognized texts, comprising:

the calculation module is used for calculating the similarity between the text at the corresponding position of the recognition text to be determined and the target comparison text, wherein the target comparison text is a text which is consistent with the sentence pattern structure of the candidate recognition text in a preset text library, the target comparison text comprises the recognition text to be determined, and the positions of the text at the corresponding position of the recognition text to be determined and the text at the corresponding position of the target comparison text in the same sentence pattern structure are the same;

7. The apparatus of claim 6, further comprising: a third determination module;

the third determining module is configured to determine a maximum probability value and a second approximate probability value in a plurality of voice recognition texts corresponding to the voice data to be recognized before the first determining module determines the determined recognition text and the recognized text to be determined in the at least two candidate recognition texts corresponding to the voice data to be recognized;

the first determining module is specifically configured to determine at least two candidate recognized texts from the plurality of speech recognized texts when a difference between the maximum probability value and the second maximum probability value is smaller than a preset probability threshold.

8. The apparatus according to claim 7, wherein the first determining module is specifically configured to obtain a first speech recognition text in the plurality of speech recognition texts, where a difference between the probability value and the maximum probability value is smaller than a preset probability threshold; and determining the first voice recognition text and the voice recognition text corresponding to the maximum probability value as the at least two candidate recognition texts.

9. The apparatus according to claim 6, wherein the computing module is specifically configured to determine the semantic similarity between the recognized text to be determined and the text at the corresponding position in the target comparison text by using a preset word vector model, where the preset word vector model is used to identify the semantic similarity between words by word vector distance.

10. The apparatus according to claim 9, wherein the computing module is specifically configured to, when the recognition text to be determined includes at least two words, respectively determine semantic similarities between each word in the recognition text to be determined and a word at a corresponding position in the target comparison text by using the preset word vector model.