CN106782560B - Method and device for determining target recognition text - Google Patents

Method and device for determining target recognition text Download PDF

Info

Publication number
CN106782560B
CN106782560B CN201710127503.9A CN201710127503A CN106782560B CN 106782560 B CN106782560 B CN 106782560B CN 201710127503 A CN201710127503 A CN 201710127503A CN 106782560 B CN106782560 B CN 106782560B
Authority
CN
China
Prior art keywords
text
recognition
determined
texts
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710127503.9A
Other languages
Chinese (zh)
Other versions
CN106782560A (en
Inventor
陈仲帅
马宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Co Ltd
Original Assignee
Hisense Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Co Ltd filed Critical Hisense Co Ltd
Priority to CN201710127503.9A priority Critical patent/CN106782560B/en
Publication of CN106782560A publication Critical patent/CN106782560A/en
Application granted granted Critical
Publication of CN106782560B publication Critical patent/CN106782560B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a device for determining a target recognition text, wherein the method comprises the following steps: determining a determined recognition text and a to-be-determined recognition text in at least two candidate recognition texts corresponding to the voice data to be recognized, wherein the determined recognition text is the same part in the at least two candidate recognition texts, and the to-be-determined recognition text is the different part in the at least two candidate recognition texts; calculating the similarity between the texts at the corresponding positions of the recognition text to be determined and the target comparison text, wherein the target comparison text is a text which is consistent with the sentence pattern structure of the candidate recognition text in a preset text library, and the target comparison text comprises the recognition text; configuring candidate recognition texts consisting of the texts to be determined and recognized and the determined recognition texts corresponding to the maximum value in the similarity as target recognition texts; the target recognition texts are further screened from the candidate recognition texts, and the accuracy of the target recognition texts is improved.

Description

Method and device for determining target recognition text
Technical Field
The present application relates to speech recognition technologies, and in particular, to a method and an apparatus for determining a target recognition text.
Background
With the development of voice control technology, more and more intelligent devices have a voice recognition function, for example, a smart television, a smart refrigerator, a smart air conditioner and the like having a voice control function, and a smart phone and a smart computer having a voice input function.
The current voice recognition mainly comprises the processes of voice preprocessing, acoustic model decoding, pronunciation dictionary analysis, language model decoding and the like, wherein the voice preprocessing is to simply process a received voice signal to obtain a voice feature file and the like; the input of the acoustic model decoding is a feature file of voice, and a phoneme file with the highest probability is obtained through the acoustic model decoding; and then, converting the phoneme information into possible character combinations by inquiring the pronunciation dictionary, and acquiring character combination information with high probability from the character combinations as candidate recognition results through context associated information of the language model. Because the corpus sources in the language model are wide, the accuracy of the recognition result cannot be guaranteed by the candidate recognition result, and therefore, the accurate recognition result needs to be screened by some methods.
However, no suitable selection method exists in the prior art.
Content of application
The application provides a method and a device for determining a target recognition text, which are used for screening out an accurate recognition result from candidate recognition results of voice data to be recognized.
The first aspect of the present application provides a method for determining a target recognition text from at least two candidate recognition texts, comprising:
determining a determined recognition text and a to-be-determined recognition text in at least two candidate recognition texts corresponding to voice data to be recognized, wherein the determined recognition text is the same part in the at least two candidate recognition texts, and the to-be-determined recognition text is the different part in the at least two candidate recognition texts;
calculating the similarity between the text at the corresponding position of the recognition text to be determined and a target comparison text, wherein the target comparison text is a text which is consistent with the sentence pattern structure of the candidate recognition text in a preset text library, and the target comparison text comprises the recognition text to be determined;
and configuring the candidate recognition texts consisting of the texts to be determined and the determined recognition texts corresponding to the maximum value in the similarity as target recognition texts.
A second aspect of the present application provides an apparatus for determining a target recognition text from candidate recognition texts, including:
the device comprises a first determining module, a second determining module and a judging module, wherein the first determining module is used for determining a determined recognition text and a to-be-determined recognition text in at least two candidate recognition texts corresponding to voice data to be recognized, the determined recognition text is the same part in the at least two candidate recognition texts, and the to-be-determined recognition text is the different part in the at least two candidate recognition texts;
the calculation module is used for calculating the similarity between the text at the corresponding position of the recognition text to be determined and a target comparison text, wherein the target comparison text is a text which is consistent with the sentence pattern structure of the candidate recognition text in a preset text library, and the target comparison text comprises the determination recognition text;
and the second determining module is used for configuring the candidate recognition texts consisting of the texts to be determined and the determined recognition texts corresponding to the maximum value in the similarity as target recognition texts.
The beneficial effect of this application is as follows:
the method for determining the target recognition text comprises the steps of firstly determining a determined recognition text and a text to be determined in at least two candidate recognition texts corresponding to voice data to be recognized, then calculating the similarity between the text to be determined and the text at the corresponding position of a target comparison text aiming at the text to be determined, determining the text to be determined corresponding to the maximum value in the similarity as a correct result corresponding to the voice data to be recognized, further configuring the candidate recognition text consisting of the text to be determined and the determined recognition text as the target recognition text, and determining the text to be determined which is closest to the voice data input by a user according to the target comparison text with the same sentence pattern structure and further according to the similarity between the text to be determined and the text at the corresponding position of the target comparison text when a plurality of candidate recognition texts with close probabilities are obtained, and then the text to be identified and the text to be identified are combined into a target identification text and fed back to the user, namely, different parts in the candidate identification texts with close probabilities are further selected by referring to the target comparison text, so that the accuracy of identifying the voice data to be identified is improved, and the user experience of voice identification is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for determining a target recognition text from at least two candidate recognition texts according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a method for determining a target recognition text from at least two candidate recognition texts according to another embodiment of the present application;
fig. 3 is a schematic structural diagram of an apparatus for determining a target recognition text from at least two candidate recognition texts according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an apparatus for determining a target recognition text from at least two candidate recognition texts according to another embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Before explaining the embodiments of the present invention in detail, an application environment of the embodiments of the present invention will be described. The display method for displaying the voice input control instruction provided in the embodiment of the present invention is applied to a terminal, for example, the terminal may be a smart television, a smart phone, a tablet computer, or the like having an Android operating system or an IOS operating system, and the terminal may also be a computer, a PDA (Personal digital assistant), or the like having a Window operating system or an IOS operating system, which is not specifically limited in the embodiment of the present invention.
The application provides a method for determining a target recognition text from at least two candidate recognition texts, and on the basis that a plurality of recognition results are obtained by voice recognition, a final voice recognition text is further analyzed and selected from the plurality of recognition results, so that the accuracy of the voice recognition is improved.
Fig. 1 is a flowchart illustrating a method for determining a target recognition text from at least two candidate recognition texts according to an embodiment of the present application, as shown in fig. 1, the method includes:
s101, determining a determined recognition text and a text to be determined in at least two candidate recognition texts corresponding to the voice data to be recognized.
In a specific implementation process, after a user inputs voice data to be recognized, a plurality of voice recognition texts may be recognized due to reasons such as closeness of pronunciation or recognition accuracy.
For example, a user speaks a sentence "i want to listen to a highly attractive song", and may get a plurality of voice recognition texts "i want to listen to a highly attractive song", and the like.
Candidate recognition texts are determined from the plurality of voice recognition texts, and accurate recognition results are further selected.
The candidate recognition texts are composed of the determined recognition texts and the recognition texts to be determined. The identification text to be determined is the same part in the at least two candidate identification texts. For example, "song" that "i want to listen to" and "song that" i want to listen to "are the identification texts to be determined, and" high win "are the identification texts to be determined.
That is, the same part in the multiple candidate recognition texts can be considered as an accurate result, while the different part is the recognition text to be determined, which needs to be further determined, that is, the recognition text to be determined needs to be further recognized, so as to obtain a more accurate result.
S102, calculating the similarity between the texts at the corresponding positions of the recognition text to be determined and the target contrast text.
The target comparison text is a text which is consistent with the sentence pattern structure of the candidate recognition text in the preset text library, and the target comparison text comprises the determined recognition text.
The preset text library may include a plurality of pre-stored sentences, vocabulary combinations, etc., and the target comparison texts consistent with the candidate recognition text sentence patterns may be matched in the preset text library through word senses, parts of speech (nouns, verbs), etc. For example, "i want to listen to happy songs" may be matched to the target contrast text "i want to listen to zhou jenlen songs" etc. For another example, "please give me a cup of coffee" may match the target contrast text "please give me a cup of milk".
For example, the target contrast text includes the above-mentioned certain identification text, i.e., "i want to listen to a song of zhou jeren" includes the certain identification text "i want to listen to a song".
S103, configuring candidate recognition texts consisting of the texts to be determined and the determined recognition texts corresponding to the maximum value in the similarity as target recognition texts.
Optionally, the similarity between the texts at the corresponding positions of the recognition text to be determined and the target contrast text is respectively calculated and determined. For example, the similarity between "grand beauty" and "Zhou Jien" and the similarity between "grand beauty" and "Zhou Jien" are determined, respectively.
If the similarity between the high victory beauty and the Zhoujilun is the largest, the target recognition text is configured as the song that I want to listen to the high victory beauty.
The similarity may refer to semantic similarity, and may also be similarity of belonged types, similarity of parts of speech, and the like, which is not limited herein.
In the embodiment, firstly, a determined recognition text and a text to be determined in at least two candidate recognition texts corresponding to the voice data to be recognized are determined, then, for the text to be determined, the similarity between the text to be determined and the text at the corresponding position of the target comparison text is calculated, the text to be determined and recognized corresponding to the maximum value in the similarity is determined as the correct result corresponding to the voice data to be recognized, and further, the candidate recognition text composed of the text to be determined and the determined recognition text is configured as the target recognition text, so that when a plurality of candidate recognition texts with close probabilities are obtained, the text to be determined and the most similar to the voice data input by the user are determined according to the target comparison text with the same sentence pattern structure as the candidate recognition texts with the close probabilities and further according to the similarity between the text to be determined and the text at the corresponding position in the target comparison text, and then the text to be identified and the text to be identified are combined into a target identification text and fed back to the user, namely, different parts in the candidate identification texts with close probabilities are further selected by referring to the target comparison text, so that the accuracy of identifying the voice data to be identified is improved, and the user experience of voice identification is improved.
Fig. 2 is a flowchart illustrating a method for determining a target recognition text from candidate recognition texts according to another embodiment of the present application. As shown in fig. 2, on the basis of fig. 1, before S101, the method further includes:
s201, acquiring a plurality of voice recognition texts corresponding to the voice data to be recognized.
When a user inputs a segment of speech, the terminal may obtain a plurality of results according to a preset speech recognition decoder, and generally, the preset speech recognition decoder may include one or more models for speech recognition to recognize speech data to be recognized. Because some pronunciation in the voice information is fuzzy or the same pronunciation and similar vocabularies are more, a plurality of voice recognition texts can be recognized.
Specifically, the method comprises the following steps: after the voice data to be recognized is acquired, the voice data to be recognized can be subjected to some preprocessing such as front-end signal processing and endpoint detection processing, voice features are extracted frame by frame, the extracted features are sent to a preset voice recognition decoder, and the preset voice recognition decoder can include: the decoder combines the acoustic model, the language model, and the pronunciation dictionary to obtain a plurality of speech recognition texts.
The acoustic model mainly describes likelihood probability of features under the pronunciation model, and the acoustic model can adopt a Hidden Markov Model (HMM). The Language Model mainly describes continuous occurrence probability among words, the Language Model adopts an n-gram Model, for Chinese, the Model is called a Chinese Language Model (CLM), the Language Model can contain a large amount of linguistic data, the linguistic data can be a large amount of sentences, vocabularies and the like, and the word search result can be constrained according to the statistical probability of co-occurrence between front and back words. The pronunciation dictionary is mainly used for converting words and chords. In specific conversion, the acoustic model decoding is to search the feature file of the sound signal in the acoustic model to generate the optimal phoneme recognition result, wherein phonemes can identify letters. And converting the phoneme recognition result into characters by inquiring the pronunciation dictionary. Finally, the goal of language model decoding is to select the most likely word combination result from the word combinations obtained by querying the pronunciation dictionary as the speech recognition text.
It should be noted that, reference may be made to related technologies for the operation of recognizing the voice data to be recognized to obtain the corresponding voice recognition text, which is not described in detail in the embodiment of the present invention.
For example, the operation of recognizing the speech data to be recognized to obtain the corresponding speech recognition text can be sequentially realized through the following formula.
W1=argmaxP(W|X) (1)
Figure BDA0001238907870000061
In the formula (1), W represents any character sequence stored in a database, where the character sequence includes words or characters, and the database may be a corpus used for speech recognition; x represents voice data input by a user, W1Represents a character sequence that can be matched with the speech data to be recognized obtained from the stored character sequence, and P (W | X) represents the probability that the speech data to be recognized can become a character. In the above formula (2), W2The matching degree between the voice data to be recognized and the character sequence is shown, P (X | W) shows the probability that the character sequence can pronounce, P (W) shows the probability that the character sequence is a word or a character, and P (X) shows the probability that the voice data to be recognized is audio information.
In the above recognition process, P (W) may be determined by a language model, and P (X | W) may be determined by an acoustic model, so as to complete speech recognition on the speech data to be recognized, and obtain a speech recognition text corresponding to the speech data to be recognized. The language model and the acoustic model will be briefly described below, respectively.
Language model
The language model usually uses the chain rule to break down the probability of a word or character sequence into the product of the probabilities of each word or character, i.e., breaking down W into W1、w2、w3、....wn-1、wnAnd p (w) is determined by the following formula (3).
P(W)=P(w1)P(w2|w1)P(w3|w1,w2)...P(wn|w1,w2,...,wn-1) (3)
In the above formula (3), each term in p (w) is a probability that the current character sequence is a word or a character under the condition that all the character sequences before the character sequence is known are words or characters.
Since when determining p (w) by the above formula (3), if the condition is too long, it is determined that p (w) will be inefficient, thereby affecting subsequent speech recognition. Therefore, to improve the efficiency of determining P (W), P (W) is typically determined by an n-gram language model in the language model. When determining p (w) by the n-gram language model, the probability of the nth word depends only on the (n-1) th word located in front of the word, and p (w) can be determined by the following formula (4).
P(W)=P(w1)P(w2|w1)P(w3|w2)...P(wn|wn-1) (4)
Acoustic model
Since the pronunciation of each word needs to be determined when determining each word, the pronunciation of each word needs to be determined through a dictionary. Where the dictionary is a model juxtaposed to the acoustic model and the language module, and the dictionary can convert a single word into a phoneme string. The acoustic model may determine which sounds should be sequentially pronounced by the words in the user-input speech data through a dictionary, and find the demarcation point of each phoneme through a dynamic rule algorithm such as a Viterbi (Viterbi) algorithm, thereby determining the start-stop time of each phoneme, and thus determining the degree of matching of the user-input speech data with the phoneme string, that is, determining P (X | W).
In general, the distribution of feature vectors of each phoneme can be estimated by a classifier such as a Gaussian mixture model, and in the speech recognition stage, the feature vector x of each frame in the speech data input by the user is determinedtFrom the corresponding phoneme siResulting probability P (x)t|si) The probabilities for each frame are multiplied to obtain P (X | W).
Wherein, the classifier can be obtained by training in advance, and the specific operation is as follows: a large number of feature vectors and phonemes corresponding to each feature vector are extracted from training data through a frequency cepstrum Coefficient (MFCC), and thus a classifier from features to phonemes is trained.
It should be noted that, in practical applications, not only the above-mentioned manner for determining P (X | W) but also other manners, such as directly giving P(s) through a neural network, may be includedi|xt) Can be converted into P (x) by Bayesian formulat|si) And then multiplied to obtain P (X | W), which is only for illustration and does not represent that the embodiment of the present invention is limited thereto.
S202, determining a maximum probability value and a second approximate probability value in probability values corresponding to a plurality of voice recognition texts.
The recognition probability of each speech recognition text can be calculated by adopting a preset algorithm according to the character combination of each speech recognition text.
Alternatively, formulas may be employed
Figure BDA0001238907870000081
Calculating probability value P of each voice recognition textrecWherein
Figure BDA0001238907870000082
Is the decoding rate of the acoustic model and,
Figure BDA0001238907870000083
is the decoding rate of the pronunciation dictionary,
Figure BDA0001238907870000084
is the decoding rate of the language model.
Figure BDA0001238907870000085
A feature file representing the speech data to be recognized,
Figure BDA0001238907870000086
in order to identify the combination of words,
Figure BDA0001238907870000087
is a sequence of phonemes.
Therefore, the character combination, the phoneme sequence and the feature file of the voice data to be recognized of each voice recognition text are substituted to obtain the corresponding feature file of each voice recognition text
Figure BDA0001238907870000088
Figure BDA0001238907870000089
And then obtaining the probability value corresponding to each voice recognition text.
Assuming that there are N speech recognition texts in total, the probability value of each speech recognition text is denoted as PnWherein N is 1, 2, … …, N. The maximum probability value P can be further selectedmaxAnd a second rough probability value P2max
S203, determining whether the difference value between the maximum probability value and the second approximate probability value is larger than a preset probability threshold value.
Further, a difference value between the maximum probability value and the second maximum probability value may be obtained, and if the difference value is greater than or equal to a preset probability threshold value, it indicates that the accuracy of the speech recognition text corresponding to the maximum probability value is higher, and the speech recognition text corresponding to the maximum probability value may be directly determined as the target recognition text.
In specific implementation, the maximum probability value P can be calculated in sequencemaxWith other probability values PnOptionally, using a common equationFormula (II)
Figure BDA00012389078700000810
Calculating the mean absolute value as the acoustic probability value difference EP,EPThe distribution condition of the voice recognition texts is reflected, and the direct difference between the optimal voice recognition text and the rest voice recognition texts is measured. EPWhen the probability value is larger than the preset threshold value, the maximum probability value P can be directly obtainedmaxThe corresponding speech recognition text is determined as the target recognition text without further semantic analysis.
Further, when the difference value between the maximum probability value and the second approximate probability value is smaller than a preset probability threshold, at least two candidate recognized texts from the plurality of voice recognized texts are determined.
Alternatively, the determining of the at least two candidate recognized texts from the plurality of speech recognized texts may be: the method comprises the steps of obtaining a first voice recognition text of which the difference value between the probability value and the maximum probability value is smaller than a preset probability threshold value in a plurality of voice recognition texts, and determining the first voice recognition text and the voice recognition text corresponding to the maximum probability value as at least two candidate recognition texts.
And comparing the maximum probability value with other probability values, and when the difference value is smaller than a preset probability threshold value, taking the voice recognition text corresponding to the compared probability value as a candidate recognition text. If the difference value is larger than or equal to the preset probability threshold value, the probability that the voice recognition text corresponding to the compared probability value becomes the target recognition text is very low, and further analysis is not performed.
Alternatively, the probability values of the plurality of speech recognition texts may be ranked, and a preset number of speech recognition texts with the highest probability values may be selected as candidate recognition texts. Or from top to bottom, selecting candidate recognition texts in sequence according to the difference value of the probability values of two adjacent voice recognition texts, for example, if the difference value of the maximum probability value and the second high probability value is greater than a preset threshold value, directly using the voice recognition text with the highest probability value as a target recognition text, and not continuing to compare; otherwise, the voice recognition text with the highest probability value and the voice recognition text with the second highest probability value are used as candidate recognition texts, the difference value between the second highest probability value and the next probability value is determined in sequence, the candidate recognition texts are determined, and the analogy is repeated until a certain difference value is larger than a preset threshold value, and the comparison is not carried out. Of course, the methods are not limited to these methods, and the candidate recognition texts may be determined flexibly according to the needs, or may be obtained by using a formula or an algorithm.
If only one candidate speech recognition text is determined, this candidate speech recognition text can be directly configured as the target speech recognition text. If there are a plurality of candidate speech recognition texts, a result that best meets the actual situation is further determined as the target speech recognition text.
Optionally, calculating the similarity between the text at the corresponding position of the recognition text to be determined and the target contrast text may include: and determining semantic similarity between the text at the corresponding position of the recognition text to be determined and the target comparison text by adopting a preset word vector model.
The preset word vector model is used for identifying semantic similarity among words through word vector distances.
The preset word vector model can be obtained through word vector training, specifically, the text content can be converted into a real number vector with limited low dimensionality, and the dimensionality is common in 50-dimension and 100-dimension. The distance of the vector can be measured by the most traditional euclidean distance, and can also be measured by the cosine included angle, which is not limited herein. The distance of the vector reflects the distance of the word semantics, i.e. the semantic similarity between words can be expressed by the distance of the vector. The method comprises the following steps that word vector training can be carried out by adopting training tools of word vectors, firstly, training linguistic data capable of comprehensively covering basic words in Chinese are obtained, and corresponding preprocessing is carried out; and then, a training tool of word vectors is called to perform training, and a vector representation form is generated, for example, each word in the corpus has a corresponding 50-dimensional vector representation, which is not limited herein. The larger the vector distance is, the longer the semantic distance between words is, whereas the shorter the semantic distance is.
Specifically, the texts at the corresponding positions of the candidate recognition texts and the target comparison text to be determined appear in the same sentence pattern, and the positions are the same, so that the probability of the same kind of things is very high, and then the similarity is further determined according to the word vector distance.
Table 1 is taken as an example to illustrate:
TABLE 1
Figure BDA0001238907870000101
Therefore, the word vector distance between the "high-success" and the "zhongjilun" is the closest, then the "song that i want to listen to the high-success" is configured as the target recognition text, and the target recognition text is output and displayed to the user, if the target recognition text is voice information of a control instruction class, related instructions can be executed according to the target recognition text, and details are not repeated herein.
Optionally, the preset word vector model is adopted to determine semantic similarity between the text at the corresponding position of the recognition text to be determined and the target comparison text, and the semantic similarity may be: and when the text to be identified comprises at least two words, respectively determining the semantic similarity between each word in the text to be identified and the word at the corresponding position in the target contrast text by adopting a preset word vector model.
That is, the words at different positions are respectively compared, for example, semantic similarity between the text at the corresponding position of the recognition text "breakfast eating fruit is beneficial to health" and the target comparison text "dinner eating coarse food is beneficial to health" is compared, and semantic similarity between "breakfast" and "dinner" and semantic similarity between "coarse food" and "fruit" can be respectively determined.
Fig. 3 is a schematic structural diagram of an apparatus for determining a target recognition text from at least two candidate recognition texts according to an embodiment of the present application, as shown in fig. 3, the apparatus includes: a first determination module 301, a calculation module 302, and a second determination module 303, wherein:
the first determining module 301 is configured to determine a recognition determining text and a recognition to be determined text in at least two candidate recognition texts corresponding to the voice data to be recognized.
The identification text to be determined is the same part in at least two candidate identification texts, and the identification text to be determined is the different part in at least two candidate identification texts.
A calculating module 302, configured to calculate a similarity between the text at the corresponding position of the to-be-determined recognition text and the target comparison text.
The target comparison text is a text which is consistent with the sentence pattern structure of the candidate recognition text in a preset text library, and the target comparison text comprises the determined recognition text.
A second determining module 303, configured to configure the candidate recognition text composed of the to-be-determined recognition text and the determined recognition text corresponding to the maximum value in the similarity as a target recognition text.
In this embodiment, first, the first determining module 301 determines a determined recognition text and a text to be determined in at least two candidate recognition texts corresponding to the voice data to be recognized, then, the calculating module 302 calculates the similarity between the text to be determined and the text at the corresponding position of the target comparison text for the text to be determined, determines the text to be determined corresponding to the maximum value in the similarity as the correct result corresponding to the voice data to be recognized, and then, the second determining module 302 configures the candidate recognition text composed of the text to be determined and the determined recognition text as the target recognition text, so that when a plurality of candidate recognition texts with close probabilities are obtained, according to the target comparison text with the same sentence pattern structure thereof, further according to the similarity between the text to be determined and the text at the corresponding position in the target comparison text, the method comprises the steps of determining the text to be identified which is closest to voice data input by a user, further forming a target identification text by the text to be identified and the text to be identified, and feeding back the target identification text to the user, namely further selecting different parts in a plurality of candidate identification texts with close probabilities by referring to a target comparison text, so that the accuracy of identifying the voice data to be identified is improved, and the user experience of voice identification is improved.
Fig. 4 is a schematic structural diagram of an apparatus for determining a target recognition text from at least two candidate recognition texts according to another embodiment of the present application, as shown in fig. 4, on the basis of fig. 3, the apparatus further includes: a third determination module 401, wherein:
the third determining module 401 is configured to determine a maximum probability value and a second approximate probability value in the plurality of speech recognition texts corresponding to the speech data to be recognized before the first determining module 301 determines the determined recognition text and the recognized text to be determined in the at least two candidate recognition texts corresponding to the speech data to be recognized.
In this embodiment, the first determining module 301 determines at least two candidate recognized texts from the plurality of speech recognized texts when a difference between the maximum probability value and the second approximate probability value is smaller than a preset probability threshold.
Optionally, the first determining module 301 is specifically configured to acquire a first speech recognition text in the plurality of speech recognition texts, where a difference between the probability value and the maximum probability value is smaller than a preset probability threshold; and determining the first voice recognition text and the voice recognition text corresponding to the maximum probability value as the at least two candidate recognition texts.
Further, the calculating module 302 is specifically configured to determine semantic similarity between the text to be determined and identified and the text at the corresponding position in the target comparison text by using a preset word vector model. The preset word vector model is used for identifying semantic similarity among words through word vector distance.
Optionally, the calculating module 302 is specifically configured to, when the to-be-determined recognition text includes at least two vocabularies, respectively determine semantic similarity between each vocabulary in the to-be-determined recognition text and a vocabulary in a corresponding position in the target comparison text by using the preset word vector model.
It should be noted that: in the apparatus for determining a target recognition text provided in the above embodiment, when determining a target recognition text from at least two candidate recognition texts, only the division of the above function modules is used as an example, in practical applications, the function distribution may be completed by different function modules according to needs, that is, the internal structure of the apparatus is divided into different function modules, so as to complete all or part of the functions described above. In addition, the apparatus for determining a target recognition text and the method for determining a target recognition text provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to perform some steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (10)

1. A method for determining a target recognition text from at least two candidate recognition texts, comprising:
determining a determined recognition text and a to-be-determined recognition text in at least two candidate recognition texts corresponding to voice data to be recognized, wherein the determined recognition text is the same part in the at least two candidate recognition texts, and the to-be-determined recognition text is the different part in the at least two candidate recognition texts;
calculating the similarity between the text at the corresponding position of the recognition text to be determined and a target comparison text, wherein the target comparison text is a text in a preset text library consistent with the sentence pattern structure of the candidate recognition text, the target comparison text comprises the recognition text to be determined, and the positions of the text at the corresponding position of the recognition text to be determined and the text at the corresponding position of the target comparison text in the same sentence pattern structure are the same;
and configuring the candidate recognition texts consisting of the texts to be determined and the determined recognition texts corresponding to the maximum value in the similarity as target recognition texts.
2. The method according to claim 1, wherein before determining the determined recognized text and the text to be determined in the at least two candidate recognized texts corresponding to the speech data to be recognized, the method further comprises:
determining a maximum probability value and a second approximate probability value in a plurality of voice recognition texts corresponding to the voice data to be recognized;
determining at least two candidate recognized texts from the plurality of speech recognized texts when a difference between the maximum probability value and the second approximate probability value is smaller than a preset probability threshold.
3. The method of claim 2, wherein determining at least two candidate recognized texts from the plurality of speech recognized texts comprises:
acquiring a first voice recognition text of which the difference value between the probability value and the maximum probability value in the plurality of voice recognition texts is smaller than a preset probability threshold;
and determining the first voice recognition text and the voice recognition text corresponding to the maximum probability value as the at least two candidate recognition texts.
4. The method according to claim 1, wherein the calculating of the similarity between the text at the corresponding position of the recognition text to be determined and the target comparison text specifically comprises:
and determining semantic similarity between the text to be determined and recognized and the text at the corresponding position of the target comparison text by adopting a preset word vector model, wherein the preset word vector model is used for identifying the semantic similarity between words through word vector distance.
5. The method according to claim 4, wherein the determining the semantic similarity between the recognition text to be determined and the text at the corresponding position in the target comparison text by using the preset word vector model specifically comprises:
and when the text to be identified comprises at least two words, respectively determining the semantic similarity between each word in the text to be identified and the word at the corresponding position in the target contrast text by adopting the preset word vector model.
6. An apparatus for determining a target recognized text from at least two candidate recognized texts, comprising:
the device comprises a first determining module, a second determining module and a judging module, wherein the first determining module is used for determining a determined recognition text and a to-be-determined recognition text in at least two candidate recognition texts corresponding to voice data to be recognized, the determined recognition text is the same part in the at least two candidate recognition texts, and the to-be-determined recognition text is the different part in the at least two candidate recognition texts;
the calculation module is used for calculating the similarity between the text at the corresponding position of the recognition text to be determined and the target comparison text, wherein the target comparison text is a text which is consistent with the sentence pattern structure of the candidate recognition text in a preset text library, the target comparison text comprises the recognition text to be determined, and the positions of the text at the corresponding position of the recognition text to be determined and the text at the corresponding position of the target comparison text in the same sentence pattern structure are the same;
and the second determining module is used for configuring the candidate recognition texts consisting of the texts to be determined and the determined recognition texts corresponding to the maximum value in the similarity as target recognition texts.
7. The apparatus of claim 6, further comprising: a third determination module;
the third determining module is configured to determine a maximum probability value and a second approximate probability value in a plurality of voice recognition texts corresponding to the voice data to be recognized before the first determining module determines the determined recognition text and the recognized text to be determined in the at least two candidate recognition texts corresponding to the voice data to be recognized;
the first determining module is specifically configured to determine at least two candidate recognized texts from the plurality of speech recognized texts when a difference between the maximum probability value and the second maximum probability value is smaller than a preset probability threshold.
8. The apparatus according to claim 7, wherein the first determining module is specifically configured to obtain a first speech recognition text in the plurality of speech recognition texts, where a difference between the probability value and the maximum probability value is smaller than a preset probability threshold; and determining the first voice recognition text and the voice recognition text corresponding to the maximum probability value as the at least two candidate recognition texts.
9. The apparatus according to claim 6, wherein the computing module is specifically configured to determine the semantic similarity between the recognized text to be determined and the text at the corresponding position in the target comparison text by using a preset word vector model, where the preset word vector model is used to identify the semantic similarity between words by word vector distance.
10. The apparatus according to claim 9, wherein the computing module is specifically configured to, when the recognition text to be determined includes at least two words, respectively determine semantic similarities between each word in the recognition text to be determined and a word at a corresponding position in the target comparison text by using the preset word vector model.
CN201710127503.9A 2017-03-06 2017-03-06 Method and device for determining target recognition text Active CN106782560B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710127503.9A CN106782560B (en) 2017-03-06 2017-03-06 Method and device for determining target recognition text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710127503.9A CN106782560B (en) 2017-03-06 2017-03-06 Method and device for determining target recognition text

Publications (2)

Publication Number Publication Date
CN106782560A CN106782560A (en) 2017-05-31
CN106782560B true CN106782560B (en) 2020-06-16

Family

ID=58962349

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710127503.9A Active CN106782560B (en) 2017-03-06 2017-03-06 Method and device for determining target recognition text

Country Status (1)

Country Link
CN (1) CN106782560B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107329843B (en) * 2017-06-30 2021-06-01 百度在线网络技术(北京)有限公司 Application program voice control method, device, equipment and storage medium
CN107277645A (en) * 2017-07-27 2017-10-20 广东小天才科技有限公司 Error correction method and device for subtitle content
CN107680585B (en) * 2017-08-23 2020-10-02 海信集团有限公司 Chinese word segmentation method, Chinese word segmentation device and terminal
CN108197102A (en) * 2017-12-26 2018-06-22 百度在线网络技术(北京)有限公司 A kind of text data statistical method, device and server
CN108417210B (en) * 2018-01-10 2020-06-26 苏州思必驰信息科技有限公司 Word embedding language model training method, word recognition method and system
CN108364655B (en) * 2018-01-31 2021-03-09 网易乐得科技有限公司 Voice processing method, medium, device and computing equipment
CN110188338B (en) * 2018-02-23 2023-02-21 富士通株式会社 Text-dependent speaker verification method and apparatus
CN109829704A (en) * 2018-12-07 2019-05-31 创发科技有限责任公司 Payment channel configuration method, device and computer readable storage medium
CN111681670B (en) * 2019-02-25 2023-05-12 北京嘀嘀无限科技发展有限公司 Information identification method, device, electronic equipment and storage medium
CN109918680B (en) * 2019-03-28 2023-04-07 腾讯科技(上海)有限公司 Entity identification method and device and computer equipment
CN110705274B (en) * 2019-09-06 2023-03-24 电子科技大学 Fusion type word meaning embedding method based on real-time learning
CN110853635B (en) * 2019-10-14 2022-04-01 广东美的白色家电技术创新中心有限公司 Speech recognition method, audio annotation method, computer equipment and storage device
CN110706707B (en) * 2019-11-13 2020-09-18 百度在线网络技术(北京)有限公司 Method, apparatus, device and computer-readable storage medium for voice interaction
JP7374756B2 (en) * 2019-12-20 2023-11-07 キヤノン株式会社 Information processing device, information processing method, and program
CN111667821A (en) * 2020-05-27 2020-09-15 山西东易园智能家居科技有限公司 Voice recognition system and recognition method
CN112614263A (en) * 2020-12-30 2021-04-06 浙江大华技术股份有限公司 Method and device for controlling gate, computer equipment and storage medium
CN113177114B (en) * 2021-05-28 2022-10-21 重庆电子工程职业学院 Natural language semantic understanding method based on deep learning
CN113539270B (en) * 2021-07-22 2024-04-02 阳光保险集团股份有限公司 Position identification method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101655837A (en) * 2009-09-08 2010-02-24 北京邮电大学 Method for detecting and correcting error on text after voice recognition
CN102999483A (en) * 2011-09-16 2013-03-27 北京百度网讯科技有限公司 Method and device for correcting text
CN105374351A (en) * 2014-08-12 2016-03-02 霍尼韦尔国际公司 Methods and apparatus for interpreting received speech data using speech recognition
CN105513586A (en) * 2015-12-18 2016-04-20 百度在线网络技术(北京)有限公司 Speech recognition result display method and speech recognition result display device
CN105869642A (en) * 2016-03-25 2016-08-17 海信集团有限公司 Voice text error correction method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003308094A (en) * 2002-02-12 2003-10-31 Advanced Telecommunication Research Institute International Method for correcting recognition error place in speech recognition
US7587308B2 (en) * 2005-11-21 2009-09-08 Hewlett-Packard Development Company, L.P. Word recognition using ontologies
CN103699530A (en) * 2012-09-27 2014-04-02 百度在线网络技术(北京)有限公司 Method and equipment for inputting texts in target application according to voice input information
CN104021786B (en) * 2014-05-15 2017-05-24 北京中科汇联信息技术有限公司 Speech recognition method and speech recognition device
KR102380833B1 (en) * 2014-12-02 2022-03-31 삼성전자주식회사 Voice recognizing method and voice recognizing appratus
CN106326303B (en) * 2015-06-30 2019-09-13 芋头科技(杭州)有限公司 A kind of spoken semantic analysis system and method
CN106469554B (en) * 2015-08-21 2019-11-15 科大讯飞股份有限公司 A kind of adaptive recognition methods and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101655837A (en) * 2009-09-08 2010-02-24 北京邮电大学 Method for detecting and correcting error on text after voice recognition
CN102999483A (en) * 2011-09-16 2013-03-27 北京百度网讯科技有限公司 Method and device for correcting text
CN105374351A (en) * 2014-08-12 2016-03-02 霍尼韦尔国际公司 Methods and apparatus for interpreting received speech data using speech recognition
CN105513586A (en) * 2015-12-18 2016-04-20 百度在线网络技术(北京)有限公司 Speech recognition result display method and speech recognition result display device
CN105869642A (en) * 2016-03-25 2016-08-17 海信集团有限公司 Voice text error correction method and device

Also Published As

Publication number Publication date
CN106782560A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
CN106782560B (en) Method and device for determining target recognition text
US10283111B1 (en) Disambiguation in speech recognition
US11496582B2 (en) Generation of automated message responses
US11380330B2 (en) Conversational recovery for voice user interface
US10943583B1 (en) Creation of language models for speech recognition
US10489393B1 (en) Quasi-semantic question answering
US10134388B1 (en) Word generation for speech recognition
US9484021B1 (en) Disambiguation in speech recognition
US20220189458A1 (en) Speech based user recognition
US10210862B1 (en) Lattice decoding and result confirmation using recurrent neural networks
US10332508B1 (en) Confidence checking for speech processing and query answering
US10388274B1 (en) Confidence checking for speech processing and query answering
US9934785B1 (en) Identification of taste attributes from an audio signal
US10121467B1 (en) Automatic speech recognition incorporating word usage information
US11823678B2 (en) Proactive command framework
US10056078B1 (en) Output of content based on speech-based searching and browsing requests
US10339920B2 (en) Predicting pronunciation in speech recognition
US7421387B2 (en) Dynamic N-best algorithm to reduce recognition errors
US20180137109A1 (en) Methodology for automatic multilingual speech recognition
US10963497B1 (en) Multi-stage query processing
US9704483B2 (en) Collaborative language model biasing
US20140195238A1 (en) Method and apparatus of confidence measure calculation
CN109637537B (en) Method for automatically acquiring annotated data to optimize user-defined awakening model
US11093110B1 (en) Messaging feedback mechanism
US10366442B1 (en) Systems and methods to update shopping cart

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant