CN112863499B

CN112863499B - Speech recognition method and device, storage medium

Info

Publication number: CN112863499B
Application number: CN202110041968.9A
Authority: CN
Inventors: 谢巧菁; 崔世起; 秦斌
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2021-01-13
Filing date: 2021-01-13
Publication date: 2023-01-24
Anticipated expiration: 2041-01-13
Also published as: CN112863499A

Abstract

The disclosure relates to a voice recognition method and device and a storage medium. The method comprises the following steps: receiving input voice data; determining whether the text length corresponding to the voice data is larger than a preset length threshold value; if the text length corresponding to the voice data is smaller than the preset length threshold, determining whether the voice data is voice with unknown intention according to a first rule; and if the text length corresponding to the voice data is greater than or equal to the preset length threshold, determining whether the voice data is voice with unknown intention according to a second rule. By the method, the accuracy of judging whether the intention is clear or not can be improved.

Description

Speech recognition method and device, storage medium

Technical Field

The present disclosure relates to the field of intelligent speech technologies, and in particular, to a speech recognition method and apparatus, and a storage medium.

Background

With the rapid development of computers and artificial intelligence technologies, intelligent voice conversations are also greatly developed. The user conveys own requirements such as numerical calculation, weather inquiry, intelligent household control and the like to an intelligent voice assistant (an application in a voice device) through voice.

After receiving the voice of the user, the intelligent voice assistant converts the voice into a text through an Automatic Speech Recognition (ASR) technology, and analyzes the user's needs through a background Natural Language Processing (NLP) technology, for example, recognizing the user's intention.

Disclosure of Invention

The disclosure provides a voice recognition method and device and a storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a speech recognition method, including:

receiving input voice data;

determining whether the text length corresponding to the voice data is larger than a preset length threshold value;

if the text length corresponding to the voice data is smaller than the preset length threshold, determining whether the voice data is voice with unknown intention according to a first rule;

and if the text length corresponding to the voice data is greater than or equal to the preset length threshold, determining whether the voice data is voice with unknown intention according to a second rule.

In some embodiments, the determining whether the voice data is a voice with unknown intention according to a second rule if the text length corresponding to the voice data is greater than or equal to the preset length threshold includes:

if the text length is larger than or equal to the preset length threshold, inputting the voice data into a first language model, and determining a confusion value of the voice data;

and determining whether the voice data is voice without intention according to the confusion value.

In some embodiments, said determining whether said speech data is speech of unknown intent from said confusion value comprises:

if the confusion value is larger than a preset confusion threshold value, determining the voice data as the voice with unknown intention;

alternatively, the first and second electrodes may be,

if the confusion value is smaller than or equal to a preset confusion threshold value, at least inputting the voice data into a second language model, and determining the confidence coefficient of the voice data as nonsense voice;

and determining whether the voice data is voice without intention according to the confidence coefficient of the meaningless voice.

In some embodiments, the method further comprises:

acquiring keyword information included in a text corresponding to the voice data;

if the confusion value is smaller than or equal to a preset confusion threshold value, at least inputting the voice data into a second language model, and determining the confidence level that the voice data is nonsense voice, wherein the method comprises the following steps:

and if the confusion value is smaller than or equal to a preset confusion threshold value, inputting the keyword information and/or the confusion value and the voice data into the second language model, and determining the confidence of the voice data as nonsense voice.

In some embodiments, the second language model is a model trained using a CNN network.

In some embodiments, the first language model is a model trained using a BERT network.

In some embodiments, the determining whether the voice data is a voice with unknown intention according to a first rule if the text length corresponding to the voice data is smaller than the preset length threshold includes:

if the text length is smaller than the preset length threshold, inputting the voice data into a preset unknown intention database, and determining whether the voice data is matched with data in the preset unknown intention database;

and if the voice data is matched with the data in the preset unknown intention database, determining that the voice data is the voice with unknown intention.

In some embodiments, the method further comprises:

performing intention recognition on the voice data to obtain an intention scoring value of the voice data; wherein the intent score value characterizes an intent clarity of the speech data;

if the text length corresponding to the voice data is smaller than the preset length threshold, determining whether the voice data is a voice with unknown intention according to a first rule, including:

if the text length is smaller than the preset length threshold, determining whether the voice data is voice with unknown intention by combining the intention scoring value and the first rule;

if the text length corresponding to the voice data is greater than or equal to the preset length threshold, determining whether the voice data is a voice with unknown intention according to a second rule, including:

and if the text length is greater than or equal to the preset length threshold, determining whether the voice data is voice without intention by combining the intention scoring value and the second rule.

In some embodiments, the method further comprises:

and if the voice data is determined to be the voice with unknown intention, outputting a preset response reply.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech recognition apparatus including:

a receiving module configured to receive input voice data;

the first judgment module is configured to determine whether the text length corresponding to the voice data is greater than a preset length threshold value;

the second judgment module is configured to determine whether the voice data is a voice with unknown intention according to a first rule if the text length corresponding to the voice data is smaller than the preset length threshold;

and the third judging module is configured to determine whether the voice data is the voice with unknown intention according to a second rule if the text length corresponding to the voice data is greater than or equal to the preset length threshold.

In some embodiments, the third determining module is specifically configured to, if the text length is greater than or equal to the preset length threshold, input the voice data into a first language model, and determine a confusion value of the voice data; and determining whether the voice data is voice without intention according to the confusion value.

In some embodiments, the third determining module is specifically configured to determine that the voice data is a voice with unknown intention if the confusion value is greater than a preset confusion threshold; or if the confusion value is smaller than or equal to a preset confusion threshold value, at least inputting the voice data into a second language model, and determining the confidence of the voice data as nonsense voice; and determining whether the voice data is the voice without intention according to the confidence coefficient of the meaningless voice.

In some embodiments, the apparatus further comprises:

the first acquisition module is configured to acquire keyword information included in a text corresponding to the voice data;

the third determining module is specifically configured to input the keyword information and/or the confusion value and the voice data into the second language model if the confusion value is less than or equal to a preset confusion threshold value, and determine the confidence level that the voice data is nonsense voice.

In some embodiments, the second determining module is specifically configured to, if the text length is smaller than the preset length threshold, input the voice data into a preset unknown-intention database, and determine whether the voice data matches with data in the preset unknown-intention database; and if the voice data is matched with the data in the preset unknown intention database, determining that the voice data is the voice with unknown intention.

In some embodiments, the apparatus further comprises:

the second acquisition module is configured to perform intention recognition on the voice data and acquire an intention score value of the voice data; wherein the intent score value characterizes an intent clarity of the speech data;

the second judging module is specifically configured to determine whether the voice data is a voice with unknown intention or not by combining the intention score value and the first rule if the text length is smaller than the preset length threshold;

the third determining module is specifically configured to determine whether the voice data is a voice with unknown intention by combining the intention score value and the second rule if the text length is greater than or equal to the preset length threshold.

In some embodiments, the apparatus further comprises:

and the output module is configured to output a preset response reply if the voice data is determined to be the voice with unknown intention.

According to a third aspect of the embodiments of the present disclosure, there is provided a speech recognition apparatus including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the speech recognition method as described in the first aspect above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium including:

the instructions in the storage medium, when executed by a processor of a computer, enable the computer to perform the speech recognition method as described in the first aspect above.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

according to the method and the device, different rules are adopted to determine whether the voice data is voice with unknown intention according to whether the text length corresponding to the voice data is larger than the preset length threshold value. The amount of information provided is different due to different lengths of the voice data, for example, in the longer voice data (the voice data with the text length greater than or equal to the preset length threshold), the text length is longer, and the contexts are associated with each other, so that a larger amount of information can be provided; in the shorter voice data (voice data with the text length smaller than the preset length threshold), the text length is shorter, and the provided context-related information is less, so that the provided information amount is small, and the contribution degree of different information amounts to the unknown recognition is different. Based on the method, the device and the system, whether the intention is clear or not is identified by adopting the corresponding rules aiming at the text length of the voice data, and the accuracy of judging whether the intention is clear or not can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flowchart illustrating a speech recognition method according to an embodiment of the present disclosure.

Fig. 2 is a flow chart illustrating a decision scheme for explicit intent determination according to an embodiment of the present disclosure.

Fig. 3 is a network structure diagram of a BERT model according to an embodiment of the present disclosure.

FIG. 4 is a diagram illustrating an example of a model structure for determining whether speech data is intended to be explicit through multi-feature fusion according to an embodiment of the present disclosure.

Fig. 5 is a flowchart illustrating a speech recognition method according to an exemplary embodiment of the disclosure.

FIG. 6 is a diagram illustrating a speech recognition device according to an example embodiment.

Fig. 7 is a block diagram illustrating a terminal according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosure, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating a speech recognition method according to an embodiment of the present disclosure, and as shown in fig. 1, the speech recognition method includes the following steps:

s11, receiving input voice data;

s12, determining whether the text length corresponding to the voice data is larger than a preset length threshold value;

S13A, if the text length corresponding to the voice data is smaller than the preset length threshold, determining whether the voice data is voice with unknown intention according to a first rule;

and S13B, if the text length corresponding to the voice data is larger than or equal to the preset length threshold, determining whether the voice data is a voice with unknown intention or not according to a second rule.

In the embodiment of the disclosure, the voice recognition method can be applied to terminal voice equipment and also can be applied to a server. If the voice recognition method is applied to the terminal voice equipment, the voice equipment supports the functions of voice acquisition and audio output, and human-computer voice interaction can be realized on the basis. The voice device includes: smart phones, smart speakers, or wearable devices that support voice interaction functions, etc.

For example, taking the example that the voice device is a smart speaker, the voice data input by the user may be collected based on a voice collection component included in the smart speaker, and the response information corresponding to the collected voice data is output through a voice output component of the smart speaker based on the analysis processing of the smart speaker. The voice acquisition component of the intelligent sound box can be a microphone, and the voice output component of the intelligent sound box can be a loudspeaker.

If the voice recognition method is applied to the server, the server may receive the voice data collected by the voice device and perform subsequent processing. In the embodiment of the present disclosure, a case where the voice recognition method is applied to a voice device is described as an example.

In steps S11 to S12, after receiving the voice data, the voice device determines whether a text length corresponding to the voice data is greater than a preset length threshold. The text length corresponding to the speech data may be a text length determined after the speech data is converted into a text by using an ASR technique. The preset length threshold may be a length corresponding to a case where, after a large amount of speech data is determined whether the intentions are clear, whether the determination accuracy of the intentions of the speech data before and after the text length is clear is significantly different.

In step S13A, if the text length corresponding to the voice data is smaller than the preset length threshold, determining whether the voice data is a voice with unknown intention according to a first rule; in step S13B, if the text length corresponding to the voice data is greater than or equal to the preset length threshold, it is determined whether the voice data is a voice with unknown intention according to the second rule. That is, according to the text length corresponding to the voice data, different rules are adopted to determine whether the voice data is a voice with unknown intention.

For example, the voice data is "how much weather is today", the intention of the voice data is to ask for the intention of weather, and the intention of the voice data is clear; if the voice data is "our weather," the voice data may be of an uninteresting voice.

In the related art, the recognition scheme of whether the intention is clear is relatively single, and fig. 2 is a flowchart illustrating a scheme of judging whether the intention is clear according to an embodiment of the present disclosure, as shown in fig. 2, a sentence confusion value is calculated by a language model in step S101, and then it is determined whether the sentence is a nonsense type according to the confusion value in step S102. In this processing method, how the language model is specifically implemented is not described, and the same scheme may be adopted for all input sentences. It can be understood that the scheme does not perform targeted processing according to the characteristics of sentences, and the problem that the recognition accuracy rate may not be high due to unknown intentions exists.

In contrast, according to the present disclosure, different rules are employed to determine whether the voice data is a speech with an unknown intention according to whether the text length corresponding to the voice data is greater than the preset length threshold. The amount of information provided is different due to different lengths of the voice data, for example, the longer text length in the longer voice data (the voice data with the text length greater than or equal to the preset length threshold) and the context is correlated with each other, so that a larger amount of information can be provided; in the shorter voice data (voice data with the text length smaller than the preset length threshold), the text length is shorter, and the provided context-related information is less, so that the provided information amount is small, and the contribution degree of different information amounts to the unknown recognition is different. Based on the method, the device and the system, whether the intention is clear or not is identified by adopting the corresponding rules aiming at the text length of the voice data, and the accuracy of judging whether the intention is clear or not can be improved.

In some embodiments, the determining, according to a second rule, whether the voice data is a voice with unknown intention if the text length corresponding to the voice data is greater than or equal to the preset length threshold includes:

In this embodiment, if the text length of the voice data is greater than or equal to the preset length threshold, the voice data is input to a first language model for calculating the confusion value to obtain the confusion value of the current voice data, and the first language model may be obtained by training a large number of voice data samples. In the embodiment of the present disclosure, the confusion value obtained based on the first language model is used to measure the probability that the input voice data is the unintelligible voice data, and the larger the confusion value is, the higher the probability that the voice data is the unintelligible voice is, and the smaller the confusion value is, the lower the probability that the voice data is the unintelligible voice is.

if the confusion value is larger than a preset confusion threshold value, determining the voice data as voice with unknown intention;

alternatively, the first and second liquid crystal display panels may be,

and determining whether the voice data is the voice without intention according to the confidence coefficient of the meaningless voice.

In this embodiment, when the confusion value obtained by the first language model is greater than the preset confusion threshold, the voice data is determined to be the speech with unknown intention, and when the confusion value is less than or equal to the preset confusion threshold, the voice data is not directly determined to be the speech with clear intention, but at least the voice data is input into the second language model to determine the confidence level that the voice data is the speech with no meaning, and then whether the voice data is the speech with unknown intention is determined according to the confidence level of the speech with no meaning.

As described above, the smaller the confusion value is, the less likely the speech data is to be an uninteresting speech, and the determination result of the single model cannot be made to be one hundred percent accurate, so that the present disclosure continues to determine the confidence that the speech data is a nonsense speech by at least the second language model when the confusion value of the speech data is less than or equal to the preset confusion threshold, thereby determining whether the speech data is an uninteresting speech according to the confidence. By the multi-model combination mode, the accuracy of identifying whether the intention is clear or not can be further improved.

In this embodiment, the determining whether the speech data is speech with unknown intention according to the confidence level of the meaningless speech includes:

if the confidence coefficient is larger than a preset confidence coefficient threshold value, determining that the voice data is the voice with unknown intention;

and if the confidence coefficient is less than or equal to the preset confidence coefficient threshold value, determining that the voice data is voice with clear intention.

In some embodiments, the method further comprises:

if the confusion value is smaller than or equal to a preset confusion threshold value, at least inputting the voice data into a second language model, and determining the confidence of the voice data as nonsense voice, wherein the steps of:

and if the confusion value is smaller than or equal to a preset confusion threshold value, inputting the keyword information and/or the confusion value and the voice data into the second language model, and determining the confidence coefficient of the voice data as nonsense voice.

In this embodiment, keyword information included in the text corresponding to the voice data may also be obtained, where the keyword information may be a word most related to an intention expressed by the voice data, for example, a word with a high frequency of occurrence in the voice data, or a word belonging to a predetermined noun library or verb library, and the like, the keyword information may be used to characterize intention information or a type, and the keyword may be slot information such as a location, a time, and the like, as key information for triggering the intention.

After the keyword information included in the voice data is acquired, the keyword information and/or the confusion value and the voice data are input into the second language model when the confusion value is smaller than or equal to the preset confusion threshold value, and the confidence coefficient that the voice data are nonsense voice is determined.

Specifically, the second language model may be a model trained based on the speech data for training and the keyword information in the speech data, and thus when the keyword information and the speech data are input to the second language model, the confidence that the input speech data is nonsense speech can be obtained.

The second language model may be a model trained based on speech data for training and an input confusion value. For example, when the confusion value in the model is smaller than another preset confusion threshold, the confidence of the meaningless speech obtained based on the speech data itself is multiplied by a first weight as the confidence of the meaningless speech finally output by the second language model; and when the confusion value is greater than or equal to the other preset confusion threshold, multiplying the confidence coefficient of the meaningless speech obtained based on the speech data by a second weight to be used as the confidence coefficient of the meaningless speech finally output by the second language model; wherein the first weight is less than the second weight.

It should be noted that, when the keyword information, the confusion value, and the voice data are input into the second language model together, the two manners described above may be combined, and the disclosure is not detailed herein.

In the embodiment of the present disclosure, the second language model is a model obtained by training based on a Convolutional Neural Network (CNN), but it should be noted that the second language model of the present disclosure is not limited to the CNN, and may also be a model obtained by training based on a Deep Neural Network (DNN).

It is understood that, in this embodiment, when the confusion value is less than or equal to the preset confusion threshold, the keyword information and/or the confusion value are combined together to determine whether the voice data is the voice with unknown intention on the basis of the second language model. By the multi-feature multi-rule combination mode, the accuracy of identifying whether the intention is clear or not can be greatly improved.

BERT is a neural network model with bidirectional depth, fig. 3 is a network structure diagram of a BERT model shown in the embodiment of the present disclosure, and as shown in fig. 3, the BERT model allows a bidirectional Transformer encoder to be used, and simultaneously, left-side and right-side information is utilized to perform bidirectional training to improve precision, so that a deep bidirectional language representation capable of fusing left-side and right-side context information is finally generated. The following formula (1) is a calculation formula of a loss function in the BERT model:

wherein, w _i For the currently predicted word, the word can be combined with context information in both left and right directions to calculate a loss value, and the model obtains a trained model by continuously optimizing the loss value (minimizing the loss value).

The following formula (2) is a calculation formula of a loss function in a conventional Recurrent Neural Network (RNN) model:

in the RNN model, the predictor is combined with the above information only to calculate the loss value to train the model.

Since the text length of the voice data is greater than or equal to the preset length threshold in this implementation, as described above, the number of words in the voice data is large, and the words are related to each other vertically, which can provide a large amount of information, and thus is suitable for using the BERT model. Based on this, the present disclosure employs the BERT model, and a first language model with better accuracy can be obtained as compared to a one-way model, and thus a more accurate determination result can be obtained when the intention of the voice data is determined using the first language model.

In this embodiment, if the text length corresponding to the voice data is smaller than the preset length threshold, the voice data is input into the preset unknown intention database, and whether the voice data is a voice with unknown intention is determined in a data matching manner.

It should be noted that, in the embodiment of the present disclosure, the preset unknown-intention database may be a database determined in an offline mining manner, and all the data in the preset unknown-intention database are unknown-intention data. When the database with unknown preset intentions is established by offline mining, the voice data which is received by the voice equipment and can not give a response can be stored, and the database with unknown preset intentions is established after manual screening.

In the related art, for a short text, no special processing is performed, and a language model is input as in a long text. In fact, because the short text information amount is small, the recognition is difficult to be performed by using a language model or other models, and therefore the recognition accuracy is poor. According to the method and the device, for the short text with the text length smaller than the preset length threshold, whether the short text is the voice with unknown intention is determined based on the matching with the data in the database with unknown intention, and the accuracy of whether the intention is clear or not for the short text recognition can be improved.

In some embodiments, the method further comprises:

performing intention recognition on the voice data to obtain an intention score value of the voice data; wherein the intent score value characterizes an intent clarity of the speech data;

In this embodiment, intention recognition is performed on the voice data to obtain an intention score value representing the clarity of intention of the voice data, and the intention score value is combined with the first rule or the second rule to determine whether the voice data is speech of unknown intention according to the text length of the voice data. For example, the present disclosure scores the intended intelligibility of speech and data through Natural Language Understanding (NLU) techniques. For example, the intent score value of "our weather" as previously described may be 0.5 points, while the intent score value of "how today's weather" may be 0.95 points.

In some embodiments, when determining whether the speech data is speech that is not intended in conjunction with the intent score value and the first rule or the second rule, the intent score value may be compared to a preset intent score threshold, and when the intent score value is less than the preset intent score threshold, and the first rule or the second rule determines that the speech data is not intended, the speech data is determined to be speech that is not intended. However, the present disclosure is not limited to this combination, and any combination of the score value and the first rule or the second rule is intended to be included in the scope of the present disclosure. For example, the output result of the first rule or the second rule may be assigned to be multiplied by a third weight, and the intention score value is multiplied by a fourth weight to obtain a final score, and it is determined whether the speech data is speech of unknown intention according to the final score. In some embodiments, the speech data is determined to be intent ambiguous when the intent score value is combined with the first rule, e.g., when the speech data matches data in a database of predetermined unknown intentions and the intent score value is less than a predetermined intent score threshold.

It should be noted that, in the embodiment of the present disclosure, the database with unknown intention may be obtained by combining the intention recognition technology. For example, intention recognition is performed on a large amount of voice data, low-grade voice data are filtered based on intention score values, part-of-speech tagging is performed on the low-grade voice data, and entity words with clear meanings in the voice data are filtered out and then serve as mining data to form a preset unknown-intention database of the disclosure.

The method has the advantages that the unknown intention database is formed by offline mining in combination with the intention recognition mode, manual item-by-item verification of whether the voice data are clear or not is not needed, preliminary filtering is performed through the intention recognition, secondary filtering is performed in combination with part-of-speech tagging, the forming efficiency of the unknown intention database is greatly improved, and the accuracy of short voice text intention recognition is enhanced.

In some possible embodiments, if the second rule is: when it is determined that the confusion value is smaller than or equal to the preset confusion threshold value according to the first language model, at least the voice data is input to the second language model to obtain the confidence of the meaningless voice, and then the intention score value and the second rule are combined, the intention score value, the keyword information, the confusion value and the voice data can be jointly input to the second language model (such as a model trained by a CNN network). In the second language model, for example, corresponding convolution kernels may be assigned according to the intention score value and/or the confusion value (that is, convolution kernels corresponding to different intention score values and/or confusion values may be different), and the assigned convolution kernels are further used to perform convolution on the speech data and the keyword information so as to obtain a confidence that the speech data is meaningless speech, and whether the speech data is uninteresting speech is determined according to the confidence of the meaningless speech.

FIG. 4 is a diagram illustrating an example of a model structure for determining whether speech data is intended to be unambiguous through multi-feature fusion, where the model structure is applicable to a second language model according to the present disclosure. As shown in fig. 4, after the corresponding text matrix is obtained for the input speech data, the convolution layer obtains the convolution result, and after the convolution result is subjected to pooling layer dimensionality reduction, the full-link layer combines the natural language processing features, such as the intention score, and the final confidence coefficient characterizing that the speech data is nonsense speech is obtained.

It can be understood that, the intention score value obtained by the intention recognition method is combined with the first rule or the second rule according to the text length to comprehensively determine whether the voice data is the voice with unknown intention, that is, the accuracy of recognizing whether the intention is clear can be greatly improved by adopting a multi-feature fusion method.

In some embodiments, the method further comprises:

In the embodiment of the disclosure, after it is determined that the voice data is a voice with unknown intention, a preset response reply is output, for example, the preset response reply is "please say again", so as to avoid a phenomenon of a talk burst in love due to incomplete understanding of the intention of the user caused by unclear audio, truncated audio, meaningless content and the like. Through this kind of mode of answering that gives when the intention is unclear predetermineeing, can promote the response success rate through the mode of giving the guide, and give unified reply, also can give the user unified experience sense to make user's voice interaction's experience better.

In the present disclosure, when it is determined that the voice data is the intended voice, a corresponding response may be output according to the content of the voice data.

Fig. 5 is a flowchart illustrating a speech recognition method according to an exemplary embodiment of the present disclosure, where as shown in fig. 5, the speech recognition method includes the following steps:

and S21, receiving input voice data.

And S22, performing natural language understanding processing on the voice data.

In this embodiment, performing natural language understanding processing on the voice data includes acquiring a text corresponding to the voice data, and performing intent recognition on the voice data to acquire an intent score value of the voice data.

S23, judging the length; if the short voice belongs to the short voice, executing the step S24; if the voice belongs to the long voice, step S26 is executed.

In this embodiment, the determining the length may be comparing a text length corresponding to the voice data with a preset length threshold, and determining that the text length corresponding to the voice data is short voice if the text length corresponding to the voice data is smaller than the preset length threshold; and if the text length corresponding to the voice data is greater than or equal to the preset length threshold, determining that the voice data is long voice.

And S24, mining and matching the voice data.

In the embodiment, the voice data is mined and matched, that is, the voice data is input into the database with unknown preset intention, and whether the voice data is matched with the data in the database with unknown preset intention is determined.

And S25, if the matching is successful, determining that the voice data is meaningless voice.

In this embodiment, if the voice data matches with data in the database with unknown intent, the voice data is determined to be nonsense voice, i.e., the voice data is determined to be speech with unknown intent.

And S26, if the matching is unsuccessful, inputting the voice data into the language model for recognition to obtain a confusion value of the voice data.

In this embodiment, if the matching between the voice data and the data in the database with unknown intent is unsuccessful, the voice data is input into the language model for recognition, that is, the voice data is input into the first language model, and the confusion value of the voice data is determined.

And S27, if the confusion value is high, determining the voice data as nonsense voice.

In this embodiment, after obtaining the confusion value, the confusion value is compared with a preset confusion value, and if the confusion value is greater than a preset confusion threshold, the high confusion value is included, and the voice data is determined to be the voice with unknown intention.

And S28, if the speech data belong to the low confusion value, inputting the speech data into a meaningless classification model for recognition to obtain the confidence coefficient of the speech data.

In this embodiment, the low confusion value is attributed to the confusion value being less than or equal to the preset confusion threshold. And inputting the voice data into a meaningless classification model for recognition, namely inputting the voice data into a second language model, and determining the confidence coefficient that the voice data is meaningless voice. It should be noted that, the intention score value and the voice data can also be input into the meaningless classification model together for recognition, so as to obtain the confidence of the voice data.

And S29, if the confidence level is high, determining that the voice data is meaningless voice.

In this embodiment, if the confidence level of the voice data is greater than the preset confidence level threshold, it is a high confidence level, and the voice data is determined to be an uninteresting voice.

And S30, if the confidence coefficient is low, determining that the voice data is meaningful voice.

In this embodiment, if the confidence level of the speech data is less than or equal to the preset confidence level threshold, it is a low confidence level, and the speech data is determined to be the intended speech.

It can be understood that, the method and the device for determining the voice data based on the text length corresponding to the voice data adopt a plurality of rules to combine layer by layer to determine whether the voice data is the voice with unknown intention, and can greatly improve the accuracy of whether the intention is definitely determined.

FIG. 6 is a diagram illustrating a speech recognition device according to an example embodiment. Referring to fig. 6, the voice recognition apparatus includes:

a receiving module 201 configured to receive input voice data;

a first determining module 202, configured to determine whether a text length corresponding to the voice data is greater than a preset length threshold;

a second determining module 203, configured to determine whether the voice data is a speech with unknown intention according to a first rule if the text length corresponding to the voice data is smaller than the preset length threshold;

the third determining module 204 is configured to determine whether the voice data is a voice with unknown intention according to a second rule if the text length corresponding to the voice data is greater than or equal to the preset length threshold.

In some embodiments, the third determining module 204 is specifically configured to, if the text length is greater than or equal to the preset length threshold, input the voice data into a first language model, and determine a confusion value of the voice data; and determining whether the voice data is voice without intention according to the confusion value.

In some embodiments, the third determining module 204 is specifically configured to determine that the voice data is a voice with unknown intention if the confusion value is greater than a preset confusion threshold; or if the confusion value is smaller than or equal to a preset confusion threshold value, at least inputting the voice data into a second language model, and determining the confidence of the voice data as nonsense voice; and determining whether the voice data is voice without intention according to the confidence coefficient of the meaningless voice.

In some embodiments, the apparatus further comprises:

a first obtaining module 205, configured to obtain keyword information included in a text corresponding to the voice data;

the third determining module 204 is specifically configured to input the keyword information and/or the confusion value and the voice data into the second language model and determine the confidence level that the voice data is meaningless voice if the confusion value is less than or equal to a preset confusion threshold value.

In some embodiments, the second determining module 203 is specifically configured to, if the text length is smaller than the preset length threshold, input the voice data into a preset unknown-intention database, and determine whether the voice data matches with data in the preset unknown-intention database; and if the voice data is matched with the data in the preset unknown intention database, determining that the voice data is the voice with unknown intention.

In some embodiments, the apparatus further comprises:

a second obtaining module 206, configured to perform intention recognition on the voice data, and obtain an intention score value of the voice data; wherein the intent score value characterizes an intent clarity of the speech data;

the second determining module 203 is specifically configured to determine whether the voice data is a voice with unknown intention by combining the intention score value and the first rule if the text length is smaller than the preset length threshold;

the third determining module 204 is specifically configured to determine whether the voice data is a voice with unknown intention by combining the intention score and the second rule if the text length is greater than or equal to the preset length threshold.

In some embodiments, the apparatus further comprises:

the output module 207 is configured to output a preset response reply if it is determined that the voice data is a voice with unknown intention.

With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

Fig. 7 is a block diagram illustrating a terminal according to an example embodiment. For example, the terminal device 800 may be a smart speaker, a smart phone, or the like.

Referring to fig. 7, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, audio component 810 includes a Microphone (MIC) configured to receive external audio signals when apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as Wi-Fi,2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other voice elements for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of a terminal, enable the terminal to perform a speech recognition method, the method comprising:

receiving input voice data;

and if the text length corresponding to the voice data is greater than or equal to the preset length threshold, determining whether the voice data is a voice with unknown intention or not according to a second rule.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A speech recognition method, comprising:

receiving input voice data;

if the text length corresponding to the voice data is smaller than the preset length threshold, determining whether the voice data is voice without intention by combining the intention score value and a first rule;

and if the text length corresponding to the voice data is greater than or equal to the preset length threshold, determining whether the voice data is voice without intention by combining the intention scoring value and a second rule.

2. The method of claim 1, wherein determining whether the speech data is an uninteresting speech according to a second rule if the text length corresponding to the speech data is greater than or equal to the preset length threshold comprises:

if the text length is larger than or equal to the preset length threshold value, inputting the voice data into a first language model, and determining a confusion value of the voice data;

3. The method of claim 2, wherein determining whether the speech data is speech with unknown intent according to the confusion value comprises:

alternatively, the first and second electrodes may be,

4. The method of claim 3, further comprising:

5. The method of claim 3, wherein the second language model is a model trained using a CNN network.

6. The method of claim 2, wherein the first language model is a model trained using a BERT network.

7. The method of claim 1, wherein determining whether the speech data is speech with unknown intent according to a first rule if the text length corresponding to the speech data is smaller than the preset length threshold comprises:

8. The method according to any one of claims 1 to 7, further comprising:

and if the text length is greater than or equal to the preset length threshold, determining whether the voice data is voice with unknown intention or not by combining the intention score value and the second rule.

9. The method of claim 1, further comprising:

10. A speech recognition apparatus, comprising:

a receiving module configured to receive input voice data;

the second judgment module is configured to determine whether the voice data is voice with unknown intention or not by combining the intention score value and a first rule if the text length corresponding to the voice data is smaller than the preset length threshold;

and the third judging module is configured to determine whether the voice data is voice without intention by combining the intention score value and a second rule if the text length corresponding to the voice data is greater than or equal to the preset length threshold.

11. The apparatus of claim 10,

the third judging module is specifically configured to input the voice data into a first language model and determine a confusion value of the voice data if the text length is greater than or equal to the preset length threshold; and determining whether the voice data is voice without intention according to the confusion value.

12. The apparatus of claim 11,

the third judging module is specifically configured to determine that the voice data is a voice with unknown intention if the confusion value is greater than a preset confusion threshold; or if the confusion value is smaller than or equal to a preset confusion threshold value, at least inputting the voice data into a second language model, and determining the confidence of the voice data as nonsense voice; and determining whether the voice data is voice without intention according to the confidence coefficient of the meaningless voice.

13. The apparatus of claim 12, further comprising:

14. The apparatus of claim 13, wherein the second language model is a model trained using a CNN network.

15. The apparatus of claim 12, wherein the first language model is a model trained using a BERT network.

16. The apparatus of claim 10,

the second judging module is specifically configured to input the voice data into a preset unknown-intention database if the text length is smaller than the preset length threshold, and determine whether the voice data is matched with data in the preset unknown-intention database; and if the voice data is matched with the data in the preset unknown intention database, determining that the voice data is the voice with unknown intention.

17. The apparatus of any one of claims 10 to 16, further comprising:

18. The apparatus of claim 10, further comprising:

19. A speech recognition apparatus, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the speech recognition method of any one of claims 1 to 9.

20. A non-transitory computer-readable storage medium in which instructions, when executed by a processor of a computer, enable the computer to perform the speech recognition method of any of claims 1 to 9.