CN110097886B - Intention recognition method and device, storage medium and terminal - Google Patents

Intention recognition method and device, storage medium and terminal Download PDF

Info

Publication number
CN110097886B
CN110097886B CN201910356912.5A CN201910356912A CN110097886B CN 110097886 B CN110097886 B CN 110097886B CN 201910356912 A CN201910356912 A CN 201910356912A CN 110097886 B CN110097886 B CN 110097886B
Authority
CN
China
Prior art keywords
recognition result
intention
voice recognition
sentence
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910356912.5A
Other languages
Chinese (zh)
Other versions
CN110097886A (en
Inventor
李杭泰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Xiaoai Robot Technology Co ltd
Original Assignee
Guizhou Xiaoai Robot Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Xiaoai Robot Technology Co ltd filed Critical Guizhou Xiaoai Robot Technology Co ltd
Priority to CN201910356912.5A priority Critical patent/CN110097886B/en
Publication of CN110097886A publication Critical patent/CN110097886A/en
Application granted granted Critical
Publication of CN110097886B publication Critical patent/CN110097886B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

An intention identification method and device, a storage medium and a terminal are provided, and the intention identification method comprises the following steps: performing initial intention recognition on a current voice recognition result of a user, wherein the current voice recognition result is text data; when the initial intention recognition fails, determining the number of words contained in the current voice recognition result, wherein the words are the minimum units with semantics in the text data; when the number of the characters reaches a preset threshold, splitting the current voice recognition result to obtain M sentences, wherein M is a positive integer greater than 1; respectively carrying out intention identification on the M sentences to obtain N intentions, wherein N is a positive integer and is less than or equal to M; determining an intention of the current speech recognition result at least according to the N intentions. According to the technical scheme, the accuracy of intention identification can be improved.

Description

Intention recognition method and device, storage medium and terminal
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to an intention recognition method and apparatus, a storage medium, and a terminal.
Background
In the process of human-computer interaction in a voice mode, the prior art performs voice recognition on voice data input by a user by using a voice engine. And directly using all contents obtained by speech recognition as the input of a semantic understanding engine to obtain the intention of the user.
However, voice interaction is much more complex than direct text input. In a voice interaction scenario, the following situations exist: too much content (e.g., more than 20 words) for a single voice interaction by the user; the speech engine does not make a sentence break for the sentence, especially for the sentence with more than 20 characters; the user speaks intermittently, and the content picked up by the speech recognition engine once does not form a sentence. In the above three cases, that is, in the case where a long sentence, a long sentence with no break, and a content do not constitute a sentence, the intention of the user cannot be recognized, and the user experience is reduced.
Disclosure of Invention
The invention solves the technical problem of how to improve the accuracy of intention identification.
In order to solve the above technical problem, an embodiment of the present invention provides an intention identifying method, where the intention identifying method includes: performing initial intention recognition on a current voice recognition result of a user, wherein the current voice recognition result is text data; when the initial intention recognition fails, determining the number of words contained in the current voice recognition result, wherein the words are the minimum units with semantics in the text data; when the number of the characters reaches a preset threshold, splitting the current voice recognition result to obtain M sentences, wherein M is a positive integer greater than 1; respectively carrying out intention identification on the M sentences to obtain N intentions, wherein N is a positive integer and is less than or equal to M; determining an intention of the current speech recognition result at least according to the N intentions.
Optionally, N is a positive integer greater than or equal to 2, and the determining, according to at least the N intentions, an intention corresponding to the current speech recognition result includes: calculating the importance of the sentences of the N intentions; and selecting the intention of the sentence with the highest importance degree as the intention of the current voice recognition result.
Optionally, the calculating to obtain the importance of the N sentences of the N intentions includes: and respectively calculating the word frequency inverse document frequency of the N sentences to respectively serve as the importance of the N sentences.
Optionally, the determining, according to at least the N intentions, an intention corresponding to the current speech recognition result includes: determining a position of a sentence obtaining the N intentions in the current speech recognition result; and selecting the intention of the sentence with the most back position as the intention of the current voice recognition result.
Optionally, the splitting the current speech recognition result includes: and splitting the current voice recognition result by adopting a preset regular expression.
Optionally, before splitting the current speech recognition result, the method further includes: judging whether the current voice recognition result is sentence-breaking according to punctuation marks; and if the current voice recognition result is not punctuated, performing sentence breaking on the current voice recognition result by using a pre-trained sentence breaking model.
Optionally, the intention identifying method further includes: when the number of the characters does not reach a preset threshold, judging whether the number of the characters contained in a previous voice recognition result before the current voice recognition result reaches the preset threshold and whether the intention recognition is successful; if the number of the characters contained in the previous voice recognition result does not reach a preset threshold and the intention recognition fails, at least combining the current voice recognition result with the previous voice recognition result; and performing intention recognition by using the combined voice recognition result.
Optionally, the at least combining the current speech recognition result and the previous speech recognition result includes: storing the current voice recognition result to a sentence list for caching; and if the number of the recognition results in the sentence list cache is more than 1, merging all the voice recognition results in the sentence list cache.
Optionally, the intention identifying method further includes: emptying the sentence list cache if the intention recognition of the merged speech recognition result is successful; or emptying the sentence list cache if the number of the words contained in the previous voice recognition result reaches a preset threshold or the intention recognition is successful.
Optionally, the performing intent recognition by using the merged speech recognition result includes: calculating the smoothness of the combined voice recognition result; and if the smoothness reaches a preset threshold value, performing intention recognition by using the combined voice recognition result.
In order to solve the above technical problem, an embodiment of the present invention further discloses an intention identifying device, including: the system comprises an initial intention identification module, a voice recognition module and a voice recognition module, wherein the initial intention identification module is suitable for carrying out initial intention identification on a current voice recognition result of a user, and the current voice recognition result is text data; a word number determination module adapted to determine the number of words contained in the current speech recognition result when the initial intention recognition fails, a word being a minimum unit of the text data; the splitting module is suitable for splitting the current voice recognition result when the number of the characters reaches a preset threshold so as to obtain M sentences, wherein M is a positive integer greater than 1; the intention identification module is suitable for respectively carrying out intention identification on the M sentences to obtain N intentions, wherein N is a positive integer and is less than or equal to M; an intent determination module adapted to determine an intent of the current speech recognition result based at least on the N intents.
The embodiment of the invention also discloses a storage medium, wherein computer instructions are stored on the storage medium, and the computer instructions execute the steps of the intention identification method when running.
The embodiment of the invention also discloses a terminal which comprises a memory and a processor, wherein the memory is stored with computer instructions capable of running on the processor, and the processor executes the steps of the intention identification method when running the computer instructions.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:
because the intention recognition aiming at the text data with the number of the contained characters being larger than the preset threshold is easy to fail, the technical scheme of the invention determines to split the current voice recognition result according to the number of the contained characters in the current voice recognition result, obtains M sentences, performs intention recognition aiming at the M sentences, and determines the intention of the current voice recognition result based on N intentions of the M sentences, so as to avoid the condition that the intention of a long sentence (namely the sentence with the number of the contained characters being larger than the preset threshold) cannot be recognized in the prior art, improve the success rate of the intention recognition of the long sentence, and further improve the interactive experience of a user.
Further, when determining the intention corresponding to the current speech recognition result according to the N intentions, determining the position of the sentence with the N intentions in the current speech recognition result; and selecting the intention of the sentence with the most back position as the intention of the current voice recognition result. In consideration of the fact that the language expression habit of the user is that important sentences are usually located at the later positions, when the final intention of the user is selected from the N intentions, the intention of the sentence with the most later position is selected, and therefore the accuracy of user intention identification is guaranteed.
Further, when the number of the words does not reach a preset threshold, judging whether the number of the words contained in a previous voice recognition result before the current voice recognition result reaches the preset threshold and whether the intention recognition is successful; if the number of the characters contained in the previous voice recognition result does not reach a preset threshold and the intention recognition fails, at least combining the current voice recognition result with the previous voice recognition result; and performing intention recognition by using the combined voice recognition result. In the technical scheme of the invention, the current voice recognition result which contains the number of the characters and fails in the intention recognition can be combined with the previous voice recognition result which contains the number of the characters and fails in the intention recognition and does not reach the preset threshold, and the intention recognition is carried out again by using the combined result so as to improve the success rate of intention recognition of short sentences (namely sentences containing the number of the characters and which does not reach the preset threshold).
Drawings
FIG. 1 is a flow chart of an intent recognition method according to an embodiment of the present invention;
FIG. 2 is a flowchart of one embodiment of step S105 shown in FIG. 1;
FIG. 3 is a flowchart of another embodiment of step S105 shown in FIG. 1;
FIG. 4 is a partial flow diagram of an intent recognition method according to an embodiment of the present invention;
FIG. 5 is a flowchart of one embodiment of step S402 shown in FIG. 4;
fig. 6 is a schematic structural diagram of an intention identifying apparatus according to an embodiment of the invention.
Detailed Description
As described in the background art, in the above three cases, that is, in the case where a long sentence, a long sentence with no break, and a content do not constitute a sentence, the intention of the user cannot be recognized, and the user experience is reduced.
Because the intention recognition aiming at the text data with the number of the contained characters being larger than the preset threshold is easy to fail, the technical scheme of the invention determines to split the current voice recognition result according to the number of the contained characters in the current voice recognition result, obtains M sentences, performs intention recognition aiming at the M sentences, and determines the intention of the current voice recognition result based on N intentions of the M sentences, so as to avoid the condition that the intention of a long sentence (namely the sentence with the number of the contained characters being larger than the preset threshold) cannot be recognized in the prior art, improve the success rate of the intention recognition of the long sentence, and further improve the interactive experience of a user.
The long sentence according to the embodiment of the present invention may refer to a sentence containing a number of words reaching a predetermined threshold, for example, a sentence with a number of words greater than 20.
The phrase may refer to a sentence containing words less than a predetermined threshold, for example, a sentence with words greater than or equal to 20.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
Fig. 1 is a flowchart of an intention identifying method according to an embodiment of the present invention.
The method shown in fig. 1 may be suitable for any scenario requiring voice interaction, such as reception halls, shopping malls, banks, airports, etc. Specifically, the steps of the method may be performed by a human-computer interaction device, such as a virtual robot, a physical robot, and the like.
The intent recognition method shown in fig. 1 may include the steps of:
step S101: performing initial intention recognition on a current voice recognition result of a user, wherein the current voice recognition result is text data;
step S102: when the initial intention recognition fails, determining the number of words contained in the current voice recognition result, wherein the words are the minimum units with semantics in the text data;
step S103: when the number of the characters reaches a preset threshold, splitting the current voice recognition result to obtain M sentences, wherein M is a positive integer greater than 1;
step S104: respectively carrying out intention identification on the M sentences to obtain N intentions, wherein N is a positive integer and is less than or equal to M;
step S105: determining an intention of the current speech recognition result at least according to the N intentions.
It should be noted that the sequence numbers of the steps in this embodiment do not represent a limitation on the execution sequence of the steps.
Regarding the specific process of performing initial intent recognition on the current speech recognition result in step S101, reference may be made to a related intent recognition algorithm in the prior art, which is not described herein in detail in the embodiments of the present invention.
After the intention of the current speech recognition result is obtained, the answer to the intention can be determined and fed back to the user. Specifically, the question matching the intention can be searched in the knowledge base, and the answer of the matched question is used as the answer for the intention.
Those skilled in the art will appreciate that reference is made to the prior art with respect to the process of obtaining answers to questions using a knowledge base, and that further description is omitted herein.
In the prior art, intention recognition is only carried out once on a previous voice recognition result of a user, and secondary intention recognition is not carried out if the intention recognition fails.
Unlike the prior art, in the case that the initial intention recognition fails, the embodiment of the present invention may perform intention recognition again on the current speech recognition result through steps S102 to S105 to obtain the intention of the current speech recognition result.
In a specific implementation of step S102, the number of words contained in the current speech recognition result may be determined. Wherein the word is a minimum unit with semantics in the text data. For example, when the language of the text data is Chinese, the word refers to a single Chinese character; when the language of the text data is english, the word refers to a single word, and so on, the language of the text data is the other languages, which is not described herein again.
In the specific implementation of step S103, when the number of words reaches the preset threshold, it indicates that the current speech recognition result is a long sentence, and in order to accurately obtain the intention of the long sentence, that is, the current speech recognition result may be split to obtain M sentences, where M is greater than or equal to 2.
Regarding the specific splitting manner, the splitting manner may be according to punctuation marks, and the punctuation marks may specifically be periods, exclamation marks, question marks, commas, and the like, for example, a complete sentence is between two periods. Alternatively, the splitting may be performed according to semantics. For example, multiple semantics of the current speech recognition result are obtained, and a part of the obtained single semantics is a sentence.
In a specific embodiment of the present invention, a preset regular expression may be adopted to split the current speech recognition result.
It should be noted that, the specific form of the preset regular expression may refer to the prior art, and the embodiment of the present invention is not limited to this.
Further, in the implementation of step S104, the intention recognition may be performed on the M sentences, respectively. The number of the words contained in the M sentences obtained after splitting is smaller than a preset threshold, namely the M sentences are short sentences, and the success rate of intention recognition on the short sentences is higher than that on the long sentences, so that N intentions can be obtained, wherein N is smaller than or equal to M.
Those skilled in the art understand that there may be different sentence identifications to achieve the same intent, in other words, there may be at least two of the N intents that are the same.
That is, when the intention recognition of M sentences is successful, N is equal to M, in which case, there may be a one-to-one correspondence between M sentences and N intentions; if the intention recognition of at least one sentence in the M sentences fails, namely only N sentences are successfully recognized, N is smaller than M, and in this case, only the N sentences successfully recognized and N intentions are in one-to-one correspondence.
Furthermore, since M sentences are split from the current speech recognition result, N intents obtained from the M sentences can represent the intention of the current speech recognition result. In a specific implementation of step S105, the intention of the current speech recognition result may be determined at least according to the N intentions.
According to the embodiment of the invention, the current voice recognition result is determined to be split according to the number of the characters contained in the current voice recognition result, M sentences are obtained, intention recognition is carried out on the M sentences, the intention of the current voice recognition result is determined based on N intentions of the M sentences, so that the condition that the intention of a long sentence (namely the sentence containing the number of the characters larger than a preset threshold) cannot be recognized in the prior art is avoided, the intention recognition success rate of the long sentence is improved, and further the interactive experience of a user is improved.
In one non-limiting embodiment of the invention, N is a positive integer greater than or equal to 2. Referring to fig. 2, step S105 shown in fig. 1 may include the following steps:
step S201: calculating the importance of the sentences of the N intentions;
step S202: and selecting the intention of the sentence with the highest importance degree as the intention of the current voice recognition result.
The method and the device can determine the intention of the current voice recognition result according to the plurality of intentions, and the number of the intentions of the current voice recognition result is 1.
In this embodiment, the importance of the sentence may be semantic importance, which represents the importance of the sentence in the current speech recognition result. By obtaining the importance of the sentences with the N intentions, the sentence with the highest importance, which can represent the speech recognition result, in the current speech recognition result can be selected. The intention of the sentence is the intention of the current speech recognition result.
Further, step S201 shown in fig. 2 may include the following steps: and respectively calculating the word frequency inverse document frequency of the N sentences to respectively serve as the importance of the N sentences.
In the embodiment of the invention, the importance of the sentence can be represented by term frequency-inverse document frequency (TF-IDF).
In a specific implementation, when calculating the TF-IDF value of a sentence, the sentence may be segmented, the TF-IDF value is calculated for a plurality of words obtained after the segmentation, a preset number of words with the largest TF-IDF, for example, 3 words with the largest TF-IDF, is selected, an average value of the TF-IDF values of the selected words is calculated, and the average value is used as the TF-IDF value of the sentence.
In another non-limiting embodiment of the invention, N is a positive integer greater than or equal to 2. Referring to fig. 3, step S105 shown in fig. 1 may include the following steps:
step S301: determining a position of a sentence obtaining the N intentions in the current speech recognition result;
step S302: and selecting the intention of the sentence with the most back position as the intention of the current voice recognition result.
Unlike the previous embodiments, in determining the intention of the current speech recognition result, the embodiment of the present invention selects the intention of the sentence positioned most backward.
Considering that the language expression habit of the user is to place important sentences generally at the later positions, when the final intention of the user is selected from the N intentions, the intention of the sentence with the most later position is selected, so that the accuracy of the intention identification of the user is ensured.
In one non-limiting embodiment of the present invention, step S103 shown in fig. 1 may further include the following steps: judging whether the current voice recognition result is sentence-breaking according to punctuation marks; and if the current voice recognition result is not punctuated, performing sentence breaking on the current voice recognition result by using a pre-trained sentence breaking model.
As described above, when splitting the current speech recognition result, the current speech recognition result is split according to punctuation marks, so that it is required to ensure that the current speech recognition result is sentence-broken according to punctuation marks.
Under the condition that the current voice recognition result is not punctuated, the embodiment of the invention can use the sentence model to perform sentence interruption on the current voice recognition result. In particular, the sentence break model may determine the sentence in the current speech recognition result and supplement punctuation marks after the sentence.
In particular, the sentence-breaking model may be a Deep Neural Network (DNN) language model. The sentence-break model is a trained model.
It should be noted that sample data may be selected in advance to train the sentence break model, so that the trained sentence break model is obtained. The specific process of training the DNN language model may refer to the prior art, and the embodiment of the present invention is not limited thereto.
In a specific application scenario, the current speech recognition result of the user is "how your weather is really good today, i is the first time that your sun comes today, and you are not familiar with how your sun knows how your sun's railway station goes here at all. After a DNN language model is adopted to perform sentence breaking on a current voice recognition result, the following text data' hello, the weather is really good today, i is the first time you come in Guiyang today, and is not familiar with the Guiyang at all, and you know how you go at the Guiyang railway station. "since the number of words of the text data is greater than 20," the text data is split to obtain 5 phrases, which are "hello", "weather today is really good", "today is the first time i come in guiyang", "not familiar at all" and "do you know how to go at the guiyang railway station". The intention recognition is performed on the above 5 phrases, respectively, and it can be determined that the intentions of the second and fifth phrases are "weather" and "train station route", respectively. And selecting the intention of the fifth short sentence (namely the 'railway station route') as the intention of the current voice recognition result.
Further, the answer to the intention may also be looked up in the knowledge base to get the following answer "traffic route to the noble railway station: and walking 300 meters to the international ecological conference center (north) by 60 buses to the railway station, walking 350 meters to the Guizhou finance city (south) by 23 buses to the old sun gate to get off, and changing to 219 buses to the railway station, and feeding the answer back to the user, wherein the answer can be presented to the user in a voice or text mode.
In a non-limiting embodiment of the present invention, referring to fig. 4, the method shown in fig. 1 may further include the following steps:
step S401: when the number of the characters does not reach a preset threshold, judging whether the number of the characters contained in a previous voice recognition result before the current voice recognition result reaches the preset threshold and whether the intention recognition is successful;
step S402: if the number of the characters contained in the previous voice recognition result does not reach a preset threshold and the intention recognition fails, at least combining the current voice recognition result with the previous voice recognition result;
step S403: and performing intention recognition by using the combined voice recognition result.
In the embodiment of the invention, if the number of the words contained in the current voice recognition result does not reach the preset threshold, the current voice recognition result is a short sentence, and an intention recognition process aiming at the short sentence can be triggered.
In a specific implementation, it can be determined whether a previous speech recognition result before the current speech recognition result is a short sentence, and whether the intention recognition is successful. In the case where the previous speech recognition result is a short sentence and the intention recognition fails, the current speech recognition result may be merged with the previous speech recognition result. Specifically, the two short sentences may be combined into one sentence. And further, the intention recognition can be carried out on the sentences obtained after combination.
In the embodiment of the invention, the current voice recognition result which contains the number of the characters and fails in the intention recognition can be merged with the previous voice recognition result which contains the number of the characters and fails in the intention recognition and the merged result is used for carrying out the intention recognition again so as to improve the success rate of the intention recognition of short sentences (namely sentences containing the number of the characters and which does not reach the preset threshold).
In a specific application scenario, the speech recognition result of the user at time 1 is "noble" of phrase 1, after 30 seconds, the speech recognition result of the user is "today" of phrase 2, and after 30 seconds, the speech recognition result of the user is "weather" of phrase 3. In the case that the intention recognition for the short sentences 1, 2 and 3 fails, the three short sentences may be combined to obtain a sentence 4 "weather of the day of noble sun", and the intention recognition for the sentence 4 may be performed to obtain an intention of "inquiring weather of the day of noble sun".
Further, for the user's intention, the answer to the intention may be obtained by calling a third-party application, such as a weather providing application, that "the highest temperature of the noble sun today is 14 degrees, the lowest temperature is 8 degrees, the rain shower, the air pollution diffusion index, the good weather condition is favorable for the air pollution diffusion", and the answer is fed back to the user, and may be specifically presented to the user in a voice or text manner.
Referring to fig. 5, step S402 shown in fig. 4 may include the following steps:
step S501: storing the current voice recognition result to a sentence list for caching;
step S502: and if the number of the recognition results in the sentence list cache is more than 1, merging all the voice recognition results in the sentence list cache.
In this embodiment, a sentence list cache is set to store sentences (i.e., recognition results) which contain words whose number does not reach a preset threshold and are intended to be recognized unsuccessfully, such as a current speech recognition result and previous speech recognition results.
When the number of recognition results in the sentence list cache is greater than 1, a merging operation may be performed, that is, all the speech recognition results in the sentence list cache are merged into one sentence for subsequent intent recognition.
Further, the intention identifying method shown in fig. 4 may further include the following steps: emptying the sentence list cache if the intention recognition of the merged speech recognition result is successful; or emptying the sentence list cache if the number of the words contained in the previous voice recognition result reaches a preset threshold or the intention recognition is successful.
In this embodiment, the sentences in the sentence list cache are all short sentences, and are used for merging to perform intent recognition. Thus, if the intention recognition of the merged speech recognition result is successful, indicating that the intention recognition of the sentence in the sentence list cache is successful, the sentence list cache can be cleared.
In addition, if the previous speech recognition result is a long sentence, which means that the previous speech recognition result is not applicable to the intention recognition process for a short sentence, and needs to be eliminated, the recognition result in the sentence list cache becomes semantically discontinuous at this time, so that the sentence list cache can be emptied. Or, if the intention recognition of the previous speech recognition result is successful, which means that the previous speech recognition result does not need to enter the intention recognition process of the short sentence, and the previous speech recognition result also needs to be removed, the recognition result in the sentence list cache becomes discontinuous semantically at this time, and therefore the sentence list cache can be emptied.
In a specific implementation, the maximum capacity of the sentence list buffer can be configured, for example, 5 short sentences, that is, 5 speech recognition results. In this case, if the number of recognition results in the sentence list cache reaches 5 and the intention recognition fails, the sentence list cache is emptied to secure the response time of the intention recognition.
Further, step S403 shown in fig. 4 may include the following steps: calculating the smoothness of the combined voice recognition result; and if the smoothness reaches a preset threshold value, performing intention recognition by using the combined voice recognition result.
In the embodiment of the invention, the smoothness of the voice recognition result can represent the semantic consistency of the voice recognition result. The higher the degree of smoothness of the speech recognition result, the higher the probability that the speech recognition result is a complete sentence, and the higher the recognition success rate of the intention recognition on the speech recognition result.
Therefore, in order to improve the success rate of intention recognition, the merged speech recognition result is shown to have consistent semantics under the condition that the smoothness of the merged speech recognition result reaches the preset threshold value, and the merged speech recognition result can be used for intention recognition.
Specifically, the compliance may be calculated by using a DNN language model, and a specific process of calculating the compliance of the merged speech recognition result may be: and calculating the probability of the second word of the merged speech recognition result after the first word by utilizing a DNN language model from the first word of the merged speech recognition result, and calculating the probability of the third word of the merged speech recognition result after the second word by utilizing the DNN language model in the same way until all words in the merged speech recognition result are traversed and the product of all probabilities is calculated.
Referring to fig. 6, an intention identifying apparatus 60 is further disclosed in the embodiments of the present invention, and the intention identifying apparatus 60 may include an initial intention identifying module 601, a word number determining module 602, a splitting module 603, an intention identifying module 604, and an intention determining module 605.
The initial intention recognition module 601 is adapted to perform initial intention recognition on a current speech recognition result of the user, where the current speech recognition result is text data; the word number determination module 602 is adapted to determine the number of words contained in the current speech recognition result when the initial intention recognition fails, a word being a minimum unit of the text data; the splitting module 603 is adapted to split the current speech recognition result when the number of the words reaches a preset threshold, so as to obtain M sentences, where M is a positive integer greater than 1; the intention identifying module 604 is adapted to perform intention identification on the M sentences, respectively, to obtain N intentions, where N is a positive integer and is less than or equal to M; the intent determination module 605 is adapted to determine the intent of the current speech recognition result from at least the N intents.
In particular implementations, the word count determination module 602 may determine the number of words included in the current speech recognition result. Wherein the word is a minimum unit with semantics in the text data. For example, when the language of the text data is Chinese, the word refers to a single Chinese character; when the language of the text data is english, the word refers to a single word.
The specific splitting manner in the splitting module 603 may be splitting according to punctuation marks, where the punctuation marks may specifically be periods, exclamation marks, question marks, commas, and the like, for example, a complete sentence is formed between two periods. Alternatively, the splitting may be performed according to semantics. For example, multiple semantics of the current speech recognition result are obtained, and a part of the obtained single semantics is a sentence.
In a specific embodiment of the present invention, the splitting module 603 may split the current speech recognition result by using a preset regular expression.
The intention recognition module 604 may then perform intention recognition on the M sentences, respectively. The number of the words contained in the M sentences obtained after splitting is smaller than a preset threshold, namely the M sentences are short sentences, and the success rate of intention recognition on the short sentences is higher than that on the long sentences, so that N intentions can be obtained, wherein N is smaller than or equal to M.
Since the M sentences are split from the current speech recognition result, the N intentions obtained by the M sentences can represent the intention of the current speech recognition result. The intent determination module 605 may determine the intent of the current speech recognition result from at least the N intents.
According to the embodiment of the invention, the current voice recognition result is determined to be split according to the number of the characters contained in the current voice recognition result, M sentences are obtained, intention recognition is carried out on the M sentences, the intention of the current voice recognition result is determined based on N intentions of the M sentences, so that the condition that the intention of a long sentence (namely the sentence containing the number of the characters larger than a preset threshold) cannot be recognized in the prior art is avoided, the intention recognition success rate of the long sentence is improved, and further the interactive experience of a user is improved.
In one non-limiting embodiment of the present invention, the intention determining module 605 may comprise an importance calculating unit for calculating the importance of the sentences obtaining the N intents; and the first intention selecting unit is used for selecting the intention of the sentence with the highest importance degree as the intention of the current voice recognition result.
The method and the device can determine the intention of the current voice recognition result according to the plurality of intentions, and the number of the intentions of the current voice recognition result is 1.
In this embodiment, the importance of the sentence may be semantic importance, which represents the importance of the sentence in the current speech recognition result. By obtaining the importance of the sentences with the N intentions, the sentence with the highest importance, which can represent the speech recognition result, in the current speech recognition result can be selected. The intention of the sentence is the intention of the current speech recognition result.
In a specific implementation, the importance calculating unit may calculate the word frequency inverse document frequency of each of the N sentences as the importance of each of the N sentences.
Specifically, when calculating the TF-IDF value of a sentence, the importance calculating unit may first perform word segmentation on the sentence, calculate the TF-IDF value for a plurality of words obtained after the word segmentation, select a preset number of words with the largest TF-IDF, for example, 3 words with the largest TF-IDF, calculate an average value of the TF-IDF values of the selected words, and use the average value as the TF-IDF value of the sentence.
In another non-limiting embodiment of the invention, N is a positive integer greater than or equal to 2. The intent determination module 605 may include a location determination unit to determine a location of the sentence from which the N intents were obtained in the current speech recognition result; and the second intention selecting unit is used for selecting the intention of the sentence with the most back position as the intention of the current voice recognition result.
Unlike the previous embodiments, in determining the intention of the current speech recognition result, the embodiment of the present invention selects the intention of the sentence positioned most backward.
Considering that the language expression habit of the user is to place important sentences generally at the later positions, when the final intention of the user is selected from the N intentions, the intention of the sentence with the most later position is selected, so that the accuracy of the intention identification of the user is ensured.
In one non-limiting embodiment of the present invention, the intention identifying means 60 may further include: the judging module is used for judging whether the current voice recognition result is sentence-breaking according to punctuation marks; and the sentence breaking module is used for breaking the current voice recognition result by utilizing a pre-trained sentence breaking model when the current voice recognition result is not subjected to sentence breaking according to punctuation marks.
As described above, when splitting the current speech recognition result, the current speech recognition result is split according to punctuation marks, so that it is required to ensure that the current speech recognition result is sentence-broken according to punctuation marks.
Under the condition that the current voice recognition result is not punctuated, the embodiment of the invention can use the sentence model to perform sentence interruption on the current voice recognition result. In particular, the sentence break model may determine the sentence in the current speech recognition result and supplement punctuation marks after the sentence.
In particular, the sentence-breaking model may be a Deep Neural Network (DNN) language model. The sentence-break model is a trained model.
It should be noted that sample data may be selected in advance to train the sentence break model, so that the trained sentence break model is obtained. The specific process of training the DNN language model may refer to the prior art, and the embodiment of the present invention is not limited thereto.
In one non-limiting embodiment of the present invention, the intention identifying means 60 may further include: the phrase judgment module is used for judging whether the number of the characters contained in the previous voice recognition result before the current voice recognition result reaches a preset threshold and whether the intention recognition is successful or not when the number of the characters does not reach the preset threshold; a merging module, configured to merge at least the current speech recognition result and the previous speech recognition result when the number of words included in the previous speech recognition result does not reach a preset threshold and the intention recognition fails; and the recognition module is used for performing intention recognition by utilizing the combined voice recognition result.
In the embodiment of the invention, if the number of the words contained in the current voice recognition result does not reach the preset threshold, the current voice recognition result is a short sentence, and an intention recognition process aiming at the short sentence can be triggered.
In a specific implementation, it can be determined whether a previous speech recognition result before the current speech recognition result is a short sentence, and whether the intention recognition is successful. In the case where the previous speech recognition result is a short sentence and the intention recognition fails, the current speech recognition result may be merged with the previous speech recognition result. Specifically, the two short sentences may be combined into one sentence. And further, the intention recognition can be carried out on the sentences obtained after combination.
In the embodiment of the invention, the current voice recognition result which contains the number of the characters and fails in the intention recognition can be merged with the previous voice recognition result which contains the number of the characters and fails in the intention recognition and the merged result is used for carrying out the intention recognition again so as to improve the success rate of the intention recognition of short sentences (namely sentences containing the number of the characters and which does not reach the preset threshold).
Further, the merging module may include: the storage unit is used for storing the current voice recognition result to a sentence list for caching; a merging unit, configured to merge all the speech recognition results in the sentence list cache when the number of recognition results in the sentence list cache is greater than 1.
In this embodiment, a sentence list cache is set to store sentences (i.e., recognition results) which contain words whose number does not reach a preset threshold and are intended to be recognized unsuccessfully, such as a current speech recognition result and previous speech recognition results.
When the number of recognition results in the sentence list cache is greater than 1, a merging operation may be performed, that is, all the speech recognition results in the sentence list cache are merged into one sentence for subsequent intent recognition.
Further, the intention recognition device 60 shown in fig. 6 may further include a clearing module, configured to clear the sentence list cache when the intention recognition on the combined speech recognition result is successful; or emptying the sentence list cache when the number of the words contained in the previous voice recognition result reaches a preset threshold or the intention recognition is successful.
In this embodiment, the sentences in the sentence list cache are all short sentences, and are used for merging to perform intent recognition. Thus, if the intention recognition of the merged speech recognition result is successful, indicating that the intention recognition of the sentence in the sentence list cache is successful, the sentence list cache can be cleared.
In addition, if the previous speech recognition result is a long sentence, which means that the previous speech recognition result is not applicable to the intention recognition process for a short sentence, and needs to be eliminated, the recognition result in the sentence list cache becomes semantically discontinuous at this time, so that the sentence list cache can be emptied. Or, if the intention recognition of the previous speech recognition result is successful, which means that the previous speech recognition result does not need to enter the intention recognition process of the short sentence, and the previous speech recognition result also needs to be removed, the recognition result in the sentence list cache becomes discontinuous semantically at this time, and therefore the sentence list cache can be emptied.
In a specific implementation, the maximum capacity of the sentence list buffer can be configured, for example, 5 short sentences, that is, 5 speech recognition results. In this case, if the number of recognition results in the sentence list cache reaches 5 and the intention recognition fails, the sentence list cache is emptied to secure the response time of the intention recognition.
In a particular embodiment, the identification module may include: a smoothness calculation unit for calculating the smoothness of the combined voice recognition result; and the recognition unit is used for performing intention recognition by utilizing the combined voice recognition result when the smoothness reaches a preset threshold value.
In the embodiment of the invention, the smoothness of the voice recognition result can represent the semantic consistency of the voice recognition result. The higher the degree of smoothness of the speech recognition result, the higher the probability that the speech recognition result is a complete sentence, and the higher the recognition success rate of the intention recognition on the speech recognition result.
Therefore, in order to improve the success rate of intention recognition, the merged speech recognition result is shown to have consistent semantics under the condition that the smoothness of the merged speech recognition result reaches the preset threshold value, and the merged speech recognition result can be used for intention recognition.
Specifically, the compliance may be calculated by using a DNN language model, and a specific process of calculating the compliance of the merged speech recognition result may be: and calculating the probability of the second word of the merged speech recognition result after the first word by utilizing a DNN language model from the first word of the merged speech recognition result, and calculating the probability of the third word of the merged speech recognition result after the second word by utilizing the DNN language model in the same way until all words in the merged speech recognition result are traversed and the product of all probabilities is calculated.
For more details of the operation principle and the operation manner of the intention identifying apparatus 60, reference may be made to the related descriptions in fig. 1 to 5, which are not described herein again.
The embodiment of the invention also discloses a storage medium, wherein computer instructions are stored on the storage medium, and when the computer instructions are operated, the steps of the method shown in the figures 1 to 5 can be executed. The storage medium may include ROM, RAM, magnetic or optical disks, etc. The storage medium may further include a non-volatile memory (non-volatile) or a non-transitory memory (non-transient), and the like.
The embodiment of the invention also discloses a terminal which can comprise a memory and a processor, wherein the memory is stored with computer instructions capable of running on the processor. The processor, when executing the computer instructions, may perform the steps of the methods shown in fig. 1-5. The terminal includes, but is not limited to, a mobile phone, a computer, a tablet computer and other terminal devices.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (11)

1. An intent recognition method, comprising:
performing initial intention recognition on a current voice recognition result of a user, wherein the current voice recognition result is text data;
when the initial intention recognition fails, determining the number of words contained in the current voice recognition result, wherein the words are the minimum units with semantics in the text data;
when the number of the characters reaches a preset threshold, splitting the current voice recognition result to obtain M sentences, wherein M is a positive integer greater than 1;
respectively carrying out intention identification on the M sentences to obtain N intentions, wherein N is a positive integer and is less than or equal to M;
determining an intention of the current speech recognition result at least according to the N intentions;
when the number of the characters does not reach a preset threshold, judging whether the number of the characters contained in a previous voice recognition result before the current voice recognition result reaches the preset threshold and whether the intention recognition is successful; if the number of words contained in the previous voice recognition result does not reach a preset threshold and the intention recognition fails, storing the current voice recognition result into a sentence list cache, and if the number of the recognition results in the sentence list cache is greater than 1, merging all the voice recognition results in the sentence list cache;
and performing intention recognition by using the combined voice recognition result.
2. The method according to claim 1, wherein N is a positive integer greater than or equal to 2, and the determining the intention corresponding to the current speech recognition result according to at least the N intentions comprises:
calculating the importance of the sentences of the N intentions;
and selecting the intention of the sentence with the highest importance degree as the intention of the current voice recognition result.
3. The method according to claim 2, wherein the calculating the importance of the N sentences of the N intentions comprises:
and respectively calculating the word frequency inverse document frequency of the N sentences to respectively serve as the importance of the N sentences.
4. The method according to claim 1, wherein the determining the intention corresponding to the current speech recognition result according to at least the N intentions comprises:
determining a position of a sentence obtaining the N intentions in the current speech recognition result;
and selecting the intention of the sentence with the most back position as the intention of the current voice recognition result.
5. The intent recognition method of claim 1, wherein the splitting the current speech recognition result comprises:
and splitting the current voice recognition result by adopting a preset regular expression.
6. The intent recognition method according to claim 1, wherein the splitting of the current speech recognition result further comprises:
judging whether the current voice recognition result is sentence-breaking according to punctuation marks;
and if the current voice recognition result is not punctuated, performing sentence breaking on the current voice recognition result by using a pre-trained sentence breaking model.
7. The intention recognition method according to claim 1, further comprising:
emptying the sentence list cache if the intention recognition of the merged speech recognition result is successful; or emptying the sentence list cache if the number of the words contained in the previous voice recognition result reaches a preset threshold or the intention recognition is successful.
8. The intent recognition method according to claim 1, wherein the performing intent recognition using the merged speech recognition result comprises:
calculating the smoothness of the combined voice recognition result;
and if the smoothness reaches a preset threshold value, performing intention recognition by using the combined voice recognition result.
9. An intention recognition apparatus, comprising:
the system comprises an initial intention identification module, a voice recognition module and a voice recognition module, wherein the initial intention identification module is suitable for carrying out initial intention identification on a current voice recognition result of a user, and the current voice recognition result is text data;
the word number determining module is suitable for determining the number of words contained in the current voice recognition result when the initial intention recognition fails, wherein the words are the minimum units with semantics in the text data;
the splitting module is suitable for splitting the current voice recognition result when the number of the characters reaches a preset threshold so as to obtain M sentences, wherein M is a positive integer greater than 1;
the intention identification module is suitable for respectively carrying out intention identification on the M sentences to obtain N intentions, wherein N is a positive integer and is less than or equal to M;
an intent determination module adapted to determine an intent of the current speech recognition result based at least on the N intents;
the phrase judgment module is used for judging whether the number of the characters contained in the previous voice recognition result before the current voice recognition result reaches a preset threshold and whether the intention recognition is successful or not when the number of the characters does not reach the preset threshold;
a merging module, configured to store the current speech recognition result in a sentence list cache when the number of words included in the previous speech recognition result does not reach a preset threshold and the intention recognition fails, and merge all speech recognition results in the sentence list cache when the number of recognition results in the sentence list cache is greater than 1;
and the recognition module is used for performing intention recognition by utilizing the combined voice recognition result.
10. A storage medium having stored thereon computer instructions, wherein said computer instructions when executed perform the steps of the intent recognition method of any of claims 1-8.
11. A terminal comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, wherein the processor, when executing the computer instructions, performs the steps of the intent recognition method of any of claims 1-8.
CN201910356912.5A 2019-04-29 2019-04-29 Intention recognition method and device, storage medium and terminal Active CN110097886B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910356912.5A CN110097886B (en) 2019-04-29 2019-04-29 Intention recognition method and device, storage medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910356912.5A CN110097886B (en) 2019-04-29 2019-04-29 Intention recognition method and device, storage medium and terminal

Publications (2)

Publication Number Publication Date
CN110097886A CN110097886A (en) 2019-08-06
CN110097886B true CN110097886B (en) 2021-09-10

Family

ID=67446527

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910356912.5A Active CN110097886B (en) 2019-04-29 2019-04-29 Intention recognition method and device, storage medium and terminal

Country Status (1)

Country Link
CN (1) CN110097886B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112700767B (en) * 2019-10-21 2022-08-26 思必驰科技股份有限公司 Man-machine conversation interruption method and device
CN112017663B (en) * 2020-08-14 2024-04-30 博泰车联网(南京)有限公司 Voice generalization method and device and computer storage medium
CN112069786A (en) * 2020-08-25 2020-12-11 北京字节跳动网络技术有限公司 Text information processing method and device, electronic equipment and medium
CN112992151B (en) * 2021-03-15 2023-11-07 中国平安财产保险股份有限公司 Speech recognition method, system, device and readable storage medium
CN114238566A (en) * 2021-12-10 2022-03-25 零犀(北京)科技有限公司 Data enhancement method and device for voice or text data
CN115577092B (en) * 2022-12-09 2023-03-24 深圳市人马互动科技有限公司 User speech processing method and device, electronic equipment and computer storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8539349B1 (en) * 2006-10-31 2013-09-17 Hewlett-Packard Development Company, L.P. Methods and systems for splitting a chinese character sequence into word segments
CN104391980A (en) * 2014-12-08 2015-03-04 百度在线网络技术(北京)有限公司 Song generating method and device
CN106874419A (en) * 2017-01-22 2017-06-20 北京航空航天大学 A kind of real-time focus polymerization of many granularities
CN106980686A (en) * 2017-03-31 2017-07-25 努比亚技术有限公司 The segmenting method and terminal of a kind of search term
CN107315737A (en) * 2017-07-04 2017-11-03 北京奇艺世纪科技有限公司 A kind of semantic logic processing method and system
CN108091328A (en) * 2017-11-20 2018-05-29 北京百度网讯科技有限公司 Speech recognition error correction method, device and readable medium based on artificial intelligence

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7711545B2 (en) * 2003-07-02 2010-05-04 Language Weaver, Inc. Empirical methods for splitting compound words with application to machine translation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8539349B1 (en) * 2006-10-31 2013-09-17 Hewlett-Packard Development Company, L.P. Methods and systems for splitting a chinese character sequence into word segments
CN104391980A (en) * 2014-12-08 2015-03-04 百度在线网络技术(北京)有限公司 Song generating method and device
CN106874419A (en) * 2017-01-22 2017-06-20 北京航空航天大学 A kind of real-time focus polymerization of many granularities
CN106980686A (en) * 2017-03-31 2017-07-25 努比亚技术有限公司 The segmenting method and terminal of a kind of search term
CN107315737A (en) * 2017-07-04 2017-11-03 北京奇艺世纪科技有限公司 A kind of semantic logic processing method and system
CN108091328A (en) * 2017-11-20 2018-05-29 北京百度网讯科技有限公司 Speech recognition error correction method, device and readable medium based on artificial intelligence

Also Published As

Publication number Publication date
CN110097886A (en) 2019-08-06

Similar Documents

Publication Publication Date Title
CN110097886B (en) Intention recognition method and device, storage medium and terminal
US11435898B2 (en) Modality learning on mobile devices
CN110765244B (en) Method, device, computer equipment and storage medium for obtaining answering operation
CN110210029B (en) Method, system, device and medium for correcting error of voice text based on vertical field
CN107240398B (en) Intelligent voice interaction method and device
CN107305575B (en) Sentence-break recognition method and device of man-machine intelligent question-answering system
JP5901001B1 (en) Method and device for acoustic language model training
US11068519B2 (en) Conversation oriented machine-user interaction
CN110930980B (en) Acoustic recognition method and system for Chinese and English mixed voice
CN112100349A (en) Multi-turn dialogue method and device, electronic equipment and storage medium
CN110415679B (en) Voice error correction method, device, equipment and storage medium
JP6677419B2 (en) Voice interaction method and apparatus
JP7213943B2 (en) Audio processing method, device, device and storage medium for in-vehicle equipment
JP2010518534A (en) Contextual input method
CN111428010A (en) Man-machine intelligent question and answer method and device
WO2014117553A1 (en) Method and system of adding punctuation and establishing language model
CN111727442A (en) Training sequence generation neural network using quality scores
CN112686051A (en) Semantic recognition model training method, recognition method, electronic device, and storage medium
CN113486170A (en) Natural language processing method, device, equipment and medium based on man-machine interaction
KR102102287B1 (en) Method for crowdsourcing data of chat model for chatbot
CN114399772A (en) Sample generation, model training and trajectory recognition methods, devices, equipment and medium
CN110020429B (en) Semantic recognition method and device
JP7096199B2 (en) Information processing equipment, information processing methods, and programs
CN115620726A (en) Voice text generation method, and training method and device of voice text generation model
CN112036135B (en) Text processing method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant