CN111292745A

CN111292745A - Method and device for processing voice recognition result and electronic equipment

Info

Publication number: CN111292745A
Application number: CN202010076388.9A
Authority: CN
Inventors: 苏少炜; 陈孝良; 冯大航; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-01-23
Filing date: 2020-01-23
Publication date: 2020-06-16
Anticipated expiration: 2040-01-23
Also published as: CN111292745B

Abstract

The invention provides a method and a device for processing a voice recognition result and electronic equipment. According to the invention, the voice recognition result is optimized, and the scene information acquired according to the display page of the target terminal is introduced, so that the method and the device are more suitable for the watching scene of the user, namely, the method and the device are closer to the requirement of the user, and the analysis accuracy rate of the voice information is further improved.

Description

Method and device for processing voice recognition result and electronic equipment

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a method and an apparatus for processing a speech recognition result, and an electronic device.

Background

In the case of screen-mounted interactive devices such as televisions, set-top boxes, tablet computers, and intelligent refrigerators, because the speech recognition result is inaccurate due to device and environmental noise, accent of a speaker, and the like, it is often necessary to accurately recognize a text menu or a program name on a screen so as to perform related interactive operations, for example, to trigger a key operation on the screen when a movie on the screen is played.

Therefore, the problem of low semantic resolution accuracy caused by the situations of low recognition accuracy and the like of the existing voice recognition technology in the real voice interaction scene needs to be solved urgently.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for processing a speech recognition result, and an electronic device, so as to solve the problem that the semantic resolution accuracy is low due to the situations of low recognition accuracy and the like in the existing speech recognition technology in a real scene of speech interaction.

In order to solve the technical problems, the invention adopts the following technical scheme:

a method for processing a speech recognition result comprises the following steps:

acquiring a voice recognition result of voice information input by a user;

acquiring scene information and acquiring text vocabularies corresponding to the scene information according to a display page of a target terminal;

and performing text similarity calculation on the voice recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target voice recognition result.

Optionally, obtaining a text vocabulary corresponding to the scene information includes:

acquiring first text content displayed on the display page;

acquiring second text content which is related to the scene information but is not displayed on the display page;

determining the first textual content and the second textual content as the textual vocabulary.

Optionally, performing text similarity calculation on the speech recognition result and the text vocabulary to obtain a target text vocabulary, and determining that the target text vocabulary is the target speech recognition result includes:

extracting entity words in the voice recognition result, and determining a user operation intention corresponding to the voice recognition result;

matching the entity words with the first text content;

if the matching is available, taking the entity words and the words corresponding to the user operation intention as target voice recognition results;

if the matching cannot be carried out, matching the entity words with the second text content;

and if the matching is available, taking the words corresponding to the entity words and the user operation intention as target voice recognition results.

Optionally, performing text similarity calculation on the speech recognition result and the text vocabulary to obtain a target text vocabulary, and determining that the target text vocabulary is the target speech recognition result, further comprising:

if the entity words and the second text content cannot be matched, performing Longest Common Subsequence (LCS) calculation on the entity words and the text words to obtain an LCS calculation result;

screening out the longest public subsequence in the LCS calculation result, determining a text vocabulary corresponding to the longest public subsequence, and using the text vocabulary as a text vocabulary to be analyzed;

carrying out fuzzy matching on the entity words and the text vocabulary to be analyzed to obtain a fuzzy matching result;

screening out the optimal result in the fuzzy matching result;

and taking the words of the optimal result corresponding to the user operation intention as target voice recognition results.

Optionally, performing fuzzy matching on the entity words and the text vocabulary to be analyzed to obtain a fuzzy matching result, including:

and carrying out fuzzy matching by calculating the minimum Levensstein distance to obtain a fuzzy matching result.

Optionally, before performing an LCS calculation on the longest common subsequence of the entity word and the text word to obtain an LCS calculation result, the method further includes:

and converting the entity words and the text words into corresponding Chinese pinyin.

A speech recognition result processing apparatus comprising:

the first data acquisition module is used for acquiring a voice recognition result of voice information input by a user;

the second data acquisition module is used for acquiring scene information according to a display page of the target terminal and acquiring text vocabularies corresponding to the scene information;

and the text processing module is used for performing text similarity calculation on the voice recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target voice recognition result.

Optionally, the second data obtaining module includes:

the first obtaining sub-module is used for obtaining first text content displayed on the display page;

a second obtaining sub-module, configured to obtain second text content that is related to the scene information but is not displayed on the display page;

a determining submodule configured to determine the first text content and the second text content as the text vocabulary.

Optionally, the text processing module includes:

the data processing submodule is used for extracting entity words in the voice recognition result and determining a user operation intention corresponding to the voice recognition result;

the first matching submodule is used for matching the entity words with the first text content;

the result determining submodule is used for taking the words corresponding to the entity words and the user operation intentions as target voice recognition results if the entity words can be matched with the first text content;

the second matching sub-module is used for matching the entity words with the second text content if the entity words cannot be matched with the first text content;

and the result determining sub-module is further configured to take the word corresponding to the entity word and the user operation intention as a target speech recognition result if the entity word and the second text content can be matched.

An electronic device, comprising: a memory and a processor;

wherein the memory is used for storing programs;

the processor calls a program and is used to:

acquiring a voice recognition result of voice information input by a user;

Compared with the prior art, the invention has the following beneficial effects:

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a method for processing a speech recognition result according to an embodiment of the present invention;

FIG. 2 is a flowchart of another method for processing speech recognition results according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a speech recognition result processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a processing method of a voice recognition result, which is used for carrying out voice recognition on voice information after receiving the voice information of a user to obtain the voice recognition result.

Referring to fig. 1, a method of processing a speech recognition result may include:

and S11, acquiring a voice recognition result of the voice information input by the user.

The use scenario in this embodiment may be a screen-mounted interactive device such as a television, a set-top box, a tablet computer, an intelligent refrigerator, and the like, where a user outputs voice information to the device, for example, "i want to watch a happy ocean," at this time, the device receives the voice information and performs voice recognition on the voice information, and the voice recognition may use a voice recognition technology ASR (Automatic speech recognition), a WFST (weighted finite state converter) model, and the like, so that a voice recognition result may be obtained. The voice recognition result is a character recognition result corresponding to the voice information input by the user. If the voice information input by the user is 'i want to see the jubilance', the voice recognition result is the character 'i want to see the jubilance'.

And S12, acquiring scene information according to the display page of the target terminal and acquiring text vocabularies corresponding to the scene information.

According to a display page of a target terminal, for example, a corresponding scene (for the condition that the corresponding relation between the program and the scene is marked) is determined according to the category (such as an integrated program and a game) of a television screen, and for the condition that the corresponding relation between the program and the scene is not marked, the scene is calculated through text and context intelligent learning displayed by the television screen, such as popular television drama, popular music, child education, child English, health maintenance and the like. The scene learning method can perform LDA (Linear Discriminant Analysis) clustering on words appearing in the current scene, and determine the category of the scene according to the top-level top result of the clustering; big house door-movie scene; kai-tert tells a story-a childhood education scenario.

The first text content in the scene, i.e. the text vocabulary directly appearing in the display interface (e.g. "waning work", "my tom cat", "american and grey-wolf", "animal world", etc. displayed on the screen) may also be referred to as a high-priority incremental knowledge base. Related words (including but not limited to text words not displayed on the screen in the current scene, for example: also including text words that may be related in the scene (such as 'pig peclet corpus' that is not displayed on the screen but belongs to a very popular program in children's cartoon) or text words built in the system such as' last, next, history, stop playing, open first, etc.) are collected to the background of the system, namely, second text content, which may also be called a suboptimal incremental knowledge base, wherein the first text content and the second text content are combined into text words, which may also be called a high-optimal incremental knowledge base. The number of the first text content and the second text content is not limited, and may be plural.

And S13, performing text similarity calculation on the voice recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target voice recognition result.

In this step, the text vocabulary most similar to the speech recognition result is screened out from the text vocabularies, and a set is output.

In this embodiment, after a user inputs voice information and obtains a voice recognition result corresponding to the voice information, scene information and a text vocabulary corresponding to the scene information are obtained according to a display page of a target terminal, text similarity calculation is performed on the voice recognition result and the text vocabulary to obtain a target text vocabulary, and the target text vocabulary is determined to be a target voice recognition result. According to the invention, the voice recognition result is optimized, and the scene information acquired according to the display page of the target terminal is introduced, so that the method and the device are more suitable for the watching scene of the user, namely, the method and the device are closer to the requirement of the user, and the analysis accuracy rate of the voice information is further improved.

In another embodiment of the present invention, an implementation process of "performing text similarity calculation on the speech recognition result and the text vocabulary to obtain a target text vocabulary, and determining that the target text vocabulary is the target speech recognition result" is specifically described, and with reference to fig. 2, the implementation process may include:

and S21, extracting entity words in the voice recognition result, and determining the user operation intention corresponding to the voice recognition result.

Specifically, the speech recognition result is subjected to named entity extraction and recognition through word segmentation and a named entity recognition model to obtain entity words, wherein the entity words can be various named entities and other proper nouns, such as personal names, place names, song names, organization names and the like.

And then, performing intention recognition on the voice recognition result through an intention recognition model, and recognizing the operation intention of the user, wherein if the voice recognition result is 'playing the ocean-loving', the operation intention of the user is 'playing intention', and the entity word is 'ocean-loving'.

And S22, matching the entity words with the first text content. If the matching is possible, step S23 is executed; if the matching is not possible, step S24 is executed.

And S23, taking the words corresponding to the entity words and the user operation intentions as final voice recognition results.

S24, matching the entity words with the second text content; if the matching is possible, step S23 is executed; if the matching is not possible, step S25 is executed.

Specifically, entity words are searched in the high-priority incremental knowledge base (first text content) and the suboptimal incremental knowledge base (second text content) in sequence, and if the entity words are searched in the first text content, the words corresponding to the entity words and the user operation intention are used as target voice recognition results. And if not, searching for an entity word in the second text content, if so, taking a word corresponding to the entity word and the user operation intention as a target voice recognition result, and if not, executing step S25.

And S25, performing LCS calculation on the longest common subsequence of the entity words and the text words to obtain an LCS calculation result.

S26, screening the longest public subsequence in the LCS calculation result, determining the text vocabulary corresponding to the longest public subsequence, and taking the text vocabulary as the text vocabulary to be analyzed.

In practical applications, before LCS calculation is performed, the entity words and the text vocabulary, i.e. the high-priority incremental knowledge base (the first text content) and the sub-optimal incremental knowledge base (the second text content), are all converted into pinyin, for example, "juyangyang" is converted into pinyin, and "storm" is performed.

And then LCS calculation is carried out on the calculation entity words and the text words in the knowledge base, and the matched text words are selected as the high-priority hit reference result.

For example, the query of the user is: the 'playing jubilance', the word segmentation and entity recognition result is the playing intention, the playing content is 'jubilance', the 'jubilance' and the words in the scene high-priority increment knowledge base are subjected to LCS calculation, the 'jubilance' and the 'grey-wolf' are found to be hit, and then the user intention is updated to 'playing jubilance' and the 'grey-wolf'. In addition, the above-mentioned "xi yang and hui tai lang" is the text vocabulary to be analyzed.

And S27, carrying out fuzzy matching on the entity words and the text words to be analyzed to obtain a fuzzy matching result.

And S28, screening out the optimal result in the fuzzy matching result.

And S29, taking the words of the optimal result corresponding to the user operation intention as target voice recognition results.

In practical application, fuzzy matching can be performed in a mode of calculating the minimum Levensstein distance, and a fuzzy matching result is obtained.

Specifically, fuzzy matching is carried out by calculating the minimum Levensstein distance and the like, and the result with the minimum distance is selected as a reference result; the levenstein distance is a method for calculating the edit distance between two character strings, and the distance between the first i characters of the character string a and the first j characters of the character string b is calculated specifically as follows:

after the Levenstein distance is obtained through calculation, the minimum Levenstein distance is screened out, and a text vocabulary corresponding to the Levenstein distance, such as the terms corresponding to the operation intention of the user, such as the terms corresponding to the playing intention, are added, such as the terms corresponding to the playing intention are played. The final speech recognition result is "play jubilance and gray-tarry".

Such as: assuming that the current scene is in the movie, the query of the user is 'playing the ounce', the extraction intention is 'playing', and the playing entity is the ounce; matching a snowstorm (with the same pinyin) in a text in an incremental high-quality knowledge base according to the fact that the current scene is a movie and the pinyin is ' xuebao ', and finally performing action as ' playing the snowstorm ' (the movie) '; note: snow storms: a popular movie in 2019; and (3) snow leopard: a hot drama in 2010; in the scene of animal cry: cry of ounce; short video scenes: short video on ounces.

In the embodiment, effect improvement can be further performed by other fuzzy matching methods based on the scene knowledge base, meanwhile, the problem that part of program names are long can be further solved by a pre-storing and pre-fetching method, and a user can prepare related contents and instructions in advance when reading only one part of the program names, so that the response speed and accuracy of man-machine interaction are improved, and the machine is more intelligent.

The embodiment of the invention can solve the problems of low voice recognition rate caused by homophone, near-tone words, noise interference of real scenes, artificial misreading and the like, long names of partial programs, incapability of recognition caused by reading only one part of the programs, greatly improved accuracy of recognition and semantic understanding in a scene with a screen and greatly improved user experience of man-machine interaction.

Optionally, on the basis of the above embodiment of the method for processing the speech recognition result, another embodiment of the present invention provides a device for processing the speech recognition result, and with reference to fig. 3, the device may include:

the first data acquisition module 11 is configured to acquire a voice recognition result of voice information input by a user;

the second data acquisition module 12 is configured to acquire scene information according to a display page of the target terminal and acquire a text vocabulary corresponding to the scene information;

and the text processing module 13 is configured to perform text similarity calculation on the speech recognition result and the text vocabulary to obtain a target text vocabulary, and determine that the target text vocabulary is a target speech recognition result.

Further, the second data obtaining module includes:

After a user inputs voice information and obtains a voice recognition result corresponding to the voice information, scene information and text vocabularies corresponding to the scene information are obtained according to a display page of a target terminal, text similarity calculation is carried out on the voice recognition result and the text vocabularies to obtain target text vocabularies, and the target text vocabularies are determined to be target voice recognition results. According to the invention, the voice recognition result is optimized, and the scene information acquired according to the display page of the target terminal is introduced, so that the method and the device are more suitable for the watching scene of the user, namely, the method and the device are closer to the requirement of the user, and the analysis accuracy rate of the voice information is further improved.

It should be noted that, for the working processes of each module and sub-module in this embodiment, please refer to the corresponding description in the above embodiments, which is not described herein again.

Optionally, on the basis of the above embodiment of the speech recognition result processing device, the text processing module includes:

Further, the text processing module further comprises:

the calculation submodule is used for calculating the longest common subsequence LCS of the entity words and the text words to obtain an LCS calculation result if the entity words and the second text content cannot be matched;

the first screening submodule is used for screening out the longest public subsequence in the LCS calculation result, determining a text vocabulary corresponding to the longest public subsequence and using the text vocabulary as a text vocabulary to be analyzed;

the matching submodule is used for carrying out fuzzy matching on the entity words and the text words to be analyzed to obtain a fuzzy matching result;

the second screening submodule is used for screening out the optimal result in the fuzzy matching result;

and the result determining submodule is also used for taking the words of the optimal result corresponding to the user operation intention as target voice recognition results.

Further, the matching sub-module is configured to perform fuzzy matching on the entity words and the text vocabulary to be analyzed, and when a fuzzy matching result is obtained, the matching sub-module is specifically configured to:

Further, still include:

and the pinyin conversion submodule is used for converting the entity words and the text words into corresponding Chinese pinyin.

Optionally, on the basis of the above embodiment of the method and apparatus for processing a speech recognition result, another embodiment of the present invention provides an electronic device, including: a memory and a processor;

wherein the memory is used for storing programs;

the processor calls a program and is used to:

acquiring a voice recognition result of voice information input by a user;

Further, acquiring a text vocabulary corresponding to the scene information includes:

acquiring first text content displayed on the display page;

Further, performing text similarity calculation on the speech recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target speech recognition result, including:

matching the entity words with the first text content;

Further, performing text similarity calculation on the speech recognition result and the text vocabulary to obtain a target text vocabulary, and determining that the target text vocabulary is a target speech recognition result, the method further includes:

screening out the optimal result in the fuzzy matching result;

Further, fuzzy matching is carried out on the entity words and the text vocabulary to be analyzed, so as to obtain fuzzy matching results, and the fuzzy matching results comprise:

Further, before performing the longest common subsequence LCS calculation on the entity words and the text vocabulary to obtain an LCS calculation result, the method further includes:

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for processing a speech recognition result, comprising:

acquiring a voice recognition result of voice information input by a user;

2. The processing method according to claim 1, wherein obtaining a text vocabulary corresponding to the scene information comprises:

acquiring first text content displayed on the display page;

3. The processing method of claim 2, wherein performing text similarity calculation on the speech recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target speech recognition result comprises:

matching the entity words with the first text content;

4. The processing method according to claim 2, wherein performing text similarity calculation on the speech recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target speech recognition result, further comprising:

screening out the optimal result in the fuzzy matching result;

5. The processing method according to claim 4, wherein performing fuzzy matching on the entity words and the text vocabulary to be analyzed to obtain a fuzzy matching result comprises:

6. The processing method of claim 4, wherein before performing LCS calculation on the longest common subsequence of the entity words and the text vocabulary to obtain an LCS calculation result, the method further comprises:

7. An apparatus for processing a speech recognition result, comprising:

8. The processing apparatus as claimed in claim 7, wherein the second data acquisition module comprises:

9. The processing apparatus according to claim 8, wherein the text processing module comprises:

10. An electronic device, comprising: a memory and a processor;

wherein the memory is used for storing programs;

the processor calls a program and is used to:

acquiring a voice recognition result of voice information input by a user;