CN111292745A - Method and device for processing voice recognition result and electronic equipment - Google Patents

Method and device for processing voice recognition result and electronic equipment Download PDF

Info

Publication number
CN111292745A
CN111292745A CN202010076388.9A CN202010076388A CN111292745A CN 111292745 A CN111292745 A CN 111292745A CN 202010076388 A CN202010076388 A CN 202010076388A CN 111292745 A CN111292745 A CN 111292745A
Authority
CN
China
Prior art keywords
text
recognition result
target
voice recognition
vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010076388.9A
Other languages
Chinese (zh)
Other versions
CN111292745B (en
Inventor
苏少炜
陈孝良
冯大航
常乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202010076388.9A priority Critical patent/CN111292745B/en
Publication of CN111292745A publication Critical patent/CN111292745A/en
Application granted granted Critical
Publication of CN111292745B publication Critical patent/CN111292745B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/221Announcement of recognition results
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention provides a method and a device for processing a voice recognition result and electronic equipment. According to the invention, the voice recognition result is optimized, and the scene information acquired according to the display page of the target terminal is introduced, so that the method and the device are more suitable for the watching scene of the user, namely, the method and the device are closer to the requirement of the user, and the analysis accuracy rate of the voice information is further improved.

Description

Method and device for processing voice recognition result and electronic equipment
Technical Field
The present invention relates to the field of speech recognition, and in particular, to a method and an apparatus for processing a speech recognition result, and an electronic device.
Background
In the case of screen-mounted interactive devices such as televisions, set-top boxes, tablet computers, and intelligent refrigerators, because the speech recognition result is inaccurate due to device and environmental noise, accent of a speaker, and the like, it is often necessary to accurately recognize a text menu or a program name on a screen so as to perform related interactive operations, for example, to trigger a key operation on the screen when a movie on the screen is played.
Therefore, the problem of low semantic resolution accuracy caused by the situations of low recognition accuracy and the like of the existing voice recognition technology in the real voice interaction scene needs to be solved urgently.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for processing a speech recognition result, and an electronic device, so as to solve the problem that the semantic resolution accuracy is low due to the situations of low recognition accuracy and the like in the existing speech recognition technology in a real scene of speech interaction.
In order to solve the technical problems, the invention adopts the following technical scheme:
a method for processing a speech recognition result comprises the following steps:
acquiring a voice recognition result of voice information input by a user;
acquiring scene information and acquiring text vocabularies corresponding to the scene information according to a display page of a target terminal;
and performing text similarity calculation on the voice recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target voice recognition result.
Optionally, obtaining a text vocabulary corresponding to the scene information includes:
acquiring first text content displayed on the display page;
acquiring second text content which is related to the scene information but is not displayed on the display page;
determining the first textual content and the second textual content as the textual vocabulary.
Optionally, performing text similarity calculation on the speech recognition result and the text vocabulary to obtain a target text vocabulary, and determining that the target text vocabulary is the target speech recognition result includes:
extracting entity words in the voice recognition result, and determining a user operation intention corresponding to the voice recognition result;
matching the entity words with the first text content;
if the matching is available, taking the entity words and the words corresponding to the user operation intention as target voice recognition results;
if the matching cannot be carried out, matching the entity words with the second text content;
and if the matching is available, taking the words corresponding to the entity words and the user operation intention as target voice recognition results.
Optionally, performing text similarity calculation on the speech recognition result and the text vocabulary to obtain a target text vocabulary, and determining that the target text vocabulary is the target speech recognition result, further comprising:
if the entity words and the second text content cannot be matched, performing Longest Common Subsequence (LCS) calculation on the entity words and the text words to obtain an LCS calculation result;
screening out the longest public subsequence in the LCS calculation result, determining a text vocabulary corresponding to the longest public subsequence, and using the text vocabulary as a text vocabulary to be analyzed;
carrying out fuzzy matching on the entity words and the text vocabulary to be analyzed to obtain a fuzzy matching result;
screening out the optimal result in the fuzzy matching result;
and taking the words of the optimal result corresponding to the user operation intention as target voice recognition results.
Optionally, performing fuzzy matching on the entity words and the text vocabulary to be analyzed to obtain a fuzzy matching result, including:
and carrying out fuzzy matching by calculating the minimum Levensstein distance to obtain a fuzzy matching result.
Optionally, before performing an LCS calculation on the longest common subsequence of the entity word and the text word to obtain an LCS calculation result, the method further includes:
and converting the entity words and the text words into corresponding Chinese pinyin.
A speech recognition result processing apparatus comprising:
the first data acquisition module is used for acquiring a voice recognition result of voice information input by a user;
the second data acquisition module is used for acquiring scene information according to a display page of the target terminal and acquiring text vocabularies corresponding to the scene information;
and the text processing module is used for performing text similarity calculation on the voice recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target voice recognition result.
Optionally, the second data obtaining module includes:
the first obtaining sub-module is used for obtaining first text content displayed on the display page;
a second obtaining sub-module, configured to obtain second text content that is related to the scene information but is not displayed on the display page;
a determining submodule configured to determine the first text content and the second text content as the text vocabulary.
Optionally, the text processing module includes:
the data processing submodule is used for extracting entity words in the voice recognition result and determining a user operation intention corresponding to the voice recognition result;
the first matching submodule is used for matching the entity words with the first text content;
the result determining submodule is used for taking the words corresponding to the entity words and the user operation intentions as target voice recognition results if the entity words can be matched with the first text content;
the second matching sub-module is used for matching the entity words with the second text content if the entity words cannot be matched with the first text content;
and the result determining sub-module is further configured to take the word corresponding to the entity word and the user operation intention as a target speech recognition result if the entity word and the second text content can be matched.
An electronic device, comprising: a memory and a processor;
wherein the memory is used for storing programs;
the processor calls a program and is used to:
acquiring a voice recognition result of voice information input by a user;
acquiring scene information and acquiring text vocabularies corresponding to the scene information according to a display page of a target terminal;
and performing text similarity calculation on the voice recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target voice recognition result.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a method and a device for processing a voice recognition result and electronic equipment. According to the invention, the voice recognition result is optimized, and the scene information acquired according to the display page of the target terminal is introduced, so that the method and the device are more suitable for the watching scene of the user, namely, the method and the device are closer to the requirement of the user, and the analysis accuracy rate of the voice information is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for processing a speech recognition result according to an embodiment of the present invention;
FIG. 2 is a flowchart of another method for processing speech recognition results according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a speech recognition result processing apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a processing method of a voice recognition result, which is used for carrying out voice recognition on voice information after receiving the voice information of a user to obtain the voice recognition result.
Referring to fig. 1, a method of processing a speech recognition result may include:
and S11, acquiring a voice recognition result of the voice information input by the user.
The use scenario in this embodiment may be a screen-mounted interactive device such as a television, a set-top box, a tablet computer, an intelligent refrigerator, and the like, where a user outputs voice information to the device, for example, "i want to watch a happy ocean," at this time, the device receives the voice information and performs voice recognition on the voice information, and the voice recognition may use a voice recognition technology ASR (Automatic speech recognition), a WFST (weighted finite state converter) model, and the like, so that a voice recognition result may be obtained. The voice recognition result is a character recognition result corresponding to the voice information input by the user. If the voice information input by the user is 'i want to see the jubilance', the voice recognition result is the character 'i want to see the jubilance'.
And S12, acquiring scene information according to the display page of the target terminal and acquiring text vocabularies corresponding to the scene information.
According to a display page of a target terminal, for example, a corresponding scene (for the condition that the corresponding relation between the program and the scene is marked) is determined according to the category (such as an integrated program and a game) of a television screen, and for the condition that the corresponding relation between the program and the scene is not marked, the scene is calculated through text and context intelligent learning displayed by the television screen, such as popular television drama, popular music, child education, child English, health maintenance and the like. The scene learning method can perform LDA (Linear Discriminant Analysis) clustering on words appearing in the current scene, and determine the category of the scene according to the top-level top result of the clustering; big house door-movie scene; kai-tert tells a story-a childhood education scenario.
The first text content in the scene, i.e. the text vocabulary directly appearing in the display interface (e.g. "waning work", "my tom cat", "american and grey-wolf", "animal world", etc. displayed on the screen) may also be referred to as a high-priority incremental knowledge base. Related words (including but not limited to text words not displayed on the screen in the current scene, for example: also including text words that may be related in the scene (such as 'pig peclet corpus' that is not displayed on the screen but belongs to a very popular program in children's cartoon) or text words built in the system such as' last, next, history, stop playing, open first, etc.) are collected to the background of the system, namely, second text content, which may also be called a suboptimal incremental knowledge base, wherein the first text content and the second text content are combined into text words, which may also be called a high-optimal incremental knowledge base. The number of the first text content and the second text content is not limited, and may be plural.
And S13, performing text similarity calculation on the voice recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target voice recognition result.
In this step, the text vocabulary most similar to the speech recognition result is screened out from the text vocabularies, and a set is output.
In this embodiment, after a user inputs voice information and obtains a voice recognition result corresponding to the voice information, scene information and a text vocabulary corresponding to the scene information are obtained according to a display page of a target terminal, text similarity calculation is performed on the voice recognition result and the text vocabulary to obtain a target text vocabulary, and the target text vocabulary is determined to be a target voice recognition result. According to the invention, the voice recognition result is optimized, and the scene information acquired according to the display page of the target terminal is introduced, so that the method and the device are more suitable for the watching scene of the user, namely, the method and the device are closer to the requirement of the user, and the analysis accuracy rate of the voice information is further improved.
In another embodiment of the present invention, an implementation process of "performing text similarity calculation on the speech recognition result and the text vocabulary to obtain a target text vocabulary, and determining that the target text vocabulary is the target speech recognition result" is specifically described, and with reference to fig. 2, the implementation process may include:
and S21, extracting entity words in the voice recognition result, and determining the user operation intention corresponding to the voice recognition result.
Specifically, the speech recognition result is subjected to named entity extraction and recognition through word segmentation and a named entity recognition model to obtain entity words, wherein the entity words can be various named entities and other proper nouns, such as personal names, place names, song names, organization names and the like.
And then, performing intention recognition on the voice recognition result through an intention recognition model, and recognizing the operation intention of the user, wherein if the voice recognition result is 'playing the ocean-loving', the operation intention of the user is 'playing intention', and the entity word is 'ocean-loving'.
And S22, matching the entity words with the first text content. If the matching is possible, step S23 is executed; if the matching is not possible, step S24 is executed.
And S23, taking the words corresponding to the entity words and the user operation intentions as final voice recognition results.
S24, matching the entity words with the second text content; if the matching is possible, step S23 is executed; if the matching is not possible, step S25 is executed.
Specifically, entity words are searched in the high-priority incremental knowledge base (first text content) and the suboptimal incremental knowledge base (second text content) in sequence, and if the entity words are searched in the first text content, the words corresponding to the entity words and the user operation intention are used as target voice recognition results. And if not, searching for an entity word in the second text content, if so, taking a word corresponding to the entity word and the user operation intention as a target voice recognition result, and if not, executing step S25.
And S25, performing LCS calculation on the longest common subsequence of the entity words and the text words to obtain an LCS calculation result.
S26, screening the longest public subsequence in the LCS calculation result, determining the text vocabulary corresponding to the longest public subsequence, and taking the text vocabulary as the text vocabulary to be analyzed.
In practical applications, before LCS calculation is performed, the entity words and the text vocabulary, i.e. the high-priority incremental knowledge base (the first text content) and the sub-optimal incremental knowledge base (the second text content), are all converted into pinyin, for example, "juyangyang" is converted into pinyin, and "storm" is performed.
And then LCS calculation is carried out on the calculation entity words and the text words in the knowledge base, and the matched text words are selected as the high-priority hit reference result.
For example, the query of the user is: the 'playing jubilance', the word segmentation and entity recognition result is the playing intention, the playing content is 'jubilance', the 'jubilance' and the words in the scene high-priority increment knowledge base are subjected to LCS calculation, the 'jubilance' and the 'grey-wolf' are found to be hit, and then the user intention is updated to 'playing jubilance' and the 'grey-wolf'. In addition, the above-mentioned "xi yang and hui tai lang" is the text vocabulary to be analyzed.
And S27, carrying out fuzzy matching on the entity words and the text words to be analyzed to obtain a fuzzy matching result.
And S28, screening out the optimal result in the fuzzy matching result.
And S29, taking the words of the optimal result corresponding to the user operation intention as target voice recognition results.
In practical application, fuzzy matching can be performed in a mode of calculating the minimum Levensstein distance, and a fuzzy matching result is obtained.
Specifically, fuzzy matching is carried out by calculating the minimum Levensstein distance and the like, and the result with the minimum distance is selected as a reference result; the levenstein distance is a method for calculating the edit distance between two character strings, and the distance between the first i characters of the character string a and the first j characters of the character string b is calculated specifically as follows:
Figure BDA0002378587660000081
after the Levenstein distance is obtained through calculation, the minimum Levenstein distance is screened out, and a text vocabulary corresponding to the Levenstein distance, such as the terms corresponding to the operation intention of the user, such as the terms corresponding to the playing intention, are added, such as the terms corresponding to the playing intention are played. The final speech recognition result is "play jubilance and gray-tarry".
Such as: assuming that the current scene is in the movie, the query of the user is 'playing the ounce', the extraction intention is 'playing', and the playing entity is the ounce; matching a snowstorm (with the same pinyin) in a text in an incremental high-quality knowledge base according to the fact that the current scene is a movie and the pinyin is ' xuebao ', and finally performing action as ' playing the snowstorm ' (the movie) '; note: snow storms: a popular movie in 2019; and (3) snow leopard: a hot drama in 2010; in the scene of animal cry: cry of ounce; short video scenes: short video on ounces.
In the embodiment, effect improvement can be further performed by other fuzzy matching methods based on the scene knowledge base, meanwhile, the problem that part of program names are long can be further solved by a pre-storing and pre-fetching method, and a user can prepare related contents and instructions in advance when reading only one part of the program names, so that the response speed and accuracy of man-machine interaction are improved, and the machine is more intelligent.
The embodiment of the invention can solve the problems of low voice recognition rate caused by homophone, near-tone words, noise interference of real scenes, artificial misreading and the like, long names of partial programs, incapability of recognition caused by reading only one part of the programs, greatly improved accuracy of recognition and semantic understanding in a scene with a screen and greatly improved user experience of man-machine interaction.
Optionally, on the basis of the above embodiment of the method for processing the speech recognition result, another embodiment of the present invention provides a device for processing the speech recognition result, and with reference to fig. 3, the device may include:
the first data acquisition module 11 is configured to acquire a voice recognition result of voice information input by a user;
the second data acquisition module 12 is configured to acquire scene information according to a display page of the target terminal and acquire a text vocabulary corresponding to the scene information;
and the text processing module 13 is configured to perform text similarity calculation on the speech recognition result and the text vocabulary to obtain a target text vocabulary, and determine that the target text vocabulary is a target speech recognition result.
Further, the second data obtaining module includes:
the first obtaining sub-module is used for obtaining first text content displayed on the display page;
a second obtaining sub-module, configured to obtain second text content that is related to the scene information but is not displayed on the display page;
a determining submodule configured to determine the first text content and the second text content as the text vocabulary.
After a user inputs voice information and obtains a voice recognition result corresponding to the voice information, scene information and text vocabularies corresponding to the scene information are obtained according to a display page of a target terminal, text similarity calculation is carried out on the voice recognition result and the text vocabularies to obtain target text vocabularies, and the target text vocabularies are determined to be target voice recognition results. According to the invention, the voice recognition result is optimized, and the scene information acquired according to the display page of the target terminal is introduced, so that the method and the device are more suitable for the watching scene of the user, namely, the method and the device are closer to the requirement of the user, and the analysis accuracy rate of the voice information is further improved.
It should be noted that, for the working processes of each module and sub-module in this embodiment, please refer to the corresponding description in the above embodiments, which is not described herein again.
Optionally, on the basis of the above embodiment of the speech recognition result processing device, the text processing module includes:
the data processing submodule is used for extracting entity words in the voice recognition result and determining a user operation intention corresponding to the voice recognition result;
the first matching submodule is used for matching the entity words with the first text content;
the result determining submodule is used for taking the words corresponding to the entity words and the user operation intentions as target voice recognition results if the entity words can be matched with the first text content;
the second matching sub-module is used for matching the entity words with the second text content if the entity words cannot be matched with the first text content;
and the result determining sub-module is further configured to take the word corresponding to the entity word and the user operation intention as a target speech recognition result if the entity word and the second text content can be matched.
Further, the text processing module further comprises:
the calculation submodule is used for calculating the longest common subsequence LCS of the entity words and the text words to obtain an LCS calculation result if the entity words and the second text content cannot be matched;
the first screening submodule is used for screening out the longest public subsequence in the LCS calculation result, determining a text vocabulary corresponding to the longest public subsequence and using the text vocabulary as a text vocabulary to be analyzed;
the matching submodule is used for carrying out fuzzy matching on the entity words and the text words to be analyzed to obtain a fuzzy matching result;
the second screening submodule is used for screening out the optimal result in the fuzzy matching result;
and the result determining submodule is also used for taking the words of the optimal result corresponding to the user operation intention as target voice recognition results.
Further, the matching sub-module is configured to perform fuzzy matching on the entity words and the text vocabulary to be analyzed, and when a fuzzy matching result is obtained, the matching sub-module is specifically configured to:
and carrying out fuzzy matching by calculating the minimum Levensstein distance to obtain a fuzzy matching result.
Further, still include:
and the pinyin conversion submodule is used for converting the entity words and the text words into corresponding Chinese pinyin.
The embodiment of the invention can solve the problems of low voice recognition rate caused by homophone, near-tone words, noise interference of real scenes, artificial misreading and the like, long names of partial programs, incapability of recognition caused by reading only one part of the programs, greatly improved accuracy of recognition and semantic understanding in a scene with a screen and greatly improved user experience of man-machine interaction.
It should be noted that, for the working processes of each module and sub-module in this embodiment, please refer to the corresponding description in the above embodiments, which is not described herein again.
Optionally, on the basis of the above embodiment of the method and apparatus for processing a speech recognition result, another embodiment of the present invention provides an electronic device, including: a memory and a processor;
wherein the memory is used for storing programs;
the processor calls a program and is used to:
acquiring a voice recognition result of voice information input by a user;
acquiring scene information and acquiring text vocabularies corresponding to the scene information according to a display page of a target terminal;
and performing text similarity calculation on the voice recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target voice recognition result.
Further, acquiring a text vocabulary corresponding to the scene information includes:
acquiring first text content displayed on the display page;
acquiring second text content which is related to the scene information but is not displayed on the display page;
determining the first textual content and the second textual content as the textual vocabulary.
Further, performing text similarity calculation on the speech recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target speech recognition result, including:
extracting entity words in the voice recognition result, and determining a user operation intention corresponding to the voice recognition result;
matching the entity words with the first text content;
if the matching is available, taking the entity words and the words corresponding to the user operation intention as target voice recognition results;
if the matching cannot be carried out, matching the entity words with the second text content;
and if the matching is available, taking the words corresponding to the entity words and the user operation intention as target voice recognition results.
Further, performing text similarity calculation on the speech recognition result and the text vocabulary to obtain a target text vocabulary, and determining that the target text vocabulary is a target speech recognition result, the method further includes:
if the entity words and the second text content cannot be matched, performing Longest Common Subsequence (LCS) calculation on the entity words and the text words to obtain an LCS calculation result;
screening out the longest public subsequence in the LCS calculation result, determining a text vocabulary corresponding to the longest public subsequence, and using the text vocabulary as a text vocabulary to be analyzed;
carrying out fuzzy matching on the entity words and the text vocabulary to be analyzed to obtain a fuzzy matching result;
screening out the optimal result in the fuzzy matching result;
and taking the words of the optimal result corresponding to the user operation intention as target voice recognition results.
Further, fuzzy matching is carried out on the entity words and the text vocabulary to be analyzed, so as to obtain fuzzy matching results, and the fuzzy matching results comprise:
and carrying out fuzzy matching by calculating the minimum Levensstein distance to obtain a fuzzy matching result.
Further, before performing the longest common subsequence LCS calculation on the entity words and the text vocabulary to obtain an LCS calculation result, the method further includes:
and converting the entity words and the text words into corresponding Chinese pinyin.
After a user inputs voice information and obtains a voice recognition result corresponding to the voice information, scene information and text vocabularies corresponding to the scene information are obtained according to a display page of a target terminal, text similarity calculation is carried out on the voice recognition result and the text vocabularies to obtain target text vocabularies, and the target text vocabularies are determined to be target voice recognition results. According to the invention, the voice recognition result is optimized, and the scene information acquired according to the display page of the target terminal is introduced, so that the method and the device are more suitable for the watching scene of the user, namely, the method and the device are closer to the requirement of the user, and the analysis accuracy rate of the voice information is further improved.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for processing a speech recognition result, comprising:
acquiring a voice recognition result of voice information input by a user;
acquiring scene information and acquiring text vocabularies corresponding to the scene information according to a display page of a target terminal;
and performing text similarity calculation on the voice recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target voice recognition result.
2. The processing method according to claim 1, wherein obtaining a text vocabulary corresponding to the scene information comprises:
acquiring first text content displayed on the display page;
acquiring second text content which is related to the scene information but is not displayed on the display page;
determining the first textual content and the second textual content as the textual vocabulary.
3. The processing method of claim 2, wherein performing text similarity calculation on the speech recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target speech recognition result comprises:
extracting entity words in the voice recognition result, and determining a user operation intention corresponding to the voice recognition result;
matching the entity words with the first text content;
if the matching is available, taking the entity words and the words corresponding to the user operation intention as target voice recognition results;
if the matching cannot be carried out, matching the entity words with the second text content;
and if the matching is available, taking the words corresponding to the entity words and the user operation intention as target voice recognition results.
4. The processing method according to claim 2, wherein performing text similarity calculation on the speech recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target speech recognition result, further comprising:
if the entity words and the second text content cannot be matched, performing Longest Common Subsequence (LCS) calculation on the entity words and the text words to obtain an LCS calculation result;
screening out the longest public subsequence in the LCS calculation result, determining a text vocabulary corresponding to the longest public subsequence, and using the text vocabulary as a text vocabulary to be analyzed;
carrying out fuzzy matching on the entity words and the text vocabulary to be analyzed to obtain a fuzzy matching result;
screening out the optimal result in the fuzzy matching result;
and taking the words of the optimal result corresponding to the user operation intention as target voice recognition results.
5. The processing method according to claim 4, wherein performing fuzzy matching on the entity words and the text vocabulary to be analyzed to obtain a fuzzy matching result comprises:
and carrying out fuzzy matching by calculating the minimum Levensstein distance to obtain a fuzzy matching result.
6. The processing method of claim 4, wherein before performing LCS calculation on the longest common subsequence of the entity words and the text vocabulary to obtain an LCS calculation result, the method further comprises:
and converting the entity words and the text words into corresponding Chinese pinyin.
7. An apparatus for processing a speech recognition result, comprising:
the first data acquisition module is used for acquiring a voice recognition result of voice information input by a user;
the second data acquisition module is used for acquiring scene information according to a display page of the target terminal and acquiring text vocabularies corresponding to the scene information;
and the text processing module is used for performing text similarity calculation on the voice recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target voice recognition result.
8. The processing apparatus as claimed in claim 7, wherein the second data acquisition module comprises:
the first obtaining sub-module is used for obtaining first text content displayed on the display page;
a second obtaining sub-module, configured to obtain second text content that is related to the scene information but is not displayed on the display page;
a determining submodule configured to determine the first text content and the second text content as the text vocabulary.
9. The processing apparatus according to claim 8, wherein the text processing module comprises:
the data processing submodule is used for extracting entity words in the voice recognition result and determining a user operation intention corresponding to the voice recognition result;
the first matching submodule is used for matching the entity words with the first text content;
the result determining submodule is used for taking the words corresponding to the entity words and the user operation intentions as target voice recognition results if the entity words can be matched with the first text content;
the second matching sub-module is used for matching the entity words with the second text content if the entity words cannot be matched with the first text content;
and the result determining sub-module is further configured to take the word corresponding to the entity word and the user operation intention as a target speech recognition result if the entity word and the second text content can be matched.
10. An electronic device, comprising: a memory and a processor;
wherein the memory is used for storing programs;
the processor calls a program and is used to:
acquiring a voice recognition result of voice information input by a user;
acquiring scene information and acquiring text vocabularies corresponding to the scene information according to a display page of a target terminal;
and performing text similarity calculation on the voice recognition result and the text vocabulary to obtain a target text vocabulary, and determining the target text vocabulary as a target voice recognition result.
CN202010076388.9A 2020-01-23 2020-01-23 Method and device for processing voice recognition result and electronic equipment Active CN111292745B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010076388.9A CN111292745B (en) 2020-01-23 2020-01-23 Method and device for processing voice recognition result and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010076388.9A CN111292745B (en) 2020-01-23 2020-01-23 Method and device for processing voice recognition result and electronic equipment

Publications (2)

Publication Number Publication Date
CN111292745A true CN111292745A (en) 2020-06-16
CN111292745B CN111292745B (en) 2023-03-24

Family

ID=71029223

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010076388.9A Active CN111292745B (en) 2020-01-23 2020-01-23 Method and device for processing voice recognition result and electronic equipment

Country Status (1)

Country Link
CN (1) CN111292745B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112908337A (en) * 2021-01-31 2021-06-04 云知声智能科技股份有限公司 Method, device and equipment for displaying voice recognition text and storage medium
CN112908306A (en) * 2021-01-30 2021-06-04 云知声智能科技股份有限公司 Voice recognition method, device, terminal and storage medium for optimizing screen-on effect
CN113539271A (en) * 2021-07-23 2021-10-22 北京梧桐车联科技有限责任公司 Speech recognition method, device, equipment and computer readable storage medium
CN115547337A (en) * 2022-11-25 2022-12-30 深圳市人马互动科技有限公司 Speech recognition method and related product

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101422041A (en) * 2006-04-17 2009-04-29 微软公司 Internet search-based television
CN106649409A (en) * 2015-11-04 2017-05-10 陈包容 Method and apparatus for displaying search result based on scene information
CN107045496A (en) * 2017-04-19 2017-08-15 畅捷通信息技术股份有限公司 The error correction method and error correction device of text after speech recognition
CN109036419A (en) * 2018-07-23 2018-12-18 努比亚技术有限公司 A kind of speech recognition match method, terminal and computer readable storage medium
CN109325233A (en) * 2018-09-27 2019-02-12 北京安云世纪科技有限公司 Global semantic understanding method, apparatus, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101422041A (en) * 2006-04-17 2009-04-29 微软公司 Internet search-based television
CN106649409A (en) * 2015-11-04 2017-05-10 陈包容 Method and apparatus for displaying search result based on scene information
CN107045496A (en) * 2017-04-19 2017-08-15 畅捷通信息技术股份有限公司 The error correction method and error correction device of text after speech recognition
CN109036419A (en) * 2018-07-23 2018-12-18 努比亚技术有限公司 A kind of speech recognition match method, terminal and computer readable storage medium
CN109325233A (en) * 2018-09-27 2019-02-12 北京安云世纪科技有限公司 Global semantic understanding method, apparatus, computer equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112908306A (en) * 2021-01-30 2021-06-04 云知声智能科技股份有限公司 Voice recognition method, device, terminal and storage medium for optimizing screen-on effect
CN112908337A (en) * 2021-01-31 2021-06-04 云知声智能科技股份有限公司 Method, device and equipment for displaying voice recognition text and storage medium
CN113539271A (en) * 2021-07-23 2021-10-22 北京梧桐车联科技有限责任公司 Speech recognition method, device, equipment and computer readable storage medium
CN115547337A (en) * 2022-11-25 2022-12-30 深圳市人马互动科技有限公司 Speech recognition method and related product
CN115547337B (en) * 2022-11-25 2023-03-03 深圳市人马互动科技有限公司 Speech recognition method and related product

Also Published As

Publication number Publication date
CN111292745B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
CN111292745B (en) Method and device for processing voice recognition result and electronic equipment
CN111968649B (en) Subtitle correction method, subtitle display method, device, equipment and medium
US7120626B2 (en) Content retrieval based on semantic association
US7853582B2 (en) Method and system for providing information services related to multimodal inputs
US10192544B2 (en) Method and system for constructing a language model
CN109637537B (en) Method for automatically acquiring annotated data to optimize user-defined awakening model
CN105096935A (en) Voice input method, device, and system
CN110602516A (en) Information interaction method and device based on live video and electronic equipment
CN109979450B (en) Information processing method and device and electronic equipment
CN113035199B (en) Audio processing method, device, equipment and readable storage medium
US20080094496A1 (en) Mobile communication terminal
CN112382295B (en) Speech recognition method, device, equipment and readable storage medium
CN114357989B (en) Video title generation method and device, electronic equipment and storage medium
US20110093264A1 (en) Providing Information Services Related to Multimodal Inputs
CN111126084B (en) Data processing method, device, electronic equipment and storage medium
CN102970618A (en) Video on demand method based on syllable identification
CN111062221A (en) Data processing method, data processing device, electronic equipment and storage medium
CN114547373A (en) Method for intelligently identifying and searching programs based on audio
CN113407775B (en) Video searching method and device and electronic equipment
CN111026786A (en) Dictation list generation method and family education equipment
CN109783648B (en) Method for improving ASR language model by using ASR recognition result
US20050125224A1 (en) Method and apparatus for fusion of recognition results from multiple types of data sources
CN116189663A (en) Training method and device of prosody prediction model, and man-machine interaction method and device
CN110010131B (en) Voice information processing method and device
CN114171000A (en) Audio recognition method based on acoustic model and language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant