WO2016101577A1

WO2016101577A1 - Voice recognition method, client and terminal device

Info

Publication number: WO2016101577A1
Application number: PCT/CN2015/082972
Authority: WO
Inventors: 谢志华
Original assignee: 中兴通讯股份有限公司
Priority date: 2014-12-24
Filing date: 2015-06-30
Publication date: 2016-06-30
Also published as: CN105786880A

Abstract

A voice recognition method, a client and a terminal device, the method comprising: acquiring an original voice recognition result of a voice input by a user, and analyzing a voice recognition scene of the user on the basis of the original voice recognition result; on the basis of the voice recognition scene, acquiring, from the original voice recognition result, keyword information required to be corrected and polyphones in each piece of keyword information; generating one or more junk words comprising the polyphones on the basis of the keyword information required to be corrected and the polyphones in each piece of keyword information; on the basis of the voice recognition scene, acquiring actual information, in the terminal device, corresponding to the voice recognition scene, matching the junk words with the actual information, screening out correct polyphones, and filling the correct polyphones in the keyword information required to be corrected, so as to obtain correct keywords; and generating, on the basis of the correct keywords, a final voice recognition result consistent with the current voice recognition scene. The technical solution described above promotes the experience of users in voice interaction.

Description

Voice recognition method, client and terminal device

Technical field

This paper relates to the field of communication technologies, and in particular, to a method, a client and a terminal device for voice recognition.

Background technique

As voice becomes one of the well-known interaction methods, each voice recognition software is emerging in the market, and the quality of voice recognition software is also uneven. One of the standards for measuring the quality of voice recognition software is the voice recognition rate. Although in the cloud recognition situation, each speech recognition engine provider provides the function of natural semantic understanding, each engine provider provides different capabilities, and there is no way to fully understand the semantics of different people and different scenarios. Therefore, how to correctly identify the user's semantics according to the current terminal scene, improve the accuracy of speech recognition, and finally achieve the best voice user experience is very meaningful and valuable.

Most of the methods used by the engine providers in the related art generally use a language model on the cloud server, and process the voice of the user through a certain algorithm, and finally get the user's intention and inform the user, but in many cases, Some specific statements are ambiguous, and the cloud server has no way to get the only result. The result of the feedback to the user may be different from the actual result expected by the user, so that the user's feeling is that the recognition is not accurate, and the user experience is not good.

Summary of the invention

The embodiments of the present invention provide a method, a client, and a terminal device for voice recognition, to solve the technical problem of how to make the voice result better meet the current user's expectation and improve the user's voice interaction experience.

According to an aspect of the embodiments of the present invention, a method for voice recognition is provided, which is applied to a terminal device side, and the method includes: acquiring an original voice recognition result of a voice input by a user, and parsing according to the original voice recognition result. a voice recognition scene of the user; a scene from which the keyword information to be corrected and the polyphonic word in each of the keyword information are acquired from the original speech recognition result; the keyword information corrected according to the need and the polyphonic word in each keyword information Generating one or more junk words including the polyphonic words; acquiring the speech recognition scene or the need in the terminal device according to the range of the speech recognition scene or the keyword information to be corrected Correcting the actual information corresponding to the keyword information, matching the junk word with the actual information, filtering out the correct polyphonic word, and filling the correct polyphonic word into the keyword information to be corrected, Obtaining the correct keyword; generating a final speech recognition result conforming to the current speech recognition scene according to the correct keyword.

Optionally, the parsing the voice recognition scene of the user according to the original voice recognition result comprises: matching, according to a preset voice recognition result and a scene correspondence table, the matching corresponding to the original voice recognition result The user's speech recognition scene.

Optionally, the acquiring the keyword information to be corrected and the polyphonic word in each keyword information from the original voice recognition result according to the voice recognition scene, including: identifying a scene according to the voice recognition and presetting a scene key information extraction table, obtaining keyword information to be corrected from the original voice recognition result; determining whether there is a polyphonic word in the keyword information to be corrected, and if so, acquiring each keyword information Multi-tone words.

Optionally, the generating, according to the keyword information that needs to be corrected and the polyphonic word in each keyword information, one or more junk words including polyphonic words, including: converting the polyphonic words into corresponding pinyin, And then extracting one or more Chinese characters corresponding to the pinyin according to the homophone correspondence table; filling the Chinese characters into the keyword information to be corrected to replace the polyphonic words to form one or more garbage containing polyphonic words word.

Optionally, the format of the speech recognition result and the scene correspondence table and the scene key information extraction table is an XML extensible markup language.

According to another aspect of the present invention, a voice recognition client is further provided, which is applied to a terminal device side, where the client includes: a scene resolution module, configured to obtain an original voice recognition result of a voice input by a user, And parsing the voice recognition scene of the user according to the original voice recognition result; the multi-tone word extraction module is configured to: obtain, according to the voice recognition scene, keyword information and each of the original voice recognition results that need to be corrected a polyphonic word in the keyword information; a junk word generating module, configured to be based on the keyword information to be corrected and each key The polyphonic word in the word information generates one or more junk words including the polyphonic word; the multi-tone word correction module is configured to acquire the terminal according to the range of the speech recognition scene or the keyword information to be corrected The actual information corresponding to the voice recognition scene or the keyword information that needs to be corrected in the device, the garbage word is matched with the actual information, and the correct multi-phonetic word is filtered out, and the correct multi-phonetic word is Filling the keyword information to be corrected to obtain a correct keyword; the result processing module is configured to generate a final speech recognition result conforming to the current speech recognition scene according to the correct keyword.

Optionally, the scene resolution module is configured to match the scene recognition table according to the preset voice recognition result, and obtain a voice recognition scene of the user corresponding to the original voice recognition result.

Optionally, the multi-tone word extraction module is configured to: obtain, according to the voice recognition scene and a preset scene key information extraction table, the keyword information that needs to be corrected from the original voice recognition result; determine the need to correct Whether there is a polyphonic word in the keyword information, and if so, the polyphonic word in each keyword information is obtained.

Optionally, the polyphonic word correction module is configured to convert the multi-phonetic word into a corresponding pinyin, and then extract one or more Chinese characters corresponding to the pinyin according to the homophone correspondence table; The polyphonic words are replaced in the keyword information to be corrected to form one or more junk words containing polyphonic words.

According to still another aspect of an embodiment of the present invention, there is also provided a terminal device comprising a voice recognition client as described above.

According to still another aspect of an embodiment of the present invention, there is also provided a computer storage medium having stored therein computer executable instructions for performing the method described above.

In the embodiment of the present invention, the recognition result fed back by the engine provider is further optimized with the current scene to optimize the recognition rate, so that the recognition result is more in line with the current user's expectation, and the user's voice communication is improved. Experience each other.

BRIEF abstract

1 is a flowchart of a method for voice recognition on a terminal device side according to an embodiment of the present invention;

2 is a second flowchart of a method for voice recognition on a terminal device side according to an embodiment of the present invention; and

FIG. 3 is a structural block diagram of a terminal device for voice recognition according to an embodiment of the present invention.

Preferred embodiment of the invention

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the embodiments of the present invention have been shown in the drawings, the embodiments Rather, these embodiments are provided so that this disclosure will be more fully understood and the scope of the disclosure will be fully disclosed.

In the embodiment of the present invention, a method, a client, and a terminal device for voice recognition applied to a terminal device side are provided, and an original voice recognition result of a voice input by a user is obtained, and a voice of the user is parsed according to the original voice recognition result. Identifying a scene, wherein the original speech recognition result is obtained by the cloud server according to the voice input input by the user; and the keyword information to be corrected and the multi-tone word in each keyword information are obtained from the original speech recognition result according to the speech recognition scene; The keyword information and the polyphonic words in each keyword information generate one or more junk words containing polyphonic words; and acquire the speech recognition scene in the terminal device according to the range of the speech recognition scene or the keyword information to be corrected Or the actual information corresponding to the corrected keyword information, and match the junk word with the actual information, filter out the correct polyphonic word, fill the correct polyphonic word into the keyword information to be corrected, and obtain the correct keyword; Generate a current speech recognition field based on the correct keyword The final speech recognition result is further optimized by combining the current speech recognition scene and the actual information in the terminal device, thereby converting the original speech recognition result into a final speech conforming to the current speech recognition scene and the terminal device. Identify the results and improve the speech recognition rate.

As shown in FIG. 1 , a flow of a method for voice recognition on a terminal device side in an embodiment of the present invention One of the diagrams, including the steps are as follows:

Step S101: Acquire an original voice recognition result of the voice input by the user, and parse the voice recognition scene of the user according to the original voice recognition result, where the original voice recognition result is recognized by the cloud server according to the voice input by the user. get.

Optionally, in step S101, according to the preset voice recognition result and the scene correspondence table, the voice recognition scene of the user corresponding to the original voice recognition result is obtained, where the voice recognition scene is used to indicate the voice in the scene. The voice recognition scene may include: a call scene, a music scene, and the like, and the voice recognition result and the scene correspondence table respectively correspond to different expressions representing the same scene as a unified scene. For the "call scene", the original speech recognition results returned by each engine provider's speech recognition may be different, some return "call", and some may return "call", and some may return "Call", etc., in the present embodiment, the speech recognition result and the scene correspondence table will correspond to different expressions of the same scene into the same unified scene, thereby finally obtaining the recognition result in step S101. The only speech recognition scene.

Optionally, the speech recognition result and the format correspondence table format XML (Extensible Markup Language) format, the code examples are as follows:

<? Xml version="1.0"encoding="utf-8"? >

<Value>

</Value>

</Domain>

<Scene>Music</Scene>

<Value>

<V>Play music</V>

<V>Listen to music</V>

<V>Music</V>

</Value>

</Domain>

</SceneMapTable>

The correspondence between the "call", "telephone", "call" and "calling scene", and "playing music", "listening to music", "Music" are recorded in the speech recognition result and the scene correspondence table set by the above code. Correspondence with "music scenes". If "call" is included in the original speech recognition result, the user's speech recognition scene can be obtained as a calling scene in step S101. If the original speech recognition result includes "play music", the user's speech recognition scene can be obtained as a music scene in step S101.

In the embodiment of the present invention, the cloud server can utilize the function of natural semantic understanding in the related art to identify the voice input by the user, and obtain the original voice recognition result.

Step S103: Acquire, according to the voice recognition scene, keyword information that needs to be corrected and a polyphonic word in each of the keyword information from the original voice recognition result;

Optionally, the keyword information extraction table is obtained according to the voice recognition scene and the preset scene key information extraction table, and the keyword information that needs to be corrected is obtained from the original voice recognition result; and then it is determined whether there is a plurality of keyword information that needs to be corrected. The phonetic word, if any, acquires the polyphonic word in each keyword message.

That is, in step S103, according to the scene key information extraction table, the keyword information that may need to be corrected is extracted from the original speech recognition result, and then it is determined whether there is a polyphonic word in the keyword information, and if so, the keyword and Multi-tone words are extracted. Judging whether there is a polyphonic word in the keyword information, according to the polysyllabic dictionary, each keyword in the keyword information can be queried in the polysyllabic dictionary to confirm whether it is a polyphonic word, and finally all the keywords whose confirmation is a polyphonic word are separately saved. Come down. For example, if you determine that the scene is a call, you need to extract the contact information as may be needed. The keyword information to be corrected, and then determine whether the contact recognition result contains a polyphonic word, and if so, the contact and the polyphonic word need to be extracted.

Optionally, the format of the scene key information extraction table is an XML format, and the code examples are as follows:

<? Xml version="1.0"encoding="utf-8"? >

<Key>Contact</Key>

</Keyword>

</Domain>

<Scene>Music</Scene>

<Key> album name</Key>

<Key>Artist</Key>

</Keyword>

</Domain>

</KeywordMapTable>

In the above code, it is also introduced that if it is determined that the scene is music, it is necessary to extract the song name, album name, or artist as keyword information that may need to be corrected, and then determine whether the song name, album name, or artist recognition result contains polyphonic words. If so, you will need to extract the song name, album name, or artist and polyphonic words.

Step S105, generating one or more junk words including polyphonic words according to the keyword information to be corrected and the polyphonic words in each keyword information;

Optionally, converting the multi-phonetic word extracted in step S103 into a corresponding pinyin, and then extracting all possible Chinese characters corresponding to the pinyin according to the homophone correspondence table, and then filling all the Chinese characters into the keyword information that needs to be corrected. Multi-tone words to form one or more junk words containing polyphonic words.

Step S107: Acquire, according to the range of the voice recognition scene or the keyword information that needs to be corrected, the actual information corresponding to the voice recognition scene or the keyword information that needs to be corrected in the terminal device, and Matching the junk word with the actual information, filtering out the correct polyphonic word, and filling the correct polyphonic word into the keyword information to be corrected to obtain a correct keyword;

For example, if the voice recognition scene is a call scene or the range of the keyword information to be corrected belongs to the contact, the actual information is the actual contact information list. Of course, it can be understood that it is not limited in the embodiment of the present invention. The way the actual information is represented.

Step S109: Generate a final speech recognition result that conforms to the current speech recognition scene according to the correct keyword.

Optionally, in step S107, according to the voice recognition scene information and the keyword membership range, the actual information corresponding to the terminal device is extracted, for example, the current call scene, and the current keyword is the contact category, and the current mobile phone is extracted. contact information. Then, in step S107, the above junk words are compared one by one in the real contact information set, and if an exact match is found, the junk words are retained, and then the final contact recognition result is obtained in step S109.

In the embodiment of the present invention, the multi-phone recognition rate is improved by adding junk words. The speech recognition mode in the embodiment of the present invention is applicable to all scenes related to terminal key information recognition, such as contacts, music names, artists, albums. By name, application name, and the like, the above embodiment can generate a recognition result that is more accurate and more in line with the user's expectation, thereby improving the voice recognition rate and improving the user's voice interaction experience.

The following describes the flow of voice recognition in the embodiment of the present invention by taking a call scene as an example, and the steps are as follows:

Step S201: Parse the scene of the original speech recognition result to obtain a unique speech recognition scene;

Optionally, parsing according to the scene keyword in the original speech recognition result, and obtaining the field a unique speech recognition scene corresponding to the scene keyword;

Step S203: determining whether it is a call scene, if yes, proceeding to step S205, extracting a contact term in step S205, and acquiring a polyphonic word of the name in the contact term; if not, processing according to other scenarios.

Step S207, converting the multi-tone word obtained above into pinyin;

Step S209, query all Chinese characters matching the above pinyin, if there is a matching Chinese character, perform step S211; if there is no matching meaning, press non-multiple word processing;

Step S211, replacing the obtained garbage Chinese characters with the polyphonic words in the original keyword to generate a name garbage word;

Step S213, obtaining a list of actual contact information of the terminal device;

Step S215, the junk word is filtered in the actual contact list. If the correct contact information is obtained, step S217 is performed; if the correct contact information is not obtained, the speech recognition result is given, such as the recognition failure;

Step S217, recombining the correct contact information and the voice recognition scene information to generate a final voice recognition result.

FIG. 3 is a schematic structural diagram of a client for voice recognition applied to a terminal device side according to an embodiment of the present invention, where the client includes:

The scene parsing module 301 is configured to obtain an original voice recognition result of the voice input by the user, and parse the voice recognition scene of the user according to the original voice recognition result, where the original voice recognition result is determined by the cloud server according to the user The input speech recognition is obtained;

The multi-syllable word extraction module 302 is configured to acquire, from the original speech recognition result, keyword information that needs to be corrected and a polyphonic word in each of the keyword information according to the speech recognition scene;

The junk word generating module 303 is configured to generate one or more junk words including the polyphonic words according to the keyword information that needs to be corrected and the polyphonic words in each keyword information;

The multi-word correction module 304 is configured to acquire, according to the range of the voice recognition scene or the keyword information that needs to be corrected, the keyword information corresponding to the voice recognition scene or the need to be corrected in the terminal device. Actual information, matching the junk word with the actual information, filtering out the correct polyphonic word, and filling the correct polyphonic word into the need for correction In the key word information, the correct keyword is obtained;

The result processing module 305 is configured to generate a final speech recognition result that conforms to the current speech recognition scene based on the correct keyword.

Optionally, in the embodiment of the present invention, the scenario parsing module 301 is configured to match, according to a preset voice recognition result, a scene correspondence table, and obtain a voice recognition scene of the user corresponding to the original voice recognition result. .

Optionally, in the embodiment of the present invention, the polyphonic word extraction module 302 is configured to obtain a key that needs to be corrected from the original speech recognition result according to the speech recognition scene and a preset scene key information extraction table. Word information; determining whether there is a polysyllabic word in the keyword information to be corrected, and if so, acquiring the polyphonic word in each keyword information.

Optionally, in the embodiment of the present invention, the polyphonic word correction module 304 is configured to convert the polyphonic word into a corresponding pinyin, and then extract one or more corresponding to the pinyin according to the homophone correspondence table. a Chinese character; filling the Chinese character into the keyword information to be corrected to replace the polyphonic word to form one or more junk words containing polyphonic words.

Optionally, in the embodiment of the present invention, the format of the speech recognition result and the scene correspondence table and the scene key information extraction table is an XML extensible markup language.

The above is a preferred embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. It should also be considered as the scope of protection of the present invention.

One of ordinary skill in the art will appreciate that all or a portion of the steps of the above-described embodiments can be implemented using a computer program flow, which can be stored in a computer readable storage medium, such as on a corresponding hardware platform (eg, The system, device, device, device, etc. are executed, and when executed, include one or a combination of the steps of the method embodiments.

Alternatively, all or part of the steps of the above embodiments may also be implemented using an integrated circuit. The steps may be separately fabricated into individual integrated circuit modules, or a plurality of modules or steps may be fabricated into a single integrated circuit module.

The devices/function modules/functional units in the above embodiments may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices.

When each device/function module/functional unit in the above embodiment is implemented in the form of a software function module and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. The above mentioned computer readable storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

Industrial applicability

The above technical solution makes the recognition result more in line with the current user's expectation, and improves the user's voice interaction experience.

Claims

A method for voice recognition applied to a terminal device side, the method comprising:

Obtaining an original voice recognition result of the voice input by the user, and parsing the voice recognition scene of the user according to the original voice recognition result;

Obtaining, from the original speech recognition result, keyword information that needs to be corrected and a polyphonic word in each of the keyword information according to the speech recognition scene;

Generating one or more junk words including the polyphonic words according to the keyword information that needs to be corrected and the polyphonic words in each keyword information;

Obtaining actual information corresponding to the voice recognition scene or the keyword information to be corrected in the terminal device according to the range of the voice recognition scene or the keyword information that needs to be corrected, and the Matching the garbage words with the actual information, filtering out the correct polyphonic words, filling the correct polyphonic words into the keyword information to be corrected, and obtaining the correct keywords;

A final speech recognition result conforming to the current speech recognition scene is generated according to the correct keyword.
The method of claim 1, wherein the parsing the speech recognition scene of the user according to the original speech recognition result comprises:

According to the preset voice recognition result and the scene correspondence table, the voice recognition scene of the user corresponding to the original voice recognition result is matched.
The method of claim 2, wherein the obtaining the keyword information to be corrected and the polyphonic word in each keyword information from the original speech recognition result according to the speech recognition scene comprises:

Obtaining, according to the voice recognition scene and a preset scene key information extraction table, keyword information that needs to be corrected from the original voice recognition result;

Determining whether there is a polysyllabic word in the keyword information to be corrected, and if so, acquiring the polyphonic word in each keyword information.
The method of claim 1 wherein said keyword letter corrected according to said need And the polyphonic words in each keyword information generate one or more junk words containing polyphonic words, including:

Converting the polyphonic word into a corresponding pinyin, and then extracting one or more Chinese characters corresponding to the pinyin according to the homophone correspondence table;

Substituting the Chinese characters into the keyword information to be corrected to replace the polyphonic words to form one or more junk words containing polyphonic words.
The method of claim 3, wherein the format of the speech recognition result and the scene correspondence table and the scene key information extraction table is an XML extensible markup language.
A client applied to terminal device side voice recognition, the client includes:

The scene parsing module is configured to obtain an original speech recognition result of the voice input by the user, and parse the voice recognition scene of the user according to the original speech recognition result;

a polyphonic word extraction module, configured to acquire, from the original speech recognition result, keyword information that needs to be corrected and a polyphonic word in each of the keyword information according to the speech recognition scene;

a junk word generating module, configured to generate one or more junk words including the polyphonic words according to the keyword information that needs to be corrected and the polyphonic words in each keyword information;

a multi-word correction module, configured to acquire, according to the range of the voice recognition scene or the keyword information that needs to be corrected, the keyword information corresponding to the voice recognition scene or the keyword information to be corrected in the terminal device Actual information, matching the junk word with the actual information, filtering out the correct polyphonic word, and filling the correct polyphonic word into the keyword information to be corrected to obtain a correct keyword;

The result processing module is configured to generate a final speech recognition result conforming to the current speech recognition scene according to the correct keyword.
The client according to claim 6, wherein the scene resolution module is configured to parse the voice recognition scene of the user according to the original voice recognition result by:

According to the preset voice recognition result and the scene correspondence table, the voice recognition scene of the user corresponding to the original voice recognition result is matched.
The client according to claim 6, wherein the multi-tone word extraction module is configured to obtain, from the original speech recognition result, keyword information and each required to be corrected according to the speech recognition scene according to the following manner Polyphonic words in keyword information:

Determining, according to the speech recognition scene and the preset scene key information extraction table, keyword information that needs to be corrected from the original speech recognition result; determining whether there is a polyphonic word in the keyword information that needs to be corrected, and if so, Then, the polyphonic word in each keyword information is obtained.
The client according to claim 6, wherein said multi-tone word correction module is configured to generate one or more inclusions of keyword information corrected according to said need and multi-tone words in each keyword information by: Multi-word word spam:

Converting the polyphonic word into a corresponding pinyin, and then extracting one or more Chinese characters corresponding to the pinyin according to the homophone correspondence table; filling the Chinese character into the keyword information to be corrected to replace the polyphonic word Compose one or more junk words containing polyphonic words.
The client according to claim 6, wherein the format of the speech recognition result and the scene correspondence table and the scene key information extraction table is an XML extensible markup language.
A terminal device comprising the voice recognition client according to any one of claims 6 to 10.
A computer storage medium having stored therein computer executable instructions for performing the method of any one of claims 1 to 5.