CN111475129A

CN111475129A - Method and equipment for displaying candidate homophones through voice recognition

Info

Publication number: CN111475129A
Application number: CN201910067927.XA
Authority: CN
Inventors: 周末
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-01-24
Filing date: 2019-01-24
Publication date: 2020-07-31

Abstract

The invention discloses a display method of candidate homophones for voice recognition, which comprises the following steps: receiving the voice-recognized data from the server; analyzing the data and judging whether candidate words exist in the data or not; and if the candidate words exist, displaying the words with the highest recognition probability as main words in a hyperlink mode, wherein the main words can be clicked. The application also provides corresponding electronic equipment and a computer readable storage medium. By applying the technical scheme disclosed by the invention, the intelligence of the intelligent equipment in voice recognition can be improved, and the problem that the user needs to manually type again in the voice recognition is avoided.

Description

Method and equipment for displaying candidate homophones through voice recognition

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method and equipment for displaying candidate homophones for voice recognition.

Background

With the development of technology, people often use voice recognition function when using application programs in intelligent devices (such as various mobile devices, handheld devices, etc.). However, the accuracy of the existing speech recognition cannot reach 99.99% based on the profound sophistication of the Chinese language. The specific reasons are as follows:

1. the Chinese coding used in machine software is generally unified as GB2312 coding. 6376 Chinese characters are included in GB2312 code, wherein ancient characters are not included, while the Chinese has 21 initial consonants, 35 vowels, four tones and 400 syllables, and therefore the number of syllables is far smaller than that of Chinese characters. That is to say: chinese contains a large number of homophones and homophones.

2. The Speech recognition technology (ASR) is a technology that enables machines to "understand" human Speech. The main flow of speech recognition is shown in fig. 1:

firstly, performing signal processing including noise reduction, framing and the like on an input section of voice;

then, feature extraction is carried out based on the result of signal processing, and then acoustic mode matching is carried out based on an acoustic model;

finally, language processing is carried out based on the language model, and a character result corresponding to the voice is obtained.

According to the flow shown in fig. 1, the function of converting speech into text is in the stage of language processing. The main principles of speech processing are: receiving an acoustic sequence (which can be simply understood as pinyin), and giving a result with the maximum recognition probability corresponding to the acoustic sequence according to a large number of language models subjected to text training, context semantics and statistical rules, wherein the result is the finally recognized character.

The above process is illustrated below by a simple example:

step 1: voice recording: yu ef. Here, since the patent document needs to be expressed in a text manner, pinyin is used for representation, and a sound signal corresponding to the pinyin is actually input.

Step 2: the first syllable yu, can be identified by a number of words, for example: month, about, over, happy, etc. Since there is also a context entry, the result is not returned first for the time being.

And 3, step 3: the second syllable, at yu' f, is identified, and when combined with the above, the result of the identification will change significantly, excluding combinations that are not part of the word in daily use, such as: the homonym options identified may be: yuenangg, Yuesfu, Yufu, Lefu, etc. And according to the judgment of the language model, selecting the word with the highest recognition probability from the homophones as a recognition result and returning the word as the recognition result.

The probabilistic algorithm is trained based on a large amount of text in the language model. The more text that is trained, the higher the probability that it can be accurately recognized. However, the above prior art techniques are not ideal for recognizing less probable context semantics and in other special cases, the recognition result is not ideal.

For the reasons, when the intelligent device is used for voice recognition in daily life, the fact that the input voice is homophones is often met, but characters displayed after voice recognition are not the target words wanted by people. According to the prior art, when the situation is met, the user is usually required to manually re-input the characters by using an input method so as to modify the characters into the target words.

Therefore, based on the current common voice recognition technical scheme, if the voice of the homophone word is recorded, only the vocabulary with the higher usage rate can be recognized, and the vocabulary which the user wants to express cannot be correctly recognized. As exemplified above, the word with the highest probability identified is "monthly payment", but the intent of the logger is "the parent of the moon". If necessary, only the original characters can be deleted and then manually typed in again. When the number of entered texts is large, the part needing to be modified needs to be searched line by line. The intelligence of the intelligent device is seriously influenced by the existence of the problems.

Disclosure of Invention

The embodiment of the invention provides a method and equipment for displaying candidate homophones for voice recognition and a computer-readable storage medium, which are used for avoiding the problem that a user needs to manually type and input again in voice recognition.

The embodiment of the application discloses a method for displaying candidate homophones for voice recognition, which comprises the following steps:

receiving the voice-recognized data from the server;

analyzing the data and judging whether candidate words exist in the data or not;

and if the candidate words exist, displaying the words with the highest recognition probability as main words in a hyperlink mode, wherein the main words can be clicked.

Preferably, the method further comprises:

underlining below the primary word;

or, the main word is presented in a different color from the other words;

alternatively, the main word is underlined below the main word and is presented in a different color from the other words.

Preferably, the method further comprises:

and when the clicking operation on the main word is detected, displaying the candidate word of the main word by using a candidate word display frame.

Preferably, the candidate words of the main word are sequentially displayed according to the sequence from high recognition probability to low recognition probability.

Preferably, the method further comprises:

and when the selection operation of any candidate word is detected, displaying the selected candidate word in the main text and hiding the candidate word display box.

Preferably, the presenting the selected candidate word in the main text comprises:

and displaying the selected candidate words in a hyperlink mode, wherein the candidate words can be clicked.

Preferably, underlining is performed below the candidate word shown in the main text;

or displaying the candidate words shown in the main text in a color different from other words;

or, underlining the candidate word displayed in the main text and displaying the candidate word displayed in the main text in a color different from other words.

Preferably, the method further comprises:

and when the clicking operation on the candidate words displayed in the main text is detected, displaying the main words and other candidate words by using the candidate word display box.

The embodiment of the application also discloses an electronic device, which comprises a memory, a processor and a computer program which is stored on the memory and can be run on the processor, wherein the processor executes the program to realize the following steps:

receiving the voice-recognized data from the server;

Preferably, the processor executes the program to further implement the following steps:

underlining below the primary word;

or, the main word is presented in a different color from the other words;

and when the clicking operation on the main word is detected, displaying the candidate words of the main word in sequence by using a candidate word display frame according to the sequence from high to low of the recognition probability.

and when the selection operation of any candidate word is detected, displaying the selected candidate word in the main text in a hyperlink mode, wherein the candidate word can be clicked, and hiding a candidate word display box.

underlining below the candidate word presented in the main text;

Embodiments of the application also provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method steps of any of claims 1-8.

The embodiment of the invention provides the method and the device for displaying the candidate homophones for voice recognition, and the method and the device have the advantages that the voice recognition function of the existing intelligent terminal is improved, the voice recognition result of the homophones is displayed, and the plurality of recognized candidate homophones are displayed for the user to select, so that the user can select the candidate words in a click selection mode, the intelligence of the intelligent device in the voice recognition is improved, and the problem that the user needs to manually type and input again in the voice recognition is solved.

Drawings

FIG. 1 is a schematic diagram of a main flow of conventional speech recognition;

FIG. 2 is a flowchart illustrating a method for displaying candidate homophones for speech recognition according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a conventional json data format;

FIG. 4 is a schematic diagram of an exemplary interface showing a word with the highest recognition probability according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating an interactive selection method for candidate homophones in speech recognition according to a second embodiment of the present invention;

FIG. 6 is a schematic diagram of an exemplary interface for displaying candidate words according to a second embodiment of the present invention;

FIG. 7 is a schematic diagram of an interface showing a candidate word selected by a user according to a second embodiment of the present invention;

FIG. 8 is a schematic interface diagram illustrating a candidate word display box for displaying candidate words according to a second embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and examples.

The embodiment of the invention provides a technical scheme for improving the voice recognition function of an intelligent terminal, which is used for simultaneously displaying a plurality of recognized candidate homophones for a user to select for the voice recognition result with homophones, so that the user can select in a click selection mode, the intelligence of intelligent equipment in voice recognition is improved, and the problem that the user needs to manually type and input again in the voice recognition is solved.

The method for displaying the candidate homophones for voice recognition provided by the embodiment of the invention comprises the following steps:

firstly, receiving data after voice recognition from a server;

then, analyzing the data and judging whether candidate words exist in the data or not;

The main word may be displayed in a manner of underlining the lower part of the main word, displaying the main word in a color different from that of other words, or underlining the lower part of the main word and displaying the main word in a color different from that of other words, so as to distinguish the main word from other words.

As described above, a main word may be clicked, and when a click operation on the main word is detected, a candidate word of the main word is displayed by using a candidate word display box, so that a candidate word to be selected is displayed to a user. When the candidate words are presented, the candidate words can be presented in sequence from high to low in recognition probability.

In addition, when a selection operation of any candidate word is detected, it indicates that the user wishes to treat the word as a new candidate word, and therefore, the selected candidate word is presented in the main text, and a candidate word presentation box displayed before is hidden. When the selected candidate word is presented, the candidate word can be presented in a hyperlink manner as described above, and the candidate word can be clicked.

Similarly, for the candidate words displayed in the main text, underlining may be performed below the candidate words, or displaying the candidate words in a color different from that of other words, or underlining and displaying the candidate words in a color different from that of other words.

And when the clicking operation on the candidate words displayed in the main text is detected, displaying the main words and other candidate words by using the candidate word display box, so that other candidate words which can be selected are displayed to the user.

The technical solution of the present application is further described in detail by three preferred embodiments:

the first embodiment is as follows:

a flow chart of a method for displaying candidate homophones for speech recognition provided by an embodiment of the present invention is shown in fig. 2, and includes the following steps:

step 1: the client receives the voice-recognized data from the server.

In this embodiment, the client refers to an application client providing a voice recognition function in the smart device.

In the process of server speech recognition, according to the existing speech model algorithm, the server needs to extract 1-N homophones and return the homophones to the client, for example, a common json data format is shown in fig. 3:

still taking "yu ef multi" as an example, as given in the background art, the server will return 3 homophones of monthly payment, Yue father and Yufu to the client, and respectively give the recognition probabilities of the 3 words: 0.87, 0.67 and 0.32.

Step 2: and the client analyzes the data returned by the server.

And step 3: the client judges whether the returned data contains a candidate word, if so, the step 4 is executed; otherwise, displaying according to a conventional mode, and ending.

And 4, step 4: and displaying the word with the highest recognition probability as a word in a hyperlink-like manner, underlining the word below the word, and clicking.

Preferably, the word may also be presented in a different color than the other words. In this embodiment, the word with the highest recognition probability is referred to as the "main word", and the "main word" is relative to the "candidate word".

An exemplary interface that presents the words with the highest probability of recognition is shown in fig. 4. According to the embodiment, step 1, the recognition probability of "monthly payment" is the highest, so that in the interface shown in fig. 4, the embodiment displays "monthly payment" in a hyperlink manner, the main word can be clicked, and "monthly payment" is in a blue font and is underlined.

Example two:

after the word with the highest recognition probability is displayed according to the embodiment, the candidate homophones can be further selected according to the interaction method provided by the second embodiment of the present invention, as shown in fig. 5, including:

step 1: and detecting the clicking operation of the user.

Step 2: and (3) if the word corresponding to the clicking operation of the user can be clicked, indicating that the word has a candidate word, and executing the step 3, otherwise, ending.

And step 3: and displaying the candidate words of the word by using the candidate word display box.

An exemplary interface for presenting candidate words is shown in fig. 6, where the interface shown in fig. 6 presents two candidate words for "monthly payment": "Yue father" and "le Fu". Preferably, the candidate words may be displayed in order of the recognition probability from high to low, and at most N candidate words are displayed, for example: n is equal to 3. If not, the display can be slid to view.

And 4, step 4: when detecting that the user selects any candidate word, executing step 5.

And 5: and taking the candidate word selected by the user as a new main word, displaying the new main word in the main text, and hiding the candidate word display frame.

Preferably, the candidate word selected by the user can also be presented in a hyperlink-like manner, underlined below the word, and clickable. Assuming that the user selects "Yuenai," the presentation interface is as shown in FIG. 7.

When the user clicks the current main word "Yuenai" again, the original main word "monthly payment" is displayed in the candidate word display box together with the other candidate word "Yufu", as shown in FIG. 8. For the same word, the candidate words can be switched by repeatedly clicking.

Example three:

the second embodiment replaces the original main word with a candidate word according to the selection of the user. For the main word displayed in the candidate word display frame, the main word may also be restored to the main word in a mode of re-selection, which is described in this embodiment.

Referring to fig. 8 in the second embodiment, the following steps are continued:

step 1: when the fact that the user clicks the current main word 'Yuenai' is detected, the original main word 'monthly payment' and other candidate words 'Yufu' are displayed in the candidate word display box together.

Step 2: when detecting that the user selects the "monthly payment" in the candidate word presentation box, step 3 is executed.

And step 3: and displaying the monthly payment selected by the user as a new main word in the main text, and hiding the candidate word display frame.

So far, the original main word "monthly payment" becomes the main word again and is displayed in the main text box, and other candidate words are hidden.

Corresponding to the above method, an embodiment of the present application further provides an electronic device, whose constituent structure is shown in fig. 9, and includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the following steps:

receiving the voice-recognized data from the server;

underlining below the primary word;

or, the main word is presented in a different color from the other words;

underlining below the candidate word presented in the main text;

The embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method for presenting the candidate homophones for speech recognition according to the embodiment of the present application.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for displaying candidate homophones for speech recognition is characterized by comprising the following steps:

receiving the voice-recognized data from the server;

2. The method of claim 1, further comprising:

underlining below the primary word;

or, the main word is presented in a different color from the other words;

3. The method according to claim 1 or 2, characterized in that the method further comprises:

4. The method of claim 3, wherein:

and displaying the candidate words of the main word in sequence according to the sequence of the recognition probability from high to low.

5. The method of claim 3, further comprising:

6. The method of claim 5, wherein presenting the selected candidate word in a main text comprises:

7. The method of claim 6, wherein:

underlining below the candidate word presented in the main text;

8. The method of claim 6, further comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of:

receiving the voice-recognized data from the server;

10. The electronic device of claim 9, wherein the processor, when executing the program, further performs the steps of:

underlining below the primary word;

or, the main word is presented in a different color from the other words;

11. The electronic device according to claim 9 or 10, wherein the processor when executing the program further performs the steps of:

12. The electronic device of claim 11, wherein the processor, when executing the program, further performs the steps of:

13. The electronic device of claim 12, wherein the processor, when executing the program, further performs the steps of:

underlining below the candidate word presented in the main text;

14. The electronic device of claim 13, wherein the processor, when executing the program, further performs the steps of:

15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 8.