WO2021000068A1

WO2021000068A1 - Speech recognition method and apparatus used by non-native speaker

Info

Publication number: WO2021000068A1
Application number: PCT/CN2019/093947
Authority: WO
Inventors: 郑小龙
Original assignee: 播闪机械人有限公司
Priority date: 2019-06-29
Filing date: 2019-06-29
Publication date: 2021-01-07

Abstract

Provided are a speech recognition method and apparatus used by a non-native speaker. The speech recognition apparatus comprises: a speech recognition module (101) for converting a received language of a non-native speaker into a decoded language, and transmitting the decoded language to a language matching module (111) and a non-native speech translation module (104) respectively; the language matching module (111) for invoking, according to the received decoded language, a standard language corresponding to the decoded language and stored in the language matching module (111), wherein the standard language and the decoded language received from the speech recognition module (101) form a decoded language and standard language pair, and the decoded language and standard language pair is transmitted to an accent analyzer (102); the accent analyzer (102) for comparing the received decoded language and standard language pair with accent types stored in the accent analyzer (102), obtaining, through analysis, an accent type corresponding to the decoded language and standard language pair, and sending the accent type to an accent module database (103); the accent module database (103) for invoking, according to the accent type and from the accent module database (103), an accent module (416) corresponding to the accent type, and transmitting the accent module (416) to the non-native speech translation module (104); and the non-native speech translation module (104) for translating the decoded language from the speech recognition module (101) into a standard sentence by means of the accent module (416) and outputting the standard sentence. By means of the decoded language and standard language pair, the language of the non-native speaker can be converted into a corresponding standard language of a native speaker more accurately, and an accent module collection method can save a great amount of time.

Description

Voice recognition method and device used by non-native speakers

Technical field

The present invention is a voice recognition method and device, and particularly relates to a voice recognition method and device used by non-native speakers.

Background technique

At present, voice recognition technology is now widely used in many different fields such as home control devices, smart speakers, personal assistants and telephone interfaces. Commonly used voice recognition solutions and products include Alexa from Amazon, Google Assistant and Apple Siri. In addition, such as Companies such as IBM, Apple, Amazon, and Google provide developers with powerful speech recognition APIs for the development of further applications.

A powerful speech recognition engine requires a lot of resources to develop, and it takes a long time to debug the engine to provide high-quality speech-to-text functions. It requires a wide range of engineering applications in terms of processing audio signals related to noise, echo and reverberation. Therefore, the cost of developing a high-quality speech recognition module is high.

Most speech recognition engines are developed and debugged based on native speakers. In order to develop a speech recognition engine, it needs to collect voice samples from a large number of native speakers to make the speech recognition engine get the best performance; When the engine is used for non-native speakers, if these non-native speakers cannot pronounce correctly in their native language, it will not work properly. When the non-native speakers pronounce with accent, the speech recognition engine will return the wrong decoded text.

For any language, such as English, there are many non-native speakers with different accents. Non-native speakers in different regions will have their own accent based on their native language. To collect all non-native speakers’ pronunciation samples, it will cost too much, and many new pronunciation modules have to be developed to satisfy non-native speakers in many different regions. The pronunciation; This requires a low-cost, efficient method to adapt to different accent speech recognition engines to decode different accents.

When solving language problems, it is very common to use deep learning methods. Long-term short-term memory (LSTM) or recurrent neural networks have been used in many deep learning fields, such as translation, natural language processing, and speech recognition; when used When LSTM performs translation, it usually uses letter-based sequence-to-sequence translation or word-based sequence-to-sequence translation. To adapt to accents, neither letter-based methods nor word-based methods are effective because they are not completely related to pronunciation.

Summary of the invention

In order to solve the above-mentioned problems, the present invention proposes a method of using the existing speech recognition module to obtain the decoding result of the accent of non-native speakers, but the decoding result may be incorrect. The accent module comes from the decoding result of the non-local accent group, and the accent module is used Translate the decoded output of the speech recognition module.

In order to achieve the above objectives, the present invention provides the following technical methods:

A speech recognition method used by non-native speakers, the method includes the following steps:

Step S10, the speech recognition module converts the received language of the non-native speaker into a decoded language, and transmits the decoded language to the language matching module and the non-native voice translation module respectively;

Step S11, the language matching module retrieves the standard language corresponding to the decoded language stored in the language matching module according to the received decoded language, and the standard language is decoded in pairs with the decoded language received from the speech recognition module Language and standard language, and send the pair of decoded language and standard language to the accent analyzer;

Step S12: The accent analyzer compares the received pair of decoded language and standard language with the accent category stored in the accent analyzer to find out the accent category corresponding to the pair of decoded language and standard language, and compares the accent The category is sent to the accent module database;

Step S13: According to the accent category, the accent module database retrieves the accent module corresponding to the accent category from the accent module database, and transmits the accent module to the non-native language speech translation module;

In step S14, the non-native language speech translation module translates the decoded language from the speech recognition module into a standard sentence and outputs it with an accent module.

Preferably, the decoding language in the step S10 is decoded text or decoded phoneme.

Preferably, the accent module in the step S13 is obtained by receiving the decoded language of the non-native speaker and the manually added decoded language.

Preferably, the accent module includes a plurality of LSTM layers and a density layer connected in sequence, and the last LSTM layer is connected to the density layer.

A speech recognition device used by non-native speakers, including steps:

The voice recognition module is used to convert the language of the received non-native speaker into a decoded language, and transmit the decoded language to the language matching module and the non-native voice translation module respectively;

The language matching module is configured to retrieve the standard language corresponding to the decoded language stored in the language matching module according to the received decoded language, and the standard language is the same as the decoded language received from the speech recognition module Pair the decoded language and standard language, and send the pair of decoded language and standard language to the accent analyzer;

The accent analyzer is used to compare the received pair of decoded language and standard language with the accent category stored in the accent analyzer to find out the accent category corresponding to the pair of decoded language and standard language, and compare The accent category is sent to the accent module database;

The accent module database is used to retrieve the accent module corresponding to the accent category from the accent module database according to the accent category, and transmit the accent module to the non-native language speech translation module;

The non-native language speech translation module is used to translate the decoded language from the speech recognition module into a standard sentence and output it with an accent module.

Compared with the prior art, the present invention has the following beneficial effects: in the speech recognition method used by non-native speakers, the received pair of decoded languages and the pair of decoded languages stored in the standard language and accent analyzer are combined through the voice analyzer. Compare and analyze the accent category of the decoded text with the standard language. This pair of decoded languages and standard languages can more accurately convert the language of non-native speakers into the language of the corresponding standard native speakers. The accent module is collected and added manually. The language can save a lot of time and solve the time cost of collecting a large number of languages of non-native speakers.

Description of the drawings

Figure 1 is a flowchart of a voice recognition method used by non-native speakers of the present invention;

2 is a flowchart of accent module collection of a speech recognition method used by non-native speakers of the present invention;

3 is a flowchart of a non-native language speech translation module of a speech recognition method used by non-native speakers of the present invention.

Detailed ways

In order to better understand the technical method of the present invention, the embodiments provided by the present invention will be described in detail below with reference to the accompanying drawings. As shown in Figures 1-3, the method includes the following steps:

Step S10, the speech recognition module 101 converts the language of the received non-native speaker into a decoded language, and transmits the decoded language to the language matching module 111 and the non-native voice translation module 104 respectively; the decoded language may be decoded text or decoded phoneme The speech recognition module 101 can also handle all noise, echo, and reverberation problems in speech signal processing.

In step S11, the language matching module 111 retrieves the standard language corresponding to the decoded language stored in the language matching module 111 according to the received decoded language, and the standard language corresponds to the decoded language received from the speech recognition module 101. A pair of decoded language and standard speech is sent to the accent analyzer 102; the language matching module 111 stores a large number of standard languages.

Step S12: The accent analyzer 102 compares the received pair of decoded language and standard language with the accent category stored in the accent analyzer 102 to find out the accent category corresponding to the pair of decoded language and standard language The accent category is sent to the accent module database 103; the accent category 303 stored in the analyzer 102 contains many pairs of decoded language 313 and standard language 323, which also come from the speech recognition module 101 and the language matching module, respectively 111. In the pre-use storage stage, the analyzer 102 collects and stores a large number of accent categories.

Step S13: According to the accent category, the accent module database 103 retrieves the accent module 416 corresponding to the accent category from the accent module 416 database 103, and the accent module 416 transmits it to the non-native language speech translation module 104; the accent module 416 database 103 A number of accent modules 416 are stored in it.

As shown in Figure 2, the accent module 416 collects a small amount of non-native speakers’ languages and a large number of manually added languages through the speech recognition module 101, converts these two languages into decoded languages, and sends them to data processing. Unit 404, the data processing unit 404 receives decoded languages from a small number of non-native speakers in the speech recognition module 101 and a large number of decoded languages added manually, and the data processing unit 404 receives the decoded language and processes the format of the decoded language in a unified manner Then, the decoded language after the format processing is sent to the data embedding module 405, and the data embedding module 405 converts the decoded language after the format processing into a digital code and sends it to the accent module 416. Each accent module 416 contains multiple sequentially connected An LSTM layer 406 (long-term short-term memory layer) and a density layer 408, and the last LSTM layer 406 is connected to the density layer 408.

Step S14, the non-native language speech translation module 104 translates the decoded language accent module 416 from the speech recognition module 101 into standard sentences for output; as shown in FIG. 3, the non-native language speech translation module 104 includes the data processing unit 404 and data The embedding module 405 and the received accent module 416, the data processing unit 404 is connected to the data embedding module 405, and the data embedding module 405 is connected to the accent module 416. The data processing unit 404 unifies the received decoded language format and then processes the format The decoded language is then sent to the data embedding module 405. The data embedding module 405 converts the decoded language after format processing into a digital code and sends it to the accent module 416. Each accent module 416 contains multiple successively connected LSTM layers 406 (long-term and short-term Memory layer) and a density layer 408. The last LSTM layer 406 is connected to the density layer 408. After receiving the digital codes, the LSTM layers 406 that are connected in turn are converted into two-dimensional codes and sent to the density layer 408. The density layer 408 The two-dimensional code is converted into standard sentences and output.

A speech recognition device used by non-native speakers, including steps:

The speech recognition module 101 is configured to convert the language of the received non-native speaker into a decoded language, and transmit the decoded language to the language matching module 111 and the non-native language speech translation module 104 respectively;

The language matching module 111 is configured to retrieve the standard language corresponding to the decoded language stored in the language matching module 111 according to the received decoded language, and the standard language is the same as the decoded language received from the speech recognition module 101 The language forms a pair of decoded language and standard speech, and transmits the pair of decoded language and standard language to the accent analyzer 102;

The accent analyzer 102 is configured to compare the received pair of decoded language and standard language with the accent category stored in the accent analyzer 102 to find out the accent category corresponding to the pair of decoded language and standard language, and Sending the accent category to the accent module database 103;

The accent module database 103 is used to retrieve the accent module 416 corresponding to the accent category from the accent module database 103 according to the accent category, and transmit the accent module 416 to the non-native language speech translation module 104;

The non-native language speech translation module 104 is used to translate the decoded language accent module 416 from the speech recognition module 101 into standard sentences for output.

The decoded language is decoded text or decoded phoneme; the accent module 416 is obtained by receiving the decoded language of non-native speakers and the decoded language manually added; the accent module 416 contains multiple successively connected LSTM layers and a density layer, and finally An LSTM layer is connected to the density layer.

The above provides a detailed introduction to a speech recognition method used by non-native speakers provided by the embodiments of the present invention. For those skilled in the art, according to the ideas of the embodiments of the present invention, there will be specific implementation and application scope. In summary, the content of this specification should not be construed as limiting the present invention.

Claims

A speech recognition method used by non-native speakers, the method includes the following steps:

Step S10, the speech recognition module (101) converts the received language of the non-native speaker into a decoded language, and transmits the decoded language to the language matching module (111) and the non-native voice translation module (104) respectively;

Step S11, the language matching module (111) retrieves the standard language corresponding to the decoded language stored in the language matching module (111) according to the received decoded language, and the standard language is the same as the standard language from the speech recognition module (101) The received decoded language is a pair of decoded language and standard language, and the pair of decoded language and standard language is sent to the accent analyzer (102);

Step S12: The accent analyzer (102) compares the received pair of decoded language and standard language with the accent category stored in the accent analyzer (102) to analyze the accent category corresponding to the pair of decoded language and standard language , And send the accent category to the accent module database (103);

Step S13: According to the accent category, the accent module database (103) retrieves the accent module (416) corresponding to the accent category from the accent module database (103), and transmits the accent module (416) to the non-native language voice translation Module (104);

In step S14, the non-native language speech translation module (104) translates the decoded language accent module (416) from the speech recognition module (101) into standard sentences for output.
The speech recognition method used by non-native speakers of claim 1, wherein the decoding language in step S10 is decoded text or decoded phoneme.
A speech recognition method used by non-native speakers according to claim 1, wherein the accent module (416) in step S13 is obtained by receiving the decoded language of the non-native speaker and the decoded language manually added. .
The speech recognition method used by non-native speakers according to claim 1, wherein each accent module (416) contains a plurality of LSTM layers and a density layer connected in sequence, and the last LSTM layer and the density layer connection.
A speech recognition device used by non-native speakers is characterized in that it comprises the steps:

The speech recognition module (101) is used to convert the language of the received non-native speaker into a decoded language, and transmit the decoded language to the language matching module (111) and the non-native language speech translation module (104) respectively;

The language matching module (111) is used to retrieve the standard language corresponding to the decoded language stored in the language matching module (111) according to the received decoded language, and the standard language is compatible with the slave speech recognition module (101) The decoded language received in) is a pair of decoded language and standard speech, and the pair of decoded language and standard language are sent to the accent analyzer (102);

The accent analyzer (102) is configured to compare the received pair of decoded language and standard language with the accent category stored in the accent analyzer (102) to find out the corresponding pair of decoded language and standard language Accent category, and send the accent category to the accent module database (103);

The accent module database (103) is used to retrieve the accent module (416) corresponding to the accent category from the accent module database (103) according to the accent category, and transmit the accent module (416) to the non-native language Voice translation module (104);

The non-native language speech translation module (104) is used to translate the decoded language accent module (416) from the speech recognition module (101) into standard sentences for output.
The speech recognition device used by non-native speakers of claim 5, wherein the decoded language is decoded text or decoded phoneme.
The speech recognition device used by non-native speakers according to claim 5, wherein the accent module (416) is obtained by receiving the decoded language of the non-native speakers and the decoded language manually added.
A speech recognition device for non-native speakers according to claim 5, characterized in that the accent module (416) comprises a plurality of LSTM layers and a density layer connected in sequence, the last LSTM layer and the density layer connection.