CN112037762A

CN112037762A - Chinese-English mixed speech recognition method

Info

Publication number: CN112037762A
Application number: CN202010948079.6A
Authority: CN
Inventors: 朱羿孜; 许召辉; 马翼平; 陈年生; 范光宇; 饶蕾; 周圣杰
Original assignee: Avic East China Photoelectric Shanghai Co ltd
Current assignee: Avic East China Photoelectric Shanghai Co ltd
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2020-12-04

Abstract

The invention relates to a Chinese-English mixed speech recognition method, which comprises the steps of collecting Chinese-English mixed speech, dividing the speech into a plurality of sections of speech signals according to a certain frame length, enabling the section overlapping rate to be 40%, carrying out high-pass filtering on the segmented speech signals, carrying out windowing processing after filtering to obtain a windowing function of each section of speech signals, calculating and judging whether the speech signals are in a mute section, carrying out Chinese recognition judgment on the speech signals in a non-mute section, outputting Chinese in the section and setting a language identifier to be 1 when the speech signals are successfully recognized, carrying out English recognition when the speech signals are unsuccessfully recognized and setting the language identifier to be 0, outputting English in the section and updating the language identifier to be 0 when the speech signals are unsuccessfully recognized, and setting the language identifier to be 1 when the speech signals are unsucces. The voice is divided into a plurality of voice signals and whether the voice signals are in a mute section or not is judged, so that the recognition efficiency can be effectively improved, meanwhile, Chinese recognition is respectively carried out on each voice signal, English failure is carried out when the recognition fails, and the accuracy of Chinese and English recognition can be effectively ensured.

Description

Chinese-English mixed speech recognition method

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a Chinese and English mixed voice recognition method.

Background

With the development of globalization of information, multilingual and multilingual communication is becoming more and more common. The single voice recognition system can not effectively recognize multi-language communication, and the establishment of the voice recognition system capable of recognizing multiple languages and voice signals is a new task of the voice recognition technology.

Chinese is the most language of users at present, English is the language that users distribute the most widely, therefore set up a Chinese-English bilingual recognition system and have very good application prospects.

In the prior art, a first bilingual speech recognition system for chinese and english is implemented as follows: the Chinese speech recognizer and the English speech recognizer are integrated together, firstly, language recognition is carried out on input speech data, and then the corresponding speech recognizer is called according to a language recognition result, so that a task of Chinese and English bilingual speech recognition is realized.

In the prior art, a second bilingual speech recognition system for chinese and english is implemented as follows: parameter sharing of Chinese and English models is achieved according to linguistic knowledge or a data driving method, model confusion is reduced, and an acoustic model and a language model shared in Chinese and English are trained on the basis. Thus, only one recognizer is used to recognize the Chinese, English, Chinese and English mixed speech signal.

However, the first scheme requires a large amount of model training in the early stage, which is relatively high in cost, while the second scheme only uses linguistic knowledge or data driving to share model parameters on a model level, which causes insufficient parameter sharing and relatively high confusion of Chinese and English models, and further causes the inaccurate recognition performance of a Chinese and English bilingual speech recognition system.

Disclosure of Invention

1. Technical problem to be solved by the invention

The invention aims to solve the problem that the existing Chinese and English mixed speech recognition accuracy is not high.

2. Technical scheme

In order to achieve the purpose, the technical scheme provided by the invention is as follows:

the invention relates to a Chinese-English mixed speech recognition method, which comprises the steps of collecting Chinese-English mixed speech, dividing the speech into a plurality of sections of speech signals according to a certain frame length, enabling the section overlapping rate to be 40%, carrying out high-pass filtering on the segmented speech signals, carrying out windowing processing after filtering to obtain a windowing function of each section of speech signals, calculating and judging whether the speech signals are in a mute section, carrying out Chinese recognition judgment on the speech signals in a non-mute section, outputting Chinese in the section and setting a language identifier to be 1 when the speech signals are successfully recognized, carrying out English recognition when the speech signals are unsuccessfully recognized and setting the language identifier to be 0, outputting English in the section and updating the language identifier to be 0 when the speech signals are unsuccessfully recognized, and setting the language identifier to be 1 when the speech signals are unsuccessf.

Preferably, the voice is divided into a plurality of sections of voice signals according to a certain frame length, specifically, the voice is divided into N voice sections according to a certain frame length L, the frame length is controlled, each section of voice only contains one word as much as possible, the section overlapping rate is 40%, and the voice signal of each section is marked as x_a(n)。

Preferably, the windowing process is specifically calculated by the following formula (1) and formula (2)

y_a(n)＝x_a(n)*w(n) (2)；

Where w (n) is a window function and n is a speech frame number.

Calculating to obtain a windowed valueFunction y_a(n)。

Preferably, the determining whether the speech signal is a silence segment is specifically to calculate the values of the short-time energy e (a) and the zero-crossing rate z (a) of each speech segment through the formula (3) and the formula (4), and weight the short-time energy e (a) and the zero-crossing rate z (a), wherein the weights are k₁、k₂Calculating formula (5) of weighted judgment function H (a) and setting threshold value H_setIf H (a) is not less than H_setIs a speech segment, if H (a)<H_setThe segment is a silent segment;

H(a)＝k₁E(a)+k₂Z(a) (5)

wherein the threshold value H_setIs 10, weight k₁、k₂Are respectively as

Preferably, each section of voice signal is judged whether to be a mute section, if the mute section is included between two sections of voice sections, the mute section is discarded, and the two sections of voice sections are not processed and are regarded as two single characters or words; if the two voice sections do not contain the mute section, the two voice sections are combined to be regarded as one voice section and regarded as one character or word.

Preferably, the Chinese recognition is specifically to match a speech signal of a non-silent section with a Chinese library database, wherein the Chinese library database contains basic common single words (words) for life and does not contain English derived Chinese homophones (words), and when the matching is successful, the Chinese of the section is output and the language identification is set to be 1; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined to carry out Chinese recognition, if the matching is successful, the Chinese of the section is output, the language identification is set to be 1, the matching fails, and the language identification is set to be 0 to carry out English recognition.

Preferably, the english recognition is specifically to match a speech signal with a language identifier of 0 with an english database, wherein the english database contains basic common words for life and does not contain english homophones (words) derived from chinese, and when the matching is successful, the section of english is output and the language identifier is set to be 0; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined for English recognition, if the matching is successful, the English sections are output, the language identification is set to be 0, and if the matching fails, the language identification is set to be 1 for continuous judgment.

3. Advantageous effects

Compared with the prior art, the technical scheme provided by the invention has the following beneficial effects:

the invention relates to a Chinese-English mixed speech recognition method, which comprises the steps of collecting Chinese-English mixed speech, dividing the speech into a plurality of sections of speech signals according to a certain frame length, enabling the overlap rate of the sections to be 40%, carrying out high-pass filtering on the segmented speech signals, carrying out windowing processing after filtering to obtain a windowing function of each section of speech signals, calculating and judging whether the speech signals are in a mute section, carrying out Chinese recognition judgment on the speech signals in a non-mute section, outputting Chinese in the section and setting a language identifier to be 1 when the speech signals are successfully recognized, carrying out English recognition when the speech signals are unsuccessfully recognized and setting the language identifier to be 0, outputting English in the section and updating the language identifier to be 0 when the speech signals are unsuccessfully recognized, and setting the language identifier to be 1 when the speech signals are unsu. The voice is divided into a plurality of voice signals and whether the voice signals are in a mute section or not is judged, so that the recognition efficiency can be effectively improved, meanwhile, Chinese recognition is respectively carried out on each voice signal, English failure is carried out when the recognition fails, and the accuracy of Chinese and English recognition can be effectively ensured.

Drawings

FIG. 1 is a general flow diagram of hybrid speech recognition of the present invention;

FIG. 2 is a flow chart of the recognition process when the phonetic language identification is 1;

fig. 3 is a flowchart of recognition when the speech language flag is 0.

Detailed Description

In order to facilitate an understanding of the invention, the invention will now be described more fully hereinafter with reference to the accompanying drawings, in which several embodiments of the invention are shown, but which may be embodied in many different forms and are not limited to the embodiments described herein, but rather are provided for the purpose of providing a more thorough disclosure of the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs; the terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention; as used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Example 1

Referring to fig. 1 to fig. 3, in the method for recognizing a chinese-english mixed speech according to this embodiment, after a chinese-english mixed speech is collected, the speech is divided into a plurality of speech signals according to a certain frame length, and the segment overlap rate is 40%, the segmented speech signals are high-pass filtered and subjected to windowing processing after filtering to obtain a windowing function of each speech signal, whether the speech signals are in a silent segment is calculated and judged, a chinese recognition judgment is performed on the speech signals in a non-silent segment, when the recognition is successful, a chinese word is output and a language identifier is set to 1, when the recognition is failed, a language identifier is set to 0, and when the recognition is failed, an english word is output and a language identifier is updated to 0, and when the recognition is failed, a language identifier is set to 1, and chinese recognition is performed again. The voice is divided into a plurality of voice signals and whether the voice signals are in a mute section or not is judged, so that the recognition efficiency can be effectively improved, meanwhile, Chinese recognition is respectively carried out on each voice signal, English failure is carried out when the recognition fails, and the accuracy of Chinese and English recognition can be effectively ensured.

Dividing speech into several sections of speech signals according to a certain frame length, concretely, dividing speech into N speech sections according to a certain frame length L, controlling frame length to make every section of speech only contain one character as far as possible, and making its section overlap rate be 40%, and making every section have a speech signal overlap rateThe speech signal is noted as x_a(n) of (a). By controlling the frame length, each section of voice only contains one word as far as possible, the accuracy of overall recognition can be improved, single word recognition is simpler and more accurate than word recognition, meanwhile, the segmentation overlapping rate is 40%, continuity between continuous voice sections can be effectively guaranteed, the condition that the voice of the word is segmented by segmentation is reduced, and the recognition accuracy and comprehensiveness are guaranteed.

The windowing process is specifically calculated by the following formula (1) and formula (2)

y_a(n)＝x_a(n)*w(n) (2)；

Where w (n) is a window function and n is a speech frame number.

Calculating to obtain a windowing function y_a(n) of (a). Windowing can reduce leakage during speech segmentation.

Judging whether the speech signal is a silence segment, specifically, calculating the values of the short-time energy E (a) and the zero-crossing rate Z (a) of each speech segment through a formula (3) and a formula (4), and weighting the short-time energy E (a) and the zero-crossing rate Z (a), wherein the weights are k₁、k₂Calculating formula (5) of weighted judgment function H (a) and setting threshold value H_setIf H (a) is not less than H_setIs a speech segment, if H (a)<H_setThe segment is a silent segment;

H(a)＝k₁E(a)+k₂Z(a) (5)

wherein the threshold value H_setIs 10, weight k₁、k₂Are respectively as

The speech segments can be distinguished from the silence segments by a weighting process of zero-crossing rate and short-term energy.

Judging whether each section of voice signal is a mute section, if the mute section is included between two sections of voice sections, the mute section is removed, and the two sections of voice sections are not processed and are regarded as two single characters or words; if the two voice sections do not contain the mute section, the two voice sections are combined to be regarded as one voice section and regarded as one character or word.

Performing Chinese recognition specifically by matching a speech signal of a non-silent section with a Chinese library database, wherein the Chinese library database contains basic common single words (words) and does not contain English derived Chinese homophones (words), and when matching is successful, outputting the Chinese of the section and setting a language identifier as 1; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined to carry out Chinese recognition, if the matching is successful, the Chinese of the section is output, the language identification is set to be 1, the matching fails, and the language identification is set to be 0 to carry out English recognition.

Performing English recognition, specifically, matching a speech signal with a language identifier of 0 with an English database, wherein the English database contains basic common life words and does not contain English homophones (words) derived from Chinese, and outputting the section of English and setting the language identifier of 0 when matching is successful; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined for English recognition, if the matching is successful, the English sections are output, the language identification is set to be 0, and if the matching fails, the language identification is set to be 1 for continuous judgment.

The above-mentioned embodiments only express a certain implementation mode of the present invention, and the description thereof is specific and detailed, but not construed as limiting the scope of the present invention; it should be noted that, for those skilled in the art, without departing from the concept of the present invention, several variations and modifications can be made, which are within the protection scope of the present invention; therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A Chinese-English mixed speech recognition method is characterized in that: after Chinese and English mixed voice is collected, dividing the voice into a plurality of sections of voice signals according to a certain frame length, wherein the overlap rate of the sections is 40%, carrying out high-pass filtering on the segmented voice signals, carrying out windowing processing after filtering to obtain a windowing function of each section of voice signal, calculating and judging whether the voice signals are in a mute section, carrying out Chinese recognition judgment on the voice signals in a non-mute section, outputting Chinese in the section and setting a language identifier to be 1 when the recognition is successful, setting the language identifier to be 0 when the recognition is failed and carrying out English recognition, outputting English in the section and updating the language identifier to be 0 when the recognition is successful, and setting the language identifier to be 1 when the recognition is failed and carrying out Chinese recognition again.

2. The method of claim 1, wherein the method comprises: dividing speech into several sections of speech signals according to a certain frame length, concretely, dividing speech into N speech sections according to a certain frame length L, controlling frame length to make every section of speech only contain one character as far as possible, making section overlap rate be 40%, marking speech signal of every section as x_a(n)。

3. The method of claim 1, wherein the method comprises: the windowing process is specifically calculated by the following formula (1) and formula (2)

y_a(n)＝x_a(n)*w(n) (2)；

Where w (n) is a window function and n is a speech frame number.

Calculating to obtain a windowing function y_a(n)。

4. A method for mixed chinese and english speech recognition according to claim 3, wherein: the specific steps of judging whether the voice signal is a mute section are as follows formula (3)The formula (4) calculates the short-time energy E (a) and the zero-crossing rate Z (a) of each speech segment, and weights the short-time energy E (a) and the zero-crossing rate Z (a), wherein the weights are k₁、k₂Calculating formula (5) of weighted judgment function H (a) and setting threshold value H_setIf H (a) is not less than H_setIs a speech segment, if H (a)<H_setThe segment is a silent segment;

H(a)＝k₁E(a)+k₂Z(a) (5)

wherein the threshold value H_setIs 10, weight k₁、k₂Are respectively as

5. The method of claim 4, wherein the method comprises: judging whether each section of voice signal is a mute section, if the mute section is included between two sections of voice sections, the mute section is removed, and the two sections of voice sections are not processed and are regarded as two single characters or words; if the two voice sections do not contain the mute section, the two voice sections are combined to be regarded as one voice section and regarded as one character or word.

6. The method of claim 5, wherein the method comprises: performing Chinese recognition specifically by matching a speech signal of a non-silent section with a Chinese library database, wherein the Chinese library database contains basic common single words (words) and does not contain English derived Chinese homophones (words), and when matching is successful, outputting the Chinese of the section and setting a language identifier as 1; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined to carry out Chinese recognition, if the matching is successful, the Chinese of the section is output, the language identification is set to be 1, the matching fails, and the language identification is set to be 0 to carry out English recognition.

7. The method of claim 6, wherein the method comprises: performing English recognition, specifically, matching a speech signal with a language identifier of 0 with an English database, wherein the English database contains basic common life words and does not contain English homophones (words) derived from Chinese, and outputting the section of English and setting the language identifier of 0 when matching is successful; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined for English recognition, if the matching is successful, the English sections are output, the language identification is set to be 0, and if the matching fails, the language identification is set to be 1 for continuous judgment.