CN112037762A - Chinese-English mixed speech recognition method - Google Patents

Chinese-English mixed speech recognition method Download PDF

Info

Publication number
CN112037762A
CN112037762A CN202010948079.6A CN202010948079A CN112037762A CN 112037762 A CN112037762 A CN 112037762A CN 202010948079 A CN202010948079 A CN 202010948079A CN 112037762 A CN112037762 A CN 112037762A
Authority
CN
China
Prior art keywords
chinese
recognition
section
english
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010948079.6A
Other languages
Chinese (zh)
Inventor
朱羿孜
许召辉
马翼平
陈年生
范光宇
饶蕾
周圣杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Avic East China Photoelectric Shanghai Co ltd
Original Assignee
Avic East China Photoelectric Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Avic East China Photoelectric Shanghai Co ltd filed Critical Avic East China Photoelectric Shanghai Co ltd
Priority to CN202010948079.6A priority Critical patent/CN112037762A/en
Publication of CN112037762A publication Critical patent/CN112037762A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Abstract

The invention relates to a Chinese-English mixed speech recognition method, which comprises the steps of collecting Chinese-English mixed speech, dividing the speech into a plurality of sections of speech signals according to a certain frame length, enabling the section overlapping rate to be 40%, carrying out high-pass filtering on the segmented speech signals, carrying out windowing processing after filtering to obtain a windowing function of each section of speech signals, calculating and judging whether the speech signals are in a mute section, carrying out Chinese recognition judgment on the speech signals in a non-mute section, outputting Chinese in the section and setting a language identifier to be 1 when the speech signals are successfully recognized, carrying out English recognition when the speech signals are unsuccessfully recognized and setting the language identifier to be 0, outputting English in the section and updating the language identifier to be 0 when the speech signals are unsuccessfully recognized, and setting the language identifier to be 1 when the speech signals are unsucces. The voice is divided into a plurality of voice signals and whether the voice signals are in a mute section or not is judged, so that the recognition efficiency can be effectively improved, meanwhile, Chinese recognition is respectively carried out on each voice signal, English failure is carried out when the recognition fails, and the accuracy of Chinese and English recognition can be effectively ensured.

Description

Chinese-English mixed speech recognition method
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a Chinese and English mixed voice recognition method.
Background
With the development of globalization of information, multilingual and multilingual communication is becoming more and more common. The single voice recognition system can not effectively recognize multi-language communication, and the establishment of the voice recognition system capable of recognizing multiple languages and voice signals is a new task of the voice recognition technology.
Chinese is the most language of users at present, English is the language that users distribute the most widely, therefore set up a Chinese-English bilingual recognition system and have very good application prospects.
In the prior art, a first bilingual speech recognition system for chinese and english is implemented as follows: the Chinese speech recognizer and the English speech recognizer are integrated together, firstly, language recognition is carried out on input speech data, and then the corresponding speech recognizer is called according to a language recognition result, so that a task of Chinese and English bilingual speech recognition is realized.
In the prior art, a second bilingual speech recognition system for chinese and english is implemented as follows: parameter sharing of Chinese and English models is achieved according to linguistic knowledge or a data driving method, model confusion is reduced, and an acoustic model and a language model shared in Chinese and English are trained on the basis. Thus, only one recognizer is used to recognize the Chinese, English, Chinese and English mixed speech signal.
However, the first scheme requires a large amount of model training in the early stage, which is relatively high in cost, while the second scheme only uses linguistic knowledge or data driving to share model parameters on a model level, which causes insufficient parameter sharing and relatively high confusion of Chinese and English models, and further causes the inaccurate recognition performance of a Chinese and English bilingual speech recognition system.
Disclosure of Invention
1. Technical problem to be solved by the invention
The invention aims to solve the problem that the existing Chinese and English mixed speech recognition accuracy is not high.
2. Technical scheme
In order to achieve the purpose, the technical scheme provided by the invention is as follows:
the invention relates to a Chinese-English mixed speech recognition method, which comprises the steps of collecting Chinese-English mixed speech, dividing the speech into a plurality of sections of speech signals according to a certain frame length, enabling the section overlapping rate to be 40%, carrying out high-pass filtering on the segmented speech signals, carrying out windowing processing after filtering to obtain a windowing function of each section of speech signals, calculating and judging whether the speech signals are in a mute section, carrying out Chinese recognition judgment on the speech signals in a non-mute section, outputting Chinese in the section and setting a language identifier to be 1 when the speech signals are successfully recognized, carrying out English recognition when the speech signals are unsuccessfully recognized and setting the language identifier to be 0, outputting English in the section and updating the language identifier to be 0 when the speech signals are unsuccessfully recognized, and setting the language identifier to be 1 when the speech signals are unsuccessf.
Preferably, the voice is divided into a plurality of sections of voice signals according to a certain frame length, specifically, the voice is divided into N voice sections according to a certain frame length L, the frame length is controlled, each section of voice only contains one word as much as possible, the section overlapping rate is 40%, and the voice signal of each section is marked as xa(n)。
Preferably, the windowing process is specifically calculated by the following formula (1) and formula (2)
Figure BDA0002675991940000021
ya(n)=xa(n)*w(n) (2);
Where w (n) is a window function and n is a speech frame number.
Calculating to obtain a windowed valueFunction ya(n)。
Preferably, the determining whether the speech signal is a silence segment is specifically to calculate the values of the short-time energy e (a) and the zero-crossing rate z (a) of each speech segment through the formula (3) and the formula (4), and weight the short-time energy e (a) and the zero-crossing rate z (a), wherein the weights are k1、k2Calculating formula (5) of weighted judgment function H (a) and setting threshold value HsetIf H (a) is not less than HsetIs a speech segment, if H (a)<HsetThe segment is a silent segment;
Figure BDA0002675991940000031
Figure BDA0002675991940000032
H(a)=k1E(a)+k2Z(a) (5)
wherein the threshold value HsetIs 10, weight k1、k2Are respectively as
Figure BDA0002675991940000033
Preferably, each section of voice signal is judged whether to be a mute section, if the mute section is included between two sections of voice sections, the mute section is discarded, and the two sections of voice sections are not processed and are regarded as two single characters or words; if the two voice sections do not contain the mute section, the two voice sections are combined to be regarded as one voice section and regarded as one character or word.
Preferably, the Chinese recognition is specifically to match a speech signal of a non-silent section with a Chinese library database, wherein the Chinese library database contains basic common single words (words) for life and does not contain English derived Chinese homophones (words), and when the matching is successful, the Chinese of the section is output and the language identification is set to be 1; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined to carry out Chinese recognition, if the matching is successful, the Chinese of the section is output, the language identification is set to be 1, the matching fails, and the language identification is set to be 0 to carry out English recognition.
Preferably, the english recognition is specifically to match a speech signal with a language identifier of 0 with an english database, wherein the english database contains basic common words for life and does not contain english homophones (words) derived from chinese, and when the matching is successful, the section of english is output and the language identifier is set to be 0; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined for English recognition, if the matching is successful, the English sections are output, the language identification is set to be 0, and if the matching fails, the language identification is set to be 1 for continuous judgment.
3. Advantageous effects
Compared with the prior art, the technical scheme provided by the invention has the following beneficial effects:
the invention relates to a Chinese-English mixed speech recognition method, which comprises the steps of collecting Chinese-English mixed speech, dividing the speech into a plurality of sections of speech signals according to a certain frame length, enabling the overlap rate of the sections to be 40%, carrying out high-pass filtering on the segmented speech signals, carrying out windowing processing after filtering to obtain a windowing function of each section of speech signals, calculating and judging whether the speech signals are in a mute section, carrying out Chinese recognition judgment on the speech signals in a non-mute section, outputting Chinese in the section and setting a language identifier to be 1 when the speech signals are successfully recognized, carrying out English recognition when the speech signals are unsuccessfully recognized and setting the language identifier to be 0, outputting English in the section and updating the language identifier to be 0 when the speech signals are unsuccessfully recognized, and setting the language identifier to be 1 when the speech signals are unsu. The voice is divided into a plurality of voice signals and whether the voice signals are in a mute section or not is judged, so that the recognition efficiency can be effectively improved, meanwhile, Chinese recognition is respectively carried out on each voice signal, English failure is carried out when the recognition fails, and the accuracy of Chinese and English recognition can be effectively ensured.
Drawings
FIG. 1 is a general flow diagram of hybrid speech recognition of the present invention;
FIG. 2 is a flow chart of the recognition process when the phonetic language identification is 1;
fig. 3 is a flowchart of recognition when the speech language flag is 0.
Detailed Description
In order to facilitate an understanding of the invention, the invention will now be described more fully hereinafter with reference to the accompanying drawings, in which several embodiments of the invention are shown, but which may be embodied in many different forms and are not limited to the embodiments described herein, but rather are provided for the purpose of providing a more thorough disclosure of the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs; the terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention; as used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
Example 1
Referring to fig. 1 to fig. 3, in the method for recognizing a chinese-english mixed speech according to this embodiment, after a chinese-english mixed speech is collected, the speech is divided into a plurality of speech signals according to a certain frame length, and the segment overlap rate is 40%, the segmented speech signals are high-pass filtered and subjected to windowing processing after filtering to obtain a windowing function of each speech signal, whether the speech signals are in a silent segment is calculated and judged, a chinese recognition judgment is performed on the speech signals in a non-silent segment, when the recognition is successful, a chinese word is output and a language identifier is set to 1, when the recognition is failed, a language identifier is set to 0, and when the recognition is failed, an english word is output and a language identifier is updated to 0, and when the recognition is failed, a language identifier is set to 1, and chinese recognition is performed again. The voice is divided into a plurality of voice signals and whether the voice signals are in a mute section or not is judged, so that the recognition efficiency can be effectively improved, meanwhile, Chinese recognition is respectively carried out on each voice signal, English failure is carried out when the recognition fails, and the accuracy of Chinese and English recognition can be effectively ensured.
Dividing speech into several sections of speech signals according to a certain frame length, concretely, dividing speech into N speech sections according to a certain frame length L, controlling frame length to make every section of speech only contain one character as far as possible, and making its section overlap rate be 40%, and making every section have a speech signal overlap rateThe speech signal is noted as xa(n) of (a). By controlling the frame length, each section of voice only contains one word as far as possible, the accuracy of overall recognition can be improved, single word recognition is simpler and more accurate than word recognition, meanwhile, the segmentation overlapping rate is 40%, continuity between continuous voice sections can be effectively guaranteed, the condition that the voice of the word is segmented by segmentation is reduced, and the recognition accuracy and comprehensiveness are guaranteed.
The windowing process is specifically calculated by the following formula (1) and formula (2)
Figure BDA0002675991940000051
ya(n)=xa(n)*w(n) (2);
Where w (n) is a window function and n is a speech frame number.
Calculating to obtain a windowing function ya(n) of (a). Windowing can reduce leakage during speech segmentation.
Judging whether the speech signal is a silence segment, specifically, calculating the values of the short-time energy E (a) and the zero-crossing rate Z (a) of each speech segment through a formula (3) and a formula (4), and weighting the short-time energy E (a) and the zero-crossing rate Z (a), wherein the weights are k1、k2Calculating formula (5) of weighted judgment function H (a) and setting threshold value HsetIf H (a) is not less than HsetIs a speech segment, if H (a)<HsetThe segment is a silent segment;
Figure BDA0002675991940000061
Figure BDA0002675991940000062
H(a)=k1E(a)+k2Z(a) (5)
wherein the threshold value HsetIs 10, weight k1、k2Are respectively as
Figure BDA0002675991940000063
The speech segments can be distinguished from the silence segments by a weighting process of zero-crossing rate and short-term energy.
Judging whether each section of voice signal is a mute section, if the mute section is included between two sections of voice sections, the mute section is removed, and the two sections of voice sections are not processed and are regarded as two single characters or words; if the two voice sections do not contain the mute section, the two voice sections are combined to be regarded as one voice section and regarded as one character or word.
Performing Chinese recognition specifically by matching a speech signal of a non-silent section with a Chinese library database, wherein the Chinese library database contains basic common single words (words) and does not contain English derived Chinese homophones (words), and when matching is successful, outputting the Chinese of the section and setting a language identifier as 1; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined to carry out Chinese recognition, if the matching is successful, the Chinese of the section is output, the language identification is set to be 1, the matching fails, and the language identification is set to be 0 to carry out English recognition.
Performing English recognition, specifically, matching a speech signal with a language identifier of 0 with an English database, wherein the English database contains basic common life words and does not contain English homophones (words) derived from Chinese, and outputting the section of English and setting the language identifier of 0 when matching is successful; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined for English recognition, if the matching is successful, the English sections are output, the language identification is set to be 0, and if the matching fails, the language identification is set to be 1 for continuous judgment.
The above-mentioned embodiments only express a certain implementation mode of the present invention, and the description thereof is specific and detailed, but not construed as limiting the scope of the present invention; it should be noted that, for those skilled in the art, without departing from the concept of the present invention, several variations and modifications can be made, which are within the protection scope of the present invention; therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (7)

1. A Chinese-English mixed speech recognition method is characterized in that: after Chinese and English mixed voice is collected, dividing the voice into a plurality of sections of voice signals according to a certain frame length, wherein the overlap rate of the sections is 40%, carrying out high-pass filtering on the segmented voice signals, carrying out windowing processing after filtering to obtain a windowing function of each section of voice signal, calculating and judging whether the voice signals are in a mute section, carrying out Chinese recognition judgment on the voice signals in a non-mute section, outputting Chinese in the section and setting a language identifier to be 1 when the recognition is successful, setting the language identifier to be 0 when the recognition is failed and carrying out English recognition, outputting English in the section and updating the language identifier to be 0 when the recognition is successful, and setting the language identifier to be 1 when the recognition is failed and carrying out Chinese recognition again.
2. The method of claim 1, wherein the method comprises: dividing speech into several sections of speech signals according to a certain frame length, concretely, dividing speech into N speech sections according to a certain frame length L, controlling frame length to make every section of speech only contain one character as far as possible, making section overlap rate be 40%, marking speech signal of every section as xa(n)。
3. The method of claim 1, wherein the method comprises: the windowing process is specifically calculated by the following formula (1) and formula (2)
Figure FDA0002675991930000011
ya(n)=xa(n)*w(n) (2);
Where w (n) is a window function and n is a speech frame number.
Calculating to obtain a windowing function ya(n)。
4. A method for mixed chinese and english speech recognition according to claim 3, wherein: the specific steps of judging whether the voice signal is a mute section are as follows formula (3)The formula (4) calculates the short-time energy E (a) and the zero-crossing rate Z (a) of each speech segment, and weights the short-time energy E (a) and the zero-crossing rate Z (a), wherein the weights are k1、k2Calculating formula (5) of weighted judgment function H (a) and setting threshold value HsetIf H (a) is not less than HsetIs a speech segment, if H (a)<HsetThe segment is a silent segment;
Figure FDA0002675991930000021
Figure FDA0002675991930000022
H(a)=k1E(a)+k2Z(a) (5)
wherein the threshold value HsetIs 10, weight k1、k2Are respectively as
Figure FDA0002675991930000023
5. The method of claim 4, wherein the method comprises: judging whether each section of voice signal is a mute section, if the mute section is included between two sections of voice sections, the mute section is removed, and the two sections of voice sections are not processed and are regarded as two single characters or words; if the two voice sections do not contain the mute section, the two voice sections are combined to be regarded as one voice section and regarded as one character or word.
6. The method of claim 5, wherein the method comprises: performing Chinese recognition specifically by matching a speech signal of a non-silent section with a Chinese library database, wherein the Chinese library database contains basic common single words (words) and does not contain English derived Chinese homophones (words), and when matching is successful, outputting the Chinese of the section and setting a language identifier as 1; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined to carry out Chinese recognition, if the matching is successful, the Chinese of the section is output, the language identification is set to be 1, the matching fails, and the language identification is set to be 0 to carry out English recognition.
7. The method of claim 6, wherein the method comprises: performing English recognition, specifically, matching a speech signal with a language identifier of 0 with an English database, wherein the English database contains basic common life words and does not contain English homophones (words) derived from Chinese, and outputting the section of English and setting the language identifier of 0 when matching is successful; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined for English recognition, if the matching is successful, the English sections are output, the language identification is set to be 0, and if the matching fails, the language identification is set to be 1 for continuous judgment.
CN202010948079.6A 2020-09-10 2020-09-10 Chinese-English mixed speech recognition method Pending CN112037762A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010948079.6A CN112037762A (en) 2020-09-10 2020-09-10 Chinese-English mixed speech recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010948079.6A CN112037762A (en) 2020-09-10 2020-09-10 Chinese-English mixed speech recognition method

Publications (1)

Publication Number Publication Date
CN112037762A true CN112037762A (en) 2020-12-04

Family

ID=73585283

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010948079.6A Pending CN112037762A (en) 2020-09-10 2020-09-10 Chinese-English mixed speech recognition method

Country Status (1)

Country Link
CN (1) CN112037762A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737629A (en) * 2011-11-11 2012-10-17 东南大学 Embedded type speech emotion recognition method and device
CN108564942A (en) * 2018-04-04 2018-09-21 南京师范大学 One kind being based on the adjustable speech-emotion recognition method of susceptibility and system
US10388272B1 (en) * 2018-12-04 2019-08-20 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences
CN110335617A (en) * 2019-05-24 2019-10-15 国网新疆电力有限公司乌鲁木齐供电公司 A kind of noise analysis method in substation
CN110517668A (en) * 2019-07-23 2019-11-29 普强信息技术(北京)有限公司 A kind of Chinese and English mixing voice identifying system and method
CN111179937A (en) * 2019-12-24 2020-05-19 上海眼控科技股份有限公司 Method, apparatus and computer-readable storage medium for text processing
CN111243597A (en) * 2020-01-10 2020-06-05 上海电机学院 Chinese-English mixed speech recognition method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737629A (en) * 2011-11-11 2012-10-17 东南大学 Embedded type speech emotion recognition method and device
CN108564942A (en) * 2018-04-04 2018-09-21 南京师范大学 One kind being based on the adjustable speech-emotion recognition method of susceptibility and system
US10388272B1 (en) * 2018-12-04 2019-08-20 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences
CN110335617A (en) * 2019-05-24 2019-10-15 国网新疆电力有限公司乌鲁木齐供电公司 A kind of noise analysis method in substation
CN110517668A (en) * 2019-07-23 2019-11-29 普强信息技术(北京)有限公司 A kind of Chinese and English mixing voice identifying system and method
CN111179937A (en) * 2019-12-24 2020-05-19 上海眼控科技股份有限公司 Method, apparatus and computer-readable storage medium for text processing
CN111243597A (en) * 2020-01-10 2020-06-05 上海电机学院 Chinese-English mixed speech recognition method

Similar Documents

Publication Publication Date Title
CN110364171B (en) Voice recognition method, voice recognition system and storage medium
CN110517663B (en) Language identification method and system
CN107039034B (en) Rhythm prediction method and system
CN101447185B (en) Audio frequency rapid classification method based on content
CN107305541A (en) Speech recognition text segmentation method and device
CN108847241A (en) It is method, electronic equipment and the storage medium of text by meeting speech recognition
CN107657947A (en) Method of speech processing and its device based on artificial intelligence
CN103810994B (en) Speech emotional inference method based on emotion context and system
EP2860727A1 (en) Voice recognition method and device
CN107220235A (en) Speech recognition error correction method, device and storage medium based on artificial intelligence
US20080294433A1 (en) Automatic Text-Speech Mapping Tool
CN111754978A (en) Rhythm hierarchy marking method, device, equipment and storage medium
CN110120221A (en) The offline audio recognition method of user individual and its system for vehicle system
CN112927679B (en) Method for adding punctuation marks in voice recognition and voice recognition device
US11810471B2 (en) Computer implemented method and apparatus for recognition of speech patterns and feedback
CN105261246A (en) Spoken English error correcting system based on big data mining technology
JP6875819B2 (en) Acoustic model input data normalization device and method, and voice recognition device
CN106205601B (en) Determine the method and system of text voice unit
CN109545197A (en) Recognition methods, device and the intelligent terminal of phonetic order
JP2002215187A (en) Speech recognition method and device for the same
JP2004094257A (en) Method and apparatus for generating question of decision tree for speech processing
Khandelwal et al. Black-box adaptation of ASR for accented speech
CN111798838A (en) Method, system, equipment and storage medium for improving speech recognition accuracy
TW201937479A (en) Multilingual mixed speech recognition method
CN114999463B (en) Voice recognition method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201204