CN112037762A - Chinese-English mixed speech recognition method - Google Patents
Chinese-English mixed speech recognition method Download PDFInfo
- Publication number
- CN112037762A CN112037762A CN202010948079.6A CN202010948079A CN112037762A CN 112037762 A CN112037762 A CN 112037762A CN 202010948079 A CN202010948079 A CN 202010948079A CN 112037762 A CN112037762 A CN 112037762A
- Authority
- CN
- China
- Prior art keywords
- chinese
- recognition
- section
- english
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
Abstract
The invention relates to a Chinese-English mixed speech recognition method, which comprises the steps of collecting Chinese-English mixed speech, dividing the speech into a plurality of sections of speech signals according to a certain frame length, enabling the section overlapping rate to be 40%, carrying out high-pass filtering on the segmented speech signals, carrying out windowing processing after filtering to obtain a windowing function of each section of speech signals, calculating and judging whether the speech signals are in a mute section, carrying out Chinese recognition judgment on the speech signals in a non-mute section, outputting Chinese in the section and setting a language identifier to be 1 when the speech signals are successfully recognized, carrying out English recognition when the speech signals are unsuccessfully recognized and setting the language identifier to be 0, outputting English in the section and updating the language identifier to be 0 when the speech signals are unsuccessfully recognized, and setting the language identifier to be 1 when the speech signals are unsucces. The voice is divided into a plurality of voice signals and whether the voice signals are in a mute section or not is judged, so that the recognition efficiency can be effectively improved, meanwhile, Chinese recognition is respectively carried out on each voice signal, English failure is carried out when the recognition fails, and the accuracy of Chinese and English recognition can be effectively ensured.
Description
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a Chinese and English mixed voice recognition method.
Background
With the development of globalization of information, multilingual and multilingual communication is becoming more and more common. The single voice recognition system can not effectively recognize multi-language communication, and the establishment of the voice recognition system capable of recognizing multiple languages and voice signals is a new task of the voice recognition technology.
Chinese is the most language of users at present, English is the language that users distribute the most widely, therefore set up a Chinese-English bilingual recognition system and have very good application prospects.
In the prior art, a first bilingual speech recognition system for chinese and english is implemented as follows: the Chinese speech recognizer and the English speech recognizer are integrated together, firstly, language recognition is carried out on input speech data, and then the corresponding speech recognizer is called according to a language recognition result, so that a task of Chinese and English bilingual speech recognition is realized.
In the prior art, a second bilingual speech recognition system for chinese and english is implemented as follows: parameter sharing of Chinese and English models is achieved according to linguistic knowledge or a data driving method, model confusion is reduced, and an acoustic model and a language model shared in Chinese and English are trained on the basis. Thus, only one recognizer is used to recognize the Chinese, English, Chinese and English mixed speech signal.
However, the first scheme requires a large amount of model training in the early stage, which is relatively high in cost, while the second scheme only uses linguistic knowledge or data driving to share model parameters on a model level, which causes insufficient parameter sharing and relatively high confusion of Chinese and English models, and further causes the inaccurate recognition performance of a Chinese and English bilingual speech recognition system.
Disclosure of Invention
1. Technical problem to be solved by the invention
The invention aims to solve the problem that the existing Chinese and English mixed speech recognition accuracy is not high.
2. Technical scheme
In order to achieve the purpose, the technical scheme provided by the invention is as follows:
the invention relates to a Chinese-English mixed speech recognition method, which comprises the steps of collecting Chinese-English mixed speech, dividing the speech into a plurality of sections of speech signals according to a certain frame length, enabling the section overlapping rate to be 40%, carrying out high-pass filtering on the segmented speech signals, carrying out windowing processing after filtering to obtain a windowing function of each section of speech signals, calculating and judging whether the speech signals are in a mute section, carrying out Chinese recognition judgment on the speech signals in a non-mute section, outputting Chinese in the section and setting a language identifier to be 1 when the speech signals are successfully recognized, carrying out English recognition when the speech signals are unsuccessfully recognized and setting the language identifier to be 0, outputting English in the section and updating the language identifier to be 0 when the speech signals are unsuccessfully recognized, and setting the language identifier to be 1 when the speech signals are unsuccessf.
Preferably, the voice is divided into a plurality of sections of voice signals according to a certain frame length, specifically, the voice is divided into N voice sections according to a certain frame length L, the frame length is controlled, each section of voice only contains one word as much as possible, the section overlapping rate is 40%, and the voice signal of each section is marked as xa(n)。
Preferably, the windowing process is specifically calculated by the following formula (1) and formula (2)
ya(n)=xa(n)*w(n) (2);
Where w (n) is a window function and n is a speech frame number.
Calculating to obtain a windowed valueFunction ya(n)。
Preferably, the determining whether the speech signal is a silence segment is specifically to calculate the values of the short-time energy e (a) and the zero-crossing rate z (a) of each speech segment through the formula (3) and the formula (4), and weight the short-time energy e (a) and the zero-crossing rate z (a), wherein the weights are k1、k2Calculating formula (5) of weighted judgment function H (a) and setting threshold value HsetIf H (a) is not less than HsetIs a speech segment, if H (a)<HsetThe segment is a silent segment;
H(a)=k1E(a)+k2Z(a) (5)
Preferably, each section of voice signal is judged whether to be a mute section, if the mute section is included between two sections of voice sections, the mute section is discarded, and the two sections of voice sections are not processed and are regarded as two single characters or words; if the two voice sections do not contain the mute section, the two voice sections are combined to be regarded as one voice section and regarded as one character or word.
Preferably, the Chinese recognition is specifically to match a speech signal of a non-silent section with a Chinese library database, wherein the Chinese library database contains basic common single words (words) for life and does not contain English derived Chinese homophones (words), and when the matching is successful, the Chinese of the section is output and the language identification is set to be 1; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined to carry out Chinese recognition, if the matching is successful, the Chinese of the section is output, the language identification is set to be 1, the matching fails, and the language identification is set to be 0 to carry out English recognition.
Preferably, the english recognition is specifically to match a speech signal with a language identifier of 0 with an english database, wherein the english database contains basic common words for life and does not contain english homophones (words) derived from chinese, and when the matching is successful, the section of english is output and the language identifier is set to be 0; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined for English recognition, if the matching is successful, the English sections are output, the language identification is set to be 0, and if the matching fails, the language identification is set to be 1 for continuous judgment.
3. Advantageous effects
Compared with the prior art, the technical scheme provided by the invention has the following beneficial effects:
the invention relates to a Chinese-English mixed speech recognition method, which comprises the steps of collecting Chinese-English mixed speech, dividing the speech into a plurality of sections of speech signals according to a certain frame length, enabling the overlap rate of the sections to be 40%, carrying out high-pass filtering on the segmented speech signals, carrying out windowing processing after filtering to obtain a windowing function of each section of speech signals, calculating and judging whether the speech signals are in a mute section, carrying out Chinese recognition judgment on the speech signals in a non-mute section, outputting Chinese in the section and setting a language identifier to be 1 when the speech signals are successfully recognized, carrying out English recognition when the speech signals are unsuccessfully recognized and setting the language identifier to be 0, outputting English in the section and updating the language identifier to be 0 when the speech signals are unsuccessfully recognized, and setting the language identifier to be 1 when the speech signals are unsu. The voice is divided into a plurality of voice signals and whether the voice signals are in a mute section or not is judged, so that the recognition efficiency can be effectively improved, meanwhile, Chinese recognition is respectively carried out on each voice signal, English failure is carried out when the recognition fails, and the accuracy of Chinese and English recognition can be effectively ensured.
Drawings
FIG. 1 is a general flow diagram of hybrid speech recognition of the present invention;
FIG. 2 is a flow chart of the recognition process when the phonetic language identification is 1;
fig. 3 is a flowchart of recognition when the speech language flag is 0.
Detailed Description
In order to facilitate an understanding of the invention, the invention will now be described more fully hereinafter with reference to the accompanying drawings, in which several embodiments of the invention are shown, but which may be embodied in many different forms and are not limited to the embodiments described herein, but rather are provided for the purpose of providing a more thorough disclosure of the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs; the terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention; as used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
Example 1
Referring to fig. 1 to fig. 3, in the method for recognizing a chinese-english mixed speech according to this embodiment, after a chinese-english mixed speech is collected, the speech is divided into a plurality of speech signals according to a certain frame length, and the segment overlap rate is 40%, the segmented speech signals are high-pass filtered and subjected to windowing processing after filtering to obtain a windowing function of each speech signal, whether the speech signals are in a silent segment is calculated and judged, a chinese recognition judgment is performed on the speech signals in a non-silent segment, when the recognition is successful, a chinese word is output and a language identifier is set to 1, when the recognition is failed, a language identifier is set to 0, and when the recognition is failed, an english word is output and a language identifier is updated to 0, and when the recognition is failed, a language identifier is set to 1, and chinese recognition is performed again. The voice is divided into a plurality of voice signals and whether the voice signals are in a mute section or not is judged, so that the recognition efficiency can be effectively improved, meanwhile, Chinese recognition is respectively carried out on each voice signal, English failure is carried out when the recognition fails, and the accuracy of Chinese and English recognition can be effectively ensured.
Dividing speech into several sections of speech signals according to a certain frame length, concretely, dividing speech into N speech sections according to a certain frame length L, controlling frame length to make every section of speech only contain one character as far as possible, and making its section overlap rate be 40%, and making every section have a speech signal overlap rateThe speech signal is noted as xa(n) of (a). By controlling the frame length, each section of voice only contains one word as far as possible, the accuracy of overall recognition can be improved, single word recognition is simpler and more accurate than word recognition, meanwhile, the segmentation overlapping rate is 40%, continuity between continuous voice sections can be effectively guaranteed, the condition that the voice of the word is segmented by segmentation is reduced, and the recognition accuracy and comprehensiveness are guaranteed.
The windowing process is specifically calculated by the following formula (1) and formula (2)
ya(n)=xa(n)*w(n) (2);
Where w (n) is a window function and n is a speech frame number.
Calculating to obtain a windowing function ya(n) of (a). Windowing can reduce leakage during speech segmentation.
Judging whether the speech signal is a silence segment, specifically, calculating the values of the short-time energy E (a) and the zero-crossing rate Z (a) of each speech segment through a formula (3) and a formula (4), and weighting the short-time energy E (a) and the zero-crossing rate Z (a), wherein the weights are k1、k2Calculating formula (5) of weighted judgment function H (a) and setting threshold value HsetIf H (a) is not less than HsetIs a speech segment, if H (a)<HsetThe segment is a silent segment;
H(a)=k1E(a)+k2Z(a) (5)
wherein the threshold value HsetIs 10, weight k1、k2Are respectively asThe speech segments can be distinguished from the silence segments by a weighting process of zero-crossing rate and short-term energy.
Judging whether each section of voice signal is a mute section, if the mute section is included between two sections of voice sections, the mute section is removed, and the two sections of voice sections are not processed and are regarded as two single characters or words; if the two voice sections do not contain the mute section, the two voice sections are combined to be regarded as one voice section and regarded as one character or word.
Performing Chinese recognition specifically by matching a speech signal of a non-silent section with a Chinese library database, wherein the Chinese library database contains basic common single words (words) and does not contain English derived Chinese homophones (words), and when matching is successful, outputting the Chinese of the section and setting a language identifier as 1; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined to carry out Chinese recognition, if the matching is successful, the Chinese of the section is output, the language identification is set to be 1, the matching fails, and the language identification is set to be 0 to carry out English recognition.
Performing English recognition, specifically, matching a speech signal with a language identifier of 0 with an English database, wherein the English database contains basic common life words and does not contain English homophones (words) derived from Chinese, and outputting the section of English and setting the language identifier of 0 when matching is successful; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined for English recognition, if the matching is successful, the English sections are output, the language identification is set to be 0, and if the matching fails, the language identification is set to be 1 for continuous judgment.
The above-mentioned embodiments only express a certain implementation mode of the present invention, and the description thereof is specific and detailed, but not construed as limiting the scope of the present invention; it should be noted that, for those skilled in the art, without departing from the concept of the present invention, several variations and modifications can be made, which are within the protection scope of the present invention; therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (7)
1. A Chinese-English mixed speech recognition method is characterized in that: after Chinese and English mixed voice is collected, dividing the voice into a plurality of sections of voice signals according to a certain frame length, wherein the overlap rate of the sections is 40%, carrying out high-pass filtering on the segmented voice signals, carrying out windowing processing after filtering to obtain a windowing function of each section of voice signal, calculating and judging whether the voice signals are in a mute section, carrying out Chinese recognition judgment on the voice signals in a non-mute section, outputting Chinese in the section and setting a language identifier to be 1 when the recognition is successful, setting the language identifier to be 0 when the recognition is failed and carrying out English recognition, outputting English in the section and updating the language identifier to be 0 when the recognition is successful, and setting the language identifier to be 1 when the recognition is failed and carrying out Chinese recognition again.
2. The method of claim 1, wherein the method comprises: dividing speech into several sections of speech signals according to a certain frame length, concretely, dividing speech into N speech sections according to a certain frame length L, controlling frame length to make every section of speech only contain one character as far as possible, making section overlap rate be 40%, marking speech signal of every section as xa(n)。
4. A method for mixed chinese and english speech recognition according to claim 3, wherein: the specific steps of judging whether the voice signal is a mute section are as follows formula (3)The formula (4) calculates the short-time energy E (a) and the zero-crossing rate Z (a) of each speech segment, and weights the short-time energy E (a) and the zero-crossing rate Z (a), wherein the weights are k1、k2Calculating formula (5) of weighted judgment function H (a) and setting threshold value HsetIf H (a) is not less than HsetIs a speech segment, if H (a)<HsetThe segment is a silent segment;
H(a)=k1E(a)+k2Z(a) (5)
5. The method of claim 4, wherein the method comprises: judging whether each section of voice signal is a mute section, if the mute section is included between two sections of voice sections, the mute section is removed, and the two sections of voice sections are not processed and are regarded as two single characters or words; if the two voice sections do not contain the mute section, the two voice sections are combined to be regarded as one voice section and regarded as one character or word.
6. The method of claim 5, wherein the method comprises: performing Chinese recognition specifically by matching a speech signal of a non-silent section with a Chinese library database, wherein the Chinese library database contains basic common single words (words) and does not contain English derived Chinese homophones (words), and when matching is successful, outputting the Chinese of the section and setting a language identifier as 1; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined to carry out Chinese recognition, if the matching is successful, the Chinese of the section is output, the language identification is set to be 1, the matching fails, and the language identification is set to be 0 to carry out English recognition.
7. The method of claim 6, wherein the method comprises: performing English recognition, specifically, matching a speech signal with a language identifier of 0 with an English database, wherein the English database contains basic common life words and does not contain English homophones (words) derived from Chinese, and outputting the section of English and setting the language identifier of 0 when matching is successful; if the matching fails, the unmatched voice sections and the voice sections before and after the voice sections are combined for English recognition, if the matching is successful, the English sections are output, the language identification is set to be 0, and if the matching fails, the language identification is set to be 1 for continuous judgment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010948079.6A CN112037762A (en) | 2020-09-10 | 2020-09-10 | Chinese-English mixed speech recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010948079.6A CN112037762A (en) | 2020-09-10 | 2020-09-10 | Chinese-English mixed speech recognition method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112037762A true CN112037762A (en) | 2020-12-04 |
Family
ID=73585283
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010948079.6A Pending CN112037762A (en) | 2020-09-10 | 2020-09-10 | Chinese-English mixed speech recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112037762A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737629A (en) * | 2011-11-11 | 2012-10-17 | 东南大学 | Embedded type speech emotion recognition method and device |
CN108564942A (en) * | 2018-04-04 | 2018-09-21 | 南京师范大学 | One kind being based on the adjustable speech-emotion recognition method of susceptibility and system |
US10388272B1 (en) * | 2018-12-04 | 2019-08-20 | Sorenson Ip Holdings, Llc | Training speech recognition systems using word sequences |
CN110335617A (en) * | 2019-05-24 | 2019-10-15 | 国网新疆电力有限公司乌鲁木齐供电公司 | A kind of noise analysis method in substation |
CN110517668A (en) * | 2019-07-23 | 2019-11-29 | 普强信息技术(北京)有限公司 | A kind of Chinese and English mixing voice identifying system and method |
CN111179937A (en) * | 2019-12-24 | 2020-05-19 | 上海眼控科技股份有限公司 | Method, apparatus and computer-readable storage medium for text processing |
CN111243597A (en) * | 2020-01-10 | 2020-06-05 | 上海电机学院 | Chinese-English mixed speech recognition method |
-
2020
- 2020-09-10 CN CN202010948079.6A patent/CN112037762A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737629A (en) * | 2011-11-11 | 2012-10-17 | 东南大学 | Embedded type speech emotion recognition method and device |
CN108564942A (en) * | 2018-04-04 | 2018-09-21 | 南京师范大学 | One kind being based on the adjustable speech-emotion recognition method of susceptibility and system |
US10388272B1 (en) * | 2018-12-04 | 2019-08-20 | Sorenson Ip Holdings, Llc | Training speech recognition systems using word sequences |
CN110335617A (en) * | 2019-05-24 | 2019-10-15 | 国网新疆电力有限公司乌鲁木齐供电公司 | A kind of noise analysis method in substation |
CN110517668A (en) * | 2019-07-23 | 2019-11-29 | 普强信息技术(北京)有限公司 | A kind of Chinese and English mixing voice identifying system and method |
CN111179937A (en) * | 2019-12-24 | 2020-05-19 | 上海眼控科技股份有限公司 | Method, apparatus and computer-readable storage medium for text processing |
CN111243597A (en) * | 2020-01-10 | 2020-06-05 | 上海电机学院 | Chinese-English mixed speech recognition method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110364171B (en) | Voice recognition method, voice recognition system and storage medium | |
CN110517663B (en) | Language identification method and system | |
CN107039034B (en) | Rhythm prediction method and system | |
CN101447185B (en) | Audio frequency rapid classification method based on content | |
CN107305541A (en) | Speech recognition text segmentation method and device | |
CN108847241A (en) | It is method, electronic equipment and the storage medium of text by meeting speech recognition | |
CN107657947A (en) | Method of speech processing and its device based on artificial intelligence | |
CN103810994B (en) | Speech emotional inference method based on emotion context and system | |
EP2860727A1 (en) | Voice recognition method and device | |
CN107220235A (en) | Speech recognition error correction method, device and storage medium based on artificial intelligence | |
US20080294433A1 (en) | Automatic Text-Speech Mapping Tool | |
CN111754978A (en) | Rhythm hierarchy marking method, device, equipment and storage medium | |
CN110120221A (en) | The offline audio recognition method of user individual and its system for vehicle system | |
CN112927679B (en) | Method for adding punctuation marks in voice recognition and voice recognition device | |
US11810471B2 (en) | Computer implemented method and apparatus for recognition of speech patterns and feedback | |
CN105261246A (en) | Spoken English error correcting system based on big data mining technology | |
JP6875819B2 (en) | Acoustic model input data normalization device and method, and voice recognition device | |
CN106205601B (en) | Determine the method and system of text voice unit | |
CN109545197A (en) | Recognition methods, device and the intelligent terminal of phonetic order | |
JP2002215187A (en) | Speech recognition method and device for the same | |
JP2004094257A (en) | Method and apparatus for generating question of decision tree for speech processing | |
Khandelwal et al. | Black-box adaptation of ASR for accented speech | |
CN111798838A (en) | Method, system, equipment and storage medium for improving speech recognition accuracy | |
TW201937479A (en) | Multilingual mixed speech recognition method | |
CN114999463B (en) | Voice recognition method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201204 |