CN109545186B - Speech recognition training system and method - Google Patents

Speech recognition training system and method Download PDF

Info

Publication number
CN109545186B
CN109545186B CN201811538408.9A CN201811538408A CN109545186B CN 109545186 B CN109545186 B CN 109545186B CN 201811538408 A CN201811538408 A CN 201811538408A CN 109545186 B CN109545186 B CN 109545186B
Authority
CN
China
Prior art keywords
loss function
recognition
unit
voice
recognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811538408.9A
Other languages
Chinese (zh)
Other versions
CN109545186A (en
Inventor
胡杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Momenta Suzhou Technology Co Ltd
Original Assignee
Momenta Suzhou Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Momenta Suzhou Technology Co Ltd filed Critical Momenta Suzhou Technology Co Ltd
Priority to CN201811538408.9A priority Critical patent/CN109545186B/en
Publication of CN109545186A publication Critical patent/CN109545186A/en
Application granted granted Critical
Publication of CN109545186B publication Critical patent/CN109545186B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention relates to a voice recognition training system and a method thereof, belonging to the technical field of voice recognition; in the prior art, the error correction mechanism or the error correction algorithm of the loss function is general or single, and the error of voice recognition cannot be corrected quickly and accurately; the invention provides a speech recognition training system and method, which can provide the precision and speed of the system by setting a plurality of loss functions according to the error of common recognition to respectively deal with different situations.

Description

Speech recognition training system and method
Technical Field
The invention relates to a voice recognition technology, in particular to a Chinese voice recognition training method.
Background
The voice communication with the machine is carried out, so that the machine can understand what you say, which is a thing that people dreams for a long time. The Chinese Internet of things school and enterprise alliance can take the voice recognition ratio as the hearing system of the machine. Speech recognition technology is a high technology that allows machines to convert speech signals into corresponding text or commands through a recognition and understanding process.
In recent years, research on the application of artificial neural networks to speech recognition has been promoted. In these studies, multi-layer perceptual networks based on back-propagation algorithms (BP algorithms) were mostly employed. Artificial neural networks have the ability to distinguish complex classification boundaries, which apparently contributes significantly to pattern classification. Especially in the aspect of telephone voice recognition, the method becomes a hot spot of the current voice recognition application due to the wide application prospect.
However, in the existing artificial neural network applied to the speech recognition network, the error correction mechanism or the error correction algorithm of the loss function for error correction is general or single, and the error correction cannot be quickly and accurately performed on the speech recognition error.
Disclosure of Invention
In view of the problems in the prior art, the present invention provides a speech recognition training system, which is characterized in that: the system comprises: the device comprises a feature extraction unit, a voice recognition unit and a loss function;
the feature extraction unit is used for extracting features of the voice information to be recognized;
the voice recognition unit is used for carrying out voice recognition on the input voice information to be recognized to obtain a recognition result;
the system compares the voice information to be recognized with the recognition result through pre-labeling of the voice information to be recognized, constructs the loss function, and finally corrects the voice recognition unit and the feature extraction unit step by step through step reverse conduction of the loss function;
the loss function is formed by the sum of at least two different types of loss functions.
Preferably, the two different types of loss functions are respectively: a homophonic loss function and an approximate loss function.
Preferably, the homophonic loss function represents the probability of recognition errors of different characters with the same pronunciation; the approximate loss function represents the probability of recognition errors of different characters with similar pronunciation.
Preferably, the loss function of the system is a homophonic loss function + b approximate loss function, wherein a and b are weighting coefficients.
Preferably, when the recognition result includes different characters with similar pronunciation, b > a; b < a when the recognition result includes different characters having the same pronunciation.
Preferably, the character recognition unit includes a first character recognition unit and a second character recognition unit, which correspond to the homophonic loss function and the approximate loss function, respectively.
Preferably, the system further comprises a mapping unit predicting the recognition result by mapping of a dictionary or a dictionary.
Preferably, the system further comprises a sentence loss function representing a probability of an ambiguity-prone sentence recognition error.
Preferably, the number of the voice recognition units is plural.
The invention also provides a method for carrying out voice recognition training by using the system, which is characterized by comprising the following steps: the method comprises the following steps:
a characteristic extraction step: carrying out feature extraction on voice information to be recognized;
a voice recognition step: carrying out voice recognition on the input voice information to be recognized to obtain a recognition result;
and (3) correcting: the system compares the voice information to be recognized with the recognition result through pre-labeling of the voice information to be recognized, constructs the loss function, and finally corrects the voice recognition unit and the feature extraction unit step by step through step reverse conduction of the loss function;
the loss function is formed by the sum of at least two different types of loss functions.
Preferably, the two different types of loss functions are respectively: a homophonic loss function and an approximate loss function.
Preferably, the loss function further comprises a sentence loss function representing a probability of an ambiguity-prone sentence occurrence recognition error.
The invention points of the invention include but are not limited to the following points:
(1) the invention proposes that the loss function is expressed by the sum of the same loss function and the approximate loss function; the problem of different types of errors in speech recognition can be solved by setting the weight between the two under different conditions; the classification can also be limited to the field of common words according to actual conditions.
(2) The loss functions of the present invention may also include sentence loss functions that provide the accuracy and speed of the training system for sentences that are prone to ambiguity.
(3) The invention can also use a plurality of voice recognition units, namely two recurrent neural networks, and the two can respectively work in a targeted manner, thereby improving the working efficiency.
Drawings
FIG. 1 is a speech recognition training structure based on deep learning in embodiment 1 of the present invention;
fig. 2 is a speech recognition training structure based on deep learning in embodiment 2 of the present invention.
Detailed Description
The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.
To better illustrate the invention and to facilitate the understanding of the technical solutions thereof, typical but non-limiting examples of the invention are as follows:
the invention provides a speech recognition training method based on deep learning, which comprises the steps of firstly determining speech information to be recognized, extracting characteristics of the input speech information through a Convolutional Neural Network (CNN), then inputting the extracted characteristics into a Recurrent Neural Network (RNN), then outputting a recognition result through the Recurrent Neural Network, then comparing the specific speech content with the recognition result through marking of the speech information to be recognized, constructing a loss function, finally conducting reversely step by step through the loss function, and correcting the Neural Network step by step to achieve the training purpose.
The invention provides a speech recognition training system based on deep learning, which is characterized in that speech information to be detected is input through a speech input module of the system and then passes through a preprocessing unit, the preprocessing unit filters the input speech information to avoid interference, then the filtered speech information is input into a feature extraction unit for feature extraction, and the extracted features are input into a speech recognition unit (generally a neural network) for recognition.
Example 1
The speech recognition training system of the invention is shown in fig. 1 and comprises a preprocessing unit, a feature extraction unit, a speech recognition unit, a loss function and a mapping unit; the feature extraction unit is specifically a convolutional neural network CNN, and the character recognition unit is specifically a recurrent neural network RNN.
The system also needs some preparation work in the early stage, specifically, the following (1) is a training sample set, namely, a voice information sample is labeled, and specific labeling indexes identify specific voice information contents; (2) classifying all characters in the Chinese character library, and marking the classification of each character, wherein the specific classification is as follows:
in some embodiments, in mode 1, the classification by glyph is employed
Words with similar glyphs are all labeled as category 1, for example: "vegetables" and "dishes", "go" and "lose", "forest" and "wood", etc.;
all characters with no similarity of font are marked as category 2;
in some embodiments, approach 2 is employed, classifying by pronunciation
Different words with the same pronunciation are labeled as category 1, for example: "Ming" and "Ming", "ren" and "ren";
the characters with similar pronunciation are marked as category 2, and the pronunciation in the text can be different from the edge sound and the nose sound, the front nose sound and the back nose sound, and the tone; for example: "flow" and "cow", "root" and "more", "creep" and "permit", and the like;
because the number of the Chinese characters is more, the Chinese characters with the same or similar pronunciation are also more; the classification can be limited to the range of common words as required, and the rarely used words can be labeled in another way.
The classification according to the mode 1 can be used for a character recognition technology, is similar to a voice recognition technology in a mode that feature extraction is carried out through a convolutional neural network, then classification is carried out through a cyclic neural network, finally correction is carried out through a loss function, and finally training is completed.
The feature extraction unit is realized by constructing a convolutional neural network CNN, the convolutional neural network firstly performs initial feature extraction on the voice information through convolutional cores, and the initially extracted features can comprise part of the voice information and can be one word or a plurality of words; then, carrying out feature extraction on the features extracted at the previous level step by a secondary extraction layer or multiple extraction layers in the convolutional neural network to obtain required accurate features, and removing redundant features; and finally, all sub-voice information formed by extracting the same voice information characteristic is connected in series by a full connection layer of the convolutional neural network to form a complete extraction characteristic set.
The speech recognition unit is realized by constructing a Recurrent Neural Network (RNN), the input of the Recurrent Neural Network (RNN) comprises two kinds of data, the first kind of data is characteristic data extracted by the Convolutional Neural Network (CNN), the second kind of data is output data of the Recurrent Neural Network (RNN) at the previous time, and finally the Recurrent Neural Network (RNN) outputs a speech recognition result; in order to ensure the accuracy of speech recognition, the general usage of the language is usually considered, and therefore, on the basis of the above, the input of the recurrent neural network RNN may further include a third type of data, i.e., the predicted result of the recurrent neural network RNN at the previous time to the time, where the third type of data may be obtained through a dictionary or a dictionary mapping.
Obtaining a voice information recognition result through a convolutional neural network CNN and a cyclic neural network RNN, comparing the voice information recognition result with a voice information pre-label, performing back propagation on data when the comparison result is different, and gradually correcting each neural network in the back propagation process; the above process is repeated until the accuracy or error rate of the identification result reaches the set threshold.
The comparison between the recognition result and the pre-marked result is embodied by a loss function, while the errors compared according to the past experience are mainly classified into two types, one type is homophonic error, the other type is approximate error, and the two types belong to character errors;
the character error can be indirectly adjusted through a character loss function to extract the most expressive characteristic;
the total loss function of the embodiment is the homophonic loss function + the approximate loss function, so that the error in the speech recognition can be well solved.
Example 2
The speech recognition training system of the invention is shown in fig. 2 and comprises a preprocessing unit, a feature extraction unit, a speech recognition unit 1, a loss function 1, a speech recognition unit 2, a loss function 2 and a mapping unit; the feature extraction unit is specifically a convolutional neural network CNN, and the character recognition unit is specifically a recurrent neural network RNN.
The system also needs some preparation work in the early stage, specifically, the following (1) is a training sample set, namely, a voice information sample is labeled, and specific labeling indexes identify specific voice information contents; (2) classifying all characters in the Chinese character library, and marking the classification of each character, wherein the specific classification is as follows:
different words with the same pronunciation are labeled as category 1, for example: "Ming" and "Ming", "ren" and "ren";
the characters with similar pronunciation are marked as category 2, and the same pronunciation can be different from the edge sound and the nose sound, the front nose sound and the back nose sound, and the tone; for example: "current" and "cow", "root" and "more", "creep" and "permit", and the like;
then, feature extraction is carried out on the input voice information through a feature extraction unit, the extracted features are simultaneously and respectively input into a voice recognition unit 1 and a voice recognition unit 2, and then recognition results are output by the voice recognition unit 1 and the voice recognition unit 2; and then comparing the marking of the voice information to be recognized, namely the content of the specific voice with the recognition result, constructing a loss function, and finally conducting reversely step by the loss function so as to modify the neural network step by step to realize the training purpose.
The above function is a homophonic loss function + b approximate loss function, where a and b are weight coefficients, and a + b is 1; b < a if category 1 is included in the recognition result, preferably b is 0.7-0.9; b > a if category 2 is included in the recognition result, preferably b is 0.1-0.2; if both are included, let b be a.
Here, the speech recognition unit 1 and the speech recognition unit 2, i.e. the first recurrent neural network and the second recurrent neural network, may focus on different directions, the second recurrent neural network connected to the approximate loss function may specifically recognize a specific type of speech, such as characters with similar speech (category 1), while the first recurrent neural network connected to the homophonic loss function may focus on recognizing characters with the same speech (category 2), which is one of the innovative points of the present invention.
In this embodiment, the feature extraction unit is common, and in addition to this, a speech recognition unit (not shown) may be common, and then the text recognition unit outputs the results to the loss function 1 and the loss function 2, respectively, at the same time.
In addition, in the speech recognition process, besides a single character recognition error, a sentence recognition error also exists, and the sentence recognition error is quite common, so that a speech recognition unit 3 and a loss function 3 can be arranged in the system in parallel with a speech recognition unit 1, a loss function 1, a speech recognition unit 2 and a loss function 2;
the sentence recognition error generally includes the following cases:
(1) errors caused by different sentence breaks;
(2) errors due to ambiguous words;
(3) errors due to the bias phrase;
(4) errors due to multiple fixed or idioms;
if the speech information to be recognized includes the above situations, a sentence loss function can be included in the total loss function, and the specific function can be: a homophonic loss function + b approximate loss function + c sentence loss function; if the above sentences are included in the recognition result, c may be 0.5, a + b may be 0.5, and the values of a and b may be referred to the above ratio.
The preferred embodiments of the present invention have been described in detail, however, the present invention is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present invention within the technical idea of the present invention, and these simple modifications are within the protective scope of the present invention.
It should be noted that the various technical features described in the above embodiments can be combined in any suitable manner without contradiction, and the invention is not described in any way for the possible combinations in order to avoid unnecessary repetition.
In addition, any combination of the various embodiments of the present invention can be made, and the same should be considered as the disclosure of the present invention as long as the idea of the present invention is not violated.

Claims (9)

1. A speech recognition training system, characterized by: the system comprises: the device comprises a feature extraction unit, a voice recognition unit and a loss function;
the feature extraction unit is used for extracting features of the voice information to be recognized;
the voice recognition unit is used for carrying out voice recognition on the input voice information to be recognized to obtain a recognition result;
the system compares the pre-labeling of the voice information to be recognized with the recognition result, constructs the loss function, and finally corrects the voice recognition unit and the feature extraction unit step by step through the step-by-step reverse conduction of the loss function;
the loss function is formed by the sum of at least two different types of loss functions;
the two different types of loss functions are respectively: a homophonic loss function and an approximate loss function;
the homophonic loss function represents the probability of recognition errors of different characters with the same pronunciation; the approximate loss function represents the probability of recognition errors of different characters with similar pronunciation.
2. The system of claim 1, wherein: the loss function of the system is a homophonic loss function + b approximate loss function, wherein a and b are weight coefficients.
3. The system of claim 2, wherein: b > a when the recognition result includes different characters with similar pronunciation; b < a when the recognition result includes different characters having the same pronunciation.
4. The system of claim 1, wherein: the character recognition unit includes a first character recognition unit and a second character recognition unit, which correspond to the homophonic loss function and the approximate loss function, respectively.
5. The system of claim 1, wherein: the system further comprises a mapping unit predicting the recognition result by a mapping of a dictionary or dictionaries.
6. The system of claim 1, wherein: the system also includes a sentence loss function that represents a probability of an ambiguity-prone sentence occurrence recognition error.
7. The system of claim 1, the speech recognition unit being plural.
8. Method for speech recognition training using a system according to any of claims 1-6, characterized in that: the method comprises the following steps:
a characteristic extraction step: carrying out feature extraction on voice information to be recognized;
a voice recognition step: carrying out voice recognition on the input voice information to be recognized to obtain a recognition result;
error correction: the system compares the voice information to be recognized with the recognition result through pre-labeling of the voice information to be recognized, constructs the loss function, and finally corrects the voice recognition unit and the feature extraction unit step by step through step reverse conduction of the loss function;
the loss function is formed by the sum of at least two different types of loss functions;
the two different types of loss functions are respectively: a homophonic loss function and an approximate loss function;
the homophonic loss function represents the probability of recognition errors of different characters with the same pronunciation; the approximate loss function represents the probability of recognition errors of different characters with similar pronunciation.
9. The method of claim 8, wherein: the loss function also includes a sentence loss function that represents a probability of an ambiguity-prone sentence recognition error.
CN201811538408.9A 2018-12-16 2018-12-16 Speech recognition training system and method Active CN109545186B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811538408.9A CN109545186B (en) 2018-12-16 2018-12-16 Speech recognition training system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811538408.9A CN109545186B (en) 2018-12-16 2018-12-16 Speech recognition training system and method

Publications (2)

Publication Number Publication Date
CN109545186A CN109545186A (en) 2019-03-29
CN109545186B true CN109545186B (en) 2022-05-27

Family

ID=65854899

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811538408.9A Active CN109545186B (en) 2018-12-16 2018-12-16 Speech recognition training system and method

Country Status (1)

Country Link
CN (1) CN109545186B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827801B (en) * 2020-01-09 2020-04-17 成都无糖信息技术有限公司 Automatic voice recognition method and system based on artificial intelligence
CN115512692B (en) * 2022-11-04 2023-02-28 腾讯科技(深圳)有限公司 Voice recognition method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107464559A (en) * 2017-07-11 2017-12-12 中国科学院自动化研究所 Joint forecast model construction method and system based on Chinese rhythm structure and stress
CN108389577A (en) * 2018-02-12 2018-08-10 广州视源电子科技股份有限公司 Optimize method, system, equipment and the storage medium of voice recognition acoustic model
CN108766440A (en) * 2018-05-28 2018-11-06 平安科技(深圳)有限公司 Speaker's disjunctive model training method, two speaker's separation methods and relevant device
CN108920622A (en) * 2018-06-29 2018-11-30 北京奇艺世纪科技有限公司 A kind of training method of intention assessment, training device and identification device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7617103B2 (en) * 2006-08-25 2009-11-10 Microsoft Corporation Incrementally regulated discriminative margins in MCE training for speech recognition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107464559A (en) * 2017-07-11 2017-12-12 中国科学院自动化研究所 Joint forecast model construction method and system based on Chinese rhythm structure and stress
CN108389577A (en) * 2018-02-12 2018-08-10 广州视源电子科技股份有限公司 Optimize method, system, equipment and the storage medium of voice recognition acoustic model
CN108766440A (en) * 2018-05-28 2018-11-06 平安科技(深圳)有限公司 Speaker's disjunctive model training method, two speaker's separation methods and relevant device
CN108920622A (en) * 2018-06-29 2018-11-30 北京奇艺世纪科技有限公司 A kind of training method of intention assessment, training device and identification device

Also Published As

Publication number Publication date
CN109545186A (en) 2019-03-29

Similar Documents

Publication Publication Date Title
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
CN107729309B (en) Deep learning-based Chinese semantic analysis method and device
CN109902307B (en) Named entity recognition method, named entity recognition model training method and device
CN109685056B (en) Method and device for acquiring document information
CN111859921B (en) Text error correction method, apparatus, computer device and storage medium
CN107688803B (en) Method and device for verifying recognition result in character recognition
CN108763510A (en) Intension recognizing method, device, equipment and storage medium
CN111209740B (en) Text model training method, text error correction method, electronic device and storage medium
CN110070853B (en) Voice recognition conversion method and system
CN111611792A (en) Entity error correction method and system for voice transcription text
CN109522558A (en) A kind of Chinese wrongly written character bearing calibration based on deep learning
CN109684928B (en) Chinese document identification method based on internet retrieval
CN111651978A (en) Entity-based lexical examination method and device, computer equipment and storage medium
CN110276069A (en) A kind of Chinese braille mistake automatic testing method, system and storage medium
CN109545186B (en) Speech recognition training system and method
CN111401012B (en) Text error correction method, electronic device and computer readable storage medium
CN113673228A (en) Text error correction method, text error correction device, computer storage medium and computer program product
KR101072460B1 (en) Method for korean morphological analysis
CN113420766B (en) Low-resource language OCR method fusing language information
CN107783958B (en) Target statement identification method and device
CN113553847A (en) Method, device, system and storage medium for parsing address text
Wray et al. Best practices for crowdsourcing dialectal arabic speech transcription
Hladek et al. Unsupervised spelling correction for Slovak
CN114970554A (en) Document checking method based on natural language processing
CN114299930A (en) End-to-end speech recognition model processing method, speech recognition method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20211126

Address after: 215100 floor 23, Tiancheng Times Business Plaza, No. 58, qinglonggang Road, high speed rail new town, Xiangcheng District, Suzhou, Jiangsu Province

Applicant after: MOMENTA (SUZHOU) TECHNOLOGY Co.,Ltd.

Address before: Room 601-a32, Tiancheng information building, No. 88, South Tiancheng Road, high speed rail new town, Xiangcheng District, Suzhou City, Jiangsu Province

Applicant before: MOMENTA (SUZHOU) TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant