CN108630192B - non-Chinese speech recognition method, system and construction method thereof - Google Patents

non-Chinese speech recognition method, system and construction method thereof Download PDF

Info

Publication number
CN108630192B
CN108630192B CN201710156620.8A CN201710156620A CN108630192B CN 108630192 B CN108630192 B CN 108630192B CN 201710156620 A CN201710156620 A CN 201710156620A CN 108630192 B CN108630192 B CN 108630192B
Authority
CN
China
Prior art keywords
chinese
speech recognition
voice
feature extraction
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710156620.8A
Other languages
Chinese (zh)
Other versions
CN108630192A (en
Inventor
王东
张之勇
赵梦原
黄伟明
李国强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced Systems Development Co ltd
Tsinghua University
Original Assignee
Advanced Systems Development Co ltd
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Systems Development Co ltd, Tsinghua University filed Critical Advanced Systems Development Co ltd
Priority to CN201710156620.8A priority Critical patent/CN108630192B/en
Publication of CN108630192A publication Critical patent/CN108630192A/en
Application granted granted Critical
Publication of CN108630192B publication Critical patent/CN108630192B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a non-Chinese speech recognition method, a system and a construction method thereof, wherein the construction method of the non-Chinese speech recognition comprises the following steps: extracting voice features from Chinese voice data of a Chinese language database by using a Chinese feature extraction model; establishing a Chinese acoustic model according to the extracted voice characteristics; processing the Chinese acoustic model to obtain a non-Chinese acoustic model; processing the Chinese feature extraction model to obtain a non-Chinese feature extraction model; and constructing a non-Chinese speech recognition system according to the non-Chinese feature extraction model and the non-Chinese acoustic model. The invention can quickly construct an effective non-Chinese speech recognition system by utilizing the existing Chinese speech resources, the trained model thereof and a small amount of necessary language data resources, thereby effectively reducing the cost and time overhead.

Description

non-Chinese speech recognition method, system and construction method thereof
Technical Field
The invention relates to the technical field of voice recognition, in particular to a non-Chinese voice recognition method, a non-Chinese voice recognition system and a non-Chinese voice recognition construction method.
Background
Speech recognition is a technique that converts sound into text. The speech recognition needs a large amount of speech data resources which are accurately marked in advance to carry out model training, otherwise, the practical effect of higher recognition rate is difficult to achieve. The collection and the correct marking of voice data require a large amount of manpower, material resources and time cost, and a large amount of data is difficult to accumulate in a short period. For Chinese speech recognition, the method can be used for efficiently realizing the accumulation of resource data by purchasing a professional data company or outsourcing and labeling on-line data; however, when constructing a speech recognition system for a language other than Chinese, the data for that language must be re-accumulated, incurring significant cost and time overhead.
Disclosure of Invention
The invention aims to solve the technical problem of the prior art and provides a non-Chinese speech recognition method, a non-Chinese speech recognition system and a construction method thereof.
The technical scheme for solving the technical problems is as follows: a construction method of a non-Chinese speech recognition system comprises the following steps:
step 1, extracting voice characteristics from Chinese voice data of a Chinese language database by utilizing a Chinese characteristic extraction model;
step 2, establishing a Chinese acoustic model according to the extracted voice characteristics;
step 3, processing the Chinese acoustic model to obtain a non-Chinese acoustic model;
step 4, processing the Chinese feature extraction model to obtain a non-Chinese feature extraction model;
and 5, constructing a non-Chinese speech recognition system according to the non-Chinese feature extraction model and the non-Chinese acoustic model.
The invention has the beneficial effects that: the non-Chinese speech recognition system is constructed by processing the Chinese acoustic model and the Chinese feature extraction model to obtain a non-Chinese feature extraction model and a non-Chinese acoustic model and constructing the non-Chinese speech recognition system according to the non-Chinese feature extraction model and the non-Chinese acoustic model, so that an effective non-Chinese speech recognition system can be quickly constructed by utilizing the existing Chinese speech resources and trained models thereof and a small amount of necessary language data resources, and cost and time overhead are effectively reduced.
On the basis of the technical scheme, the invention can be further improved as follows.
Further, before step 5, the method further comprises: step 6, enhancing the non-Chinese acoustic model by utilizing cross-language factors, wherein the cross-language factors are language-independent factors and comprise: an environment factor, a channel factor, and a speaker factor.
Further, the chinese feature extraction model and the non-chinese feature extraction model are respectively formed by a deep neural network DNN or a convolutional neural network CNN, and the chinese acoustic model and the non-chinese acoustic model are respectively formed by a recurrent neural network RNN.
Further, step 3 comprises:
step 3.1, processing the Chinese acoustic model by using an i-vector algorithm to obtain a non-Chinese acoustic model; alternatively, the first and second electrodes may be,
and 3.2, processing the Chinese acoustic model by adopting an automatic encoder based on CNN or RNN to obtain a non-Chinese acoustic model.
Further, step 4 comprises:
step 4.1, directly copying the Chinese feature extraction model and taking the model as the non-Chinese feature extraction model; alternatively, the first and second electrodes may be,
and 4.2, processing the Chinese feature extraction model according to a target function constraint method to obtain a non-Chinese feature extraction model.
Further, the objective function is:
L(x;w)=H(x;w)+∑x||hc(x)-hj(x)||2
wherein H (x; w) is an objective function of the traditional neural network training; h isc(x) And hj(x) And respectively extracting excitation value vectors of all hidden nodes in the network for the training sample x in the Chinese and non-Chinese characteristics.
Another technical solution of the present invention for solving the above technical problems is as follows: a non-chinese speech recognition system constructed by a method of construction of a non-chinese speech recognition system as described in any one of the above embodiments.
Another technical solution of the present invention for solving the above technical problems is as follows: a non-chinese speech recognition method comprising:
step 1, acquiring a voice signal to be recognized;
step 2, determining whether the speech signal to be recognized is a non-Chinese speech signal by using the non-Chinese speech recognition system in the embodiment;
and 3, outputting a voice recognition result.
The invention has the beneficial effects that: whether the acquired voice signal to be recognized is a non-Chinese voice signal or not can be more conveniently determined by using the non-Chinese voice recognition system.
On the basis of the technical scheme, the invention can be further improved as follows.
Further, step 2 comprises:
step 2.1, extracting voice features from the voice signal to be recognized;
and 2.2, inputting the extracted voice features into the non-Chinese voice recognition system, comparing the extracted voice features with a decoding diagram of non-Chinese voice recognition, and determining whether the voice signal to be recognized is a non-Chinese voice signal.
Another technical solution of the present invention for solving the above technical problems is as follows: a non-chinese speech recognition apparatus, comprising:
the acquisition module is used for acquiring a voice signal to be recognized;
a speech recognition module, configured to determine whether the speech signal to be recognized acquired by the acquisition module is a non-chinese speech signal by using a non-chinese speech recognition system in the foregoing embodiment;
and the output module is used for outputting the voice recognition result determined by the voice recognition module.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention or in the description of the prior art will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for constructing a non-Chinese speech recognition system according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart diagram of a method of constructing a non-Chinese speech recognition system in accordance with another embodiment of the present invention;
FIG. 3 is a schematic flow chart of a non-Chinese speech recognition method according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart diagram of a non-Chinese speech recognition method according to another embodiment of the present invention;
fig. 5 is a schematic block diagram of a non-chinese speech recognition apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
A method 100 of constructing a non-chinese speech recognition system, as shown in fig. 1, includes:
110. and extracting the voice characteristics from the Chinese voice data of the Chinese language database by using the Chinese characteristic extraction model.
Specifically, in this embodiment, the speech feature description refers to a spectrum and a signal feature of a speech physical signal, and the type of the speech feature may include MFCC, PLP, FBANK, and the like, but the embodiment of the present invention is not limited thereto.
120. And establishing a Chinese acoustic model according to the extracted voice characteristics.
130. And processing the Chinese acoustic model to obtain a non-Chinese acoustic model.
Specifically, in this embodiment, the non-chinese language may include japanese, korean, english, and the like, which is not limited in this respect by the embodiment of the present invention.
140. And processing the Chinese feature extraction model to obtain a non-Chinese feature extraction model.
Specifically, in this embodiment, a partial structure of the chinese acoustic model may be trimmed and used for the initial structure of the non-chinese acoustic model. Clipping here means that since the language model is derived from the original corpus of a large amount of natural language text, a large statistical and distribution space is formed, and a large amount of computer computing resources are consumed, for example: memory, CPU, GPU, etc., and time consumption, which are run in real time. Then, the speech recognition application needs to crop this large volume language model. That is, those resources in the language model space that are statistically rare are directly deleted and taken away, in exchange for a faster response. This saves a lot of computer computational resources, although it sacrifices the recognition rate of some rarely occurring word pronunciations.
150. And constructing a non-Chinese speech recognition system according to the non-Chinese feature extraction model and the non-Chinese acoustic model.
Specifically, in this embodiment, the chinese feature extraction model and the non-chinese feature extraction model are respectively formed by a deep neural network DNN or a convolutional neural network CNN, and the chinese acoustic model and the non-chinese acoustic model are respectively formed by a recurrent neural network RNN.
The non-Chinese speech recognition system construction method provided in the above embodiment obtains the non-Chinese feature extraction model and the non-Chinese acoustic model by processing the Chinese acoustic model and the Chinese feature extraction model, and constructs the non-Chinese speech recognition system according to the non-Chinese feature extraction model and the non-Chinese acoustic model, so that an effective non-Chinese speech recognition system can be quickly constructed by using the existing Chinese speech resources and the trained models thereof, and a small amount of necessary language data resources, thereby effectively reducing the cost and time overhead.
Specifically, in this embodiment, an i-vector algorithm may be used to process the chinese acoustic model in step 130, so as to obtain a non-chinese acoustic model.
The i-vector algorithm can characterize all factors related to the long-term characteristics of the speech into a low-dimensional feature vector. Because the pronunciation content has a short-term characteristic, the factor does not contain pronunciation content information and has language independence. The speech long-term feature is to re-segment the speech frame that has been processed by short-time framing with a long-term window, and then analyze the speech characteristics after re-segmentation.
Alternatively, step 130 may employ an automatic encoder based on CNN or RNN to process the chinese acoustic model, resulting in a non-chinese acoustic model.
The automatic encoder based on CNN or RNN expresses the speech signal as a string of characteristic sequence, and the CNN or RNN is used for compressing into a low-dimensional characteristic vector, and then the CNN or RNN regenerates the original characteristic sequence. Similar to the i-vector algorithm, the CNN or RNN compressed low-dimensional feature vector does not express short-time pronunciation features, and therefore has language independence.
In the conversion process from speech signal to text (i.e. recognition process), the hierarchy of signal is developed in a hierarchical way, the lower dimension is closer to the original source of speech, the higher dimension is closer to the final result of recognition, namely: text. Because the original non-Chinese speech corpus is not enough, the technical characteristics of Chinese can be reused by means of the speech recognition technology and resource accumulation of Chinese, and the technical characteristics relatively reflect the common characteristics of human speech and are independent of the components which change depending on the language, so that the Chinese speech system can be deeply accessed.
In addition, in this embodiment, the chinese language feature extraction model may be directly copied and the copied chinese language feature extraction model may be used as the non-chinese language feature extraction model in step 140. Alternatively, the first and second electrodes may be,
in step 140, the chinese feature extraction model may be processed according to an objective function constraint method to obtain a non-chinese feature extraction model. Wherein the objective function is:
L(x;w)=H(x;w)+Σx||hc(x)-hj(x)||2
wherein H (x; w) is an objective function of the traditional neural network training, Hc(x) And hj(x) And respectively extracting excitation value vectors of all hidden nodes in the network for the training sample x in the Chinese and non-Chinese characteristics.
It should be noted that, in this embodiment, the non-chinese training data is simultaneously passed through the chinese feature extraction model and the non-chinese feature extraction model, and the deviations of the two on each hidden layer node in the neural network are added as constraint terms to the training objective function during the training process. That is, not only the optimization of the acoustic model output target (i.e., H (x; w)) is considered when training a non-Chinese acoustic model, but also the feature extraction result is considered to be as close as possible to the output result of the Chinese feature extraction model. This means that the learned feature extraction knowledge in chinese is passed to the non-chinese feature extraction model in a constrained term. Compared with the direct substitution method, the method can balance the requirement of minimizing the classification error of the model and the requirement of learning Chinese.
Optionally, in an embodiment, as shown in fig. 2, before step 150, the method 100 further includes:
160. the non-Chinese acoustic model is enhanced with cross-language factors.
Wherein, the cross-language factor is a language independent factor, including: an environment factor, a channel factor, and a speaker factor, but embodiments of the invention are not limited in this respect.
Because Chinese has a large amount of training corpus, including various complexities such as channels, speakers, accents and the like, the Chinese feature extraction model has strong robustness to the language-independent factors. In this embodiment, cross-language factors derived from a Chinese feature extraction model may be employed to enhance the capabilities of non-Chinese acoustic models.
Specifically, a factor model of language independent factors is obtained by training Chinese speech data. And generating language independent factors in the training and decoding process of the non-Chinese recognition system by utilizing the factor model.
It should be understood that, in this embodiment, the step 160 and the step 140 are not strictly in an execution order, and may be executed in parallel, or executed sequentially, and this is not limited in this embodiment of the present invention.
It should be understood that, in the embodiments of the present invention, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
In addition, the present invention also provides a non-chinese speech recognition system, which is constructed by the construction method of the non-chinese speech recognition system as shown in fig. 1 and 2.
The method for constructing a non-chinese speech recognition system according to an embodiment of the present invention is described in detail with reference to fig. 1 and 2, and the method for identifying non-chinese speech according to an embodiment of the present invention is described in detail with reference to fig. 3 and 4.
A method 300 of non-chinese speech recognition, as shown in fig. 3, includes:
310. and acquiring a voice signal to be recognized.
320. A non-chinese speech recognition system is utilized to determine whether a speech signal to be recognized is a non-chinese speech signal.
Specifically, in this embodiment, the non-chinese speech recognition system is a non-chinese speech recognition system constructed by a method of constructing a non-chinese speech recognition system as shown in fig. 1 and 2 above. In addition, the non-chinese language may include japanese, korean, english, etc., and the embodiment of the present invention is not limited thereto.
330. And outputting a voice recognition result.
The non-chinese speech recognition method provided in the above embodiment can more conveniently determine whether the acquired speech signal to be recognized is a non-chinese speech signal by using a non-chinese speech recognition system.
Optionally, in an embodiment, as shown in fig. 4, step 320 includes:
321. speech features are extracted from a speech signal to be recognized.
322. And inputting the extracted voice characteristics into a non-Chinese voice recognition system, comparing the extracted voice characteristics with a decoding diagram of non-Chinese voice recognition, and determining whether the voice signal to be recognized is a non-Chinese voice signal.
Specifically, in this embodiment, a large amount of accurately labeled original speech resources are subjected to various technical means to generate acoustic models, a large amount of original text corpora are trained to generate language models, and then a pronunciation dictionary is combined to generate a decoding graph for recognition.
And extracting the characteristics of the voice to be recognized, and analyzing and comparing the voice to the decoding image to form a recognition result. Specifically, the method comprises the following steps: after the voice signal is processed by front-end signal processing, end point detection and the like, voice characteristics are extracted frame by frame, the voice characteristic types comprise MFCC, PLP, FBANK and the like, the extracted characteristics are sent to a decoder, and the most matched word sequence is found and output as a recognition result under the common guidance of an acoustic model, a language model and a pronunciation dictionary.
A non-chinese speech recognition method according to an embodiment of the present invention is described in detail with reference to fig. 3 and 4, and a non-chinese speech recognition apparatus according to an embodiment of the present invention is described in detail with reference to fig. 5.
A non-chinese speech recognition apparatus 500, as shown in fig. 5, includes: an acquisition module 510, a speech recognition module 520, and an output module 530. Wherein the content of the first and second substances,
the obtaining module 510 is configured to obtain a speech signal to be recognized.
The speech recognition module 520 is configured to determine whether the speech signal to be recognized acquired by the acquisition module 510 is a non-chinese speech signal using a non-chinese speech recognition system. The non-chinese speech recognition system is constructed by the method of constructing the non-chinese speech recognition system as shown in fig. 1 and fig. 2.
The output module 530 is used for outputting the voice recognition result determined by the voice recognition module 520.
It should be understood that, in the embodiment of the present invention, the non-chinese speech recognition apparatus 500 according to the embodiment of the present invention may correspond to an execution body of the non-chinese speech recognition method 300 according to the embodiment of the present invention, and the above and other operations and/or functions of each module in the non-chinese speech recognition apparatus 500 are respectively for implementing corresponding processes of each method in fig. 3 and fig. 4, and are not described herein again for brevity.
Specifically, in this embodiment, the speech recognition module 520 may be specifically configured to extract speech features from the speech signal to be recognized acquired by the acquisition module 510, input the extracted speech features to the non-chinese speech recognition system, compare the extracted speech features with a decoding map of non-chinese speech recognition, and determine whether the speech signal to be recognized is a non-chinese speech signal.
It should be understood that, in the embodiments of the present invention, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A construction method of a non-Chinese speech recognition system is characterized by comprising the following steps:
step 1, extracting voice characteristics from Chinese voice data of a Chinese language database by utilizing a Chinese characteristic extraction model;
step 2, establishing a Chinese acoustic model according to the extracted voice characteristics;
step 3, processing the Chinese acoustic model to obtain a non-Chinese acoustic model;
step 4, processing the Chinese feature extraction model to obtain a non-Chinese feature extraction model;
and 5, constructing a non-Chinese speech recognition system according to the non-Chinese feature extraction model and the non-Chinese acoustic model.
2. The method of claim 1, further comprising, before step 5:
step 6, enhancing the non-Chinese acoustic model by utilizing cross-language factors, wherein the cross-language factors are language-independent factors and comprise: an environment factor, a channel factor, and a speaker factor.
3. The method as claimed in claim 1 or 2, wherein the chinese feature extraction model and the non-chinese feature extraction model are respectively formed by a deep neural network DNN or a convolutional neural network CNN, and the chinese acoustic model and the non-chinese acoustic model are respectively formed by a recurrent neural network RNN.
4. A method as claimed in claim 3, wherein step 3 comprises:
step 3.1, processing the Chinese acoustic model by using an i-vector algorithm to obtain a non-Chinese acoustic model; alternatively, the first and second electrodes may be,
and 3.2, processing the Chinese acoustic model by adopting an automatic encoder based on a deep neural network (CNN) or a convolutional neural network (RNN) to obtain a non-Chinese acoustic model.
5. The method of claim 4, wherein step 4 comprises:
step 4.1, directly copying the Chinese feature extraction model and taking the model as the non-Chinese feature extraction model; alternatively, the first and second electrodes may be,
and 4.2, processing the Chinese feature extraction model according to a target function constraint method to obtain a non-Chinese feature extraction model.
6. The method of claim 5, wherein the objective function is:
L(x;w)=H(x;w)+∑x||hc(x)-hj(x)||2
wherein H (x; w) is an objective function of the traditional neural network training, Hc(x) And hj(x) And respectively extracting excitation value vectors of all hidden nodes in the network for the training sample x in the Chinese and non-Chinese characteristics.
7. A non-chinese speech recognition system constructed by a method of constructing a non-chinese speech recognition system according to any one of claims 1-6.
8. A method for non-chinese speech recognition, comprising:
step 1, acquiring a voice signal to be recognized;
step 2, determining whether the speech signal to be recognized is a non-Chinese speech signal by using a non-Chinese speech recognition system according to claim 7;
and 3, outputting a voice recognition result.
9. The method of claim 8, wherein step 2 comprises:
step 2.1, extracting voice features from the voice signal to be recognized;
and 2.2, inputting the extracted voice features into the non-Chinese voice recognition system, comparing the extracted voice features with a decoding diagram of non-Chinese voice recognition, and determining whether the voice signal to be recognized is a non-Chinese voice signal.
10. A non-chinese speech recognition apparatus, comprising:
the acquisition module is used for acquiring a voice signal to be recognized;
a speech recognition module for determining whether the speech signal to be recognized acquired by the acquisition module is a non-Chinese speech signal by using a non-Chinese speech recognition system according to claim 7;
and the output module is used for outputting the voice recognition result determined by the voice recognition module.
CN201710156620.8A 2017-03-16 2017-03-16 non-Chinese speech recognition method, system and construction method thereof Active CN108630192B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710156620.8A CN108630192B (en) 2017-03-16 2017-03-16 non-Chinese speech recognition method, system and construction method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710156620.8A CN108630192B (en) 2017-03-16 2017-03-16 non-Chinese speech recognition method, system and construction method thereof

Publications (2)

Publication Number Publication Date
CN108630192A CN108630192A (en) 2018-10-09
CN108630192B true CN108630192B (en) 2020-06-26

Family

ID=63687365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710156620.8A Active CN108630192B (en) 2017-03-16 2017-03-16 non-Chinese speech recognition method, system and construction method thereof

Country Status (1)

Country Link
CN (1) CN108630192B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111369978B (en) * 2018-12-26 2024-05-17 北京搜狗科技发展有限公司 Data processing method and device for data processing

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727901A (en) * 2009-12-10 2010-06-09 清华大学 Method for recognizing Chinese-English bilingual voice of embedded system
CN103400577A (en) * 2013-08-01 2013-11-20 百度在线网络技术(北京)有限公司 Acoustic model building method and device for multi-language voice identification
CN103839545A (en) * 2012-11-23 2014-06-04 三星电子株式会社 Apparatus and method for constructing multilingual acoustic model
CN106469552A (en) * 2015-08-20 2017-03-01 三星电子株式会社 Speech recognition apparatus and method
CN106486115A (en) * 2015-08-28 2017-03-08 株式会社东芝 Improve method and apparatus and audio recognition method and the device of neutral net language model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI349925B (en) * 2008-01-10 2011-10-01 Delta Electronics Inc Speech recognition device and method thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727901A (en) * 2009-12-10 2010-06-09 清华大学 Method for recognizing Chinese-English bilingual voice of embedded system
CN103839545A (en) * 2012-11-23 2014-06-04 三星电子株式会社 Apparatus and method for constructing multilingual acoustic model
CN103400577A (en) * 2013-08-01 2013-11-20 百度在线网络技术(北京)有限公司 Acoustic model building method and device for multi-language voice identification
CN106469552A (en) * 2015-08-20 2017-03-01 三星电子株式会社 Speech recognition apparatus and method
CN106486115A (en) * 2015-08-28 2017-03-08 株式会社东芝 Improve method and apparatus and audio recognition method and the device of neutral net language model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《基于循环神经网络的汉语语言模型建模方法》;王龙、杨俊安、陈雷、林伟;《声学技术》;20151015;第34卷(第5期);431-436 *

Also Published As

Publication number Publication date
CN108630192A (en) 2018-10-09

Similar Documents

Publication Publication Date Title
CN106683680B (en) Speaker recognition method and device, computer equipment and computer readable medium
US10657969B2 (en) Identity verification method and apparatus based on voiceprint
CN110428820B (en) Chinese and English mixed speech recognition method and device
CN107170453B (en) Cross-language voice transcription method, equipment and readable medium based on artificial intelligence
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN111081279A (en) Voice emotion fluctuation analysis method and device
CN112992125B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN112784696A (en) Lip language identification method, device, equipment and storage medium based on image identification
CN113707125B (en) Training method and device for multi-language speech synthesis model
CN110503956B (en) Voice recognition method, device, medium and electronic equipment
CN112257437A (en) Voice recognition error correction method and device, electronic equipment and storage medium
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
EP4392972A1 (en) Speaker-turn-based online speaker diarization with constrained spectral clustering
CN110738061A (en) Ancient poetry generation method, device and equipment and storage medium
CN110647613A (en) Courseware construction method, courseware construction device, courseware construction server and storage medium
CN109408175B (en) Real-time interaction method and system in general high-performance deep learning calculation engine
CN108630192B (en) non-Chinese speech recognition method, system and construction method thereof
CN117238321A (en) Speech comprehensive evaluation method, device, equipment and storage medium
CN111477212A (en) Content recognition, model training and data processing method, system and equipment
CN115527551A (en) Voice annotation quality evaluation method and device, electronic equipment and storage medium
CN109344388A (en) A kind of comment spam recognition methods, device and computer readable storage medium
CN115331703A (en) Song voice detection method and device
CN114118068A (en) Method and device for amplifying training text data and electronic equipment
CN116564281B (en) Emotion recognition method and device based on AI
CN112530456B (en) Language category identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant