CN108630192B

CN108630192B - non-Chinese speech recognition method, system and construction method thereof

Info

Publication number: CN108630192B
Application number: CN201710156620.8A
Authority: CN
Inventors: 王东; 张之勇; 赵梦原; 黄伟明; 李国强
Original assignee: Advanced Systems Development Co ltd; Tsinghua University
Current assignee: Advanced Systems Development Co ltd; Tsinghua University
Priority date: 2017-03-16
Filing date: 2017-03-16
Publication date: 2020-06-26
Anticipated expiration: 2037-03-16
Also published as: CN108630192A

Abstract

The invention relates to a non-Chinese speech recognition method, a system and a construction method thereof, wherein the construction method of the non-Chinese speech recognition comprises the following steps: extracting voice features from Chinese voice data of a Chinese language database by using a Chinese feature extraction model; establishing a Chinese acoustic model according to the extracted voice characteristics; processing the Chinese acoustic model to obtain a non-Chinese acoustic model; processing the Chinese feature extraction model to obtain a non-Chinese feature extraction model; and constructing a non-Chinese speech recognition system according to the non-Chinese feature extraction model and the non-Chinese acoustic model. The invention can quickly construct an effective non-Chinese speech recognition system by utilizing the existing Chinese speech resources, the trained model thereof and a small amount of necessary language data resources, thereby effectively reducing the cost and time overhead.

Description

non-Chinese speech recognition method, system and construction method thereof

Technical Field

The invention relates to the technical field of voice recognition, in particular to a non-Chinese voice recognition method, a non-Chinese voice recognition system and a non-Chinese voice recognition construction method.

Background

Speech recognition is a technique that converts sound into text. The speech recognition needs a large amount of speech data resources which are accurately marked in advance to carry out model training, otherwise, the practical effect of higher recognition rate is difficult to achieve. The collection and the correct marking of voice data require a large amount of manpower, material resources and time cost, and a large amount of data is difficult to accumulate in a short period. For Chinese speech recognition, the method can be used for efficiently realizing the accumulation of resource data by purchasing a professional data company or outsourcing and labeling on-line data; however, when constructing a speech recognition system for a language other than Chinese, the data for that language must be re-accumulated, incurring significant cost and time overhead.

Disclosure of Invention

The invention aims to solve the technical problem of the prior art and provides a non-Chinese speech recognition method, a non-Chinese speech recognition system and a construction method thereof.

The technical scheme for solving the technical problems is as follows: a construction method of a non-Chinese speech recognition system comprises the following steps:

step 1, extracting voice characteristics from Chinese voice data of a Chinese language database by utilizing a Chinese characteristic extraction model;

step 2, establishing a Chinese acoustic model according to the extracted voice characteristics;

step 3, processing the Chinese acoustic model to obtain a non-Chinese acoustic model;

step 4, processing the Chinese feature extraction model to obtain a non-Chinese feature extraction model;

and 5, constructing a non-Chinese speech recognition system according to the non-Chinese feature extraction model and the non-Chinese acoustic model.

The invention has the beneficial effects that: the non-Chinese speech recognition system is constructed by processing the Chinese acoustic model and the Chinese feature extraction model to obtain a non-Chinese feature extraction model and a non-Chinese acoustic model and constructing the non-Chinese speech recognition system according to the non-Chinese feature extraction model and the non-Chinese acoustic model, so that an effective non-Chinese speech recognition system can be quickly constructed by utilizing the existing Chinese speech resources and trained models thereof and a small amount of necessary language data resources, and cost and time overhead are effectively reduced.

On the basis of the technical scheme, the invention can be further improved as follows.

Further, before step 5, the method further comprises: step 6, enhancing the non-Chinese acoustic model by utilizing cross-language factors, wherein the cross-language factors are language-independent factors and comprise: an environment factor, a channel factor, and a speaker factor.

Further, the chinese feature extraction model and the non-chinese feature extraction model are respectively formed by a deep neural network DNN or a convolutional neural network CNN, and the chinese acoustic model and the non-chinese acoustic model are respectively formed by a recurrent neural network RNN.

Further, step 3 comprises:

step 3.1, processing the Chinese acoustic model by using an i-vector algorithm to obtain a non-Chinese acoustic model; alternatively, the first and second electrodes may be,

and 3.2, processing the Chinese acoustic model by adopting an automatic encoder based on CNN or RNN to obtain a non-Chinese acoustic model.

Further, step 4 comprises:

step 4.1, directly copying the Chinese feature extraction model and taking the model as the non-Chinese feature extraction model; alternatively, the first and second electrodes may be,

and 4.2, processing the Chinese feature extraction model according to a target function constraint method to obtain a non-Chinese feature extraction model.

Further, the objective function is:

L(x；w)＝H(x；w)+∑_x||h_c(x)-h_j(x)||²

wherein H (x; w) is an objective function of the traditional neural network training; h is_c(x) And h_j(x) And respectively extracting excitation value vectors of all hidden nodes in the network for the training sample x in the Chinese and non-Chinese characteristics.

Another technical solution of the present invention for solving the above technical problems is as follows: a non-chinese speech recognition system constructed by a method of construction of a non-chinese speech recognition system as described in any one of the above embodiments.

Another technical solution of the present invention for solving the above technical problems is as follows: a non-chinese speech recognition method comprising:

step 1, acquiring a voice signal to be recognized;

step 2, determining whether the speech signal to be recognized is a non-Chinese speech signal by using the non-Chinese speech recognition system in the embodiment;

and 3, outputting a voice recognition result.

The invention has the beneficial effects that: whether the acquired voice signal to be recognized is a non-Chinese voice signal or not can be more conveniently determined by using the non-Chinese voice recognition system.

Further, step 2 comprises:

step 2.1, extracting voice features from the voice signal to be recognized;

and 2.2, inputting the extracted voice features into the non-Chinese voice recognition system, comparing the extracted voice features with a decoding diagram of non-Chinese voice recognition, and determining whether the voice signal to be recognized is a non-Chinese voice signal.

Another technical solution of the present invention for solving the above technical problems is as follows: a non-chinese speech recognition apparatus, comprising:

the acquisition module is used for acquiring a voice signal to be recognized;

a speech recognition module, configured to determine whether the speech signal to be recognized acquired by the acquisition module is a non-chinese speech signal by using a non-chinese speech recognition system in the foregoing embodiment;

and the output module is used for outputting the voice recognition result determined by the voice recognition module.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention or in the description of the prior art will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for constructing a non-Chinese speech recognition system according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart diagram of a method of constructing a non-Chinese speech recognition system in accordance with another embodiment of the present invention;

FIG. 3 is a schematic flow chart of a non-Chinese speech recognition method according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart diagram of a non-Chinese speech recognition method according to another embodiment of the present invention;

fig. 5 is a schematic block diagram of a non-chinese speech recognition apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

A method 100 of constructing a non-chinese speech recognition system, as shown in fig. 1, includes:

110. and extracting the voice characteristics from the Chinese voice data of the Chinese language database by using the Chinese characteristic extraction model.

Specifically, in this embodiment, the speech feature description refers to a spectrum and a signal feature of a speech physical signal, and the type of the speech feature may include MFCC, PLP, FBANK, and the like, but the embodiment of the present invention is not limited thereto.

120. And establishing a Chinese acoustic model according to the extracted voice characteristics.

130. And processing the Chinese acoustic model to obtain a non-Chinese acoustic model.

Specifically, in this embodiment, the non-chinese language may include japanese, korean, english, and the like, which is not limited in this respect by the embodiment of the present invention.

140. And processing the Chinese feature extraction model to obtain a non-Chinese feature extraction model.

Specifically, in this embodiment, a partial structure of the chinese acoustic model may be trimmed and used for the initial structure of the non-chinese acoustic model. Clipping here means that since the language model is derived from the original corpus of a large amount of natural language text, a large statistical and distribution space is formed, and a large amount of computer computing resources are consumed, for example: memory, CPU, GPU, etc., and time consumption, which are run in real time. Then, the speech recognition application needs to crop this large volume language model. That is, those resources in the language model space that are statistically rare are directly deleted and taken away, in exchange for a faster response. This saves a lot of computer computational resources, although it sacrifices the recognition rate of some rarely occurring word pronunciations.

150. And constructing a non-Chinese speech recognition system according to the non-Chinese feature extraction model and the non-Chinese acoustic model.

Specifically, in this embodiment, the chinese feature extraction model and the non-chinese feature extraction model are respectively formed by a deep neural network DNN or a convolutional neural network CNN, and the chinese acoustic model and the non-chinese acoustic model are respectively formed by a recurrent neural network RNN.

The non-Chinese speech recognition system construction method provided in the above embodiment obtains the non-Chinese feature extraction model and the non-Chinese acoustic model by processing the Chinese acoustic model and the Chinese feature extraction model, and constructs the non-Chinese speech recognition system according to the non-Chinese feature extraction model and the non-Chinese acoustic model, so that an effective non-Chinese speech recognition system can be quickly constructed by using the existing Chinese speech resources and the trained models thereof, and a small amount of necessary language data resources, thereby effectively reducing the cost and time overhead.

Specifically, in this embodiment, an i-vector algorithm may be used to process the chinese acoustic model in step 130, so as to obtain a non-chinese acoustic model.

The i-vector algorithm can characterize all factors related to the long-term characteristics of the speech into a low-dimensional feature vector. Because the pronunciation content has a short-term characteristic, the factor does not contain pronunciation content information and has language independence. The speech long-term feature is to re-segment the speech frame that has been processed by short-time framing with a long-term window, and then analyze the speech characteristics after re-segmentation.

Alternatively, step 130 may employ an automatic encoder based on CNN or RNN to process the chinese acoustic model, resulting in a non-chinese acoustic model.

The automatic encoder based on CNN or RNN expresses the speech signal as a string of characteristic sequence, and the CNN or RNN is used for compressing into a low-dimensional characteristic vector, and then the CNN or RNN regenerates the original characteristic sequence. Similar to the i-vector algorithm, the CNN or RNN compressed low-dimensional feature vector does not express short-time pronunciation features, and therefore has language independence.

In the conversion process from speech signal to text (i.e. recognition process), the hierarchy of signal is developed in a hierarchical way, the lower dimension is closer to the original source of speech, the higher dimension is closer to the final result of recognition, namely: text. Because the original non-Chinese speech corpus is not enough, the technical characteristics of Chinese can be reused by means of the speech recognition technology and resource accumulation of Chinese, and the technical characteristics relatively reflect the common characteristics of human speech and are independent of the components which change depending on the language, so that the Chinese speech system can be deeply accessed.

In addition, in this embodiment, the chinese language feature extraction model may be directly copied and the copied chinese language feature extraction model may be used as the non-chinese language feature extraction model in step 140. Alternatively, the first and second electrodes may be,

in step 140, the chinese feature extraction model may be processed according to an objective function constraint method to obtain a non-chinese feature extraction model. Wherein the objective function is:

L(x；w)＝H(x；w)+Σ_x||h_c(x)-h_j(x)||²

wherein H (x; w) is an objective function of the traditional neural network training, H_c(x) And h_j(x) And respectively extracting excitation value vectors of all hidden nodes in the network for the training sample x in the Chinese and non-Chinese characteristics.

It should be noted that, in this embodiment, the non-chinese training data is simultaneously passed through the chinese feature extraction model and the non-chinese feature extraction model, and the deviations of the two on each hidden layer node in the neural network are added as constraint terms to the training objective function during the training process. That is, not only the optimization of the acoustic model output target (i.e., H (x; w)) is considered when training a non-Chinese acoustic model, but also the feature extraction result is considered to be as close as possible to the output result of the Chinese feature extraction model. This means that the learned feature extraction knowledge in chinese is passed to the non-chinese feature extraction model in a constrained term. Compared with the direct substitution method, the method can balance the requirement of minimizing the classification error of the model and the requirement of learning Chinese.

Optionally, in an embodiment, as shown in fig. 2, before step 150, the method 100 further includes:

160. the non-Chinese acoustic model is enhanced with cross-language factors.

Wherein, the cross-language factor is a language independent factor, including: an environment factor, a channel factor, and a speaker factor, but embodiments of the invention are not limited in this respect.

Because Chinese has a large amount of training corpus, including various complexities such as channels, speakers, accents and the like, the Chinese feature extraction model has strong robustness to the language-independent factors. In this embodiment, cross-language factors derived from a Chinese feature extraction model may be employed to enhance the capabilities of non-Chinese acoustic models.

Specifically, a factor model of language independent factors is obtained by training Chinese speech data. And generating language independent factors in the training and decoding process of the non-Chinese recognition system by utilizing the factor model.

It should be understood that, in this embodiment, the step 160 and the step 140 are not strictly in an execution order, and may be executed in parallel, or executed sequentially, and this is not limited in this embodiment of the present invention.

It should be understood that, in the embodiments of the present invention, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

In addition, the present invention also provides a non-chinese speech recognition system, which is constructed by the construction method of the non-chinese speech recognition system as shown in fig. 1 and 2.

The method for constructing a non-chinese speech recognition system according to an embodiment of the present invention is described in detail with reference to fig. 1 and 2, and the method for identifying non-chinese speech according to an embodiment of the present invention is described in detail with reference to fig. 3 and 4.

A method 300 of non-chinese speech recognition, as shown in fig. 3, includes:

310. and acquiring a voice signal to be recognized.

320. A non-chinese speech recognition system is utilized to determine whether a speech signal to be recognized is a non-chinese speech signal.

Specifically, in this embodiment, the non-chinese speech recognition system is a non-chinese speech recognition system constructed by a method of constructing a non-chinese speech recognition system as shown in fig. 1 and 2 above. In addition, the non-chinese language may include japanese, korean, english, etc., and the embodiment of the present invention is not limited thereto.

330. And outputting a voice recognition result.

The non-chinese speech recognition method provided in the above embodiment can more conveniently determine whether the acquired speech signal to be recognized is a non-chinese speech signal by using a non-chinese speech recognition system.

Optionally, in an embodiment, as shown in fig. 4, step 320 includes:

321. speech features are extracted from a speech signal to be recognized.

322. And inputting the extracted voice characteristics into a non-Chinese voice recognition system, comparing the extracted voice characteristics with a decoding diagram of non-Chinese voice recognition, and determining whether the voice signal to be recognized is a non-Chinese voice signal.

Specifically, in this embodiment, a large amount of accurately labeled original speech resources are subjected to various technical means to generate acoustic models, a large amount of original text corpora are trained to generate language models, and then a pronunciation dictionary is combined to generate a decoding graph for recognition.

And extracting the characteristics of the voice to be recognized, and analyzing and comparing the voice to the decoding image to form a recognition result. Specifically, the method comprises the following steps: after the voice signal is processed by front-end signal processing, end point detection and the like, voice characteristics are extracted frame by frame, the voice characteristic types comprise MFCC, PLP, FBANK and the like, the extracted characteristics are sent to a decoder, and the most matched word sequence is found and output as a recognition result under the common guidance of an acoustic model, a language model and a pronunciation dictionary.

A non-chinese speech recognition method according to an embodiment of the present invention is described in detail with reference to fig. 3 and 4, and a non-chinese speech recognition apparatus according to an embodiment of the present invention is described in detail with reference to fig. 5.

A non-chinese speech recognition apparatus 500, as shown in fig. 5, includes: an acquisition module 510, a speech recognition module 520, and an output module 530. Wherein the content of the first and second substances,

the obtaining module 510 is configured to obtain a speech signal to be recognized.

The speech recognition module 520 is configured to determine whether the speech signal to be recognized acquired by the acquisition module 510 is a non-chinese speech signal using a non-chinese speech recognition system. The non-chinese speech recognition system is constructed by the method of constructing the non-chinese speech recognition system as shown in fig. 1 and fig. 2.

The output module 530 is used for outputting the voice recognition result determined by the voice recognition module 520.

It should be understood that, in the embodiment of the present invention, the non-chinese speech recognition apparatus 500 according to the embodiment of the present invention may correspond to an execution body of the non-chinese speech recognition method 300 according to the embodiment of the present invention, and the above and other operations and/or functions of each module in the non-chinese speech recognition apparatus 500 are respectively for implementing corresponding processes of each method in fig. 3 and fig. 4, and are not described herein again for brevity.

Specifically, in this embodiment, the speech recognition module 520 may be specifically configured to extract speech features from the speech signal to be recognized acquired by the acquisition module 510, input the extracted speech features to the non-chinese speech recognition system, compare the extracted speech features with a decoding map of non-chinese speech recognition, and determine whether the speech signal to be recognized is a non-chinese speech signal.

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A construction method of a non-Chinese speech recognition system is characterized by comprising the following steps:

2. The method of claim 1, further comprising, before step 5:

step 6, enhancing the non-Chinese acoustic model by utilizing cross-language factors, wherein the cross-language factors are language-independent factors and comprise: an environment factor, a channel factor, and a speaker factor.

3. The method as claimed in claim 1 or 2, wherein the chinese feature extraction model and the non-chinese feature extraction model are respectively formed by a deep neural network DNN or a convolutional neural network CNN, and the chinese acoustic model and the non-chinese acoustic model are respectively formed by a recurrent neural network RNN.

4. A method as claimed in claim 3, wherein step 3 comprises:

and 3.2, processing the Chinese acoustic model by adopting an automatic encoder based on a deep neural network (CNN) or a convolutional neural network (RNN) to obtain a non-Chinese acoustic model.

5. The method of claim 4, wherein step 4 comprises:

6. The method of claim 5, wherein the objective function is:

L(x；w)＝H(x；w)+∑_x||h_c(x)-h_j(x)||²

7. A non-chinese speech recognition system constructed by a method of constructing a non-chinese speech recognition system according to any one of claims 1-6.

8. A method for non-chinese speech recognition, comprising:

step 1, acquiring a voice signal to be recognized;

step 2, determining whether the speech signal to be recognized is a non-Chinese speech signal by using a non-Chinese speech recognition system according to claim 7;

and 3, outputting a voice recognition result.

9. The method of claim 8, wherein step 2 comprises:

step 2.1, extracting voice features from the voice signal to be recognized;

10. A non-chinese speech recognition apparatus, comprising:

the acquisition module is used for acquiring a voice signal to be recognized;

a speech recognition module for determining whether the speech signal to be recognized acquired by the acquisition module is a non-Chinese speech signal by using a non-Chinese speech recognition system according to claim 7;