CN114664292A - Model training method, model training device, speech recognition method, speech recognition device, speech recognition equipment and readable storage medium - Google Patents

Model training method, model training device, speech recognition method, speech recognition device, speech recognition equipment and readable storage medium Download PDF

Info

Publication number
CN114664292A
CN114664292A CN202011527010.2A CN202011527010A CN114664292A CN 114664292 A CN114664292 A CN 114664292A CN 202011527010 A CN202011527010 A CN 202011527010A CN 114664292 A CN114664292 A CN 114664292A
Authority
CN
China
Prior art keywords
layer
hidden
network
voice recognition
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011527010.2A
Other languages
Chinese (zh)
Other versions
CN114664292B (en
Inventor
杨斌
王洪斌
蒋宁
吴海英
杨春勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mashang Xiaofei Finance Co Ltd
Original Assignee
Mashang Xiaofei Finance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mashang Xiaofei Finance Co Ltd filed Critical Mashang Xiaofei Finance Co Ltd
Priority to CN202011527010.2A priority Critical patent/CN114664292B/en
Publication of CN114664292A publication Critical patent/CN114664292A/en
Application granted granted Critical
Publication of CN114664292B publication Critical patent/CN114664292B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a model training method, a voice recognition method, a model training device, a voice recognition device and a readable storage medium, and relates to the technical field of voice recognition to improve the recognition speed. The method comprises the following steps: acquiring training data; training a target speech recognition network by using training data; the target voice recognition network comprises an input layer, a hidden layer network and an output layer, wherein the hidden layer network comprises at least two groups of same hidden layers which are connected in parallel; selecting at least one group of hidden layers from hidden layers of a first voice recognition network as a target hidden layer, and copying the target hidden layer to obtain at least two groups of same hidden layers, wherein the first voice recognition network is a voice recognition network with a time delay property; the first voice recognition network is a voice recognition network with a time delay property. The embodiment of the invention can improve the identification speed.

Description

Model training method, model training device, speech recognition method, speech recognition device, speech recognition equipment and readable storage medium
Technical Field
The invention relates to the technical field of voice recognition, in particular to a model training method, a voice recognition method, a model training device, a voice recognition device, a model training device and a readable storage medium.
Background
In the field of speech recognition, common speech recognition models include: HMM-GMM (Hidden Markov Model-Gaussian Mixture Model, Hidden Markov Model, and Mixed Gaussian Model), DNN (Deep Neural Networks), CNN (Convolutional Neural Networks), TDNN (Time Delay Neural Networks), TDNN-F (Factorized Time Delay Neural Networks), and the like.
However, the recognition speed of these speech recognition models is yet to be improved.
Disclosure of Invention
The embodiment of the invention provides a model training method, a voice recognition method, a model training device, a voice recognition device, a model training device and a readable storage medium, and aims to improve the recognition speed.
In a first aspect, an embodiment of the present invention provides a model training method, including:
acquiring training data;
training a target voice recognition network by using the training data, wherein the target voice recognition network comprises an input layer, a hidden layer network and an output layer, the hidden layer network comprises at least two groups of same hidden layers, and the at least two groups of same hidden layers are connected in parallel;
at least one group of hidden layers are selected from hidden layers of a first voice recognition network to serve as target hidden layers, the target hidden layers are copied to obtain at least two groups of same hidden layers, and the first voice recognition network is a voice recognition network with a time delay property.
In a second aspect, an embodiment of the present invention further provides a speech recognition method, where the method includes:
acquiring a voice signal to be recognized;
inputting the voice signal to be recognized into a voice recognition network; the voice recognition network comprises an input layer, a hidden layer network and an output layer, wherein the hidden layer network comprises at least two groups of same hidden layers, the at least two groups of same hidden layers are connected in parallel, at least one group of hidden layers is selected from the hidden layers of the first voice recognition network to serve as a target hidden layer, and the target hidden layer is copied to obtain the at least two groups of same hidden layers; the first voice recognition network is a voice recognition network with a time delay property;
and utilizing the output of the voice recognition network as a voice recognition result.
In a third aspect, an embodiment of the present invention further provides an electronic device, including: a memory, a processor and a program stored on the memory and executable on the processor, the processor implementing the steps of the method of the first or second aspect as described above when executing the program.
In a fourth aspect, the embodiments of the present invention also provide a readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the steps in the method of the first aspect or the second aspect as described above.
In the embodiment of the invention, training data is obtained, and the target voice recognition network is trained by using the training data. The target voice recognition network comprises an input layer, a hidden layer network and an output layer, the hidden layer network comprises at least two groups of same hidden layers, and the at least two groups of same hidden layers are connected in parallel. At least one group of hidden layers are selected from hidden layers of a first voice recognition network to serve as target hidden layers, the target hidden layers are copied to obtain at least two groups of same hidden layers, and the first voice recognition network is a voice recognition network with a time delay property. Because the hidden layer adopts two groups of same hidden layer structures which are connected in parallel, the parameter dimensionality of the hidden layer in the target voice recognition network can be reduced, and thus, when the target voice recognition network is used for data processing, the calculated amount is reduced, the calculation speed of the target voice recognition network is faster, and the recognition speed is further improved.
Drawings
FIG. 1 is a flow chart of a model training method provided by an embodiment of the invention;
FIG. 2 is a schematic structural diagram of TDNN-F under different step parameters;
FIG. 3 is a schematic structural diagram of a multilayer TDNN-F;
FIG. 4 is a schematic diagram of a TDNN-F network in the prior art;
FIG. 5 is a schematic diagram of an improved TDNN-F network according to an embodiment of the present invention;
FIG. 6 is a second schematic diagram of an improved TDNN-F network according to an embodiment of the present invention;
FIG. 7 is a flow chart of a speech recognition method provided by an embodiment of the invention;
FIG. 8 is a block diagram of a model training apparatus according to an embodiment of the present invention;
fig. 9 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.
Detailed Description
The term "and/or" in the embodiments of the present invention describes an association relationship of associated objects, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
In the embodiments of the present application, the term "plurality" means two or more, and other terms are similar thereto.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
Referring to fig. 1, fig. 1 is a flowchart of a model training method provided in an embodiment of the present invention, and as shown in fig. 1, includes the following steps:
step 101, training data is obtained.
And step 102, training a target voice recognition network by using the training data.
The voice data used for training the model can be obtained from the database, voice processing is carried out on the voice data to obtain audio frequency spectrum characteristics, and the audio frequency characteristics are used as the training data. The output of the target speech recognition network is the output probability corresponding to a plurality of basic speech elements. In the training process, the output probability of the target voice recognition network is compared with the expected output probability which is obtained by calculating through the labeled data in advance, and the network weight is continuously updated, so that the network is trained.
The target voice recognition network comprises an input layer, a hidden layer network and an output layer, the hidden layer network comprises at least two sets of same hidden layers, the at least two sets of same hidden layers are connected in parallel, at least one set of hidden layers are selected from the hidden layers of the first voice recognition network to serve as the target hidden layer, and the target hidden layer is copied to obtain the at least two sets of same hidden layers. Wherein each set of hidden layers may comprise at least one hidden layer.
The first voice recognition network is a voice recognition network with a time delay property. For example, the first speech recognition network may be CNN, VDCNN (Very Deep Convolutional neural Networks), TDNN-F, etc., or the first speech recognition network may also be a variant composite structure with Convolutional neural Networks, such as a combination of CNN and TDNN-F, a combination of CNN and LSTM (Long Short-Term Memory network), a combination of TDNN-F and LSTM, etc.
The target hidden layer is obtained by selecting at least one hidden layer with the layer depth meeting the preset requirement from the hidden layers of the first voice recognition network. The at least two groups of identical hidden layers are obtained by copying a network structure formed by the target hidden layer; wherein the network structure formed by the target hidden layer has the same structure as the at least two groups of identical hidden layers. The at least two groups of the same hidden layers are at least two groups of the same hidden layers after parameters are adjusted, wherein the parameters at least comprise step length. Wherein, the step length is an integer. According to experience, when the step length is any one of the values of 3, 6 and 9, the performance of the voice recognition network is better, and therefore, the step lengths of at least two groups of the same hidden layers can be selected from the values of 3, 6 and 9. When there are several hidden layers, the step length between the hidden layers or at least partial hidden layers may take different values to raise the performance of the speech recognition network.
The parameters further include: hidden layer dimensions of at least two identical sets of hidden layers. And the hidden layer dimension of the at least two groups of same hidden layers is the quotient of the number of neurons included in the target hidden layer and the step length of the target hidden layer. The hidden layer network is obtained by connecting the at least two groups of hidden layers in parallel. The target voice recognition network is obtained by sequentially connecting an input layer, a hidden layer network, a normalization layer, a random loss layer, an activation layer and an output layer, and the input layer and the output layer of the target voice recognition network are correspondingly the same as those of the first voice recognition network. The input end of the hidden layer network is the same as the input end of the hidden layer with the lowest layer depth in the target hidden layer in the first voice recognition network, and the output end of the hidden layer network is the same as the output end of the hidden layer with the highest layer depth in the target hidden layer in the first voice recognition network.
Hereinafter, the construction process of the target voice recognition network is described in detail.
Specifically, the construction process of the target speech recognition network includes:
step 1021, selecting at least one group of hidden layers from the hidden layers of the first voice recognition network as a target hidden layer.
The first speech recognition network may include an input layer, at least one set of hidden layers, and an output layer. Wherein each set of hidden layers may comprise at least one hidden layer. In the embodiment of the present invention, at least one group of hidden layers with a layer depth meeting a preset requirement is selected from the hidden layers of the first speech recognition network as the target hidden layer. Wherein, the layer depth meeting the preset requirement means that the layer depth is larger than a certain preset layer depth, and the preset layer depth can be set according to experience. For example, assuming that the hidden layer includes 5 layers, the 3 rd, 4 th and 5 th hidden layers can be selected as the target hidden layer according to actual needs. Which are specifically selected as target hidden layers, which is related to the identified task, the limitations of the system performance. The more target hidden layers are selected, the higher the accuracy of the obtained model. When selecting a hidden layer to be targeted, a hidden layer related to the high-order features of the acoustic model, that is, a deep hidden layer, may be considered as the target hidden layer.
Step 1022, at least two groups of the same hidden layers are obtained by using the target hidden layer.
In this step, the network structure formed by the target hidden layer is copied to obtain at least two groups of same hidden layers; wherein the network structure formed by the target hidden layer has the same structure as the at least two groups of hidden layers.
Since the target hidden layer may be one or more layers, if the target hidden layer is one layer, the one layer of target hidden layer may be formed with a network structure; if the target hidden layer is multi-layered, the multi-layered target hidden layer and the connection relationship between them may form a network structure. In the embodiment of the invention, the network structure formed by the target hidden layer is copied, and at least two groups of the same hidden layers can be obtained. In practice, each duplicate hidden layer of at least two identical sets of hidden layers may also be understood as comprising a network structure formed by the target hidden layer, however, the parameters (e.g. step size, hidden layer dimensions, etc.) of at least two identical sets of hidden layers may be different from those of the target hidden layer.
And 1023, obtaining a hidden layer network by using the at least two groups of same hidden layers.
In this step, the at least two groups of the same hidden layers are connected in parallel to obtain the hidden layer network. Namely, the input ends of at least two groups of hidden layers are connected together, and the data or signals output by the output ends are spliced to be output as a hidden layer network.
And 1024, obtaining a target voice recognition network by at least utilizing the input layer of the first voice recognition network, the hidden layer network and the output layer of the first voice recognition network.
Specifically, in this step, the input layer of the first speech recognition network, the hidden layer network, and the output layer of the first speech recognition network may be sequentially connected to obtain the target speech recognition network. In order to increase the robustness of the obtained identification model, in practical application, a normalization layer, a random loss layer, an activation layer and the like can be added. Specifically, the input layer of the first speech recognition network, the hidden layer network, the normalization layer, the random loss layer, the active layer, and the output layer of the first speech recognition network are sequentially connected to obtain the target speech recognition network.
The input end of the hidden layer network is the same as the input end of the hidden layer with the lowest layer depth in the target hidden layer in the first voice recognition network, and the output end of the hidden layer network is the same as the output end of the hidden layer with the highest layer depth in the target hidden layer in the first voice recognition network.
In practical applications, the target hidden layer may be all hidden layers in the first speech recognition network, or may be a partial hidden layer.
If the target hidden layer is all hidden layers in the first voice recognition network, the input end of the hidden layer network is connected with the input layer, and the output end of the hidden layer network is connected with the output layer. In order to increase the robustness of the obtained recognition model, in practical application, the input end of a hidden layer network is connected with the input layer of the first voice recognition network, and the output end of the hidden layer network is sequentially connected with a normalization layer, a random loss layer, an activation layer and the output layer of the first voice recognition network.
If the target hidden layer is a partial hidden layer in the first voice recognition network, then the input end of the target hidden layer with the minimum layer depth in the target hidden layer is the input end of the hidden layer network, and the output end of the target hidden layer with the maximum layer depth in the target hidden layer is the output end of the hidden layer network. For example, the hidden layer of the first speech recognition network is 5 layers, wherein 3, 4, 5 layers are determined as the target hidden layer. Then the input of layer 3 is the input of the hidden layer network and the output of layer 5 is the output of the hidden layer network.
In this case, other hidden layers, a hidden layer network, in addition to the target hidden layer in the first voice recognition network may be connected between the input layer and the output layer. For example, in one case, an input end of the hidden layer network is connected to the input layer, an output end of the hidden layer network is connected to input ends of other hidden layers of the first speech recognition network, and an output end of a hidden layer with the highest depth among the other hidden layers may be sequentially connected to the normalization layer, the random loss layer, the active layer, and the output layer; in another case, the input end of the hidden layer network is connected with other hidden layers of the first speech recognition network, and the output end of the hidden layer network can be sequentially connected with the normalization layer, the random loss layer, the activation layer and the output layer.
In practical application, the input of the input layer is training data, and after the input information of the training data is obtained, the input layer provides the input information to the hidden layer. The same effect is achieved regardless of whether the hidden layer in the hidden layer network or other hidden layers in the first speech recognition network except the target hidden layer, and the features are extracted. The difference is that when the hidden layers are in different configurations, their corresponding inputs are different. In summary, each neural unit of the hidden layer has a different weight on the input, and thus performs feature extraction from different angles. The output layer is used for butting the hidden layer and outputting a model result, and the weights are adjusted to form correct responses to different hidden layer neuron stimuli to obtain an output result.
Because the hidden layer adopts the same two groups of hidden layer structures connected in parallel, the parameter dimensionality of the hidden layer in the target voice recognition network can be reduced, and thus, when the target voice recognition network is used for data processing, the calculated amount is reduced, the calculation speed of the target voice recognition network is faster, and the recognition speed is further improved.
For the convolutional neural network, the value of each output node of the convolutional layer depends on only one region of the input layer, and other input values outside the region do not influence the output value, and the region is the receptive field. In a specific application, in order to further improve the calculation speed and the recognition accuracy of the model and to model different input receptive fields, in the embodiment of the present invention, on the basis of the above embodiment, parameters of at least two sets of hidden layers may be modified or adjusted.
Specifically, the parameter may include a step size, and further may include hidden layer dimensions of at least two groups of hidden layers. Wherein, the step length is an integer. However, according to experience, when the step size is 3, 6, or 9, the performance of the voice recognition network is better, and therefore, the step sizes of at least two hidden layers can be selected from 3, 6, or 9. When there are multiple hidden layers, the step length between the hidden layers or at least the step length of partial hidden layers can take different values to improve the performance of the speech recognition network. And the hidden layer dimension of the at least two groups of same hidden layers is the quotient of the number of neurons included in the target hidden layer and the step length of the target hidden layer.
Because the number of neurons of the hidden layer in the obtained network can be increased in a parallel connection mode, the number of neurons of the hidden layer before and after the obtained network of the hidden layer can be basically kept consistent by modifying or reducing the dimension of the hidden layer, namely the size of the model is kept consistent.
Hereinafter, how to improve the TDNN-F to obtain an improved TDNN-F model will be described by taking the first speech recognition model as TDNN-F as an example. By the scheme of the embodiment of the invention, the number of model parameters can be reduced, the model reasoning speed is increased, and the robustness of the model for voice modeling is improved.
The neural network model is generally formed by connecting and combining sublayers with different structures in various modes, and the commonly used sublayer structures comprise a full-connection layer, a convolution layer, a circulation convolution layer, an attention layer and the like; the connection modes include series connection, parallel connection, residual error network and the like. The connection relation of adjacent layers of the TDNN-F network is determined by the step length parameters, and the range of the input layer corresponding to different step length parameters is different. FIG. 2 shows the structure of TDNN-F for different step size parameters. FIG. 3 shows the structure of the multi-layer TDNN-F.
In general, the TDNN-F may include an input layer, at least one hidden layer, and an output layer. In the embodiment of the invention, the network structure of part of the hidden layer is copied to obtain the copied hidden layer. And then, connecting the copied hidden layers in parallel to obtain a hidden layer network. And then, the hidden layer dimensionality, the step length and the like of the hidden layer network can be modified, so that modeling of different input receptive fields can be realized, and the identification accuracy and the identification speed of the model can be improved. In addition, in order to increase the robustness of the model, a standardization layer, a random loss layer, an activation layer and the like can be added.
Fig. 4 is a schematic diagram of a TDNN-F network structure in the prior art, where each hidden layer includes 1536 neurons, and a step size is 3. This structure can be considered to be TDNN-F with a single step size. From fig. 4, the hidden layers of layers 4, 5, and 6 are selected to be copied to obtain three sets of copied hidden layers, and the copied hidden layers are connected in parallel to obtain a hidden layer network. Wherein, the input end of the hidden layer network is the input end of the 4 th layer, and the output end is the output end of the 6 th layer. And the output signal or data of the hidden layer network is spliced to obtain the output of the hidden layer network. In addition, in order to increase the robustness of the model, a standardization layer, a random loss layer, an activation layer and the like are added. A normalization layer (Relu layer), a random loss layer (BN) and an activation layer (Dropout layer) are connected between the output end of the 6 th layer and the output layer (softmax layer). Therefore, the schematic diagram of the improved TDNN-F network structure is shown in fig. 5.
Since the number n of neurons in the TDNN-F network structure before the improvement is 1536 and the step length is 3, in the TDNN-F network structure after the improvement, the hidden layer dimension of each copied hidden layer is: 1536/3 ═ 512; the step sizes may be set to 3, 6, 9, respectively. Optionally, in the plurality of copied hidden layers, the step size of one copied hidden layer is the same as that of the copied target hidden layer.
Assuming that the dimension of a hidden layer input matrix in a TDNN-F network before improvement is M multiplied by K, the dimension of hidden layer parameters is K multiplied by N, and the time complexity of a neural network is O (M multiplied by K multiplied by N); when S hidden layer networks are connected in parallel, the hidden layer parameter dimension corresponding to the improved TDNN-F network is (K/S) × (N/S), and the time complexity of the neural network is S × O (M × K × N/S2) ═ O (M × K × N)/S. Wherein M, K, N, S and S are integers which are more than 0. Therefore, the time complexity of the improved TDNN-F network is reduced.
As shown in fig. 6, in one embodiment, the hidden layer dimension D of TDNN-F is 512; the single-step TDNN-F network structure 61 adopts 4 layers of serial connections, and the step d of each layer of TDNN-F is 3; the multiple step TDNN-F network structure (hidden layer network) 62 uses 7 layers of serial connections, and uses 3 steps, i.e., d is 3/6/9. Inputting a plurality of step length TDNN-F network structures in parallel, and outputting to perform frame splicing (or called splicing); and finally inputting the result after frame splicing into a softmax layer through a Relu layer, a BN layer and a Dropout layer, wherein the output of the softmax layer is used as the output of the model.
It can be seen from the above description that by using the scheme of the embodiment of the invention, the improved TDNN-F model can perform acoustic model modeling on pronunciation units with different speech rates and different lengths, so that the accuracy of the acoustic model is improved, and the robustness is better. Under the condition that the total parameter number and the model structure of the model are basically kept unchanged, the calculation complexity of the neural network is O (N)3) Therefore, the hidden layer parameter dimension N in the hidden layer network is smaller, so that the improved TDNN-F network has smaller calculation amount and higher calculation speed.
In practical applications, assume that the experimental settings are: training data, test data, pronunciation dictionaries, and language models provide data using AISHELL1 open source data sets. The experimental environment is as follows: a40-core Intel (R) Xeon (R) Gold 6226CPU model server is adopted.
Except the neural network structure, other decoding networks are kept completely consistent. The neural network structure is set as follows:
single4(512) -single7 (1536): and (3) representing 11 layers of tdnn-f cascade neural networks, which correspond to the structure of the original neural network. single4(512) -multi7 (6-9-12512): and the 4-layer tdnn-f cascade + 7-layer multi-stream cascade neural network is shown and is a multi-stream neural network structure provided by the embodiment of the invention.
The results of the experiment are shown in table 1 below:
TABLE 1
Figure BDA0002851121210000091
Wherein, CER represents the word error rate and measures the accuracy rate of the model identification result; rtf represents a real-time response coefficient and measures the operation speed of the model.
The experiments show that the scheme of the embodiment of the invention has smaller model, higher model identification accuracy and higher calculation speed.
Referring to fig. 7, fig. 7 is a flowchart of a speech recognition method according to an embodiment of the present invention, as shown in fig. 7, including the following steps:
step 701, acquiring a voice signal to be recognized;
step 702, inputting the voice signal to be recognized into a voice recognition network; the voice recognition network comprises an input layer, a hidden layer network and an output layer, wherein the hidden layer network comprises at least two groups of same hidden layers, the at least two groups of same hidden layers are connected in parallel, at least one group of hidden layers is selected from the hidden layers of the first voice recognition network to serve as a target hidden layer, and the target hidden layer is copied to obtain the at least two groups of same hidden layers; the first voice recognition network is a voice recognition network with a time delay property;
703, using the output of the voice recognition network as a voice recognition result;
the voice recognition network is a target voice recognition network obtained by any model training method.
Because the hidden layer of the target voice recognition network adopts the hidden layer network, the parameter dimensionality of the hidden layer in the target voice recognition network is reduced, and the calculated amount is reduced, so that the calculation speed of the target voice recognition network is higher, and the recognition speed is further improved. Therefore, the speed and the accuracy of voice recognition can be improved by utilizing the scheme of the embodiment of the invention.
The embodiment of the invention also provides a model training device. Referring to fig. 8, fig. 8 is a structural diagram of a model training apparatus according to an embodiment of the present invention. Because the principle of solving the problem of the model training device is similar to the model training method in the embodiment of the invention, the implementation of the model training device can refer to the implementation of the method, and repeated details are not repeated.
As shown in fig. 8, the model training apparatus 800 includes:
a first obtaining module 801, configured to obtain training data;
a first processing module 802, configured to train a target speech recognition network using the training data, where the speech recognition network includes an input layer, a hidden layer network and an output layer, the hidden layer network includes at least two groups of the same hidden layers, and the at least two groups of the same hidden layers are connected in parallel, and the at least two groups of the same hidden layers are obtained by selecting at least one group of hidden layers from hidden layers of a first speech recognition network as a target hidden layer and copying the target hidden layer to obtain the at least two groups of the same hidden layers; the first voice recognition network is a voice recognition network with a time delay property.
Wherein the target hidden layer is obtained by selecting at least one group of hidden layers with the layer depth meeting the preset requirement from the hidden layers of the first voice recognition network.
Wherein the at least two groups of the same hidden layers are at least two groups of the same hidden layers after adjusting parameters, wherein the parameters at least comprise step length.
Wherein, the value of the step length comprises any one of 3, 6 and 9;
the parameters further include: hidden layer dimensions of at least two groups of identical hidden layers;
and the hidden layer dimension of the at least two groups of same hidden layers is the quotient of the number of neurons included in the target hidden layer and the step length of the target hidden layer.
The input layer, the hidden layer network, the normalization layer, the random loss layer, the activation layer and the output layer of the target voice recognition network are sequentially connected, and the input layer and the output layer of the target voice recognition network are correspondingly the same as those of the first voice recognition network;
wherein the group of hidden layers comprises at least one hidden layer; the input end of the hidden layer network is the same as the input end of the hidden layer with the lowest layer depth in the target hidden layer in the first voice recognition network, and the output end of the hidden layer network is the same as the output end of the hidden layer with the highest layer depth in the target hidden layer in the first voice recognition network.
The apparatus provided in the embodiment of the present invention may implement the method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.
The embodiment of the invention also provides a voice recognition device. Referring to fig. 9, fig. 9 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention. Because the principle of solving the problem of the model training device is similar to the speech recognition method in the embodiment of the invention, the implementation of the speech recognition device can refer to the implementation of the method, and repeated details are not repeated.
As shown in fig. 9, the speech recognition apparatus 900 includes:
a first obtaining module 901, configured to obtain a speech signal to be recognized; a first recognition module 902, configured to input the voice signal to be recognized to a voice recognition network; the voice recognition network comprises an input layer, a hidden layer network and an output layer, wherein the hidden layer network comprises at least two groups of same hidden layers, the at least two groups of same hidden layers are connected in parallel, at least one group of hidden layers is selected from the hidden layers of the first voice recognition network to serve as a target hidden layer, and the target hidden layer is copied to obtain the at least two groups of same hidden layers; the first voice recognition network is a voice recognition network with a time delay property; a second obtaining module 903, configured to utilize the output of the voice recognition network as a voice recognition result.
The voice recognition network is a target voice recognition network obtained by utilizing the model training method.
The apparatus provided in the embodiment of the present invention may implement the method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. With such an understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the methods according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.
An embodiment of the present invention further provides an electronic device, including: a memory, a processor, and a program stored on the memory and executable on the processor; the processor is used for reading the program implementation in the memory to realize the steps of the model training method; or to implement the steps of the speech recognition method described previously.
The embodiment of the present invention further provides a readable storage medium, where a program is stored on the readable storage medium, and when the program is executed by a processor, the program implements each process of the above-mentioned embodiment of the model training method or the speech recognition method, and can achieve the same technical effect, and in order to avoid repetition, the detailed description is omitted here. The readable storage medium may be any available medium or data storage device that can be accessed by a processor, including but not limited to magnetic memory (e.g., floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc.), optical memory (e.g., CD, DVD, BD, HVD, etc.), and semiconductor memory (e.g., ROM, EPROM, EEPROM, nonvolatile memory (NAND FLASH), Solid State Disk (SSD)), etc.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. With such an understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the methods according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A method of model training, comprising:
acquiring training data;
training a target voice recognition network by using the training data, wherein the target voice recognition network comprises an input layer, a hidden layer network and an output layer, the hidden layer network comprises at least two groups of same hidden layers, and the at least two groups of same hidden layers are connected in parallel;
at least one group of hidden layers are selected from hidden layers of a first voice recognition network to serve as target hidden layers, the target hidden layers are copied to obtain at least two groups of same hidden layers, and the first voice recognition network is a voice recognition network with a time delay property.
2. The method of claim 1,
the target hidden layer is obtained by selecting at least one group of hidden layers with layer depths meeting preset requirements from the hidden layers of the first voice recognition network.
3. The method of claim 1,
the at least two groups of the same hidden layers are at least two groups of the same hidden layers after parameters are adjusted, wherein the parameters at least comprise step length.
4. The method of claim 3,
the value of the step length comprises any one of 3, 6 and 9;
the parameters further include: a hidden layer dimension of the at least two groups of identical hidden layers;
and the hidden layer dimension of the at least two groups of same hidden layers is the quotient of the number of neurons included in the target hidden layer and the step length of the target hidden layer.
5. The method according to claim 1, wherein the input layer, the hidden layer network, the normalization layer, the random loss layer, the active layer and the output layer of the target speech recognition network are connected in sequence, and the input layer and the output layer of the target speech recognition network are correspondingly the same as those of the first speech recognition network;
wherein, the group of hidden layers comprises at least one hidden layer; the input end of the hidden layer network is the same as the input end of the hidden layer with the lowest layer depth in the target hidden layer in the first voice recognition network, and the output end of the hidden layer network is the same as the output end of the hidden layer with the highest layer depth in the target hidden layer in the first voice recognition network.
6. A method of speech recognition, the method comprising:
acquiring a voice signal to be recognized;
inputting the voice signal to be recognized into a voice recognition network, wherein the voice recognition network comprises an input layer, a hidden layer network and an output layer, the hidden layer network comprises at least two groups of same hidden layers, the at least two groups of same hidden layers are connected in parallel, at least one group of hidden layers is selected from the hidden layers of the first voice recognition network to serve as a target hidden layer, and the target hidden layer is copied to obtain the at least two groups of same hidden layers; the first voice recognition network is a voice recognition network with a time delay property;
and utilizing the output of the voice recognition network as a voice recognition result.
7. The speech recognition method of claim 6, wherein the speech recognition network is a target speech recognition network obtained by using the model training method of any one of claims 1 to 5.
8. A model training apparatus, comprising:
the first acquisition module is used for acquiring training data;
the first processing module is used for training a target voice recognition network by using the training data, wherein the voice recognition network comprises an input layer, a hidden layer network and an output layer, the hidden layer network comprises at least two groups of same hidden layers, the at least two groups of same hidden layers are connected in parallel, and the at least two groups of same hidden layers are obtained by selecting at least one group of hidden layers from the hidden layers of the first voice recognition network as the target hidden layer and copying the target hidden layer to obtain the at least two groups of same hidden layers; the first voice recognition network is a voice recognition network with a time delay property.
9. An electronic device, comprising: a memory, a processor, and a program stored on the memory and executable on the processor; -wherein the processor, for reading a program implementation in memory, comprises the steps in the model training method of any of claims 1 to 5; or to implement the steps in a speech recognition method as claimed in claims 6 to 7.
10. A readable storage medium storing a program which, when executed by a processor, implements steps comprising in a model training method according to any one of claims 1 to 5; or to implement the steps in a speech recognition method as claimed in claims 6 to 7.
CN202011527010.2A 2020-12-22 2020-12-22 Model training method, speech recognition method, device, equipment and readable storage medium Active CN114664292B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011527010.2A CN114664292B (en) 2020-12-22 2020-12-22 Model training method, speech recognition method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011527010.2A CN114664292B (en) 2020-12-22 2020-12-22 Model training method, speech recognition method, device, equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN114664292A true CN114664292A (en) 2022-06-24
CN114664292B CN114664292B (en) 2023-08-01

Family

ID=82024437

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011527010.2A Active CN114664292B (en) 2020-12-22 2020-12-22 Model training method, speech recognition method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN114664292B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130212052A1 (en) * 2012-02-15 2013-08-15 Microsoft Corporation Tensor deep stacked neural network
CN105745700A (en) * 2013-11-27 2016-07-06 国立研究开发法人情报通信研究机构 Statistical-acoustic-model adaptation method, acoustic-model learning method suitable for statistical-acoustic-model adaptation, storage medium in which parameters for building deep neural network are stored, and computer program for adapting statistical acoustic model
JP2017097693A (en) * 2015-11-26 2017-06-01 Kddi株式会社 Data prediction device, information terminal, program, and method performing learning with data of different periodic layer
CN109147774A (en) * 2018-09-19 2019-01-04 华南理工大学 A kind of improved Delayed Neural Networks acoustic model
CN109344959A (en) * 2018-08-27 2019-02-15 联想(北京)有限公司 Neural network training method, nerve network system and computer system
KR20190061433A (en) * 2017-11-28 2019-06-05 한국생산기술연구원 System and method for estimating output data of a rotating device using a neural network
CN110473634A (en) * 2019-04-23 2019-11-19 浙江大学 A kind of Inherited Metabolic Disorders auxiliary screening method based on multiple domain fusion study
US20200134424A1 (en) * 2018-10-31 2020-04-30 Sony Interactive Entertainment Inc. Systems and methods for domain adaptation in neural networks using domain classifier
CN111667835A (en) * 2020-06-01 2020-09-15 马上消费金融股份有限公司 Voice recognition method, living body detection method, model training method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130212052A1 (en) * 2012-02-15 2013-08-15 Microsoft Corporation Tensor deep stacked neural network
CN105745700A (en) * 2013-11-27 2016-07-06 国立研究开发法人情报通信研究机构 Statistical-acoustic-model adaptation method, acoustic-model learning method suitable for statistical-acoustic-model adaptation, storage medium in which parameters for building deep neural network are stored, and computer program for adapting statistical acoustic model
JP2017097693A (en) * 2015-11-26 2017-06-01 Kddi株式会社 Data prediction device, information terminal, program, and method performing learning with data of different periodic layer
KR20190061433A (en) * 2017-11-28 2019-06-05 한국생산기술연구원 System and method for estimating output data of a rotating device using a neural network
CN109344959A (en) * 2018-08-27 2019-02-15 联想(北京)有限公司 Neural network training method, nerve network system and computer system
CN109147774A (en) * 2018-09-19 2019-01-04 华南理工大学 A kind of improved Delayed Neural Networks acoustic model
US20200134424A1 (en) * 2018-10-31 2020-04-30 Sony Interactive Entertainment Inc. Systems and methods for domain adaptation in neural networks using domain classifier
CN110473634A (en) * 2019-04-23 2019-11-19 浙江大学 A kind of Inherited Metabolic Disorders auxiliary screening method based on multiple domain fusion study
CN111667835A (en) * 2020-06-01 2020-09-15 马上消费金融股份有限公司 Voice recognition method, living body detection method, model training method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DIAMANTINO CASEIRO;: "Multiple parallel hidden layers and other improvements to recurrent neural network language modeling", 《2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING》 *
王子龙: "基于递归神经网络的端到端语音识别", 《计算机与数字工程》 *

Also Published As

Publication number Publication date
CN114664292B (en) 2023-08-01

Similar Documents

Publication Publication Date Title
CN109840287B (en) Cross-modal information retrieval method and device based on neural network
US11948066B2 (en) Processing sequences using convolutional neural networks
JP6952201B2 (en) Multi-task learning as a question answering
JP6980119B2 (en) Speech recognition methods and their devices, devices, storage media and programs
US11417317B2 (en) Determining input data for speech processing
Sainath et al. Low-rank matrix factorization for deep neural network training with high-dimensional output targets
Saon et al. Speaker adaptation of neural network acoustic models using i-vectors
Li et al. Learning small-size DNN with output-distribution-based criteria
US20140156575A1 (en) Method and Apparatus of Processing Data Using Deep Belief Networks Employing Low-Rank Matrix Factorization
US20200126539A1 (en) Speech recognition using convolutional neural networks
Lakomkin et al. On the robustness of speech emotion recognition for human-robot interaction with deep neural networks
CN111916111B (en) Intelligent voice outbound method and device with emotion, server and storage medium
CN111159416A (en) Language task model training method and device, electronic equipment and storage medium
CN108763535A (en) Information acquisition method and device
CN113641822B (en) Fine-grained emotion classification method based on graph neural network
CN111814489A (en) Spoken language semantic understanding method and system
CN109637527A (en) The semantic analytic method and system of conversation sentence
CN116049387A (en) Short text classification method, device and medium based on graph convolution
WO2020135324A1 (en) Audio signal processing
CN112489651A (en) Voice recognition method, electronic device and storage device
CN114664292B (en) Model training method, speech recognition method, device, equipment and readable storage medium
CN111507218A (en) Matching method and device of voice and face image, storage medium and electronic equipment
Sainath et al. Improvements to filterbank and delta learning within a deep neural network framework
CN116150311A (en) Training method of text matching model, intention recognition method and device
CN111814469B (en) Relation extraction method and device based on tree type capsule network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant