CN112735388B

CN112735388B - Network model training method, voice recognition processing method and related equipment

Info

Publication number: CN112735388B
Application number: CN202011577841.0A
Authority: CN
Inventors: 孟庆林; 吴海英; 蒋宁; 王洪斌; 赵立军
Original assignee: Mashang Xiaofei Finance Co Ltd
Current assignee: Mashang Xiaofei Finance Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-11-09
Anticipated expiration: 2040-12-28
Also published as: CN112735388A

Abstract

The invention provides a network model training method, a voice recognition processing method and related equipment, wherein the method comprises the following steps: carrying out iterative training on the voice recognition network model to be trained by using the labeled sample data to obtain a voice recognition network model; the to-be-trained voice recognition network model comprises a delay neural network layer, a first residual error layer, N second residual error layers and N-1 shallow feature fusion network layers, wherein the shallow feature fusion network layers are used for fusing N feature vectors of different scales output by the N second residual error layers to obtain N first feature vectors, and the N first feature vectors are fused with the output of the delay neural network layer and the output of the first residual error layer to obtain second feature vectors. Therefore, voiceprint feature recognition and emotion feature recognition can be performed by using the trained voice recognition network model, so that the cost of emotion detection of the user is reduced, and the reliability of user identity verification is improved.

Description

Network model training method, voice recognition processing method and related equipment

Technical Field

The invention relates to the technical field of voice recognition, in particular to a network model training method, a voice recognition processing method and related equipment.

Background

With the development of communication technology, the demand for work by telephone is increasing. For example, in the consumer finance field, a customer service call center may handle thousands of hot lines, return visits, and collection calls every day. In order to improve the quality of customer service, it is very important to effectively supervise the customer service attitude. In the prior art, a manual screening mode is usually adopted for performing spot check, however, in order to ensure the coverage rate of call detection, the call data volume to be detected is large, which results in high detection cost. In addition, in order to ensure the reliability of the call user, the identity information of the user is usually required to be verified, and currently, the identity information is usually verified on a user name or a user identity number, however, both the user name and the user identity number can be stolen, so that the reliability of the identity authentication is low. Therefore, in the prior art, the user emotion detection cost is high, and the reliability of the identity verification is low.

Disclosure of Invention

The embodiment of the invention provides a network model training method, a voice recognition processing method and related equipment, and aims to solve the problems of high cost of emotion detection of a user and low reliability of identity verification.

In a first aspect, an embodiment of the present invention provides a method for training a speech recognition network model, including:

carrying out iterative training on the voice recognition network model to be trained by using the labeled sample data to obtain a voice recognition network model;

the to-be-trained voice recognition network model comprises a time-delay neural network layer, a first residual layer, N second residual layers and N-1 shallow feature fusion network layers, wherein the N second residual layers, the first residual layers and the time-delay neural network layer are sequentially connected in series, the shallow feature fusion network layer is used for fusing N different-scale feature vectors output by the N second residual layers to obtain N first feature vectors, the N first feature vectors are fused with the output of the time-delay neural network layer and the output of the first residual layer to obtain second feature vectors, the second feature vectors are used for representing voiceprint feature information or emotion feature information, and N is an integer greater than 1.

In a second aspect, an embodiment of the present invention provides a speech recognition processing method, including:

preprocessing first voice data to be recognized to obtain a sixth feature vector, wherein the sixth feature vector is used for representing voiceprint feature information of the first voice data;

inputting the sixth feature vector into a voice recognition network model to obtain a voiceprint feature vector to be confirmed;

inputting the voiceprint feature vector into a preset classification model to obtain a first classification result;

determining that the first voice data is voice data of a first user under the condition that the first classification result is matched with a reference result corresponding to the first user;

the voice recognition network model comprises a delay neural network layer, a first residual error layer, N second residual error layers and N-1 shallow feature fusion network layers, wherein the N second residual error layers, the first residual error layer and the delay neural network layers are sequentially connected in series, the N second residual error layers and the first residual error layers are used for performing convolution processing on the sixth feature vector and outputting feature vectors with different scales on different residual error layers, the shallow feature fusion network layer is used for fusing N feature vectors with different scales output by the N second residual error layers to obtain N seventh feature vectors, the voiceprint feature vectors are obtained by fusing the N seventh feature vectors with the output of the delay neural network layer and the output of the first residual error layer, and N is an integer greater than 1.

In a third aspect, an embodiment of the present invention provides a speech recognition processing method, including:

preprocessing second voice data to be recognized to obtain an eleventh feature vector, wherein the eleventh feature vector is used for representing emotion feature information of the second voice data;

inputting the eleventh feature vector into a voice recognition network model to obtain an emotion classification result of the second voice data;

the voice recognition network model comprises a delay neural network layer, a first residual error layer, N second residual error layers, N-1 shallow feature fusion network layers and a classification network layer, wherein the N second residual error layers, the first residual error layer and the delay neural network layer are sequentially connected in series, the N second residual error layers and the first residual error layer are used for performing convolution processing on the eleventh feature vector and outputting feature vectors with different scales on different residual error layers, the shallow feature fusion network layer is used for fusing the N feature vectors with different scales output by the N second residual error layers to obtain N twelfth feature vectors, the emotion classification network is used for fusing the N twelfth feature vectors, the output of the delay neural network layer and the output of the first residual error layer and then performing emotion classification to obtain the emotion classification result, n is an integer greater than 1.

In a fourth aspect, an embodiment of the present invention provides a speech recognition network model training apparatus, including:

the training module is used for carrying out iterative training on the voice recognition network model to be trained by using the labeled sample data to obtain the voice recognition network model;

In a fifth aspect, an embodiment of the present invention provides a speech recognition processing apparatus, including:

the voice recognition device comprises a first preprocessing module, a second preprocessing module and a voice recognition module, wherein the first preprocessing module is used for preprocessing first voice data to be recognized to obtain a sixth feature vector, and the sixth feature vector is used for representing voiceprint feature information of the first voice data;

the first input module is used for inputting the sixth feature vector to a voice recognition network model to obtain a voiceprint feature vector to be confirmed;

the second input module is used for inputting the voiceprint feature vectors into a preset classification model to obtain a first classification result;

the determining module is used for determining that the first voice data is the voice data of the first user under the condition that the first classification result is matched with a reference result corresponding to the first user;

In a sixth aspect, an embodiment of the present invention provides a speech recognition processing apparatus, including:

the second preprocessing module is used for preprocessing second voice data to be recognized to obtain an eleventh feature vector, and the eleventh feature vector is used for representing emotion feature information of the second voice data;

the third input module is used for inputting the eleventh feature vector to a voice recognition network model to obtain an emotion classification result of the second voice data;

In a seventh aspect, an embodiment of the present invention provides an electronic device, which includes a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, implements the steps of the speech recognition network model training method provided in the first aspect, or the computer program, when executed by the processor, implements the steps of the speech recognition processing method provided in the second aspect or the third aspect.

In an eighth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by the processor, implements the steps of the speech recognition network model training method provided in the first aspect, or the computer program, when executed by the processor, implements the steps of the speech recognition processing method provided in the second aspect or the third aspect.

In the embodiment of the invention, the voice recognition network model to be trained is iteratively trained by using the labeled sample data to obtain the voice recognition network model; the to-be-trained voice recognition network model comprises a time-delay neural network layer, a first residual layer, N second residual layers and N-1 shallow feature fusion network layers, wherein the N second residual layers, the first residual layers and the time-delay neural network layer are sequentially connected in series, the shallow feature fusion network layer is used for fusing N different-scale feature vectors output by the N second residual layers to obtain N first feature vectors, the N first feature vectors are fused with the output of the time-delay neural network layer and the output of the first residual layer to obtain second feature vectors, the second feature vectors are used for representing voiceprint feature information or emotion feature information, and N is an integer greater than 1. Because the residual error layer is used as the basis and combined with the shallow feature fusion network layer in the to-be-trained voice recognition network model, the model has small parameter quantity, strong feature mapping capability and strong short-term feature expression capability. Therefore, voiceprint feature recognition and emotion feature recognition can be performed by using the trained voice recognition network model, so that the cost of emotion detection of the user is reduced, and the reliability of user identity verification is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flow chart of a network model training method provided by an embodiment of the invention;

FIG. 2 is a block diagram of a speech recognition network model to be trained in the network model training method according to an embodiment of the present invention;

FIG. 3 is a second frame diagram of a speech recognition network model to be trained in the network model training method according to the embodiment of the present invention;

FIG. 4 is a flow chart of a speech recognition processing method according to an embodiment of the present invention;

FIG. 5 is a flow chart of another speech recognition processing method provided by the embodiments of the present invention;

FIG. 6 is a block diagram of a network model training apparatus according to an embodiment of the present invention;

fig. 7 is a block diagram of a speech recognition processing apparatus according to an embodiment of the present invention;

fig. 8 is a block diagram of another speech recognition processing apparatus according to an embodiment of the present invention;

fig. 9 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a method for training a speech recognition network model according to an embodiment of the present invention, as shown in fig. 1, including the following steps:

step 101, performing iterative training on a to-be-trained voice recognition network model by using labeled sample data to obtain a voice recognition network model;

In the embodiment of the present invention, the first residual layer and the second residual layer may be collectively referred to as a residual layer, and may be denoted as ResNetBlock, for example. As shown in fig. 2 and fig. 3, the first residual layer and the N second residual layers may be understood as residual layers in a residual network, and the full-link layer of the residual network is adjusted to be a delayed neural network layer, so as to obtain an adjusted residual network structure. Based on the adjusted residual error network structure, a to-be-trained speech recognition network model is constructed by combining a shallow feature fusion network layer, the specific structure of which is shown in fig. 2, the numerical value of N can be set according to actual needs, as shown in fig. 2, and in the embodiment of the invention, the value of N is 3.

It should be understood that when the feature vector representing the voiceprint feature information is input into the speech recognition network model to be trained, a second feature vector representing the voiceprint feature information can be obtained; when the feature vector representing the emotion feature information is input to the speech recognition network model to be trained, a second feature vector representing the emotion feature information can be obtained.

Optionally, the N-1 shallow feature fusion network layers may sample and multiplex feature vectors of different scales, so that the expression capability of short-term features in a conversational scene may be effectively improved, and a residual error network is used as a basic model, so that the model has a small parameter amount and a strong feature mapping capability.

It should be noted that, the manner in which the shallow feature fusion network layer is used to fuse the N different-scale feature vectors output by the N second residual error layers to obtain the N first feature vectors is set according to actual needs, for example, in some embodiments, the manner in which the shallow feature fusion network layer is used to fuse the N different-scale feature vectors output by the N second residual error layers to obtain the N first feature vectors includes:

the N-1 shallow feature fusion network layers are used for sampling and fusing N feature vectors of different scales output by the N second residual error layers to obtain N first feature vectors, the N first feature vectors comprise feature vectors output by a target residual error layer, and the target residual error layer is a second residual error layer adjacent to the first residual error layer.

In this embodiment, the N first feature vectors include a feature vector output by the target residual layer, which is understood to mean that each of the N first feature vectors includes a feature vector output by the target residual layer, and one of the N first feature vectors is a feature vector output by the target residual layer.

Optionally, in some embodiments, the N-1 shallow feature fusion network layers are connected to N-1 second residual layers except the target residual layer in a one-to-one correspondence manner, the 1 st shallow feature fusion network layer is further connected to the target residual layer, the 1 st shallow feature fusion network layer is configured to perform fusion processing on feature vectors output by the two connected second residual layers to obtain 1-time fusion feature vector, determine a feature vector output by the target residual network layer as a first feature vector, and the ith shallow feature fusion network layer is configured to perform fusion processing on a feature vector output by the connected second residual layer and i-1-time fusion feature vectors output by the i-1 th shallow feature fusion network layer to obtain i-time fusion feature vectors, and determine the i-time fusion feature vector as the ith first feature vector, wherein i is an integer of 2 or more and N-1 or less, and N is an integer of 2 or more;

or when N is 2, the shallow feature fusion network layer is configured to perform fusion processing on the feature vector output by the second residual layer connected to the target residual network layer and the target residual network layer to obtain a target fusion feature vector, and determine the feature vector output by the target residual network layer and the target fusion feature vector as the first feature vector.

In the embodiment of the application, when the value of i is larger, the feature vector output by the corresponding connected second residual error layer is represented as a shallower feature vector. The feature vectors of different layers are sequentially fused, so that the feature vectors of different scales are sampled and multiplexed, and the expression capability of short-time features in a conversation scene can be effectively improved.

As shown in fig. 2 and fig. 3, taking the value of N equal to 3 as an example for detailed description, when the number of the shallow feature fusion network layers is 2, the first shallow feature fusion network layer may be represented as temp2 and temp 3. Assume that the feature vector output by the first second residual layer is the feature vector C3, the second residual layer is used for performing residual processing on the feature vector C3 to output the feature vector C4, and the third second residual layer is used for performing residual processing on the feature vector C4 to output the feature vector C5. The third second residual layer may be understood as the target residual layer, the temp2 is connected to the third second residual layer and the second residual layer, and the temp3 is connected to the first second residual layer.

When N is 2, the number of the shallow feature fusion network layers is 1, and the shallow feature fusion network layer may be represented as temp 4. Assuming that the feature vector output by the first second residual layer is the feature vector C8, and the second residual layer is used for performing residual processing on the feature vector C8 to output the feature vector C9, wherein the second residual layer can be understood as the target residual layer, and the temp4 is connected with the first residual layer and the second residual layer. The temp4 is configured to perform fusion processing on the feature vector output by the second residual layer connected to the target residual network layer to obtain a target fusion feature vector, and determine both the feature vector output by the target residual network layer and the target fusion feature vector as the first feature vector.

Optionally, in some embodiments, the step of iteratively training the to-be-trained speech recognition network model by using the labeled sample data to obtain the speech recognition network model includes:

preprocessing the labeled sample data to obtain a third feature vector;

performing iterative training on the to-be-trained voice recognition network model by using the third feature vector to obtain the voice recognition network model;

wherein, when the second feature vector is used for representing emotion feature information, the third feature vector is used for representing voiceprint feature information of the sample data; when the second feature vector is used for representing voiceprint feature information, the third feature vector is used for representing emotion feature information of the sample data.

In this embodiment of the present invention, when the third feature vector is used to represent voiceprint feature information of the sample data, the third feature vector includes an Fbank feature, a first-order difference feature, and a tone information feature; when the third feature vector is used for representing emotion feature information of the sample data, the third feature vector includes a Mel-spectrum (Mel-spectrum) feature, a first order difference feature and a second order difference feature.

It should be understood that the sample data may be recorded data of a financial customer service, a thousand-hour recorded data may be used as the sample data, and when performing voiceprint feature training, voiceprint feature labeling may be performed on the sample data, that is, a certain section of voice is labeled as the voice of a certain user. When performing emotion training, emotion classification labeling can be performed on sample data.

Optionally, in some embodiments, when performing voiceprint feature training, preprocessing the labeled sample data to obtain a third feature vector includes:

performing data expansion on the labeled sample data in at least one of the following modes: adding noise, increasing speech speed, increasing data disturbance and the like;

performing feature extraction on the sample data after data expansion by using a feature extraction script in Kaldi to obtain Fbank features and tone information features;

extracting a first-order difference feature by using python aiming at the Fbank feature;

and performing fusion processing on the Fbank characteristic, the tone information characteristic and the first-order difference characteristic to obtain a third feature vector.

In this embodiment, the third feature vector may represent a three-channel feature map including Fbank-first order difference-timbre information features. Alternatively, the above-mentioned increase of data disturbance can be understood as adding processes such as background music, reverberation, and speech. Data expansion is performed on the sample data, so that the authenticity of the data can be improved. As the third feature vector adopts the fusion feature of combining the Fbank feature, the first-order difference feature and the tone information feature, the short-term feature of the conversation between the client and the customer service in the financial scene conversation is effectively combined, the spectrum enhancement can be introduced in the training process, and the scene feature coverage capability is effectively improved.

performing feature extraction on the labeled sample data to obtain Mel frequency spectrum features;

extracting a first-order difference characteristic by using python aiming at the Mel frequency spectrum characteristic;

aiming at the first-order difference characteristic, extracting a second-order difference characteristic by using python;

and performing fusion processing on the Mel frequency spectrum characteristic, the first-order difference characteristic and the second-order difference characteristic to obtain a third feature vector.

In the embodiment of the invention, the Mel frequency spectrum characteristic, the first-order difference characteristic and the second-order difference characteristic are combined to form the three-dimensional characteristic, so that the characteristic coverage capability is improved, and the classification learning of a voice recognition network model is facilitated.

Optionally, in some embodiments, the step of performing iterative training on the to-be-trained speech recognition network model by using the third feature vector to obtain the speech recognition network model includes:

in the L-time iterative training process, inputting the third feature vector to the to-be-trained speech recognition network model to obtain the second feature vector, wherein L is a positive integer;

inputting the second feature vector into a Softmax classifier to obtain a classification result of the second feature vector and a loss value of a labeling result of the third feature vector, wherein the loss value is used for adjusting network parameters of the speech recognition network model to be trained;

and if the change of the loss value is smaller than a preset value, determining the currently trained voice recognition network model to be trained as the voice recognition network model.

In the embodiment of the invention, the sample data can be divided into a plurality of groups for iterative training, and the iterative training is performed once by taking the third eigenvector corresponding to one group of sample data as input each time. In the one-time iterative training process, the second feature vector output by the to-be-trained speech recognition network model can be input into the Softmax classifier to obtain a corresponding classification result, specifically, an emotion classification result or a voiceprint feature classification result, the classification result can be compared with the labeled result in the Softmax classifier to obtain the loss value, and when the change of the loss value is greater than or equal to the preset value, the network parameter of the to-be-trained speech recognition network model can be adjusted to continue the next iterative training. If the change of the loss value is smaller than the preset value, the current classification result is close to the true value, and therefore the current trained voice recognition network model to be trained can be determined as the voice recognition network model.

Optionally, in some embodiments, the to-be-trained speech recognition network model further includes a normalization convolutional layer (conv & BN), where the normalization convolutional layer is configured to perform normalization processing on the third feature vector to obtain a fourth feature vector;

and the delay neural network layer is used for performing one-dimensional extension processing on the feature vector output by the first residual error layer to obtain a fifth feature vector with time information.

As shown in fig. 2 and 3, in the embodiment of the present invention, the residual network includes three second residual layers. In the embodiment of the invention, three residual error layers are sequentially connected in series, wherein the normalized convolutional layer, the three second residual error layers, the first residual error layer and the time-delay neural network layer are sequentially connected in series, and each residual error layer can output feature vectors with different scales. Each second residual layer may output the feature vector after residual processing to the next residual layer, and output the feature vector to the shallow feature fusion network layer at the same time. It can be understood that, it is assumed that the feature vector output by the first second residual layer is the feature vector C3, the second residual layer is used for performing residual processing on the feature vector C3 to output the feature vector C4, the third second residual layer is used for performing residual processing on the feature vector C4 to output the feature vector C5, the first residual layer is used for performing residual processing on the feature vector C5 to output the feature vector P6, and the delayed neural network layer is used for performing one-dimensional extension processing on the feature vector P6 to obtain the feature vector P7.

In fig. 2 and 3, the feature vector C5 may be represented as a feature vector P5 after being input into the shallow feature fusion network layer. In the shallow feature fusion network layer, a feature vector P5 can be downsampled through temp2 and fused with a feature vector C4 to obtain a feature vector P4, and a feature vector P4 can be downsampled through temp3 and fused with a feature vector C3 to obtain a feature vector P3.

It should be noted that, in the embodiments of the present application, C3 to C5 and P3 to P7 are only used for distinguishing different vector representations for convenience of description, and do not represent specific meanings of feature vectors.

It should be understood that the feature vectors C3 through C5 are feature vectors with different scales, and the processing of the feature vectors P3 through P7 is different when the to-be-trained speech recognition network model is used for realizing different functions. For example, in an alternative embodiment, when the above-mentioned speech recognition network model to be trained is used for voiceprint feature training, a concat structure may be used to splice and fuse the feature vectors. The second feature vector obtained at this time can be understood as a speech x-vector feature. In other words, in the embodiment of the present invention, when the second feature vector is used to represent the voiceprint feature information, the second feature vector is the feature vector after the P3 to P7 are spliced.

Optionally, when the voiceprint feature training is performed by using the to-be-trained speech recognition network model, a bidirectional threshold recurrent neural network (BiGRU) layer may be used to perform inter-feature interconnection and fusion processing on the plurality of feature vectors, so as to obtain a second feature vector. In other words, in this embodiment of the present invention, when the second feature vector is used to represent the emotion feature information, the to-be-trained speech recognition network model further includes a bidirectional threshold recurrent neural network layer (dtn) and an Attention mechanism layer (Attention), where the bidirectional threshold recurrent neural network layer is configured to perform inter-feature interconnection and fusion processing on the N first feature vectors, the fifth feature vector, and the feature vector output by the first residual error layer to obtain the second feature vector, and output the second feature vector to the Softmax classifier after performing weighting processing on the second feature vector by the Attention mechanism layer to obtain an emotion classification result.

It should be appreciated that the above-described attention mechanism layer, while used to weight the second feature vector, may also be used to align speech frame-level features.

In the embodiment of the invention, the characteristic vector output by the delay neural network layer represents time sequence information, and the bidirectional threshold recurrent neural network can encode the characteristic vector output by the delay neural network layer and perform interframe alignment with the characteristics from P3 to P6, so that the characteristic vector combined with the speech emotion time sequence information is fused and output. Compared with the conventional bidirectional long-short-term memory (bilst), in the embodiment of the present application, the network structure of the BiGRU is simple, which can effectively reduce the network parameter number, accelerate the network operation speed, and have no great influence on the network accuracy. Because the parameter quantity of the residual error network is small, the processing capacity of the voice sequence features can be improved.

It should be noted that, in the embodiment of the present invention, when performing emotion classification by using the speech recognition network model, an attention mechanism layer and a Softmax classifier are designed in the speech recognition network model to be trained, so that the trained speech recognition network model can be directly used to perform emotion classification recognition. Of course, in other embodiments, the attention mechanism layer and the Softmax classifier may be independently set, and the attention mechanism layer and the Softmax classifier are connected in series on the basis of the trained speech recognition network model, so that the emotion classification can be realized.

Optionally, in an embodiment of the present invention, a structure of each layer of network in the residual error network and a feature dimension of output are shown in the following table:

in the above table, for 80 × K x 3, where 80 represents dimension information of a mel frequency spectrum diagram corresponding to one audio frequency, K is used to represent duration information corresponding to one audio frequency, and 3 represents composition information of the mel frequency spectrum diagram, for example, the feature diagram may be composed of fbink-first order difference-timbre information features, or may be composed of fbink-first order difference-second order difference features. The above T is used to represent frame number information of one audio. For example, in an alternative embodiment, the duration of one audio is approximately 2 seconds, and the corresponding number of frames may be 200 frames.

In this embodiment of the application, in one iteration process, 128 audios may be input, a third feature vector obtained by preprocessing corresponding to the 128 audios may be a 128 x 80 x K x 3-dimensional feature vector, after the 128 x 80 x K x 3-dimensional feature vector is input to the normalization convolutional layer, normalization convolutional processing may be performed by the normalization network layer, and a feature vector corresponding to each audio may be represented as

Dimension feature vector, wherein the dimension information of the Mel frequency spectrogram of each audio frequency after normalization convolution layer is reduced to 39, and the dimension information of the frame number of each audio frequency is reduced to 39

The number of channels will be increased to 32, and after the residual error characteristics are extracted by a plurality of residual error layers in sequence, the residual error characteristics are finally output by the first residual error layer

And (5) performing one-dimensional extension processing on the dimensional feature vector to a delay neural network layer to obtain the feature vector with time information. It should be noted that after residual features are extracted in multiple residual layers, feature vectors of different scales can be output, where shallow feature vectors tend to represent original voiceprint information, and after multiple residual feature extractions, high feature vectors tend to represent semantic feature information.

Optionally, in some embodiments, a pooling network layer may be further disposed between the first residual layer and the delay neural network layer, and pooling is performed to reduce a parameter processing amount of the delay neural network layer and increase a processing speed of the delay neural network layer.

Referring to fig. 4, an embodiment of the present invention further provides a speech recognition processing method, and as shown in fig. 4, the speech recognition processing method includes:

step 401, preprocessing first voice data to be recognized to obtain a sixth feature vector, where the sixth feature vector is used to represent voiceprint feature information of the first voice data;

step 402, inputting the sixth feature vector into a voice recognition network model to obtain a voiceprint feature vector to be confirmed;

step 403, inputting the voiceprint feature vector into a preset classification model to obtain a first classification result;

step 404, determining that the first voice data is the voice data of the first user when the first classification result is matched with a reference result corresponding to the first user;

In the embodiment of the invention, when the telephone operation is carried out with the first user, the conversation can be carried out with the first user firstly, and the voice data of the first user is acquired in the conversation process, so that the identity of the first user is identified.

In the embodiment of the present invention, the first residual layer and the second residual layer may be collectively referred to as a residual layer, and may be denoted as ResNetBlock, for example. As shown in fig. 2, the first residual layer and the N second residual layers may be understood as residual layers in a residual network, and the full connection layer of the residual network is adjusted to be a delay neural network layer, so as to obtain an adjusted residual network structure. The voice recognition network model is constructed by combining a shallow feature fusion network layer based on the adjusted residual network structure, the specific structure of the voice recognition network model is shown in fig. 2, the value of N can be set according to actual needs, and as shown in fig. 2, in the embodiment of the present invention, the value of N is 3.

Optionally, in some embodiments, the shallow feature fusion network layer is configured to fuse the N different scale feature vectors output by the N second residual error layers to obtain N seventh feature vectors, where the N seventh feature vectors include:

the N-1 shallow feature fusion network layers are used for sampling and fusing N feature vectors of different scales output by the N second residual error layers to obtain N seventh feature vectors, the N seventh feature vectors comprise feature vectors output by a target residual error layer, and the target residual error layer is a second residual error layer adjacent to the first residual error layer.

Optionally, the N-1 shallow feature fusion network layers are connected to N-1 second residual layers except the target residual layer in a one-to-one correspondence manner, the 1 st shallow feature fusion network layer is further connected to the target residual layer, the 1 st shallow feature fusion network layer is configured to perform fusion processing on feature vectors output by the two connected second residual layers to obtain 1-time fusion feature vector, determine a feature vector output by the target residual network layer as a first seventh feature vector, and the i-th shallow feature fusion network layer is configured to perform fusion processing on a feature vector output by the connected second residual layer and the i-1-time fusion feature vector output by the i-1 th shallow feature fusion network layer to obtain i-time fusion feature vectors, and determine the i-th fusion feature vector as the i-th seventh feature vector, wherein i is an integer of 2 or more and N-1 or less, and N is an integer of 2 or more;

or when N is 2, the shallow feature fusion network layer is configured to perform fusion processing on the feature vector output by the second residual layer connected to the target residual network layer and the target residual network layer to obtain a target fusion feature vector, and determine both the feature vector output by the target residual network layer and the target fusion feature vector as the seventh feature vector.

In the embodiment of the application, the feature vectors with different scales are sampled and multiplexed through the N-1 shallow feature fusion network layers, so that the expression capability of short-time features in a dialogue scene can be effectively improved, and a residual error network is used as a basic model, so that the model has small parameter quantity and strong feature mapping capability.

It should be understood that, in the embodiment of the present invention, the above-mentioned preprocessing process may include the following steps:

performing feature extraction on the first voice data by using a feature extraction script in Kaldi to obtain Fbank features and tone information features;

and performing fusion processing on the Fbank characteristic, the tone information characteristic and the first-order difference characteristic to obtain a sixth feature vector.

That is, in the embodiment of the present invention, the sixth feature vector includes the Fbank feature, the first order difference feature, and the tone information feature.

The preset classification model may be a plda classifier, and specifically, the plda classifier may be first trained using the labeled sample data to learn feature representations of speakers in respective spaces, and then full probability posterior estimation is performed using an EM algorithm until a best fit feature representation parameter is found. In recognition, the voiceprint feature vector output by the speech recognition network model can be input into the plda classifier, and a first classification result is obtained. And comparing the first classification result with a reference result corresponding to the first user to be compared, so as to determine whether the current first voice data to be recognized is the voice data of the first user.

In the embodiment of the invention, because the residual error network is used as the basis in the voice recognition network model and is combined with the shallow feature fusion network layer, the parameter quantity of the model is small, the feature mapping capability is strong, and the short-term feature expression capability can be stronger. Therefore, the voice print characteristic recognition can be carried out by utilizing the trained voice recognition network model, so that the reliability of user identity verification is improved.

Optionally, in some embodiments, before the sixth feature vector is input to the speech recognition network model and the voiceprint feature vector to be confirmed is obtained, the method further includes:

acquiring reference voice data input by the first user during registration;

preprocessing the voice data to obtain an eighth feature vector, wherein the eighth feature vector is used for representing the voiceprint feature information of the reference voice data;

inputting the eighth feature vector to the voice recognition network model to obtain a reference voiceprint feature vector of the first user;

and the reference result is a result of classifying the reference voiceprint feature vector based on the preset classification model.

In the embodiment of the invention, when the user registers the account, the user can input a section of reference voice data; the above preprocessing process is consistent with the processing process of the first voice data, and reference may be specifically made to the description of the above embodiment, which is not described herein again.

It should be understood that the processing procedure of the eighth feature vector by the speech recognition network model is the same as the processing procedure of the third feature vector, and the specific processing procedure may refer to the above embodiments and is not described herein again. After obtaining the reference voiceprint feature vector, the reference voiceprint feature vector and the user identification information may be stored in association for subsequent voiceprint recognition comparison.

Optionally, in some embodiments, the speech recognition network model further includes a normalized convolutional layer, where the normalized convolutional layer is configured to perform normalization processing on the sixth feature vector to obtain a ninth feature vector;

and the delay neural network layer is used for performing one-dimensional extension processing on the feature vector output by the first residual error layer to obtain a tenth feature vector with time information.

As shown in fig. 2, in the embodiment of the present invention, the residual network includes three second residual layers, and the three residual layers are sequentially connected in series, wherein the normalized convolutional layer, the three second residual layers, the first residual layer, and the delayed neural network layer are sequentially connected in series, and each of the residual layers can output feature vectors of different scales. Each second residual layer may output the feature vector after residual processing to the next residual layer, and output the feature vector to the shallow feature fusion network layer at the same time. As shown in fig. 2, it is assumed that the feature vector output by the first second residual layer is a feature vector C3, the second residual layer is configured to perform residual processing on a feature vector C3 to output a feature vector C4, the third second residual layer is configured to perform residual processing on a feature vector C4 to output a feature vector C5, the first residual layer is configured to perform residual processing on a feature vector C5 to output a feature vector P6, and the delayed neural network layer is configured to perform one-dimensional extension processing on the feature vector P6 to obtain a feature vector P7.

In fig. 2, the number of the shallow feature fusion network layers is 2, and may be referred to as temp2 and temp3 hereinafter. The feature vector C5 can be expressed as a feature vector P5 when input to the shallow feature fusion network layer. In the shallow feature fusion network layer, a feature vector P5 can be downsampled through temp2 and fused with a feature vector C4 to obtain a feature vector P4, and a feature vector P4 can be downsampled through temp3 and fused with a feature vector C3 to obtain a feature vector P3.

It should be understood that the feature vectors C3 through C5 are feature vectors of different scales, and alternatively, in the embodiment of the present invention, for the processing of the feature vectors P3 through P7, a concat structure may be used to splice and fuse the feature vectors. The obtained voiceprint feature vector quantity can be understood as the voice x-vector feature. In other words, in this embodiment of the present invention, the voiceprint feature vector is a feature vector obtained by performing a subsequent operation on the N seventh feature vectors, the tenth feature vector, and the feature vector output by the first residual layer.

When N is 2, the number of the shallow feature fusion network layers is 1, and the shallow feature fusion network layer may be represented as temp 4. Assuming that the feature vector output by the first second residual layer is the feature vector C8, and the second residual layer is used for performing residual processing on the feature vector C8 to output the feature vector C9, wherein the second residual layer can be understood as the target residual layer, and the temp4 is connected with the first residual layer and the second residual layer. The temp4 is configured to perform fusion processing on the feature vector output by the second residual layer, where the target residual network layer is connected to the target residual network layer, to obtain a target fusion feature vector, and determine both the feature vector output by the target residual network layer and the target fusion feature vector as a seventh feature vector.

Referring to fig. 5, an embodiment of the present invention further provides a speech recognition processing method, as shown in fig. 5, the speech recognition processing method includes:

step 501, preprocessing second voice data to be recognized to obtain an eleventh feature vector, wherein the eleventh feature vector is used for representing emotion feature information of the second voice data;

step 502, inputting the eleventh feature vector into a voice recognition network model to obtain an emotion classification result of the second voice data;

In the embodiment of the invention, when the customer service and the second user carry out telephone operation, the call between the customer service and the second user can be recorded, and after the call is finished, the call data can be obtained, so that the emotion of the customer service and/or the emotion of the second user can be classified.

performing feature extraction on the second voice data to obtain Fbank features;

and fusing the Fbank characteristic, the first-order difference characteristic and the second-order difference characteristic to obtain an eleventh characteristic vector.

That is, in the embodiment of the present invention, the eleventh feature vector includes a mel-frequency spectrum feature, a first order difference feature, and a second order difference feature.

The embodiment of the invention combines the residual error network and the shallow feature fusion network layer in the voice recognition network model, thereby ensuring that the model has small parameter quantity, strong feature mapping capability and strong short-term feature expression capability. Therefore, the voice print feature recognition can be carried out by utilizing the trained voice recognition network model, so that the emotion detection cost of the user is reduced.

Optionally, in some embodiments, the shallow feature fusion network layer is configured to fuse N different scale feature vectors output by the N second residual error layers to obtain N twelfth feature vectors, where the N twelfth feature vectors include:

the N-1 shallow feature fusion network layers are used for sampling and fusing N feature vectors of different scales output by the N second residual error layers to obtain N twelfth feature vectors, the N twelfth feature vectors comprise feature vectors output by a target residual error layer, and the target residual error layer is a second residual error layer adjacent to the first residual error layer.

Optionally, the N-1 shallow feature fusion network layers are connected to N-1 second residual layers except the target residual layer in a one-to-one correspondence manner, the 1 st shallow feature fusion network layer is further connected to the target residual layer, the 1 st shallow feature fusion network layer is configured to perform fusion processing on feature vectors output by the two connected second residual layers to obtain 1-time fusion feature vector, determine a feature vector output by the target residual network layer as a first twelfth feature vector, and the i-th shallow feature fusion network layer is configured to perform fusion processing on a feature vector output by the connected second residual layer and the i-1-time fusion feature vector output by the i-1 th shallow feature fusion network layer to obtain i-time fusion feature vectors, and determine the i-th fusion feature vector as the i-th twelfth feature vector, wherein i is an integer of 2 or more and N-1 or less, and N is an integer of 2 or more;

or when N is 2, the shallow feature fusion network layer is configured to perform fusion processing on the feature vector output by the second residual layer connected to the target residual network layer and the target residual network layer to obtain a target fusion feature vector, and determine both the feature vector output by the target residual network layer and the target fusion feature vector as the twelfth feature vector.

Optionally, in some embodiments, the speech recognition network model further includes a normalized convolutional layer, where the normalized convolutional layer is configured to perform normalization processing on the eleventh feature vector to obtain a thirteenth feature vector;

and the delay neural network layer is used for performing one-dimensional extension processing on the feature vector output by the first residual error layer to obtain a fourteenth feature vector with time information.

As shown in fig. 3, in the embodiment of the present invention, the residual network includes three second residual layers, and the three residual layers are sequentially connected in series, wherein the normalized convolutional layer, the three second residual layers, the first residual layer, and the delayed neural network layer are sequentially connected in series, and each of the residual layers can output feature vectors with different scales. Each second residual layer may output the feature vector after residual processing to the next residual layer, and output the feature vector to the shallow feature fusion network layer at the same time. As shown in fig. 3, it is assumed that the feature vector output by the first second residual layer is a feature vector C3, the second residual layer is configured to perform residual processing on a feature vector C3 to output a feature vector C4, the third second residual layer is configured to perform residual processing on a feature vector C4 to output a feature vector C5, the first residual layer is configured to perform residual processing on a feature vector C5 to output a feature vector P6, and the delayed neural network layer is configured to perform one-dimensional extension processing on the feature vector P6 to obtain a feature vector P7.

As shown in fig. 3, the number of the shallow feature fusion network layers is 2, and may be referred to as temp2 and temp3 hereinafter. The feature vector C5 can be expressed as a feature vector P5 when input to the shallow feature fusion network layer. In the shallow feature fusion network, the feature vector P5 can be downsampled by temp2 and fused with the feature vector C4 to obtain a feature vector P4, and the feature vector P4 can be downsampled by temp3 and fused with the feature vector C3 to obtain a feature vector P3.

When N is 2, the number of the shallow feature fusion network layers is 1, and the shallow feature fusion network layer may be represented as temp 4. Assuming that the feature vector output by the first second residual layer is the feature vector C8, and the second residual layer is used for performing residual processing on the feature vector C8 to output the feature vector C9, wherein the second residual layer can be understood as the target residual layer, and the temp4 is connected with the first residual layer and the second residual layer. The temp4 is configured to perform fusion processing on the feature vector output by the second residual layer, where the target residual network layer is connected to the target residual network layer, to obtain a target fusion feature vector, and determine both the feature vector output by the target residual network layer and the target fusion feature vector as a twelfth feature vector.

Optionally, in some embodiments, the emotion classification network includes a bidirectional threshold recurrent neural network layer, an attention mechanism layer, and a Softmax classifier, where the bidirectional threshold recurrent neural network layer is configured to perform inter-feature interconnection and fusion processing on the N twelfth feature vectors, the fourteenth feature vector, and the feature vector output by the first residual layer to obtain a second feature vector, and output the second feature vector to the Softmax classifier for emotion classification after being weighted by the attention mechanism layer.

As shown in fig. 3, the bidirectional threshold recurrent neural network layer may perform inter-feature interconnection and fusion processing on the P3-P7 vectors to obtain a second feature vector, where the second feature vector is used to represent feature information of the second speech data and may also be understood as an emotion feature vector.

It should be noted that, various optional implementations described in the embodiments of the present invention may be implemented in combination with each other or implemented separately, and the embodiments of the present invention are not limited thereto.

Referring to fig. 6, fig. 6 is a structural diagram of a speech recognition network model training apparatus according to an embodiment of the present invention, and as shown in fig. 6, the speech recognition network model training apparatus 600 includes:

the training module 601 is configured to perform iterative training on the speech recognition network model to be trained by using the labeled sample data to obtain a speech recognition network model;

Optionally, the shallow feature fusion network layer is configured to fuse N different scale feature vectors output by the N second residual error layers to obtain N first feature vectors, where the N first feature vectors include:

Optionally, the N-1 shallow feature fusion network layers are connected to N-1 second residual layers except the target residual layer in a one-to-one correspondence manner, the 1 st shallow feature fusion network layer is further connected to the target residual layer, the 1 st shallow feature fusion network layer is configured to perform fusion processing on feature vectors output by the two connected second residual layers to obtain 1-time fusion feature vector, determine a feature vector output by the target residual network layer as a first feature vector, and the i-th shallow feature fusion network layer is configured to perform fusion processing on a feature vector output by the connected second residual layer and the i-1-time fusion feature vectors output by the i-1 th shallow feature fusion network layer to obtain i-time fusion feature vectors, and determine the i-th fusion feature vector as the i-th first feature vector, wherein i is an integer of 2 or more and N-1 or less, and N is an integer of 2 or more;

or when N is 2, the shallow feature fusion network layer is configured to perform fusion processing on the feature vector output by the second residual layer connected to the target residual network layer and the target residual network layer to obtain a target fusion feature vector, and determine both the feature vector output by the target residual network layer and the target fusion feature vector as the first feature vector.

Optionally, the training module 601 includes:

the preprocessing unit is used for preprocessing the marked sample data to obtain a third feature vector;

the training unit is used for carrying out iterative training on the to-be-trained voice recognition network model by utilizing the third feature vector to obtain the voice recognition network model;

wherein, when the second feature vector is used for representing voiceprint feature information, the third feature vector is used for representing the voiceprint feature information of the sample data; when the second feature vector is used for representing emotion feature information, the third feature vector is used for representing emotion feature information of the sample data.

Optionally, the training unit is specifically configured to perform the following operations:

Optionally, the to-be-trained speech recognition network model further includes a normalization convolutional layer, where the normalization convolutional layer is configured to perform normalization processing on the third feature vector to obtain a fourth feature vector;

Optionally, when the second feature vector is used to represent the emotion feature information, the to-be-trained speech recognition network model further includes a bidirectional threshold recurrent neural network layer and an attention mechanism layer, where the bidirectional threshold recurrent neural network layer is configured to perform inter-feature interconnection and fusion processing on the N first feature vectors, the fifth feature vector, and the feature vector output by the first residual error layer to obtain the second feature vector, and output the second feature vector to the Softmax classifier after performing weighting processing by the attention mechanism layer to obtain an emotion classification result.

Optionally, when the third feature vector is used to represent voiceprint feature information of the sample data, the third feature vector includes an Fbank feature, a first-order difference feature, and a tone information feature; and when the third feature vector is used for representing emotion feature information of the sample data, the third feature vector comprises a Mel frequency spectrum feature, a first-order difference feature and a second-order difference feature.

The speech recognition network model training device provided by the embodiment of the invention can realize each process in the method embodiment of fig. 1, and is not repeated here to avoid repetition.

Referring to fig. 7, fig. 7 is a structural diagram of a speech recognition processing apparatus according to an embodiment of the present invention, and as shown in fig. 7, a speech recognition processing apparatus 700 includes:

a first preprocessing module 701, configured to preprocess first voice data to be recognized to obtain a sixth feature vector, where the sixth feature vector is used to represent voiceprint feature information of the first voice data;

a first input module 702, configured to input the sixth feature vector to a speech recognition network model, so as to obtain a voiceprint feature vector to be confirmed;

a second input module 703, configured to input the voiceprint feature vector to a preset classification model to obtain a first classification result;

a determining module 704, configured to determine that the first voice data is voice data of a first user when the first classification result matches a reference result corresponding to the first user;

the voice recognition network model comprises a delay neural network layer, a first residual layer, N second residual layers and a shallow feature fusion network layer, wherein the N second residual layers, the first residual layer and the delay neural network layer are sequentially connected in series, the N second residual layers and the first residual layers are used for performing convolution processing on the sixth feature vector and outputting feature vectors with different scales on different residual layers, the N-1 shallow feature fusion network layer is used for fusing the N feature vectors with different scales output by the N second residual layers to obtain N seventh feature vectors, the voiceprint feature vectors are obtained by fusing the N seventh feature vectors with the output of the delay neural network layer and the output of the first residual layer, and N is an integer greater than 1.

Optionally, the shallow feature fusion network layer is configured to fuse N different scale feature vectors output by the N second residual error layers to obtain N seventh feature vectors, where the N seventh feature vectors include:

Optionally, the speech recognition processing device 700 further includes:

the acquisition module is used for acquiring reference voice data input by the first user during registration;

the second preprocessing module 701 is further configured to preprocess the voice data to obtain an eighth feature vector, where the eighth feature vector is used to represent voiceprint feature information of the reference voice data;

the first input module 702 is further configured to input the eighth feature vector to the speech recognition network model, so as to obtain a reference voiceprint feature vector of the first user;

Optionally, the speech recognition network model further includes a normalization convolutional layer, where the normalization convolutional layer is configured to perform normalization processing on the sixth feature vector to obtain a ninth feature vector;

Optionally, the voiceprint feature vector is a feature vector obtained by performing subsequent processing on the N seventh feature vectors, the tenth feature vector, and the feature vector output by the first residual error layer.

Optionally, the sixth feature vector includes Fbank features, first order difference features and tone information features.

The speech recognition processing apparatus provided in the embodiment of the present invention can implement each process in the method embodiment of fig. 4, and is not described here again to avoid repetition.

Referring to fig. 8, fig. 8 is a block diagram of a speech recognition processing apparatus according to an embodiment of the present invention, and as shown in fig. 8, the speech recognition processing apparatus 800 includes:

a second preprocessing module 801, configured to preprocess second voice data to be recognized to obtain an eleventh feature vector, where the eleventh feature vector is used to represent emotion feature information of the second voice data;

a third input module 802, configured to input the eleventh feature vector to a speech recognition network model, so as to obtain an emotion classification result of the second speech data;

Optionally, the shallow feature fusion network layer is configured to fuse N different scale feature vectors output by the N second residual error layers to obtain N twelfth feature vectors, where the N twelfth feature vectors include:

Optionally, the speech recognition network model further includes a normalization convolutional layer, where the normalization convolutional layer is configured to perform normalization processing on the eleventh feature vector to obtain a thirteenth feature vector;

Optionally, the emotion classification network includes a bidirectional threshold recurrent neural network layer, an attention mechanism layer, and a Softmax classifier, where the bidirectional threshold recurrent neural network layer is configured to perform inter-feature interconnection and fusion processing on the N twelfth feature vectors, the fourteenth feature vector, and the feature vector output by the first residual error layer to obtain a second feature vector, and output the second feature vector to the Softmax classifier for emotion classification after performing weighting processing on the second feature vector by the attention mechanism layer.

Optionally, the eleventh feature vector comprises mel-frequency spectral features, first order difference features and second order difference features.

The speech recognition processing apparatus provided in the embodiment of the present invention can implement each process in the method embodiment of fig. 5, and is not described here again to avoid repetition.

Fig. 9 is a schematic diagram of a hardware structure of an electronic device implementing various embodiments of the present invention.

The electronic device 900 includes, but is not limited to: a radio frequency unit 901, a network module 902, an audio output unit 903, an input unit 904, a sensor 905, a display unit 906, a user input unit 907, an interface unit 908, a memory 909, a processor 910, and a power supply 911. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 9 does not constitute a limitation of the electronic device, and that the electronic device may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. In the embodiment of the present invention, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.

The processor 910 is configured to perform iterative training on the to-be-trained speech recognition network model by using the labeled sample data to obtain a speech recognition network model;

Alternatively, the processor 910 is configured to perform the following operations:

It should be understood that, in the embodiment of the present invention, the radio frequency unit 901 may be used for receiving and sending signals during a message transmission and reception process or a call process, and specifically, after receiving downlink data from a base station, the downlink data is processed by the processor 910; in addition, the uplink data is transmitted to the base station. Generally, the radio frequency unit 901 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 901 can also communicate with a network and other devices through a wireless communication system.

The electronic device provides wireless broadband internet access to the user via the network module 902, such as assisting the user in sending and receiving e-mails, browsing web pages, and accessing streaming media.

The audio output unit 903 may convert audio data received by the radio frequency unit 901 or the network module 902 or stored in the memory 909 into an audio signal and output as sound. Also, the audio output unit 903 may provide audio output related to a specific function performed by the electronic device 900 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 903 includes a speaker, a buzzer, a receiver, and the like.

The input unit 904 is used to receive audio or video signals. The input Unit 904 may include a Graphics Processing Unit (GPU) 9041 and a microphone 9042, and the Graphics processor 9041 processes image data of a still picture or video obtained by an image capturing device (such as a camera) in a video capture mode or an image capture mode. The processed image frames may be displayed on the display unit 906. The image frames processed by the graphic processor 9041 may be stored in the memory 909 (or other storage medium) or transmitted via the radio frequency unit 901 or the network module 902. The microphone 9042 can receive sounds and can process such sounds into audio data. The processed audio data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 901 in case of the phone call mode.

The electronic device 900 also includes at least one sensor 905, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor includes an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 9061 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 9061 and/or the backlight when the electronic device 900 is moved to the ear. As one type of motion sensor, an accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of an electronic device (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), and vibration identification related functions (such as pedometer, tapping); the sensors 905 may also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, etc., which are not described in detail herein.

The display unit 906 is used to display information input by the user or information provided to the user. The Display unit 906 may include a Display panel 9061, and the Display panel 9061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 907 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 907 includes a touch panel 9071 and other input devices 9072. The touch panel 9071, also referred to as a touch screen, may collect touch operations by a user on or near the touch panel 9071 (e.g., operations by a user on or near the touch panel 9071 using a finger, a stylus, or any other suitable object or accessory). The touch panel 9071 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 910, receives a command from the processor 910, and executes the command. In addition, the touch panel 9071 may be implemented by using various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The user input unit 907 may include other input devices 9072 in addition to the touch panel 9071. Specifically, the other input devices 9072 may include, but are not limited to, a physical keyboard, function keys (such as a volume control key, a switch key, and the like), a track ball, a mouse, and a joystick, which are not described herein again.

Further, the touch panel 9071 may be overlaid on the display panel 9061, and when the touch panel 9071 detects a touch operation on or near the touch panel 9071, the touch panel is transmitted to the processor 910 to determine the type of the touch event, and then the processor 910 provides a corresponding visual output on the display panel 9061 according to the type of the touch event. Although in fig. 9, the touch panel 9071 and the display panel 9061 are two independent components to implement the input and output functions of the electronic device, in some embodiments, the touch panel 9071 and the display panel 9061 may be integrated to implement the input and output functions of the electronic device, which is not limited herein.

The interface unit 908 is an interface for connecting an external device to the electronic apparatus 900. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 908 may be used to receive input from external devices (e.g., data information, power, etc.) and transmit the received input to one or more elements within the electronic device 900 or may be used to transmit data between the electronic device 900 and external devices.

The memory 909 may be used to store software programs as well as various data. The memory 909 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 909 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 910 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 909 and calling data stored in the memory 909, thereby performing overall monitoring of the electronic device. Processor 910 may include one or more processing units; preferably, the processor 910 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It is to be appreciated that the modem processor described above may not be integrated into processor 910.

The electronic device 900 may further include a power supply 911 (e.g., a battery) for supplying power to various components, and preferably, the power supply 911 may be logically connected to the processor 910 through a power management system, so as to manage charging, discharging, and power consumption management functions through the power management system.

In addition, the electronic device 900 includes some functional modules that are not shown, and thus are not described in detail herein.

Preferably, an embodiment of the present invention further provides an electronic device, which includes a processor 910, a memory 909, and a computer program stored in the memory 909 and capable of running on the processor 910, and when the computer program is executed by the processor 910, the computer program implements each process of the above-mentioned embodiment of the speech recognition network model training method or the speech recognition processing method, and can achieve the same technical effect, and in order to avoid repetition, it is not described herein again.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above-mentioned speech recognition network model training method or speech recognition processing method embodiment, and can achieve the same technical effect, and is not described herein again to avoid repetition. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for training a speech recognition network model is characterized by comprising the following steps:

the to-be-trained voice recognition network model comprises a time-delay neural network layer, a first residual layer, N second residual layers and N-1 shallow feature fusion network layers, wherein the N second residual layers, the first residual layers and the time-delay neural network layer are sequentially connected in series, the shallow feature fusion network layer is used for fusing N different-scale feature vectors output by the N second residual layers to obtain N first feature vectors, the N first feature vectors are fused with the output of the time-delay neural network layer and the output of the first residual layer to obtain second feature vectors, the second feature vectors are used for representing voiceprint feature information or emotion feature information, and N is an integer greater than 1;

the nth second residual error layer is used for processing the feature vector output by the nth-1 second residual error layer, the first residual error layer is used for processing the feature vector output by the nth second residual error layer, and the delay neural network layer is used for processing the feature vector output by the first residual error layer.

2. The method according to claim 1, wherein the shallow feature fusion network layer is configured to fuse N different-scale feature vectors output by the N second residual error layers to obtain N first feature vectors, and includes:

3. The method according to claim 2, wherein the N-1 shallow feature fusion network layers are connected to N-1 second residual layers except the target residual layer in a one-to-one correspondence manner, the 1 st shallow feature fusion network layer is further connected to the target residual layer, the 1 st shallow feature fusion network layer is configured to perform fusion processing on feature vectors output by the two connected second residual layers to obtain 1-time fusion feature vector, determine a feature vector output by the target residual network layer as a first feature vector, and the i-th shallow feature fusion network layer is configured to perform fusion processing on a feature vector output by the connected second residual layer and i-1-time fusion feature vectors output by the i-1 th shallow feature fusion network layer to obtain i-time fusion feature vectors, determining the ith fusion feature vector as the ith first feature vector, wherein i is an integer which is more than or equal to 2 and less than or equal to N-1, and N is an integer which is more than or equal to 2;

4. The method according to claim 1, wherein the step of iteratively training the to-be-trained speech recognition network model by using the labeled sample data to obtain the speech recognition network model comprises:

preprocessing the labeled sample data to obtain a third feature vector;

5. The method according to claim 4, wherein the to-be-trained speech recognition network model further includes a normalized convolutional layer, and the normalized convolutional layer is used for performing normalization processing on the third feature vector to obtain a fourth feature vector;

6. The method according to claim 5, wherein when the second feature vector is used to represent the emotion feature information, the to-be-trained speech recognition network model further includes a bidirectional threshold recurrent neural network layer and an attention mechanism layer, and the bidirectional threshold recurrent neural network layer is configured to perform inter-feature interconnection and fusion processing on the N first feature vectors, the fifth feature vector, and the feature vector output by the first residual error layer to obtain the second feature vector, and output the second feature vector to a Softmax classifier after being weighted by the attention mechanism layer to obtain an emotion classification result.

7. A speech recognition processing method, comprising:

the voice recognition network model comprises a delay neural network layer, a first residual error layer, N second residual error layers and N-1 shallow feature fusion network layers, wherein the N second residual error layers, the first residual error layer and the delay neural network layers are sequentially connected in series, the N second residual error layers and the first residual error layers are used for performing convolution processing on the sixth feature vector and outputting feature vectors with different scales on different residual error layers, the shallow feature fusion network layer is used for fusing N feature vectors with different scales output by the N second residual error layers to obtain N seventh feature vectors, the voiceprint feature vectors are obtained by fusing the N seventh feature vectors with the output of the delay neural network layer and the output of the first residual error layer, and N is an integer greater than 1;

8. The method according to claim 7, wherein the shallow feature fusion network layer is configured to fuse the N different scale feature vectors output by the N second residual error layers to obtain N seventh feature vectors, and includes:

9. The method according to claim 8, wherein the N-1 shallow feature fusion network layers are connected to N-1 second residual layers except the target residual layer in a one-to-one correspondence manner, the 1 st shallow feature fusion network layer is further connected to the target residual layer, the 1 st shallow feature fusion network layer is configured to perform fusion processing on feature vectors output by the two connected second residual layers to obtain 1-time fusion feature vector, determine a feature vector output by the target residual network layer as a first seventh feature vector, and the i-th shallow feature fusion network layer is configured to perform fusion processing on a feature vector output by the connected second residual layer and i-1-time fusion feature vectors output by the i-1 th shallow feature fusion network layer to obtain i-time fusion feature vectors, determining the ith fusion feature vector as the ith seventh feature vector, wherein i is an integer greater than or equal to 2 and less than or equal to N-1, and N is an integer greater than 2;

10. The method according to claim 7, wherein before inputting the sixth feature vector into the speech recognition network model and obtaining the voiceprint feature vector to be confirmed, the method further comprises:

acquiring reference voice data input by the first user during registration;

preprocessing the reference voice data to obtain an eighth feature vector, wherein the eighth feature vector is used for representing voiceprint feature information of the reference voice data;

11. The method according to claim 7, wherein the speech recognition network model further comprises a normalized convolutional layer for normalizing the sixth feature vector to obtain a ninth feature vector;

12. A speech recognition processing method, comprising:

the voice recognition network model comprises a delay neural network layer, a first residual error layer, N second residual error layers, N-1 shallow feature fusion network layers and a classification network layer, wherein the N second residual error layers, the first residual error layer and the delay neural network layer are sequentially connected in series, the N second residual error layers and the first residual error layer are used for performing convolution processing on the eleventh feature vector and outputting feature vectors with different scales on different residual error layers, the shallow feature fusion network layer is used for fusing the N feature vectors with different scales output by the N second residual error layers to obtain N twelfth feature vectors, the classification network is used for fusing the N twelfth feature vectors, the output of the delay neural network layer and the output of the first residual error layer to perform emotion classification to obtain the emotion classification result, n is an integer greater than 1;

13. The method according to claim 12, wherein the shallow feature fusion network layer is configured to fuse N different-scale feature vectors output by the N second residual error layers to obtain N twelfth feature vectors, and includes:

14. The method according to claim 13, wherein the N-1 shallow feature fusion network layers are connected to N-1 second residual layers except the target residual layer in a one-to-one correspondence manner, the 1 st shallow feature fusion network layer is further connected to the target residual layer, the 1 st shallow feature fusion network layer is configured to perform fusion processing on feature vectors output by the two connected second residual layers to obtain 1-time fusion feature vector, determine a feature vector output by the target residual network layer as a first twelfth feature vector, and the i-th shallow feature fusion network layer is configured to perform fusion processing on a feature vector output by the connected second residual layer and i-1-time fusion feature vectors output by the i-1 th shallow feature fusion network layer to obtain i-time fusion feature vectors, determining the ith fusion feature vector as the ith twelfth feature vector, wherein i is an integer which is greater than or equal to 2 and less than or equal to N-1, and N is an integer which is greater than 2;

15. The method according to claim 12, wherein the speech recognition network model further comprises a normalized convolutional layer for normalizing the eleventh feature vector to obtain a thirteenth feature vector;

16. The method according to claim 15, wherein the classification network includes a bidirectional threshold recurrent neural network layer, an attention mechanism layer, and a Softmax classifier, and the bidirectional threshold recurrent neural network layer is configured to perform inter-feature interconnection and fusion processing on the N twelfth feature vectors, the fourteenth feature vector, and the feature vector output by the first residual layer to obtain a second feature vector, perform weighting processing on the second feature vector through the attention mechanism layer, and output the second feature vector to the Softmax classifier for emotion classification.

17. An electronic device comprising a processor, a memory and a computer program stored on the memory and being executable on the processor, the computer program, when executed by the processor, implementing the steps of the speech recognition network model training method according to any one of claims 1 to 6, or the computer program, when executed by the processor, implementing the steps of the speech recognition processing method according to any one of claims 7 to 16.

18. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech recognition network model training method according to one of claims 1 to 6, or which, when being executed by the processor, carries out the steps of the speech recognition processing method according to one of claims 7 to 16.