WO2020156153A1 - 音频识别方法、系统和机器设备 - Google Patents

音频识别方法、系统和机器设备 Download PDF

Info

Publication number
WO2020156153A1
WO2020156153A1 PCT/CN2020/072063 CN2020072063W WO2020156153A1 WO 2020156153 A1 WO2020156153 A1 WO 2020156153A1 CN 2020072063 W CN2020072063 W CN 2020072063W WO 2020156153 A1 WO2020156153 A1 WO 2020156153A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
label
given
data stream
loss function
Prior art date
Application number
PCT/CN2020/072063
Other languages
English (en)
French (fr)
Inventor
苏丹
王珺
陈杰
俞栋
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP20748958.4A priority Critical patent/EP3920178A4/en
Publication of WO2020156153A1 publication Critical patent/WO2020156153A1/zh
Priority to US17/230,515 priority patent/US11900917B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2132Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on discrimination criteria, e.g. discriminant analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Definitions

  • This application relates to the field of computer acoustics application technology, in particular to audio recognition methods, systems and machinery and equipment.
  • the realization of audio recognition in the acoustic scene that is, the execution of various audio classification tasks is often limited by the variability of the acoustic scene, such as automatic speech recognition based on audio recognition, which will make it difficult to apply audio recognition to various audio classification tasks.
  • the variability of the acoustic scene comes from many aspects, such as speaker, accent, background noise, reverberation, soundtrack and recording conditions, etc.
  • this application provides a neural network training method, system and machine equipment for audio recognition.
  • An audio recognition method includes:
  • the loss function values of a series of given labels in the relative label data are obtained by fusion, and the audio label result is obtained for the audio data stream.
  • An audio recognition system which includes:
  • a data stream acquisition module for acquiring an audio data stream for audio recognition, the audio data stream including audio data corresponding to several time frames;
  • the feature extraction module is used to extract the features of each layer of the network in the neural network for different audio data of each time frame in the audio data stream to obtain the depth features output by the corresponding time frame;
  • the fusion calculation module is used for fusing the audio data stream with the set loss function for a given label in the label data through the set loss function with respect to the inter-class confusion measurement index and intra-class distance penalty for the given label value;
  • the result obtaining module is used to obtain the loss function value of a series of given labels in the relative label data through fusion, and obtain the audio labeling result for the audio data stream.
  • a machine and equipment including:
  • a memory where computer-readable instructions are stored, and when the computer-readable instructions are executed by the processor, the aforementioned method is implemented.
  • a storage medium that includes a stored program, and the above method is executed when the program is running.
  • a computer program product comprising instructions, which when run on a computer, causes the computer to execute the above-mentioned method.
  • an audio data stream is obtained for neural network training for audio recognition.
  • This audio data stream includes audio data corresponding to several time frames.
  • the training neural network In the network the feature extraction of each layer of the network is performed to obtain the depth characteristics of the corresponding time frame output. So far, the different audio data of each time frame is obtained for labeling the audio data stream to identify the depth of the audio data stream Features; on this basis, for the given label in the label data, the audio data stream is integrated with the set loss function through the depth feature to measure the inter-class confusion index and the intra-class distance measurement of the given label. Penalty, and finally update the parameters in the neural network through the loss function value obtained by the fusion.
  • the parameters of each layer of the network are updated based on the loss function value obtained by the fusion.
  • the audio data stream is relative to the given labeling of the inter-class confusion index and the intra-class distance penalty value to improve the robustness of the realized neural network to the acoustic conditions that are not seen during training and change greatly.
  • the inter-class confusion index of the audio data stream relative to a given label will ensure the inter-class discrimination of the depth features in audio recognition; while the audio data stream relative to the given label’s intra-class distance penalty value, for audio recognition, This enhances the discrimination performance of the extracted depth features. Therefore, the fusion between the two on this basis ensures that the depth features have the distinction between classes and the tightness of distribution within classes, thereby improving the realization The robustness of the neural network to acoustic conditions that are not seen during training and changes greatly, thereby effectively improving the performance of audio recognition.
  • Figure 1 is a schematic diagram of the implementation environment involved in this application.
  • Fig. 2 is a block diagram showing the hardware structure of an audio recognition terminal according to an exemplary embodiment
  • Fig. 3 is a flowchart showing a neural network training method for audio recognition according to an exemplary embodiment
  • Fig. 4 is a flowchart of an audio recognition method according to another exemplary embodiment
  • FIG. 5 is a flowchart illustrating step 350 according to the embodiment corresponding to FIG. 3;
  • FIG. 6 is a flowchart illustrating step 350 in another exemplary embodiment according to the embodiment corresponding to FIG. 3;
  • FIG. 7 is a flowchart illustrating step 370 according to the embodiment corresponding to FIG. 3;
  • Fig. 8 is a schematic diagram showing a network architecture of a neural network in an automatic speech recognition system according to an exemplary embodiment
  • Fig. 9 is a schematic diagram showing forward propagation and backward propagation error signal flow when fusion loss function supervises training neural network according to an exemplary embodiment
  • Fig. 10 is a block diagram showing an audio recognition system according to an exemplary embodiment
  • Fig. 11 is a block diagram of a fusion calculation module according to the embodiment corresponding to Fig. 10;
  • Fig. 12 is a block diagram of another exemplary embodiment of the fusion computing module according to the embodiment corresponding to Fig. 10;
  • Fig. 13 is a block diagram showing an update module according to the embodiment corresponding to Fig. 10.
  • Figure 1 is a schematic diagram of the implementation environment involved in this application.
  • the implementation environment includes an audio source 110 and an audio recognition terminal 130.
  • an audio source 110 for example, a piece of speech, a neural network training is performed on the audio recognition terminal 130.
  • the audio source 110 may be a speaker or a terminal device, and a piece of voice is output to the audio recognition terminal 130 through the speaker's speech, or an audio playback direction performed by a terminal device
  • the audio recognition terminal 130 outputs a piece of audio.
  • the audio recognition terminal 130 may be a smart speaker, a smart TV, an online voice recognition system, etc.
  • the audio source 110 will provide an audio data stream for the training of the neural network as training data.
  • the neural network training logic for audio recognition implemented in this application will be applied to the audio recognition terminal 130 to perform neural network training on the audio input by the audio source 110. It should be understood that the specific framework of the implementation environment will be strongly related to the landing scene, and different scenes will cause the implementation environment where it is located in addition to the audio source 110 and the audio recognition terminal 130 to have different architecture deployments.
  • the audio recognition terminal 130 will be oriented to various audio sources 110, for example, devices where various applications are located, and provide the audio recognition terminal 130 with audio data streams for neural network training through various audio sources 110.
  • the trained neural network will be applied to many scenarios, for example, audio monitoring, speaker recognition, and human-computer interaction in security surveillance will not be listed here to achieve audio recognition in many scenarios.
  • Fig. 2 is a block diagram showing the hardware structure of an audio recognition terminal according to an exemplary embodiment.
  • the audio recognition terminal may be a server, of course, it may also be a terminal device with excellent computing capabilities.
  • Fig. 2 is a block diagram showing the hardware structure of a server serving as an audio recognition terminal according to an exemplary embodiment.
  • the server 200 is only an example adapted to the present disclosure, and cannot be considered as providing any limitation on the scope of use of the present disclosure.
  • the server 200 cannot be interpreted as being dependent on or having one or more components in the exemplary server 200 shown in FIG. 2.
  • the hardware structure of the server 200 may vary greatly due to differences in configuration or performance.
  • the server 200 includes: a power supply 210, an interface 230, at least one storage medium 250, and at least one central processing unit (CPU , Central Processing Units) 270.
  • CPU Central Processing Unit
  • the power supply 210 is used to provide working voltage for each hardware device on the server 200.
  • the interface 230 includes at least one wired or wireless network interface 231, at least one serial-to-parallel conversion interface 233, at least one input/output interface 235, at least one USB interface 237, etc., for communicating with external devices.
  • the storage medium 250 can be a random storage medium, a magnetic disk or an optical disc, etc.
  • the resources stored on it include an operating system 251, application programs 253, data 255, etc., and the storage method can be short-term storage or permanent storage.
  • the operating system 251 is used to manage and control the hardware devices and application programs 253 on the server 200 to realize the calculation and processing of the massive data 255 by the central processing unit 270. It can be Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM , FreeBSDTM, etc.
  • the application program 253 is a computer program that completes at least one specific task based on the operating system 251. It may include at least one module (not shown in FIG. 2), and each module may respectively include a series of operations on the server 200. instruction.
  • the data 255 may be photos, pictures, etc. stored in the disk.
  • the central processing unit 270 may include one or more processors, and is configured to communicate with the storage medium 250 through a bus, and is used for computing and processing the massive data 255 in the storage medium 250.
  • the server 200 applicable to the present disclosure will read a series of operation instructions stored in the storage medium 250 through the central processing unit 270 to perform audio recognition.
  • Fig. 3 is a flow chart showing a neural network training method for audio recognition according to an exemplary embodiment.
  • the neural network training method for realizing audio recognition includes at least the following steps.
  • step 310 an audio data stream is obtained for neural network training for audio recognition, and the audio data stream includes audio data corresponding to several time frames, respectively.
  • the audio data stream corresponding to the audio can be obtained first, so that the audio data stream can be used to perform the neural network training process later.
  • the audio data stream describes the audio content and also reflects the speaker who outputs the audio content.
  • the audio data stream is composed of one frame of audio data. Therefore, the audio data stream contains several audio data corresponding to the time frame. These audio data will form a time sequence, that is, the audio data stream will correspond to the audio sequence data constructed in a certain time sequence.
  • step 310 includes: obtaining a noisy and continuous audio data stream and training data labelled as a neural network.
  • Audio recognition can refer to the classification of audio data streams. That is, in the audio recognition process, the audio data stream is audio-labeled, so that the audio label indicates the category to which the audio data stream belongs, so that the speaker corresponding to the audio data stream can be subsequently learned based on the audio label, or in the content On the attribution label. Based on this, it can be seen that in the process of realizing a neural network for audio recognition, the audio data stream and the annotation data corresponding to the audio data stream should be used as training data, so that the annotation data will cooperate with the audio data stream for neural network training.
  • the audio recognition method further includes the following steps:
  • the audio data stream is divided into frames to obtain audio data corresponding to several time frames.
  • the audio data corresponding to the time frames will complete audio recognition through the prediction of the corresponding audio label.
  • the audio data stream is often of any length and is fully labeled.
  • it can be a short-term input voice, or a current speech, etc., so the audio can be performed according to a certain frame length and frame shift.
  • the identified audio data stream is framed to obtain audio data corresponding to each time frame, and a given annotation in the annotation data will correspond to audio data of a time frame.
  • the audio recognition realized by the neural network is a kind of time series classification.
  • the audio data obtained by framing forms the time series data in the time series classification.
  • the audio data can be performed in time series. This is used to output characteristics for the audio data in each time frame.
  • the process of audio recognition is the prediction process of audio tagging. It predicts the type of audio data stream where the audio data is located, and then adds corresponding tags, which can also be called tags, to obtain the audio tagging results.
  • the annotation result confirms the corresponding speaker or the category of the audio content.
  • the training of neural network corresponds to this. Therefore, it is necessary to use the labeled audio data stream for neural network training.
  • step 330 for different audio data of each time frame in the audio data stream, feature extraction of each layer of the network is performed in the trained neural network to obtain the depth features output by the corresponding time frame.
  • feature extraction of different audio data in each time frame is performed.
  • This feature extraction is performed in a neural network.
  • the depth features corresponding to the time frame are obtained through feature extraction of each layer of the network in the neural network.
  • the neural network for feature extraction of audio data can be applied to a variety of model types and network topologies, and can also expand the network structure as needed, or even replace various more effective network topologies.
  • the neural network can be differentiated by a multi-layer structure composed of a convolutional network layer and a Max pool layer, a multi-layer structure of LSTM (Long Short-Term Memory) and a fully connected layer.
  • the audio data under the time frame outputs depth characteristics.
  • the depth feature output corresponding to the time frame will be a numerical description of the audio data. Therefore, the audio data will be characterized and the audio data stream will be labeled.
  • step 330 includes: extracting features of different audio data in each time frame of the audio data stream in each layer of the neural network layer by layer until reaching the last layer of the network to obtain the corresponding The depth characteristics of the time frame output.
  • the audio data of each time frame in the audio data stream is extracted in the neural network through each layer of the network to complete the extraction of the depth features, so as to obtain the features in units of frames.
  • the audio recognition method further includes:
  • step 410 for the depth features, the depth features of the specified number of time frames before and after the corresponding time frame are acquired.
  • the obtained depth feature is obtained by extracting audio data of a time frame, and in this exemplary embodiment, the depth feature will be spliced according to a certain length for this time frame to This is used as the depth feature of the output of this time frame.
  • the depth features of the specified number of time frames before and after the time frame are acquired.
  • the specified number of time frames may be 5 frames, and the depth characteristics of the audio data of 5 frames before and after the time frame are obtained.
  • step 430 the depth features are spliced with the depth features of a specified number of time frames before and after the corresponding time frame according to the time sequence to obtain the depth features output by the neural network in the time frame.
  • step 410 after the execution of step 410, after acquiring the depth features of a specified number of time frames for the time frame, the depth features will be stitched according to the time frame corresponding to the acquired depth features to obtain the current time frame in the neural network.
  • the depth characteristics of the network output after the execution of step 410, after acquiring the depth features of a specified number of time frames for the time frame, the depth features will be stitched according to the time frame corresponding to the acquired depth features to obtain the current time frame in the neural network.
  • the audio data stream is divided into frames to obtain audio data corresponding to several time frames, and each audio data describes a part of the audio data stream.
  • Feature extraction is performed on all audio data to accurately classify and recognize audio data streams.
  • the audio data stream requesting neural network training through the foregoing exemplary embodiment, according to the hardware deployment of the audio recognition terminal itself, the audio data stream is divided according to a certain time length, and audio data corresponding to several time frames are obtained. , In order to adapt to any audio recognition conditions and machine deployment conditions, and enhance the reliability and versatility of neural networks.
  • the different audio data corresponding to several time frames are all stitched with depth features for the current time frame corresponding to the specified number of time frames, so as to obtain the depth features that can reflect the context information, thereby enhancing the accuracy of the neural network Sex.
  • the current time frame referred to is the time frame currently processed in the depth feature stitching performed.
  • the depth feature stitching performed is performed for each time frame.
  • the depth features before and after the time frame are stitched around each time frame for the corresponding depth feature to obtain the output of this time frame. Depth characteristics.
  • step 350 for a given label in the label data, the audio data stream is fused with the set loss function through the depth feature to measure the inter-class confusion index and the intra-class distance penalty value of the given label.
  • the depth feature is obtained by extracting the audio data of the time frame
  • the depth feature is used to characterize the audio data, and the neural network training of the audio data is performed.
  • the annotation data corresponds to the audio data stream.
  • the annotation data is input for the training process of the neural network.
  • the annotation data will be used to provide all possible annotations for the annotation prediction of the audio data stream, and then the calculation performed in step 350 passes which one The category corresponding to the label is compared with the inter-class confusion index in the audio data stream, so as to determine the loss function value, so as to complete an iterative training of the neural network.
  • the set loss function is used to take the depth feature as the input to realize the fusion calculation between the inter-class confusion measurement index and the intra-class distance penalty value of the audio data stream relative to the given label.
  • setting the loss function is the fusion loss function.
  • the loss function value is provided for the training of the neural network under the action of the set loss function.
  • the annotation data includes a number of given annotations.
  • the audio data stream is integrated into the set loss function through the depth feature to measure the inter-class confusion index and the intra-class distance penalty value of the given annotation. In this way, the loss function value of this given label is obtained. This loss function value will determine whether the neural network training iterated this time will converge.
  • the neural network training performed will be controlled by minimizing the loss function value to ensure that the neural network iterative training performed can be converged and ended.
  • the obtained parameters are updated to the neural network.
  • the corresponding minimized loss function value is obtained by the fusion of the inter-class confusion measurement index and the intra-class distance penalty value. Therefore, the inter-class confusion measurement index and class The inner distance penalty value will be minimized.
  • each given label corresponds to a category, and the given label will exist as the label of the corresponding category.
  • the inter-class confusion index of the audio data stream relative to a given label is used to characterize the possibility that the audio data stream belongs to the category corresponding to this given label, so as to enhance the distinction between classes, that is, between classes
  • the audio data stream relative to the intra-class distance penalty value of a given label is used to enhance the discrimination performance through the intra-class distance penalty, so as to achieve compact intra-class distribution
  • To meet the performance of intra-class discrimination that is, the smaller the intra-class distance penalty, the stronger the compactness of the intra-class distribution, and the enhancement of the intra-class discrimination performance is obtained.
  • the obtained inter-class confusion index and intra-class distance penalty value with respect to a given label are for audio data of a time frame.
  • the fusion of the inter-class confusion measurement index and the intra-class distance penalty value of the audio data with respect to a given label will be achieved through its depth characteristics.
  • the obtained inter-class confusion index and intra-class distance penalty value for a given label are for the entire audio data stream.
  • the audio data stream is integrated with the inter-class confusion measurement index of the audio data stream relative to the current given label and the intra-class distance penalty value.
  • the labeling sequence is obtained for labeling the entire audio data stream, and the loss function value thus obtained is the probability of the audio data stream relative to a possible labeling sequence. The value of this probability will be determined by the audio data stream. The inter-class confusion index of the data stream relative to this label sequence and the intra-class distance penalty value relative to this label sequence are determined.
  • the labeling of a single audio data is optimized to the prediction of all possible labeling sequences in the audio data stream, so that it is no longer necessary to ensure that the frame-level labeling in the training of the neural network is carried out, and it is not necessary for the audio data of each time frame to be in
  • the corresponding annotations are provided during the training process.
  • the input signal stream of the training process no longer needs to be consistent with the length of the annotations. It should be understood that for a piece of audio, it is normal for the audio data of a certain time frame or several time frames to have no corresponding annotations. Yes, it often takes several time frames to label the current time frame. Therefore, the overall labeling of the audio data stream will make the realization of audio recognition no longer need to be frame-level in the training process
  • the annotations can support and adopt the mechanism of sequence modeling, and can learn discriminative feature expressions while training sequence identification.
  • the softmax layer is also included, The output of the result is completed through the softmax layer.
  • the output result is the probability distribution of the audio data stream with respect to each given label, that is, the aforementioned loss function value, so as to optimize the neural network through the minimized loss function value.
  • step 350 will be performed through the softmax layer in the neural network, and thereby obtain the loss function value of the audio data stream relative to a series of given annotations in the annotation data.
  • the softmax layer of the neural network is to perform the fusion between the inter-class confusion measurement index and the intra-class distance penalty value of the audio data stream with respect to a given annotation, which will be realized through the set fusion loss function.
  • the intra-class distance penalty value can be calculated by Euclidean distance, or calculated by using other distance types, such as angular distance.
  • the calculation of the penalty value of the distance within the class can be realized by the center loss function, but it is not limited to this, and it can also be realized by using the Contrastive loss function, Triplet loss function, Sphere face loss function and CosFace loss function of angular distance. To achieve the calculation of the distance penalty value within the class, we will not list them one by one here.
  • step 370 the loss function values of a series of given labels in the relative label data are obtained through fusion, and the parameters in the neural network are updated.
  • step 350 after obtaining the loss function value of the audio data stream relative to a series of given labels in the label data, the training of the neural network can be controlled by the loss function value.
  • the series of given labels referred to are all given labels corresponding to the output loss function value of the audio data stream through the softmax layer.
  • the audio data stream is fused to obtain a series of given labels corresponding to the loss function value, including the given labels mapped by the sofmax layer of the audio data corresponding to each time frame.
  • the audio data stream is fused to obtain a series of given labels corresponding to the loss function value, which is the given label mapped by the audio data stream through the softmax layer.
  • the error rate of audio recognition under unseen acoustic conditions will be significantly reduced, and the ability of audio recognition to generalize the variability of noise is effectively improved, so that it can be used in clean speech conditions, training acoustic conditions, and unseen acoustic conditions.
  • a very low error rate can be obtained under acoustic conditions.
  • FIG. 5 is a flowchart illustrating step 350 according to the embodiment corresponding to FIG. 3.
  • this step 350 includes:
  • step 351 for a given label in the label data, a center vector corresponding to the category of the given label is obtained, and the center vector is used to describe the centers of all depth features in the category.
  • step 353 the audio data of the time frame is set according to the depth feature and the center vector.
  • the fusion between the inter-class confusion index and the intra-class distance penalty value in the loss function relative to the given label is obtained to obtain the audio The value of the loss function of the data relative to the given label.
  • this exemplary embodiment is fusion calculation oriented to audio data, and the loss function value of each time frame audio data relative to a given label is obtained by setting a loss function.
  • the label data includes a number of given labels. Therefore, in the calculation of the intra-class distance penalty value using Euclidean distance, the intra-class distance will be calculated for the depth feature according to the center vector of the given label category, and then the intra-class distance penalty will be obtained by penalizing the intra-class distance value.
  • the center vector is used to describe the center of the category of a given label. In the fusion calculation performed by the softmax layer of the neural network, for each time frame of the audio data, each given label in the label data is based on the center vector for the intra-class distance penalty value of the given label Calculation.
  • the audio data of each time frame will also be predicted to measure the inter-class confusion index of each given label.
  • the fusion calculation performed is performed for each given label in the label data, and in the fusion calculation performed by setting the loss function, for the same given label, all perform their own relative to this
  • the inter-class confusion measurement index of a given label and the penalty value of the intra-class distance are calculated, and then the fusion between the two is carried out to obtain the loss function value of the audio data relative to the given label, and so on, and the calculation is performed for each time frame The value of the loss function of the audio data relative to all given labels.
  • the annotation of audio data can be robust under new acoustic conditions, even in a new recording environment, encountering new speakers, or even new accents and background noises, and can be stable and reliable Complete audio recognition.
  • the step 353 includes: calculating the center loss of a given label through the depth feature and the center vector to obtain the intra-class distance penalty value of the audio data of the time frame relative to the given label.
  • the center vector corresponding to a given label will be used as the center of the category.
  • the audio data of each time frame will use the depth feature extracted by itself to calculate the depth feature in the corresponding category.
  • the calculation of the center loss of the audio data relative to a given label can be realized by the center loss function shown below, namely:
  • L cl is the distance penalty value within the class
  • u t is the depth feature of the audio data of time frame t, that is, the output of the penultimate layer in the neural network at the t-th time frame, Is the center vector of the k t-th depth feature.
  • the goal is to hope that the sum of squares of the distance between the depth features of the audio data and the center should be as small as possible, that is, the intra-class distance should be as small as possible.
  • the step 353 further includes: using a cross-entropy loss function to calculate the inter-class confusion index of the audio data of the time frame relative to a given label according to the depth feature.
  • the cross-entropy loss function is used to ensure the inter-class discrimination of depth features.
  • the cross entropy loss function is:
  • L ce is the audio data of the t-th time frame attributable to the inter-class confusion measurement index of the given label, It is the output of the k tth node corresponding to the output layer of the neural network after the softmax operation.
  • K output nodes in the neural network which represent the K output categories.
  • a t is the last layer of the neural network, i.e. frame t corresponding to the output time of the previous layer softmax layer, Represents the j-th node, W and B correspond to the weight matrix and bias vector of the last layer respectively.
  • the step 353 further includes: performing a weighted calculation on the intra-class distance penalty value and the audio data relative to the inter-class confusion measurement index of the given label in the set loss function according to the specified weighting factor to obtain The value of the loss function of the audio data relative to a given label.
  • the fusion calculation performed is to perform a weighted calculation between the two in the set loss function according to the specified weight factor, so as to obtain the loss function value of the audio data relative to the given label.
  • the central loss function and the cross-entropy loss function will be fused and calculated by the following fusion loss function, namely:
  • L fmf is the loss function value of the audio data relative to a given label
  • is a designated weighting factor
  • different time frames in the audio data stream are labeled with the audio data through a given label in the label data.
  • the audio data stream includes different audio data corresponding to several time frames.
  • the audio data of each time frame is labeled, and there is a corresponding given label in the labeled data.
  • the given annotations in the annotation data all correspond to the audio data of different time frames in the audio data stream, so as to ensure the alignment of the annotation data and the audio data stream in the neural network training.
  • FIG. 6 is a flow chart illustrating step 350 in another exemplary embodiment according to the embodiment corresponding to FIG. 3.
  • this step 350 includes:
  • step 501 for a given label and supplementary blank labels in the label data, the center vector corresponding to the category is obtained.
  • step 503 for the depth feature sequence formed by the depth features of the audio data stream in time sequence, calculate the probability that the audio data stream is mapped to the given sequence label and the distance of the given sequence label to the center vector respectively, and obtain the relative given value of the audio data stream.
  • the penalty value of the distance within the class labeled by a given sequence is the probability that the audio data stream is mapped to the given sequence label and the distance of the given sequence label to the center vector respectively.
  • the given sequence label includes supplementary blank label and given label.
  • the blank label is a new label in the label data, and the blank label corresponds to the "blank category". It should be understood that there are often audio data in a time frame or certain time frames that do not know which given label corresponds to in the audio data stream. For this reason, the audio data can be attributed to the blank label, which will ensure the audio
  • the alignment of the data stream with a given sequence of annotations solves the problem of inconsistency between the audio data stream and the length of the annotations. Audio recognition is no longer limited by the frame-level annotation data.
  • a given sequence of labels includes a number of given labels and blank labels inserted between the given labels.
  • a blank label will be inserted at the beginning and the end to solve the problem that the first frame of audio data and the last frame of audio data in the audio data stream have no meaning, and thus cannot be labeled.
  • the label data of the audio data stream is a discrete label string that is not aligned, and a blank label is added to the discrete label string, and the supplemented blank label and a given label in the label data respectively correspond to the audio data. Audio data of different time frames in the stream.
  • the unaligned discrete tag string of the audio data stream is a given sequence of tags. Therefore, the discrete tag string includes a number of given tags, but it cannot correspond to each given frame of the input signal stream. Label. In other words, it is not known which frames of the input signal stream correspond to a given label in the discrete label string.
  • the audio data stream and unaligned discrete label strings are used as training data for the training of the neural network.
  • the training of the neural network and the implementation of subsequent audio recognition will no longer be limited to the frame-level annotation data, that is, no It is also limited by the inability to align between the input signal stream and the discrete label string.
  • the intra-class distance penalty value of the audio data stream relative to the given sequence label obtained through the center loss calculation is the expected value of the distance between the depth feature in the audio data stream and the center vector for the given sequence label.
  • the given labeling sequence is the labeling sequence that the audio data stream may correspond to, which is composed of the given labeling and the blank labeling.
  • the probability that the audio data stream is mapped to a given sequence label is calculated with respect to each possible given sequence label, and is used to describe the mapping relationship between the audio data stream and the given sequence label.
  • the calculation of the mapping probability of the performed audio data stream to the labeling probability of a given sequence can be realized by calculating the conditional probability distribution as shown below, namely:
  • ⁇ t (s) and ⁇ t (s) represent forward and backward variables, respectively, which can be calculated according to the maximum likelihood criterion in CTC (Connectionist Temporal Classification), and z is a sequence label of length r.
  • the intra-class distance penalty value of the audio data stream with respect to the given sequence will be calculated by the expected center loss function of the following conditions, namely:
  • Lecl is the intra-class distance penalty value of the audio data stream with respect to the given sequence label
  • z' is the given sequence label obtained after inserting a blank label between the beginning and end of the sequence label z and each adjacent given label
  • S is the training set where the label pair of audio data stream x and sequence label z is located.
  • the probability that the audio data stream is mapped to a given sequence label and the distance of the given sequence label to the center vector are calculated to complete the calculation of the conditional expected center loss function.
  • Each possible label sequence that can be composed of a given label and a blank label in the label data will be used as a given sequence label to participate in the calculation.
  • the step 350 includes: calculating the probability distribution of the audio data stream relative to the given sequence label according to the depth feature, and calculating the log-likelihood cost of the audio data stream through the probability distribution is the audio data stream relative The inter-class confusion index for a given sequence label.
  • the audio data stream is also calculated relative to the given sequence labeling inter-class confusion measurement index.
  • the calculation of the inter-class confusion index of the audio data stream relative to the given sequence label is to maximize the probability of the given label sequence being correctly labeled relative to the audio data stream, and maximize the probability of all correct labels, that is, minimize The log-likelihood cost of the probability distribution of the audio data stream with respect to a given sequence.
  • the probability distribution of an audio data stream with respect to a given sequence can be calculated by the following formula, namely:
  • L ml is the inter-class confusion measurement index of the audio data stream with respect to the given sequence annotation, that is, the log-likelihood cost of the audio data stream.
  • the step 350 further includes: weighting the inter-class confusion index and the intra-class distance penalty value marked by the audio data stream relative to the given sequence in the set loss function according to the specified weighting factor. Calculate to get the loss function value of the audio data stream relative to the given sequence.
  • the fusion calculation between the two is performed based on the inter-class confusion measurement index and the intra-class distance penalty value marked by the audio data stream relative to the given sequence as described above, that is, the weighted calculation between the two is performed according to the specified weight factor , In order to get the loss function value of the audio data stream relative to the given sequence.
  • the end of the neural network training convergence will be determined according to the minimized loss function value. Therefore, corresponding to this, the audio data stream is weighted according to the specified weighting factor for each given sequence label. Calculate, the parameters corresponding to the minimized loss function value can be updated to the neural network.
  • the following temporal multi-loss fusion function is used to calculate the loss function value of the audio data stream with respect to a given sequence, namely:
  • L tmf is the loss function value of the audio data stream with respect to the given sequence
  • is the designated weighting factor
  • FIG. 7 is a flowchart illustrating step 370 according to the embodiment corresponding to FIG. 3.
  • this step 370 includes:
  • step 371 according to the loss function values of a series of given labels in the relative label data obtained by fusion, iterative training of the updated parameters of the network layers in the neural network is performed until the minimized loss function value is obtained.
  • step 373 the parameters corresponding to the minimized loss function value are updated to each layer of the neural network.
  • the neural network that realizes audio recognition and has robustness is obtained by training through noisy and continuous audio data stream.
  • the neural network obtained by training will often cover a variety of different acoustic conditions, which will enable the trained neural network to adapt to various acoustic conditions and have better The reliable stability.
  • the weight parameters of each layer of the network are optimized according to the minimized loss function value, so as to obtain a neural network that is robust to unseen acoustic conditions. That is to say, in an exemplary embodiment, the training of the neural network will be performed with the minimum set loss function as the training target, so that the annotation prediction of the audio data stream can be realized through the neural network.
  • the training of the neural network will pass the audio data stream in the forward direction until the output generates an error signal, and propagate the error information back to update the parameters, such as the weight matrix of each layer of the network, the parameters of the softmax layer, etc., to complete the training of the multilayer neural network, and then Applied to audio classification tasks.
  • the temporal multi-loss function used in the softmax layer is also differentiable. Therefore, the neural network standard back propagation algorithm is used for training.
  • the neural network obtained by training will be continuously optimized, thereby continuously enhancing the accuracy of the neural network for audio recognition.
  • the training implementation of the exemplary embodiment described above can be applied to neural networks of various network structures, that is, the model type and network structure of the neural network are not limited, and can be replaced with various effective new network structures.
  • the softmax layer is constructed for the neural network used, and there is no additional complexity here, and no additional hyperparameter or network structure tuning is required, and the consistent performance is improved.
  • the automatic speech recognition system will train the input audio data stream to obtain a neural network.
  • the existing automatic speech recognition cannot be applied to all possible acoustic conditions and varying acoustic conditions. This is due to the fact that the neural network used cannot cover all acoustic conditions during training; on the other hand, When training a neural network, each sample frame needs to have a corresponding category label, but this is not enough for the actual neural network training process.
  • the training data that can be used is noisy and continuous
  • the audio data stream and the unaligned discrete label sequence do not know which frames of the input signal stream correspond to a certain label.
  • the method described above will be applied to perform automatic speech recognition.
  • the audio data stream is fused with deep features through the network layers in the neural network, the inter-class confusion measurement index and the intra-class distance penalty value relative to a given label will be integrated. , You can get the loss function value of a series of given labels in the audio data stream relative to the label data, and complete the training of the neural network.
  • Fig. 8 is a schematic diagram showing a network architecture of a neural network in an automatic speech recognition system according to an exemplary embodiment.
  • the network architecture of the neural network to which the automatic speech recognition system belongs in this application at least includes a multi-layer structure 1010 of a convolutional network layer plus a Max pool layer, and a multi-layer structure 1030 of LSTM.
  • the fully connected layer 1050 and the fusion loss function calculation module correspondingly, the audio data flows through the feature extraction module to obtain the input features, then passes through the convolutional network layer plus the Max pool layer multi-layer structure 1010, and then passes through the LSTM multi-layer structure
  • the structure 1030 then passes through the fully connected layer 1050 and outputs it to the fusion loss function calculation module, and completes the neural network training through the fusion loss function realized by it.
  • the annotation data may be the phoneme representation of the audio data stream.
  • the output phoneme will be used as the training target for supervised training.
  • the neural network in FIG. 8 has K output nodes, representing K output categories, for example, context-related phonemes, context-related subphonemes, and hidden Markov state labels.
  • K output categories for example, context-related phonemes, context-related subphonemes, and hidden Markov state labels.
  • the loss function value of the audio data stream with respect to a series of given sequence annotations calculated by the temporal multi-loss fusion function is realized by calculating the probability distribution on all possible annotation sequences. Given this probability distribution, the temporal multi-loss function directly maximizes the probability of correct annotation while penalizing the distance between the depth feature and the corresponding center, so that it is no longer restricted by frame-level annotation data.
  • the network structure and hyperparameters of the neural network are configured as shown in Table 1.
  • the network structure first includes two two-dimensional convolutional layers, the number of output channels is 64 and 80, and the kernel size of each layer is (3, 3), stride is (1,1); each convolutional layer is connected to a maxpool layer, its kernel size is (2,2), stride is (2,2); then five layers of LSTM layer are connected, each layer is hidden
  • the number of nodes is 1024, and the number of output nodes is 512; then a fully connected layer is connected, and the number of output nodes corresponds to the K output categories.
  • 12K context-related phonemes can be used in the detailed implementation.
  • L fmf L ce + ⁇ L cl
  • the training optimization algorithm adopts Adam method.
  • the learning rate is set to an initial value of 1e-4 at the beginning of training, and when the average verification likelihood value (calculated after every 5K batch training) does not drop for 3 consecutive times, the learning rate is halved. If the average verification likelihood does not decrease for 8 consecutive times, the training is terminated early.
  • Fig. 9 is a schematic diagram showing the forward propagation and back propagation error signal flow when the neural network is supervised and trained by the fusion loss function according to an exemplary embodiment.
  • the learning algorithm of temporal multi-loss fusion function includes:
  • the input part that is, take the training label pair (x, z) ⁇ S as input, and set the initialization parameter ⁇ of the convolutional layer and the LSTM layer, the initialization weight parameter W of the fully connected layer and the initialization center vector ⁇ c j
  • j 1 ,2,...,K ⁇ , weight factor ⁇ , batch momentum ⁇ and learning rate ⁇ .
  • the parameters ⁇ and W will be adjusted, and the parameters of the center vector will be updated after the blank label is inserted.
  • the CTC loss function is calculated
  • the generated back-propagation error signal as shown in Figure 9, through the softmax layer, the back-propagation error signal of the log-likelihood cost L ml of the audio data stream can be obtained, namely:
  • the neural network obtained through the above-mentioned loss function training is applied to the automatic speech recognition system to obtain robustness to unseen acoustic conditions.
  • Fig. 10 is a block diagram showing an audio recognition system according to an exemplary embodiment.
  • the audio recognition system includes but is not limited to: a data stream acquisition module 1210, a feature extraction module 1230, a fusion calculation module 1250, and an update module 1270.
  • the data stream acquisition module 1210 is configured to acquire audio data streams for neural network training of audio recognition, where the audio data streams include audio data corresponding to several time frames respectively;
  • the feature extraction module 1230 is configured to extract features of each layer of the network in the trained neural network for different audio data of each time frame in the audio data stream, to obtain the depth features output by the corresponding time frame;
  • the fusion calculation module 1250 is used for fusing the audio data stream with the set loss function to measure the inter-class confusion index and the intra-class distance for the given label in the label data through the set loss function. Penalty value
  • the update module 1270 is configured to obtain a series of loss function values of a given label in the relative label data through fusion, and perform parameter update in the neural network.
  • Fig. 11 is a block diagram of a fusion calculation module according to the embodiment corresponding to Fig. 10.
  • the fusion computing module 1250 includes:
  • the center vector obtaining unit 1251 is configured to obtain, for a given label in the label data, a center vector corresponding to a category to which the given label belongs, and the center vector is used to describe the centers of all depth features in the category;
  • the loss function value fusion unit 1253 is configured to set the audio data of the time frame according to the depth feature and the center vector.
  • the inter-class confusion measurement index and the intra-class confusion degree of the loss function relative to the given label The fusion between the distance penalty values and the loss function value of the audio data relative to the given label is obtained.
  • the loss function value fusion unit 1253 is further configured to perform the center loss calculation of the given label through the depth feature and the center vector to obtain the relative audio data of the time frame. The penalty value of the distance within the class of the given label.
  • the loss function value fusion unit 1253 is further configured to use a cross-entropy loss function to calculate the inter-class confusion degree of the audio data of the time frame relative to the given label according to the depth feature Measure the index.
  • the loss function value fusion unit 1253 is further configured to set a loss function to penalize the distance within the class and the audio data relative to the given class according to a specified weight factor.
  • the inter-aliasing measurement index is weighted and calculated to obtain the loss function value of the audio data relative to the given label.
  • Fig. 12 is a block diagram showing another exemplary embodiment of the fusion computing module according to the embodiment corresponding to Fig. 10.
  • the fusion calculation module 1250 includes:
  • the category center obtaining unit 1301 is configured to obtain a center vector corresponding to a category for a given label in the label data and the supplemented blank label;
  • the intra-class distance penalty value calculation unit 1303 is configured to calculate the probability that the audio data stream is mapped to a given sequence label and the given sequence for the depth feature sequence formed by the audio data stream in time sequence Labeling the respective distances from the center vector to obtain the intra-class distance penalty value of the audio data stream with respect to the given sequence;
  • the given sequence label includes the supplementary blank label and the given label.
  • the fusion calculation module 1250 further includes a probability distribution calculation unit configured to calculate the probability distribution of the audio data stream with respect to the given sequence annotation according to the depth feature, And calculating the log-likelihood cost of the audio data stream based on the probability distribution is the inter-class confusion measurement index marked by the audio data stream relative to the given sequence.
  • the fusion calculation module 1250 further includes a weight calculation unit, which is configured to label the audio data stream relative to the given sequence in the set loss function according to a specified weight factor.
  • the inter-class confusion measurement index and the intra-class distance penalty value are weighted and calculated to obtain the loss function value of the audio data stream with respect to the given sequence.
  • Fig. 13 is a block diagram showing an update module according to the embodiment corresponding to Fig. 10.
  • the update module 370 includes:
  • the iterative training unit 371 is configured to perform iterative training of the updated parameters of the network layers in the neural network according to the loss function values of a series of given labels in the relative label data obtained by fusion, until the minimized loss is obtained Function value
  • the parameter update unit 373 is configured to update the parameters corresponding to the minimized loss function value to each layer of the neural network.
  • this application also provides a machine and equipment, which can be used in the implementation environment shown in FIG. 1 to perform all of the methods shown in any one of FIGS. 3, 4, 5, 6 and 7 Or part of the steps.
  • the device includes:
  • a memory for storing processor executable instructions
  • the processor is configured to execute the aforementioned method.
  • an embodiment of the present application also provides a storage medium, which includes a stored program, and the program executes the steps in any of the foregoing methods when the program is running.
  • embodiments of the present application also provide a computer program product including instructions, which when run on a computer, cause the computer to execute the steps in any of the foregoing methods.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种实现音频识别的神经网络训练方法,该方法包括:获取音频数据流(310);对音频数据流中每个时间帧的不同音频数据,在神经网络中进行网络各层的特征抽取,获得对应时间帧输出的深度特征(330);为标注数据中的给定标注,通过深度特征对音频数据流在设定损失函数中融合相对给定标注的类间混淆度衡量指数和类内距离惩罚值(350);通过融合得到的损失函数值,进行神经网络中的参数更新(370)。该方法能够基于所融合得到的损失函数值进行神经网络的训练,综合音频数据流相对给定标注的类间混淆度衡量指数以及相对中心向量之间距离度量的惩罚来提高所实现音频识别的鲁棒性。

Description

音频识别方法、系统和机器设备
本申请要求于2019年01月29日提交中国专利局、申请号为201910087286.4、申请名称为“音频识别方法、系统和机器设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机声学应用技术领域,特别涉及音频识别方法、系统和机器设备。
背景技术
声学场景中音频识别的实现,即各种音频分类任务的执行往往受限于声学场景的变化性,例如基于音频识别的自动语音识别,这将使得音频识别难以应用于各种音频分类任务中。声学场景的变化性是来自于多方面的,例如,说话人、口音、背景噪声、混响、声道和录音条件等等。
随着科学技术的发展和硬件计算能力的大幅提升,将基于神经网络实现音频识别。但是,基于神经网络实现的音频识别仍然无法保证对变化的声学场景的鲁棒性。
发明内容
为了解决相关技术中用于实现音频识别的神经网络缺乏对训练时未见以及变化大的声学条件下的鲁棒性,本申请提供一种实现音频识别的神经网络训练方法、系统和机器设备。
一种音频识别方法,所述方法包括:
获取进行音频识别的音频数据流,所述音频数据流包括分别对应若干时间帧的音频数据;
对所述音频数据流中每个时间帧的不同音频数据,在神经网络中进行网络各层的特征抽取,获得对应时间帧输出的深度特征;
为标注数据中的给定标注,通过所述深度特征对所述音频数据流在设定损失函数中融合相对所述给定标注的类间混淆度衡量指数和类内距离惩罚值;
通过融合得到相对标注数据中一系列给定标注的损失函数值,对所述音频 数据流获得音频标注结果。
一种音频识别系统,所述音频识别系统包括:
数据流获取模块,用于获取进行音频识别的音频数据流,所述音频数据流包括分别对应若干时间帧的音频数据;
特征抽取模块,用于对所述音频数据流中每个时间帧的不同音频数据,在神经网络中进行网络各层的特征抽取,获得对应时间帧输出的深度特征;
融合计算模块,用于为标注数据中的给定标注,通过所述深度特征对所述音频数据流在设定损失函数融合相对所述给定标注的类间混淆度衡量指数和类内距离惩罚值;
结果获取模块,用于通过融合得到相对标注数据中一系列给定标注的损失函数值,对所述音频数据流获得音频标注结果。
一种机器设备,包括:
处理器;以及
存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时实现如前所述的方法。
一种存储介质,该存储介质包括存储的程序,程序运行时执行上述的方法。
一种包括指令的计算机程序产品,当其在计算机上运行时,使得所述计算机执行上述的方法。
本申请的实施例提供的技术方案可以包括以下有益效果:
对于给定音频,为音频识别的神经网络训练获取音频数据流,此音频数据流包括了分别对应若干时间帧的音频数据,对音频数据流中每个时间帧的不同音频数据,在所训练神经网络中进行网络各层的特征抽取,获得对应时间帧输出的深度特征,至此便为每一时间帧的不同音频数据都获得了用于对音频数据流进行标注,以识别此音频数据流的深度特征;在此基础之上,再为标注数据中的给定标注,通过深度特征来对音频数据流在设定损失函数融合相对此给定标注的类间混淆度衡量指数以及类内距离度量的惩罚,最后通过所融合得到的损失函数值来进行神经网络中的参数更新,对于用于进行音频识别的神经网络而言,基于所融合得到的损失函数值来进行网络各层的参数更新,综合音频数据流相对给定标注的类间混淆度衡量指数以及类内距离惩罚值来提高所实现 神经网络对训练时未见以及变化大的声学条件的鲁棒性。
音频数据流相对给定标注的类间混淆度衡量指数,将保证了音频识别中深度特征的类间区分性;而音频数据流相对给定标注的类内距离惩罚值,对于音频识别而言,则增强了所抽取得到的深度特征的鉴别性能,因此,在此基础上所进行的二者之间融合,保证了深度特征具备类间区分性和类内分布的紧密性,从而得以提高所实现神经网络对训练时未见以及变化大的声学条件的鲁棒性,进而有效提升音频识别的性能。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本申请。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并于说明书一起用于解释本申请的原理。
图1是根据本申请所涉及的实施环境的示意图;
图2是根据一示例性实施例示出的一种音频识别端的硬件结构框图;
图3是根据一示例性实施例示出的一种实现音频识别的神经网络训练方法的流程图;
图4是根据另一示例性实施例示出的一种音频识别方法的流程图
图5是根据图3对应实施例示出的对步骤350进行描述的流程图;
图6是根据图3对应实施例示出的对步骤350在另一个示例性实施例进行描述的流程图;
图7是根据图3对应实施例示出的对步骤370进行描述的流程图;
图8是根据一示例性实施例示出的自动语音识别系统中神经网络的网络架构示意图;
图9是根据一示例性实施例示出的融合损失函数监督训练神经网络时的前向传播和反向传播错误信号流的示意图;
图10是根据一示例性实施例示出的一种音频识别系统的框图;
图11是根据图10对应实施例示出的融合计算模块的框图;
图12是根据图10对应实施例示出的融合计算模块在另一个示例性实施例 的框图;
图13是根据图10对应实施例示出的更新模块的框图。
具体实施方式
这里将详细地对示例性实施例执行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
图1是根据本申请所涉及的实施环境的示意图。在一个示例性实施例中,该实施环境包括音频源110以及音频识别端130,对于音频源110所发出的音频,例如,一段语音,在音频识别端130进行着神经网络的训练,以此来获得可供实现音频识别的神经网络。
例如,如图1所示的,音频源110,可以是一说话人,也可以是一终端设备,通过说话人的说话向音频识别端130输出一段语音,或者通过一终端设备进行的音频播放向音频识别端130输出一段音频。
音频识别端130可以是智能音箱、智能电视、在线语音识别系统等,对于神经网络的训练过程而言,音频源110将为所进行的神经网络训练提供音频数据流作为训练数据。本申请所实现音频识别的神经网络训练逻辑将应用于音频识别端130,对音频源110输入的音频进行神经网络训练。应当理解,该实施环境的具体框架将与所落地的场景强相关,不同的场景,将使得所在的实施环境除了音频源110和音频识别端130之外,有着不同的架构部署。
音频识别端130将面向于各种音频源110,例如,各种应用所在的设备,通过各种音频源110来为音频识别端130提供进行神经网络训练的音频数据流。
所训练得到的神经网络将应用于诸多场景,例如,安全监控中的音频监控、说话人识别以及人机交互等在此不进行一一列举,实现诸多场景下的音频识别。
图2是根据一示例性实施例示出的一种音频识别端的硬件结构框图。在一个示例性实施例中,音频识别端可以是服务器,当然,其也可以是计算能力优 秀的终端设备。图2是根据一示例性实施例示出的作为音频识别端的服务器的硬件结构框图。需要说明的是,该服务器200只是一个适配于本公开的示例,不能认为是提供了对本公开的使用范围的任何限制。该服务器200也不能解释为需要依赖于或者必须具有图2中示出的示例性的服务器200中的一个或者多个组件。
该服务器200的硬件结构可因配置或者性能的不同而产生较大的差异,如图2所示,服务器200包括:电源210、接口230、至少一存储介质250、以及至少一中央处理器(CPU,Central Processing Units)270。
其中,电源210用于为服务器200上的各硬件设备提供工作电压。
接口230包括至少一有线或无线网络接口231、至少一串并转换接口233、至少一输入输出接口235以及至少一USB接口237等,用于与外部设备通信。
存储介质250作为资源存储的载体,可以是随机存储介质、磁盘或者光盘等,其上所存储的资源包括操作系统251、应用程序253及数据255等,存储方式可以是短暂存储或者永久存储。其中,操作系统251用于管理与控制服务器200上的各硬件设备以及应用程序253,以实现中央处理器270对海量数据255的计算与处理,其可以是Windows ServerTM、Mac OS XTM、UnixTM、LinuxTM、FreeBSDTM等。应用程序253是基于操作系统251之上完成至少一项特定工作的计算机程序,其可以包括至少一模块(图2中未示出),每个模块都可以分别包含有对服务器200的一系列操作指令。数据255可以是存储于磁盘中的照片、图片等等。
中央处理器270可以包括一个或多个以上的处理器,并设置为通过总线与存储介质250通信,用于运算与处理存储介质250中的海量数据255。
如上面所详细描述的,适用本公开的服务器200将通过中央处理器270读取存储介质250中存储的一系列操作指令的形式来进行音频的识别。
图3是根据一示例性实施例示出的一种实现音频识别的神经网络训练方法的流程图。在一个示例性实施例中,该实现音频识别的神经网络训练方法,如图3所示,至少包括以下步骤。
在步骤310中,为音频识别的神经网络训练获取音频数据流,音频数据流包括分别对应若干时间帧的音频数据。
其中,对实现音频识别的神经网络进行训练之前,可以先获取此音频对应的音频数据流,以便后续可以利用该音频数据流来执行神经网络的训练过程。应当理解,音频数据流描述了音频内容,也反映了输出此音频内容的说话人。音频数据流是由一帧帧音频数据所组成的,因此,音频数据流包含了对应时间帧的若干音频数据。这些音频数据将形成时间序列,也就是说,音频数据流将对应于按照一定的时间顺序所构成的音频序列数据。
在一个示例性实施例中,步骤310包括:获取带噪且连续的音频数据流以及标注数据为神经网络的训练数据。
音频识别可以是指对音频数据流进行分类。也就是,在音频识别过程实现对音频数据流进行音频标注,使得该音频标注标示了音频数据流归属的类别,从而使得后续可以基于该音频标注获知音频数据流所对应的说话人,或者在内容上归属的标签。基于此可知,对于实现音频识别的神经网络过程中,应该将音频数据流以及该音频数据流对应的标注数据作为训练数据,以便使得标注数据将与音频数据流相配合进行神经网络的训练。
在一个示例性实施例中,对于所获取的音频数据流,在执行步骤330之前,该音频识别方法还包括以下步骤:
对音频数据流进行分帧,获得对应若干时间帧的音频数据,对应于时间帧的音频数据将通过相应音频标注的预测完成音频识别。
其中,音频数据流往往是任意长度且完成标注的,例如,其可以是短暂输入的语音,也可以是当前所进行的演讲等等,因此,可以按照一定的帧长和帧移对该进行音频识别的音频数据流进行分帧,以此来获得每一时间帧对应的音频数据,其标注数据中的给定标注将对应于一时间帧的音频数据。
神经网络所实现的音频识别,作为时序分类的一种,由分帧所获得的音频数据形成了时序分类中的时序数据,在后续所进行的特征抽取中,按时序对音频数据进行即可,以此来为每一时间帧下的音频数据输出特征。
音频识别的进行,即为音频标注的预测过程,预测音频数据所在音频数据流的类别,进而为此而打上相应的标注,亦可称之为标签,由此即可获得音频标注结果,通过音频标注结果确认相应的说话人,或者音频在内容上的类别。神经网络的训练与此相对应,因此,需要使用标注的音频数据流来进行神经网 络训练。
在步骤330中,对音频数据流中每个时间帧的不同音频数据,在所训练神经网络中进行网络各层的特征抽取,获得对应时间帧输出的深度特征。
其中,对音频数据流,进行每一时间帧下不同音频数据的特征抽取,此特征抽取是在神经网络中进行的,通过神经网络中网络各层的特征抽取来对应于时间帧获得深度特征。
首先应当说明的是,对音频数据进行特征抽取的神经网络,可以适用于多种模型类型和网络拓扑结构,也可以根据需要扩展网络结构,甚至替换各种更为有效的网络拓扑结构。在一个示例性实施例中,神经网络可以通过卷积网络层和Max pool层构成的多层结构、LSTM(Long Short-Term Memory,长短期记忆网络)的多层结构以及全连接层来为不同时间帧下的音频数据输出深度特征。
对应时间帧所输出的深度特征,将是对音频数据的数值化描述,因此,将表征音频数据而进行音频数据流的标注。
在一个示例性实施例中,对步骤330包括:对音频数据流中每个时间帧的不同音频数据,在神经网络中的网络各层逐层进行特征抽取,直至抵达网络最后一层,获得对应时间帧输出的深度特征。
其中,音频数据流中每一时间帧的音频数据,都在神经网络经由网络各层完成深度特征的抽取,以此来以帧为单位获取特征。
而在另一个示例性实施例中,对于步骤330所获得对应时间帧的深度特征,在执行步骤350之前,正如图4所示出的,该音频识别方法还包括:
在步骤410中,对深度特征,获取所对应时间帧之前和之后各指定数量时间帧的深度特征。
其中,在前述示例性实施例中,所获得的深度特征,是对一时间帧的音频数据抽取得到的,而在本示例性实施例中,将为此时间帧按照一定长度拼接深度特征,以此来作为此时间帧输出的深度特征。
基于此,针对对应于每一时间帧的深度特征,都获取在此时间帧之前和之后各指定数量时间帧的深度特征。例如,指定数量时间帧可以是5帧,由获取此时间帧之前和之后各5帧音频数据的深度特征。
在步骤430中,将深度特征按照时序与所对应时间帧之前和之后指定数量时间帧的深度特征拼接,获得时间帧在神经网络输出的深度特征。
其中,在通过步骤410的执行,为时间帧获取指定数量时间帧的深度特征之后,将按照所获取深度特征对应的时间帧,进行深度特征的按时序拼接,以此来获得当前时间帧在神经网络输出的深度特征。
应当理解,对音频数据流分帧而获得若干时间帧对应的音频数据,每一音频数据都描述了音频数据流中的一部分内容。对所有音频数据都进行特征抽取,方能准确进行音频数据流的分类识别。
对于请求进行神经网络训练的音频数据流,通过前述示例性实施例,根据音频识别端自身的硬件部署情况,将音频数据流按照一定的时间长度进行分割,便得到了对应若干时间帧的音频数据,以此来适应于任意音频识别状况和机器部署状况,增强神经网络的可靠性和通用性。
而对应于若干时间帧的不同音频数据,都进行按指定数量时间帧为当前所对应的时间帧进行着深度特征的拼接,以此来获得能够反映上下文信息的深度特征,进而增强神经网络的精准性。
在此应当说明的是,对于所进行的深度特征拼接,所指的当前时间帧,是所进行的深度特征拼接中,当前所处理到的时间帧。所进行的深度特征拼接,是针对于每一时间帧进行的,分别围绕着每一时间帧而为所对应深度特征拼接此时间帧之前以及之后的深度特征,以此来获得此时间帧输出的深度特征。
在步骤350中,为标注数据中的给定标注,通过深度特征对音频数据流在设定损失函数中融合相对给定标注的类间混淆度衡量指数和类内距离惩罚值。
其中,在对时间帧的音频数据抽取得到深度特征之后,即使用深度特征来表征音频数据,进行此音频数据参与的神经网络训练。
标注数据对应于音频数据流,标注数据是为神经网络的训练过程所输入的,标注数据将用于为音频数据流的标注预测提供所有可能的标注,进而通过步骤350所进行的计算通过哪一标注所对应的类别相对于音频数据流所存在的类间混淆度衡量指数,从而确定损失函数值,以此来完成神经网络的一次迭代训练。
而设定损失函数,用于以深度特征为输入,实现音频数据流相对给定标注的类间混淆度衡量指数和类内距离惩罚值二者之间的融合计算。也就是说,设定损失函数即为融合损失函数。在设定损失函数的作用下为神经网络的训练供损失函数值。
标注数据包括若干给定标注,对于每一给定标注,都通过深度特征对音频数据流在设定损失函数中融合相对这一给定标注的类间混淆度衡量指数和类内距离惩罚值,以此来获得这一给定标注的损失函数值。此损失函数值将决定着本次所迭代进行的神经网络训练是否收敛结束。
应当理解的,对于设定损失函数所计算得到的损失函数值,将通过最小化损失函数值来控制所进行的神经网络训练,以保证所进行的神经网络迭代训练能够得到收敛而结束,进而将由此得到的参数更新到神经网络中。
对于所训练得到的神经网络而言,其所对应最小化的损失函数值,由于是由类间混淆度衡量指数和类内距离惩罚值所融合得到的,因此,类间混淆度衡量指数和类内距离惩罚值都将是最小化的。
每一给定标注都对应于一类别,给定标注将作为所对应类别的标签而存在。应当说明的是,音频数据流相对给定标注的类间混淆度衡量指数,用于表征音频数据流归属于这一给定标注所对应类别的可能性,以增强类间区分性,即类间混淆度衡量指数越小,则类间区分性越强;而音频数据流相对给定标注的类内距离惩罚值,则用于通过类内距离的惩罚来增强鉴别性能,以通过类内分布紧凑来满足类内鉴别性能,即类内距离惩罚值越小,则类内分布的紧凑性越强,进而获得类内鉴别性能的增强。
在一个示例性实施例中,所获得相对给定标注的类间混淆度衡量指数和类内距离惩罚值,是面向于时间帧的音频数据而言的。对每一时间帧的音频数据,最将通过其深度特征来实现此音频数据相对给定标注的类间混淆度衡量指数以及类内距离惩罚值二者之间的融合。
在另一个示例性实施例中,所获得给定标注的类间混淆度衡量指数和类内距离惩罚值,是面向于整个音频数据流而言的。针对于标注数据中的每一给定标注,都对音频数据流在整体上进行音频数据流相对当前给定标注的类间混淆度衡量指数和类内距离惩罚值二者之间的融合。
在此示例性实施例中,面向于音频数据流整体进行标注而获得标注序列,由此所获得的损失函数值即为音频数据流相对一可能的标注序列的概率,此概率的数值大小将由音频数据流相对此标注序列的类间混淆度衡量指数以及相对此标注序列的类内距离惩罚值决定。
至此,将单一音频数据的标注优化为音频数据流对所有可能标注序列的预测,从而将不再需要保证神经网络的训练中帧级别标注的进行,不需要为每一时间帧的音频数据都在训练过程中提供所对应的标注,训练过程的输入信号流不再需要保证与标注的长度一致,应当理解,对于一段音频而言,某一个或者某几个时间帧的音频数据无对应标注是正常的,往往会经过几个时间帧才能够对当前时间帧进行音频数据的标注,因此,面向于音频数据流进行整体上的标注,将使得音频识别的实现不再需要在训练过程中进行帧级别的标注,能够支持和采纳序列建模的机制,且能够在序列鉴别训练的同时学习有鉴别性的特征表达。
如前所述的,通过神经网络中的网络各层进行了特征抽取,以此来获得了时间帧音频数据的深度特征,除此之外,对于神经网络而言,还包括了softmax层,将通过softmax层完成结果的输出,当然,所输出的结果是音频数据流相对各给定标注的概率分布,即前述所指的损失函数值,以此来通过最小化的损失函数值优化神经网络。
因此,步骤350的实现将是通过神经网络中的softmax层执行的,进而以此来获得音频数据流相对标注数据中一系列给定标注的损失函数值。
神经网络的softmax层为进行音频数据流相对给定标注的类间混淆度衡量指数和类内距离惩罚值二者之间的融合,将通过设定的融合损失函数实现。
应当理解的,类内距离惩罚值,可以通过欧几里得距离计算得到,也可采用其它距离类型计算得到,例如角度距离。与此相对应的,类内距离惩罚值的计算可以通过中心损失函数实现,但也不限于此,也可以通过采用角度距离的Contrastive损失函数、Triplet损失函数、Sphere face损失函数和CosFace损失函数等实现类内距离惩罚值的计算,在此不一一进行列举。
在步骤370中,通过融合得到相对标注数据中一系列给定标注的损失函数值,进行神经网络中的参数更新。
其中,通过步骤350的执行,获得音频数据流相对标注数据中一系列给定标注的损失函数值之后,即可由此损失函数值来控制神经网络的训练。
应当说明的是,所指的一系列给定标注,是音频数据流通过softmax层输出损失函数值所对应的所有给定标注。在一个示例性实施例中,音频数据流融合得到损失函数值所对应的一系列给定标注,包括了每一时间帧所对应音频数据通过sofmax层映射的给定标注。在另一个示例性实施例中,音频数据流融合得到损失函数值所对应的一系列给定标注,则是音频数据流通过softmax层所映射的给定标注。
通过此示例性实施例,将得以显著降低音频识别在未见声学条件下的错误率,有效提高了音频识别对噪声可变性的泛化能力,进而在干净语音条件、训练己见声学条件以及未见声学条件下都能够获得非常低的错误率。
图5是根据图3对应实施例示出的对步骤350进行描述的流程图。在一个示例性实施例中,如图5所示,该步骤350包括:
在步骤351中,为标注数据中的给定标注,获取给定标注所属类别对应的中心向量,该中心向量用于描述所属类别中所有深度特征的中心。
在步骤353中,根据深度特征和中心向量对时间帧的音频数据进行设定损失函数中自身相对给定标注的类间混淆度衡量指数和类内距离惩罚值二者之间的融合,获得音频数据相对给定标注的损失函数值。
其中,此示例性实施例是面向于音频数据进行的融合计算,通过设定损失函数来获得每一时间帧音频数据相对给定标注的损失函数值。
如前所述的,标注数据包括若干给定标注。因此,在采用欧几里得距离所进行的类内距离惩罚值计算中,将根据给定标注所在类别的中心向量对深度特征计算类内距离,进而通过惩罚类内距离而获得类内距离惩罚值。应当理解的,中心向量用于描述给定标注所在类别的中心。在神经网络的softmax层所进行的融合计算中,对于每一时间帧的音频数据,都针对于标注数据中的每一给定标注基于中心向量进行相对这一给定标注的类内距离惩罚值的计算。
与此相对应的,对于这一给定标注,也将预测每一时间帧的音频数据相对每一给定标注的类间混淆度衡量指数。
由此可知,所进行的融合计算是针对于标注数据中的每一给定标注进行 的,并且在设定损失函数所进行的融合计算中,针对于相同给定标注,都进行自身相对此给定标注的类间混淆度衡量指数及类内距离惩罚值计算,进而进行二者之间的融合,得到音频数据相对于此给定标注的损失函数值,以此类推,运算得到每一时间帧音频数据相对所有给定标注的损失函数值。
通过此示例性实施例,使得音频数据的标注能够在新的声学条件下具备鲁棒性,即便在新的录音环境、遇到新的说话人甚至于新的口音和背景噪声,也能够稳定可靠的完成音频识别。
在另一个示例性实施例中,该步骤353包括:通过深度特征和中心向量,进行给定标注的中心损失计算,获得时间帧的音频数据相对给定标注的类内距离惩罚值。
其中,正如前述所指出的,对应于给定标注的中心向量,将作为所在类别的中心,每一时间帧的音频数据都将通过自身所抽取得到的深度特征对中心向量计算深度特征在相应类别中的类内紧凑性和鉴别性能,这是通过惩罚深度特征和中心向量之间的类内距离所实现的。
因此,在一个示例性实施例中,对于音频数据相对给定标注的中心损失计算,可通过如下所示的中心损失函数实现,即:
Figure PCTCN2020072063-appb-000001
其中,L cl是类内距离惩罚值,u t是时间帧t的音频数据的深度特征,即神经网络中倒数第二层在第t时间帧的输出,
Figure PCTCN2020072063-appb-000002
是第k t类深度特征的中心向量。
在所进行的中损失计算中,其目标是希望音频数据的深度特征享有中心的距离的平方和要越小越好,即类内距离越小越好。
在另一个示例性实施例中,该步骤353还包括:根据深度特征,采用交叉熵损失函数计算时间帧的音频数据相对给定标注的类间混淆度衡量指数。
其中,交叉熵损失函数用于保证深度特征的类间区分性。
在一个示例性实施例中,交叉熵损失函数为:
Figure PCTCN2020072063-appb-000003
其中,L ce是第t时间帧的音频数据归属于给定标注的类间混淆度衡量指数,
Figure PCTCN2020072063-appb-000004
是神经网络输出层给过softmax操作之后对应第k t个结点的输出,神经网络中有K个输出结点,代表K类输出类别。
进一步的,对于
Figure PCTCN2020072063-appb-000005
将通过下述公式得到,即:
Figure PCTCN2020072063-appb-000006
a t=Wu t+B
其中,a t是神经网络最后一层,即softmax层的前一层的对应时间帧t的输出,
Figure PCTCN2020072063-appb-000007
表示第j个结点,W和B分别对应最后一层的权重矩阵和偏置向量。
在另一个示例性实施例中,该步骤353还包括:按照指定权重因子,在设定损失函数对类内距离惩罚值和音频数据相对给定标注的类间混淆度衡量指数进行加权计算,得到音频数据相对给定标注的损失函数值。
其中,所进行的融合计算,是按照指定权重因子在设定损失函数进行二者之间的加权计算,以此来获得音频数据相对给定标注的损失函数值。
在一个示例性实施例中,作为融合损失函数的设定损失函数,将通过以下融合损失函数对中心损失函数和交叉熵损失函数进行融合计算,即:
L fmf=L ce+λL cl
其中,L fmf是音频数据相对给定标注的损失函数值,λ是指定权重因子。
在一个示例性实施例中,音频数据流中的不同时间帧都通过标注数据中的给定标注进行音频数据的标注。
正如前述所指出的,音频数据流包括了分别对应于若干时间帧的不同音频数据。每一时间帧的音频数据都进行了标注,在标注数据中有着对应的给定标注。
换而言之,标注数据中的给定标注都是对应于音频数据流中不同时间帧的音频数据的,以此来保证神经网络训练中标注数据与音频数据流的对齐。
图6是根据图3对应实施例示出的对步骤350在另一个示例性实施例进行描述的流程图。在另一个示例性实施例中,标注数据中补充空白标注,如图6所示,该步骤350包括:
在步骤501中,对标注数据中的给定标注和补充的空白标注,获取所属类别对应的中心向量。
在步骤503中,对音频数据流按时序对深度特征形成的深度特征序列,计 算音频数据流映射为给定序列标注的概率以及给定序列标注分别相对中心向量的距离,获得音频数据流相对给定序列标注的类内距离惩罚值。
其中,给定序列标注包括补充的空白标注和给定标注。
首先应当说明的是,空白标注是标注数据中的新增标注,空白标注对应于“空白类”。应当理解,音频数据流中,往往存在不知道对应于哪一给定标注的一时间帧或者某几时间帧的音频数据,为此将音频数据归属于空白标注即可,由此将得以保证音频数据流与给定序列标注的对齐,即解决音频数据流与标注的长度不一致的问题,音频识别不再受限于帧级别标注数据的限制。
可以理解的,对于音频数据流而言,空白标注将存在于音频数据流归属的给定标注中,即空白标注分隔给定标注。
给定序列标注包括若干给定标注以及给定标注之间插入的空白标注。除此这外,在给定序列标注中,还将在首尾插入空白标注,以此来解决音频数据流中首帧音频数据以及最后一帧音频数据无含义,进而无法标注的问题。
由此,在一个示例性实施例中,音频数据流的标注数据是未对齐的离散标签串,在离散标签串补充空白标注,补充的空白标注和标注数据中的给定标注分别对应于音频数据流中不同时间帧的音频数据。
音频数据流所未对齐的离散标签串,是一给定序列标注,因此,离散标签串包括了若干给定标注,但是,并无法针对于输入信号流的每一帧而对应上每一给定标注。也就是说,并不知道离散标签串中某一给定标注对应到输入信号流的哪些帧。
以音频数据流和未对齐的离散标签串作为训练数据来进行神经网络的训练,在此作用下将使得神经网络的训练以及后续音频识别的实现不再受限于帧级别的标注数据,即不再受限于输入信号流与离散标签串二者之间的无法对齐。
通过所进行的中心损失计算获得音频数据流相对给定序列标注的类内距离惩罚值是对给定序列标注计算音频数据流中深度特征偏离中心向量的距离的期望值。给定标注序列是音频数据流可能对应的标注序列,由给定标注和空白标注构成。
而音频数据流映射为给定序列标注的概率,是相对于每一可能的给定序列 标注所进行的计算,用于描述音频数据流与给定序列标注之间的映射关系。
在一个示例性实施例中,所进行音频数据流映射为给定序列标注概率的计算,可通过如下所示的条件概率分布计算实现,即:
p(s,t|z)=α t(s)β t(s)
其中,α t(s)和β t(s)分别表示前向变量和后向变量,可依据CTC(Connectionist Temporal Classification)中的最大似然准则计算得到,z是长度为r的序列标注。
由于给定序列标注实质是一序列标注z插入空白标注所得到的,因此,对此给定序列标注计算音频数据流映射为这一给定序列标注的概率,实质是对序列标注z进行的。
与此相对应的,音频数据流相对给定序列标注的类内距离惩罚值,将通过如下条件期望中心损失函数计算得到,即:
Figure PCTCN2020072063-appb-000008
其中,L ecl是音频数据流相对给定序列标注的类内距离惩罚值,z'是在序列标注z的首尾及每个相邻给定标注之间插入空白标注之后得到的给定序列标注,
Figure PCTCN2020072063-appb-000009
是给定序列标注中对应类别的中心向量,S则是音频数据流x与序列标注z这一标注对所在的训练集。
在为音频数据流所进行的融合计算中,进行着音频数据流映射为给定序列标注的概率以及给定序列标注分别相对中心向量的距离的计算,以此来完成条件期望中心损失函数的计算,获得音频数据流相对给定序列标注的类内距离惩罚值。标注数据中给定标注和空白标注所能够组成的每一可能的标注序列,都将作为给定序列标注参与计算。
在另一个示例性实施例中,该步骤350包括:根据深度特征,计算音频数据流相对给定序列标注的概率分布,且通过概率分布计算音频数据流的对数似然代价为音频数据流相对给定序列标注的类间混淆度衡量指数。
其中,随着音频数据流相对给定序列标注类内距离惩罚值的计算,也为此音频数据流进行其相对给定序列标注类间混淆度衡量指数的计算。音频数据流相对给定序列标注的类间混淆度衡量指数计算,其目标是最大化相对音频数据流而言给定标注序列为正确标注的概率,将最大化所有正确标注的概率,即最 小化音频数据流相对给定序列标注概率分布的对数似然代价。
在一个示例性实施例中,音频数据流相对给定序列标注的概率分布可通过如下所述的公式计算得到,即:
p(z|x)
由此可以得到通过概率分布计算音频数据流的对数似然代价为:
Figure PCTCN2020072063-appb-000010
其中,L ml是音频数据流相对给定序列标注的类间混淆度衡量指数,即音频数据流的对数似然代价。
在另一个示例性实施例中,该步骤350还包括:按照指定权重因子,在设定损失函数中对音频数据流相对给定序列标注的类间混淆度衡量指数和类内距离惩罚值进行加权计算,得到音频数据流相对给定序列标注的损失函数值。
其中,基于如上所述音频数据流相对给定序列标注的类间混淆度衡量指数和类内距离惩罚值,进行二者之间的融合计算,即按照指定权重因子进行二者之间的加权计算,以此来得到音频数据流相对给定序列标注的损失函数值。
在一个示例性实施例中,将根据最小化的损失函数值来确定神经网络训练的收敛结束,因此,与此相对应的,音频数据流相对每一给定序列标注都按照指定权重因子进行加权计算,所对应最小化损失函数值的参数即可更新至神经网络中。
在一个示例性实施例,将通过下述时态多损失融合函数来计算得到音频数据流相对给定序列标注的损失函数值,即:
L tmf=L ml+λL ecl
其中,L tmf是音频数据流相对给定序列标注的损失函数值,λ是指定权重因子。
通过时态多损失融合函数,得以保证深度特征在类间的区分性,而条件期望中心损失函数则提高了深度特征在类内分布的紧凑程度,即保证鉴别性。
图7是根据图3对应实施例示出的对步骤370进行描述的流程图。在一个示例性实施例中,如图7所示,该步骤370包括:
在步骤371中,根据融合得到相对标注数据中一系列给定标注的损失函数值,进行神经网络中网络各层所更新参数的迭代训练,直至获得最小化的所述 损失函数值。
在步骤373中,将最小化损失函数值对应的参数更新至神经网络的网络各层。
其中,实现音频识别且具备鲁棒性的神经网络,是通过带噪且连续的音频数据流进行训练所得到的。在音频数据流和融合损失函数的作用下,将使得训练所得到的神经网络往往涵盖着各种不同的声学条件,将使得所训练得到的神经网络能够适应各种不同的声学条件,具备更佳的可靠稳定性。
并且在通过神经网络的网络各层进行训练的过程中,根据所最小化的损失函数值进行着网络各层权重参数的优化,以此来获得对未见声学条件具备鲁棒性的神经网络。也就是说,在一个示例性实施例中,将以最小化的设定损失函数为训练目标来进行神经网络的训练,从而方能够通过神经网络实现音频数据流的标注预测。
神经网络的训练,将通过前向传递音频数据流直至输出产生误差信号,反向传播误差信息更新参数,例如网络各层的权重矩阵、softmax层的参数等,完成多层神经网络的训练,进而应用到音频分类任务中。
例如,对于softmax层所采用的时态多损失函数,其也是可微的,因此,神经网络标准的反向传播算法来训练。
通过此示例性实施例,将不断优化训练所得到的神经网络,进而不断增强神经网络进行音频识别的准确性。
通过如上所述的示例性实施例,便得以在各种声学条件,例如,干净语音条件、训练已见声学条件以及训练未见声学条件下实现自动语音识别等多种应用,并且能够取得非常低的字错误率。并且,在未见声学条件下通过如上所述的示例性实施例带来的相对字错误率降低的幅度是在所有声学条件下最为显著的。这都有力说明了通过如上所述的示例性实施例能有效地提高鲁棒性,并且通过同时保证深度特征在类间的区分性和类内分布的紧凑性,能够有效提高对于噪声可变性的泛化能力。
如上所述示例性实施例的训练实现,能够适用于各种网络结构的神经网络,也就是说,并不限定神经网络的模型类型和网络结构,可以替换为各种有效的新型的网络结构,并为所采用的神经网络构建softmax层,在此并未额外 增加复杂度,也不需要针对性的做额外的超参或网络结构的调优,一致性的性能得到提高。
通过如上所述的示例性实施例,将能够应用到包括智能音箱、智能电视、在线语音识别系统、智能语音助手、同声传译以及虚拟人等多个项目和产品应用中,在复杂的具有高度可变性的真实声学环境中显著地改善准确率,性能得到极大的提升。
以描述自动语音识别系统的实现为例,结合上述方法实现进行阐述。作为音频识别的一种应用,自动语音识别系统将对输入的音频数据流进行训练,以获得神经网络。现有的自动语音识别的进行,一方面无法适用于所有可能的声学条件以及变化的声学条件,这是由于所采用的神经网络无法在训练时涵盖所有声学条件导致的;另一方面的,在进行神经网络的训练时,需要每个样本帧都具备对应的类别标注,但是这对于实际所进行的神经网络训练过程而言是无法满足的,所能够使用的训练数据是带噪的、连续的音频数据流和未对齐的离散标签序列,并不知道其中某一个标签对应到输入信号流的哪些帧。
为此,将应用如上所述的方法执行自动语音识别,在经由神经网络中的网络各层通过深度特征对音频数据流融合相对给定标注的类间混淆度衡量指数和类内距离惩罚值之后,即可得到音频数据流相对标注数据中一系列给定标注的损失函数值,完成神经网络的训练。
图8是根据一示例性实施例示出的自动语音识别系统中神经网络的网络架构示意图。在一个示例性实施例中,如图8所示,本申请实现自动语音识别系统所属神经网络的网络架构至少包括了卷积网络层加Max pool层的多层结构1010、LSTM的多层结构1030、全连接层1050以及融合损失函数计算模块;与此相对应的,音频数据流经特征提取模块得到输入特征之后经卷积网络层加Max pool层的多层结构1010,再经LSTM的多层结构1030,然后再通过全连接层1050,输出到融合损失函数计算模块,通过其所实现的融合损失函数完成神经网络训练。标注数据可以是音频数据流的音素表达。
而对于这一神经网络,将利用输出音素作为训练目标进行监督训练所得到。
示例性的,假设图8中的神经网络有K个输出结点,代表K类输出类别, 例如,上下文相关的音素、上下文相关的子音素以及隐马尔可夫状态标签等。并且假设已有训练数据和对应的帧级别的标注(x t,k t):t=1,...,T,表示是x t属于第k t类的输入数据,此时即可采用前述示例性实施例所述的融合损失函数,即L fmf=L ce+λL cl来计算得到音频数据流相对标注数据中一系列给定标注的损失函数值,以此来得到完成神经网络的训练。
对于自动语音识别系统的实现而言,在此融合损失函数的作用下,得以同时保证深度到的深度特征的类间区分性和类内分布的紧密度,从而提高所使用神经网络对训练时未见的声学场景在测试时的鲁棒性。
在此基础上进一步延伸的,图8所示出神经网络中的融合损失函数计算模块将采用时态多损失融合函数,即L tmf=L ml+λL ecl,通过此融合损失函数计算模块的实现,保证类间区分性和类内鉴别性。
此时,应当理解的,通过时态多损失融合函数所计算得到音频数据流相对一系列给定序列标注的损失函数值,是通过在所有可能的标注序列上概率分布计算的进行实现的,给定该概率分布,时态多损失函数在直接最大化正确标注的概率的同时惩罚深度特征和对应中心的距离,由此不再受到帧级别标注数据的限制。
在为此自动语音识别系统的神经网络训练中,为计算训练数据的输入特征,取帧长25ms、帧移10ms提取40维的Fbank特征,然后计算它们的一阶和二阶差分构成120维的向量,归一化之后,将当前帧之前和之后的各5帧向量拼接起来,构成120*(5+5+1)=1320维的输入特征向量,即前述所指的对应时间帧的深度特征。
神经网络的网络结构和超参如表1进行配置,正如前述所指出的该网络结构首先包含两层二维的卷积层,输出频道数分别为64和80,每层kernel size为(3,3),stride为(1,1);各卷积层分别接一层maxpool层,其kernel size为(2,2),stride为(2,2);然后接五层LSTM层,每层隐结点数为1024,输出结点数为512;然后接一个全联接层,输出结点数即对应K类输出类别,例如,详细实现中可以采用12K类上下文相关的音素。
表1.本申请所采用神经网络中网络结构的一个配置实例
网络层 超参
Conv2d_1 kernel_size=(3,3),stride=(1,1),out_channels=64
Max_pool_1 kernel_size=(2,2),stride=(2,2)
Conv2d_2 kernel_size=(3,3),stride=(1,1),out_channels=80
Max_pool_2 kernel_size=(2,2),stride=(2,2)
LSTM(5层) hidden_size=1024,with peephole,output_size=512
全联接层 output_size=12K
基于上述配置实例所配置的网络架构,可以采用融合损失函数,即L fmf=L ce+λL cl或者时态多损失融合函数进行训练。对于采用融合损失函数L fmf=L ce+λL cl的训练过程,在无噪声干净语音条件下的训练过程中,指定权重因子λ取1e-3;在带噪语音条件下的训练过程中,指定权重因子λ取1e-4。训练的优化算法采用Adam方法。学习率在训练开始时设初始值1e-4,而当平均验证似然值(每5K个分批训练后计算一次)连续3次未降时,学习率减半。如果平均验证似然值连续8次未降,则提前终止训练。
而对于采用时态多损失函数所进行神经网络训练,由于时态多损失融合函数也是可微的,因此可以通过标准的反向传播算法来训练,基于前述所描述的破游戏网络学习方法,对应的,图9是根据一示例性实施例示出的融合损失函数监督训练神经网络时的前向传播和反向传播错误信号流的示意图。
示例性的,时态多损失融合函数的学习算法,包括:
输入部分,即以训练标注对(x,z)∈S为输入,并且设置卷积层和LSTM层的初始化参数θ,全连接层的初始化权重参数W和初始化中心向量{c j|j=1,2,...,K},权重因子λ,批动量(batch momentum)μ和学习率γ。
输出部分,在时态多损失融合函数的学习算法中,将进行参数θ和W的调整以及插入空白标注之后中心向量的参数更新。
具体的,按照时态多损失函数,计算由CTC损失函数
Figure PCTCN2020072063-appb-000011
产生的反向传播错误信号,如图9所示出的,经由softmax层,即可获得音频数据流的对数似然代价L ml的反向传播错误信号,即:
Figure PCTCN2020072063-appb-000012
然后计算由条件期望中心损失函数产生的反向传播错误信号,即:
Figure PCTCN2020072063-appb-000013
经由如图9中的倒数第二层,计算融合的反向传播错误信号,即:
δ=W Tδ ml+λδ ecl
至此,根据链式准则,利用上述反向传播错误信号δ ml和δ参数的W和θ的调整值△W和△θ。
并且更新中心向量,即:
Figure PCTCN2020072063-appb-000014
以此类推,直至收敛。
通过如上所述的损失函数训练得到的神经网络应用到自动语音识别系统中,进而获得对未见声学条件的鲁棒性。
当然,应当理解的,也可采用其它训练方法来基于本申请所述的方法获得神经网络对未见声学条件的鲁棒性。
下述为本申请装置实施例,用于执行本申请上述音频识别方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请音频识别方法实施例。
图10是根据一示例性实施例示出的一种音频识别系统的框图。在一个示例性实施例中,如图10所示,该音频识别系统包括但不限于:数据流获取模块1210、特征抽取模块1230、融合计算模块1250以及更新模块1270。
数据流获取模块1210,用于为音频识别的神经网络训练获取音频数据流,所述音频数据流包括分别对应若干时间帧的音频数据;
特征抽取模块1230,用于对所述音频数据流中每个时间帧的不同音频数据,在所训练神经网络中进行网络各层的特征抽取,获得对应时间帧输出的深度特征;
融合计算模块1250,用于为标注数据中的给定标注,通过所述深度特征对所述音频数据流在设定损失函数融合相对所述给定标注的类间混淆度衡量 指数和类内距离惩罚值;
更新模块1270,用于通过融合得到相对标注数据中一系列给定标注的损失函数值,进行所述神经网络中的参数更新。
图11是根据图10对应实施例示出的融合计算模块的框图。在一个示例性实施例中,如图11所示,该融合计算模块1250包括:
中心向量获取单元1251,用于为所述标注数据中的给定标注,获取所述给定标注所属类别对应的中心向量,所述中心向量用于描述所述类别中所有深度特征的中心;
损失函数值融合单元1253,用于根据所述深度特征和所述中心向量对所述时间帧的音频数据进行设定损失函数中自身相对所述给定标注的类间混淆度衡量指数和类内距离惩罚值二者之间的融合,获得所述音频数据相对所述给定标注的损失函数值。
在另一个示例性实施例中,该损失函数值融合单元1253进一步用于通过所述深度特征和所述中心向量,进行所述给定标注的中心损失计算,获得所述时间帧的音频数据相对所述给定标注的类内距离惩罚值。
在另一个示例性实施例中,该损失函数值融合单元1253进一用于根据所述深度特征,采用交叉熵损失函数计算所述时间帧的音频数据相对所述给定标注的类间混淆度衡量指数。
在另一个示例性实施例中,该损失函数值融合单元1253进一步用于按照指定权重因子,在设定损失函数对所述类内距离惩罚值和所述音频数据相对所述给定标注的类间混淆度衡量指数进行加权计算,得到所述音频数据相对所述给定标注的损失函数值。
图12是根据图10对应实施例示出的融合计算模块在另一个示例性实施例的框图。在一个示例性实施例中,标注数据中补充空白标注,如图12所示,该融合计算模块1250包括:
类别中心获取单元1301,用于对所述标注数据中的给定标注和补充的所述空白标注,获取所属类别对应的中心向量;
类内距离惩罚值计算单元1303,用于对所述音频数据流按时序对所述深度特征形成的深度特征序列,计算所述音频数据流映射为给定序列标注的概率 以及所述给定序列标注分别相对所述中心向量的距离,获得所述音频数据流相对所述给定序列标注的类内距离惩罚值;
其中,所述给定序列标注包括补充的所述空白标注和给定标注。
在另一个示例性实施例中,该融合计算模块1250还包括概率分布计算单元,概率分布计算单元用于根据所述深度特征,计算所述音频数据流相对所述给定序列标注的概率分布,且通过所述概率分布计算所述音频数据流的对数似然代价为所述音频数据流相对所述给定序列标注的类间混淆度衡量指数。
在另一个示例性实施例中,该融合计算模块1250还包括加权计算单元,加权计算单元用于按照指定权重因子,在设定损失函数中对所述音频数据流相对所述给定序列标注的类间混淆度衡量指数和类内距离惩罚值进行加权计算,得到所述音频数据流相对所述给定序列标注的损失函数值。
图13是根据图10对应实施例示出的更新模块的框图。在另一个示例性实施例中,如图13所示,该更新模块370包括:
迭代训练单元371,用于根据融合得到相对标注数据中一系列给定标注的损失函数值,进行所述神经网络中网络各层所更新参数的迭代训练,直至所述获得最小化的所述损失函数值;
参数更新单元373,用于将最小化损失函数值对应的参数更新至所述神经网络的网络各层。
可选的,本申请还提供一种机器设备,该机器设备可以用于图1所示实施环境中,执行图3、图4、图5、图6和图7任一所示的方法的全部或者部分步骤。所述装置包括:
处理器;
用于存储处理器可执行指令的存储器;
其中,所述处理器被配置为执行实现前述所指的方法。
该实施例中的装置的处理器执行操作的具体方式已经在有关前述实施例中执行了详细描述,此处将不做详细阐述说明。
另外,本申请实施例还提供了一种存储介质,该存储介质包括存储的程序,程序运行时执行上述方法的任一实施方式中步骤。
另外,本申请实施例还提供了一种包括指令的计算机程序产品,当其在计 算机上运行时,使得所述计算机执行上述方法的任一实施方式中步骤。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围执行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (16)

  1. 一种实现音频识别的神经网络训练方法,应用于音频识别端,所述方法包括:
    为音频识别的神经网络训练获取音频数据流,所述音频数据流包括分别对应若干时间帧的音频数据;
    对所述音频数据流中每个时间帧的不同音频数据,在所训练神经网络中进行网络各层的特征抽取,获得对应时间帧输出的深度特征;
    为标注数据中的给定标注,通过所述深度特征对所述音频数据流在设定损失函数中融合相对所述给定标注的类间混淆度衡量指数和类内距离惩罚值;
    通过融合得到相对标注数据中一系列给定标注的损失函数值,进行所述神经网络中的参数更新。
  2. 根据权利要求1所述的方法,所述为音频识别的神经网络训练获取音频数据流,包括:
    获取带噪且连续的音频数据流以及标注数据为所述神经网络的训练数据。
  3. 根据权利要求1所述的方法,所述为标注数据中的给定标注,通过所述深度特征对所述音频数据流在设定损失函数中融合相对所述给定标注的类间混淆度衡量指数和类内距离惩罚值,包括:
    为所述标注数据中的给定标注,获取所述给定标注所属类别对应的中心向量,所述中心向量用于描述所述类别中所有深度特征的中心;
    根据所述深度特征和所述中心向量对所述时间帧的音频数据进行设定损失函数中自身相对所述给定标注的类间混淆度衡量指数和类内距离惩罚值二者之间的融合,获得所述音频数据相对所述给定标注的损失函数值。
  4. 根据权利要求3所述的方法,所述根据所述深度特征和所述中心向量对所述时间帧的音频数据进行设定损失函数中自身相对所述给定标注的类间混淆度衡量指数和类内距离惩罚值二者之间的融合,获得所述音频数据相对所述给定标注的损失函数值,包括:
    通过所述深度特征和所述中心向量,进行所述给定标注的中心损失计算,获得所述时间帧的音频数据相对所述给定标注的类内距离惩罚值。
  5. 根据权利要求4所述的方法,所述根据所述深度特征和所述中心向量 对所述时间帧的音频数据进行设定损失函数中自身相对所述给定标注的类间混淆度衡量指数和类内距离惩罚值二者之间的融合,获得所述音频数据相对所述给定标注的损失函数值,还包括:
    根据所述深度特征,采用交叉熵损失函数计算所述时间帧的音频数据相对所述给定标注的类间混淆度衡量指数。
  6. 根据权利要求4或5所述的方法,所述根据所述深度特征和所述中心向量对所述时间帧的音频数据进行设定损失函数中自身相对所述给定标注的类间混淆度衡量指数和类内距离惩罚值二者之间的融合,获得所述音频数据相对所述给定标注的损失函数值,还包括:
    按照指定权重因子,在设定损失函数对所述类内距离惩罚值和所述音频数据相对所述给定标注的类间混淆度衡量指数进行加权计算,得到所述音频数据相对所述给定标注的损失函数值。
  7. 根据权利要求6所述的方法,所述音频数据流中的不同时间帧都通过所述标注数据中的给定标注进行音频数据的标注。
  8. 根据权利要求1所述的方法,所述标注数据中补充空白标注,所述为标注数据中的给定标注,通过所述深度特征对所述音频数据流在设定损失函数中融合相对所述给定标注的类间混淆度衡量指数和类内距离惩罚值,包括:
    对所述标注数据中的给定标注和补充的所述空白标注,获取所属类别对应的中心向量;
    对所述音频数据流按时序对所述深度特征形成的深度特征序列,计算所述音频数据流映射为给定序列标注的概率以及所述给定序列标注分别相对所述中心向量的距离,获得所述音频数据流相对所述给定序列标注的类内距离惩罚值;
    其中,所述给定序列标注包括补充的所述空白标注和给定标注。
  9. 根据权利要求8所述的方法,所述为标注数据中的给定标注,通过所述深度特征对所述音频数据流在设定损失函数中融合相对所述给定标注的类间混淆度衡量指数和类内距离惩罚值,包括:
    根据所述深度特征,计算所述音频数据流相对所述给定序列标注的概率分布,且通过所述概率分布计算所述音频数据流的对数似然代价为所述音频数据 流相对所述给定序列标注的类间混淆度衡量指数。
  10. 根据权利要求8或9所述的方法,所述为标注数据中的给定标注,通过所述深度特征对所述音频数据流在设定损失函数中融合相对所述给定标注的类间混淆度衡量指数和类内距离惩罚值,还包括:
    按照指定权重因子,在所述设定损失函数中对所述音频数据流相对所述给定序列标注的类间混淆度衡量指数和类内距离惩罚值进行加权计算,得到所述音频数据流相对所述给定序列标注的损失函数值。
  11. 根据权利要求10所述的方法,所述音频数据流的标注数据是未对齐的离散标签串,在所述离散标签串补充空白标注,补充的所述空白标注和所述标注数据中的给定标注分别对应于所述音频数据流中不同时间帧的音频数据。
  12. 根据权利要求1所述的方法,所述通过融合得到相对标注数据中一系列给定标注的损失函数值,进行所述神经网络中的参数更新,包括:
    根据融合得到相对标注数据中一系列给定标注的损失函数值,进行所述神经网络中网络各层所更新参数的迭代训练,直至所述获得最小化的所述损失函数值;
    将最小化损失函数值对应的参数更新至所述神经网络的网络各层。
  13. 一种实现音频识别的神经网络训练系统,所述音频识别系统包括:
    数据流获取模块,用于为音频识别的神经网络训练获取音频数据流,所述音频数据流包括分别对应若干时间帧的音频数据;
    特征抽取模块,用于对所述音频数据流中每个时间帧的不同音频数据,在所训练神经网络中进行网络各层的特征抽取,获得对应时间帧输出的深度特征;
    融合计算模块,用于为标注数据中的给定标注,通过所述深度特征对所述音频数据流在设定损失函数融合相对所述给定标注的类间混淆度衡量指数和类内距离惩罚值;
    更新模块,用于通过融合得到相对标注数据中一系列给定标注的损失函数值,进行所述神经网络中的参数更新。
  14. 一种机器设备,包括:
    处理器;以及
    存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时实现根据权利要求1至12中任一项所述的方法。
  15. 一种存储介质,所述存储介质包括存储的计算机程序,其中,所述计算机程序运行时执行上述权利要求1至12任一项中所述的方法。
  16. 一种包括指令的计算机程序产品,当其在计算机上运行时,使得所述计算机执行权利要求1至12任意一项所述的方法。
PCT/CN2020/072063 2019-01-29 2020-01-14 音频识别方法、系统和机器设备 WO2020156153A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20748958.4A EP3920178A4 (en) 2019-01-29 2020-01-14 METHOD AND SYSTEM FOR AUDIO DETECTION AND DEVICE
US17/230,515 US11900917B2 (en) 2019-01-29 2021-04-14 Audio recognition method and system and machine device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910087286.4 2019-01-29
CN201910087286.4A CN109859743B (zh) 2019-01-29 2019-01-29 音频识别方法、系统和机器设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/230,515 Continuation US11900917B2 (en) 2019-01-29 2021-04-14 Audio recognition method and system and machine device

Publications (1)

Publication Number Publication Date
WO2020156153A1 true WO2020156153A1 (zh) 2020-08-06

Family

ID=66896707

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/072063 WO2020156153A1 (zh) 2019-01-29 2020-01-14 音频识别方法、系统和机器设备

Country Status (4)

Country Link
US (1) US11900917B2 (zh)
EP (1) EP3920178A4 (zh)
CN (2) CN110517666B (zh)
WO (1) WO2020156153A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112233655A (zh) * 2020-09-28 2021-01-15 上海声瀚信息科技有限公司 一种提高语音命令词识别性能的神经网络训练方法
CN112259078A (zh) * 2020-10-15 2021-01-22 上海依图网络科技有限公司 一种音频识别模型的训练和非正常音频识别的方法和装置
CN112529104A (zh) * 2020-12-23 2021-03-19 东软睿驰汽车技术(沈阳)有限公司 一种车辆故障预测模型生成方法、故障预测方法及装置
CN114040319A (zh) * 2021-11-17 2022-02-11 青岛海信移动通信技术股份有限公司 一种终端设备外放音质优化方法、装置、设备和介质

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110517666B (zh) 2019-01-29 2021-03-02 腾讯科技(深圳)有限公司 音频识别方法、系统、机器设备和计算机可读介质
CN112216286B (zh) * 2019-07-09 2024-04-23 北京声智科技有限公司 语音唤醒识别方法、装置、电子设备及存储介质
CN110767231A (zh) * 2019-09-19 2020-02-07 平安科技(深圳)有限公司 一种基于时延神经网络的声控设备唤醒词识别方法及装置
CN110807465B (zh) * 2019-11-05 2020-06-30 北京邮电大学 一种基于通道损失函数的细粒度图像识别方法
CN111261146B (zh) * 2020-01-16 2022-09-09 腾讯科技(深圳)有限公司 语音识别及模型训练方法、装置和计算机可读存储介质
CN111265317B (zh) * 2020-02-10 2022-06-17 上海牙典医疗器械有限公司 一种牙齿正畸过程预测方法
CN113191218A (zh) * 2021-04-13 2021-07-30 南京信息工程大学 基于双线性注意力汇集和卷积长短期记忆的车型识别方法
CN113177482A (zh) * 2021-04-30 2021-07-27 中国科学技术大学 一种基于最小类别混淆的跨个体脑电信号分类方法
CN113823292B (zh) * 2021-08-19 2023-07-21 华南理工大学 基于通道注意力深度可分卷积网络的小样本话者辨认方法
CN114881213A (zh) * 2022-05-07 2022-08-09 天津大学 基于三分支特征融合神经网络的声音事件检测方法
CN114881212A (zh) * 2022-05-07 2022-08-09 天津大学 基于双分支判别特征神经网络的声音事件检测方法
CN115409073B (zh) * 2022-10-31 2023-03-24 之江实验室 一种面向i/q信号识别的半监督宽度学习方法及装置
CN115775562B (zh) * 2023-02-13 2023-04-07 深圳市深羽电子科技有限公司 一种用于蓝牙耳机的声音外泄检测方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105551483A (zh) * 2015-12-11 2016-05-04 百度在线网络技术(北京)有限公司 语音识别的建模方法和装置
US20170148431A1 (en) * 2015-11-25 2017-05-25 Baidu Usa Llc End-to-end speech recognition
CN108766445A (zh) * 2018-05-30 2018-11-06 苏州思必驰信息科技有限公司 声纹识别方法及系统
CN109256139A (zh) * 2018-07-26 2019-01-22 广东工业大学 一种基于Triplet-Loss的说话人识别方法
CN109859743A (zh) * 2019-01-29 2019-06-07 腾讯科技(深圳)有限公司 音频识别方法、系统和机器设备

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100136890A (ko) 2009-06-19 2010-12-29 삼성전자주식회사 컨텍스트 기반의 산술 부호화 장치 및 방법과 산술 복호화 장치 및 방법
US10147442B1 (en) * 2015-09-29 2018-12-04 Amazon Technologies, Inc. Robust neural network acoustic model with side task prediction of reference signals
CN106328122A (zh) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 一种利用长短期记忆模型递归神经网络的语音识别方法
CN107871497A (zh) * 2016-09-23 2018-04-03 北京眼神科技有限公司 语音识别方法和装置
CN106682734A (zh) * 2016-12-30 2017-05-17 中国科学院深圳先进技术研究院 一种提升卷积神经网络泛化能力的方法及装置
US10657955B2 (en) * 2017-02-24 2020-05-19 Baidu Usa Llc Systems and methods for principled bias reduction in production speech models
CN108804453B (zh) * 2017-04-28 2020-06-02 深圳荆虹科技有限公司 一种视音频识别方法及装置
CN106952649A (zh) * 2017-05-14 2017-07-14 北京工业大学 基于卷积神经网络和频谱图的说话人识别方法
KR102339716B1 (ko) * 2017-06-30 2021-12-14 삼성에스디에스 주식회사 음성 인식 방법 및 그 장치
US10714076B2 (en) * 2017-07-10 2020-07-14 Sony Interactive Entertainment Inc. Initialization of CTC speech recognition with standard HMM
CN108109613B (zh) 2017-12-12 2020-08-25 苏州思必驰信息科技有限公司 用于智能对话语音平台的音频训练和识别方法及电子设备
CN108364662B (zh) * 2017-12-29 2021-01-05 中国科学院自动化研究所 基于成对鉴别任务的语音情感识别方法与系统
CN108694949B (zh) * 2018-03-27 2021-06-22 佛山市顺德区中山大学研究院 基于重排序超向量和残差网络的说话人识别方法及其装置
US10699698B2 (en) * 2018-03-29 2020-06-30 Tencent Technology (Shenzhen) Company Limited Adaptive permutation invariant training with auxiliary information for monaural multi-talker speech recognition
CN108932950B (zh) * 2018-05-18 2021-07-09 华南师范大学 一种基于标签扩增与多频谱图融合的声音场景识别方法
CN108776835A (zh) * 2018-05-28 2018-11-09 嘉兴善索智能科技有限公司 一种深度神经网络训练方法
CN108922537B (zh) 2018-05-28 2021-05-18 Oppo广东移动通信有限公司 音频识别方法、装置、终端、耳机及可读存储介质
CN109165566B (zh) * 2018-08-01 2021-04-27 中国计量大学 一种基于新型损失函数的人脸识别卷积神经网络训练方法
CN109215662B (zh) * 2018-09-18 2023-06-20 平安科技(深圳)有限公司 端对端语音识别方法、电子装置及计算机可读存储介质
CN109065033B (zh) * 2018-09-19 2021-03-30 华南理工大学 一种基于随机深度时延神经网络模型的自动语音识别方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170148431A1 (en) * 2015-11-25 2017-05-25 Baidu Usa Llc End-to-end speech recognition
CN105551483A (zh) * 2015-12-11 2016-05-04 百度在线网络技术(北京)有限公司 语音识别的建模方法和装置
CN108766445A (zh) * 2018-05-30 2018-11-06 苏州思必驰信息科技有限公司 声纹识别方法及系统
CN109256139A (zh) * 2018-07-26 2019-01-22 广东工业大学 一种基于Triplet-Loss的说话人识别方法
CN109859743A (zh) * 2019-01-29 2019-06-07 腾讯科技(深圳)有限公司 音频识别方法、系统和机器设备

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NA LI, DEYI TUO, DAN SU, ZHIFENG LI, AND DONG YU: "Deep Discriminative Embeddings for Duration Robust Speaker Verification", INTERSPEECH 2018, 6 September 2018 (2018-09-06), pages 1 - 5, XP055724143, DOI: 10.21437/Interspeech.2018-1769 *
SARTHAK YADAV , ATUL RAI: "Learning Discriminative Features for Speaker Identification and Verification", INTERSPEECH 2018, 6 September 2018 (2018-09-06), pages 2237 - 2241, XP055724147, DOI: 10.21437/Interspeech.2018-1015 *
See also references of EP3920178A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112233655A (zh) * 2020-09-28 2021-01-15 上海声瀚信息科技有限公司 一种提高语音命令词识别性能的神经网络训练方法
CN112259078A (zh) * 2020-10-15 2021-01-22 上海依图网络科技有限公司 一种音频识别模型的训练和非正常音频识别的方法和装置
CN112529104A (zh) * 2020-12-23 2021-03-19 东软睿驰汽车技术(沈阳)有限公司 一种车辆故障预测模型生成方法、故障预测方法及装置
CN114040319A (zh) * 2021-11-17 2022-02-11 青岛海信移动通信技术股份有限公司 一种终端设备外放音质优化方法、装置、设备和介质
CN114040319B (zh) * 2021-11-17 2023-11-14 青岛海信移动通信技术有限公司 一种终端设备外放音质优化方法、装置、设备和介质

Also Published As

Publication number Publication date
US11900917B2 (en) 2024-02-13
CN110517666A (zh) 2019-11-29
US20210233513A1 (en) 2021-07-29
EP3920178A4 (en) 2022-03-30
CN110517666B (zh) 2021-03-02
EP3920178A1 (en) 2021-12-08
CN109859743B (zh) 2023-12-08
CN109859743A (zh) 2019-06-07

Similar Documents

Publication Publication Date Title
WO2020156153A1 (zh) 音频识别方法、系统和机器设备
US11538463B2 (en) Customizable speech recognition system
WO2020232867A1 (zh) 唇语识别方法、装置、计算机设备及存储介质
US10957309B2 (en) Neural network method and apparatus
US20190034814A1 (en) Deep multi-task representation learning
Trigeorgis et al. Deep canonical time warping for simultaneous alignment and representation learning of sequences
JP7431291B2 (ja) ドメイン分類器を使用したニューラルネットワークにおけるドメイン適応のためのシステム及び方法
WO2016037350A1 (en) Learning student dnn via output distribution
JP2019035936A (ja) ニューラルネットワークを用いた認識方法及び装置並びにトレーニング方法及び電子装置
Tong et al. A comparative study of robustness of deep learning approaches for VAD
JP2023537705A (ja) オーディオ・ビジュアル・イベント識別システム、方法、プログラム
US11908457B2 (en) Orthogonally constrained multi-head attention for speech tasks
Bae et al. End-to-End Speech Command Recognition with Capsule Network.
WO2023061102A1 (zh) 视频行为识别方法、装置、计算机设备和存储介质
JP2020086436A (ja) 人工神経網における復号化方法、音声認識装置及び音声認識システム
Bilkhu et al. Attention is all you need for videos: Self-attention based video summarization using universal transformers
Zhu et al. Unsupervised voice-face representation learning by cross-modal prototype contrast
CN116246214B (zh) 视听事件定位方法、模型训练方法、装置及设备和介质
JP2020042257A (ja) 音声認識方法及び装置
KR20200044173A (ko) 전자 장치 및 그의 제어 방법
Kim et al. Speaker-adaptive lip reading with user-dependent padding
Nguyen et al. Joint deep cross-domain transfer learning for emotion recognition
CN113870863B (zh) 声纹识别方法及装置、存储介质及电子设备
US11775617B1 (en) Class-agnostic object detection
KR20230057765A (ko) 자기-지도 학습 기반의 다중 객체 추적 장치 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20748958

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020748958

Country of ref document: EP

Effective date: 20210830