WO2019100998A1 - 语音信号处理模型训练方法、电子设备及存储介质 - Google Patents
语音信号处理模型训练方法、电子设备及存储介质 Download PDFInfo
- Publication number
- WO2019100998A1 WO2019100998A1 PCT/CN2018/115704 CN2018115704W WO2019100998A1 WO 2019100998 A1 WO2019100998 A1 WO 2019100998A1 CN 2018115704 W CN2018115704 W CN 2018115704W WO 2019100998 A1 WO2019100998 A1 WO 2019100998A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- task
- signal processing
- training
- speech signal
- layer
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims abstract description 609
- 238000012549 training Methods 0.000 title claims abstract description 494
- 238000000034 method Methods 0.000 title claims abstract description 135
- 238000013528 artificial neural network Methods 0.000 claims abstract description 275
- 230000006870 function Effects 0.000 claims description 187
- 230000015654 memory Effects 0.000 claims description 17
- 230000000694 effects Effects 0.000 claims description 6
- 230000006403 short-term memory Effects 0.000 claims description 2
- 230000001537 neural effect Effects 0.000 claims 2
- 230000007787 long-term memory Effects 0.000 claims 1
- 238000001514 detection method Methods 0.000 description 42
- 230000003595 spectral effect Effects 0.000 description 25
- 238000010586 diagram Methods 0.000 description 18
- 238000001228 spectrum Methods 0.000 description 13
- 238000004891 communication Methods 0.000 description 7
- 230000008030 elimination Effects 0.000 description 7
- 238000003379 elimination reaction Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 230000000750 progressive effect Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000011478 gradient descent method Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0635—Training updating or merging of old and new templates; Mean values; Weighting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Definitions
- the embodiments of the present invention relate to the field of voice processing technologies, and in particular, to a voice signal processing model training method, an electronic device, and a storage medium.
- the performance of the terminal's speech signal processing technology is particularly important; at present, the general speech recognition process is that the terminal performs speech on the input multi-channel speech. Signal processing, output single-channel voice, and then send single-channel voice to the voice background server for voice recognition.
- the conventional speech signal processing process generally includes a plurality of speech signal processing tasks, through which the plurality of speech signal processing tasks are progressively processed to process the input multi-channel speech, and output single-channel speech.
- FIG. 1 shows a conventional voice signal processing process of a terminal, which is composed of a plurality of voice signal processing tasks, and the plurality of voice signal processing tasks may specifically include: an echo cancellation task, a voice detection task, The voice direction detection task, the microphone array enhancement task, the single channel noise reduction task, the reverberation elimination task, etc.; the input multi-channel voice can perform the single channel voice after completing the coordinated processing of the plurality of voice signal processing tasks, and complete the terminal voice Signal processing.
- the field of neural network technology is becoming more and more widely.
- a technique of using a neural network to optimize the voice signal processing process of the terminal is adopted.
- the technology improves the speech signal processing performance of the terminal by using a neural network to train the speech signal processing model, using the speech signal processing model to replace the traditional speech signal processing process of the terminal, or assisting the traditional speech signal processing process of the terminal; visible, based on the neural network
- the training of the speech signal processing model has important technical significance such as improving the processing performance of the speech signal.
- the problem of using neural network to train speech signal processing model is that the number of speech signal processing tasks involved in speech signal processing is high, and the computational complexity involved in training is high, which leads to better training efficiency of speech signal processing model. low.
- the embodiments of the present invention provide a speech signal processing model training method, an electronic device, and a storage medium, so as to reduce the computational complexity of the training speech signal processing model and improve the training efficiency of the speech signal processing model.
- the embodiment of the present invention provides the following technical solutions:
- a method for training a voice signal processing model is provided, the method being applied to an electronic device, including:
- the task input feature of each speech signal processing task of the sample speech is used as a training input of the multi-task neural network to be trained, and the target training loss function is minimized as a training target, and the shared layer of the multi-task neural network to be trained is used. And updating the parameters of each task layer until the multi-task neural network to be trained converges to obtain a speech signal processing model;
- the multi-task neural network to be trained includes: a shared layer, and a task layer corresponding to each voice signal processing task.
- an embodiment of the present invention further provides a voice signal processing model training device, where the device is applied to an electronic device, including:
- a task input feature determining module configured to acquire sample voice, and determine a task input feature of each voice signal processing task of the sample voice
- a target loss function determining module configured to determine a target training loss function according to a training loss function of each of the respective speech signal processing tasks
- a model training module configured to use a task input feature of each speech signal processing task of the sample speech as a training input of the multi-task neural network to be trained, to minimize a target training loss function as a training target,
- the shared layer of the multi-task neural network to be trained and the parameters of each task layer are updated until the multi-task neural network to be trained converges to obtain a speech signal processing model;
- the multi-task neural network to be trained includes: a shared layer, and a task layer corresponding to each voice signal processing task.
- an embodiment of the present invention further provides an electronic device, including: at least one memory and at least one processor; the memory stores a program, and the processor calls a program stored in the memory, where the program is used to :
- the task input feature of each speech signal processing task of the sample speech is used as a training input of the multi-task neural network to be trained, and the target training loss function is minimized as a training target, and the shared layer of the multi-task neural network to be trained is used. And updating the parameters of each task layer until the multi-task neural network to be trained converges to obtain a speech signal processing model;
- the multi-task neural network to be trained includes: a shared layer, and a task layer corresponding to each voice signal processing task.
- an embodiment of the present invention further provides a storage medium storing a program suitable for execution by a processor, the program being used to:
- the task input feature of each speech signal processing task of the sample speech is used as a training input of the multi-task neural network to be trained, and the target training loss function is minimized as a training target, and the shared layer of the multi-task neural network to be trained is used. And updating the parameters of each task layer until the multi-task neural network to be trained converges to obtain a speech signal processing model;
- the multi-task neural network to be trained includes: a shared layer, and a task layer corresponding to each voice signal processing task.
- the target training loss function is determined by the training loss function of the plurality of voice signal processing tasks, and the task input feature of the plurality of voice signal processing tasks is based on the training input as the multi-task neural network to minimize
- the target training loss function is the training target, and the multi-task neural network to be trained is trained to obtain a speech signal processing model.
- the task neural network includes a shared layer and a task layer corresponding to each voice signal processing task, and the voice signal processing model is obtained based on the multi-task neural network training, instead of separately performing neural network training with respect to each voice signal processing task, effectively The computational complexity of the training speech signal processing model is reduced, and the training efficiency is improved.
- FIG. 1 is a schematic diagram of a conventional voice signal processing process
- FIG. 2 is a schematic diagram of a conventional speech signal processing model using a neural network
- FIG. 3 is a schematic structural diagram of a multi-task neural network according to an embodiment of the present invention.
- FIG. 4 is another schematic structural diagram of a multi-task neural network according to an embodiment of the present invention.
- FIG. 5 is a flowchart of a method for training a voice signal processing model according to an embodiment of the present invention
- FIG. 6 is a schematic diagram of training of a speech signal processing model
- FIG. 7 is another flowchart of a method for training a voice signal processing model according to an embodiment of the present invention.
- FIG. 8 is another training schematic diagram of a speech signal processing model
- FIG. 9 is still another flowchart of a method for training a voice signal processing model according to an embodiment of the present invention.
- FIG. 10 is still another flowchart of a method for training a voice signal processing model according to an embodiment of the present invention.
- FIG. 11 is a diagram showing an example of an application scenario of a speech signal processing model
- FIG. 12 is a diagram showing an example of use of an output result of a speech signal processing model
- FIG. 13 is a structural block diagram of a speech signal processing model training apparatus according to an embodiment of the present invention.
- FIG. 14 is another structural block diagram of a speech signal processing model training apparatus according to an embodiment of the present invention.
- FIG. 15 is a structural block diagram of a speech signal processing model training apparatus according to an embodiment of the present invention.
- 16 is a block diagram showing the hardware structure of an electronic device.
- FIG. 2 is a schematic diagram of a conventional speech signal processing model using a neural network, as shown in FIG. 2, for each speech signal processing task involved in the speech signal processing process, respectively construct a neural network, and each neural network corresponds to There are voice signal processing tasks, respectively training the neural network of each speech signal processing task, and when a certain neural network reaches the training convergence condition of the corresponding speech signal processing task, the training of the neural network is completed, in each nerve After the network training is completed, each neural network that has been trained is jointly formed into a speech signal processing model; the problem of this process is that the neural network training needs to be separately performed for each speech signal processing task, and for a large number of speech signal processing In terms of tasks, the computational complexity of training is high. At the same time, each neural network is relatively independent and lacks the correlation between speech signal processing tasks, resulting in the performance of the speech signal processing model derived from training has certain limitations.
- the embodiment of the present invention considers the neural network structure of the improved speech signal processing model, and performs training on the speech signal processing model based on the improved neural network structure, reduces the computational complexity of the training speech signal processing model, and improves the training efficiency; further The correlation between speech signal processing tasks is reflected in the training process, and the speech signal processing model obtained by the training is guaranteed to have reliable performance.
- the embodiment of the invention proposes a novel multi-task neural network.
- the multi-task neural network can reduce the computational complexity of the training speech signal processing model and further guarantee the speech signal processing model.
- the performance is reliable.
- the multi-task neural network can be as shown in FIG. 3, including: a shared layer, and a task layer corresponding to each voice signal processing task;
- the input of the shared layer may be imported into each task layer, and each task layer outputs a task processing result of the voice signal processing task corresponding to the task layer; wherein the shared layer may reflect commonality
- each task layer can reflect the task characteristics of the corresponding speech signal processing tasks, so that the output result of each task layer can better reflect the task requirements of the corresponding speech signal processing tasks.
- the shared layer may be defined as an LSTM (Long Short Term Memory) network.
- the shared layer may be a two-layer LSTM network; the task layer may be defined.
- MLP Multi Layer Perceptron
- each task layer can be an MLP fully connected network.
- each task layer can be a layer of MLP fully connected network. .
- the multi-task neural network provided by the embodiment of the present invention may include:
- the plurality of speech signal processing tasks are not limited to those shown in FIG. 1, and may also be deleted and/or enhanced based on the plurality of speech signal processing tasks shown in FIG.
- the voice signal processing tasks are not specifically limited in this embodiment of the present invention.
- the embodiment of the present invention can perform training on the multi-task neural network to obtain a speech signal processing model.
- the embodiment of the present invention can simultaneously train the multi-task neural network based on all the speech signal processing tasks, and update the shared layer of the multi-task neural network and the parameters of each task layer;
- FIG. 5 shows an optional process of the voice signal processing model training method provided by the embodiment of the present invention, where the method can be applied to an electronic device with data processing capability, and the electronic device can be a notebook computer or a PC.
- a terminal device having a data processing capability, such as a personal computer, may also be a server on the network side, which is not specifically limited in the embodiment of the present invention.
- the method process may include:
- Step S100 The electronic device acquires sample voice, and determines a task input feature of each voice signal processing task of the sample voice.
- the sample voice may be regarded as a sample used by the training voice signal processing model, and the sample voice may be a multi-channel voice; the number of sample voices obtained by the embodiment of the present invention may be multiple, and may be used for each sample voice.
- the task input characteristics of each speech signal processing task are determined.
- the embodiment of the present invention may separately acquire the task input feature of each voice signal processing task for the sample voice; optionally, the voice signal processing process of the terminal
- the plurality of voice signal processing tasks involved may be as shown in FIG. 1.
- the voice signal processing tasks may be deleted on the basis of the plurality of voice signal processing tasks shown in FIG. 1, and/or other forms of voice may be enhanced.
- the plurality of voice signal processing tasks include: an echo cancellation task and a voice detection task; wherein the echo cancellation task can be used to estimate a single channel voice spectrum.
- the voice detection task can be used to estimate the probability of voice presence.
- the embodiment of the present invention can obtain the task input feature of the echo cancellation task of the sample speech, such as: the spectral energy of the sampled speech single-channel speech and the spectral energy labeled as clean speech; and the task of acquiring the speech detection task of the sample speech
- the input feature is specifically as follows: whether the sample voice has a tag value of the voice; wherein the tag value may be 0 or 1, wherein 0 indicates that there is no voice, and 1 indicates that voice exists.
- the voice signal processing task described in the above paragraph is only an example.
- the voice signal processing process may involve more voice signal processing tasks.
- the embodiment of the present invention may separately acquire task input corresponding to different voice signal processing tasks for the sample voice.
- Features, and the task input characteristics corresponding to different speech signal processing tasks may be different.
- Step S110 The electronic device determines a target training loss function according to a training loss function of each voice signal processing task.
- the embodiment of the present invention implements the shared layer of the multi-task neural network and the parameter update of each task layer by training all the voice signal processing tasks, so the total training loss function (referred to as the target training loss function) used in the training needs to be performed.
- a training loss function determination based on each speech signal processing task;
- the embodiment of the present invention may determine a training loss function of each voice signal processing task; thus, for any voice signal processing task The embodiment of the present invention may multiply the training loss function of the speech signal processing task by the weight corresponding to the speech signal processing task, and obtain a multiplication result corresponding to the speech signal processing task, thereby determining each speech signal processing task. After the corresponding multiplication result, and then adding each multiplication result, the target training loss function can be obtained;
- the target training loss function L can be determined according to the following formula:
- the value of a i can be set according to the actual situation, or can be uniformly set to 1; N is the total number of voice signal processing tasks.
- Step S120 The electronic device inputs the task input feature of each voice signal processing task of the sample voice as a training input of the multi-task neural network, and minimizes the target training loss function as a training target, and the shared layer and each of the multi-task neural network The parameters of the task layer are updated until the multi-task neural network converges to obtain a speech signal processing model.
- the embodiment of the present invention can train the multi-task neural network to realize the shared layer and each of the multi-task neural network.
- Parameter update of the task layer in particular, the task input feature of each voice signal processing task of the sample voice can be used as a training input of the multi-task neural network to minimize the target training loss function as a training target.
- the multi-task neural network is trained to realize the shared layer of the multi-task neural network and the parameter update of each task layer until the multi-task neural network converges, thereby obtaining a speech signal processing model; wherein, when the multi-task neural network reaches the convergence condition At the same time, the multi-task neural network converges.
- the convergence condition may include, but is not limited to, the number of iterations of the training reaches the maximum number of times, or the target training loss function is no longer reduced.
- the embodiment of the present invention may use a Stochastic Gradient Descent (SGD) and/or a Back Propagation (BP) method to share the multi-task neural network. Update the parameters of the layer and each task layer;
- SGD Stochastic Gradient Descent
- BP Back Propagation
- the parameter update of the shared layer may be implemented according to the target training loss function, such as
- the random gradient descent method may be used to update the parameters of the shared layer according to the target training loss function obtained by each training; and the parameter update of the task layer corresponding to any speech signal processing task may be based on the speech signal.
- the loss function of the processing task is implemented.
- the random gradient descent method can be used to update the parameters of the task layer corresponding to the speech signal processing task according to the training loss function of the speech signal processing task obtained by each training.
- the association between the common speech signal processing tasks can be reflected through the shared layer, and the task characteristics of the corresponding speech signal processing tasks can be reflected by each task layer, so that the output result of each task layer can be better.
- the task requirements of the corresponding speech signal processing tasks are reflected.
- the sharing layer may be an LSTM network
- a task layer may be an MLP fully connected network
- updating parameters of the shared layer of the multi-task neural network may, for example, updating parameters of the LSTM network, including but not limited to updating
- the parameters of a task layer of the updated multi-task neural network may be as Updating the parameters of the MLP fully connected network, including but not limited to updating the connection parameters of the input layer to the hidden layer of the MLP fully connected network, the connection parameters of the hidden layer to the output layer, and the like.
- each voice signal processing task is uniformly set to 1
- the multiple voice signal processing tasks include: an echo cancellation task and a voice detection task
- the training of the speech signal processing model can be illustrated as shown in Figure 6. The process is as follows:
- the echo cancellation task of the sample speech and the input feature of the speech detection task are used as training inputs of the multi-task neural network; the sum of the training loss function of the echo cancellation task and the training loss function of the speech detection task is minimized, and the training target is
- the shared layer of the multi-task neural network, the parameters of the echo cancellation task layer and the voice detection task layer are updated until the number of iterations of the multi-task neural network reaches the maximum, or the training loss function of the echo cancellation task and the training loss of the voice detection task The sum of the functions is no longer reduced, and a speech signal processing model is obtained.
- the parameters of the shared layer of the multi-task neural network may be updated according to the sum of the echo cancellation task and the training loss function of the voice detection task obtained by each training;
- the training loss function of the echo cancellation task updates the parameters of the echo cancellation task layer;
- the parameters of the voice detection task layer can be updated according to the training loss function of the voice detection task obtained by each training.
- the training loss function of the echo cancellation task may be: a difference between the estimated clean speech spectrum energy and the real value; the training loss function of the speech detection task may be, for example, the estimated probability and true existence of the speech.
- the difference value of the value correspondingly, if the weight corresponding to each voice signal processing task is uniformly set to 1, the target training loss function can be determined as: the sum of the training loss function of the echo cancellation task and the training loss function of the voice detection task; Therefore, when training the multi-task neural network, the sum of the training loss function of the echo cancellation task and the training loss function of the speech detection task can be minimized as the training target.
- the sum of the training loss function of the minimized echo cancellation task and the training loss function of the speech detection task may be: minimizing the difference between the estimated clean speech spectral energy and the real value, and the estimated probability and true existence of the speech. The result of the addition of the difference value of the value.
- the speech signal processing model training method shown in FIG. 5 can input the task input characteristics of each speech signal processing task of the sample speech based on the multi-task neural network including the shared layer and the task layer corresponding to each speech signal processing task.
- the shared layer of the multi-task neural network and the parameter update of each task layer are performed, and the speech signal processing model is trained. Since the embodiment of the present invention is based on a multi-task neural network having a shared layer and a task layer corresponding to each voice signal processing task, the task input feature of each voice signal processing task according to the sample voice is simultaneously performed on the multi-task neural network.
- the parameter updating training of the shared layer and each task layer instead of separately training the neural network with respect to each speech signal processing task, greatly reduces the computational complexity involved in training the speech signal processing model, and is effective The computational complexity of the training speech signal processing model is reduced, and the training efficiency of the speech signal processing model is improved.
- the above-mentioned method of training a multi-task neural network based on all voice signal processing tasks to update the shared layer of the multi-task neural network and the parameters of each task layer is separately trained for each voice signal processing task separately.
- the way neural networks can reduce computational complexity.
- the embodiment of the present invention further provides a scheme for multi-task neural network training in stages, which is based on a task characteristic of each speech signal processing task in a speech signal processing process, and can avoid a speech signal. The difference between each speech signal processing task in the processing process is large.
- the scheme can train the multi-task neural network with partial speech signal processing tasks, which can guarantee the parameter convergence of the multi-task neural network.
- FIG. 7 shows another optional process of the voice signal processing model training method provided by the embodiment of the present invention.
- the method is applicable to an electronic device with data processing capability.
- the method process may be include:
- Step S200 The electronic device acquires sample voice.
- Step S210 The electronic device determines at least one first type of voice signal processing task from the plurality of voice signal processing tasks of the voice signal processing process.
- the first type of voice signal processing task may be a basic task in multiple voice signal processing tasks involved in the voice signal processing process; it may be understood that the basic task may be considered as a voice signal processing process.
- the plurality of voice signal processing tasks tasks having auxiliary effects with respect to other voice signal processing tasks;
- the plurality of voice signal processing tasks include: an echo cancellation task and a voice detection task; for example, since the echo cancellation task can estimate the single channel speech spectrum, the accuracy of the speech probability estimation can be greatly improved, Therefore, the echo cancellation task can be considered as a basic speech signal processing task.
- the first type of voice signal processing task may be considered as a task with high training complexity among multiple voice signal processing tasks involved in the voice signal processing process;
- the determining process of the first type of voice signal processing task may be: when the training complexity of the voice signal processing task is higher than the set complexity threshold, determining that the voice signal processing task is the first type of voice signal processing task; Otherwise, the speech signal processing task is not the first type of speech signal processing task.
- the echo cancellation task can be regarded as the first type of speech signal processing task with high training complexity.
- the number of the first type of voice signal processing tasks may be one or more.
- Step S220 The electronic device determines a task input feature of the first type of voice signal processing task of the sample voice, and a task input feature of each voice signal processing task of the sample voice.
- the embodiment of the present invention may determine the task input feature of the first type of speech signal processing task of the sample speech.
- the task input feature for determining the first type of speech signal processing task may be: determining a task input feature of the echo cancellation task of the sample speech; and determining a sample speech for each speech signal processing task involved in the speech signal processing process.
- the task input feature of each voice signal processing task may be determined by: determining a task input feature of the echo cancellation task of the sample voice, and a task input feature of the voice detection task.
- Step S230 The electronic device determines a first target training loss function according to a training loss function of the first type of voice signal processing task; and determines a target training loss function according to a training loss function of each voice signal processing task.
- the embodiment of the present invention may determine a training loss function of the first type of voice signal processing task, where the number of the first type of voice signal processing tasks is at least one
- a first type of speech signal processing task can multiply the training loss function of the first type of speech signal processing task by the weight corresponding to the first type of speech signal processing task, and obtain the multiplication of the first type of speech signal processing task
- the first target training loss function can be obtained.
- the training loss function of the i-th first-type speech signal processing task is L1 i
- a1 i is a weight corresponding to the i-th first-type speech signal processing task
- the first target training loss function L1 all The determination can be made according to the following formula:
- N1 is the total number of voice signal processing tasks of the first type.
- each voice signal processing task may be processed according to the foregoing step S110, to obtain a target training loss function.
- Step S240 The electronic device inputs the task input feature of the first type of speech signal processing task of the sample speech as a training input of the multi-task neural network, and minimizes the first target training loss function as a training target, and shares the multi-task neural network.
- the parameters of the task layer corresponding to the layer and the first type of speech signal processing tasks are updated until the multi-task neural network converges to obtain the first multi-task neural network.
- the embodiment of the present invention may firstly train the initial multi-task neural network based on the task input feature of the first type of speech signal processing task to minimize the first target training loss function as the training target.
- the shared layer of the multitasking neural network and the parameters of the task layer corresponding to the first type of speech signal processing task may be updated; the specific parameter updating process may be: the electronic device processes the first type of speech signal of the sample speech.
- the task input feature of the task, as the training input of the multi-task neural network minimizes the first target training loss function as the training target, and updates the shared layer of the multi-task neural network and the first-type speech signal processing by multiple iterations.
- the parameters of the task layer corresponding to the task until the maximum number of iterations is reached, or the first target training loss function is no longer reduced, thereby obtaining the first multi-task neural network.
- the embodiment of the present invention may update the parameters of the shared layer according to the first target training loss function obtained by each training; and the task layer corresponding to each first type of voice signal processing task may be according to each training.
- the training loss function of the first type of speech signal processing task is obtained, and the parameters of the task layer corresponding to the first type of speech signal processing task are updated.
- Step S250 The electronic device inputs the task input feature of each voice signal processing task of the sample voice as a training input of the first multi-task neural network, and minimizes the target training loss function as a training target, and shares the shared layer of the multi-task neural network.
- the parameters of the task layer corresponding to each voice signal processing task are updated until the first multi-task neural network converges to obtain a voice signal processing model.
- the embodiment of the present invention is based on the task input feature of the first type of speech signal processing task, and minimizes the first target training loss function as a training target, and trains the multi-task neural network to obtain the first multi-task neural network.
- the target training loss function is minimized as a training target, and the first multi-task neural network is trained to obtain a speech signal processing model.
- the shared layer of the first multitasking neural network and the parameters of the task layer corresponding to each voice signal processing task may be updated; the specific parameter updating process may be: processing each voice signal processing task of the sample voice
- the task input feature is used as the training input of the first multi-task neural network, and the target training loss function is minimized as the training target, and the shared layer of the first multi-task neural network and the task layer corresponding to each speech signal processing task are iteratively performed.
- the parameter is updated until the maximum number of iterations is reached, or the target training loss function is no longer reduced, thereby obtaining a speech signal processing model;
- the embodiment of the present invention may update the parameters of the shared layer according to the target training loss function obtained by each training; and for the task layer corresponding to each voice signal processing task, the voice may be obtained according to each training.
- the training loss function of the signal processing task updates the parameters of the task layer corresponding to the speech signal processing task.
- the processing tasks of the plurality of voice signals include: echo cancellation task.
- the voice detection task is taken as an example; the training process of the voice signal processing model in the embodiment of the present invention can be as shown in FIG. 8 , and the process is as follows:
- the input feature of the echo cancellation task of the sample speech is used as the training input of the multi-task neural network, and the training loss function of the echo cancellation task is minimized as the training target, and the task corresponding to the shared layer and the echo cancellation task of the multi-task neural network
- the parameters of the layer are updated until the number of iterations of the multi-task neural network reaches the maximum number, or the training loss function of the echo cancellation task is no longer reduced, and the first multi-task neural network is obtained.
- the input feature of the echo cancellation task may be: the spectral energy of the noisy single-channel speech of the sample speech and the spectral energy labeled as clean speech; the training objective may be: minimizing the estimated clean speech spectral energy and the true value The difference value.
- the echo cancellation task of the sample speech and the input feature of the speech detection task are used as training inputs of the first multi-task neural network; the sum of the training loss function of the echo cancellation task and the training loss function of the speech detection task is minimized.
- Training target updating the parameters of the shared layer of the first multi-task neural network, the echo cancellation task and the voice detection task layer until the number of iterations of the first multi-task neural network reaches the maximum number, or the training loss function of the echo cancellation task The sum of the training loss function with the speech detection task is no longer reduced, and a speech signal processing model is obtained.
- the embodiment of the present invention can determine a basic task from multiple voice signal processing tasks, or train a task with high complexity, and obtain at least one first type of voice signal processing.
- Task firstly, the task input feature of the first type of speech signal processing task, as the training input of the multi-task neural network, the shared layer of the multi-task neural network and the parameter update training of the task layer corresponding to the first type of speech signal processing task
- Obtaining a first multi-task neural network then processing the task input features of each speech signal as a training input of the first multi-task neural network, performing a shared layer of the first multi-task neural network and each task layer Parameter update training, training to get a speech signal processing model.
- the neural network training is not performed separately for each speech signal processing task, the computational complexity involved in training the speech signal processing model is reduced; meanwhile, the input of the task is processed with the first type of speech signal.
- the feature is to train multi-task neural network, and then the task input feature of each speech signal processing task is used as training input for multi-task neural network training, which can make the training process reflect the correlation between speech signal processing tasks and guarantee The parameters of the multi-task neural network can effectively converge, which ensures the reliability of the training speech signal processing model.
- the method shown in FIG. 7 is based on the task input feature of the first type of speech signal processing task, and the shared layer corresponding to the multi-task neural network and the task layer corresponding to the first type of speech signal processing task.
- the parameters are updated to obtain the first multi-task neural network; in the process of training the first multi-task neural network, the first type of speech signal processing task is the basic task in the process of speech signal processing or the training complexity is high. Therefore, the reliable convergence of the parameters of the task layer corresponding to the first type of speech signal processing task is particularly critical for the performance of the speech signal processing model derived from subsequent training.
- FIG. 9 shows another optional process of the voice signal processing model training method provided by the embodiment of the present invention. It should be noted that the process shown in FIG. 9 is only optional, and is performed at the first time. During the training of the task neural network, the task of the first multi-task neural network can be directly trained based on the task input features of all the first type of speech signal processing tasks, without the need to perform multiple stages as shown in FIG. Training of a multi-task neural network;
- the method shown in FIG. 9 is applicable to an electronic device with data processing capability.
- the method process may include:
- Step S300 The electronic device acquires sample voice.
- Step S310 The electronic device determines at least one first type of voice signal processing task from the plurality of voice signal processing tasks of the voice signal processing process.
- the implementation process of the step S310 is the same as the process of the step S210.
- the description of the step S310 can be described with reference to the step S210, and details are not described herein again.
- Step S320 the electronic device determines a task input feature of the first type of voice signal processing task of the sample voice, and a task input feature of each type of voice signal processing task of the sample voice; the task input feature of the first type of voice signal processing task includes : Multiple task input features; one task input feature contains at least one feature.
- the task input feature of any first type of voice signal processing task may be multiple copies, and any one of the task input features is included.
- the number of features can be at least one.
- the first type of voice signal processing task includes an echo cancellation task.
- the embodiment of the present invention may set multiple task input features, such as setting the first task of the echo cancellation task.
- the input characteristics are: the spectral energy of the noisy single-channel speech, and the spectral energy labeled as clean speech;
- the second task input feature of the echo cancellation task is: the spectral energy of the multi-channel speech;
- the third copy of the echo cancellation task is set
- the task input features are: the spectral energy of multi-channel speech, and the spectral energy of the reference signal (such as music played by a smart speaker).
- Step S330 The electronic device determines a first target training loss function according to a training loss function of the first type of voice signal processing task; and determines a target training loss function according to a training loss function of each voice signal processing task.
- step S330 is the same as the process of step S230.
- the description of step S330 can be described with reference to step S230, and details are not described herein again.
- Step S340 The electronic device selects, according to the current training phase, the current task input feature corresponding to the current training phase from the plurality of task input features of the first type of voice signal processing task of the sample voice; and inputs the current task into the feature as
- the training input of the multi-task neural network trained in the previous training phase is to minimize the first target training loss function as the training target, and the shared layer and the first type of speech signal processing of the multi-task neural network trained in the previous training phase
- the parameters of the task layer corresponding to the task are updated until the trained multi-task neural network reaches convergence according to the last task input feature, and the first multi-task neural network is obtained.
- step S340 may be considered as: the electronic device performs the multi-task neural network training according to the plurality of task input features of the first type of voice signal processing task of the sample voice, and performs training on the multi-task neural network in multiple training stages to obtain the first
- An optional implementation of the task neural network wherein a training phase uses a task input feature as a training input, and minimizes the first target training loss function as a training target; wherein, the progressive pair is divided into multiple training phases
- the process of training the multi-task neural network may be: updating the shared layer of the multi-task neural network and the parameters of the task layer corresponding to the first-type speech signal processing task progressively in multiple training stages.
- the embodiment of the present invention does not exclude other multiple task input features of the first type of speech signal processing task using the sample speech, and the progressive multi-task neural network is divided into multiple training stages. Other ways to train.
- the embodiment of the present invention may perform the training of the first multi-task neural network by using multiple training stages, thereby inputting each task of the first type of voice signal processing task according to the training stage, As the training input, the multi-task neural network is trained to obtain the first multi-task neural network; and, in the current training stage, the current task input feature currently selected by the first-type speech signal processing task is used as the last training.
- the training input of the multi-task neural network completed in the phase training.
- the task input feature of the first type of voice signal processing task includes three parts, namely, a first task input feature, a second task input feature, and a third task input feature as an example;
- the embodiment of the invention may first use the first task input feature as the training input of the multi-task neural network to be trained, to minimize the first target training loss function as the training target, the shared layer of the multi-task neural network and the first type of speech.
- the parameters of the task layer corresponding to the signal processing task are updated until the trained multi-task neural network reaches convergence according to the first task input feature, and the multi-task neural network completed in the first training phase is obtained; wherein, the first task is
- the process of inputting the feature as the training input of the multi-task neural network to be trained may be: for the first training phase, the selected task input feature of the current training phase is the first task input feature.
- the second task input feature is used as the training input of the multi-task neural network trained in the first training phase, and the multi-task neural network is trained to minimize the first target training loss function as the training target and the first training phase.
- the shared layer and the parameters of the task layer corresponding to the first type of speech signal processing task are updated until the trained multi-task neural network reaches convergence according to the second task input feature, and the multi-task neural network completed in the second training phase is obtained.
- the process of using the second task input feature as the training input of the multi-task neural network trained in the first training phase may be: for the second training phase, the task input feature of the selected current training phase is the second Task input features.
- the third task input feature is used as the training input of the multi-task neural network trained in the second training phase, to minimize the first target training loss function as the training target, and the multi-task neural network trained for the second training phase
- the shared layer and the parameters of the task layer corresponding to the first type of voice signal processing task are updated until the trained multi-task neural network reaches convergence according to the third task input feature, and the first multi-task neural network is obtained, and the completion is based on the first
- the multi-task input feature of the speech-like signal processing task is divided into multiple training stages to obtain the process of the first multi-task neural network.
- the process of training input of the multi-task neural network with the third task input feature as the second training phase training may be: for the third training phase, the selected task input feature of the current training phase is the third task. Enter the characteristics.
- the first task input feature of the echo cancellation task is: the spectral energy of the noisy single channel speech, and the spectral energy labeled as clean speech
- the second task input feature of the echo cancellation task is: the spectral energy of the multi-channel speech
- the third task input characteristic of the echo cancellation task is: the spectral energy of the multi-channel speech, and the spectral energy of the reference signal; wherein, the reference The spectral energy of the signal can be: music played by the smart speaker.
- the spectral energy of the noisy single-channel speech of the sample speech and the spectral energy labeled as clean speech are used as the training input of the multi-task neural network to minimize the estimated clean speech spectrum energy and
- the difference value of the true value is the training target, and the parameters of the shared layer of the multi-task neural network and the task layer of the echo cancellation task are updated until the number of iterations reaches the maximum number or the training target is no longer reduced.
- the spectral energy of the multi-channel speech of the sample speech is used as the training input of the multi-task neural network completed in the upper training to minimize the difference between the estimated clean speech spectrum energy and the real value, and the multi-task neural network is updated.
- the shared layer and the parameters of the task layer of the echo cancellation task until the number of iterations reaches the maximum number or the training target is no longer reduced, so that the trained multi-task neural network has the capability of multi-channel spatial filtering.
- the spectral energy of the multi-channel speech of the sample speech and the spectral energy of the reference signal can be used as the training input of the multi-task neural network completed in the previous training to minimize the estimated clean speech spectrum energy and reality.
- the difference value of the value is the training target, and the parameters of the shared layer of the multi-task neural network and the task layer of the echo cancellation task are updated until the number of iterations reaches the maximum number or the training target is no longer reduced, and the first multi-task neural network is obtained, so that A multi-task neural network can better fit multi-channel input signals and reference signals.
- the foregoing multiple instances of the task input feature of the first type of voice signal processing task are optional.
- the embodiment of the present invention may set the task input feature of the first type of voice signal processing task according to a specific situation. Number, and the specific characteristics contained in each task input feature; as in the example above, the spectral energy of the noisy single-channel speech, the spectral energy labeled as clean speech, and the task input characteristics of the spectral energy of the multi-channel speech are also Can be combined for training.
- Step S350 The electronic device inputs the task input feature of each voice signal processing task of the sample voice as a training input of the first multi-task neural network, and minimizes the target training loss function as a training target, and the first multi-task neural network
- the shared layer and the parameters of the task layer corresponding to each voice signal processing task are updated until the first multi-task neural network converges to obtain a voice signal processing model.
- step S350 is the same as the process of step S250.
- the description of step S350 can be described in the section of step S250, and details are not described herein again.
- the tasks are relatively simple and relatively independent of each other, and can be combined and trained.
- the first multi-task neural network is trained to obtain a speech signal processing model.
- the parameter update of the shared layer is performed based on the sum of the training loss functions of all the tasks used in the current training;
- the parameter update of a task layer is performed based on the training loss function of the task corresponding to the task layer, so that the trained speech signal processing model can realize the correlation between the common speech signal processing tasks through the shared layer, and The task characteristics of the corresponding speech signal processing tasks can be reflected by each task layer.
- the basic process of the voice signal processing model training method provided by the embodiment of the present invention may be as shown in FIG. 10, and FIG. 10 is provided in the embodiment of the present invention.
- Another optional process of the speech signal processing model training method, referring to FIG. 10, the method flow may include:
- Step S400 The electronic device acquires sample voice, and determines a task input feature of each voice signal processing task of the sample voice.
- step S400 can be described in the step S100.
- the implementation process of step S400 is the same as the process of step S100, and details are not described herein again.
- Step S410 The electronic device determines a target training loss function according to a training loss function of each voice signal processing task.
- step S410 can be described in the step S110.
- the implementation process of step S410 is the same as the process of step S110, and details are not described herein again.
- Step S420 The electronic device inputs the task input feature of each voice signal processing task of the sample voice as a training input of the multi-task neural network to be trained, and minimizes the target training loss function as a training target, and the multi-task neural network to be trained.
- the shared layer and the parameters of each task layer are updated until the multi-task neural network to be trained reaches convergence, and a speech signal processing model is obtained.
- the multi-task neural network to be trained may be an initial multi-task neural network (the corresponding process may be reduced to be implemented by the process shown in FIG. 5);
- the multi-task neural network to be trained may also be the first multi-task neural network, and the embodiment of the present invention may use the method shown in FIG.
- the flow of a multi-task neural network first trains the first multi-task neural network, and uses the first multi-task neural network as the multi-task neural network to be trained; then, according to the method shown in FIG. 10, each speech signal of the sample speech
- the task input feature of the processing task is used as a training input of the first multi-task neural network, and the target training loss function is minimized as a training target, and the shared layer of the first multi-task neural network and the parameters of each task layer are updated until The first multi-task neural network converges to obtain a speech signal processing model.
- the training of the first multi-task neural network may be implemented based on the task input feature of the first type of speech signal processing task of the sample speech; further, as an optional example, the first type of speech signal processing task may have multiple The task input feature, the embodiment of the present invention may be based on the training process of the first multi-task neural network shown in FIG. 9 and divided into multiple training stages to obtain the first multi-task neural network.
- the above-mentioned multi-task neural network to be trained is an initial multi-task neural network or a first multi-task neural network
- the structure of the multi-task neural network to be trained necessarily includes a shared layer, and each The task layer corresponding to the speech signal processing task; for the shared layer, the target training loss function is minimized as the training target, and the parameters of the shared layer are updated according to the target training loss function; for any speech signal processing task
- the task layer is to minimize the target training loss function as a training target, and update the parameters of the task layer of the speech signal processing task according to the training loss function of the speech signal processing task.
- the speech signal processing model training method provided by the embodiment of the present invention can be trained to obtain a speech signal processing model based on a multi-task neural network including a shared layer and a task layer corresponding to each speech signal processing task, instead of being relative to each speech signal.
- the processing tasks are separately trained in the neural network, which effectively reduces the computational complexity of the training speech signal processing model and improves the training efficiency.
- the task input feature of the first type of speech signal processing task based on the sample speech is first trained, and then the task input feature of each speech signal processing task is trained to be excavated.
- the correlation between multi-tasks in the process of speech signal processing is improved, the performance of speech signal processing is improved, and the performance of the speech signal processing model obtained by training is guaranteed to be reliable.
- the embodiment of the present invention can replace the traditional speech signal processing process of the terminal by using a speech signal processing model, such as the output of each task layer of the specific available speech signal processing model.
- a speech signal processing model such as the output of each task layer of the specific available speech signal processing model.
- the embodiment of the present invention may use a voice signal processing model to assist the traditional voice signal processing process of the terminal, such as the output of each task layer of the specific voice signal processing model, and the corresponding corresponding voice of the terminal.
- Signal processing tasks perform task processing.
- FIG. 11 is a diagram showing an example of an application scenario of a voice signal processing model.
- the embodiment of the present invention may use a voice signal processing model to input a voice to be recognized input to an instant messaging client.
- the voice signal processing of the front end is then transmitted to the voice background server of the instant messaging application for voice recognition; optionally, the instant messaging client can treat the output of the recognized voice as the corresponding voice for each task layer of the voice signal processing model.
- the auxiliary processing signal of the signal processing task assists in the processing of each speech signal processing task, improving the accuracy of the output of each speech signal processing task.
- the specific application process may include:
- the instant messaging client obtains the input speech to be recognized.
- the instant messaging client determines an output result of the voice to be recognized for each task layer of the voice signal processing model according to the voice signal processing model to be trained.
- the speech signal processing model is obtained by training the multi-task neural network by minimizing the target training loss function as the training target; wherein the target training loss function is determined according to the training loss function of each speech signal processing task; the multi-task neural network includes the sharing layer , the task layer corresponding to each voice signal processing task.
- the instant messaging client processes the output result of the recognized speech for each task layer as the task processing result of the voice signal processing task corresponding to each task layer, or uses the output result of the voice to be recognized by each task layer to assist the corresponding
- the voice signal processing task performs task processing to obtain the voice signal processing result of the front end.
- the instant messaging client sends the voice signal processing result of the front end to the voice background server, so that the voice background server performs voice recognition on the voice to be recognized according to the voice signal processing result.
- FIG. 12 shows an example of use of the output result of the speech signal processing model.
- the echo cancellation task layer of the speech signal processing model can output the speech of the speech to be recognized.
- the result of the spectrum estimation so that the speech spectrum estimation result is used as an auxiliary processing signal of the terminal traditional echo cancellation task, so that the echo cancellation task can better distinguish the ratio of the reference signal and the speech signal during processing, and improve the output result of the echo cancellation task.
- the embodiment of the present invention can directly use the echo cancellation task layer of the speech signal processing model as the output result of the echo cancellation task.
- the voice detection task layer of the voice signal processing model can output the output result of the voice to be recognized, and the output result is used as an auxiliary processing signal of the terminal traditional voice detection task, so that the accuracy of the output result of the voice detection task can be obtained.
- the lifting; wherein, the output result of the voice detecting task layer and the weighted average of the output result of the traditional voice detecting task of the terminal can be used as the output result of the final voice detecting task.
- the embodiment of the present invention can directly use the voice detection task layer of the voice signal processing model to output the result of the recognized voice as an output result of the voice detection task.
- the voice direction detection task layer of the voice signal processing model can output the output result of the voice to be recognized, thereby using the output result to assist the traditional voice direction detection task of the terminal, and estimating the voice and noise of the voice to be recognized.
- the output result of the speech to be recognized outputted by the speech direction detection task layer may be: a speech/noise spectrum estimation result of the speech to be recognized.
- the voice/noise spectrum estimation result of the speech to be recognized outputted by the voice direction detection task layer may be directly used as an output result of the voice direction detection task.
- the microphone array enhancement task layer of the speech signal processing model can output the speech/noise spectrum of the speech to be recognized, thereby assisting the terminal to enhance the traditional microphone array task, thereby more accurately estimating the target direction of the array algorithm.
- the parameters such as the noise covariance matrix required in the array algorithm; obviously, in another implementation, the embodiment of the present invention can directly increase the output result of the microphone array enhancement task layer as the output result of the microphone array enhancement task.
- the single-channel noise reduction task layer of the speech signal processing model can output the speech/noise spectrum of the speech to be recognized, thereby assisting the traditional single-channel noise reduction task of the terminal, and realizing the required single-channel noise reduction task.
- the acquisition of the key parameters such as the signal-to-noise ratio improves the processing effect of the single-channel noise reduction task.
- the embodiment of the present invention can directly output the output of the single-channel noise reduction task layer as a single channel. The output of the noise task.
- the reverberation elimination task layer of the speech signal processing model can output the room reverberation estimation, thereby assisting the terminal's traditional reverberation elimination task to adjust the parameters of the algorithm to control the degree of reverberation elimination; obviously, in another In an implementation, the output of the reverberation task layer can also be directly used as the output result of the reverberation elimination task.
- the application of the voice signal processing model described above in the voice signal processing process of the voice to be recognized is only an example, and can be understood as an application of a voice signal processing process in a smart speaker scene; obviously, in different applications.
- the application mode of the speech signal processing model can be adapted according to the actual situation, but the speech signal processing model is used instead of the traditional speech signal processing process of the terminal, or the speech signal processing model is used to assist the terminal in traditional speech signal processing. The idea of the process.
- the voice signal processing model training device provided by the embodiment of the present invention is described below.
- the voice signal processing model training device described below can be considered as a voice signal processing model training method provided by the electronic device according to the embodiment of the present invention.
- the program module; the speech signal processing model training device described below may be referred to each other in correspondence with the speech signal processing model training method described above.
- FIG. 13 is a structural block diagram of a voice signal processing model training apparatus according to an embodiment of the present disclosure.
- the apparatus is applicable to an electronic device having data processing capability. Referring to FIG. 13, the apparatus may include:
- the task input feature determining module 100 is configured to acquire a sample voice, and determine a task input feature of each voice signal processing task of the sample voice;
- the target loss function determining module 200 is configured to determine a target training loss function according to a training loss function of each voice signal processing task;
- the model training module 300 is configured to use the task input feature of each voice signal processing task of the sample voice as a training input of the multi-task neural network to be trained, to minimize the target training loss function as a training target, and to train the multi-task
- the shared layer of the neural network and the parameters of each task layer are updated until the multi-task neural network to be trained converges to obtain a speech signal processing model
- the multitasking neural network to be trained includes: a sharing layer, and a task layer corresponding to each voice signal processing task.
- model training module 300 is configured to update the shared layer of the multi-task neural network to be trained and the parameters of each task layer by minimizing the target training loss function, and specifically includes:
- the target training loss function is minimized as the training target, and the parameters of the shared layer are updated according to the target training loss function; and the task layer corresponding to any speech signal processing task is minimized to minimize the target training loss function.
- the parameters of the task layer of the speech signal processing task are updated according to the training loss function of the speech signal processing task.
- the multi-task neural network to be trained may include: a first multi-task neural network; correspondingly, FIG. 14 shows another structural block diagram of the speech signal processing model training apparatus provided by the embodiment of the present invention, and a combination diagram 13 and FIG. 14, the apparatus may further include:
- the first network training module 400 is configured to determine at least one first type of voice signal processing task from the plurality of voice signal processing tasks of the voice signal processing process, and determine a task input feature of the first type of voice signal processing task of the sample voice; Determining a first target training loss function according to a training loss function of the first type of speech signal processing task; and selecting a task input characteristic of the first type of speech signal processing task of the sample speech as a training input of the initial multi-task neural network to minimize
- the first target training loss function is a training target, and the parameters of the shared layer of the initial multi-task neural network and the task layer corresponding to the first type of speech signal processing task are updated until the initial multi-task neural network converges and obtains the first Multitasking neural network.
- the first network training module 400 is configured to determine at least one first type of voice signal processing task from the plurality of voice signal processing tasks of the voice signal processing process, specifically:
- Determining a basic task in the plurality of voice signal processing tasks, and determining the basic task as a first type of voice signal processing task, wherein the basic task is assisted by the plurality of voice signal processing tasks with respect to other voice signal processing tasks The task of the effect.
- the first network training module 400 is further configured to determine at least one first type of voice signal processing task from the plurality of voice signal processing tasks of the voice signal processing process, specifically:
- the voice signal processing task whose training complexity is higher than the set complexity threshold is determined as the first type of voice signal processing task.
- the first network training module 400 is configured to determine a first target training loss function according to a training loss function of the first type of voice signal processing task, specifically:
- the training loss function of the first type of speech signal processing task is multiplied by the weight corresponding to the first type of speech signal processing task, and the phase of the first type of speech signal processing task is obtained. Multiply the result to determine the multiplication result of each of the first type of speech signal processing tasks;
- the multiplication results of each of the first type of speech signal processing tasks are added to obtain a first target training loss function.
- the task input feature of the first type of voice signal processing task of the sample voice includes: multiple task input features; and one task input feature includes at least one feature number;
- the first network training module 400 is configured to use the task input feature of the first type of speech signal processing task of the sample speech as the training input of the initial multi-task neural network to minimize the first target training loss function.
- the target updates the parameters of the shared layer of the initial multi-task neural network and the task layer corresponding to the first type of speech signal processing task until the initial multi-task neural network converges to obtain the first multi-task neural network, which specifically includes:
- the initial multi-task neural network is trained in multiple training stages to obtain the first multi-task neural network; wherein, one training stage is used A task input feature of the first type of speech signal processing task of the sample speech is used as a training input, and the first target training loss function is minimized as a training target.
- the first network training module 400 is configured to perform training on the initial multi-task neural network in multiple training stages according to multiple task input features of the first type of voice signal processing task of the sample voice.
- the first multi-task neural network specifically includes:
- the current task input feature corresponding to the current training phase is selected; and the current task is input into the feature as the last training.
- the training input of the multi-task neural network completed in the phase training is to minimize the first target training loss function as the training target, and the shared layer of the multi-task neural network trained in the previous training phase corresponds to the first type of speech signal processing task.
- the parameters of the task layer are updated until the trained multi-task neural network reaches convergence according to the last task input feature, and the first multi-task neural network is obtained.
- the multi-task neural network to be trained may include: an initial multi-task neural network; the target loss function determining module 200 is configured to process a training loss function of the task according to each voice signal. Determine the target training loss function, including:
- the training loss function of the speech signal processing task is multiplied by the weight corresponding to the speech signal processing task, and the corresponding multiplication result of the speech signal processing task is obtained to determine each speech signal processing.
- the corresponding multiplication result of the task is obtained to determine each speech signal processing.
- the corresponding multiplication results of each speech signal processing task are added to obtain a target training loss function.
- the shared layer in the multi-task neural network may include an LSTM network
- the task layer corresponding to each voice signal processing task may include: an MLP fully connected network corresponding to each voice signal processing task;
- model training module 300 is configured to update the shared layer of the multi-task neural network to be trained and the parameters of each task layer, which may specifically include:
- FIG. 15 illustrates another mode of the voice signal processing model training apparatus provided by the embodiment of the present invention.
- the structural block diagram, as shown in FIG. 14 and FIG. 15, may further include:
- the model application module 500 is configured to determine an output result of the voice to be recognized for each task layer of the voice signal processing model; and output the result of the voice to be recognized by each task layer as a task of the voice signal processing task corresponding to each task layer process result.
- the model application module 500 is further configured to determine an output result of the voice to be recognized for each task layer of the voice signal processing model; use an output result of the voice to be recognized by each task layer, and assist each of the task layers
- the voice signal processing task performs task processing.
- model application module 500 can also be used in the apparatus shown in FIG.
- the voice signal processing model training device provided by the embodiment of the present invention can be applied to an electronic device.
- the hardware structure of the electronic device can be as shown in FIG. 16 , including: at least one processor 1 and at least one communication interface 2, At least one memory 3 and at least one communication bus 4;
- the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 complete communication with each other through the communication bus 4;
- the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present invention.
- the memory 3 may include a high speed RAM memory, and may also include a non-volatile memory such as at least one disk memory;
- the memory stores a program
- the processor calls the program
- the program is used to:
- the input feature is used as the training input of the multi-task neural network to be trained, and the target training loss function is minimized as the training target, and the shared layer of the multi-task neural network to be trained and the parameters of each task layer are updated until the training is to be performed.
- the multi-task neural network converges to obtain a speech signal processing model
- the multitasking neural network to be trained includes: a sharing layer, and a task layer corresponding to each voice signal processing task.
- the program is also used to:
- the target training loss function is minimized as a training target, and the parameters of the shared layer are updated according to the target training loss function;
- the target training loss function is minimized as the training target, and the parameters of the task layer of the speech signal processing task are updated according to the training loss function of the speech signal processing task.
- the program is also used to:
- the task input feature of the first type of speech signal processing task of the sample speech is used as a training input of the initial multi-task neural network, and the first target training loss function is minimized as a training target, and the initial multi-task neural network is The parameters of the shared layer and the task layer corresponding to the first type of speech signal processing task are updated until the initial multi-task neural network converges to obtain the first multi-task neural network.
- the program is also used to:
- the training loss function of the first type of speech signal processing task is multiplied by the weight corresponding to the first type of speech signal processing task, and the phase of the first type of speech signal processing task is obtained. Multiply the result to determine the multiplication result of each of the first type of speech signal processing tasks;
- the multiplication results of each of the first type of speech signal processing tasks are added to obtain the first target training loss function.
- the program is also used to:
- Determining a basic task in the plurality of voice signal processing tasks determining the basic task as the first type of voice signal processing task, wherein the basic task is that the plurality of voice signal processing tasks have a relative to other voice signal processing tasks Auxiliary effects task.
- the program is also used to:
- the speech signal processing task whose training complexity is higher than the set complexity threshold is determined as the first type of speech signal processing task.
- the program is also used to:
- the initial multi-task neural network is trained in a plurality of training stages to obtain the first multi-task neural network
- a training phase uses a task input feature of the first type of speech signal processing task of the sample speech as a training input, and minimizes the first target training loss function as a training target.
- the program is also used to:
- the current training phase selecting the current task input feature corresponding to the current training phase from the plurality of task input features of the first type of voice signal processing task of the sample voice;
- the current task input feature is used as the training input of the multi-task neural network trained in the previous training phase, and the first target training loss function is minimized as the training target, and the multi-task neural network trained in the previous training phase is completed.
- the shared layer and the parameters of the task layer corresponding to the first type of speech signal processing task are updated until the trained multi-task neural network reaches convergence according to the last task input feature, and the first multi-task neural network is obtained.
- the program is also used to:
- the training loss function of the speech signal processing task is multiplied by the weight corresponding to the speech signal processing task, and the corresponding multiplication result of the speech signal processing task is obtained to determine each speech signal processing.
- the corresponding multiplication result of the task is obtained to determine each speech signal processing.
- the corresponding multiplication results of each speech signal processing task are added to obtain a target training loss function.
- the program is also used to:
- connection parameter of the input layer of the LSTM network of the multi-task neural network Updating the connection parameter of the input layer of the LSTM network of the multi-task neural network to be trained to the hidden layer, the connection parameter of the hidden layer to the output layer, or the connection parameter between the hidden layer and the hidden layer;
- connection layer of the MLP fully connected network corresponding to each voice signal processing task is connected to the connection parameter of the hidden layer or the connection parameter of the hidden layer to the output layer.
- the program is also used to:
- the output result of the speech to be recognized by each task layer is taken as the task processing result of the speech signal processing task corresponding to each task layer.
- the program is also used to:
- the output result of the speech to be recognized by each task layer is used to assist the speech signal processing task corresponding to each task layer to perform task processing.
- refinement function and extended function of the program can refer to the corresponding part above.
- an embodiment of the present invention further provides a storage medium, which may be, for example, a memory, where the storage medium stores a program suitable for execution by a processor, and the program is used to:
- Obtaining sample speech determining a task input feature of each speech signal processing task of the sample speech; determining a target training loss function according to a training loss function of each speech signal processing task; inputting a task of each speech signal processing task of the sample speech
- the feature as the training input of the multi-task neural network to be trained, minimizes the target training loss function as the training target, and updates the shared layer of the multi-task neural network to be trained and the parameters of each task layer until it is to be trained.
- the task neural network converges to obtain a speech signal processing model; wherein the multi-task neural network to be trained includes: a shared layer, and a task layer corresponding to each speech signal processing task.
- refinement function and the extended function of the program may refer to corresponding parts in the foregoing.
- the steps of a method or algorithm described in connection with the embodiments disclosed herein can be implemented directly in hardware, a software module executed by a processor, or a combination of both.
- the software module can be placed in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or technical field. Any other form of storage medium known.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Signal Processing (AREA)
- Circuit For Audible Band Transducer (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
一种语音信号处理模型训练方法、电子设备及存储介质。获取样本语音,确定样本语音的每个语音信号处理任务的任务输入特征(S100);根据每个语音信号处理任务的训练损失函数,确定目标训练损失函数(S110);将每个语音信号处理任务的任务输入特征,输入待训练的多任务神经网络,以最小化目标训练损失函数为训练目标,更新待训练的多任务神经网络的共享层和每个任务层的参数,直至待训练的多任务神经网络收敛,得到语音信号处理模型(S120)。通过多任务神经网络,降低计算复杂度,提高语音信号处理模型的训练效率。
Description
本申请要求于2017年11月24日提交的申请号为201711191604.9、发明名称为“语音信号处理模型训练方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本发明实施例涉及语音处理技术领域,具体涉及一种语音信号处理模型训练方法、电子设备及存储介质。
随着语音识别技术的发展,为了在复杂环境下保持较好的语音识别率,终端的语音信号处理技术的性能尤为重要;目前,一般的语音识别过程为,终端对输入的多通道语音进行语音信号处理,输出单通道语音,再将单通道语音送给语音后台服务器进行语音识别。
传统的语音信号处理过程一般包括多个语音信号处理任务,通过该多个语音信号处理任务递进协同的对输入的多通道语音进行处理,输出单通道语音。以智能音箱场景为例,图1示出了终端传统的语音信号处理过程,该过程由多个语音信号处理任务构成,该多个语音信号处理任务可具体包括:回声消除任务、语音检测任务、语音方向检测任务、麦克风阵列增强任务、单通道降噪任务、混响消除任务等;输入的多通道语音经过上述多个语音信号处理任务的协同处理后,可输出单通道语音,完成终端的语音信号处理。
随着深度学习技术的发展,神经网络技术应用的领域越来越广,为提高终端的语音信号处理性能,本领域中,采用利用神经网络优化终端的语音信号处理过程的技术。该技术通过使用神经网络训练语音信号处理模型,利用语音信号处理模型替代终端传统的语音信号处理过程,或者辅助终端传统的语音信号处理过程,来提高终端的语音信号处理性能;可见,基于神经网络进行语音信号处理模型的训练,具有提高语音信号处理性能等重要技术意义。
目前使用神经网络训练语音信号处理模型所面临的难题是,由于语音信号处理过程涉及的语音信号处理任务的数量较多,训练所涉及的计算复杂度较高,导致语音信号处理模型的训练效率较低。
发明内容
有鉴于此,本发明实施例提供一种语音信号处理模型训练方法、电子设备及存储介质,以降低训练语音信号处理模型的计算复杂度,提高语音信号处理模型的训练效率。
为实现上述目的,本发明实施例提供如下技术方案:
一方面,提供了一种语音信号处理模型训练方法,所述方法应用在电子设备上,包括:
获取样本语音,确定所述样本语音的每个语音信号处理任务的任务输入特征;
根据所述每个语音信号处理任务的训练损失函数,确定目标训练损失函数;
将所述样本语音的每个语音信号处理任务的任务输入特征,作为待训练的多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对待训练的多任务神经网络的共享层和每个任务层的参数进行更新,直至待训练的多任务神经网络收敛,得到语音信号处理模型;
其中,所述待训练的多任务神经网络包括:共享层,和每个语音信号处理任务对应的任务层。
另一方面,本发明实施例还提供一种语音信号处理模型训练装置,所述装置应用在电子设备上,包括:
任务输入特征确定模块,用于获取样本语音,确定所述样本语音的每个语音信号处理任务的任务输入特征;
目标损失函数确定模块,用于根据所述每个各语音信号处理任务的训练损失函数,确定目标训练损失函数;
模型训练模块,用于将所述样本语音的每个语音信号处理任务的任务输入特征,作为所述待训练的多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对所述待训练的多任务神经网络的共享层和每个任务层的参数进行更新,直至待训练的多任务神经网络收敛,得到语音信号处理模型;
其中,所述待训练的多任务神经网络包括:共享层,和每个语音信号处理任务对应的任务层。
另一方面,本发明实施例还提供一种电子设备,包括:至少一个存储器和至少一个处理器;所述存储器存储有程序,所述处理器调用所述存储器存储的程序,所述程序用于:
获取样本语音,确定所述样本语音的每个语音信号处理任务的任务输入特征;
根据所述每个语音信号处理任务的训练损失函数,确定目标训练损失函数;
将所述样本语音的每个语音信号处理任务的任务输入特征,作为待训练的多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对待训练的多任务神经网络的 共享层和每个任务层的参数进行更新,直至待训练的多任务神经网络收敛,得到语音信号处理模型;
其中,所述待训练的多任务神经网络包括:共享层,和每个语音信号处理任务对应的任务层。
另一方面,本发明实施例还提供一种存储介质,所述存储介质存储有适用于处理器执行的程序,所述程序用于:
获取样本语音,确定所述样本语音的每个语音信号处理任务的任务输入特征;
根据所述每个语音信号处理任务的训练损失函数,确定目标训练损失函数;
将所述样本语音的每个语音信号处理任务的任务输入特征,作为待训练的多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对待训练的多任务神经网络的共享层和每个任务层的参数进行更新,直至待训练的多任务神经网络收敛,得到语音信号处理模型;
其中,所述待训练的多任务神经网络包括:共享层,和每个语音信号处理任务对应的任务层。
本发明实施例中,通过多个语音信号处理任务的训练损失函数,确定目标训练损失函数,并基于多个语音信号处理任务的任务输入特征,基于作为多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对待训练的多任务神经网络进行训练,得到语音信号处理模型。任务神经网络包括共享层和每个语音信号处理任务对应的任务层,基于该多任务神经网络训练得到语音信号处理模型,而不是相对于每一语音信号处理任务均单独进行神经网络的训练,有效的降低了训练语音信号处理模型的计算复杂度,提升了训练效率。
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为传统的语音信号处理过程的示意图;
图2为传统的利用神经网络,训练得到语音信号处理模型的示意图;
图3为本发明实施例提供的多任务神经网络的结构示意图;
图4为本发明实施例提供的多任务神经网络的另一结构示意图;
图5为本发明实施例提供的语音信号处理模型训练方法的流程图;
图6为语音信号处理模型的训练示意图;
图7为本发明实施例提供的语音信号处理模型训练方法的另一流程图;
图8为语音信号处理模型的另一训练示意图;
图9为本发明实施例提供的语音信号处理模型训练方法的再一流程图;
图10为本发明实施例提供的语音信号处理模型训练方法的又一流程图;
图11为语音信号处理模型的应用场景示例图;
图12为语音信号处理模型的输出结果的使用示例图;
图13为本发明实施例提供的语音信号处理模型训练装置的结构框图;
图14为本发明实施例提供的语音信号处理模型训练装置的另一结构框图;
图15为本发明实施例提供的语音信号处理模型训练装置的再一结构框图;
图16为电子设备的硬件结构框图。
图2为传统的利用神经网络,训练得到语音信号处理模型的示意图,如图2所示,针对语音信号处理过程所涉及的每个语音信号处理任务,分别的构建神经网络,每个神经网络对应有语音信号处理任务,分别的对每个语音信号处理任务的神经网络进行训练,当某一神经网络达到对应的语音信号处理任务的训练收敛条件时,完成该神经网络的训练,在每个神经网络训练完成后,将训练完成的每个神经网络联合形成语音信号处理模型;该过程存在的问题是,需要分别针对每个语音信号处理任务进行神经网络的训练,对于数量较多的语音信号处理任务而言,训练的计算复杂度较高;同时,每个神经网络相对独立,缺少语音信号处理任务之间的关联,导致训练得出的语音信号处理模型的性能具有一定的局限性。
基于此,本发明实施例考虑改进语音信号处理模型的神经网络结构,并基于改进后的神经网络结构进行语音信号处理模型的训练,降低训练语音信号处理模型的计算复杂度,提升训练效率;进一步在训练过程中体现语音信号处理任务之间的关联性,保障训练得出的语音信号处理模型具有可靠的性能。
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明实施例提出了一种新型的多任务神经网络,该多任务神经网络通过改进语音信号处理模型的神经网络结构,可以达到降低训练语音信号处理模型的计算复杂度,进一步保障语音信号处理模型的性能可靠的效果。该多任务神经网络可如图3示,包括:共享层,和每个语音信号处理任务对应的任务层;
可选的,在本发明实施例中,共享层的输入可导入每个任务层,每个任务层输出该任务层对应的语音信号处理任务的任务处理结果;其中,共享层可以体现具有共性的语音信号处理任务之间的关联性,每个任务层可体现相应的语音信号处理任务的任务特性,使得每个任务层的输出结果能够更好的反映相应的语音信号处理任务的任务需求。
可选的,在本发明实施例中,共享层可定义为LSTM(Long Short Term Memory,长短期记忆)网络,作为一种可选示例,共享层可以是两层的LSTM网络;任务层可定义为MLP(Multi layer Perceptron,多层感知器)全连接网络,即每个任务层可以均是MLP全连接网络,作为一种可选示例,每个任务层可以均是一层的MLP全连接网络。
以图1所示多个语音信号处理任务为例,本发明实施例提供的多任务神经网络可以如图4所示,包括:
共享层,回声消除任务层、语音检测任务层、…、单通道降噪任务层、混响消除任务层。
显然,在具体的语音信号处理过程中,多个语音信号处理任务并不限于图1所示,还可以在图1所示多个语音信号处理任务的基础上,删减和/或增强了某些语音信号处理任务,本发明实施例对此不作具体限定。
基于上述本发明实施例提供的多任务神经网络,本发明实施例可进行该多任务神经网络的训练,得到语音信号处理模型。
在一种训练语音信号处理模型的可选实现上,本发明实施例可同时基于所有的语音信号处理任务训练多任务神经网络,更新多任务神经网络的共享层和每个任务层的参数;
可选的,图5示出了本发明实施例提供的语音信号处理模型训练方法的一种可选流程,该方法可应用于具有数据处理能力的电子设备,该电子设备可以为笔记本电脑、PC(Personal Computer,个人计算机)等具有数据处理能力的终端设备,也可以为网络侧的服务器,本发明实施例对此不作具体限定;参照图5,该方法流程可以包括:
步骤S100、电子设备获取样本语音,确定样本语音的每个语音信号处理任务的任务输入特征。
可选的,样本语音可以认为是训练语音信号处理模型所使用的样本,样本语音可以是 多通道语音;本发明实施例所获取的样本语音的数量可以为多个,可以对每条样本语音,均确定出每个语音信号处理任务的任务输入特征。
对于终端的语音信号处理过程所涉及的多个语音信号处理任务,本发明实施例可对样本语音,分别获取每个语音信号处理任务的任务输入特征;可选的,终端的语音信号处理过程所涉及的多个语音信号处理任务可以如图1所示,当然,也可以在图1所示多个语音信号处理任务的基础上,删减语音信号处理任务,和/或,增强其他形式的语音信号处理任务;
可选的,为便于理解,作为一种可选示例,以多个语音信号处理任务包括:回声消除任务和语音检测任务为例;其中,该回声消除任务可用于进行单通道语音谱的估计,该语音检测任务可用于进行语音存在概率的估计。则本发明实施例可获取样本语音的回声消除任务的任务输入特征,具体如:样本语音的带噪单通道语音的频谱能量及标注为干净语音的频谱能量;获取样本语音的语音检测任务的任务输入特征,具体如:样本语音是否存在语音的标记值;其中,该标记值可以是0或者1,其中,0表示不存在语音,1表示存在语音。
显然,上段描述的语音信号处理任务仅是作为示例,语音信号处理过程实际所涉及的语音信号处理任务可能更多,本发明实施例可对样本语音,分别获取不同语音信号处理任务相应的任务输入特征,而不同的语音信号处理任务所对应的任务输入特征可能不同。
步骤S110、电子设备根据每个语音信号处理任务的训练损失函数,确定目标训练损失函数。
本发明实施例是通过训练所有的语音信号处理任务,来实现多任务神经网络的共享层和每个任务层的参数更新,因此训练所使用的总训练损失函数(称为目标训练损失函数)需基于每个语音信号处理任务的训练损失函数确定;
可选的,鉴于传统的分别针对每个语音信号处理任务,单独进行神经网络训练的方案,本发明实施例可确定出每个语音信号处理任务的训练损失函数;从而对于任一语音信号处理任务,本发明实施例可将该语音信号处理任务的训练损失函数,乘以该语音信号处理任务相应的权重,得到该语音信号处理任务相应的相乘结果,以此确定出每个语音信号处理任务相应的相乘结果后,进而将每个相乘结果相加,可得到目标训练损失函数;
示例的,设第i个语音信号处理任务的训练损失函数为L
i,a
i为第i个语音信号处理任务相应的权重,则可根据如下公式确定目标训练损失函数L
all:
其中,a
i的数值可以根据实际情况进行设置,也可统一设置为1;N为语音信号处理任务的总数。
步骤S120、电子设备将样本语音的每个语音信号处理任务的任务输入特征,作为多任 务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对多任务神经网络的共享层和每个任务层的参数进行更新,直至多任务神经网络收敛,得到语音信号处理模型。
在确定样本语音的每个语音信号处理任务的任务输入特征,及确定训练的目标训练损失函数后,本发明实施例可对多任务神经网络进行训练,以实现多任务神经网络的共享层和每个任务层的参数更新;具体的,本发明实施例可将样本语音的每个语音信号处理任务的任务输入特征,作为多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对多任务神经网络进行训练,实现多任务神经网络的共享层和每个任务层的参数更新,直至多任务神经网络收敛,从而得到语音信号处理模型;其中,当该多任务神经网络达到收敛条件时时,则该多任务神经网络收敛,该收敛条件可以包括但不限于:训练的迭代次数达到最大次数,或者目标训练损失函数不再减小等,本发明实施例对此不做具体限定。
可选的,在确定训练输入和训练目标后,本发明实施例可使用随机梯度下降(Stochastic Gradient Descent,SGD)和/或反向传播(Back Propagation,BP)方法,对多任务神经网络的共享层和每个任务层的参数进行更新;
可选的,在以最小化目标训练损失函数为训练目标,对多任务神经网络的共享层和每个任务层的参数进行更新时,共享层的参数更新可根据目标训练损失函数实现,如在每次训练时,可使用随机梯度下降方法,根据每次训练得出的目标训练损失函数,更新共享层的参数;而任一语音信号处理任务对应的任务层的参数更新,可根据该语音信号处理任务的损失函数实现,如在每次训练时,可使用随机梯度下降方法,根据每次训练得出的该语音信号处理任务的训练损失函数,更新该语音信号处理任务对应的任务层的参数;从而既可通过共享层体现具有共性的语音信号处理任务之间的关联性,又可通过每个任务层体现相应的语音信号处理任务的任务特性,使得每个任务层的输出结果能够更好的反映相应的语音信号处理任务的任务需求。
可选的,作为一种示例,共享层可以是LSTM网络,一任务层可以是MLP全连接网络;更新多任务神经网络的共享层的参数可以如,更新LSTM网络的参数,包括但不限于更新LSTM网络的输入层到隐含层的连接参数、隐含层到输出层的连接参数或隐含层到隐含层之间的连接参数等;更新多任务神经网络的一任务层的参数可以如,更新MLP全连接网络的参数,包括但不限于更新MLP全连接网络的输入层到隐含层的连接参数、隐含层到输出层的连接参数等。
可选的,为便于理解,作为一种可选示例,若统一设置每个语音信号处理任务相应的权重为1,且以多语音信号处理任务包括:回声消除任务和语音检测任务为例;则进行语音信号处理模型的训练示意可如图6所示,过程如下:
将样本语音的回声消除任务和语音检测任务的输入特征,作为多任务神经网络的训练输入;以最小化回声消除任务的训练损失函数与语音检测任务的训练损失函数的和,为训练目标;对多任务神经网络的共享层,回声消除任务层和语音检测任务层的参数进行更新,直至多任务神经网络的迭代次数达到最大次,或者,回声消除任务的训练损失函数与语音检测任务的训练损失函数的和不再减小,得到语音信号处理模型。
具体的,在每次训练时,可根据每次训练得出的回声消除任务与语音检测任务的训练损失函数的和,更新多任务神经网络的共享层的参数;可根据每次训练得出的回声消除任务的训练损失函数,更新回声消除任务层的参数;可根据每次训练得出的语音检测任务的训练损失函数,更新语音检测任务层的参数。
可选的,一般而言,回声消除任务的训练损失函数可以如:所估计的干净语音频谱能量与真实值的差异值;语音检测任务的训练损失函数可以如:所估计的语音存在概率与真实值的差异值;相应的,若统一设置每个语音信号处理任务相应的权重为1,则可确定目标训练损失函数为:回声消除任务的训练损失函数与语音检测任务的训练损失函数的和;从而在进行多任务神经网络的训练时,可以最小化回声消除任务的训练损失函数与语音检测任务的训练损失函数的和,为训练目标。其中,最小化回声消除任务的训练损失函数与语音检测任务的训练损失函数的和具体可以为:最小化所估计的干净语音频谱能量与真实值的差异值,及所估计的语音存在概率与真实值的差异值的相加结果。
可见,图5所示的语音信号处理模型训练方法,可基于包括共享层和每个语音信号处理任务对应的任务层的多任务神经网络,将样本语音的每个语音信号处理任务的任务输入特征作为训练输入,进行多任务神经网络的共享层和每个任务层的参数更新,训练得到语音信号处理模型。由于本发明实施例是基于具有共享层和每个语音信号处理任务对应的任务层的多任务神经网络,根据样本语音的每个语音信号处理任务的任务输入特征,同时的进行多任务神经网络的共享层和每个任务层的参数更新训练,而不是相对于每一语音信号处理任务均单独进行神经网络的训练,因此,极大的降低了训练语音信号处理模型所涉及的计算复杂度,有效的降低了训练语音信号处理模型的计算复杂度,提升了语音信号处理模型的训练效率。
上述同时基于所有的语音信号处理任务训练多任务神经网络,来更新多任务神经网络的共享层和每个任务层的参数的方式,相比于传统的分别针对每个语音信号处理任务,单独训练神经网络的方式能够降低计算复杂度。进一步的,本发明实施例还提供了一种分阶段进行多任务神经网络训练的方案,该方案是基于语音信号处理过程中每个语音信号处理 任务的任务特性所得到的方案,能够避免语音信号处理过程中每个语音信号处理任务间的差异较大,同时,该方案可以采用部分语音信号处理任务训练多任务神经网络,能够保障多任务神经网络的参数收敛性。
可选的,图7示出了本发明实施例提供的语音信号处理模型训练方法的另一种可选流程,该方法可应用于具有数据处理能力的电子设备,参照图7,该方法流程可以包括:
步骤S200、电子设备获取样本语音。
步骤S210、电子设备从语音信号处理过程的多个语音信号处理任务中,确定至少一个第一类语音信号处理任务。
可选的,作为一种实现,第一类语音信号处理任务可以是,语音信号处理过程涉及的多个语音信号处理任务中的基本任务;可以理解的是,基本任务可以认为是语音信号处理过程的多个语音信号处理任务中,相对于其他的语音信号处理任务具有辅助效果的任务;
作为一种可选示例,以多个语音信号处理任务包括:回声消除任务和语音检测任务为例;由于回声消除任务能够实现单通道语音谱的估计,能极大提升语音概率估计的准确度,因此回声消除任务可以认为是基本语音信号处理任务。
可选的,作为另一种可选实现,第一类语音信号处理任务可以认为是,语音信号处理过程涉及的多个语音信号处理任务中训练复杂度较高的任务;
其中,第一类语音信号处理任务的确定过程可以为:当语音信号处理任务的训练复杂度高于设定的复杂度阈值时,则确定该语音信号处理任务为第一类语音信号处理任务;否则,该语音信号处理任务不是第一类语音信号处理任务。
作为一种可选示例,以多个语音信号处理任务包括:回声消除任务和语音检测任务为例;由于回声消除任务所进行的单通道语音谱的估计,需要得到所有M个频带的干净语音能量值,该M一般为大于1的正整数,例如,M的值可以为512,而语音检测任务所进行的语音存在概率估计,需得到当前帧是否包含语音的单值估计,M远大于1,从训练复杂度的角度看,回声消除任务的训练复杂度远高于语音检测任务,因此回声消除任务可视为是训练复杂度较高的第一类语音信号处理任务。
在本发明实施例中,第一类语音信号处理任务的数量可以为一个或多个。
步骤S220、电子设备确定样本语音的第一类语音信号处理任务的任务输入特征,及样本语音的每个语音信号处理任务的任务输入特征。
在确定第一类语音信号处理任务后,针对样本语音,本发明实施例可确定样本语音的第一类语音信号处理任务的任务输入特征。其中,该确定第一类语音信号处理任务的任务 输入特征可以为:确定样本语音的回声消除任务的任务输入特征;同时,对于语音信号处理过程涉及的每个语音信号处理任务,确定出样本语音的每个语音信号处理任务的任务输入特征。其中,确定每个语音信号处理任务的任务输入特征可以为:确定样本语音的回声消除任务的任务输入特征,和语音检测任务的任务输入特征等。
步骤S230、电子设备根据第一类语音信号处理任务的训练损失函数,确定第一目标训练损失函数;及根据每个语音信号处理任务的训练损失函数,确定目标训练损失函数。
可选的,对于第一类语音信号处理任务,本发明实施例可确定第一类语音信号处理任务的训练损失函数,在第一类语音信号处理任务的数量为至少一个的情况下,对于任一第一类语音信号处理任务,可将该第一类语音信号处理任务的训练损失函数,乘以该第一类语音信号处理任务相应的权重,得到该第一类语音信号处理任务的相乘结果,以确定出每个第一类语音信号处理任务的相乘结果后,进而将每个第一类语音信号处理任务的相乘结果相加,可得到第一目标训练损失函数。
可选的,设第i个第一类语音信号处理任务的训练损失函数为L1
i,a1
i为第i个第一类语音信号处理任务相应的权重,则第一目标训练损失函数L1
all的确定可根据如下公式实现:
可选的,本发明实施例可以参照上文步骤S110部分所示,对每个语音信号处理任务进行处理,得到目标训练损失函数。
步骤S240、电子设备将样本语音的第一类语音信号处理任务的任务输入特征,作为多任务神经网络的训练输入,以最小化第一目标训练损失函数为训练目标,对多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数进行更新,直至多任务神经网络收敛,得到第一多任务神经网络。
可选的,本发明实施例可先基于第一类语音信号处理任务的任务输入特征,以最小化第一目标训练损失函数为训练目标,对初始的多任务神经网络进行训练。
在具体训练时,可对多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数进行更新;具体参数更新过程可以是:电子设备将样本语音的第一类语音信号处理任务的任务输入特征,作为多任务神经网络的训练输入,以最小化第一目标训练损失函数为训练目标,通过多次迭代的方式,更新多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数,直至达到最大迭代次数,或者第一目标训练损失函数不再减小,从而得到第一多任务神经网络。
可选的,本发明实施例可根据每次训练得出的第一目标训练损失函数,更新共享层的 参数;而对于每一第一类语音信号处理任务对应的任务层,可根据每次训练得出的该第一类语音信号处理任务的训练损失函数,更新该第一类语音信号处理任务对应的任务层的参数。
步骤S250、电子设备将样本语音的每个语音信号处理任务的任务输入特征,作为第一多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对多任务神经网络的共享层和每个语音信号处理任务对应的任务层的参数进行更新,直至第一多任务神经网络收敛,得到语音信号处理模型。
可选的,本发明实施例基于第一类语音信号处理任务的任务输入特征,以最小化第一目标训练损失函数为训练目标,对多任务神经网络进行训练,得到第一多任务神经网络后,可再基于每个语音信号处理任务的任务输入特征,以最小化目标训练损失函数为训练目标,对第一多任务神经网络进行训练,得到语音信号处理模型。
在具体训练时,可对第一多任务神经网络的共享层和每个语音信号处理任务对应的任务层的参数进行更新;具体参数更新过程可以是:将样本语音的每个语音信号处理任务的任务输入特征,作为第一多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,通过迭代的进行第一多任务神经网络的共享层和每个语音信号处理任务对应的任务层的参数更新,直至达到最大迭代次数,或者目标训练损失函数不再减小,从而得到语音信号处理模型;
可选的,本发明实施例可根据每次训练得出的目标训练损失函数,更新共享层的参数;而对于每一语音信号处理任务对应的任务层,可根据每次训练得出的该语音信号处理任务的训练损失函数,更新该语音信号处理任务对应的任务层的参数。
为便于理解步骤S240和步骤S250所示的先后训练过程,作为一种可选示例,若统一设置每个语音信号处理任务相应的权重为1,且以多个语音信号处理任务包括:回声消除任务和语音检测任务为例;则本发明实施例进行语音信号处理模型的训练过程可以如图8所示,过程如下:
先将样本语音的回声消除任务的输入特征,作为多任务神经网络的训练输入,以最小化回声消除任务的训练损失函数为训练目标,对多任务神经网络的共享层和回声消除任务对应的任务层的参数进行更新,直至多任务神经网络的迭代次数达到最大次,或者,回声消除任务的训练损失函数不再减小,得到第一多任务神经网络。其中,该回声消除任务的输入特征可以为:样本语音的带噪单通道语音的频谱能量及标注为干净语音的频谱能量;该训练目标可以为:最小化所估计的干净语音频谱能量与真实值的差异值。
进而,将样本语音的回声消除任务和语音检测任务的输入特征,作为第一多任务神经 网络的训练输入;以最小化回声消除任务的训练损失函数与语音检测任务的训练损失函数的和,为训练目标;对第一多任务神经网络的共享层,回声消除任务和语音检测任务层的参数进行更新,直至第一多任务神经网络的迭代次数达到最大次,或者,回声消除任务的训练损失函数与语音检测任务的训练损失函数的和不再减小,得到语音信号处理模型。
可见,基于图7所示语音信号处理模型训练方法,本发明实施例可从多个语音信号处理任务中确定出基本任务,或者训练复杂度较高的任务,得到至少一个第一类语音信号处理任务;进而先以第一类语音信号处理任务的任务输入特征,作为多任务神经网络的训练输入,进行多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数更新训练,得到第一多任务神经网络;然后再以每个语音信号处理任务的任务输入特征,作为第一多任务神经网络的训练输入,进行第一多任务神经网络的共享层和每个任务层的参数更新训练,训练得到语音信号处理模型。
这个过程中,由于没有对每一语音信号处理任务均单独进行神经网络的训练,因此,降低了训练语音信号处理模型所涉及的计算复杂度;同时,先以第一类语音信号处理任务的输入特征进行多任务神经网络的训练,再以每个语音信号处理任务的任务输入特征,作为训练输入进行多任务神经网络的训练,可使得训练过程可体现语音信号处理任务之间的关联性,保障多任务神经网络的参数能够有效收敛,保障了训练得出的语音信号处理模型的可靠性能。
图7所示方法进行语音信号处理模型训练的过程中,是先根据第一类语音信号处理任务的任务输入特征,进行多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数进行更新,训练得到第一多任务神经网络;在训练得到第一多任务神经网络的过程中,由于第一类语音信号处理任务是语音信号处理过程中的基本任务或者训练复杂度较高的任务,因此,第一类语音信号处理任务对应的任务层的参数的可靠收敛,对于后续训练得出的语音信号处理模型的性能尤为关键。
可选的,本发明实施例中,还可以根据第一类语音信号处理任务的不同输入特征分多个阶段的,进行第一类语音信号处理任务对应的任务层的参数的收敛训练,以进一步保障第一类语音信号处理任务对应的任务层的参数的有效收敛。可选的,图9示出了本发明实施例提供的语音信号处理模型训练方法的再一种可选流程,需要说明的是,图9所示流程仅是可选的,在进行第一多任务神经网络的训练时,也可直接基于所有第一类语音信号处理任务的任务输入特征,直接进行第一多任务神经网络的训练,而不需如图9所示分多个阶段的进行第一多任务神经网络的训练;
可选的,图9所示方法可应用于具有数据处理能力的电子设备,参照图9,该方法流程可以包括:
步骤S300、电子设备获取样本语音。
步骤S310、电子设备从语音信号处理过程的多个语音信号处理任务中,确定至少一个第一类语音信号处理任务。
可选的,步骤S310的实现过程,为与步骤S210的过程同理的过程,步骤S310的描述可参照步骤S210部分描述,此处不再一一赘述。
步骤S320、电子设备确定样本语音的第一类语音信号处理任务的任务输入特征,及样本语音的每个类语音信号处理任务的任务输入特征;该第一类语音信号处理任务的任务输入特征包括:多份任务输入特征;一份任务输入特征所包含的特征数量为至少一个。
可选的,在本发明实施例中,对于一第一类语音信号处理任务而言,任一第一类语音信号处理任务的任务输入特征可以为多份,任一份任务输入特征所包含的特征数量可以为至少一个。
作为一种可选示例,以第一类语音信号处理任务包括回声消除任务为例,则对于回声消除任务,本发明实施例可设置多份任务输入特征,如设置回声消除任务的第一份任务输入特征为:带噪单通道语音的频谱能量,及标注为干净语音的频谱能量;设置回声消除任务的第二份任务输入特征为:多通道语音的频谱能量;设置回声消除任务的第三份任务输入特征为:多通道语音的频谱能量,及参考信号的频谱能量(如智能音箱播放的音乐)等。
步骤S330、电子设备根据第一类语音信号处理任务的训练损失函数,确定第一目标训练损失函数;及根据每个语音信号处理任务的训练损失函数,确定目标训练损失函数。
可选的,步骤S330的实现过程,为与步骤S230的过程同理的过程,步骤S330的介绍可参照步骤S230部分描述,此处不再一一赘述。
步骤S340、电子设备根据当前训练阶段,从样本语音的第一类语音信号处理任务的多份任务输入特征中,选取当前训练阶段相应的当前份任务输入特征;将该当前份任务输入特征,作为上一训练阶段训练完成的多任务神经网络的训练输入,以最小化第一目标训练损失函数为训练目标,对上一训练阶段训练完成的多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数进行更新,直至根据最后一份任务输入特征,训练的多任务神经网络达到收敛,得到第一多任务神经网络。
可选的,步骤S340可以认为是,电子设备根据样本语音的第一类语音信号处理任务的多份任务输入特征,分多个训练阶段递进的对多任务神经网络进行训练,得到第一多任务神经网络的一种可选实现,其中,一个训练阶段使用一份任务输入特征作为训练输入,且 以最小化第一目标训练损失函数为训练目标;其中,分多个训练阶段递进的对多任务神经网络进行训练的过程可以为:分多个训练阶段递进的对多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数进行更新。另外,除通过步骤S340实现外,本发明实施例并不排除其他的,利用样本语音的第一类语音信号处理任务的多份任务输入特征,分多个训练阶段递进的对多任务神经网络进行训练的其他方式。
可选的,在步骤S340中,本发明实施例可分多个训练阶段的进行第一多任务神经网络的训练,从而依训练阶段的将第一类语音信号处理任务的每份任务输入特征,分别作为训练输入,对多任务神经网络进行训练,以得到第一多任务神经网络;且,在当前训练阶段中,第一类语音信号处理任务当前选取的当前份任务输入特征,作为上一训练阶段训练完成的多任务神经网络的训练输入。
可选的,作为示例,以第一类语音信号处理任务的任务输入特征包括三份,分别为第一份任务输入特征,第二份任务输入特征,第三份任务输入特征为例;则本发明实施例可先以第一份任务输入特征作为待训练的多任务神经网络的训练输入,以最小化第一目标训练损失函数为训练目标,对多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数进行更新,直至根据第一份任务输入特征,训练的多任务神经网络达到收敛,得到第一训练阶段训练完成的多任务神经网络;其中,以第一份任务输入特征作为待训练的多任务神经网络的训练输入的过程可以为:对于第一训练阶段,所选取的当前训练阶段的任务输入特征为第一份任务输入特征。
然后,以第二份任务输入特征作为第一训练阶段训练完成的多任务神经网络的训练输入,以最小化第一目标训练损失函数为训练目标,对第一训练阶段训练完成的多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数进行更新,直至根据第二份任务输入特征,训练的多任务神经网络达到收敛,得到第二训练阶段训练完成的多任务神经网络;其中,以第二份任务输入特征作为第一训练阶段训练完成的多任务神经网络的训练输入的过程可以为:对于第二训练阶段,所选取的当前训练阶段的任务输入特征为第二份任务输入特征。
再以第三份任务输入特征作为第二训练阶段训练完成的多任务神经网络的训练输入,以最小化第一目标训练损失函数为训练目标,对第二训练阶段训练完成的多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数进行更新,直至根据第三份任务输入特征,训练的多任务神经网网络达到收敛,得到第一多任务神经网络,完成基于第一类语音信号处理任务的多份任务输入特征,分多个训练阶段的训练得到第一多任务神经网络的过程。其中,以第三份任务输入特征作为第二训练阶段训练完成的多任务神经网络的 训练输入的过程可以为:对于第三训练阶段,所选取的当前训练阶段的任务输入特征为第三份任务输入特征。
为便于理解,以第一类语音信号处理任务为回声消除任务为例,则回声消除任务的第一份任务输入特征为:带噪单通道语音的频谱能量,及标注为干净语音的频谱能量;回声消除任务的第二份任务输入特征为:多通道语音的频谱能量;回声消除任务的第三份任务输入特征为:多通道语音的频谱能量,及参考信号的频谱能量等;其中,该参考信号的频谱能量可以为:智能音箱播放的音乐。
相应的,本发明实施例可先以样本语音的带噪单通道语音的频谱能量,及标注为干净语音的频谱能量作为多任务神经网络的训练输入,以最小化所估计的干净语音频谱能量与真实值的差异值为训练目标,更新多任务神经网络的共享层和回声消除任务的任务层的参数,直至迭代次数达到最大次数或者训练目标不再减小。
然后,以样本语音的多通道语音的频谱能量作为上段训练完成的多任务神经网络的训练输入,以最小化所估计的干净语音频谱能量与真实值的差异值为训练目标,更新多任务神经网络的共享层和回声消除任务的任务层的参数,直至迭代次数达到最大次数或者训练目标不再减小,使得训练后的多任务神经网络具备多通道的空间滤波的能力。
在完成多通道训练之后,还可以样本语音的多通道语音的频谱能量,及参考信号的频谱能量作为上段训练完成的多任务神经网络的训练输入,以最小化所估计的干净语音频谱能量与真实值的差异值为训练目标,更新多任务神经网络的共享层和回声消除任务的任务层的参数,直至迭代次数达到最大次数或者训练目标不再减小,得到第一多任务神经网络,使得第一多任务神经网络能够较好地拟合多通道输入信号和参考信号。
可选的,上述的第一类语音信号处理任务的多份任务输入特征的示例仅是可选的,本发明实施例可根据具体情况,设置第一类语音信号处理任务的任务输入特征的份数,以及每份任务输入特征所包含的具体特征;如在上述的示例中,带噪单通道语音的频谱能量、标注为干净语音的频谱能量、和多通道语音的频谱能量的任务输入特征也可合并在一起训练。
步骤S350、电子设备将样本语音的每个语音信号处理任务的任务输入特征,作为第一多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对第一多任务神经网络的共享层和每个语音信号处理任务对应的任务层的参数进行更新,直至第一多任务神经网络收敛,得到语音信号处理模型。
可选的,步骤S350的实现过程,为与步骤S250的过程同理的过程,步骤S350的介绍可参照步骤S250部分描述,此处不再一一赘述。
可选的,在得到第一多任务神经网络后,对于语音检测、方向检测和混响消除等语音信号处理任务而言,这些任务较为简单且互相之间相对独立,可以合并在一起训练,因此可在得到第一多任务神经网络后,结合样本语音的每个语音信号处理任务的任务输入特征,进行第一多任务神经网络的训练,得到语音信号处理模型。
需要说明的是,上述无论采用何种训练方式进行,在进行共享层和某一任务层的参数更新时,共享层的参数更新,基于当前训练所使用的所有任务的训练损失函数之和进行;而一任务层的参数更新,基于该任务层对应的任务的训练损失函数进行,从而可使得训练的语音信号处理模型既可通过共享层体现具有共性的语音信号处理任务之间的关联性,又可通过每个任务层体现相应的语音信号处理任务的任务特性。
对上述说明的语音信号处理模型的各种训练过程进行归纳、总结,则本发明实施例提供的语音信号处理模型训练方法的基本核心流程可以如图10所示,图10为本发明实施例提供的语音信号处理模型训练方法的又一种可选流程,参照图10,该方法流程可以包括:
步骤S400、电子设备获取样本语音,确定样本语音的每个语音信号处理任务的任务输入特征。
可选的,步骤S400的介绍可参照步骤S100部分描述,步骤S400的实现过程,为与步骤S100的过程同理的过程,此处不再一一赘述。
步骤S410、电子设备根据每个语音信号处理任务的训练损失函数,确定目标训练损失函数。
可选的,步骤S410的介绍可参照步骤S110部分描述,步骤S410的实现过程,为与步骤S110的过程同理的过程,此处不再一一赘述。
步骤S420、电子设备将样本语音的每个语音信号处理任务的任务输入特征,作为待训练的多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对待训练的多任务神经网络的共享层和每个任务层的参数进行更新,直至待训练的多任务神经网络达到收敛,得到语音信号处理模型。
可选的,作为一种可选实现,在步骤S420中,待训练的多任务神经网络可以是初始的多任务神经网络(相应的过程可归结到由图5所示流程实现);
可选的,作为另一种可选实现,在步骤S420中,待训练的多任务神经网络也可以是第一多任务神经网络,本发明实施例可利用图7所示方法流程中训练得到第一多任务神经网络的流程,先训练得到第一多任务神经网络,将第一多任务神经网络作为待训练的多任务神经网络;然后以图10所示方法,将样本语音的每个语音信号处理任务的任务输入特征,作 为第一多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对第一多任务神经网络的共享层和每个任务层的参数进行更新,直至第一多任务神经网络收敛,得到语音信号处理模型。
可选的,第一多任务神经网络的训练,可基于样本语音的第一类语音信号处理任务的任务输入特征实现;进一步,作为一种可选示例,第一类语音信号处理任务可以具有多份任务输入特征,本发明实施例可基于图9所示的第一多任务神经网络的训练流程,分多个训练阶段,训练得到第一多任务神经网络。
需要说明的是,上述的待训练的多任务神经网络无论是初始的多任务神经网络,还是第一多任务神经网络,待训练的多任务神经网络的结构必然是包括了共享层,和每个语音信号处理任务对应的任务层;而对于该共享层,是以最小化目标训练损失函数为训练目标,根据目标训练损失函数,对共享层的参数进行更新;对于任一语音信号处理任务对应的任务层,是以最小化目标训练损失函数为训练目标,根据该语音信号处理任务的训练损失函数,对该语音信号处理任务的任务层的参数进行更新。
本发明实施例提供的语音信号处理模型训练方法,可基于包括共享层和每个语音信号处理任务对应的任务层的多任务神经网络,训练得到语音信号处理模型,而不是相对于每一语音信号处理任务均单独进行神经网络的训练,有效的降低了训练语音信号处理模型的计算复杂度,提升了训练效率。
进一步,可在语音信号处理模型的训练过程中,通过先基于样本语音的第一类语音信号处理任务的任务输入特征进行训练,然后基于每个语音信号处理任务的任务输入特征进行训练,可挖掘出语音信号处理过程中多任务之间的关联性,提升语音信号处理性能,保障训练得到的语音信号处理模型的性能可靠。
在以上述方法流程训练得到语音信号处理模型后,可选的,本发明实施例可使用语音信号处理模型替代终端传统的语音信号处理过程,如具体可用语音信号处理模型的每个任务层的输出结果,替代终端传统的每个任务层对应语音信号处理任务的任务处理结果。
而在另一种实现上,本发明实施例可使用语音信号处理模型,辅助终端传统的语音信号处理过程,如具体可用语音信号处理模型的每个任务层的输出,辅助终端传统的相应的语音信号处理任务进行任务处理。
图11示出了语音信号处理模型的应用场景示例图,如图11所示,在训练得到语音信号处理模型,本发明实施例可使用语音信号处理模型对输入即时通讯客户端的待识别语音,进行前端的语音信号处理,然后输送到即时通讯应用的语音后台服务器进行语音识别;可 选的,即时通讯客户端可将语音信号处理模型的每个任务层对待识别语音的输出,分别作为相应的语音信号处理任务的辅助处理信号,从而辅助每个语音信号处理任务的处理,提高了每个语音信号处理任务的结果输出的准确性。
参照图11,作为一种可选应用场景,在即时通讯客户端装载本发明实施例训练好的语音信号处理模型的基础上,具体应用过程可以包括:
S1、即时通讯客户端获取输入的待识别语音。
S2、即时通讯客户端根据待训练的语音信号处理模型,确定语音信号处理模型的每个任务层对待识别语音的输出结果。
其中,语音信号处理模型以最小化目标训练损失函数为训练目标,训练多任务神经网络得到;其中,目标训练损失函数根据每个语音信号处理任务的训练损失函数确定;多任务神经网络包括共享层,和每个语音信号处理任务对应的任务层。
S3、即时通讯客户端将每个任务层对待识别语音的输出结果,作为每个任务层对应的语音信号处理任务的任务处理结果,或,使用每个任务层对待识别语音的输出结果,辅助相应的语音信号处理任务进行任务处理,以得到前端的语音信号处理结果。
S4、即时通讯客户端将前端的语音信号处理结果,发送给语音后台服务器,以便语音后台服务器根据语音信号处理结果,对待识别语音进行语音识别。
可选的,图12示出了语音信号处理模型的输出结果的一种使用示例,参照图12,针对终端传统的回声消除任务,语音信号处理模型的回声消除任务层可输出待识别语音的语音谱估计结果,从而将该语音谱估计结果作为终端传统的回声消除任务的辅助处理信号,使得回声消除任务在处理时能够更好的区分参考信号和语音信号的比例,提升回声消除任务的输出结果的准确性;显然,在另一种实现上,本发明实施例也可直接将语音信号处理模型的回声消除任务层对待识别语音的输出结果,作为回声消除任务的输出结果。
针对语音检测任务,语音信号处理模型的语音检测任务层可输出待识别语音的输出结果,将该输出结果作为终端传统的语音检测任务的辅助处理信号,使得语音检测任务的输出结果的准确性得以提升;其中,可以将语音检测任务层的输出结果,与终端传统的语音检测任务的输出结果的加权平均值作为最后的语音检测任务的输出结果。显然,在另一种实现上,本发明实施例也可直接将语音信号处理模型的语音检测任务层对待识别语音的输出结果,作为语音检测任务的输出结果。
针对语音方向检测任务,语音信号处理模型的语音方向检测任务层可输出待识别语音的输出结果,从而用该输出结果辅助终端传统的语音方向检测任务,进行待识别语音的语音和噪声的估计,从而得到更为准确的语音方向估计结果;其中,语音方向检测任务层输 出的待识别语音的输出结果可以为:待识别语音的语音/噪声谱估计结果。显然,在另一种实现上,本发明实施例也可直接将语音方向检测任务层输出的待识别语音的语音/噪声谱估计结果,作为语音方向检测任务的输出结果。
针对麦克风阵列增强任务,语音信号处理模型的麦克风阵列增强任务层可输出待识别语音的语音/噪声谱,以此辅助终端传统的麦克风阵列增强任务,从而更准确的估计出阵列算法的目标方向,以及阵列算法中所需要的噪声协方差矩阵等参数;显然,在另一种实现上,本发明实施例也可直接将麦克风阵列增强任务层的输出结果,作为麦克风阵列增强任务的输出结果。
针对单通道降噪任务,语音信号处理模型的单通道降噪任务层可输出待识别语音的语音/噪声谱,从而辅助终端传统的单通道降噪任务,实现单通道降噪任务中所需要的信噪比等关键参数的获取,提升单通道降噪任务的处理效果;显然,在另一种实现上,本发明实施例也可直接将单通道降噪任务层的输出结果,作为单通道降噪任务的输出结果。
针对混响消除任务,语音信号处理模型的混响消除任务层可输出房间混响估计,从而辅助终端传统的混响消除任务进行算法的参数调节,来控制混响消除的程度;显然,在另一种实现上,本发明实施例也可直接将混响消除任务层的输出结果,作为混响消除任务的输出结果。
可选的,上述描述的语音信号处理模型在待识别语音的语音信号处理过程中的应用仅是示例,可以理解为是在智能音箱场景下的语音信号处理过程的应用;显然,在不同的应用场景下,语音信号处理模型的应用方式可以根据实际情况适配调整,但不脱离使用语音信号处理模型替代终端传统的语音信号处理过程,或,使用语音信号处理模型,辅助终端传统的语音信号处理过程的思路。
下面对本发明实施例提供的语音信号处理模型训练装置进行介绍,下文描述的语音信号处理模型训练装置可以认为是,电子设备为实现本发明实施例提供的语音信号处理模型训练方法,所需设置的程序模块;下文描述的语音信号处理模型训练装置可与上文描述的语音信号处理模型训练方法相互对应参照。
图13为本发明实施例提供的语音信号处理模型训练装置的结构框图,该装置可应用于具有数据处理能力的电子设备,参照图13,该装置可以包括:
任务输入特征确定模块100,用于获取样本语音,确定样本语音的每个语音信号处理任务的任务输入特征;
目标损失函数确定模块200,用于根据每个语音信号处理任务的训练损失函数,确定目标训练损失函数;
模型训练模块300,用于将样本语音的每个语音信号处理任务的任务输入特征,作为待训练的多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对待训练的多任务神经网络的共享层和每个任务层的参数进行更新,直至待训练的多任务神经网络收敛,得到语音信号处理模型;
其中,该待训练的多任务神经网络包括:共享层,和每个语音信号处理任务对应的任务层。
可选的,模型训练模块300,用于以最小化目标训练损失函数为训练目标,对待训练的多任务神经网络的共享层和每个任务层的参数进行更新,具体包括:
对于该共享层,以最小化目标训练损失函数为训练目标,根据目标训练损失函数,对共享层的参数进行更新;及对于任一语音信号处理任务对应的任务层,以最小化目标训练损失函数为训练目标,根据该语音信号处理任务的训练损失函数,对该语音信号处理任务的任务层的参数进行更新。
可选的,该待训练的多任务神经网络可以包括:第一多任务神经网络;相应的,图14示出了本发明实施例提供的语音信号处理模型训练装置的另一结构框图,结合图13和图14所示,该装置还可以包括:
第一网络训练模块400,用于从语音信号处理过程的多个语音信号处理任务中,确定至少一个第一类语音信号处理任务;确定样本语音的第一类语音信号处理任务的任务输入特征;根据第一类语音信号处理任务的训练损失函数,确定第一目标训练损失函数;将样本语音的第一类语音信号处理任务的任务输入特征,作为初始的多任务神经网络的训练输入,以最小化第一目标训练损失函数为训练目标,对初始的多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数进行更新,直至初始的多任务神经网络收敛,得到第一多任务神经网络。
可选的,第一网络训练模块400,用于从语音信号处理过程的多个语音信号处理任务中,确定至少一个第一类语音信号处理任务,具体包括:
确定该多个语音信号处理任务中的基本任务,将该基本任务确定为第一类语音信号处理任务,该基本任务为该多个语音信号处理任务中,相对于其他的语音信号处理任务具有辅助效果的任务。
可选的,第一网络训练模块400,还用于从语音信号处理过程的多个语音信号处理任务中,确定至少一个第一类语音信号处理任务,具体包括:
将该多个语音信号处理任务中,训练复杂度高于设定的复杂度阈值的语音信号处理任务,确定为第一类语音信号处理任务。
可选的,该第一网络训练模块400,用于根据第一类语音信号处理任务的训练损失函数,确定第一目标训练损失函数,具体包括:
对于任一第一类语音信号处理任务,将该第一类语音信号处理任务的训练损失函数,乘以该第一类语音信号处理任务相应的权重,得到该第一类语音信号处理任务的相乘结果,以确定出每个第一类语音信号处理任务的相乘结果;
将每个第一类语音信号处理任务的相乘结果相加,得到第一目标训练损失函数。
可选的,该样本语音的第一类语音信号处理任务的任务输入特征包括:多份任务输入特征;一份任务输入特征所包含的特征数量为至少一个;
相应的,第一网络训练模块400,用于将样本语音的第一类语音信号处理任务的任务输入特征,作为初始的多任务神经网络的训练输入,以最小化第一目标训练损失函数为训练目标,对初始的多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数进行更新,直至初始的多任务神经网络收敛,得到第一多任务神经网络,具体包括:
根据样本语音的第一类语音信号处理任务的多份任务输入特征,分多个训练阶段递进的对初始的多任务神经网络进行训练,得到第一多任务神经网络;其中,一个训练阶段使用样本语音的第一类语音信号处理任务的一份任务输入特征作为训练输入,且以最小化第一目标训练损失函数为训练目标。
可选的,第一网络训练模块400,用于根据样本语音的第一类语音信号处理任务的多份任务输入特征,分多个训练阶段递进的对初始的多任务神经网络进行训练,得到第一多任务神经网络,具体包括:
根据当前训练阶段,从该样本语音的第一类语音信号处理任务的多份任务输入特征中,选取该当前训练阶段相应的当前份任务输入特征;将该当前份任务输入特征,作为上一训练阶段训练完成的多任务神经网络的训练输入,以最小化第一目标训练损失函数为训练目标,对该上一训练阶段训练完成的多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数进行更新,直至根据最后一份任务输入特征,训练的多任务神经网络达到收敛,得到第一多任务神经网络。
可选的,在另一种实现上,该待训练的多任务神经网络可以包括:初始的多任务神经网络;该目标损失函数确定模块200,用于根据每个语音信号处理任务的训练损失函数,确定目标训练损失函数,具体包括:
对于任一语音信号处理任务,将该语音信号处理任务的训练损失函数,乘以该语音信号处理任务相应的权重,得到该语音信号处理任务相应的相乘结果,以确定出每个语音信号处理任务相应的相乘结果;
将每个语音信号处理任务相应的相乘结果相加,得到目标训练损失函数。
可选的,多任务神经网络中的共享层可以包括LSTM网络,每个语音信号处理任务对应的任务层可以包括:每个语音信号处理任务对应的MLP全连接网络;
可选的,模型训练模块300,用于对待训练的多任务神经网络的共享层和每个任务层的参数进行更新,可以具体包括:
对待训练的多任务神经网络的LSTM网络的输入层到隐含层的连接参数、隐含层到输出层的连接参数或隐含层到隐含层之间的连接参数进行更新;及,对每个语音信号处理任务对应的MLP全连接网络的输入层到隐含层的连接参数或隐含层到输出层的连接参数进行更新。
可选的,在训练得到语音信号处理模型后,可在语音前端的语音信号处理过程中进行应用;可选的,图15示出了本发明实施例提供的语音信号处理模型训练装置的再一结构框图,结合图14和图15所示,该装置还可以包括:
模型应用模块500,用于确定语音信号处理模型的每个任务层对待识别语音的输出结果;将每个任务层对待识别语音的输出结果,作为该每个任务层对应的语音信号处理任务的任务处理结果。
可选的,模型应用模块500,还用于确定该语音信号处理模型的每个任务层对待识别语音的输出结果;使用每个任务层对待识别语音的输出结果,辅助该每个任务层对应的语音信号处理任务进行任务处理。
可选的,模型应用模块500也可在图13所示装置中进行使用。
本发明实施例提供的语音信号处理模型训练装置可应用于电子设备中,可选的,该电子设备的硬件结构可以如图16所示,包括:至少一个处理器1,至少一个通信接口2,至少一个存储器3和至少一个通信总线4;
在本发明实施例中,处理器1、通信接口2、存储器3、通信总线4的数量为至少一个,且处理器1、通信接口2、存储器3通过通信总线4完成相互间的通信;可选的,处理器1可能是一个中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路。存储器3可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器;
其中,该存储器存储有程序,该处理器调用该程序,该程序用于:
获取样本语音,确定样本语音的每个语音信号处理任务的任务输入特征;根据该每个语音信号处理任务的训练损失函数,确定目标训练损失函数;将样本语音的每个语音信号处理任务的任务输入特征,作为待训练的多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对待训练的多任务神经网络的共享层和每个任务层的参数进行更新,直至待训练的多任务神经网络收敛,得到语音信号处理模型;
其中,该待训练的多任务神经网络包括:共享层,和每个语音信号处理任务对应的任务层。
可选的,该程序还用于:
对于该共享层,以最小化目标训练损失函数为训练目标,根据该目标训练损失函数,对该共享层的参数进行更新;
对于任一语音信号处理任务对应的任务层,以最小化目标训练损失函数为训练目标,根据该语音信号处理任务的训练损失函数,对该语音信号处理任务的任务层的参数进行更新。
可选的,该程序还用于:
从语音信号处理过程的多个语音信号处理任务中,确定至少一个第一类语音信号处理任务;
确定该样本语音的第一类语音信号处理任务的任务输入特征;
根据该第一类语音信号处理任务的训练损失函数,确定第一目标训练损失函数;
将该样本语音的第一类语音信号处理任务的任务输入特征,作为初始的多任务神经网络的训练输入,以最小化第一目标训练损失函数为训练目标,对该初始的多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数进行更新,直至该初始的多任务神经网络收敛,得到第一多任务神经网络。
可选的,该程序还用于:
对于任一第一类语音信号处理任务,将该第一类语音信号处理任务的训练损失函数,乘以该第一类语音信号处理任务相应的权重,得到该第一类语音信号处理任务的相乘结果,以确定出每个第一类语音信号处理任务的相乘结果;
将每个第一类语音信号处理任务的相乘结果相加,得到该第一目标训练损失函数。
可选的,该程序还用于:
确定该多个语音信号处理任务中的基本任务,将该基本任务确定为该第一类语音信号处理任务,该基本任务为该多个语音信号处理任务中,相对于其他的语音信号处理任务具有辅助效果的任务。
可选的,该程序还用于:
将该多个语音信号处理任务中,训练复杂度高于设定的复杂度阈值的语音信号处理任务,确定为该第一类语音信号处理任务。
可选的,该程序还用于:
根据该样本语音的第一类语音信号处理任务的多份任务输入特征,分多个训练阶段递进的对该初始的多任务神经网络进行训练,得到该第一多任务神经网络;
其中,一个训练阶段使用该样本语音的第一类语音信号处理任务的一份任务输入特征作为训练输入,且以最小化第一目标训练损失函数为训练目标。
可选的,该程序还用于:
根据当前训练阶段,从该样本语音的第一类语音信号处理任务的多份任务输入特征中,选取该当前训练阶段相应的当前份任务输入特征;
将该当前份任务输入特征,作为上一训练阶段训练完成的多任务神经网络的训练输入,以最小化第一目标训练损失函数为训练目标,对该上一训练阶段训练完成的多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数进行更新,直至根据最后一份任务输入特征,训练的多任务神经网络达到收敛,得到第一多任务神经网络。
可选的,该程序还用于:
对于任一语音信号处理任务,将该语音信号处理任务的训练损失函数,乘以该语音信号处理任务相应的权重,得到该语音信号处理任务相应的相乘结果,以确定出每个语音信号处理任务相应的相乘结果;
将每个语音信号处理任务相应的相乘结果相加,得到目标训练损失函数。
可选的,该程序还用于:
对该待训练的多任务神经网络的LSTM网络的输入层到隐含层的连接参数、隐含层到输出层的连接参数或隐含层到隐含层之间的连接参数进行更新;
对该每个语音信号处理任务对应的MLP全连接网络的输入层到隐含层的连接参数或隐含层到输出层的连接参数进行更新。
可选的,该程序还用于:
确定语音信号处理模型的每个任务层对待识别语音的输出结果;
将每个任务层对待识别语音的输出结果,作为该每个任务层对应的语音信号处理任务的任务处理结果。
可选的,该程序还用于:
确定该语音信号处理模型的每个任务层对待识别语音的输出结果;
使用该每个任务层对待识别语音的输出结果,辅助该每个任务层对应的语音信号处理任务进行任务处理。
其中,该程序的细化功能和扩展功能可参照上文相应部分。
进一步,本发明实施例还提供一种存储介质,该存储介质可选如存储器,所述存储介质存储有适用于处理器执行的程序,所述程序用于:
获取样本语音,确定样本语音的每个语音信号处理任务的任务输入特征;根据每个语音信号处理任务的训练损失函数,确定目标训练损失函数;将样本语音的每个语音信号处理任务的任务输入特征,作为待训练的多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对待训练的多任务神经网络的共享层和每个任务层的参数进行更新,直至待训练的多任务神经网络收敛,得到语音信号处理模型;其中,所述待训练的多任务神经网络包括:共享层,和每个语音信号处理任务对应的任务层。
可选的,所述程序的细化功能和扩展功能可参照上文相应部分。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这 些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的核心思想或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。
Claims (20)
- 一种语音信号处理模型训练方法,其特征在于,所述方法应用在电子设备上,包括:获取样本语音,确定所述样本语音的每个语音信号处理任务的任务输入特征;根据所述每个语音信号处理任务的训练损失函数,确定目标训练损失函数;将所述样本语音的每个语音信号处理任务的任务输入特征,作为待训练的多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对所述待训练的多任务神经网络的共享层和每个任务层的参数进行更新,直至所述待训练的多任务神经网络收敛,得到语音信号处理模型;其中,所述待训练的多任务神经网络包括:共享层,和每个语音信号处理任务对应的任务层。
- 根据权利要求1所述的语音信号处理模型训练方法,其特征在于,所述以最小化目标训练损失函数为训练目标,对所述待训练的多任务神经网络的共享层和每个任务层的参数进行更新包括:对于所述共享层,以最小化目标训练损失函数为训练目标,根据所述目标训练损失函数,对所述共享层的参数进行更新;对于任一语音信号处理任务对应的任务层,以最小化目标训练损失函数为训练目标,根据所述语音信号处理任务的训练损失函数,对所述语音信号处理任务的任务层的参数进行更新。
- 根据权利要求1所述的语音信号处理模型训练方法,其特征在于,所述待训练的多任务神经网络包括:第一多任务神经网络;所述方法还包括:从语音信号处理过程的多个语音信号处理任务中,确定至少一个第一类语音信号处理任务;确定所述样本语音的第一类语音信号处理任务的任务输入特征;根据所述第一类语音信号处理任务的训练损失函数,确定第一目标训练损失函数;将所述样本语音的第一类语音信号处理任务的任务输入特征,作为初始的多任务神经网络的训练输入,以最小化第一目标训练损失函数为训练目标,对所述初始的多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数进行更新,直至所述初始的多任务神经网络收敛,得到第一多任务神经网络。
- 根据权利要求3所述的语音信号处理模型训练方法,其特征在于,所述从语音信号处理过程的多个语音信号处理任务中,确定至少一个第一类语音信号处理任务包括:确定所述多个语音信号处理任务中的基本任务,将所述基本任务确定为所述第一类语音信号处理任务,所述基本任务为所述多个语音信号处理任务中,相对于其他的语音信号处理任务具有辅助效果的任务。
- 根据权利要求3所述的语音信号处理模型训练方法,其特征在于,所述从语音信号处理过程的多个语音信号处理任务中,确定至少一个第一类语音信号处理任务包括:将所述多个语音信号处理任务中,训练复杂度高于设定的复杂度阈值的语音信号处理任务,确定为所述第一类语音信号处理任务。
- 根据权利要求1所述的语音信号处理模型训练方法,其特征在于,所述待训练的多任务神经网络包括:初始的多任务神经网络;所述根据每个语音信号处理任务的训练损失函数,确定目标训练损失函数包括:对于任一语音信号处理任务,将所述语音信号处理任务的训练损失函数,乘以该语音信号处理任务相应的权重,得到该语音信号处理任务相应的相乘结果,以确定出每个语音信号处理任务相应的相乘结果;将每个语音信号处理任务相应的相乘结果相加,得到目标训练损失函数。
- 根据权利要求1所述的语音信号处理模型训练方法,其特征在于,所述共享层包括长短期记忆LSTM网络,所述每个语音信号处理任务对应的任务层包括:每个语音信号处理任务对应的多层感知器MLP全连接网络;所述对所述待训练的多任务神经网络的共享层和每个任务层的参数进行更新包括:对所述待训练的多任务神经网络的LSTM网络的输入层到隐含层的连接参数、隐含层到输出层的连接参数或隐含层到隐含层之间的连接参数进行更新;对所述每个语音信号处理任务对应的MLP全连接网络的输入层到隐含层的连接参数或隐含层到输出层的连接参数进行更新。
- 一种电子设备,其特征在于,包括:至少一个存储器和至少一个处理器;所述存储器存储有程序,所述处理器调用所述存储器存储的程序,所述程序用于:获取样本语音,确定所述样本语音的每个语音信号处理任务的任务输入特征;根据所述每个语音信号处理任务的训练损失函数,确定目标训练损失函数;将所述样本语音的每个语音信号处理任务的任务输入特征,作为待训练的多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对待训练的多任务神经网络的共享层和每个任务层的参数进行更新,直至待训练的多任务神经网络收敛,得到语音信号处理模型;其中,所述待训练的多任务神经网络包括:共享层,和每个语音信号处理任务对应的任务层。
- 根据权利要求8所述的电子设备,其特征在于,所述程序还用于:对于所述共享层,以最小化目标训练损失函数为训练目标,根据所述目标训练损失函数,对所述共享层的参数进行更新;对于任一语音信号处理任务对应的任务层,以最小化目标训练损失函数为训练目标,根据所述语音信号处理任务的训练损失函数,对所述语音信号处理任务的任务层的参数进行更新。
- 根据权利要求8所述的电子设备,其特征在于,所述程序还用于:从语音信号处理过程的多个语音信号处理任务中,确定至少一个第一类语音信号处理任务;确定所述样本语音的第一类语音信号处理任务的任务输入特征;根据所述第一类语音信号处理任务的训练损失函数,确定第一目标训练损失函数;将所述样本语音的第一类语音信号处理任务的任务输入特征,作为初始的多任务神经网络的训练输入,以最小化第一目标训练损失函数为训练目标,对所述初始的多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数进行更新,直至所述初始的多任务神经网络收敛,得到第一多任务神经网络。
- 根据权利要求10所述的电子设备,其特征在于,所述程序还用于:对于任一第一类语音信号处理任务,将所述第一类语音信号处理任务的训练损失函数,乘以所述第一类语音信号处理任务相应的权重,得到所述第一类语音信号处理任务的相乘结果,以确定出每个第一类语音信号处理任务的相乘结果;将每个第一类语音信号处理任务的相乘结果相加,得到所述第一目标训练损失函数。
- 根据权利要求10所述的电子设备,其特征在于,所述程序还用于:确定所述多个语音信号处理任务中的基本任务,将所述基本任务确定为所述第一类语音信号处理任务,所述基本任务为所述多个语音信号处理任务中,相对于其他的语音信号处理任务具有辅助效果的任务。
- 根据权利要求10所述的电子设备,其特征在于,所述程序还用于:将所述多个语音信号处理任务中,训练复杂度高于设定的复杂度阈值的语音信号处理任务,确定为所述第一类语音信号处理任务。
- 根据权利要求10所述的电子设备,其特征在于,所述程序还用于:根据所述样本语音的第一类语音信号处理任务的多份任务输入特征,分多个训练阶段递进的对所述初始的多任务神经网络进行训练,得到所述第一多任务神经网络;其中,一个训练阶段使用所述样本语音的第一类语音信号处理任务的一份任务输入特征作为训练输入,且以最小化第一目标训练损失函数为训练目标。
- 根据权利要求12所述的电子设备,其特征在于,所述程序还用于:根据当前训练阶段,从所述样本语音的第一类语音信号处理任务的多份任务输入特征中,选取所述当前训练阶段相应的当前份任务输入特征;将所述当前份任务输入特征,作为上一训练阶段训练完成的多任务神经网络的训练输入,以最小化第一目标训练损失函数为训练目标,对所述上一训练阶段训练完成的多任务神经网络的共享层和第一类语音信号处理任务对应的任务层的参数进行更新,直至根据最后一份任务输入特征,训练的多任务神经网络达到收敛,得到第一多任务神经网络。
- 根据权利要求8所述的电子设备,其特征在于,所述程序还用于:对于任一语音信号处理任务,将所述语音信号处理任务的训练损失函数,乘以该语音信号处理任务相应的权重,得到该语音信号处理任务相应的相乘结果,以确定出每个语音信号处理任务相应的相乘结果;将每个语音信号处理任务相应的相乘结果相加,得到目标训练损失函数。
- 根据权利要求8所述的电子设备,其特征在于,所述程序还用于:对所述待训练的多任务神经网络的LSTM网络的输入层到隐含层的连接参数、隐含层到输出层的连接参数或隐含层到隐含层之间的连接参数进行更新;对所述每个语音信号处理任务对应的MLP全连接网络的输入层到隐含层的连接参数或隐含层到输出层的连接参数进行更新。
- 根据权利要求8所述的电子设备,其特征在于,所述程序还用于:确定语音信号处理模型的每个任务层对待识别语音的输出结果;将每个任务层对待识别语音的输出结果,作为所述每个任务层对应的语音信号处理任务的任务处理结果。
- 根据权利要求8所述的电子设备,其特征在于,所述程序还用于:确定所述语音信号处理模型的每个任务层对待识别语音的输出结果;使用所述每个任务层对待识别语音的输出结果,辅助所述每个任务层对应的语音信号处理任务进行任务处理。
- 一种存储介质,其特征在于,所述存储介质存储有适用于处理器执行的程序,所述程序用于:获取样本语音,确定所述样本语音的每个语音信号处理任务的任务输入特征;根据所述每个语音信号处理任务的训练损失函数,确定目标训练损失函数;将所述样本语音的每个语音信号处理任务的任务输入特征,作为待训练的多任务神经网络的训练输入,以最小化目标训练损失函数为训练目标,对所述待训练的多任务神经网络的共享层和每个任务层的参数进行更新,直至所述待训练的多任务神经网络收敛,得到语音信号处理模型;其中,所述待训练的多任务神经网络包括:共享层,和所述每个语音信号处理任务对应的任务层。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP18880575.8A EP3611725B1 (en) | 2017-11-24 | 2018-11-15 | Voice signal processing model training method, electronic device, and storage medium |
US16/655,548 US11158304B2 (en) | 2017-11-24 | 2019-10-17 | Training method of speech signal processing model with shared layer, electronic device and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711191604.9A CN109841220B (zh) | 2017-11-24 | 2017-11-24 | 语音信号处理模型训练方法、装置、电子设备及存储介质 |
CN201711191604.9 | 2017-11-24 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/655,548 Continuation US11158304B2 (en) | 2017-11-24 | 2019-10-17 | Training method of speech signal processing model with shared layer, electronic device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019100998A1 true WO2019100998A1 (zh) | 2019-05-31 |
Family
ID=66630868
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/115704 WO2019100998A1 (zh) | 2017-11-24 | 2018-11-15 | 语音信号处理模型训练方法、电子设备及存储介质 |
Country Status (4)
Country | Link |
---|---|
US (1) | US11158304B2 (zh) |
EP (1) | EP3611725B1 (zh) |
CN (2) | CN109841220B (zh) |
WO (1) | WO2019100998A1 (zh) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111191675A (zh) * | 2019-12-03 | 2020-05-22 | 深圳市华尊科技股份有限公司 | 行人属性识别模型实现方法及相关装置 |
CN111402929A (zh) * | 2020-03-16 | 2020-07-10 | 南京工程学院 | 基于域不变的小样本语音情感识别方法 |
CN111814959A (zh) * | 2020-06-30 | 2020-10-23 | 北京百度网讯科技有限公司 | 模型训练数据的处理方法、装置、系统和存储介质 |
CN112989108A (zh) * | 2021-02-24 | 2021-06-18 | 腾讯科技(深圳)有限公司 | 基于人工智能的语种检测方法、装置及电子设备 |
WO2021137754A1 (en) * | 2019-12-31 | 2021-07-08 | National University Of Singapore | Feedback-controlled voice conversion |
CN113610150A (zh) * | 2021-08-05 | 2021-11-05 | 北京百度网讯科技有限公司 | 模型训练的方法、对象分类方法、装置及电子设备 |
CN113707134A (zh) * | 2021-08-17 | 2021-11-26 | 北京搜狗科技发展有限公司 | 一种模型训练方法、装置和用于模型训练的装置 |
CN113704388A (zh) * | 2021-03-05 | 2021-11-26 | 腾讯科技(深圳)有限公司 | 多任务预训练模型的训练方法、装置、电子设备和介质 |
CN113782000A (zh) * | 2021-09-29 | 2021-12-10 | 北京中科智加科技有限公司 | 一种基于多任务的语种识别方法 |
CN114842834A (zh) * | 2022-03-31 | 2022-08-02 | 中国科学院自动化研究所 | 一种语音文本联合预训练方法及系统 |
Families Citing this family (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109841220B (zh) | 2017-11-24 | 2022-09-13 | 深圳市腾讯计算机系统有限公司 | 语音信号处理模型训练方法、装置、电子设备及存储介质 |
EP3598777B1 (en) * | 2018-07-18 | 2023-10-11 | Oticon A/s | A hearing device comprising a speech presence probability estimator |
JP7051749B2 (ja) * | 2019-06-03 | 2022-04-11 | 株式会社東芝 | 信号処理装置、信号処理システム、信号処理方法、およびプログラム |
CN112116095B (zh) * | 2019-06-19 | 2024-05-24 | 北京搜狗科技发展有限公司 | 一种多任务学习模型训练的方法及相关装置 |
CN110751941B (zh) * | 2019-09-18 | 2023-05-26 | 平安科技(深圳)有限公司 | 语音合成模型的生成方法、装置、设备及存储介质 |
WO2021077247A1 (zh) * | 2019-10-21 | 2021-04-29 | 深圳大学 | 一种人工耳蜗信号处理方法、装置及计算机可读存储介质 |
CN110681051B (zh) * | 2019-10-21 | 2023-06-13 | 深圳大学 | 一种人工耳蜗信号处理方法、装置及计算机可读存储介质 |
CN110767212B (zh) * | 2019-10-24 | 2022-04-26 | 百度在线网络技术(北京)有限公司 | 一种语音处理方法、装置和电子设备 |
CN110782883B (zh) * | 2019-11-12 | 2020-10-20 | 百度在线网络技术(北京)有限公司 | 一种模型训练方法、装置、电子设备及存储介质 |
CN110930996B (zh) * | 2019-12-11 | 2023-10-31 | 广州市百果园信息技术有限公司 | 模型训练方法、语音识别方法、装置、存储介质及设备 |
CN110996208B (zh) * | 2019-12-13 | 2021-07-30 | 恒玄科技(上海)股份有限公司 | 一种无线耳机及其降噪方法 |
CN111261145B (zh) * | 2020-01-15 | 2022-08-23 | 腾讯科技(深圳)有限公司 | 语音处理装置、设备及其训练方法 |
CN111368748B (zh) * | 2020-03-06 | 2023-12-01 | 深圳市商汤科技有限公司 | 网络训练方法及装置、图像识别方法及装置 |
CN111341293B (zh) * | 2020-03-09 | 2022-11-18 | 广州市百果园信息技术有限公司 | 一种文本语音的前端转换方法、装置、设备和存储介质 |
CN111524521B (zh) * | 2020-04-22 | 2023-08-08 | 北京小米松果电子有限公司 | 声纹提取模型训练方法和声纹识别方法、及其装置和介质 |
CN111654572A (zh) * | 2020-05-27 | 2020-09-11 | 维沃移动通信有限公司 | 音频处理方法、装置、电子设备及存储介质 |
CN111653287A (zh) * | 2020-06-04 | 2020-09-11 | 重庆邮电大学 | 基于dnn和频带内互相关系数的单通道语音增强算法 |
CN111667728B (zh) * | 2020-06-18 | 2021-11-30 | 思必驰科技股份有限公司 | 语音后处理模块训练方法和装置 |
CN113837349A (zh) * | 2020-06-24 | 2021-12-24 | 华为技术有限公司 | 一种多任务学习方法及装置 |
CN111816162B (zh) * | 2020-07-09 | 2022-08-23 | 腾讯科技(深圳)有限公司 | 一种语音变化信息检测方法、模型训练方法以及相关装置 |
CN111883154B (zh) * | 2020-07-17 | 2023-11-28 | 海尔优家智能科技(北京)有限公司 | 回声消除方法及装置、计算机可读的存储介质、电子装置 |
CN111599382B (zh) * | 2020-07-27 | 2020-10-27 | 深圳市声扬科技有限公司 | 语音分析方法、装置、计算机设备和存储介质 |
CN111951780B (zh) * | 2020-08-19 | 2023-06-13 | 广州华多网络科技有限公司 | 语音合成的多任务模型训练方法及相关设备 |
CN112163676B (zh) * | 2020-10-13 | 2024-04-05 | 北京百度网讯科技有限公司 | 多任务服务预测模型训练方法、装置、设备以及存储介质 |
CN112380849B (zh) * | 2020-11-20 | 2024-05-28 | 北京百度网讯科技有限公司 | 生成兴趣点提取模型和提取兴趣点的方法和装置 |
CN112434714A (zh) * | 2020-12-03 | 2021-03-02 | 北京小米松果电子有限公司 | 多媒体识别的方法、装置、存储介质及电子设备 |
CN112712816B (zh) * | 2020-12-23 | 2023-06-20 | 北京达佳互联信息技术有限公司 | 语音处理模型的训练方法和装置以及语音处理方法和装置 |
CN112541124B (zh) * | 2020-12-24 | 2024-01-12 | 北京百度网讯科技有限公司 | 生成多任务模型的方法、装置、设备、介质及程序产品 |
CN113129870B (zh) * | 2021-03-23 | 2022-03-25 | 北京百度网讯科技有限公司 | 语音识别模型的训练方法、装置、设备和存储介质 |
CN113241064B (zh) * | 2021-06-28 | 2024-02-13 | 科大讯飞股份有限公司 | 语音识别、模型训练方法、装置、电子设备和存储介质 |
CN113314119B (zh) * | 2021-07-27 | 2021-12-03 | 深圳百昱达科技有限公司 | 语音识别智能家居控制方法及装置 |
CN113593594B (zh) * | 2021-09-01 | 2024-03-08 | 北京达佳互联信息技术有限公司 | 语音增强模型的训练方法和设备及语音增强方法和设备 |
CN113724723B (zh) * | 2021-09-02 | 2024-06-11 | 西安讯飞超脑信息科技有限公司 | 混响与噪声抑制方法、装置、电子设备及存储介质 |
CN114464168A (zh) * | 2022-03-07 | 2022-05-10 | 云知声智能科技股份有限公司 | 语音处理模型的训练方法、语音数据的降噪方法及装置 |
CN114612750B (zh) * | 2022-05-09 | 2022-08-19 | 杭州海康威视数字技术股份有限公司 | 自适应学习率协同优化的目标识别方法、装置及电子设备 |
CN115116446A (zh) * | 2022-06-21 | 2022-09-27 | 成都理工大学 | 一种噪声环境下说话人识别模型构建方法 |
CN114882884B (zh) * | 2022-07-06 | 2022-09-23 | 深圳比特微电子科技有限公司 | 一种基于深度学习模型的多任务实现方法、装置 |
CN115188371A (zh) * | 2022-07-13 | 2022-10-14 | 合肥讯飞数码科技有限公司 | 一种语音识别模型训练方法、语音识别方法及相关设备 |
CN117275499B (zh) * | 2023-11-17 | 2024-02-02 | 深圳波洛斯科技有限公司 | 自适应神经网络的降噪方法及相关装置 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104881678A (zh) * | 2015-05-11 | 2015-09-02 | 中国科学技术大学 | 一种模型与特征联合学习的多任务学习方法 |
CN106355248A (zh) * | 2016-08-26 | 2017-01-25 | 深圳先进技术研究院 | 一种深度卷积神经网络训练方法及装置 |
WO2017083399A2 (en) * | 2015-11-09 | 2017-05-18 | Google Inc. | Training neural networks represented as computational graphs |
CN107357838A (zh) * | 2017-06-23 | 2017-11-17 | 上海交通大学 | 基于多任务学习的对话策略在线实现方法 |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9460711B1 (en) * | 2013-04-15 | 2016-10-04 | Google Inc. | Multilingual, acoustic deep neural networks |
US9665823B2 (en) * | 2013-12-06 | 2017-05-30 | International Business Machines Corporation | Method and system for joint training of hybrid neural networks for acoustic modeling in automatic speech recognition |
US10089576B2 (en) * | 2015-07-28 | 2018-10-02 | Microsoft Technology Licensing, Llc | Representation learning using multi-task deep neural networks |
US10339921B2 (en) * | 2015-09-24 | 2019-07-02 | Google Llc | Multichannel raw-waveform neural networks |
WO2017161233A1 (en) * | 2016-03-17 | 2017-09-21 | Sri International | Deep multi-task representation learning |
US9886949B2 (en) * | 2016-03-23 | 2018-02-06 | Google Inc. | Adaptive audio enhancement for multichannel speech recognition |
CN106228980B (zh) * | 2016-07-21 | 2019-07-05 | 百度在线网络技术(北京)有限公司 | 数据处理方法和装置 |
US10657437B2 (en) * | 2016-08-18 | 2020-05-19 | International Business Machines Corporation | Training of front-end and back-end neural networks |
CN106529402B (zh) * | 2016-09-27 | 2019-05-28 | 中国科学院自动化研究所 | 基于多任务学习的卷积神经网络的人脸属性分析方法 |
US10839284B2 (en) * | 2016-11-03 | 2020-11-17 | Salesforce.Com, Inc. | Joint many-task neural network model for multiple natural language processing (NLP) tasks |
CN106653056B (zh) * | 2016-11-16 | 2020-04-24 | 中国科学院自动化研究所 | 基于lstm循环神经网络的基频提取模型及训练方法 |
WO2018140969A1 (en) * | 2017-01-30 | 2018-08-02 | Google Llc | Multi-task neural networks with task-specific paths |
CN109841220B (zh) | 2017-11-24 | 2022-09-13 | 深圳市腾讯计算机系统有限公司 | 语音信号处理模型训练方法、装置、电子设备及存储介质 |
-
2017
- 2017-11-24 CN CN201711191604.9A patent/CN109841220B/zh active Active
- 2017-11-24 CN CN201910745812.1A patent/CN110444214B/zh active Active
-
2018
- 2018-11-15 WO PCT/CN2018/115704 patent/WO2019100998A1/zh unknown
- 2018-11-15 EP EP18880575.8A patent/EP3611725B1/en active Active
-
2019
- 2019-10-17 US US16/655,548 patent/US11158304B2/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104881678A (zh) * | 2015-05-11 | 2015-09-02 | 中国科学技术大学 | 一种模型与特征联合学习的多任务学习方法 |
WO2017083399A2 (en) * | 2015-11-09 | 2017-05-18 | Google Inc. | Training neural networks represented as computational graphs |
CN106355248A (zh) * | 2016-08-26 | 2017-01-25 | 深圳先进技术研究院 | 一种深度卷积神经网络训练方法及装置 |
CN107357838A (zh) * | 2017-06-23 | 2017-11-17 | 上海交通大学 | 基于多任务学习的对话策略在线实现方法 |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111191675A (zh) * | 2019-12-03 | 2020-05-22 | 深圳市华尊科技股份有限公司 | 行人属性识别模型实现方法及相关装置 |
CN111191675B (zh) * | 2019-12-03 | 2023-10-24 | 深圳市华尊科技股份有限公司 | 行人属性识别模型实现方法及相关装置 |
WO2021137754A1 (en) * | 2019-12-31 | 2021-07-08 | National University Of Singapore | Feedback-controlled voice conversion |
CN111402929B (zh) * | 2020-03-16 | 2022-09-20 | 南京工程学院 | 基于域不变的小样本语音情感识别方法 |
CN111402929A (zh) * | 2020-03-16 | 2020-07-10 | 南京工程学院 | 基于域不变的小样本语音情感识别方法 |
CN111814959A (zh) * | 2020-06-30 | 2020-10-23 | 北京百度网讯科技有限公司 | 模型训练数据的处理方法、装置、系统和存储介质 |
CN112989108A (zh) * | 2021-02-24 | 2021-06-18 | 腾讯科技(深圳)有限公司 | 基于人工智能的语种检测方法、装置及电子设备 |
CN113704388A (zh) * | 2021-03-05 | 2021-11-26 | 腾讯科技(深圳)有限公司 | 多任务预训练模型的训练方法、装置、电子设备和介质 |
CN113610150A (zh) * | 2021-08-05 | 2021-11-05 | 北京百度网讯科技有限公司 | 模型训练的方法、对象分类方法、装置及电子设备 |
CN113610150B (zh) * | 2021-08-05 | 2023-07-25 | 北京百度网讯科技有限公司 | 模型训练的方法、对象分类方法、装置及电子设备 |
CN113707134A (zh) * | 2021-08-17 | 2021-11-26 | 北京搜狗科技发展有限公司 | 一种模型训练方法、装置和用于模型训练的装置 |
CN113707134B (zh) * | 2021-08-17 | 2024-05-17 | 北京搜狗科技发展有限公司 | 一种模型训练方法、装置和用于模型训练的装置 |
CN113782000B (zh) * | 2021-09-29 | 2022-04-12 | 北京中科智加科技有限公司 | 一种基于多任务的语种识别方法 |
CN113782000A (zh) * | 2021-09-29 | 2021-12-10 | 北京中科智加科技有限公司 | 一种基于多任务的语种识别方法 |
CN114842834A (zh) * | 2022-03-31 | 2022-08-02 | 中国科学院自动化研究所 | 一种语音文本联合预训练方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
CN110444214B (zh) | 2021-08-17 |
EP3611725A1 (en) | 2020-02-19 |
US11158304B2 (en) | 2021-10-26 |
EP3611725A4 (en) | 2020-12-23 |
CN109841220B (zh) | 2022-09-13 |
CN109841220A (zh) | 2019-06-04 |
EP3611725B1 (en) | 2024-01-17 |
US20200051549A1 (en) | 2020-02-13 |
CN110444214A (zh) | 2019-11-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019100998A1 (zh) | 语音信号处理模型训练方法、电子设备及存储介质 | |
CN108463848B (zh) | 用于多声道语音识别的自适应音频增强 | |
JP7324753B2 (ja) | 修正された一般化固有値ビームフォーマーを用いた音声信号のボイス強調 | |
CN110503971A (zh) | 用于语音处理的基于神经网络的时频掩模估计和波束形成 | |
US9154873B2 (en) | Echo suppression | |
EP3136700B1 (en) | Nearend speech detector | |
WO2019080551A1 (zh) | 目标语音检测方法及装置 | |
US10771621B2 (en) | Acoustic echo cancellation based sub band domain active speaker detection for audio and video conferencing applications | |
CN113436643A (zh) | 语音增强模型的训练及应用方法、装置、设备及存储介质 | |
CN112489668B (zh) | 去混响方法、装置、电子设备和存储介质 | |
CN111722696B (zh) | 用于低功耗设备的语音数据处理方法和装置 | |
CN108074582A (zh) | 一种噪声抑制信噪比估计方法和用户终端 | |
WO2022218254A1 (zh) | 语音信号增强方法、装置及电子设备 | |
CN113470685A (zh) | 语音增强模型的训练方法和装置及语音增强方法和装置 | |
WO2024000854A1 (zh) | 语音降噪方法、装置、设备及计算机可读存储介质 | |
Martín-Doñas et al. | Dual-channel DNN-based speech enhancement for smartphones | |
CN113257267B (zh) | 干扰信号消除模型的训练方法和干扰信号消除方法及设备 | |
JP6891144B2 (ja) | 生成装置、生成方法及び生成プログラム | |
US8515096B2 (en) | Incorporating prior knowledge into independent component analysis | |
CN116403594B (zh) | 基于噪声更新因子的语音增强方法和装置 | |
JP6711765B2 (ja) | 形成装置、形成方法および形成プログラム | |
CN113488066B (zh) | 音频信号处理方法、音频信号处理装置及存储介质 | |
CN113689870A (zh) | 一种多通道语音增强方法及其装置、终端、可读存储介质 | |
US20240170003A1 (en) | Audio Signal Enhancement with Recursive Restoration Employing Deterministic Degradation | |
CN108922557A (zh) | 一种聊天机器人的多人语音分离方法及系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18880575 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2018880575 Country of ref document: EP Effective date: 20191111 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |