WO2022160715A1 - Voice signal processing method and electronic device - Google Patents

Voice signal processing method and electronic device Download PDF

Info

Publication number
WO2022160715A1
WO2022160715A1 PCT/CN2021/116212 CN2021116212W WO2022160715A1 WO 2022160715 A1 WO2022160715 A1 WO 2022160715A1 CN 2021116212 W CN2021116212 W CN 2021116212W WO 2022160715 A1 WO2022160715 A1 WO 2022160715A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
voice
feature
features
signal frames
Prior art date
Application number
PCT/CN2021/116212
Other languages
French (fr)
Chinese (zh)
Inventor
邓峰
王晓瑞
王仲远
Original Assignee
北京达佳互联信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京达佳互联信息技术有限公司 filed Critical 北京达佳互联信息技术有限公司
Publication of WO2022160715A1 publication Critical patent/WO2022160715A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques

Definitions

  • the present disclosure relates to the technical field of voice processing, and in particular, to a voice signal processing method and electronic device.
  • the collected speech signal will contain noise, and the existence of noise will have an adverse effect on the processing of the speech signal. Therefore, for the processing of the speech signal, the removal of noise plays a crucial role.
  • a method for processing a voice signal comprising: determining a plurality of first voice features of an original voice signal, each of the first voice features corresponding to the original voice signal a voice signal frame in the ; process a plurality of the first voice features to obtain a plurality of non-local voice features, each of the non-local voice features corresponds to a voice signal frame, and each of the non-local voice features
  • the feature is obtained by fusing the first voice feature of the voice signal frame corresponding to the non-local voice feature and the first voice feature of other voice signal frames except the voice signal frame;
  • Process the non-local speech features of each speech signal frame in the signal to obtain the mixed speech features of each of the speech signal frames; obtain denoising parameters based on the mixed speech features of a plurality of the speech signal frames;
  • the original speech signal is denoised by the noise parameter to obtain the target speech signal.
  • an apparatus for processing a voice signal comprising: a feature determination unit configured to determine a plurality of first voice features of an original voice signal, each of the first voice features The feature corresponds to a voice signal frame in the original voice signal; the non-local feature acquisition unit is configured to process a plurality of the first voice features to obtain a plurality of non-local voice features, each of the non-local features The voice feature corresponds to one voice signal frame, and each of the non-local voice features is based on the first voice feature of the voice signal frame corresponding to the non-local voice feature, and other voice signals other than the voice signal frame obtained by fusing the first voice features of the frames; the mixed feature acquisition unit is configured to process the non-local voice features of each voice signal frame in the original voice signal respectively, and obtain the mixed feature of each voice signal frame.
  • speech features a denoising parameter acquisition unit, configured to acquire denoising parameters based on mixed speech features of a plurality of the speech signal frames;
  • an electronic device comprising: one or more processors; a memory for storing executable instructions of the one or more processors; wherein the The one or more processors are configured to perform the steps of: determining a plurality of first speech features of the original speech signal, each of the first speech features corresponding to a speech signal frame in the original speech signal; The first voice features are processed to obtain a plurality of non-local voice features, each of the non-local voice features corresponds to a voice signal frame, and each of the non-local voice features is based on the non-local voice features The first voice feature of the corresponding voice signal frame and the first voice feature of other voice signal frames except the voice signal frame are obtained by fusing; The local speech features are processed to obtain a mixed speech feature of each of the speech signal frames; a denoising parameter is obtained based on the mixed speech features of a plurality of the speech signal frames; based on the denoising parameters, the original speech signal is processed Deno
  • a computer-readable storage medium When the instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device can perform the following steps: determine the original voice Multiple first voice features of the signal, each of which corresponds to a voice signal frame in the original voice signal; and multiple first voice features are processed to obtain multiple non-local voice features , each of the non-local voice features corresponds to a voice signal frame, and each of the non-local voice features is based on the first voice feature of the voice signal frame corresponding to the non-local voice feature, and except for the voice signal It is obtained by fusing the first voice features of other voice signal frames other than the frame; the non-local voice features of each voice signal frame in the original voice signal are processed respectively to obtain the mixed voice of each voice signal frame. feature; obtain denoising parameters based on the mixed speech features of a plurality of the speech signal frames; perform denoising on the original speech signal based on the denoising parameters to obtain a target speech signal
  • a computer program product comprising a computer program, the computer program being executed by a processor for the steps of: determining a plurality of first speech features of an original speech signal, each of the first speech features The voice feature corresponds to a voice signal frame in the original voice signal; a plurality of the first voice features are processed to obtain a plurality of non-local voice features, each of the non-local voice features corresponds to a voice signal frame , and each non-local speech feature is fused based on the first speech feature of the speech signal frame corresponding to the non-local speech feature, and the first speech feature of other speech signal frames except the speech signal frame obtained; respectively process the non-local voice features of each voice signal frame in the original voice signal to obtain the mixed voice features of each of the voice signal frames; based on the mixed voice features of the multiple voice signal frames, Obtaining denoising parameters; denoising the original speech signal based on the denoising parameters to obtain a target speech signal.
  • the context information of each speech signal frame is considered when acquiring the non-local speech features of each speech signal frame, and then the non-local speech features of each speech signal frame are respectively processed to obtain the speech signal
  • the speech features of the frame itself are used to obtain the speech features of the mixed form.
  • the denoising parameters obtained based on the speech features of the mixed form are more accurate, so that the denoising parameters can accurately represent the signals other than the noise signal in each speech signal frame. Therefore, the denoising parameter is used to denoise the original speech signal, which improves the denoising effect of the original speech signal.
  • Fig. 1 is a schematic diagram of a speech processing model according to an exemplary embodiment.
  • Fig. 2 is a schematic diagram of another speech processing model according to an exemplary embodiment.
  • Fig. 3 is a schematic diagram of another speech processing model according to an exemplary embodiment.
  • Fig. 4 is a flow chart of a method for processing a speech signal according to an exemplary embodiment.
  • Fig. 5 is a flow chart of another voice signal processing method according to an exemplary embodiment.
  • Fig. 6 is a schematic diagram of a non-local attention sub-model according to an exemplary embodiment.
  • Fig. 7 is a flow chart of a method for acquiring non-local speech features according to an exemplary embodiment.
  • Fig. 8 is a schematic diagram of a first processing network according to an exemplary embodiment.
  • Fig. 9 is a schematic diagram of a second processing network according to an exemplary embodiment.
  • Fig. 10 is a schematic diagram of another second processing network according to an exemplary embodiment.
  • Fig. 11 is a schematic diagram of a residual non-local sub-network according to an exemplary embodiment.
  • Fig. 12 is a schematic diagram of another non-local attention sub-model according to an exemplary embodiment.
  • Fig. 13 is a flowchart showing another method for processing a speech signal according to an exemplary embodiment.
  • Fig. 14 is a block diagram of an apparatus for processing a speech signal according to an exemplary embodiment.
  • Fig. 15 is a block diagram of another apparatus for processing speech signals according to an exemplary embodiment.
  • Fig. 16 is a structural block diagram of a terminal according to an exemplary embodiment.
  • Fig. 17 is a structural block diagram of a server according to an exemplary embodiment.
  • the user information (including but not limited to user equipment information, user personal information, etc.) involved in this disclosure is the information authorized by the user or fully authorized by all parties.
  • spectral subtraction is used to denoise the speech signal, that is, the silent segment in the speech signal is obtained, the noise signal is extracted from the silent segment, and the noise in the speech signal can be removed by subtracting the speech signal and the noise signal.
  • the noise in the speech signal changes, spectral subtraction is difficult to remove the noise, and the denoising effect is poor.
  • the voice signal processing method provided by the embodiment of the present disclosure can be applied to various scenarios.
  • the present disclosure can be used.
  • the method provided by the embodiment removes the noise signal in the voice signal, improves the voice quality of the voice signal, enables the viewer terminal to play the clear voice signal, and improves the live broadcast effect.
  • the speech signal is denoised first, and the denoised speech signal is recognized, so as to improve the accuracy of speech recognition.
  • the methods provided by the embodiments of the present disclosure can also be applied to scenarios such as video playback, language recognition, speech synthesis, and identity recognition.
  • Fig. 1 is a schematic diagram of a speech processing model provided according to an exemplary embodiment, the speech processing model includes: a non-local attention network 101 and a local attention network 102, the non-local attention network 101 and the local attention network 102 connections.
  • the non-local attention network 101 is used to process the first speech feature of the input original speech signal to obtain the non-local speech feature of the original speech signal
  • the local attention network 102 is used to process the non-local speech feature of the original speech signal.
  • the features are further processed to obtain the mixed speech features of the original speech signal.
  • the speech processing model further includes: a feature extraction network 103, a feature reconstruction network 104 and a speech denoising network 105
  • the feature extraction network 103 is connected with the non-local attention network 101
  • the feature reconstruction network 104 and The local attention network 102 is connected
  • the speech denoising network 105 is connected with the feature reconstruction network 104 .
  • the feature extraction network 103 is used to extract the first voice feature of the original voice signal
  • the feature reconstruction network 104 is used to perform feature reconstruction on the mixed voice features of the processed original voice signal, so as to obtain the denoising parameters of the original voice signal
  • the voice The denoising network 105 is used to denoise the original speech signal.
  • the speech processing model includes a plurality of non-local attention networks 101 and a plurality of local attention networks 102, and the plurality of non-local attention networks 101 and the plurality of local attention networks 102 can be in any order Connect in sequence.
  • the speech processing model includes two non-local attention networks 101 and two local attention networks 102, the feature extraction network 103 is connected to the first non-local attention network 101, the first non-local attention network 101 The network 101 is connected with the first local attention network 102, the first local attention network 102 is connected with the second local attention network 103, and the second local attention network 103 is connected with the second non-local attention network 101 connected, the second non-local attention network 101 is connected with the feature reconstruction network 104.
  • the non-local attention network may be referred to as a non-local attention sub-model
  • the local attention network may be referred to as a local attention sub-model
  • the feature extraction network may be referred to as a feature extraction sub-model
  • the feature reconstruction network may be referred to as
  • the speech denoising network can be referred to as the speech denoising sub-model.
  • the voice signal processing method provided by the embodiment of the present disclosure is performed by an electronic device, and the electronic device is a terminal or a server.
  • the terminal is a portable, pocket-sized, hand-held and other various types of terminals, such as a mobile phone, a computer, a tablet computer, and the like.
  • the server is a server, or a server cluster composed of several servers, or a cloud computing service center.
  • Fig. 4 is a flow chart showing a method for processing a voice signal according to an exemplary embodiment. Referring to Fig. 4, the method is executed by an electronic device and includes the following steps:
  • each non-local voice feature corresponds to a voice signal frame
  • each non-local voice feature is based on The first voice feature of the voice signal frame corresponding to the non-local voice feature is obtained by fusing the first voice feature of other voice signal frames except the voice signal frame.
  • the non-local attention sub-model is invoked to obtain the non-local speech features of each speech signal frame, the context information of the speech signal frame is considered, and then the local attention sub-model is invoked for each speech respectively.
  • the non-local speech features of the signal frame are processed to obtain the speech features of the speech signal frame itself, so as to obtain the speech features of the mixed form.
  • the proportion of signals other than the noise signal in each speech signal frame, so the denoising parameter is used to denoise the original speech signal, which improves the denoising effect of the original speech signal.
  • Fig. 5 is a flowchart of another voice signal processing method according to an exemplary embodiment. Referring to Fig. 5, the method is executed by an electronic device and includes the following steps:
  • the electronic device acquires the original amplitude and original phase of multiple speech signal frames in the original speech signal.
  • the speech signal includes amplitude and phase, and the noise signal in the speech signal is included in the amplitude
  • the original amplitude and original phase of each speech signal frame in the original speech signal are obtained, and the original amplitude is denoised, In order to realize the denoising of the original speech signal without processing the original phase, the processing amount is reduced.
  • the original voice signal is collected by an electronic device, or is a voice signal containing noise signals sent by other electronic devices to the electronic device.
  • the noise signal is a noise signal of environmental noise, white noise, or the like.
  • the original voice signal includes multiple voice signal frames, and the electronic device performs Fourier transform on each voice signal frame to obtain the original amplitude and original phase of each voice signal frame, and subsequently calculates the original amplitude of each voice signal frame. Processed to achieve denoising of the original amplitude.
  • the Fourier transform includes fast Fourier transform, short-time Fourier transform, and the like.
  • the signal length of the original speech signal cannot exceed the length of the reference signal, that is, the duration of the original speech signal cannot exceed the reference duration, wherein the reference signal length is any length, and the reference duration is any duration.
  • the reference signal length is 64 speech signal frames.
  • the electronic device invokes the feature extraction sub-model to perform feature extraction on the original amplitudes of the multiple voice signal frames respectively, to obtain the first voice features of the multiple voice signal frames.
  • the electronic device invokes the feature extraction sub-model to perform feature extraction on the original amplitude of each voice signal frame respectively, to obtain the first voice feature of each voice signal frame, that is, to obtain multiple first voice features of the original voice signal.
  • the first voice feature of the voice signal frame is used to describe the corresponding voice signal frame, and the first voice feature is represented by a vector, a matrix or other forms.
  • the first voice features of multiple voice signal frames are respectively represented, or, the first voice features of multiple voice signal frames are combined to represent, for example, the first voice feature of each voice signal frame is a vector, Then, a plurality of vectors are combined to form a matrix, and each column in the matrix represents the first speech feature of a speech signal frame.
  • the feature extraction sub-model includes a convolution layer, a batch normalization layer, and an activation function layer.
  • the electronic device invokes the non-local attention sub-model to fuse the first speech features of the multiple speech signal frames to obtain the non-local speech features of each speech signal frame.
  • each non-local voice feature corresponds to a voice signal frame
  • each non-local voice feature is based on the first voice feature of the voice signal frame corresponding to the non-local voice feature, and other voice signals other than the voice signal frame
  • the first speech feature of the frame is obtained by fusion. That is, the non-local voice feature of each voice signal frame is obtained by combining the first voice features of a plurality of voice signal frames, that is, the features of the voice signal frames before and after the voice signal frame are considered.
  • the non-local attention sub-model adopts the attention mechanism and residual learning to process the first speech feature.
  • this can be taken into account.
  • the context information of the speech signal frame makes the processed non-local speech features more accurate, and because the first speech feature of the speech signal frame will lose some speech features in the processing process, the residual learning can be used in the first speech feature.
  • the non-local speech features are obtained by combining the input first speech features, so as to avoid losing important speech features in the process of processing the first speech features to obtain the non-local speech features.
  • the non-local attention sub-model includes a first processing unit, a second processing unit, a first fusion unit, and a second fusion unit.
  • the first processing unit may be referred to as a first processing network
  • the second processing unit may be referred to as a second processing network
  • the first fusion unit may be referred to as a first fusion network
  • the second fusion unit may be referred to as a second fusion network
  • the first processing network is a trunk branch (Trunk Branch)
  • the second processing network is a mask branch (Mask Branch).
  • the first processing network and the second processing network respectively process the first voice signals of the input multiple voice signal frames, the first fusion network fuses the features obtained after processing by the first processing network and the second processing network, and the first fusion network The second fusion network fuses the features fused by the first fusion network with the features input in the non-local attention sub-model.
  • FIG. 7 The process of invoking the non-local attention sub-model to process the first speech feature of each speech signal frame by the electronic device is shown in FIG. 7 , and the process includes the following steps:
  • the electronic device invokes the first processing network to perform feature extraction on the first voice features of the multiple voice signal frames, respectively, to obtain the second voice features of each voice signal frame.
  • the first processing network is invoked to perform feature extraction on the first voice feature of each voice signal frame, respectively, to obtain the second voice feature of each voice signal frame.
  • the second voice feature is obtained by further extracting the first voice feature, and the second voice feature contains less noise features than the first voice feature.
  • the first processing network includes a plurality of hole residual sub-units (Res.Unit), and the hole residual sub-unit may be called a hole residual sub-network.
  • the hole residual sub-unit may be called a hole residual sub-network.
  • each hole residual sub-network includes a hole convolution layer, a batch normalization layer and an activation function layer, and the multiple hole residual sub-networks are connected by the network structure of the residual learning network .
  • the atrous convolutional layer can expand the receptive field and obtain more contextual information.
  • the non-local attention sub-model further includes at least one hole residual unit.
  • the hole residual unit may be called a hole residual network.
  • Each hole residual network includes two hole residual sub-networks, wherein the hole residual subunit may be called a hole residual sub-network, and the two hole residual sub-networks are connected by the network structure of the residual learning network.
  • the electronic device Before calling the first processing network and the second processing network to process the first voice feature of each voice signal frame, the electronic device firstly calls at least one hole residual network to perform feature extraction on the first voice feature of each voice signal frame , to obtain the further extracted first voice feature of each voice signal frame, and the subsequent first processing network and the second processing network process the further extracted first voice feature of each voice signal frame.
  • the above-mentioned invocation of the first processing network including a plurality of hole residual sub-networks can further extract the first speech feature to obtain deeper speech features.
  • the electronic device invokes the second processing network to fuse the first voice features of each voice signal frame with the first voice features of other voice signal frames respectively to obtain a third voice feature of each voice signal frame.
  • the electronic device invokes the second processing network, and fuses the first voice feature of the voice signal frame with the first voice features of other voice signal frames except the voice signal frame to obtain the voice
  • the third speech feature of the signal frame is obtained by combining the first voice features of other voice signal frames.
  • the second processing network includes a residual non-local subunit, a convolution subunit, and a deconvolution subunit.
  • the residual non-local sub-unit may be referred to as a residual non-local sub-network
  • the convolution sub-unit may be referred to as a convolution sub-network
  • the deconvolution sub-unit may be referred to as a de-convolution sub-network.
  • the electronic device calls the residual non-local sub-network, and based on the weights of multiple voice signal frames, weighted fusion of the first voice feature of each voice signal frame and the first voice features of other voice signal frames respectively, to obtain each voice signal
  • the first speech feature after frame weighted fusion the convolution sub-network is called to encode the first speech feature after weighted fusion of each speech signal frame, and the encoded feature of each speech signal frame is obtained; the deconvolution sub-network is called, Decoding the encoded features of each voice signal frame to obtain a third voice feature of each voice signal frame.
  • the residual non-local sub-network is called, and based on the weights of multiple speech signal frames, the first speech feature of the speech signal frame is respectively compared with other speech features except the speech signal frame.
  • the first speech features of the signal frame are weighted and fused to obtain the weighted and fused first speech features of the speech signal frame.
  • the second processing network further includes a plurality of feature reduction subunits, a plurality of first hole residual subunits, a plurality of second hole residual subunits, and an activation function subunit, wherein,
  • the feature reduction sub-unit may be called the feature reduction sub-network
  • the first hole residual sub-unit may be called the first hole residual sub-network
  • the second hole residual sub-unit may be called the second hole residual sub-network
  • the activation function subunits may be referred to as activation function sub-networks.
  • the residual non-local sub-network is connected to the first first hole residual sub-network, multiple first hole residual sub-networks are connected in turn, and the last hole residual sub-network is connected to the convolution sub-network, and the convolution sub-network is connected to the convolution sub-network.
  • the first feature reduction sub-network is connected, multiple feature reduction sub-networks are connected in turn, the last feature reduction sub-network is connected with the deconvolution sub-network, and the deconvolution sub-network is connected with the first and second hole residual sub-network,
  • a plurality of second hole residual sub-networks are connected in sequence, and the last hole residual sub-network is connected with the activation function sub-network.
  • Figure 10 only takes two first hole residual sub-networks, two second hole residual sub-networks and two feature reduction sub-networks as examples. Other numbers of networks and feature reduction sub-networks are also possible.
  • the activation function in the activation function sub-network is a sigmoid function or other activation function.
  • the atrous residual sub-network is different, each atrous residual sub-network includes atrous convolutional layer, batch normalization layer and activation function layer.
  • the feature reduction sub-network is also a hole residual sub-network.
  • the electronic device invokes a plurality of first hole residual sub-networks to process the weighted and fused first voice features of each voice signal frame to obtain the further processed first voice features of each voice signal frame;
  • the convolution sub-network is called to encode the first speech feature after further processing of each speech signal frame, and the encoded feature of each speech signal frame is obtained;
  • Feature reduction to obtain multiple reduced coding features;
  • call the deconvolution layer to decode multiple reduced coding features to obtain the decoded speech features of each speech signal frame; call multiple second hole residuals
  • the network processes the decoded speech features of each speech signal frame to obtain the third speech feature of each speech signal frame.
  • performing reduction processing on the coding features can reduce the coding features, thereby reducing the amount of calculation and improving the processing speed of the coding features.
  • the residual non-local sub-network includes a first fusion layer and a second fusion layer
  • the electronic device calls the first fusion layer, and based on the weights of the multiple speech signal frames, the first speech of each speech signal frame is
  • the features are respectively weighted and fused with the first voice features of other voice signal frames except the voice signal frame to obtain the fusion features of each voice signal frame; the second fusion layer is called, respectively.
  • the speech feature and the fusion feature are fused to obtain the first speech feature after weighted fusion of each speech signal frame. That is, for each voice signal frame, the first fusion layer is called, and based on the weights of multiple voice signal frames, the first voice feature of the voice signal frame is respectively combined with other voice signal frames except the voice signal frame. Perform weighted fusion on the first speech feature of the speech signal frame to obtain the fusion feature of the speech signal frame;
  • the first fusion layer is called to fuse the first speech features of different speech signal frames based on corresponding weights, so as to obtain more accurate fusion features, and in the case of including the first fusion layer and the second fusion layer
  • the residual non-local sub-network is a residual learning network, which fuses the fusion feature with the input first speech feature, so that the final weighted fusion first speech feature is more accurate, avoiding the loss of some important fusion features. feature, which improves the accuracy of the first speech feature after weight fusion.
  • the residual learning network is easier to optimize, which can improve the training efficiency of the model during the training process.
  • FIG. 11 takes the processing of three speech signal frames as an example to illustrate, the residual non-local sub-network further includes a plurality of convolution layers, a third fusion layer and a normalization layer,
  • the third fusion layer is connected with the two convolutional layers, and the third fusion layer is used to fuse the first speech features processed by the connected two convolutional layers.
  • the third fusion layer is connected with the normalization layer and normalized
  • the normalization layer is used to normalize the fused speech features output by the third fusion layer, the normalization layer is connected to the first fusion layer, and the first fusion layer is used to process the first speech processed by another convolutional layer.
  • the feature is fused with the normalized speech feature output by the normalization layer, and the fusion feature of each speech signal frame is obtained.
  • the fusion feature is processed by a convolution layer and fused with the first speech feature to obtain the weight The first speech feature after fusion.
  • the first fusion layer and the third fusion layer use matrix multiplication to fuse speech features
  • the second fusion layer uses matrix addition to fuse speech features.
  • the first voice feature of the voice signal frame is T*K*C
  • the first voice feature represents the voice feature C corresponding to time T and frequency K, in order to be able to compare different voice signals.
  • the speech features of the frames are multiplied or added, and the speech features need to be formally transformed.
  • the residual non-local sub-network uses the following formula to process the first speech feature of the speech signal frame x i :
  • o i represents the first speech feature after weighted fusion of speech signal frame x i , W z , Wu , W v and W g are known model parameters, softmax represents normalization processing, x j represents the division of speech Other speech signal frames other than the signal frame xi , xi represents the fusion feature of the speech signal frame xi .
  • the electronic device invokes the first fusion network, and fuses the second voice feature and the third voice feature of each voice signal frame respectively to obtain the non-local voice feature of each voice signal frame.
  • the first fusion network is a multiplication unit, that is, the second speech feature and the third speech feature of each speech signal frame are respectively multiplied to obtain a fused non-local speech feature.
  • the electronic device invokes the second fusion network to fuse the non-local speech feature of each speech signal frame with the first speech feature, to obtain a non-local speech feature after fusion of each speech signal frame.
  • the second fusion network is an addition unit, that is, the electronic device adds the non-local speech feature of each speech signal frame and the first speech feature to obtain the fused non-local speech feature of each speech signal frame.
  • different networks in the non-local attention sub-model are called to process the first speech feature in different aspects, wherein the first processing network including a plurality of hole residual sub-networks can further The first voice feature is extracted to obtain deeper voice features, and the second processing network adopts a non-local attention mechanism.
  • the first voice feature of each voice signal frame it is considered that the voice signal is divided
  • the first fusion network is called to fuse the speech features obtained by the two processing networks together to obtain non-local speech features.
  • the hole residual sub-network can expand the receptive field and further obtain more contextual information.
  • the non-local attention sub-model when the non-local attention sub-model includes the second fusion network, the non-local attention sub-model is a residual learning network. After obtaining the non-local speech features, the non-local speech features are combined with the input The first speech features are fused to make the final non-local speech features more accurate, avoid the loss of some important features from the non-local speech features, and improve the accuracy of the non-local speech features. Moreover, the residual learning network is easier to optimize, which can improve the training efficiency of the model during the training process.
  • the non-local attention sub-model includes a plurality of hole residual units, which may be called a hole residual network, and the electronic device first calls the plurality of hole residual networks to pair Input the first voice feature of each voice signal frame for processing, and then input the processed first voice feature to the first processing network and the second processing network.
  • the electronic device first calls the plurality of hole residual networks to pair Input the first voice feature of each voice signal frame for processing, and then input the processed first voice feature to the first processing network and the second processing network.
  • multiple hole residual networks are called to process the non-local speech features, and the processed non-local speech features are input into the subsequent local attention sub-model.
  • Fig. 12 only takes four hole residual networks as an example for description.
  • the electronic device invokes the local attention sub-model to separately process the non-local speech features of each speech signal frame to obtain the mixed speech features of each speech signal frame.
  • the noise feature is no longer included in the mixed speech feature, and the mixed speech feature of each speech signal frame is obtained after considering the speech features of other speech signal frames, which is more accurate.
  • the network structure of the local attention sub-model is similar to that of the non-local attention sub-model, the difference is that the residual non-local sub-network is not included in the local attention sub-model, and the network structure of the local attention sub-model is in This will not be repeated here.
  • the embodiments of the present disclosure only take a non-local attention sub-model and a local attention sub-model as examples for description.
  • multiple non-local attention sub-models and multiple local attention sub-models are included, that is, after the mixed speech feature is obtained, the mixed speech feature can be input into the subsequent non-local attention sub-model or Continue processing in the local attention sub-model to obtain a more accurate mixed speech signal.
  • the electronic device invokes the local attention sub-model, and then processes the non-local speech features of the speech signal frame respectively.
  • the non-local voice features of other voice signal frames that is, the context information of the voice signal frame is no longer considered, so that the voice features of the voice signal frame itself can be obtained in the processing process, and when the non-local voice features are obtained before,
  • the context information of the speech signal frame has been considered, so the obtained mixed speech feature can not only reflect the speech characteristic of the speech signal frame in the whole speech signal, but also reflect the speech characteristic of the speech signal frame itself.
  • the electronic device invokes the feature recognition sub-model to perform feature recognition on the mixed speech features of multiple speech signal frames to obtain denoising parameters.
  • the feature recognition sub-model is used to perform feature recognition on the mixed speech features of multiple speech signal frames. For the mixed speech features of each speech signal frame, the feature recognition sub-model can identify the corresponding mixed speech features from the mixed speech features. The ratio between the noise signal in the voice signal frame and the voice signal other than the noise signal, and the multiple mixed voice features are respectively identified, so as to obtain the denoising parameters corresponding to the multiple voice signal frames, that is, the original voice signal is obtained. Corresponding denoising parameter, the denoising parameter represents the proportion of the speech signal other than the noise signal in the speech signal frame, and the denoising parameter can be used to denoise the original speech signal subsequently.
  • the denoising parameter is represented in the form of a matrix, each element in the matrix represents the denoising parameter of a speech signal frame, or one column element or one row element in the matrix represents the denoising parameter of a speech signal frame.
  • the feature recognition sub-model is a convolutional network or other types of networks.
  • the electronic device invokes the speech denoising sub-model, and denoises the original amplitudes of the multiple speech signal frames according to the denoising parameters to obtain the target amplitudes of the multiple speech signal frames.
  • the speech denoising sub-model is a multiplication network, and the denoising parameters are multiplied by multiple original amplitudes to obtain target amplitudes of multiple speech signal frames, and the target amplitudes do not contain noise signals.
  • the denoising parameter is a matrix
  • each element in the matrix is respectively multiplied by the original amplitude of the corresponding speech signal frame, or one column element or one row element in the matrix is respectively the same as the corresponding speech signal frame.
  • the original magnitudes are multiplied.
  • the electronic device combines the original phases and target amplitudes of multiple voice signal frames to obtain a target voice signal.
  • the electronic device performs inverse Fourier transform on the original phases and target amplitudes of the plurality of speech signal frames to obtain the target speech signal, where the target speech signal is the speech signal after removing the noise signal.
  • This method of denoising the original amplitude in the speech signal frame only needs to process the amplitude in the speech signal without processing the phase, which reduces the features to be processed and improves the processing speed.
  • the non-local attention sub-model is invoked to obtain the non-local speech features of each speech signal frame, the context information of the speech signal frame is considered, and then the local attention sub-model is invoked for each speech respectively.
  • the non-local speech features of the signal frame are processed to obtain the speech features of the speech signal frame itself, so as to obtain the speech features of the mixed form.
  • the proportion of signals other than the noise signal in each speech signal frame, so the denoising parameter is used to denoise the original speech signal, which improves the denoising effect of the original speech signal.
  • the feature extraction is performed on the original amplitude of the speech signal frame, and the original amplitude is denoised according to the acquired denoising parameters to obtain a signal that does not contain noise.
  • the target amplitude of the original speech signal can be de-noised, and then the target speech signal without the noise signal can be recovered according to the target amplitude and the original phase, so as to realize the de-noising of the original speech signal.
  • This denoising method only needs to process the amplitude in the speech signal without processing the phase, which reduces the features that need to be processed.
  • the speech processing model needs to be trained before calling the speech processing model and processing the original speech signal.
  • the training process is as follows: obtain the sample speech signal and the sample noise signal; mix the sample speech signal and the sample noise signal to obtain the sample mixture Signal; call the speech processing model to process multiple sample speech signal frames in the sample mixed signal to obtain the predicted denoising parameters corresponding to the sample mixed signal; denoise the original speech signal based on the predicted denoising parameters, and obtain the denoised signal
  • the predicted speech signal of based on the difference between the predicted speech signal and the sample speech signal, the speech processing model is trained.
  • the sample speech signal is a clean speech signal that does not contain noise signals.
  • the training speed of the model is improved during the training process.
  • sample voice signals of multiple users are obtained from the voice database, and then multiple sample noise signals are obtained from the noise database, and multiple sample noise signals and sample voice signals are mixed according to different signal-to-noise ratios respectively to obtain multiple samples.
  • Mixed signal using multiple sample mixed signals to train the speech processing model.
  • sample amplitudes of multiple sample speech signal frames in the sample mixed signal are obtained, and a speech processing model is invoked to process the multiple sample amplitudes to obtain predicted denoising parameters corresponding to the sample mixed signal; based on the predicted denoising The parameters denoise the sample amplitude to obtain the predicted amplitude of each speech signal frame, and train the speech processing model based on the difference between the predicted amplitude of each speech signal frame and the amplitudes of multiple speech signal frames in the sample speech signal.
  • Conv. represents the feature extraction sub-model or feature recognition sub-model
  • RNAM represents the non-local attention sub-model
  • RAM represents the local attention sub-model
  • Res.Unit represents the hole residual network or hole residual sub-network
  • Conv represents the volume Product sub-network
  • Deconv represents the deconvolution sub-network
  • NL Unit represents the residual non-local sub-network.
  • Wiener Filtering Wiener Filtering
  • SEGAN Seech Enhancement Generative Adversarial Network, speech enhancement generative adversarial network
  • Wavelnet microwave
  • MMSE-GAN a speech enhancement generative adversarial network
  • Network DFL (Deep Feature Loss, deep feature loss) method
  • MDPhD a hybrid model
  • RSGAN-GP Seech Enhancement using Relativistic Generative Adversarial Networks with Gradient Penalty, using a relative speech enhancement generative adversarial network
  • these methods are compared with the method (RNANet) provided in the embodiments of the present disclosure.
  • CSIG (a kind of Evaluation index) is the average opinion score of signal distortion, the larger the CSIG, the better the denoising effect;
  • CBAK an evaluation index
  • COVL an evaluation index
  • STOI Short Time Objective Intelligibility, short-term objective intelligibility
  • the electronic device may not call the voice processing model to denoise the original voice signal.
  • FIG. 13 is a flowchart of another voice signal processing method provided by an embodiment of the present application. The method is performed by an electronic device, see FIG. 13 , and the method includes:
  • the electronic device performs feature extraction on the original amplitude of each voice signal frame, respectively, to obtain the first voice feature of each voice signal frame.
  • each non-local voice feature is obtained by fusing the first voice feature of the voice signal frame corresponding to the non-local voice feature and the first voice features of other voice signal frames except the voice signal frame.
  • the electronic device performs feature extraction on the first voice feature of each voice signal frame, respectively, to obtain the second voice feature of each voice signal frame; and separates the first voice feature of each voice signal frame with other The first voice feature of the voice signal frame is fused to obtain the third voice feature of each voice signal frame; the second voice feature and the third voice feature of each voice signal frame are respectively fused to obtain the Non-local speech features.
  • the first speech feature of each speech signal frame is weighted and fused with the first speech features of other speech signal frames, respectively, to obtain a weighted fusion of each speech signal frame.
  • the electronic device performs weighted fusion of the first voice features of each voice signal frame and the first voice features of other voice signal frames based on the weights of multiple voice signal frames, to obtain the weight of each voice signal frame. Fusion feature: the first voice feature of each voice signal frame is fused with the fused feature to obtain the weighted and fused first voice feature of each voice signal frame.
  • the electronic device performs feature reduction on the coding feature of each voice signal frame to obtain multiple reduced coding features; and decodes each reduced coding feature to obtain the third coding feature of each voice signal frame. voice characteristics.
  • the electronic device fuses the non-local speech feature of each speech signal frame with the first speech feature to obtain the fused non-local speech feature of each speech signal frame.
  • the electronic device performs feature recognition on the mixed speech features of the multiple speech signal frames to obtain the denoising parameter.
  • the electronic device de-noises the original amplitudes of the multiple speech signal frames based on the denoising parameters, respectively, to obtain the target amplitudes of the multiple speech signal frames; the original phase of the multiple speech signal frames Combine with the target amplitude to obtain the target speech signal.
  • the context information of each speech signal frame is considered when acquiring the non-local speech features of each speech signal frame, and then the non-local speech features of each speech signal frame are respectively processed to obtain the speech signal
  • the speech features of the frame itself are used to obtain the speech features of the mixed form.
  • the denoising parameters obtained based on the speech features of the mixed form are more accurate, so that the denoising parameters can accurately represent the signals other than the noise signal in each speech signal frame. Therefore, the denoising parameter is used to denoise the original speech signal, which improves the denoising effect of the original speech signal.
  • Fig. 14 is a block diagram of an apparatus for processing a speech signal according to an exemplary embodiment.
  • the device includes:
  • the feature determining unit 1401 is configured to determine a plurality of first voice features of the original voice signal, each first voice feature corresponds to a voice signal frame in the original voice signal;
  • the non-local feature acquisition unit 1402 is configured to process a plurality of first speech features to obtain a plurality of non-local speech features, each non-local speech feature corresponds to a speech signal frame, and each non-local speech feature is based on Obtained by fusing the first voice feature of the voice signal frame corresponding to the non-local voice feature and the first voice feature of other voice signal frames except the voice signal frame;
  • the mixed feature acquisition unit 1403 is configured to process the non-local voice features of each voice signal frame in the original voice signal respectively to obtain the mixed voice feature of each voice signal frame;
  • the denoising parameter obtaining unit 1404 is configured to obtain denoising parameters based on the mixed speech features of a plurality of speech signal frames;
  • the target signal obtaining unit 1405 is configured to perform denoising on the original speech signal based on the denoising parameter to obtain the target speech signal.
  • the context information of each voice signal frame is considered when acquiring the non-local voice feature of each voice signal frame, and then the non-local voice features of each voice signal frame are processed respectively to obtain the voice signal
  • the speech features of the frame itself are used to obtain the speech features of the mixed form.
  • the denoising parameters obtained based on the speech features of the mixed form are more accurate, so that the denoising parameters can accurately represent the signals other than the noise signal in each speech signal frame. Therefore, the denoising parameter is used to denoise the original speech signal, which improves the denoising effect of the original speech signal.
  • the feature determining unit 1401 is configured to perform feature extraction on the original amplitude of each speech signal frame, respectively, to obtain the first speech feature of each speech signal frame.
  • the target signal acquisition unit 1405 includes:
  • the amplitude obtaining subunit 1415 is configured to de-noise the original amplitudes of the multiple speech signal frames based on the denoising parameters to obtain the target amplitudes of the multiple speech signal frames;
  • the signal acquisition subunit 1425 is configured to combine the original phase and target amplitude of multiple speech signal frames to obtain the target speech signal.
  • the denoising parameter obtaining unit 1404 is configured to perform feature recognition on mixed speech features of multiple speech signal frames to obtain denoising parameters.
  • the non-local feature acquisition unit 1402 includes:
  • the feature extraction subunit 1412 is configured to perform feature extraction on the first voice feature of each voice signal frame, respectively, to obtain the second voice feature of each voice signal frame;
  • the first fusion subunit 1422 is configured to fuse the first speech features of each speech signal frame with the first speech features of other speech signal frames to obtain the third speech feature of each speech signal frame;
  • the second fusion subunit 1432 is configured to respectively fuse the second speech feature and the third speech feature of each speech signal frame to obtain the non-local speech feature of each speech signal frame.
  • the non-local feature acquisition unit 1402 further includes:
  • the third fusion subunit 1442 is configured to fuse the non-local speech feature of each speech signal frame with the first speech feature to obtain the fused non-local speech feature of each speech signal frame.
  • the first fusion subunit 1422 is configured to:
  • the first speech feature of each speech signal frame is weighted and fused with the first speech features of other speech signal frames, respectively, to obtain the weighted and fused first speech feature of each speech signal frame;
  • the first fusion subunit 1422 is configured to:
  • Each of the reduced encoded features is decoded to obtain a third voice feature of each voice signal frame.
  • the first fusion subunit 1422 is configured to perform:
  • the first speech features of each speech signal frame are weighted and fused with the first speech features of other speech signal frames, respectively, to obtain the fusion features of each speech signal frame;
  • the first voice feature of each voice signal frame is fused with the fusion feature to obtain the weighted and fused first voice feature of each voice signal frame.
  • the non-local feature acquisition unit 1402 is configured to invoke the non-local attention sub-model to process multiple first speech features to obtain multiple non-local speech features;
  • the mixed feature acquisition unit 1403 is configured to invoke the local attention sub-model to process the non-local speech features of each speech signal frame respectively to obtain the mixed speech features of each speech signal frame.
  • the non-local attention sub-model includes a first processing network, a second processing network and a first fusion network.
  • the non-local feature acquisition unit 1402 includes:
  • the feature extraction subunit 1412 is configured to call the first processing network to perform feature extraction on the first voice features of each voice signal frame respectively, to obtain the second voice features of each voice signal frame, and the first processing network includes a plurality of Atrous residual sub-network;
  • the first fusion subunit 1422 is configured to call the second processing network to fuse the first speech features of each speech signal frame with the first speech features of other speech signal frames to obtain the third speech feature of each speech signal frame. voice characteristics;
  • the second fusion subunit 1432 is configured to invoke the first fusion network to respectively fuse the second speech feature and the third speech feature of each speech signal frame to obtain the non-local speech feature of each speech signal frame.
  • the second processing network includes a residual non-local sub-network, a convolution sub-network and a deconvolution sub-network; the first fusion sub-unit 1422 is configured to:
  • the residual non-local sub-network is called, and based on the weights of multiple voice signal frames, the first voice feature of each voice signal frame is weighted and fused with the first voice features of other voice signal frames, and the weight of each voice signal frame is obtained.
  • the first speech feature after fusion
  • the deconvolution sub-network is called to decode the encoded features of each speech signal frame to obtain the third speech feature of each speech signal frame.
  • the residual non-local sub-network includes a first fusion layer and a second fusion layer
  • the first fusion subunit 1422 is configured to:
  • the first fusion layer is called, and based on the weights of multiple speech signal frames, the first speech feature of each speech signal frame is weighted and fused with the first speech features of other speech signal frames, and the fusion feature of each speech signal frame is obtained. ;
  • the second fusion layer is called to fuse the first speech feature of each speech signal frame with the fusion feature to obtain the weighted fusion first speech feature of each speech signal frame.
  • an electronic device comprising one or more processors, and volatile or non-volatile memory for storing instructions executable by the one or more processors; Wherein, the one or more processors are configured to execute the voice signal processing method in the above embodiments.
  • FIG. 16 is a structural block diagram of a terminal 1600 according to an exemplary embodiment.
  • the terminal 1600 may be a portable mobile terminal, such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, a moving picture expert compression standard audio layer 3), an MP4 (Moving Picture Experts Group Audio Layer IV, a dynamic picture expert Video Expert Compresses Standard Audio Layer 4) Player, Laptop or Desktop.
  • Terminal 1600 may also be called user equipment, portable terminal, laptop terminal, desktop terminal, and the like by other names.
  • the terminal 1600 includes: a processor 1601 and a memory 1602 .
  • the processor 1601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 1601 can use at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish.
  • the processor 1601 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor for processing data in a standby state.
  • the processor 1601 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 1601 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • Memory 1602 may include one or more computer-readable storage media, which may be non-transitory. Memory 1602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, a non-transitory computer-readable storage medium in the memory 1602 is used to store at least one piece of program code, and the at least one piece of program code is used to be executed by the processor 1601 to implement the methods provided by the method embodiments of the present disclosure. Methods of processing speech signals.
  • the terminal 1600 may also optionally include: a peripheral device interface 1603 and at least one peripheral device.
  • the processor 1601, the memory 1602 and the peripheral device interface 1603 can be connected through a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface 1603 through a bus, a signal line or a circuit board.
  • the peripheral devices include: at least one of a radio frequency circuit 1604 , a display screen 1605 , a camera assembly 1606 , an audio circuit 1607 , a positioning assembly 1608 and a power supply 1609 .
  • the peripheral device interface 1603 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 1601 and the memory 1602 .
  • processor 1601, memory 1602, and peripherals interface 1603 are integrated on the same chip or circuit board; in some other embodiments, any one of processor 1601, memory 1602, and peripherals interface 1603 or The two can be implemented on a separate chip or circuit board, which is not limited in this embodiment.
  • the radio frequency circuit 1604 is used for receiving and transmitting RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals.
  • the radio frequency circuit 1604 communicates with communication networks and other communication devices via electromagnetic signals.
  • the radio frequency circuit 1604 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals.
  • radio frequency circuitry 1604 includes an antenna system, an RF transceiver, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and the like.
  • the radio frequency circuit 1604 may communicate with other terminals through at least one wireless communication protocol.
  • the wireless communication protocol includes but is not limited to: World Wide Web, Metropolitan Area Network, Intranet, various generations of mobile communication networks (2G, 3G, 4G and 5G), wireless local area network and/or WiFi (Wireless Fidelity, Wireless Fidelity) network.
  • the radio frequency circuit 1604 may further include a circuit related to NFC (Near Field Communication, short-range wireless communication), which is not limited in the present disclosure.
  • the display screen 1605 is used for displaying UI (User Interface, user interface).
  • the UI can include graphics, text, icons, video, and any combination thereof.
  • the display screen 1605 also has the ability to acquire touch signals on or above the surface of the display screen 1605 .
  • the touch signal can be input to the processor 1601 as a control signal for processing.
  • the display screen 1605 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards.
  • the display screen 1605 there may be one display screen 1605, which is arranged on the front panel of the terminal 1600; in other embodiments, there may be at least two display screens 1605, which are respectively arranged on different surfaces of the terminal 1600 or in a folded design; In other embodiments, the display screen 1605 may be a flexible display screen, disposed on a curved surface or a folding surface of the terminal 1600 . Even, the display screen 1605 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen.
  • the display screen 1605 can be made of materials such as LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, organic light-emitting diode).
  • the camera assembly 1606 is used to capture images or video.
  • the camera assembly 1606 includes a front camera and a rear camera.
  • the front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal.
  • there are at least two rear cameras which are any one of a main camera, a depth-of-field camera, a wide-angle camera, and a telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function, the main camera It is integrated with the wide-angle camera to achieve panoramic shooting and VR (Virtual Reality, virtual reality) shooting functions or other integrated shooting functions.
  • the camera assembly 1606 may also include a flash.
  • the flash can be a single color temperature flash or a dual color temperature flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.
  • Audio circuitry 1607 may include a microphone and speakers.
  • the microphone is used to collect the sound waves of the user and the environment, convert the sound waves into electrical signals and input them to the processor 1601 for processing, or to the radio frequency circuit 1604 to realize voice communication.
  • the microphone may also be an array microphone or an omnidirectional collection microphone.
  • the speaker is used to convert the electrical signal from the processor 1601 or the radio frequency circuit 1604 into sound waves.
  • the loudspeaker can be a traditional thin-film loudspeaker or a piezoelectric ceramic loudspeaker.
  • audio circuitry 1607 may also include a headphone jack.
  • the positioning component 1608 is used to locate the current geographic location of the terminal 1600 to implement navigation or LBS (Location Based Service).
  • the positioning component 1608 may be a positioning component based on the GPS (Global Positioning System, global positioning system) of the United States, the Beidou system of China, the Grenas positioning system of Russia, or the Galileo positioning system of the European Union.
  • Power supply 1609 is used to power various components in terminal 1600 .
  • the power source 1609 may be alternating current, direct current, primary batteries, or rechargeable batteries.
  • the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. Wired rechargeable batteries are batteries that are charged through wired lines, and wireless rechargeable batteries are batteries that are charged through wireless coils.
  • the rechargeable battery can also be used to support fast charging technology.
  • terminal 1600 also includes one or more sensors 1610 .
  • the one or more sensors 1610 include, but are not limited to, an acceleration sensor 1611 , a gyro sensor 1612 , a pressure sensor 1613 , a fingerprint sensor 1614 , an optical sensor 1615 , and a proximity sensor 1616 .
  • the acceleration sensor 1611 can detect the magnitude of acceleration on the three coordinate axes of the coordinate system established by the terminal 1600 .
  • the acceleration sensor 1611 can be used to detect the components of the gravitational acceleration on the three coordinate axes.
  • the processor 1601 can control the display screen 1605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1611 .
  • the acceleration sensor 1611 can also be used for game or user movement data collection.
  • the gyroscope sensor 1612 can detect the body direction and rotation angle of the terminal 1600 , and the gyroscope sensor 1612 can cooperate with the acceleration sensor 1611 to collect 3D actions of the user on the terminal 1600 .
  • the processor 1601 can implement the following functions according to the data collected by the gyro sensor 1612: motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.
  • the pressure sensor 1613 may be disposed on the side frame of the terminal 1600 and/or the lower layer of the display screen 1605 .
  • the processor 1601 can perform left and right hand identification or shortcut operations according to the holding signal collected by the pressure sensor 1613.
  • the processor 1601 controls the operability controls on the UI interface according to the user's pressure operation on the display screen 1605.
  • the operability controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.
  • the fingerprint sensor 1614 is used to collect the user's fingerprint, and the processor 1601 identifies the user's identity according to the fingerprint collected by the fingerprint sensor 1614, or the fingerprint sensor 1614 identifies the user's identity according to the collected fingerprint. When the user's identity is identified as a trusted identity, the processor 1601 authorizes the user to perform relevant sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, making payments, and changing settings.
  • the fingerprint sensor 1614 may be disposed on the front, back, or side of the terminal 1600 . When the terminal 1600 is provided with physical buttons or a manufacturer's logo, the fingerprint sensor 1614 may be integrated with the physical buttons or the manufacturer's logo.
  • Optical sensor 1615 is used to collect ambient light intensity.
  • the processor 1601 may control the display brightness of the display screen 1605 according to the ambient light intensity collected by the optical sensor 1615 . Specifically, when the ambient light intensity is high, the display brightness of the display screen 1605 is increased; when the ambient light intensity is low, the display brightness of the display screen 1605 is decreased.
  • the processor 1601 can also dynamically adjust the shooting parameters of the camera assembly 1606 according to the ambient light intensity collected by the optical sensor 1615 .
  • a proximity sensor 1616 also called a distance sensor, is provided on the front panel of the terminal 1600 .
  • the proximity sensor 1616 is used to collect the distance between the user and the front of the terminal 1600.
  • the processor 1601 controls the display screen 1605 to switch from the bright screen state to the off screen state; when the proximity sensor 1616 detects When the distance between the user and the front of the terminal 1600 gradually increases, the processor 1601 controls the display screen 1605 to switch from the screen-off state to the screen-on state.
  • FIG. 16 does not constitute a limitation on the terminal 1600, and may include more or less components than the one shown, or combine some components, or adopt different component arrangements.
  • FIG. 17 is a structural block diagram of a server according to an exemplary embodiment.
  • the server 1700 may vary greatly due to different configurations or performance, and may include one or more processors (Central Processing Units, CPU) 1701 and one or more memories 1702, where at least one piece of program code is stored in the memory 1702, and the at least one piece of program code is loaded and executed by the processor 1701 to implement the methods provided by the above method embodiments.
  • the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface for input and output, and the server may also include other components for implementing device functions, which will not be described here.
  • a non-transitory computer-readable storage medium when the instructions in the storage medium are executed by the processor of the electronic device, the electronic device can execute the terminal or the method in the above-mentioned voice signal processing method.
  • the steps performed by the server may be ROM (Read Only Memory, Read Only Memory), RAM (Random Access Memory, Random Access Memory), CD-ROM (Compact Disc Read-Only) Memory), magnetic tapes, floppy disks, and optical data storage devices, etc.
  • a computer program product is also provided, when the instructions in the computer program product are executed by the processor of the electronic device, the electronic device can execute the above-mentioned voice signal processing method executed by the terminal or the server. step.
  • a method for processing a speech signal comprising:
  • the local attention network is called to process the non-local speech features of each speech signal frame separately, and the mixed speech features of each speech signal frame are obtained;
  • the original speech signal is denoised according to the denoising parameters to obtain the target speech signal.
  • determining the first speech feature of multiple speech signal frames in the original speech signal includes:
  • the feature extraction network is invoked to perform feature extraction on the original amplitudes of the multiple speech signal frames, respectively, to obtain the first speech features of the multiple speech signal frames.
  • the original speech signal is denoised according to the denoising parameters to obtain the target speech signal, including:
  • the original phase and target amplitude of multiple speech signal frames are combined to obtain the target speech signal.
  • the denoising parameters are obtained based on the mixed speech features of multiple speech signal frames, including:
  • the feature reconstruction network is called to perform feature reconstruction on the mixed speech features of multiple speech signal frames, and the denoising parameters are obtained.
  • the non-local attention network further includes a second fusion unit that invokes the first fusion unit to fuse the second speech feature and the third speech feature of each speech signal frame respectively to obtain each speech signal frame
  • the processing method also includes:
  • the second fusion unit is called to fuse the non-local speech feature of each speech signal frame with the first speech feature to obtain the non-local speech feature after fusion of each speech signal frame.
  • the second processing unit further includes a feature reduction subunit, which calls the convolution subunit to encode the first speech feature after weighted fusion of each speech signal frame, and obtains the encoded feature of each speech signal frame after the encoding feature is obtained.
  • the processing method also includes:
  • the deconvolution subunit is called to decode the coded features of each voice signal frame to obtain the third voice feature of each voice signal frame, including:
  • the deconvolution subunit is called to decode the plurality of reduced encoded features to obtain the third speech feature of each speech signal frame.
  • the speech processing model includes at least a non-local attention network and a local attention network
  • the training process of the speech processing model is as follows:
  • a speech processing model is trained based on the difference between the predicted speech signal and the sample speech signal.

Abstract

The present invention relates to the technical field of voice processing, and relates to a voice signal processing method and an electronic device. The method comprises: determining a plurality of first voice features in an original voice signal, each first voice feature corresponding to one voice signal frame in the original voice signal; processing the plurality of first voice features to obtain a plurality of non-local voice features, each non-local voice feature corresponding to one voice signal frame; processing the non-local voice feature of each voice signal frame in the original voice signal to obtain a mixed voice feature of each voice signal frame; obtaining a denoising parameter on the basis of the mixed voice features of the plurality of voice signal frames; and denoising the original voice signal on the basis of the denoising parameter to obtain a target voice signal.

Description

语音信号的处理方法及电子设备Voice signal processing method and electronic device
本公开基于申请号为202110125640.5、申请日为2021年1月29日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。The present disclosure is based on a Chinese patent application with an application number of 202110125640.5 and an application date of January 29, 2021, and claims the priority of the Chinese patent application, the entire contents of which are incorporated herein by reference.
技术领域technical field
本公开涉及语音处理技术领域,尤其涉及一种语音信号的处理方法及电子设备。The present disclosure relates to the technical field of voice processing, and in particular, to a voice signal processing method and electronic device.
背景技术Background technique
通常采集的语音信号中会包含噪声,噪声的存在对语音信号的处理会产生不利影响,因此对于语音信号的处理,噪声的去除有着至关重要的作用。Generally, the collected speech signal will contain noise, and the existence of noise will have an adverse effect on the processing of the speech signal. Therefore, for the processing of the speech signal, the removal of noise plays a crucial role.
发明内容SUMMARY OF THE INVENTION
根据本公开实施例的一方面,提供一种语音信号的处理方法,所述方法包括:确定原始语音信号的多个第一语音特征,每个所述第一语音特征对应于所述原始语音信号中的一个语音信号帧;对多个所述第一语音特征进行处理,得到多个非局部语音特征,每个所述非局部语音特征对应于一个语音信号帧,且每个所述非局部语音特征是基于所述非局部语音特征对应的语音信号帧的第一语音特征,以及除所述语音信号帧之外的其他语音信号帧的第一语音特征进行融合得到的;分别对所述原始语音信号中每个语音信号帧的非局部语音特征进行处理,得到每个所述语音信号帧的混合语音特征;基于多个所述语音信号帧的混合语音特征,获取去噪参数;基于所述去噪参数对所述原始语音信号进行去噪,得到目标语音信号。According to an aspect of the embodiments of the present disclosure, there is provided a method for processing a voice signal, the method comprising: determining a plurality of first voice features of an original voice signal, each of the first voice features corresponding to the original voice signal a voice signal frame in the ; process a plurality of the first voice features to obtain a plurality of non-local voice features, each of the non-local voice features corresponds to a voice signal frame, and each of the non-local voice features The feature is obtained by fusing the first voice feature of the voice signal frame corresponding to the non-local voice feature and the first voice feature of other voice signal frames except the voice signal frame; Process the non-local speech features of each speech signal frame in the signal to obtain the mixed speech features of each of the speech signal frames; obtain denoising parameters based on the mixed speech features of a plurality of the speech signal frames; The original speech signal is denoised by the noise parameter to obtain the target speech signal.
根据本公开实施例的再一方面,提供一种语音信号的处理装置,所述装置包括:特征确定单元,被配置为确定原始语音信号的多个第一语音特征,每个所述第一语音特征对应于所述原始语音信号中的一个语音信号帧;非局部特征获取单元,被配置为对多个所述第一语音特征进行处理,得到多个非局部语音特征,每个所述非局部语音特征对应于一个语音信号帧,且每个所述非局部语音特征是基于所述非局部语音特征对应的语音信号帧的第一语音特征,以及除所述语音信号帧之外的其他语音信号帧的第一语音特征进行融合得到的;混合特征获取单元,被配置为分别对所述原始语音信号中每个语音信号帧的非局部语音特征进行处理,得到每个所述语音信号帧的混合语音特征;去噪参数获取单元,被配置为基于多个所述语音信号帧的混合语音特征,获取去噪参数;目标信号获取单元,被配置为基于所述去噪参数对所述原始语音信号进行去噪,得到目标语音信号。According to yet another aspect of the embodiments of the present disclosure, there is provided an apparatus for processing a voice signal, the apparatus comprising: a feature determination unit configured to determine a plurality of first voice features of an original voice signal, each of the first voice features The feature corresponds to a voice signal frame in the original voice signal; the non-local feature acquisition unit is configured to process a plurality of the first voice features to obtain a plurality of non-local voice features, each of the non-local features The voice feature corresponds to one voice signal frame, and each of the non-local voice features is based on the first voice feature of the voice signal frame corresponding to the non-local voice feature, and other voice signals other than the voice signal frame obtained by fusing the first voice features of the frames; the mixed feature acquisition unit is configured to process the non-local voice features of each voice signal frame in the original voice signal respectively, and obtain the mixed feature of each voice signal frame. speech features; a denoising parameter acquisition unit, configured to acquire denoising parameters based on mixed speech features of a plurality of the speech signal frames; a target signal acquisition unit, configured to Perform denoising to obtain the target speech signal.
根据本公开实施例的再一方面,提供了一种电子设备,所述电子设备包括:一个或多个处理器;用于存储所述一个或多个处理器可执行指令的存储器;其中,所述一个或多个处理器被配置为执行如下步骤:确定原始语音信号的多个第一语音特征,每个所述第一语音特征对应于所述原始语音信号中的一个语音信号帧;对多个所述第一语音特征进行处理,得到多个非局部语音特征,每个所述非局部语音特征对应于一个语音信号帧,且每个所述非局部语音特征是基于所述非局部语音特征对应的语音信号帧的第一语音特征,以及除所述语音信号帧之外的其他语音信号帧的第一语音特征进行融合得到的;分别对所述原始语音信号中每个语音信号帧的非局部语音特征进行处理,得到每个 所述语音信号帧的混合语音特征;基于多个所述语音信号帧的混合语音特征,获取去噪参数;基于所述去噪参数对所述原始语音信号进行去噪,得到目标语音信号。According to yet another aspect of the embodiments of the present disclosure, there is provided an electronic device, the electronic device comprising: one or more processors; a memory for storing executable instructions of the one or more processors; wherein the The one or more processors are configured to perform the steps of: determining a plurality of first speech features of the original speech signal, each of the first speech features corresponding to a speech signal frame in the original speech signal; The first voice features are processed to obtain a plurality of non-local voice features, each of the non-local voice features corresponds to a voice signal frame, and each of the non-local voice features is based on the non-local voice features The first voice feature of the corresponding voice signal frame and the first voice feature of other voice signal frames except the voice signal frame are obtained by fusing; The local speech features are processed to obtain a mixed speech feature of each of the speech signal frames; a denoising parameter is obtained based on the mixed speech features of a plurality of the speech signal frames; based on the denoising parameters, the original speech signal is processed Denoising to get the target speech signal.
根据本公开实施例的再一方面,提供一种计算机可读存储介质,当所述计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行如下步骤:确定原始语音信号的多个第一语音特征,每个所述第一语音特征对应于所述原始语音信号中的一个语音信号帧;对多个所述第一语音特征进行处理,得到多个非局部语音特征,每个所述非局部语音特征对应于一个语音信号帧,且每个所述非局部语音特征是基于所述非局部语音特征对应的语音信号帧的第一语音特征,以及除所述语音信号帧之外的其他语音信号帧的第一语音特征进行融合得到的;分别对所述原始语音信号中每个语音信号帧的非局部语音特征进行处理,得到每个所述语音信号帧的混合语音特征;基于多个所述语音信号帧的混合语音特征,获取去噪参数;基于所述去噪参数对所述原始语音信号进行去噪,得到目标语音信号。According to yet another aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided. When the instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device can perform the following steps: determine the original voice Multiple first voice features of the signal, each of which corresponds to a voice signal frame in the original voice signal; and multiple first voice features are processed to obtain multiple non-local voice features , each of the non-local voice features corresponds to a voice signal frame, and each of the non-local voice features is based on the first voice feature of the voice signal frame corresponding to the non-local voice feature, and except for the voice signal It is obtained by fusing the first voice features of other voice signal frames other than the frame; the non-local voice features of each voice signal frame in the original voice signal are processed respectively to obtain the mixed voice of each voice signal frame. feature; obtain denoising parameters based on the mixed speech features of a plurality of the speech signal frames; perform denoising on the original speech signal based on the denoising parameters to obtain a target speech signal.
根据本公开实施例的再一方面,提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行如下步骤:确定原始语音信号的多个第一语音特征,每个所述第一语音特征对应于所述原始语音信号中的一个语音信号帧;对多个所述第一语音特征进行处理,得到多个非局部语音特征,每个所述非局部语音特征对应于一个语音信号帧,且每个所述非局部语音特征是基于所述非局部语音特征对应的语音信号帧的第一语音特征,以及除所述语音信号帧之外的其他语音信号帧的第一语音特征进行融合得到的;分别对所述原始语音信号中每个语音信号帧的非局部语音特征进行处理,得到每个所述语音信号帧的混合语音特征;基于多个所述语音信号帧的混合语音特征,获取去噪参数;基于所述去噪参数对所述原始语音信号进行去噪,得到目标语音信号。According to yet another aspect of the embodiments of the present disclosure, there is provided a computer program product, comprising a computer program, the computer program being executed by a processor for the steps of: determining a plurality of first speech features of an original speech signal, each of the first speech features The voice feature corresponds to a voice signal frame in the original voice signal; a plurality of the first voice features are processed to obtain a plurality of non-local voice features, each of the non-local voice features corresponds to a voice signal frame , and each non-local speech feature is fused based on the first speech feature of the speech signal frame corresponding to the non-local speech feature, and the first speech feature of other speech signal frames except the speech signal frame obtained; respectively process the non-local voice features of each voice signal frame in the original voice signal to obtain the mixed voice features of each of the voice signal frames; based on the mixed voice features of the multiple voice signal frames, Obtaining denoising parameters; denoising the original speech signal based on the denoising parameters to obtain a target speech signal.
本公开实施例提供的方法,在获取每个语音信号帧的非局部语音特征时考虑了该语音信号帧的上下文信息,之后分别对每个语音信号帧的非局部语音特征进行处理,获取语音信号帧本身的语音特征,从而得到混合形式的语音特征,基于该混合形式的语音特征得到的去噪参数更加准确,使去噪参数能够准确地表示每个语音信号帧中除噪声信号之外的信号所占的比例,因此采用该去噪参数对原始语音信号进行去噪,提高了原始语音信号的去噪效果。In the method provided by the embodiment of the present disclosure, the context information of each speech signal frame is considered when acquiring the non-local speech features of each speech signal frame, and then the non-local speech features of each speech signal frame are respectively processed to obtain the speech signal The speech features of the frame itself are used to obtain the speech features of the mixed form. The denoising parameters obtained based on the speech features of the mixed form are more accurate, so that the denoising parameters can accurately represent the signals other than the noise signal in each speech signal frame. Therefore, the denoising parameter is used to denoise the original speech signal, which improves the denoising effect of the original speech signal.
附图说明Description of drawings
图1是根据一示例性实施例示出的一种语音处理模型的示意图。Fig. 1 is a schematic diagram of a speech processing model according to an exemplary embodiment.
图2是根据一示例性实施例示出的另一种语音处理模型的示意图。Fig. 2 is a schematic diagram of another speech processing model according to an exemplary embodiment.
图3是根据一示例性实施例示出的另一种语音处理模型的示意图。Fig. 3 is a schematic diagram of another speech processing model according to an exemplary embodiment.
图4是根据一示例性实施例示出的一种语音信号的处理方法的流程图。Fig. 4 is a flow chart of a method for processing a speech signal according to an exemplary embodiment.
图5是根据一示例性实施例示出的另一种语音信号的处理方法的流程图。Fig. 5 is a flow chart of another voice signal processing method according to an exemplary embodiment.
图6是根据一示例性实施例示出的一种非局部注意力子模型的示意图。Fig. 6 is a schematic diagram of a non-local attention sub-model according to an exemplary embodiment.
图7是根据一示例性实施例示出的一种非局部语音特征获取方法的流程图。Fig. 7 is a flow chart of a method for acquiring non-local speech features according to an exemplary embodiment.
图8是根据一示例性实施例示出的一种第一处理网络的示意图。Fig. 8 is a schematic diagram of a first processing network according to an exemplary embodiment.
图9是根据一示例性实施例示出的一种第二处理网络的示意图。Fig. 9 is a schematic diagram of a second processing network according to an exemplary embodiment.
图10是根据一示例性实施例示出的另一种第二处理网络的示意图。Fig. 10 is a schematic diagram of another second processing network according to an exemplary embodiment.
图11是根据一示例性实施例示出的一种残差非局部子网络的示意图。Fig. 11 is a schematic diagram of a residual non-local sub-network according to an exemplary embodiment.
图12是根据一示例性实施例示出的另一种非局部注意力子模型的示意图。Fig. 12 is a schematic diagram of another non-local attention sub-model according to an exemplary embodiment.
图13是根据一示例性实施例示出的另一种语音信号的处理方法的流程图。Fig. 13 is a flowchart showing another method for processing a speech signal according to an exemplary embodiment.
图14是根据一示例性实施例示出的一种语音信号的处理装置的框图。Fig. 14 is a block diagram of an apparatus for processing a speech signal according to an exemplary embodiment.
图15是根据一示例性实施例示出的另一种语音信号的处理装置的框图。Fig. 15 is a block diagram of another apparatus for processing speech signals according to an exemplary embodiment.
图16是根据一示例性实施例示出的一种终端的结构框图。Fig. 16 is a structural block diagram of a terminal according to an exemplary embodiment.
图17是根据一示例性实施例示出的一种服务器的结构框图。Fig. 17 is a structural block diagram of a server according to an exemplary embodiment.
具体实施方式Detailed ways
本公开的说明书和权利要求书及上述附图说明中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。The terms "first", "second", etc. in the description and claims of the present disclosure and the above description of the drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein.
本公开所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等),均为经用户授权或者经过各方充分授权的信息。The user information (including but not limited to user equipment information, user personal information, etc.) involved in this disclosure is the information authorized by the user or fully authorized by all parties.
相关技术中,采用谱减法对语音信号进行去噪,即获取语音信号中的静音片段,从静音片段中提取出噪声信号,将语音信号与噪声信号相减即可去除语音信号中的噪声,但是在语音信号中噪声发生变化的情况下,谱减法难以去除噪声,去噪效果较差。In the related art, spectral subtraction is used to denoise the speech signal, that is, the silent segment in the speech signal is obtained, the noise signal is extracted from the silent segment, and the noise in the speech signal can be removed by subtracting the speech signal and the noise signal. When the noise in the speech signal changes, spectral subtraction is difficult to remove the noise, and the denoising effect is poor.
本公开实施例提供的语音信号的处理方法能够应用于多种场景下。The voice signal processing method provided by the embodiment of the present disclosure can be applied to various scenarios.
例如,应用于直播场景下。For example, in live broadcast scenarios.
在直播过程中,主播终端采集的主播的语音信号中可能存在噪声信号,观众终端直接播放该语音信号,会由于噪声的存在导致语音信号不清晰,影响观众的观看体验,此时能够采用本公开实施例提供的方法,去除语音信号中的噪声信号,提高语音信号的语音质量,使观众终端能够播放清晰的语音信号,提高直播效果。During the live broadcast, there may be noise signals in the voice signals of the anchors collected by the anchor terminals, and the audience terminals directly play the voice signals, which may cause the voice signals to be unclear due to the existence of noise and affect the viewing experience of the viewers. In this case, the present disclosure can be used. The method provided by the embodiment removes the noise signal in the voice signal, improves the voice quality of the voice signal, enables the viewer terminal to play the clear voice signal, and improves the live broadcast effect.
又例如,应用于自动语音识别场景下。For another example, it is applied to automatic speech recognition scenarios.
在语音识别过程中,在语音信号中存在噪声信号的情况下,噪声信号会对语音信号识别产生影响,导致语音识别准确率较低,难以准确识别出语音信号的内容,此时能够采用本公开实施例提供的方法,先对语音信号中进行去噪,对去噪后的语音信号进行识别,以提高语音识别的准确率。In the speech recognition process, when there is a noise signal in the speech signal, the noise signal will affect the speech signal recognition, resulting in a low speech recognition accuracy rate, and it is difficult to accurately recognize the content of the speech signal. At this time, the present disclosure can be used. In the method provided by the embodiment, the speech signal is denoised first, and the denoised speech signal is recognized, so as to improve the accuracy of speech recognition.
本公开实施例提供的方法还能够应用于视频播放、语种识别、语音合成、身份识别等场景下。The methods provided by the embodiments of the present disclosure can also be applied to scenarios such as video playback, language recognition, speech synthesis, and identity recognition.
图1是根据一示例性实施例提供的一种语音处理模型的示意图,该语音处理模型包括:非局部注意力网络101和局部注意力网络102,该非局部注意力网络101和局部注意力网络102连接。其中,非局部注意力网络101用于对输入的原始语音信号的第一语音特征进行处理,以得到原始语音信号的非局部语音特征,局部注意力网络102用于对原始语音信号的非局部语音特征进行进一步处理,以得到原始语音信号的混合语音特征。Fig. 1 is a schematic diagram of a speech processing model provided according to an exemplary embodiment, the speech processing model includes: a non-local attention network 101 and a local attention network 102, the non-local attention network 101 and the local attention network 102 connections. The non-local attention network 101 is used to process the first speech feature of the input original speech signal to obtain the non-local speech feature of the original speech signal, and the local attention network 102 is used to process the non-local speech feature of the original speech signal. The features are further processed to obtain the mixed speech features of the original speech signal.
在一些实施例中,参见图2,语音处理模型还包括:特征提取网络103、特征重建网络104和语音去噪网络105,特征提取网络103与非局部注意力网络101连接,特征重建网络104和局部注意力网络102连接,语音去噪网络105与特征重建网络104连接。其中,特征提取网络103用于提取原始语音信号的第一语音特征,特征重建网络104用于对处理后的原始语音信号的混合语音特征进行特征重建,以得到原始语音信号的去噪参数,语音去噪网络105用于对原始语音信号进行去噪。In some embodiments, referring to FIG. 2, the speech processing model further includes: a feature extraction network 103, a feature reconstruction network 104 and a speech denoising network 105, the feature extraction network 103 is connected with the non-local attention network 101, the feature reconstruction network 104 and The local attention network 102 is connected, and the speech denoising network 105 is connected with the feature reconstruction network 104 . Among them, the feature extraction network 103 is used to extract the first voice feature of the original voice signal, the feature reconstruction network 104 is used to perform feature reconstruction on the mixed voice features of the processed original voice signal, so as to obtain the denoising parameters of the original voice signal, the voice The denoising network 105 is used to denoise the original speech signal.
在一些实施例中,语音处理模型中包括多个非局部注意力网络101和多个局部注意力网络102,该多个非局部注意力网络101和该多个局部注意力网络102能够按照任意顺序依次连接。例如,参见图3,语音处理模型包括两个非局部注意力网络101和两个局部注意力网络102,特征提取网络103与第一个非局部注意力网络101连接,第一个非局部注意力网络101与第一个局部注意力网络102连接,第一个局部注意力网络102与第二个局部注意力网络103连接,第二个局部注意力网络103与第二个非局部注意力网络101连接,第二个非局部注意力网络101与特征重建网络104连接。In some embodiments, the speech processing model includes a plurality of non-local attention networks 101 and a plurality of local attention networks 102, and the plurality of non-local attention networks 101 and the plurality of local attention networks 102 can be in any order Connect in sequence. For example, referring to Figure 3, the speech processing model includes two non-local attention networks 101 and two local attention networks 102, the feature extraction network 103 is connected to the first non-local attention network 101, the first non-local attention network 101 The network 101 is connected with the first local attention network 102, the first local attention network 102 is connected with the second local attention network 103, and the second local attention network 103 is connected with the second non-local attention network 101 connected, the second non-local attention network 101 is connected with the feature reconstruction network 104.
本公开实施例中,非局部注意力网络可称为非局部注意力子模型,局部注意力网络可称为局部注意力子模型,特征提取网络可称为特征提取子模型,特征重建网络可称为特征 识别子模型,语音去噪网络可称为语音去噪子模型。In this embodiment of the present disclosure, the non-local attention network may be referred to as a non-local attention sub-model, the local attention network may be referred to as a local attention sub-model, the feature extraction network may be referred to as a feature extraction sub-model, and the feature reconstruction network may be referred to as For the feature recognition sub-model, the speech denoising network can be referred to as the speech denoising sub-model.
本公开实施例提供的语音信号的处理方法由电子设备执行,该电子设备为终端或服务器。该终端为便携式、袖珍式、手持式等多种类型的终端,如手机、计算机、平板电脑等。该服务器是一台服务器,或者由若干台服务器组成的服务器集群,或者是一个云计算服务中心。The voice signal processing method provided by the embodiment of the present disclosure is performed by an electronic device, and the electronic device is a terminal or a server. The terminal is a portable, pocket-sized, hand-held and other various types of terminals, such as a mobile phone, a computer, a tablet computer, and the like. The server is a server, or a server cluster composed of several servers, or a cloud computing service center.
图4是根据一示例性实施例示出的一种语音信号的处理方法的流程图,参见图4,该方法由电子设备执行,包括以下步骤:Fig. 4 is a flow chart showing a method for processing a voice signal according to an exemplary embodiment. Referring to Fig. 4, the method is executed by an electronic device and includes the following steps:
401、确定原始语音信号的多个第一语音特征,每个第一语音特征对应于原始语音信号中的一个语音信号帧。401. Determine multiple first voice features of the original voice signal, where each first voice feature corresponds to one voice signal frame in the original voice signal.
402、调用非局部注意力子模型,对多个第一语音特征进行融合,得到多个非局部语音特征,每个非局部语音特征对应于一个语音信号帧,且每个非局部语音特征是基于该非局部语音特征对应的语音信号帧的第一语音特征,以及除该语音信号帧之外的其他语音信号帧的第一语音特征进行融合得到的。402. Call the non-local attention sub-model to fuse multiple first voice features to obtain multiple non-local voice features, each non-local voice feature corresponds to a voice signal frame, and each non-local voice feature is based on The first voice feature of the voice signal frame corresponding to the non-local voice feature is obtained by fusing the first voice feature of other voice signal frames except the voice signal frame.
403、调用局部注意力子模型,分别对原始语音信号中每个语音信号帧的非局部语音特征进行处理,得到每个语音信号帧的混合语音特征。403. Invoke the local attention sub-model to separately process the non-local speech features of each speech signal frame in the original speech signal to obtain the mixed speech feature of each speech signal frame.
404、基于多个语音信号帧的混合语音特征,获取去噪参数。404. Obtain a denoising parameter based on the mixed speech features of the multiple speech signal frames.
405、基于去噪参数对原始语音信号进行去噪,得到目标语音信号。405. Perform denoising on the original speech signal based on the denoising parameter to obtain a target speech signal.
本公开实施例提供的方法,调用非局部注意力子模型获取每个语音信号帧的非局部语音特征,考虑了该语音信号帧的上下文信息,之后再调用局部注意力子模型分别对每个语音信号帧的非局部语音特征进行处理,获取语音信号帧本身的语音特征,从而得到混合形式的语音特征,基于该混合形式的语音特征得到的去噪参数更加准确,使去噪参数能够准确地表示每个语音信号帧中除噪声信号之外的信号所占的比例,因此采用该去噪参数对原始语音信号进行去噪,提高了原始语音信号的去噪效果。In the method provided by the embodiments of the present disclosure, the non-local attention sub-model is invoked to obtain the non-local speech features of each speech signal frame, the context information of the speech signal frame is considered, and then the local attention sub-model is invoked for each speech respectively. The non-local speech features of the signal frame are processed to obtain the speech features of the speech signal frame itself, so as to obtain the speech features of the mixed form. The proportion of signals other than the noise signal in each speech signal frame, so the denoising parameter is used to denoise the original speech signal, which improves the denoising effect of the original speech signal.
图5是根据一示例性实施例示出的另一种语音信号的处理方法的流程图,参见图5,该方法由电子设备执行,包括以下步骤:Fig. 5 is a flowchart of another voice signal processing method according to an exemplary embodiment. Referring to Fig. 5, the method is executed by an electronic device and includes the following steps:
501、电子设备获取原始语音信号中的多个语音信号帧的原始幅度和原始相位。501. The electronic device acquires the original amplitude and original phase of multiple speech signal frames in the original speech signal.
由于语音信号包括幅度和相位,语音信号中的噪声信号包含在幅度中,因此本公开实施例中,获取原始语音信号中每个语音信号帧的原始幅度和原始相位,对原始幅度进行去噪,以实现对原始语音信号的去噪,而无需对原始相位进行处理,减少了处理量。其中,原始语音信号为电子设备采集的,或者为其他电子设备发送给该电子设备的包含噪声信号的语音信号,例如噪声信号为环境噪声、白噪声等类型的噪声信号。Since the speech signal includes amplitude and phase, and the noise signal in the speech signal is included in the amplitude, in this embodiment of the present disclosure, the original amplitude and original phase of each speech signal frame in the original speech signal are obtained, and the original amplitude is denoised, In order to realize the denoising of the original speech signal without processing the original phase, the processing amount is reduced. The original voice signal is collected by an electronic device, or is a voice signal containing noise signals sent by other electronic devices to the electronic device. For example, the noise signal is a noise signal of environmental noise, white noise, or the like.
其中,原始语音信号包括多个语音信号帧,电子设备分别对每个语音信号帧进行傅里叶变换,得到每个语音信号帧的原始幅度和原始相位,后续对每个语音信号帧的原始幅度进行处理,以实现对原始幅度的去噪。其中,傅里叶变换包括快速傅里叶变换、短时傅里叶变换等。The original voice signal includes multiple voice signal frames, and the electronic device performs Fourier transform on each voice signal frame to obtain the original amplitude and original phase of each voice signal frame, and subsequently calculates the original amplitude of each voice signal frame. Processed to achieve denoising of the original amplitude. The Fourier transform includes fast Fourier transform, short-time Fourier transform, and the like.
在一些实施例中,由于语音处理模型每次处理的语音信号的信号长度有限,例如,每次能够处理一分钟的语音信号、两分钟的语音信号等。因此,原始语音信号的信号长度不能够超过参考信号长度,即原始语音信号的时长不能超过参考时长,其中,参考信号长度为任一长度,参考时长为任一时长。例如,参考信号长度为64个语音信号帧。In some embodiments, due to the limited signal length of the speech signal processed by the speech processing model each time, for example, a speech signal of one minute, a speech signal of two minutes, etc. can be processed each time. Therefore, the signal length of the original speech signal cannot exceed the length of the reference signal, that is, the duration of the original speech signal cannot exceed the reference duration, wherein the reference signal length is any length, and the reference duration is any duration. For example, the reference signal length is 64 speech signal frames.
502、电子设备调用特征提取子模型,分别对多个语音信号帧的原始幅度进行特征提取,得到多个语音信号帧的第一语音特征。502. The electronic device invokes the feature extraction sub-model to perform feature extraction on the original amplitudes of the multiple voice signal frames respectively, to obtain the first voice features of the multiple voice signal frames.
也即是,电子设备调用特征提取子模型,分别对每个语音信号帧的原始幅度进行特征提取,得到每个语音信号帧的第一语音特征,即得到原始语音信号的多个第一语音特征。That is, the electronic device invokes the feature extraction sub-model to perform feature extraction on the original amplitude of each voice signal frame respectively, to obtain the first voice feature of each voice signal frame, that is, to obtain multiple first voice features of the original voice signal. .
其中,语音信号帧的第一语音特征用于描述对应的语音信号帧,第一语音特征采用向量、矩阵或其他形式表示。可选地,多个语音信号帧的第一语音特征分别表示,或者,将 多个语音信号帧的第一语音特征组合在一起表示,例如,每个语音信号帧的第一语音特征为向量,则将多个向量组合在一起构成一个矩阵,该矩阵中每一列表示一个语音信号帧的第一语音特征。Wherein, the first voice feature of the voice signal frame is used to describe the corresponding voice signal frame, and the first voice feature is represented by a vector, a matrix or other forms. Optionally, the first voice features of multiple voice signal frames are respectively represented, or, the first voice features of multiple voice signal frames are combined to represent, for example, the first voice feature of each voice signal frame is a vector, Then, a plurality of vectors are combined to form a matrix, and each column in the matrix represents the first speech feature of a speech signal frame.
在一些实施例中,该特征提取子模型中包括卷积层、批量归一化层和激活函数层。In some embodiments, the feature extraction sub-model includes a convolution layer, a batch normalization layer, and an activation function layer.
503、电子设备调用非局部注意力子模型对多个语音信号帧的第一语音特征进行融合,得到每个语音信号帧的非局部语音特征。503. The electronic device invokes the non-local attention sub-model to fuse the first speech features of the multiple speech signal frames to obtain the non-local speech features of each speech signal frame.
也即是,电子设备调用非局部注意力子模型,对多个第一语音特征进行处理,得到多个非局部语音特征。其中,每个非局部语音特征对应一个语音信号帧,且每个非局部语音特征是基于非局部语音特征对应的语音信号帧的第一语音特征,以及除该语音信号帧之外的其他语音信号帧的第一语音特征进行融合得到的。即每个语音信号帧的非局部语音特征是结合了多个语音信号帧的第一语音特征得到的,即考虑语音信号帧之前和之后的语音信号帧的特征。That is, the electronic device invokes the non-local attention sub-model to process multiple first speech features to obtain multiple non-local speech features. Wherein, each non-local voice feature corresponds to a voice signal frame, and each non-local voice feature is based on the first voice feature of the voice signal frame corresponding to the non-local voice feature, and other voice signals other than the voice signal frame The first speech feature of the frame is obtained by fusion. That is, the non-local voice feature of each voice signal frame is obtained by combining the first voice features of a plurality of voice signal frames, that is, the features of the voice signal frames before and after the voice signal frame are considered.
本公开实施例中,非局部注意力子模型采用注意力机制和残差学习,对第一语音特征进行处理,在对每个语音信号帧的第一语音特征进行处理的过程中,能够考虑该语音信号帧的上下文信息,使处理得到的非局部语音特征更加准确,且由于语音信号帧的第一语音特征在处理过程中会丢失一些语音特征,采用残差学习能够在对第一语音特征进行处理之后,再结合输入的第一语音特征来获取非局部语音特征,避免在对第一语音特征进行处理得到非局部语音特征的过程中丢失重要的语音特征。In the embodiment of the present disclosure, the non-local attention sub-model adopts the attention mechanism and residual learning to process the first speech feature. In the process of processing the first speech feature of each speech signal frame, this can be taken into account. The context information of the speech signal frame makes the processed non-local speech features more accurate, and because the first speech feature of the speech signal frame will lose some speech features in the processing process, the residual learning can be used in the first speech feature. After the processing, the non-local speech features are obtained by combining the input first speech features, so as to avoid losing important speech features in the process of processing the first speech features to obtain the non-local speech features.
在一些实施例中,参见图6,非局部注意力子模型包括第一处理单元、第二处理单元、第一融合单元和第二融合单元。其中,第一处理单元可称为第一处理网络,第二处理单元可称为第二处理网络,第一融合单元可称为第一融合网络,第二融合单元可称为第二融合网络,该第一处理网络为主干分支(Trunk Branch),第二处理网络为掩码分支(Mask Branch)。该第一处理网络和第二处理网络分别对输入的多个语音信号帧的第一语音信号进行处理,第一融合网络对第一处理网络和第二处理网络处理之后得到的特征进行融合,第二融合网络对第一融合网络融合得到的特征与非局部注意力子模型中输入的特征进行融合。In some embodiments, referring to FIG. 6 , the non-local attention sub-model includes a first processing unit, a second processing unit, a first fusion unit, and a second fusion unit. The first processing unit may be referred to as a first processing network, the second processing unit may be referred to as a second processing network, the first fusion unit may be referred to as a first fusion network, and the second fusion unit may be referred to as a second fusion network, The first processing network is a trunk branch (Trunk Branch), and the second processing network is a mask branch (Mask Branch). The first processing network and the second processing network respectively process the first voice signals of the input multiple voice signal frames, the first fusion network fuses the features obtained after processing by the first processing network and the second processing network, and the first fusion network The second fusion network fuses the features fused by the first fusion network with the features input in the non-local attention sub-model.
电子设备调用非局部注意力子模型处理每个语音信号帧的第一语音特征的过程参见图7,该过程包括以下步骤:The process of invoking the non-local attention sub-model to process the first speech feature of each speech signal frame by the electronic device is shown in FIG. 7 , and the process includes the following steps:
701、电子设备调用第一处理网络分别对多个语音信号帧的第一语音特征进行特征提取,得到每个语音信号帧的第二语音特征。701. The electronic device invokes the first processing network to perform feature extraction on the first voice features of the multiple voice signal frames, respectively, to obtain the second voice features of each voice signal frame.
也即是,调用第一处理网络,分别对每个语音信号帧的第一语音特征进行特征提取,得到每个语音信号帧的第二语音特征。其中,该第二语音特征是对第一语音特征进行进一步提取得到的,第二语音特征与第一语音特征相比包含更少的噪声特征。That is, the first processing network is invoked to perform feature extraction on the first voice feature of each voice signal frame, respectively, to obtain the second voice feature of each voice signal frame. The second voice feature is obtained by further extracting the first voice feature, and the second voice feature contains less noise features than the first voice feature.
在一些实施例中,参见图8,第一处理网络中包括多个空洞残差子单元(Res.Unit),该空洞残差子单元可称为空洞残差子网络,图8仅是以两个空洞残差子网络为例,每个空洞残差子网络包括空洞卷积层、批量归一化层和激活函数层,且该多个空洞残差子网络采用残差学习网络的网络结构连接。其中,空洞卷积层能够扩大感受野,获取更多的上下文信息。In some embodiments, referring to FIG. 8 , the first processing network includes a plurality of hole residual sub-units (Res.Unit), and the hole residual sub-unit may be called a hole residual sub-network. Take a hole residual sub-network as an example, each hole residual sub-network includes a hole convolution layer, a batch normalization layer and an activation function layer, and the multiple hole residual sub-networks are connected by the network structure of the residual learning network . Among them, the atrous convolutional layer can expand the receptive field and obtain more contextual information.
在一些实施例中,非局部注意力子模型还包括至少一个空洞残差单元。其中,该空洞残差单元可称为空洞残差网络。每个空洞残差网络包括两个空洞残差子网络,其中,该空洞残差子单元可称为空洞残差子网络,这两个空洞残差子网络采用残差学习网络的网络结构连接。电子设备在调用第一处理网络和第二处理网络对每个语音信号帧的第一语音特征进行处理之前,先调用至少一个空洞残差网络对每个语音信号帧的第一语音特征进行特征提取,得到每个语音信号帧进一步提取后的第一语音特征,后续第一处理网络和第二处理网络对每个语音信号帧进一步提取后的第一语音特征进行处理。上述调用包括多 个空洞残差子网络的第一处理网络能够进一步对第一语音特征进行提取,得到更深层次的语音特征。In some embodiments, the non-local attention sub-model further includes at least one hole residual unit. The hole residual unit may be called a hole residual network. Each hole residual network includes two hole residual sub-networks, wherein the hole residual subunit may be called a hole residual sub-network, and the two hole residual sub-networks are connected by the network structure of the residual learning network. Before calling the first processing network and the second processing network to process the first voice feature of each voice signal frame, the electronic device firstly calls at least one hole residual network to perform feature extraction on the first voice feature of each voice signal frame , to obtain the further extracted first voice feature of each voice signal frame, and the subsequent first processing network and the second processing network process the further extracted first voice feature of each voice signal frame. The above-mentioned invocation of the first processing network including a plurality of hole residual sub-networks can further extract the first speech feature to obtain deeper speech features.
702、电子设备调用第二处理网络将每个语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行融合,得到每个语音信号帧的第三语音特征。702. The electronic device invokes the second processing network to fuse the first voice features of each voice signal frame with the first voice features of other voice signal frames respectively to obtain a third voice feature of each voice signal frame.
对于每个语音信号帧,电子设备调用第二处理网络,将该语音信号帧的第一语音特征分别与除该语音信号帧之外的其他语音信号帧的第一语音特征进行融合,得到该语音信号帧的第三语音特征。其中,每个语音信号帧的第三语音特征是结合了其他语音信号帧的第一语音特征得到的。For each voice signal frame, the electronic device invokes the second processing network, and fuses the first voice feature of the voice signal frame with the first voice features of other voice signal frames except the voice signal frame to obtain the voice The third speech feature of the signal frame. The third voice feature of each voice signal frame is obtained by combining the first voice features of other voice signal frames.
在一些实施例中,参见图9,第二处理网络包括残差非局部子单元、卷积子单元和反卷积子单元。其中,残差非局部子单元可称为残差非局部子网络,卷积子单元可称为卷积子网络,反卷积子单元可称为反卷积子网络。电子设备调用残差非局部子网络,基于多个语音信号帧的权重,将每个语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个语音信号帧加权融合后的第一语音特征;调用卷积子网络,对每个语音信号帧加权融合后的第一语音特征进行编码,得到每个语音信号帧的编码特征;调用反卷积子网络,对每个语音信号帧的编码特征进行解码,得到每个语音信号帧的第三语音特征。也即是,对于每个语音信号帧,调用残差非局部子网络,基于多个语音信号帧的权重,将该语音信号帧的第一语音特征分别与除该语音信号帧之外的其他语音信号帧的第一语音特征进行加权融合,得到该语音信号帧加权融合后的第一语音特征。In some embodiments, referring to Figure 9, the second processing network includes a residual non-local subunit, a convolution subunit, and a deconvolution subunit. Among them, the residual non-local sub-unit may be referred to as a residual non-local sub-network, the convolution sub-unit may be referred to as a convolution sub-network, and the deconvolution sub-unit may be referred to as a de-convolution sub-network. The electronic device calls the residual non-local sub-network, and based on the weights of multiple voice signal frames, weighted fusion of the first voice feature of each voice signal frame and the first voice features of other voice signal frames respectively, to obtain each voice signal The first speech feature after frame weighted fusion; the convolution sub-network is called to encode the first speech feature after weighted fusion of each speech signal frame, and the encoded feature of each speech signal frame is obtained; the deconvolution sub-network is called, Decoding the encoded features of each voice signal frame to obtain a third voice feature of each voice signal frame. That is, for each speech signal frame, the residual non-local sub-network is called, and based on the weights of multiple speech signal frames, the first speech feature of the speech signal frame is respectively compared with other speech features except the speech signal frame. The first speech features of the signal frame are weighted and fused to obtain the weighted and fused first speech features of the speech signal frame.
在一些实施例中,参见图10,第二处理网络还包括多个特征缩小子单元、多个第一空洞残差子单元、多个第二空洞残差子单元和激活函数子单元,其中,特征缩小子单元可称为特征缩小子网络,第一空洞残差子单元可称为第一空洞残差子网络,第二空洞残差子单元可称为第二空洞残差子网络,激活函数子单元可称为激活函数子网络。残差非局部子网络与第一个第一空洞残差子网络连接,多个第一空洞残差子网络依次连接,最后一个空洞残差子网络与卷积子网络连接,卷积子网络与第一个特征缩小子网络连接,多个特征缩小子网络依次连接,最后一个特征缩小子网络与反卷积子网络连接,反卷积子网络与第一个第二空洞残差子网络连接,多个第二空洞残差子网络依次连接,最后一个空洞残差子网络与激活函数子网络连接。另外,图10仅是以两个第一空洞残差子网络、两个第二空洞残差子网络和两个特征缩小子网络为例,第一空洞残差子单、第二空洞残差子网络和特征缩小子网络还可以是其他数量。In some embodiments, referring to FIG. 10 , the second processing network further includes a plurality of feature reduction subunits, a plurality of first hole residual subunits, a plurality of second hole residual subunits, and an activation function subunit, wherein, The feature reduction sub-unit may be called the feature reduction sub-network, the first hole residual sub-unit may be called the first hole residual sub-network, the second hole residual sub-unit may be called the second hole residual sub-network, and the activation function The subunits may be referred to as activation function sub-networks. The residual non-local sub-network is connected to the first first hole residual sub-network, multiple first hole residual sub-networks are connected in turn, and the last hole residual sub-network is connected to the convolution sub-network, and the convolution sub-network is connected to the convolution sub-network. The first feature reduction sub-network is connected, multiple feature reduction sub-networks are connected in turn, the last feature reduction sub-network is connected with the deconvolution sub-network, and the deconvolution sub-network is connected with the first and second hole residual sub-network, A plurality of second hole residual sub-networks are connected in sequence, and the last hole residual sub-network is connected with the activation function sub-network. In addition, Figure 10 only takes two first hole residual sub-networks, two second hole residual sub-networks and two feature reduction sub-networks as examples. Other numbers of networks and feature reduction sub-networks are also possible.
其中,激活函数子网络中的激活函数为Sigmoid函数或其他激活函数,可选地,第一空洞残差子网络与第三空洞残差子网络相同,或者第一空洞残差子网络与第三空洞残差子网络不同,每个空洞残差子网络包括空洞卷积层、批量归一化层和激活函数层。可选地,特征缩小子网络也是一种空洞残差子网络。Wherein, the activation function in the activation function sub-network is a sigmoid function or other activation function. The atrous residual sub-network is different, each atrous residual sub-network includes atrous convolutional layer, batch normalization layer and activation function layer. Optionally, the feature reduction sub-network is also a hole residual sub-network.
在一些实施例中,电子设备调用多个第一空洞残差子网络对每个语音信号帧加权融合后的第一语音特征进行处理,得到每个语音信号帧进一步处理后的第一语音特征;调用卷积子网络对每个语音信号帧进一步处理后的第一语音特征进行编码,得到每个语音信号帧的编码特征;调用多个特征缩小子网络,对每个语音信号帧的编码特征进行特征缩小,得到多个缩小后的编码特征;调用反卷积层,对多个缩小后的编码特征进行解码,得到每个语音信号帧解码后的语音特征;调用多个第二空洞残差子网络对每个语音信号帧解码后的语音特征进行处理,得到每个语音信号帧的第三语音特征。其中,对编码特征进行缩小处理,能够减小编码特征,从而减少计算量,提高对编码特征的处理速度。In some embodiments, the electronic device invokes a plurality of first hole residual sub-networks to process the weighted and fused first voice features of each voice signal frame to obtain the further processed first voice features of each voice signal frame; The convolution sub-network is called to encode the first speech feature after further processing of each speech signal frame, and the encoded feature of each speech signal frame is obtained; Feature reduction to obtain multiple reduced coding features; call the deconvolution layer to decode multiple reduced coding features to obtain the decoded speech features of each speech signal frame; call multiple second hole residuals The network processes the decoded speech features of each speech signal frame to obtain the third speech feature of each speech signal frame. Wherein, performing reduction processing on the coding features can reduce the coding features, thereby reducing the amount of calculation and improving the processing speed of the coding features.
在一些实施例中,残差非局部子网络包括第一融合层和第二融合层,电子设备调用第一融合层,基于多个语音信号帧的权重,将每个语音信号帧的第一语音特征分别与除该语音信号帧之外的其他语音信号帧的第一语音特征进行加权融合,得到每个语音信号帧的融合特征;调用第二融合层,分别对每个语音信号帧的第一语音特征与融合特征进行融合, 得到每个语音信号帧加权融合后的第一语音特征。也即是,对于每个语音信号帧,调用第一融合层,基于多个语音信号帧的权重,将该语音信号帧的第一语音特征分别与除该语音信号帧之外的其他语音信号帧的第一语音特征进行加权融合,得到该语音信号帧的融合特征;。In some embodiments, the residual non-local sub-network includes a first fusion layer and a second fusion layer, the electronic device calls the first fusion layer, and based on the weights of the multiple speech signal frames, the first speech of each speech signal frame is The features are respectively weighted and fused with the first voice features of other voice signal frames except the voice signal frame to obtain the fusion features of each voice signal frame; the second fusion layer is called, respectively. The speech feature and the fusion feature are fused to obtain the first speech feature after weighted fusion of each speech signal frame. That is, for each voice signal frame, the first fusion layer is called, and based on the weights of multiple voice signal frames, the first voice feature of the voice signal frame is respectively combined with other voice signal frames except the voice signal frame. Perform weighted fusion on the first speech feature of the speech signal frame to obtain the fusion feature of the speech signal frame;
本公开实施例中,调用第一融合层将不同语音信号帧的第一语音特征基于对应的权重融合在一起,得到更加准确的融合特征,且在包括第一融合层和第二融合层的情况下,残差非局部子网络即为一个残差学习网络,将融合特征与输入的第一语音特征进行融合,使最终得到加权融合后的第一语音特征更加准确,避免了融合特征丢失一些重要的特征,提高了权融合后的第一语音特征的准确率。并且,残差学习网络更容易优化,在训练过程中能够提高模型的训练效率。In the embodiment of the present disclosure, the first fusion layer is called to fuse the first speech features of different speech signal frames based on corresponding weights, so as to obtain more accurate fusion features, and in the case of including the first fusion layer and the second fusion layer Next, the residual non-local sub-network is a residual learning network, which fuses the fusion feature with the input first speech feature, so that the final weighted fusion first speech feature is more accurate, avoiding the loss of some important fusion features. feature, which improves the accuracy of the first speech feature after weight fusion. Moreover, the residual learning network is easier to optimize, which can improve the training efficiency of the model during the training process.
在一些实施例中,参见图11,图11以对三个语音信号帧进行处理为例进行说明,残差非局部子网络还包括多个卷积层,第三融合层和归一化层,第三融合层与两个卷积层连接,该第三融合层用于将连接的两个卷积层处理的第一语音特征进行融合,该第三融合层与归一化层连接,归一化层用于对第三融合层输出的融合后的语音特征进行归一化,该归一化层与第一融合层连接,第一融合层用于将另一个卷积层处理的第一语音特征与归一化层输出的归一化后的语音特征进行融合,得到每个语音信号帧的融合特征,将该融合特征经过一个卷积层进行处理后与第一语音特征进行融合,得到加权融合后的第一语音特征。In some embodiments, referring to FIG. 11 , FIG. 11 takes the processing of three speech signal frames as an example to illustrate, the residual non-local sub-network further includes a plurality of convolution layers, a third fusion layer and a normalization layer, The third fusion layer is connected with the two convolutional layers, and the third fusion layer is used to fuse the first speech features processed by the connected two convolutional layers. The third fusion layer is connected with the normalization layer and normalized The normalization layer is used to normalize the fused speech features output by the third fusion layer, the normalization layer is connected to the first fusion layer, and the first fusion layer is used to process the first speech processed by another convolutional layer. The feature is fused with the normalized speech feature output by the normalization layer, and the fusion feature of each speech signal frame is obtained. The fusion feature is processed by a convolution layer and fused with the first speech feature to obtain the weight The first speech feature after fusion.
在一些实施例中,第一融合层和第三融合层中采用矩阵相乘的方式对语音特征进行融合,第二融合层采用矩阵相加的方式对语音特征进行融合。可选地,对于每个语音信号帧,该语音信号帧的第一语音特征为T*K*C,该第一语音特征表示时间T和频率K对应的语音特征C,为了能够对不同语音信号帧的语音特征进行相乘或相加,需要对语音特征进行形式变换。In some embodiments, the first fusion layer and the third fusion layer use matrix multiplication to fuse speech features, and the second fusion layer uses matrix addition to fuse speech features. Optionally, for each voice signal frame, the first voice feature of the voice signal frame is T*K*C, and the first voice feature represents the voice feature C corresponding to time T and frequency K, in order to be able to compare different voice signals. The speech features of the frames are multiplied or added, and the speech features need to be formally transformed.
例如,残差非局部子网络采用下述公式对语音信号帧x i的第一语音特征进行处理: For example, the residual non-local sub-network uses the following formula to process the first speech feature of the speech signal frame x i :
o i=W zy i+x i=W zsoftmax((W ux i) TW vx j)(W gx j)+x; o i =W z y i +x i =W z softmax((W u x i ) T W v x j )(W g x j )+x;
其中,o i表示语音信号帧x i加权融合后的第一语音特征,W z、W u、W v和W g为已知的模型参数,softmax表示进行归一化处理,x j表示除语音信号帧x i之外的其他语音信号帧,x i表示语音信号帧x i的融合特征。 Among them, o i represents the first speech feature after weighted fusion of speech signal frame x i , W z , Wu , W v and W g are known model parameters, softmax represents normalization processing, x j represents the division of speech Other speech signal frames other than the signal frame xi , xi represents the fusion feature of the speech signal frame xi .
703、电子设备调用第一融合网络,分别对每个语音信号帧的第二语音特征和第三语音特征进行融合,得到每个语音信号帧的非局部语音特征。703. The electronic device invokes the first fusion network, and fuses the second voice feature and the third voice feature of each voice signal frame respectively to obtain the non-local voice feature of each voice signal frame.
在一些实施例中,第一融合网络为乘法单元,即分别将每个语音信号帧的第二语音特征和第三语音特征相乘,得到融合后的非局部语音特征。In some embodiments, the first fusion network is a multiplication unit, that is, the second speech feature and the third speech feature of each speech signal frame are respectively multiplied to obtain a fused non-local speech feature.
704、电子设备调用第二融合网络对每个语音信号帧的非局部语音特征和第一语音特征进行融合,得到每个语音信号帧融合后的非局部语音特征。704. The electronic device invokes the second fusion network to fuse the non-local speech feature of each speech signal frame with the first speech feature, to obtain a non-local speech feature after fusion of each speech signal frame.
在一些实施例中,第二融合网络为加法单元,即电子设备将每个语音信号帧的非局部语音特征和第一语音特征相加,得到每个语音信号帧融合后的非局部语音特征。In some embodiments, the second fusion network is an addition unit, that is, the electronic device adds the non-local speech feature of each speech signal frame and the first speech feature to obtain the fused non-local speech feature of each speech signal frame.
图7所示的实施例中,调用非局部注意力子模型中不同的网络分别对第一语音特征进行不同方面的处理,其中,包括多个空洞残差子网络的第一处理网络能够进一步对第一语音特征进行提取,得到更深层次的语音特征,而第二处理网络采用非局部注意力机制,在对每个语音信号帧的第一语音特征进行处理时,考虑语音信号中除该语音信号帧之外的语音信号帧,即结合上下文信息,得到更加准确的语音特征,调用第一融合网络将两个处理网络得到的语音特征融合在一起,得到非局部语音特征。并且,空洞残差子网络能够扩大感受野,也能够进一步获取更多的上下文信息。In the embodiment shown in FIG. 7 , different networks in the non-local attention sub-model are called to process the first speech feature in different aspects, wherein the first processing network including a plurality of hole residual sub-networks can further The first voice feature is extracted to obtain deeper voice features, and the second processing network adopts a non-local attention mechanism. When processing the first voice feature of each voice signal frame, it is considered that the voice signal is divided For speech signal frames outside the frame, more accurate speech features are obtained by combining context information, and the first fusion network is called to fuse the speech features obtained by the two processing networks together to obtain non-local speech features. In addition, the hole residual sub-network can expand the receptive field and further obtain more contextual information.
并且,在非局部注意力子模型包括第二融合网络的情况下,该非局部注意力子模型即为一个残差学习网络,在得到非局部语音特征之后,再将该非局部语音特征与输入的第一语音特征进行融合,使最终得到的非局部语音特征更加准确,避免了非局部语音特征丢失一些重要的特征,提高了非局部语音特征的准确率。并且,残差学习网络更容易优化,在训练过程中能够提高模型的训练效率。Moreover, when the non-local attention sub-model includes the second fusion network, the non-local attention sub-model is a residual learning network. After obtaining the non-local speech features, the non-local speech features are combined with the input The first speech features are fused to make the final non-local speech features more accurate, avoid the loss of some important features from the non-local speech features, and improve the accuracy of the non-local speech features. Moreover, the residual learning network is easier to optimize, which can improve the training efficiency of the model during the training process.
另外,在一些实施例中,参见图12,非局部注意力子模型包括多个空洞残差单元,该空洞残差单元可称为空洞残差网络,电子设备先调用多个空洞残差网络对输入每个语音信号帧的第一语音特征进行处理,将处理后的第一语音特征再输入至第一处理网络和第二处理网络,同样的,在经过第二融合网络得到非局部语音特征之后,调用多个空洞残差网络对该非局部语音特征进行处理,将处理后的非局部语音特征输入至后续的局部注意力子模型中。其中,图12仅是以四个空洞残差网络为例进行说明。In addition, in some embodiments, referring to FIG. 12 , the non-local attention sub-model includes a plurality of hole residual units, which may be called a hole residual network, and the electronic device first calls the plurality of hole residual networks to pair Input the first voice feature of each voice signal frame for processing, and then input the processed first voice feature to the first processing network and the second processing network. Similarly, after obtaining the non-local voice feature through the second fusion network , multiple hole residual networks are called to process the non-local speech features, and the processed non-local speech features are input into the subsequent local attention sub-model. Among them, Fig. 12 only takes four hole residual networks as an example for description.
504、电子设备调用局部注意力子模型对每个语音信号帧的非局部语音特征分别进行处理,得到每个语音信号帧的混合语音特征。504. The electronic device invokes the local attention sub-model to separately process the non-local speech features of each speech signal frame to obtain the mixed speech features of each speech signal frame.
其中,混合语音特征中已经不包含噪声特征,且每个语音信号帧的混合语音特征都是考虑了其他语音信号帧的语音特征之后得到的,更加准确。Among them, the noise feature is no longer included in the mixed speech feature, and the mixed speech feature of each speech signal frame is obtained after considering the speech features of other speech signal frames, which is more accurate.
本公开实施例中,局部注意力子模型与非局部注意力子模型的网络结构类似,区别在于局部注意力子模型中不包括残差非局部子网络,对于局部注意力子模型的网络结构在此不再赘述。In the embodiment of the present disclosure, the network structure of the local attention sub-model is similar to that of the non-local attention sub-model, the difference is that the residual non-local sub-network is not included in the local attention sub-model, and the network structure of the local attention sub-model is in This will not be repeated here.
需要说明的是,本公开实施例仅是以一个非局部注意力子模型和一个局部注意力子模型为例进行说明。在另一实施例中包括多个非局部注意力子模型和多个局部注意力子模型,即在得到混合语音特征之后,能够将该混合语音特征再输入至之后的非局部注意力子模型或局部注意力子模型中继续进行处理,以得到更加准确的混合语音信号。It should be noted that the embodiments of the present disclosure only take a non-local attention sub-model and a local attention sub-model as examples for description. In another embodiment, multiple non-local attention sub-models and multiple local attention sub-models are included, that is, after the mixed speech feature is obtained, the mixed speech feature can be input into the subsequent non-local attention sub-model or Continue processing in the local attention sub-model to obtain a more accurate mixed speech signal.
本公开实施例中,对于每个语音信号帧,电子设备调用局部注意力子模型,再分别对语音信号帧的非局部语音特征进行处理,在处理过程中,不再考虑除该语音信号帧之外的其他语音信号帧的非局部语音特征,即不再考虑该语音信号帧的上下文信息,使该处理过程中能够获取该语音信号帧本身的语音特征,并且在之前获取非局部语音特征时,已经考虑了该语音信号帧的上下文信息,因此,得到的混合语音特征既能够反映该语音信号帧在整个语音信号中的语音特征,又能够反映该语音信号帧本身的语音特征。In the embodiment of the present disclosure, for each speech signal frame, the electronic device invokes the local attention sub-model, and then processes the non-local speech features of the speech signal frame respectively. The non-local voice features of other voice signal frames, that is, the context information of the voice signal frame is no longer considered, so that the voice features of the voice signal frame itself can be obtained in the processing process, and when the non-local voice features are obtained before, The context information of the speech signal frame has been considered, so the obtained mixed speech feature can not only reflect the speech characteristic of the speech signal frame in the whole speech signal, but also reflect the speech characteristic of the speech signal frame itself.
505、电子设备调用特征识别子模型对多个语音信号帧的混合语音特征进行特征识别,得到去噪参数。505. The electronic device invokes the feature recognition sub-model to perform feature recognition on the mixed speech features of multiple speech signal frames to obtain denoising parameters.
其中,特征识别子模型用于对多个语音信号帧的混合语音特征进行特征识别,对于每个语音信号帧的混合语音特征,该特征识别子模型能够从该混合语音特征中,识别出对应的语音信号帧中的噪声信号和除该噪声信号之外的语音信号之间的比例,对多个混合语音特征分别进行识别,从而得到多个语音信号帧对应的去噪参数,即得到原始语音信号对应的去噪参数,该去噪参数表示语音信号帧中除噪声信号之外的语音信号所占的比例,后续能够采用该去噪参数对原始语音信号进行去噪。可选地,该去噪参数采用矩阵形式表示,矩阵中的每个元素表示一个语音信号帧的去噪参数,或者矩阵中的一列元素或一行元素表示一个语音信号帧的去噪参数。其中,特征识别子模型为卷积网络或其他类型的网络。The feature recognition sub-model is used to perform feature recognition on the mixed speech features of multiple speech signal frames. For the mixed speech features of each speech signal frame, the feature recognition sub-model can identify the corresponding mixed speech features from the mixed speech features. The ratio between the noise signal in the voice signal frame and the voice signal other than the noise signal, and the multiple mixed voice features are respectively identified, so as to obtain the denoising parameters corresponding to the multiple voice signal frames, that is, the original voice signal is obtained. Corresponding denoising parameter, the denoising parameter represents the proportion of the speech signal other than the noise signal in the speech signal frame, and the denoising parameter can be used to denoise the original speech signal subsequently. Optionally, the denoising parameter is represented in the form of a matrix, each element in the matrix represents the denoising parameter of a speech signal frame, or one column element or one row element in the matrix represents the denoising parameter of a speech signal frame. Among them, the feature recognition sub-model is a convolutional network or other types of networks.
506、电子设备调用语音去噪子模型,按照去噪参数分别对多个语音信号帧的原始幅度进行去噪,得到多个语音信号帧的目标幅度。506. The electronic device invokes the speech denoising sub-model, and denoises the original amplitudes of the multiple speech signal frames according to the denoising parameters to obtain the target amplitudes of the multiple speech signal frames.
在一些实施例中,该语音去噪子模型为乘法网络,将去噪参数与多个原始幅度相乘,得到多个语音信号帧的目标幅度,该目标幅度中不包含噪声信号。可选地,在去噪参数为矩阵的情况下,矩阵中的每个元素分别与对应的语音信号帧的原始幅度相乘,或者矩阵中的一列元素或一行元素分别与对应的语音信号帧的原始幅度相乘。In some embodiments, the speech denoising sub-model is a multiplication network, and the denoising parameters are multiplied by multiple original amplitudes to obtain target amplitudes of multiple speech signal frames, and the target amplitudes do not contain noise signals. Optionally, in the case where the denoising parameter is a matrix, each element in the matrix is respectively multiplied by the original amplitude of the corresponding speech signal frame, or one column element or one row element in the matrix is respectively the same as the corresponding speech signal frame. The original magnitudes are multiplied.
507、电子设备对多个语音信号帧的原始相位和目标幅度进行组合,得到目标语音信 号。507. The electronic device combines the original phases and target amplitudes of multiple voice signal frames to obtain a target voice signal.
在一些实施例中,电子设备对多个语音信号帧的原始相位和目标幅度进行傅里叶逆变换,得到目标语音信号,该目标语音信号为去除噪声信号后的语音信号。In some embodiments, the electronic device performs inverse Fourier transform on the original phases and target amplitudes of the plurality of speech signal frames to obtain the target speech signal, where the target speech signal is the speech signal after removing the noise signal.
这种对语音信号帧中的原始幅度进行去噪的方式,只需对语音信号中的幅度进行处理而无需对相位进行处理,减少了需要处理的特征,提高了处理速度。This method of denoising the original amplitude in the speech signal frame only needs to process the amplitude in the speech signal without processing the phase, which reduces the features to be processed and improves the processing speed.
本公开实施例提供的方法,调用非局部注意力子模型获取每个语音信号帧的非局部语音特征,考虑了该语音信号帧的上下文信息,之后再调用局部注意力子模型分别对每个语音信号帧的非局部语音特征进行处理,获取语音信号帧本身的语音特征,从而得到混合形式的语音特征,基于该混合形式的语音特征得到的去噪参数更加准确,使去噪参数能够准确地表示每个语音信号帧中除噪声信号之外的信号所占的比例,因此采用该去噪参数对原始语音信号进行去噪,提高了原始语音信号的去噪效果。In the method provided by the embodiments of the present disclosure, the non-local attention sub-model is invoked to obtain the non-local speech features of each speech signal frame, the context information of the speech signal frame is considered, and then the local attention sub-model is invoked for each speech respectively. The non-local speech features of the signal frame are processed to obtain the speech features of the speech signal frame itself, so as to obtain the speech features of the mixed form. The proportion of signals other than the noise signal in each speech signal frame, so the denoising parameter is used to denoise the original speech signal, which improves the denoising effect of the original speech signal.
并且,由于语音信号帧中的噪声信号存在于语音信号帧的原始幅度中,因此对语音信号帧的原始幅度进行特征提取,根据获取的去噪参数对原始幅度进行去噪,得到不包含噪声信号的目标幅度,实现对原始语音信号的原始幅度的去噪,再根据目标幅度和原始相位即可恢复出不包含噪声信号的目标语音信号,从而实现原始语音信号的去噪。这种去噪方式,只需对语音信号中的幅度进行处理而无需对相位进行处理,减少了需要处理的特征。Moreover, since the noise signal in the speech signal frame exists in the original amplitude of the speech signal frame, the feature extraction is performed on the original amplitude of the speech signal frame, and the original amplitude is denoised according to the acquired denoising parameters to obtain a signal that does not contain noise. The target amplitude of the original speech signal can be de-noised, and then the target speech signal without the noise signal can be recovered according to the target amplitude and the original phase, so as to realize the de-noising of the original speech signal. This denoising method only needs to process the amplitude in the speech signal without processing the phase, which reduces the features that need to be processed.
另外,在调用语音处理模型,对原始语音信号进行处理之前,需要训练该语音处理模型,训练过程如下:获取样本语音信号和样本噪声信号;将样本语音信号与样本噪声信号进行混合,得到样本混合信号;调用语音处理模型,对样本混合信号中的多个样本语音信号帧进行处理,得到样本混合信号对应的预测去噪参数;基于预测去噪参数对原始语音信号进行去噪,得到去噪后的预测语音信号;基于预测语音信号与样本语音信号之间的差异,训练语音处理模型。其中,样本语音信号为不包含噪声信号的干净语音信号。并且,由于该语音处理模型中采用了残差学习网络的网络结构,因此在训练过程中提高了模型的训练速度。In addition, before calling the speech processing model and processing the original speech signal, the speech processing model needs to be trained. The training process is as follows: obtain the sample speech signal and the sample noise signal; mix the sample speech signal and the sample noise signal to obtain the sample mixture Signal; call the speech processing model to process multiple sample speech signal frames in the sample mixed signal to obtain the predicted denoising parameters corresponding to the sample mixed signal; denoise the original speech signal based on the predicted denoising parameters, and obtain the denoised signal The predicted speech signal of ; based on the difference between the predicted speech signal and the sample speech signal, the speech processing model is trained. The sample speech signal is a clean speech signal that does not contain noise signals. Moreover, since the network structure of the residual learning network is adopted in the speech processing model, the training speed of the model is improved during the training process.
例如,从语音数据库中获取多个用户的样本语音信号,再从噪声数据库中获取多种样本噪声信号,分别按照不同的信噪比将多个样本噪声信号与样本语音信号混合,得到多个样本混合信号,采用多个样本混合信号对语音处理模型进行训练。For example, sample voice signals of multiple users are obtained from the voice database, and then multiple sample noise signals are obtained from the noise database, and multiple sample noise signals and sample voice signals are mixed according to different signal-to-noise ratios respectively to obtain multiple samples. Mixed signal, using multiple sample mixed signals to train the speech processing model.
在一些实施例中,获取样本混合信号中的多个样本语音信号帧的样本幅度,调用语音处理模型对多个样本幅度进行处理,得到样本混合信号对应的预测去噪参数;基于该预测去噪参数对样本幅度进行去噪,得到每个语音信号帧的预测幅度,基于每个语音信号帧的预测幅度与样本语音信号中的多个语音信号帧的幅度之间的差异,训练语音处理模型。In some embodiments, sample amplitudes of multiple sample speech signal frames in the sample mixed signal are obtained, and a speech processing model is invoked to process the multiple sample amplitudes to obtain predicted denoising parameters corresponding to the sample mixed signal; based on the predicted denoising The parameters denoise the sample amplitude to obtain the predicted amplitude of each speech signal frame, and train the speech processing model based on the difference between the predicted amplitude of each speech signal frame and the amplitudes of multiple speech signal frames in the sample speech signal.
例如,在训练语音处理模型时,设置的语音处理模型中的卷积层的卷积核、过滤器和卷积参数,如下述表1所示:For example, when training a speech processing model, set the convolution kernel, filter and convolution parameters of the convolutional layer in the speech processing model, as shown in Table 1 below:
表1Table 1
Figure PCTCN2021116212-appb-000001
Figure PCTCN2021116212-appb-000001
其中,Conv.表示特征提取子模型或特征识别子模型,RNAM表示非局部注意力子模型,RAM表示局部注意力子模型,Res.Unit表示空洞残差网络或空洞残差子网络,Conv 表示卷积子网络,Deconv表示反卷积子网络,NL Unit表示残差非局部子网络。Among them, Conv. represents the feature extraction sub-model or feature recognition sub-model, RNAM represents the non-local attention sub-model, RAM represents the local attention sub-model, Res.Unit represents the hole residual network or hole residual sub-network, and Conv represents the volume Product sub-network, Deconv represents the deconvolution sub-network, and NL Unit represents the residual non-local sub-network.
另外,在一些实施例中,采用维纳滤波(Wiener Filtering)方法、SEGAN(Speech Enhancement Generative Adversarial Network,语音增强生成对抗网络)方法、Wavelnet(微波)方法、MMSE-GAN(一种语音增强生成对抗网络)方法、DFL(Deep Feature Loss,深度特征损失)方法、MDPhD(一种混合模型)、RSGAN-GP(Speech Enhancement using Relativistic Generative Adversarial Networks with Gradient Penalty,使用相对性的语音增强生成对抗网络)方法作为参考方法,对比这些方法与本公开实施例提供的方法(RNANet)。In addition, in some embodiments, Wiener Filtering (Wiener Filtering) method, SEGAN (Speech Enhancement Generative Adversarial Network, speech enhancement generative adversarial network) method, Wavelnet (microwave) method, MMSE-GAN (a speech enhancement generative adversarial network) method is adopted. Network) method, DFL (Deep Feature Loss, deep feature loss) method, MDPhD (a hybrid model), RSGAN-GP (Speech Enhancement using Relativistic Generative Adversarial Networks with Gradient Penalty, using a relative speech enhancement generative adversarial network) method As a reference method, these methods are compared with the method (RNANet) provided in the embodiments of the present disclosure.
上述参考方法与本公开实施例提供的方法的对比结果参见下述表2:The comparison results of the above-mentioned reference method and the method provided by the embodiments of the present disclosure refer to the following table 2:
表2Table 2
方法method SSNRSSNR PESQPESQ CSIGCSIG CBAKCBAK COVLCOVL
NoisyNoisy 1.681.68 1.971.97 3.353.35 2.442.44 2.632.63
WienerWiener 5.075.07 2.222.22 3.233.23 2.682.68 2.672.67
SEGANSEGAN 7.737.73 2.162.16 3.483.48 2.942.94 2.802.80
WavelnetWavelnet       3.623.62 3.233.23 2.982.98
DFLDFL       3.863.86 3.333.33 3.223.22
MMSE-GANMMSE-GAN    2.532.53 3.803.80 3.123.12 3.143.14
MDPhDMDPhD 10.2210.22 2.702.70 3.853.85 3.393.39 3.273.27
RNANetRNANet 10.1610.16 2.712.71 3.983.98 3.423.42 3.353.35
其中,SSNR(Segmental Signal Noise Ratio,分段信噪比)越大表示去噪效果越好;PESQ(Perceptual Evaluation of Speech Quality,主观语音质量评价)越大表示去噪效果越好;CSIG(一种评价指标)为信号失真的平均意见评分,CSIG越大表示去噪效果越好;CBAK(一种评价指标)为背景噪声预测评分,CBAK越大表示去噪效果越好;COVL(一种评价指标)为语音信号整体信号质量的评分。Among them, the larger the SSNR (Segmental Signal Noise Ratio, the segmental signal-to-noise ratio), the better the denoising effect; the larger the PESQ (Perceptual Evaluation of Speech Quality, the subjective speech quality evaluation), the better the denoising effect; CSIG (a kind of Evaluation index) is the average opinion score of signal distortion, the larger the CSIG, the better the denoising effect; CBAK (an evaluation index) is the background noise prediction score, the larger the CBAK, the better the denoising effect; COVL (an evaluation index) ) is the overall signal quality score of the speech signal.
在一些实施例中,为了显示语音信号的清晰度的提高,采用STOI(Short Time Objective Intelligibility,短时客观可懂度)比较本公开实施例提供的方法和参考方法,对比结果参见表3:In some embodiments, in order to display the improvement of the clarity of the speech signal, STOI (Short Time Objective Intelligibility, short-term objective intelligibility) is used to compare the method provided by the embodiment of the present disclosure and the reference method, and the comparison results are shown in Table 3:
表3table 3
评价方法Evaluation method NoisyNoisy MMSE-GANMMSE-GAN RSGAN-GPRSGAN-GP RNANetRNANet
STOISTOI 0.9210.921 0.9300.930 0.9420.942 0.9460.946
其中,STOI越大表示去噪效果越好。Among them, the larger the STOI, the better the denoising effect.
根据上述表2和表3的对比结果能够看出,本公开实施例提供的方法的去噪效果明显优于其他方法的去噪效果。According to the comparison results of Table 2 and Table 3 above, it can be seen that the denoising effect of the method provided by the embodiment of the present disclosure is obviously better than that of other methods.
上述图5所示的实施例仅是以调用模型来对原始语音信号进行去噪为例进行说明,在另一实施例中,电子设备可不调用语音处理模型来对原始语音信号进行去噪。The above embodiment shown in FIG. 5 is only described by using the model to denoise the original voice signal. In another embodiment, the electronic device may not call the voice processing model to denoise the original voice signal.
图13是本申请实施例提供的另一种语音信号的处理方法的流程图。该方法由电子设备执行,参见图13,该方法包括:FIG. 13 is a flowchart of another voice signal processing method provided by an embodiment of the present application. The method is performed by an electronic device, see FIG. 13 , and the method includes:
1301、确定原始语音信号的多个第一语音特征,每个第一语音特征对应于原始语音信号中的一个语音信号帧。1301. Determine multiple first voice features of the original voice signal, where each first voice feature corresponds to one voice signal frame in the original voice signal.
在一些实施例中,电子设备分别对该每个语音信号帧的原始幅度进行特征提取,得到该每个语音信号帧的第一语音特征。In some embodiments, the electronic device performs feature extraction on the original amplitude of each voice signal frame, respectively, to obtain the first voice feature of each voice signal frame.
1302、对多个第一语音特征进行处理,得到多个非局部语音特征,每个非局部语音特征对应于一个语音信号帧。1302. Process multiple first voice features to obtain multiple non-local voice features, where each non-local voice feature corresponds to a voice signal frame.
其中,每个非局部语音特征是基于非局部语音特征对应的语音信号帧的第一语音特征,以及除该语音信号帧之外的其他语音信号帧的第一语音特征进行融合的得到的。Wherein, each non-local voice feature is obtained by fusing the first voice feature of the voice signal frame corresponding to the non-local voice feature and the first voice features of other voice signal frames except the voice signal frame.
在一些实施例中,电子设备分别对每个语音信号帧的第一语音特征进行特征提取,得 到每个语音信号帧的第二语音特征;将每个语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行融合,得到每个语音信号帧的第三语音特征;分别对每个语音信号帧的第二语音特征和第三语音特征进行融合,得到每个语音信号帧的非局部语音特征。In some embodiments, the electronic device performs feature extraction on the first voice feature of each voice signal frame, respectively, to obtain the second voice feature of each voice signal frame; and separates the first voice feature of each voice signal frame with other The first voice feature of the voice signal frame is fused to obtain the third voice feature of each voice signal frame; the second voice feature and the third voice feature of each voice signal frame are respectively fused to obtain the Non-local speech features.
在一些实施例中,基于多个语音信号帧的权重,将每个语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个语音信号帧加权融合后的第一语音特征;对每个语音信号帧加权融合后的第一语音特征进行编码,得到每个语音信号帧的编码特征;对每个语音信号帧的编码特征进行解码,得到每个语音信号帧的第三语音特征。In some embodiments, based on the weights of multiple speech signal frames, the first speech feature of each speech signal frame is weighted and fused with the first speech features of other speech signal frames, respectively, to obtain a weighted fusion of each speech signal frame. encode the first speech feature after weighted fusion of each speech signal frame to obtain the encoded feature of each speech signal frame; decode the encoded feature of each speech signal frame to obtain each speech signal The third speech feature of the frame.
在一些实施例中,电子设备基于多个语音信号帧的权重,将每个语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个语音信号帧的融合特征;对每个语音信号帧的第一语音特征与融合特征进行融合,得到每个语音信号帧加权融合后的第一语音特征。In some embodiments, the electronic device performs weighted fusion of the first voice features of each voice signal frame and the first voice features of other voice signal frames based on the weights of multiple voice signal frames, to obtain the weight of each voice signal frame. Fusion feature: the first voice feature of each voice signal frame is fused with the fused feature to obtain the weighted and fused first voice feature of each voice signal frame.
在一些实施例中,电子设备对每个语音信号帧的编码特征进行特征缩小,得到多个缩小后的编码特征;对每个缩小后的编码特征进行解码,得到每个语音信号帧的第三语音特征。In some embodiments, the electronic device performs feature reduction on the coding feature of each voice signal frame to obtain multiple reduced coding features; and decodes each reduced coding feature to obtain the third coding feature of each voice signal frame. voice characteristics.
在一些实施例中,电子设备对该每个语音信号帧的非局部语音特征和第一语音特征进行融合,得到该每个语音信号帧融合后的非局部语音特征。In some embodiments, the electronic device fuses the non-local speech feature of each speech signal frame with the first speech feature to obtain the fused non-local speech feature of each speech signal frame.
1303、分别对原始语音信号中每个语音信号帧的非局部语音特征进行处理,得到每个语音信号帧的混合语音特征。1303. Process the non-local voice features of each voice signal frame in the original voice signal respectively, to obtain the mixed voice feature of each voice signal frame.
1304、基于多个语音信号帧的混合语音特征,获取去噪参数。1304. Obtain denoising parameters based on the mixed speech features of the multiple speech signal frames.
在一些实施例中,电子设备对该多个语音信号帧的混合语音特征进行特征识别,得到该去噪参数。In some embodiments, the electronic device performs feature recognition on the mixed speech features of the multiple speech signal frames to obtain the denoising parameter.
1305、基于去噪参数对原始语音信号进行去噪,得到目标语音信号。1305. Perform denoising on the original speech signal based on the denoising parameter to obtain a target speech signal.
在一些实施例中,电子设备基于该去噪参数,分别对该多个语音信号帧的原始幅度进行去噪,得到该多个语音信号帧的目标幅度;对该多个语音信号帧的原始相位和目标幅度进行组合,得到该目标语音信号。In some embodiments, the electronic device de-noises the original amplitudes of the multiple speech signal frames based on the denoising parameters, respectively, to obtain the target amplitudes of the multiple speech signal frames; the original phase of the multiple speech signal frames Combine with the target amplitude to obtain the target speech signal.
本公开实施例提供的方法,在获取每个语音信号帧的非局部语音特征时考虑了该语音信号帧的上下文信息,之后分别对每个语音信号帧的非局部语音特征进行处理,获取语音信号帧本身的语音特征,从而得到混合形式的语音特征,基于该混合形式的语音特征得到的去噪参数更加准确,使去噪参数能够准确地表示每个语音信号帧中除噪声信号之外的信号所占的比例,因此采用该去噪参数对原始语音信号进行去噪,提高了原始语音信号的去噪效果。In the method provided by the embodiment of the present disclosure, the context information of each speech signal frame is considered when acquiring the non-local speech features of each speech signal frame, and then the non-local speech features of each speech signal frame are respectively processed to obtain the speech signal The speech features of the frame itself are used to obtain the speech features of the mixed form. The denoising parameters obtained based on the speech features of the mixed form are more accurate, so that the denoising parameters can accurately represent the signals other than the noise signal in each speech signal frame. Therefore, the denoising parameter is used to denoise the original speech signal, which improves the denoising effect of the original speech signal.
图14是根据一示例性实施例示出的一种语音信号的处理装置的框图。参见图14,该装置包括:Fig. 14 is a block diagram of an apparatus for processing a speech signal according to an exemplary embodiment. Referring to Figure 14, the device includes:
特征确定单元1401,被配置为确定原始语音信号的多个第一语音特征,每个第一语音特征对应于原始语音信号中的一个语音信号帧;The feature determining unit 1401 is configured to determine a plurality of first voice features of the original voice signal, each first voice feature corresponds to a voice signal frame in the original voice signal;
非局部特征获取单元1402,被配置为对多个第一语音特征进行处理,得到多个非局部语音特征,每个非局部语音特征对应于一个语音信号帧,且每个非局部语音特征是基于非局部语音特征对应的语音信号帧的第一语音特征,以及除语音信号帧之外的其他语音信号帧的第一语音特征进行融合得到的;The non-local feature acquisition unit 1402 is configured to process a plurality of first speech features to obtain a plurality of non-local speech features, each non-local speech feature corresponds to a speech signal frame, and each non-local speech feature is based on Obtained by fusing the first voice feature of the voice signal frame corresponding to the non-local voice feature and the first voice feature of other voice signal frames except the voice signal frame;
混合特征获取单元1403,被配置为分别对原始语音信号中每个语音信号帧的非局部语音特征进行处理,得到每个语音信号帧的混合语音特征;The mixed feature acquisition unit 1403 is configured to process the non-local voice features of each voice signal frame in the original voice signal respectively to obtain the mixed voice feature of each voice signal frame;
去噪参数获取单元1404,被配置为基于多个语音信号帧的混合语音特征,获取去噪 参数;The denoising parameter obtaining unit 1404 is configured to obtain denoising parameters based on the mixed speech features of a plurality of speech signal frames;
目标信号获取单元1405,被配置为基于去噪参数对原始语音信号进行去噪,得到目标语音信号。The target signal obtaining unit 1405 is configured to perform denoising on the original speech signal based on the denoising parameter to obtain the target speech signal.
本公开实施例提供的装置,在获取每个语音信号帧的非局部语音特征时考虑了该语音信号帧的上下文信息,之后分别对每个语音信号帧的非局部语音特征进行处理,获取语音信号帧本身的语音特征,从而得到混合形式的语音特征,基于该混合形式的语音特征得到的去噪参数更加准确,使去噪参数能够准确地表示每个语音信号帧中除噪声信号之外的信号所占的比例,因此采用该去噪参数对原始语音信号进行去噪,提高了原始语音信号的去噪效果。In the device provided by the embodiment of the present disclosure, the context information of each voice signal frame is considered when acquiring the non-local voice feature of each voice signal frame, and then the non-local voice features of each voice signal frame are processed respectively to obtain the voice signal The speech features of the frame itself are used to obtain the speech features of the mixed form. The denoising parameters obtained based on the speech features of the mixed form are more accurate, so that the denoising parameters can accurately represent the signals other than the noise signal in each speech signal frame. Therefore, the denoising parameter is used to denoise the original speech signal, which improves the denoising effect of the original speech signal.
在一些实施例中,特征确定单元1401,被配置为分别对每个语音信号帧的原始幅度进行特征提取,得到每个语音信号帧的第一语音特征。In some embodiments, the feature determining unit 1401 is configured to perform feature extraction on the original amplitude of each speech signal frame, respectively, to obtain the first speech feature of each speech signal frame.
在一些实施例中,参见图15,目标信号获取单元1405,包括:In some embodiments, referring to FIG. 15 , the target signal acquisition unit 1405 includes:
幅度获取子单元1415,被配置为基于去噪参数,分别对多个语音信号帧的原始幅度进行去噪,得到多个语音信号帧的目标幅度;The amplitude obtaining subunit 1415 is configured to de-noise the original amplitudes of the multiple speech signal frames based on the denoising parameters to obtain the target amplitudes of the multiple speech signal frames;
信号获取子单元1425,被配置为对多个语音信号帧的原始相位和目标幅度进行组合,得到目标语音信号。The signal acquisition subunit 1425 is configured to combine the original phase and target amplitude of multiple speech signal frames to obtain the target speech signal.
在一些实施例中,去噪参数获取单元1404,被配置为对多个语音信号帧的混合语音特征进行特征识别,得到去噪参数。In some embodiments, the denoising parameter obtaining unit 1404 is configured to perform feature recognition on mixed speech features of multiple speech signal frames to obtain denoising parameters.
在一些实施例中,参见图15,非局部特征获取单元1402,包括:In some embodiments, referring to FIG. 15 , the non-local feature acquisition unit 1402 includes:
特征提取子单元1412,被配置为分别对每个语音信号帧的第一语音特征进行特征提取,得到每个语音信号帧的第二语音特征;The feature extraction subunit 1412 is configured to perform feature extraction on the first voice feature of each voice signal frame, respectively, to obtain the second voice feature of each voice signal frame;
第一融合子单元1422,被配置为将每个语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行融合,得到每个语音信号帧的第三语音特征;The first fusion subunit 1422 is configured to fuse the first speech features of each speech signal frame with the first speech features of other speech signal frames to obtain the third speech feature of each speech signal frame;
第二融合子单元1432,被配置为分别对每个语音信号帧的第二语音特征和第三语音特征进行融合,得到每个语音信号帧的非局部语音特征。The second fusion subunit 1432 is configured to respectively fuse the second speech feature and the third speech feature of each speech signal frame to obtain the non-local speech feature of each speech signal frame.
在一些实施例中,参见图15,非局部特征获取单元1402,还包括:In some embodiments, referring to FIG. 15 , the non-local feature acquisition unit 1402 further includes:
第三融合子单元1442,被配置为对每个语音信号帧的非局部语音特征和第一语音特征进行融合,得到每个语音信号帧融合后的非局部语音特征。The third fusion subunit 1442 is configured to fuse the non-local speech feature of each speech signal frame with the first speech feature to obtain the fused non-local speech feature of each speech signal frame.
在一些实施例中,参见图15,第一融合子单元1422,被配置为:In some embodiments, referring to FIG. 15, the first fusion subunit 1422 is configured to:
基于多个语音信号帧的权重,将每个语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个语音信号帧加权融合后的第一语音特征;Based on the weights of the plurality of speech signal frames, the first speech feature of each speech signal frame is weighted and fused with the first speech features of other speech signal frames, respectively, to obtain the weighted and fused first speech feature of each speech signal frame;
对每个语音信号帧加权融合后的第一语音特征进行编码,得到每个语音信号帧的编码特征;Encoding the first voice feature after weighted fusion of each voice signal frame to obtain the coding feature of each voice signal frame;
对每个语音信号帧的编码特征进行解码,得到每个语音信号帧的第三语音特征。Decoding the encoded features of each voice signal frame to obtain a third voice feature of each voice signal frame.
在一些实施例中,参见图15,第一融合子单元1422,被配置为:In some embodiments, referring to FIG. 15, the first fusion subunit 1422 is configured to:
对每个语音信号帧的编码特征进行特征缩小,得到多个缩小后的编码特征;Perform feature reduction on the coding feature of each speech signal frame to obtain a plurality of reduced coding features;
对每个缩小后的编码特征进行解码,得到每个语音信号帧的第三语音特征。Each of the reduced encoded features is decoded to obtain a third voice feature of each voice signal frame.
在一些实施例中,参见图15,第一融合子单元1422,被配置为执行:In some embodiments, referring to Figure 15, the first fusion subunit 1422 is configured to perform:
基于多个语音信号帧的权重,将每个语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个语音信号帧的融合特征;Based on the weights of the plurality of speech signal frames, the first speech features of each speech signal frame are weighted and fused with the first speech features of other speech signal frames, respectively, to obtain the fusion features of each speech signal frame;
对每个语音信号帧的第一语音特征与融合特征进行融合,得到每个语音信号帧加权融合后的第一语音特征。The first voice feature of each voice signal frame is fused with the fusion feature to obtain the weighted and fused first voice feature of each voice signal frame.
在一些实施例中,非局部特征获取单元1402,被配置为调用非局部注意力子模型,对多个第一语音特征进行处理,得到多个非局部语音特征;In some embodiments, the non-local feature acquisition unit 1402 is configured to invoke the non-local attention sub-model to process multiple first speech features to obtain multiple non-local speech features;
混合特征获取单元1403,被配置为调用局部注意力子模型,分别对每个语音信号帧 的非局部语音特征进行处理,得到每个语音信号帧的混合语音特征。The mixed feature acquisition unit 1403 is configured to invoke the local attention sub-model to process the non-local speech features of each speech signal frame respectively to obtain the mixed speech features of each speech signal frame.
在一些实施例中,非局部注意力子模型包括第一处理网络、第二处理网络和第一融合网络,参见图15,非局部特征获取单元1402,包括:In some embodiments, the non-local attention sub-model includes a first processing network, a second processing network and a first fusion network. Referring to FIG. 15 , the non-local feature acquisition unit 1402 includes:
特征提取子单元1412,被配置为调用第一处理网络,分别对每个语音信号帧的第一语音特征进行特征提取,得到每个语音信号帧的第二语音特征,第一处理网络包括多个空洞残差子网络;The feature extraction subunit 1412 is configured to call the first processing network to perform feature extraction on the first voice features of each voice signal frame respectively, to obtain the second voice features of each voice signal frame, and the first processing network includes a plurality of Atrous residual sub-network;
第一融合子单元1422,被配置为调用第二处理网络,将每个语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行融合,得到每个语音信号帧的第三语音特征;The first fusion subunit 1422 is configured to call the second processing network to fuse the first speech features of each speech signal frame with the first speech features of other speech signal frames to obtain the third speech feature of each speech signal frame. voice characteristics;
第二融合子单元1432,被配置为调用第一融合网络,分别对每个语音信号帧的第二语音特征和第三语音特征进行融合,得到每个语音信号帧的非局部语音特征。The second fusion subunit 1432 is configured to invoke the first fusion network to respectively fuse the second speech feature and the third speech feature of each speech signal frame to obtain the non-local speech feature of each speech signal frame.
在一些实施例中,第二处理网络包括残差非局部子网络、卷积子网络和反卷积子网络;第一融合子单元1422,被配置为:In some embodiments, the second processing network includes a residual non-local sub-network, a convolution sub-network and a deconvolution sub-network; the first fusion sub-unit 1422 is configured to:
调用残差非局部子网络,基于多个语音信号帧的权重,将每个语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个语音信号帧加权融合后的第一语音特征;The residual non-local sub-network is called, and based on the weights of multiple voice signal frames, the first voice feature of each voice signal frame is weighted and fused with the first voice features of other voice signal frames, and the weight of each voice signal frame is obtained. The first speech feature after fusion;
调用卷积子网络,对每个语音信号帧加权融合后的第一语音特征进行编码,得到每个语音信号帧的编码特征;Call the convolution sub-network to encode the first speech feature after weighted fusion of each speech signal frame, and obtain the encoded characteristic of each speech signal frame;
调用反卷积子网络,对每个语音信号帧的编码特征进行解码,得到每个语音信号帧的第三语音特征。The deconvolution sub-network is called to decode the encoded features of each speech signal frame to obtain the third speech feature of each speech signal frame.
在一些实施例中,残差非局部子网络包括第一融合层和第二融合层,第一融合子单元1422,被配置为:In some embodiments, the residual non-local sub-network includes a first fusion layer and a second fusion layer, and the first fusion subunit 1422 is configured to:
调用第一融合层,基于多个语音信号帧的权重,将每个语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个语音信号帧的融合特征;The first fusion layer is called, and based on the weights of multiple speech signal frames, the first speech feature of each speech signal frame is weighted and fused with the first speech features of other speech signal frames, and the fusion feature of each speech signal frame is obtained. ;
调用第二融合层,对每个语音信号帧的第一语音特征与融合特征进行融合,得到每个语音信号帧加权融合后的第一语音特征。The second fusion layer is called to fuse the first speech feature of each speech signal frame with the fusion feature to obtain the weighted fusion first speech feature of each speech signal frame.
在示例性实施例中,提供了一种电子设备,该电子设备包括一个或多个处理器,以及用于存储该一个或多个处理器可执行指令的易失性或非易失性存储器;其中,该一个或多个处理器被配置为执行上述实施例中的语音信号的处理方法。In an exemplary embodiment, there is provided an electronic device comprising one or more processors, and volatile or non-volatile memory for storing instructions executable by the one or more processors; Wherein, the one or more processors are configured to execute the voice signal processing method in the above embodiments.
在一些实施例中,该电子设备提供为终端。图16是根据一示例性实施例示出的一种终端1600的结构框图。该终端1600可以是便携式移动终端,比如:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端1600还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。In some embodiments, the electronic device is provided as a terminal. FIG. 16 is a structural block diagram of a terminal 1600 according to an exemplary embodiment. The terminal 1600 may be a portable mobile terminal, such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, a moving picture expert compression standard audio layer 3), an MP4 (Moving Picture Experts Group Audio Layer IV, a dynamic picture expert Video Expert Compresses Standard Audio Layer 4) Player, Laptop or Desktop. Terminal 1600 may also be called user equipment, portable terminal, laptop terminal, desktop terminal, and the like by other names.
终端1600包括有:处理器1601和存储器1602。The terminal 1600 includes: a processor 1601 and a memory 1602 .
处理器1601可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器1601可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器1601也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器1601可以集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器1601还可 以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。The processor 1601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1601 can use at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish. The processor 1601 may also include a main processor and a coprocessor. The main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor for processing data in a standby state. In some embodiments, the processor 1601 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen. In some embodiments, the processor 1601 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.
存储器1602可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器1602还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器1602中的非暂态的计算机可读存储介质用于存储至少一条程序代码,该至少一条程序代码用于被处理器1601所执行以实现本公开中方法实施例提供的语音信号的处理方法。 Memory 1602 may include one or more computer-readable storage media, which may be non-transitory. Memory 1602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, a non-transitory computer-readable storage medium in the memory 1602 is used to store at least one piece of program code, and the at least one piece of program code is used to be executed by the processor 1601 to implement the methods provided by the method embodiments of the present disclosure. Methods of processing speech signals.
在一些实施例中,终端1600还可选包括有:外围设备接口1603和至少一个外围设备。处理器1601、存储器1602和外围设备接口1603之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口1603相连。具体地,外围设备包括:射频电路1604、显示屏1605、摄像头组件1606、音频电路1607、定位组件1608和电源1609中的至少一种。In some embodiments, the terminal 1600 may also optionally include: a peripheral device interface 1603 and at least one peripheral device. The processor 1601, the memory 1602 and the peripheral device interface 1603 can be connected through a bus or a signal line. Each peripheral device can be connected to the peripheral device interface 1603 through a bus, a signal line or a circuit board. Specifically, the peripheral devices include: at least one of a radio frequency circuit 1604 , a display screen 1605 , a camera assembly 1606 , an audio circuit 1607 , a positioning assembly 1608 and a power supply 1609 .
外围设备接口1603可被用于将I/O(Input/Output,输入/输出)相关的至少一个外围设备连接到处理器1601和存储器1602。在一些实施例中,处理器1601、存储器1602和外围设备接口1603被集成在同一芯片或电路板上;在一些其他实施例中,处理器1601、存储器1602和外围设备接口1603中的任意一个或两个可以在单独的芯片或电路板上实现,本实施例对此不加以限定。The peripheral device interface 1603 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 1601 and the memory 1602 . In some embodiments, processor 1601, memory 1602, and peripherals interface 1603 are integrated on the same chip or circuit board; in some other embodiments, any one of processor 1601, memory 1602, and peripherals interface 1603 or The two can be implemented on a separate chip or circuit board, which is not limited in this embodiment.
射频电路1604用于接收和发射RF(Radio Frequency,射频)信号,也称电磁信号。射频电路1604通过电磁信号与通信网络以及其他通信设备进行通信。射频电路1604将电信号转换为电磁信号进行发送,或者,将接收到的电磁信号转换为电信号。可选地,射频电路1604包括:天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。射频电路1604可以通过至少一种无线通信协议来与其它终端进行通信。该无线通信协议包括但不限于:万维网、城域网、内联网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity,无线保真)网络。在一些实施例中,射频电路1604还可以包括NFC(Near Field Communication,近距离无线通信)有关的电路,本公开对此不加以限定。The radio frequency circuit 1604 is used for receiving and transmitting RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals. The radio frequency circuit 1604 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1604 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals. Optionally, radio frequency circuitry 1604 includes an antenna system, an RF transceiver, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and the like. The radio frequency circuit 1604 may communicate with other terminals through at least one wireless communication protocol. The wireless communication protocol includes but is not limited to: World Wide Web, Metropolitan Area Network, Intranet, various generations of mobile communication networks (2G, 3G, 4G and 5G), wireless local area network and/or WiFi (Wireless Fidelity, Wireless Fidelity) network. In some embodiments, the radio frequency circuit 1604 may further include a circuit related to NFC (Near Field Communication, short-range wireless communication), which is not limited in the present disclosure.
显示屏1605用于显示UI(User Interface,用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏1605是触摸显示屏时,显示屏1605还具有采集在显示屏1605的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器1601进行处理。此时,显示屏1605还可以用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在一些实施例中,显示屏1605可以为一个,设置在终端1600的前面板;在另一些实施例中,显示屏1605可以为至少两个,分别设置在终端1600的不同表面或呈折叠设计;在另一些实施例中,显示屏1605可以是柔性显示屏,设置在终端1600的弯曲表面上或折叠面上。甚至,显示屏1605还可以设置成非矩形的不规则图形,也即异形屏。显示屏1605可以采用LCD(Liquid Crystal Display,液晶显示屏)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。The display screen 1605 is used for displaying UI (User Interface, user interface). The UI can include graphics, text, icons, video, and any combination thereof. When the display screen 1605 is a touch display screen, the display screen 1605 also has the ability to acquire touch signals on or above the surface of the display screen 1605 . The touch signal can be input to the processor 1601 as a control signal for processing. At this time, the display screen 1605 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, there may be one display screen 1605, which is arranged on the front panel of the terminal 1600; in other embodiments, there may be at least two display screens 1605, which are respectively arranged on different surfaces of the terminal 1600 or in a folded design; In other embodiments, the display screen 1605 may be a flexible display screen, disposed on a curved surface or a folding surface of the terminal 1600 . Even, the display screen 1605 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen. The display screen 1605 can be made of materials such as LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, organic light-emitting diode).
摄像头组件1606用于采集图像或视频。可选地,摄像头组件1606包括前置摄像头和后置摄像头。前置摄像头设置在终端的前面板,后置摄像头设置在终端的背面。在一些实施例中,后置摄像头为至少两个,分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种,以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality,虚拟现实)拍摄功能或者其它融合拍摄功能。在一些实施例中,摄像头组件1606还可以包括闪光灯。闪光灯可以是单色温闪光灯,也可以是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合,可以用于不同色温下的光线补偿。The camera assembly 1606 is used to capture images or video. Optionally, the camera assembly 1606 includes a front camera and a rear camera. The front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal. In some embodiments, there are at least two rear cameras, which are any one of a main camera, a depth-of-field camera, a wide-angle camera, and a telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function, the main camera It is integrated with the wide-angle camera to achieve panoramic shooting and VR (Virtual Reality, virtual reality) shooting functions or other integrated shooting functions. In some embodiments, the camera assembly 1606 may also include a flash. The flash can be a single color temperature flash or a dual color temperature flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.
音频电路1607可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波,并将 声波转换为电信号输入至处理器1601进行处理,或者输入至射频电路1604以实现语音通信。出于立体声采集或降噪的目的,麦克风可以为多个,分别设置在终端1600的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器1601或射频电路1604的电信号转换为声波。扬声器可以是传统的薄膜扬声器,也可以是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时,不仅可以将电信号转换为人类可听见的声波,也可以将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中,音频电路1607还可以包括耳机插孔。 Audio circuitry 1607 may include a microphone and speakers. The microphone is used to collect the sound waves of the user and the environment, convert the sound waves into electrical signals and input them to the processor 1601 for processing, or to the radio frequency circuit 1604 to realize voice communication. For the purpose of stereo acquisition or noise reduction, there may be multiple microphones, which are respectively disposed in different parts of the terminal 1600 . The microphone may also be an array microphone or an omnidirectional collection microphone. The speaker is used to convert the electrical signal from the processor 1601 or the radio frequency circuit 1604 into sound waves. The loudspeaker can be a traditional thin-film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, it can not only convert electrical signals into sound waves audible to humans, but also convert electrical signals into sound waves inaudible to humans for distance measurement and other purposes. In some embodiments, audio circuitry 1607 may also include a headphone jack.
定位组件1608用于定位终端1600的当前地理位置,以实现导航或LBS(Location Based Service,基于位置的服务)。定位组件1608可以是基于美国的GPS(Global Positioning System,全球定位系统)、中国的北斗系统、俄罗斯的格雷纳斯定位系统或欧盟的伽利略定位系统的定位组件。The positioning component 1608 is used to locate the current geographic location of the terminal 1600 to implement navigation or LBS (Location Based Service). The positioning component 1608 may be a positioning component based on the GPS (Global Positioning System, global positioning system) of the United States, the Beidou system of China, the Grenas positioning system of Russia, or the Galileo positioning system of the European Union.
电源1609用于为终端1600中的各个组件进行供电。电源1609可以是交流电、直流电、一次性电池或可充电电池。当电源1609包括可充电电池时,该可充电电池可以是有线充电电池或无线充电电池。有线充电电池是通过有线线路充电的电池,无线充电电池是通过无线线圈充电的电池。该可充电电池还可以用于支持快充技术。 Power supply 1609 is used to power various components in terminal 1600 . The power source 1609 may be alternating current, direct current, primary batteries, or rechargeable batteries. When the power source 1609 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. Wired rechargeable batteries are batteries that are charged through wired lines, and wireless rechargeable batteries are batteries that are charged through wireless coils. The rechargeable battery can also be used to support fast charging technology.
在一些实施例中,终端1600还包括有一个或多个传感器1610。该一个或多个传感器1610包括但不限于:加速度传感器1611、陀螺仪传感器1612、压力传感器1613、指纹传感器1614、光学传感器1615以及接近传感器1616。In some embodiments, terminal 1600 also includes one or more sensors 1610 . The one or more sensors 1610 include, but are not limited to, an acceleration sensor 1611 , a gyro sensor 1612 , a pressure sensor 1613 , a fingerprint sensor 1614 , an optical sensor 1615 , and a proximity sensor 1616 .
加速度传感器1611可以检测以终端1600建立的坐标系的三个坐标轴上的加速度大小。比如,加速度传感器1611可以用于检测重力加速度在三个坐标轴上的分量。处理器1601可以根据加速度传感器1611采集的重力加速度信号,控制显示屏1605以横向视图或纵向视图进行用户界面的显示。加速度传感器1611还可以用于游戏或者用户的运动数据的采集。The acceleration sensor 1611 can detect the magnitude of acceleration on the three coordinate axes of the coordinate system established by the terminal 1600 . For example, the acceleration sensor 1611 can be used to detect the components of the gravitational acceleration on the three coordinate axes. The processor 1601 can control the display screen 1605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1611 . The acceleration sensor 1611 can also be used for game or user movement data collection.
陀螺仪传感器1612可以检测终端1600的机体方向及转动角度,陀螺仪传感器1612可以与加速度传感器1611协同采集用户对终端1600的3D动作。处理器1601根据陀螺仪传感器1612采集的数据,可以实现如下功能:动作感应(比如根据用户的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。The gyroscope sensor 1612 can detect the body direction and rotation angle of the terminal 1600 , and the gyroscope sensor 1612 can cooperate with the acceleration sensor 1611 to collect 3D actions of the user on the terminal 1600 . The processor 1601 can implement the following functions according to the data collected by the gyro sensor 1612: motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.
压力传感器1613可以设置在终端1600的侧边框和/或显示屏1605的下层。当压力传感器1613设置在终端1600的侧边框时,可以检测用户对终端1600的握持信号,由处理器1601根据压力传感器1613采集的握持信号进行左右手识别或快捷操作。当压力传感器1613设置在显示屏1605的下层时,由处理器1601根据用户对显示屏1605的压力操作,实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。The pressure sensor 1613 may be disposed on the side frame of the terminal 1600 and/or the lower layer of the display screen 1605 . When the pressure sensor 1613 is disposed on the side frame of the terminal 1600, the user's holding signal of the terminal 1600 can be detected, and the processor 1601 can perform left and right hand identification or shortcut operations according to the holding signal collected by the pressure sensor 1613. When the pressure sensor 1613 is disposed on the lower layer of the display screen 1605, the processor 1601 controls the operability controls on the UI interface according to the user's pressure operation on the display screen 1605. The operability controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.
指纹传感器1614用于采集用户的指纹,由处理器1601根据指纹传感器1614采集到的指纹识别用户的身份,或者,由指纹传感器1614根据采集到的指纹识别用户的身份。在识别出用户的身份为可信身份时,由处理器1601授权该用户执行相关的敏感操作,该敏感操作包括解锁屏幕、查看加密信息、下载软件、支付及更改设置等。指纹传感器1614可以被设置在终端1600的正面、背面或侧面。当终端1600上设置有物理按键或厂商Logo时,指纹传感器1614可以与物理按键或厂商Logo集成在一起。The fingerprint sensor 1614 is used to collect the user's fingerprint, and the processor 1601 identifies the user's identity according to the fingerprint collected by the fingerprint sensor 1614, or the fingerprint sensor 1614 identifies the user's identity according to the collected fingerprint. When the user's identity is identified as a trusted identity, the processor 1601 authorizes the user to perform relevant sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, making payments, and changing settings. The fingerprint sensor 1614 may be disposed on the front, back, or side of the terminal 1600 . When the terminal 1600 is provided with physical buttons or a manufacturer's logo, the fingerprint sensor 1614 may be integrated with the physical buttons or the manufacturer's logo.
光学传感器1615用于采集环境光强度。在一个实施例中,处理器1601可以根据光学传感器1615采集的环境光强度,控制显示屏1605的显示亮度。具体地,当环境光强度较高时,调高显示屏1605的显示亮度;当环境光强度较低时,调低显示屏1605的显示亮度。在另一个实施例中,处理器1601还可以根据光学传感器1615采集的环境光强度,动态调整摄像头组件1606的拍摄参数。Optical sensor 1615 is used to collect ambient light intensity. In one embodiment, the processor 1601 may control the display brightness of the display screen 1605 according to the ambient light intensity collected by the optical sensor 1615 . Specifically, when the ambient light intensity is high, the display brightness of the display screen 1605 is increased; when the ambient light intensity is low, the display brightness of the display screen 1605 is decreased. In another embodiment, the processor 1601 can also dynamically adjust the shooting parameters of the camera assembly 1606 according to the ambient light intensity collected by the optical sensor 1615 .
接近传感器1616,也称距离传感器,设置在终端1600的前面板。接近传感器1616用 于采集用户与终端1600的正面之间的距离。在一个实施例中,当接近传感器1616检测到用户与终端1600的正面之间的距离逐渐变小时,由处理器1601控制显示屏1605从亮屏状态切换为息屏状态;当接近传感器1616检测到用户与终端1600的正面之间的距离逐渐变大时,由处理器1601控制显示屏1605从息屏状态切换为亮屏状态。A proximity sensor 1616 , also called a distance sensor, is provided on the front panel of the terminal 1600 . The proximity sensor 1616 is used to collect the distance between the user and the front of the terminal 1600. In one embodiment, when the proximity sensor 1616 detects that the distance between the user and the front of the terminal 1600 is gradually decreasing, the processor 1601 controls the display screen 1605 to switch from the bright screen state to the off screen state; when the proximity sensor 1616 detects When the distance between the user and the front of the terminal 1600 gradually increases, the processor 1601 controls the display screen 1605 to switch from the screen-off state to the screen-on state.
本领域技术人员可以理解,图16中示出的结构并不构成对终端1600的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。Those skilled in the art can understand that the structure shown in FIG. 16 does not constitute a limitation on the terminal 1600, and may include more or less components than the one shown, or combine some components, or adopt different component arrangements.
在一些实施例中,该电子设备提供为服务器。图17是根据一示例性实施例示出的一种服务器的结构框图,该服务器1700可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(Central Processing Units,CPU)1701和一个或一个以上的存储器1702,其中,存储器1702中存储有至少一条程序代码,该至少一条程序代码由处理器1701加载并执行以实现上述各个方法实施例提供的方法。当然,该服务器还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器还可以包括其他用于实现设备功能的部件,在此不做赘述。In some embodiments, the electronic device is provided as a server. FIG. 17 is a structural block diagram of a server according to an exemplary embodiment. The server 1700 may vary greatly due to different configurations or performance, and may include one or more processors (Central Processing Units, CPU) 1701 and one or more memories 1702, where at least one piece of program code is stored in the memory 1702, and the at least one piece of program code is loaded and executed by the processor 1701 to implement the methods provided by the above method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface for input and output, and the server may also include other components for implementing device functions, which will not be described here.
在示例性实施例中,还提供了一种非临时性计算机可读存储介质,当存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行上述语音信号的处理方法中终端或服务器所执行的步骤。例如,所述非临时性计算机可读存储介质可以是ROM(只读存储器,Read Only Memory)、RAM(随机存取存储器,Random Access Memory)、CD-ROM(只读光盘,Compact Disc Read-Only Memory)、磁带、软盘和光数据存储设备等。In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, when the instructions in the storage medium are executed by the processor of the electronic device, the electronic device can execute the terminal or the method in the above-mentioned voice signal processing method. The steps performed by the server. For example, the non-transitory computer-readable storage medium may be ROM (Read Only Memory, Read Only Memory), RAM (Random Access Memory, Random Access Memory), CD-ROM (Compact Disc Read-Only) Memory), magnetic tapes, floppy disks, and optical data storage devices, etc.
在示例性实施例中,还提供了一种计算机程序产品,当计算机程序产品中的指令由电子设备的处理器执行时,使得电子设备能够执行上述语音信号的处理方法中终端或服务器所执行的步骤。In an exemplary embodiment, a computer program product is also provided, when the instructions in the computer program product are executed by the processor of the electronic device, the electronic device can execute the above-mentioned voice signal processing method executed by the terminal or the server. step.
在示例性实施例中,提供了一种语音信号的处理方法,该方法包括:In an exemplary embodiment, a method for processing a speech signal is provided, the method comprising:
确定原始语音信号中的多个语音信号帧的第一语音特征;determining a first speech feature of a plurality of speech signal frames in the original speech signal;
调用非局部注意力网络对多个语音信号帧的第一语音特征进行融合,得到每个语音信号帧的非局部语音特征;Call the non-local attention network to fuse the first speech features of multiple speech signal frames, and obtain the non-local speech features of each speech signal frame;
调用局部注意力网络对每个语音信号帧的非局部语音特征分别进行处理,得到每个语音信号帧的混合语音特征;The local attention network is called to process the non-local speech features of each speech signal frame separately, and the mixed speech features of each speech signal frame are obtained;
基于多个语音信号帧的混合语音特征获取去噪参数;Obtain denoising parameters based on the mixed speech features of multiple speech signal frames;
按照去噪参数对原始语音信号进行去噪,得到目标语音信号。The original speech signal is denoised according to the denoising parameters to obtain the target speech signal.
在一些实施例中,确定原始语音信号中的多个语音信号帧的第一语音特征,包括:In some embodiments, determining the first speech feature of multiple speech signal frames in the original speech signal includes:
调用特征提取网络分别对多个语音信号帧的原始幅度进行特征提取,得到多个语音信号帧的第一语音特征。The feature extraction network is invoked to perform feature extraction on the original amplitudes of the multiple speech signal frames, respectively, to obtain the first speech features of the multiple speech signal frames.
在一些实施例中,按照去噪参数对原始语音信号进行去噪,得到目标语音信号,包括:In some embodiments, the original speech signal is denoised according to the denoising parameters to obtain the target speech signal, including:
调用语音去噪网络,按照去噪参数分别对多个语音信号帧的原始幅度进行去噪,得到多个语音信号帧的目标幅度;Call the speech denoising network to denoise the original amplitudes of multiple speech signal frames according to the denoising parameters to obtain the target amplitudes of multiple speech signal frames;
对多个语音信号帧的原始相位和目标幅度进行组合,得到目标语音信号。The original phase and target amplitude of multiple speech signal frames are combined to obtain the target speech signal.
在一些实施例中,基于多个语音信号帧的混合语音特征获取去噪参数,包括:In some embodiments, the denoising parameters are obtained based on the mixed speech features of multiple speech signal frames, including:
调用特征重建网络对多个语音信号帧的混合语音特征进行特征重建,得到去噪参数。The feature reconstruction network is called to perform feature reconstruction on the mixed speech features of multiple speech signal frames, and the denoising parameters are obtained.
在一些实施例中,非局部注意力网络还包括第二融合单元,调用第一融合单元,分别对每个语音信号帧的第二语音特征和第三语音特征进行融合,得到每个语音信号帧的非局部语音特征之后,处理方法还包括:In some embodiments, the non-local attention network further includes a second fusion unit that invokes the first fusion unit to fuse the second speech feature and the third speech feature of each speech signal frame respectively to obtain each speech signal frame After the non-local speech features of , the processing method also includes:
调用第二融合单元对每个语音信号帧的非局部语音特征和第一语音特征进行融合,得到每个语音信号帧融合后的非局部语音特征。The second fusion unit is called to fuse the non-local speech feature of each speech signal frame with the first speech feature to obtain the non-local speech feature after fusion of each speech signal frame.
在一些实施例中,第二处理单元还包括特征缩小子单元,调用卷积子单元,对每个语音信号帧加权融合后的第一语音特征进行编码,得到每个语音信号帧的编码特征之后,处 理方法还包括:In some embodiments, the second processing unit further includes a feature reduction subunit, which calls the convolution subunit to encode the first speech feature after weighted fusion of each speech signal frame, and obtains the encoded feature of each speech signal frame after the encoding feature is obtained. , the processing method also includes:
调用特征缩小子单元,对每个语音信号帧的编码特征进行特征缩小,得到多个缩小后的编码特征;Calling the feature reduction subunit to perform feature reduction on the coding features of each speech signal frame to obtain a plurality of reduced coding features;
调用反卷积子单元,对每个语音信号帧的编码特征进行解码,得到每个语音信号帧的第三语音特征,包括:The deconvolution subunit is called to decode the coded features of each voice signal frame to obtain the third voice feature of each voice signal frame, including:
调用反卷积子单元,对多个缩小后的编码特征进行解码,得到每个语音信号帧的第三语音特征。The deconvolution subunit is called to decode the plurality of reduced encoded features to obtain the third speech feature of each speech signal frame.
在一些实施例中,语音处理模型至少包括非局部注意力网络和局部注意力网络,语音处理模型的训练过程如下:In some embodiments, the speech processing model includes at least a non-local attention network and a local attention network, and the training process of the speech processing model is as follows:
获取样本语音信号和样本噪声信号;Obtain sample speech signal and sample noise signal;
将样本语音信号与样本噪声信号进行混合,得到样本混合信号;Mixing the sample speech signal and the sample noise signal to obtain a sample mixed signal;
调用语音处理模型,对样本混合信号中的多个样本语音信号帧进行处理,得到样本混合信号对应的预测去噪参数;Call the speech processing model to process multiple sample speech signal frames in the sample mixed signal, and obtain the predicted denoising parameters corresponding to the sample mixed signal;
按照预测去噪参数对原始语音信号进行去噪,得到去噪后的预测语音信号;Denoising the original speech signal according to the predicted denoising parameters to obtain the denoised predicted speech signal;
根据预测语音信号与样本语音信号之间的差异,训练语音处理模型。A speech processing model is trained based on the difference between the predicted speech signal and the sample speech signal.
本公开所有实施例均可以单独被执行,也可以与其他实施例相结合被执行,均视为本公开要求的保护范围。All the embodiments of the present disclosure can be implemented independently or in combination with other embodiments, which are all regarded as the protection scope required by the present disclosure.

Claims (41)

  1. 一种语音信号的处理方法,由电子设备执行,所述方法包括:A method for processing a voice signal, executed by an electronic device, the method comprising:
    确定原始语音信号的多个第一语音特征,每个所述第一语音特征对应于所述原始语音信号中的一个语音信号帧;determining a plurality of first voice features of the original voice signal, each of the first voice features corresponding to a voice signal frame in the original voice signal;
    对多个所述第一语音特征进行处理,得到多个非局部语音特征,每个所述非局部语音特征对应于一个语音信号帧,且每个所述非局部语音特征是基于所述非局部语音特征对应的语音信号帧的第一语音特征,以及除所述语音信号帧之外的其他语音信号帧的第一语音特征进行融合得到的;Process a plurality of the first speech features to obtain a plurality of non-local speech features, each of the non-local speech features corresponds to a speech signal frame, and each of the non-local speech features is based on the non-local speech features Obtained by fusing the first voice feature of the voice signal frame corresponding to the voice feature and the first voice feature of other voice signal frames except the voice signal frame;
    分别对所述原始语音信号中每个语音信号帧的非局部语音特征进行处理,得到每个所述语音信号帧的混合语音特征;The non-local voice features of each voice signal frame in the original voice signal are processed respectively to obtain a mixed voice feature of each of the voice signal frames;
    基于多个所述语音信号帧的混合语音特征,获取去噪参数;Obtaining denoising parameters based on the mixed speech features of a plurality of the speech signal frames;
    基于所述去噪参数对所述原始语音信号进行去噪,得到目标语音信号。The original speech signal is denoised based on the denoising parameter to obtain a target speech signal.
  2. 根据权利要求1所述的方法,其中,所述确定原始语音信号的多个第一语音特征,包括:The method according to claim 1, wherein the determining a plurality of first speech features of the original speech signal comprises:
    分别对每个所述语音信号帧的原始幅度进行特征提取,得到每个所述语音信号帧的第一语音特征。Feature extraction is performed on the original amplitude of each of the speech signal frames, respectively, to obtain the first speech feature of each of the speech signal frames.
  3. 根据权利要求2所述的方法,其中,所述基于所述去噪参数对所述原始语音信号进行去噪,得到目标语音信号,包括:The method according to claim 2, wherein the performing denoising on the original speech signal based on the denoising parameter to obtain a target speech signal comprises:
    基于所述去噪参数,分别对多个所述语音信号帧的原始幅度进行去噪,得到多个所述语音信号帧的目标幅度;Based on the denoising parameters, denoising is performed on the original amplitudes of the plurality of speech signal frames, respectively, to obtain a plurality of target amplitudes of the speech signal frames;
    对多个所述语音信号帧的原始相位和目标幅度进行组合,得到所述目标语音信号。The target speech signal is obtained by combining the original phase and target amplitude of a plurality of the speech signal frames.
  4. 根据权利要求1所述的方法,其中,所述基于多个所述语音信号帧的混合语音特征,获取去噪参数,包括:The method according to claim 1, wherein the obtaining denoising parameters based on the mixed speech features of the plurality of speech signal frames comprises:
    对多个所述语音信号帧的混合语音特征进行特征识别,得到所述去噪参数。Feature recognition is performed on the mixed speech features of a plurality of the speech signal frames to obtain the denoising parameters.
  5. 根据权利要求1所述的方法,其中,所述对多个所述第一语音特征进行处理,得到多个非局部语音特征,包括:The method according to claim 1, wherein said processing a plurality of said first speech features to obtain a plurality of non-local speech features, comprising:
    分别对每个所述语音信号帧的第一语音特征进行特征提取,得到每个所述语音信号帧的第二语音特征;Perform feature extraction on the first voice feature of each of the voice signal frames, respectively, to obtain the second voice feature of each of the voice signal frames;
    将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行融合,得到每个所述语音信号帧的第三语音特征;Fusing the first voice feature of each of the voice signal frames with the first voice features of other voice signal frames to obtain the third voice feature of each of the voice signal frames;
    分别对每个所述语音信号帧的第二语音特征和第三语音特征进行融合,得到每个所述语音信号帧的非局部语音特征。The second voice feature and the third voice feature of each of the voice signal frames are respectively fused to obtain the non-local voice feature of each of the voice signal frames.
  6. 根据权利要求5所述的方法,其中,所述方法还包括:The method of claim 5, wherein the method further comprises:
    对每个所述语音信号帧的非局部语音特征和第一语音特征进行融合,得到每个所述语音信号帧融合后的非局部语音特征。The non-local speech feature and the first speech feature of each of the speech signal frames are fused to obtain a non-local speech feature after fusion of each of the speech signal frames.
  7. 根据权利要求5所述的方法,其中,所述将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行融合,得到每个所述语音信号帧的第三语音特征,包括:The method according to claim 5, wherein the first voice feature of each of the voice signal frames is respectively fused with the first voice features of other voice signal frames to obtain the first voice feature of each of the voice signal frames. Three phonetic features, including:
    基于多个所述语音信号帧的权重,将每个所述语音信号帧的第一语音特征分别与其 他语音信号帧的第一语音特征进行加权融合,得到每个所述语音信号帧加权融合后的第一语音特征;Based on the weights of the plurality of speech signal frames, the first speech features of each speech signal frame are respectively weighted and fused with the first speech features of other speech signal frames to obtain the weighted fusion of each speech signal frame. The first voice feature of ;
    对每个所述语音信号帧加权融合后的第一语音特征进行编码,得到每个所述语音信号帧的编码特征;Encoding the first voice feature after weighted fusion of each of the voice signal frames, to obtain the coding feature of each of the voice signal frames;
    对每个所述语音信号帧的编码特征进行解码,得到每个所述语音信号帧的第三语音特征。Decoding the encoded features of each of the voice signal frames to obtain third voice features of each of the voice signal frames.
  8. 根据权利要求7所述的方法,其中,所述方法还包括:The method of claim 7, wherein the method further comprises:
    对每个所述语音信号帧的编码特征进行特征缩小,得到多个缩小后的编码特征;Feature reduction is performed on the coding feature of each of the speech signal frames to obtain a plurality of reduced coding features;
    所述对每个所述语音信号帧的编码特征进行解码,得到每个所述语音信号帧的第三语音特征,包括:The encoding feature of each of the voice signal frames is decoded to obtain the third voice feature of each of the voice signal frames, including:
    对每个所述缩小后的编码特征进行解码,得到每个所述语音信号帧的第三语音特征。Decoding each of the reduced encoded features to obtain a third voice feature of each of the voice signal frames.
  9. 根据权利要求7所述的方法,其中,所述基于多个所述语音信号帧的权重,将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个所述语音信号帧加权融合后的第一语音特征,包括:The method according to claim 7, wherein the first voice feature of each voice signal frame is weighted with the first voice features of other voice signal frames based on the weights of the plurality of voice signal frames. Fusion to obtain the first voice feature after weighted fusion of each of the voice signal frames, including:
    基于多个所述语音信号帧的权重,将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个所述语音信号帧的融合特征;Based on the weights of the plurality of speech signal frames, the first speech feature of each speech signal frame is weighted and fused with the first speech features of other speech signal frames, to obtain the fusion feature of each speech signal frame. ;
    对每个所述语音信号帧的第一语音特征与融合特征进行融合,得到每个所述语音信号帧加权融合后的第一语音特征。The first voice feature of each of the voice signal frames is fused with the fusion feature to obtain the weighted and fused first voice feature of each of the voice signal frames.
  10. 根据权利要求1所述的方法,其中,所述对多个所述第一语音特征进行处理,得到多个非局部语音特征,包括:The method according to claim 1, wherein said processing a plurality of said first speech features to obtain a plurality of non-local speech features, comprising:
    调用非局部注意力子模型,对多个所述第一语音特征进行处理,得到多个所述非局部语音特征;calling the non-local attention sub-model to process a plurality of the first speech features to obtain a plurality of the non-local speech features;
    所述分别对所述原始语音信号中每个语音信号帧的非局部语音特征进行处理,得到每个所述语音信号帧的混合语音特征,包括:The non-local voice features of each voice signal frame in the original voice signal are respectively processed to obtain the mixed voice features of each of the voice signal frames, including:
    调用局部注意力子模型,分别对每个所述语音信号帧的非局部语音特征进行处理,得到每个所述语音信号帧的混合语音特征。The local attention sub-model is invoked to process the non-local speech features of each of the speech signal frames respectively to obtain the mixed speech features of each of the speech signal frames.
  11. 根据权利要求10所述的方法,其中,所述非局部注意力子模型包括第一处理网络、第二处理网络和第一融合网络,所述调用非局部注意力子模型,对多个所述第一语音特征进行处理,得到多个所述非局部语音特征,包括:11. The method of claim 10, wherein the non-local attention sub-model comprises a first processing network, a second processing network and a first fusion network, and the invoking the non-local attention sub-model, for a plurality of the said The first voice feature is processed to obtain a plurality of the non-local voice features, including:
    调用所述第一处理网络,分别对每个所述语音信号帧的第一语音特征进行特征提取,得到每个所述语音信号帧的第二语音特征,所述第一处理网络包括多个空洞残差子网络;Invoke the first processing network to perform feature extraction on the first voice features of each of the voice signal frames, respectively, to obtain the second voice features of each of the voice signal frames, and the first processing network includes a plurality of holes Residual sub-network;
    调用所述第二处理网络,将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行融合,得到每个所述语音信号帧的第三语音特征;Invoke the second processing network to fuse the first voice feature of each of the voice signal frames with the first voice features of other voice signal frames to obtain the third voice feature of each of the voice signal frames;
    调用所述第一融合网络,分别对每个所述语音信号帧的第二语音特征和第三语音特征进行融合,得到每个所述语音信号帧的非局部语音特征。The first fusion network is invoked to respectively fuse the second speech feature and the third speech feature of each of the speech signal frames to obtain non-local speech features of each of the speech signal frames.
  12. 根据权利要求11所述的方法,其中,所述第二处理网络包括残差非局部子网络、卷积子网络和反卷积子网络;所述调用所述第二处理网络,将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行融合,得到每个所述语音信号帧的第三语音特征,包括:The method of claim 11, wherein the second processing network comprises a residual non-local sub-network, a convolutional sub-network and a deconvolutional sub-network; the invoking the second processing network converts each of the The first voice features of the voice signal frames are respectively fused with the first voice features of other voice signal frames to obtain the third voice features of each of the voice signal frames, including:
    调用所述残差非局部子网络,基于多个所述语音信号帧的权重,将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个所述 语音信号帧加权融合后的第一语音特征;The residual non-local sub-network is called, and based on the weights of the plurality of speech signal frames, the first speech features of each of the speech signal frames are weighted and fused with the first speech features of other speech signal frames to obtain the first speech feature after weighted fusion of each described speech signal frame;
    调用所述卷积子网络,对每个所述语音信号帧加权融合后的第一语音特征进行编码,得到每个所述语音信号帧的编码特征;Calling the convolution sub-network to encode the first voice feature after weighted fusion of each of the voice signal frames, to obtain the coding feature of each of the voice signal frames;
    调用所述反卷积子网络,对每个所述语音信号帧的编码特征进行解码,得到每个所述语音信号帧的第三语音特征。The deconvolution sub-network is invoked to decode the encoded features of each of the speech signal frames to obtain a third speech feature of each of the speech signal frames.
  13. 根据权利要求12所述的方法,其中,所述残差非局部子网络包括第一融合层和第二融合层,所述调用所述残差非局部子网络,基于多个所述语音信号帧的权重,将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个所述语音信号帧加权融合后的第一语音特征,包括:13. The method of claim 12, wherein the residual non-local sub-network includes a first fusion layer and a second fusion layer, and the invoking the residual non-local sub-network is based on a plurality of the speech signal frames weight, the first voice features of each of the voice signal frames are weighted and fused with the first voice features of other voice signal frames, and the first voice features after weighted fusion of each of the voice signal frames are obtained, including:
    调用所述第一融合层,基于多个所述语音信号帧的权重,将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个所述语音信号帧的融合特征;The first fusion layer is called, and based on the weights of the plurality of speech signal frames, the first speech features of each of the speech signal frames are weighted and fused with the first speech features of other speech signal frames to obtain each fusion features of the speech signal frame;
    调用所述第二融合层,对每个所述语音信号帧的第一语音特征与融合特征进行融合,得到每个所述语音信号帧加权融合后的第一语音特征。The second fusion layer is called to fuse the first speech feature of each of the speech signal frames with the fusion feature to obtain the weighted and fused first speech feature of each of the speech signal frames.
  14. 一种语音信号的处理装置,所述处理装置包括:A voice signal processing device, the processing device includes:
    特征确定单元,被配置为确定原始语音信号的多个第一语音特征,每个所述第一语音特征对应于所述原始语音信号中的一个语音信号帧;a feature determining unit configured to determine a plurality of first voice features of the original voice signal, each of the first voice features corresponding to a voice signal frame in the original voice signal;
    非局部特征获取单元,被配置为对多个所述第一语音特征进行处理,得到多个非局部语音特征,每个所述非局部语音特征对应于一个语音信号帧,且每个所述非局部语音特征是基于所述非局部语音特征对应的语音信号帧的第一语音特征,以及除所述语音信号帧之外的其他语音信号帧的第一语音特征进行融合得到的;The non-local feature acquisition unit is configured to process a plurality of the first speech features to obtain a plurality of non-local speech features, each of the non-local speech features corresponds to a speech signal frame, and each of the non-local speech features corresponds to a frame of a speech signal. The local voice feature is obtained by fusing the first voice feature of the voice signal frame corresponding to the non-local voice feature and the first voice feature of other voice signal frames except the voice signal frame;
    混合特征获取单元,被配置为分别对所述原始语音信号中每个语音信号帧的非局部语音特征进行处理,得到每个所述语音信号帧的混合语音特征;a mixed feature acquisition unit, configured to separately process the non-local voice features of each voice signal frame in the original voice signal to obtain the mixed voice feature of each of the voice signal frames;
    去噪参数获取单元,被配置为基于多个所述语音信号帧的混合语音特征,获取去噪参数;a denoising parameter obtaining unit, configured to obtain denoising parameters based on the mixed speech features of the plurality of speech signal frames;
    目标信号获取单元,被配置为基于所述去噪参数对所述原始语音信号进行去噪,得到目标语音信号。The target signal obtaining unit is configured to perform denoising on the original speech signal based on the denoising parameter to obtain a target speech signal.
  15. 根据权利要求14所述的装置,其中,所述特征确定单元,被配置为分别对每个所述语音信号帧的原始幅度进行特征提取,得到每个所述语音信号帧的第一语音特征。The apparatus according to claim 14, wherein the feature determining unit is configured to perform feature extraction on the original amplitude of each of the voice signal frames, respectively, to obtain the first voice feature of each of the voice signal frames.
  16. 根据权利要求15所述的装置,其中,所述目标信号获取单元,包括:The apparatus according to claim 15, wherein the target signal acquisition unit comprises:
    幅度获取子单元,被配置为基于所述去噪参数,分别对多个所述语音信号帧的原始幅度进行去噪,得到多个所述语音信号帧的目标幅度;an amplitude acquisition subunit, configured to perform denoising on the original amplitudes of a plurality of the speech signal frames based on the denoising parameters, to obtain a plurality of target amplitudes of the speech signal frames;
    信号获取子单元,被配置为对多个所述语音信号帧的原始相位和目标幅度进行组合,得到所述目标语音信号。The signal acquisition subunit is configured to combine the original phase and target amplitude of a plurality of the speech signal frames to obtain the target speech signal.
  17. 根据权利要求14所述的装置,其中,所述去噪参数获取单元,被配置为对多个所述语音信号帧的混合语音特征进行特征识别,得到所述去噪参数。The apparatus according to claim 14, wherein the denoising parameter obtaining unit is configured to perform feature recognition on mixed speech features of a plurality of the speech signal frames to obtain the denoising parameter.
  18. 根据权利要求14所述的装置,其中,所述非局部特征获取单元,包括:The apparatus according to claim 14, wherein the non-local feature acquisition unit comprises:
    特征提取子单元,被配置为分别对每个所述语音信号帧的第一语音特征进行特征提取,得到每个所述语音信号帧的第二语音特征;A feature extraction subunit, configured to perform feature extraction on the first voice feature of each of the voice signal frames, respectively, to obtain the second voice feature of each of the voice signal frames;
    第一融合子单元,被配置为将每个所述语音信号帧的第一语音特征分别与其他语音 信号帧的第一语音特征进行融合,得到每个所述语音信号帧的第三语音特征;The first fusion subunit, is configured to fuse the first speech feature of each described speech signal frame with the first speech feature of other speech signal frames respectively, obtains the 3rd speech feature of each described speech signal frame;
    第二融合子单元,被配置为分别对每个所述语音信号帧的第二语音特征和第三语音特征进行融合,得到每个所述语音信号帧的非局部语音特征。The second fusion subunit is configured to respectively fuse the second speech feature and the third speech feature of each of the speech signal frames to obtain the non-local speech features of each of the speech signal frames.
  19. 根据权利要求18所述的装置,其中,所述非局部特征获取单元,还包括:The apparatus according to claim 18, wherein the non-local feature acquisition unit further comprises:
    第三融合子单元,被配置为对每个所述语音信号帧的非局部语音特征和第一语音特征进行融合,得到每个所述语音信号帧融合后的非局部语音特征。The third fusion subunit is configured to fuse the non-local speech feature of each of the speech signal frames with the first speech feature to obtain a non-local speech feature after fusion of each of the speech signal frames.
  20. 根据权利要求18所述的装置,其中,所述第一融合子单元,被配置为:The apparatus of claim 18, wherein the first fusion subunit is configured to:
    基于多个所述语音信号帧的权重,将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个所述语音信号帧加权融合后的第一语音特征;Based on the weights of the plurality of speech signal frames, the first speech features of each speech signal frame are respectively weighted and fused with the first speech features of other speech signal frames to obtain the weighted fusion of each speech signal frame. The first voice feature of ;
    对每个所述语音信号帧加权融合后的第一语音特征进行编码,得到每个所述语音信号帧的编码特征;Encoding the first voice feature after weighted fusion of each of the voice signal frames, to obtain the coding feature of each of the voice signal frames;
    对每个所述语音信号帧的编码特征进行解码,得到每个所述语音信号帧的第三语音特征。Decoding the encoded features of each of the voice signal frames to obtain third voice features of each of the voice signal frames.
  21. 根据权利要求20所述的装置,其中,所述第一融合子单元,被配置为:The apparatus of claim 20, wherein the first fusion subunit is configured to:
    对每个所述语音信号帧的编码特征进行特征缩小,得到多个缩小后的编码特征;Feature reduction is performed on the coding feature of each of the speech signal frames to obtain a plurality of reduced coding features;
    对每个所述缩小后的编码特征进行解码,得到每个所述语音信号帧的第三语音特征。Decoding each of the reduced encoded features to obtain a third voice feature of each of the voice signal frames.
  22. 根据权利要求20所述的装置,其中,所述第一融合子单元,被配置为:The apparatus of claim 20, wherein the first fusion subunit is configured to:
    基于多个所述语音信号帧的权重,将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个所述语音信号帧的融合特征;Based on the weights of the plurality of speech signal frames, the first speech feature of each speech signal frame is weighted and fused with the first speech features of other speech signal frames, to obtain the fusion feature of each speech signal frame. ;
    对每个所述语音信号帧的第一语音特征与融合特征进行融合,得到每个所述语音信号帧加权融合后的第一语音特征。The first voice feature of each of the voice signal frames is fused with the fusion feature to obtain the weighted and fused first voice feature of each of the voice signal frames.
  23. 根据权利要求14所述的装置,其中,所述非局部特征获取单元,被配置为调用非局部注意力子模型,对多个所述第一语音特征进行处理,得到多个所述非局部语音特征;The apparatus according to claim 14, wherein the non-local feature acquisition unit is configured to call a non-local attention sub-model to process a plurality of the first speech features to obtain a plurality of the non-local speech features feature;
    混合特征获取单元,被配置为调用局部注意力子模型,分别对每个所述语音信号帧的非局部语音特征进行处理,得到每个所述语音信号帧的混合语音特征。The mixed feature acquisition unit is configured to invoke the local attention sub-model to process the non-local speech features of each of the speech signal frames respectively to obtain the mixed speech features of each of the speech signal frames.
  24. 根据权利要求23所述的装置,其中,所述非局部注意力子模型包括第一处理网络、第二处理网络和第一融合网络,所述非局部特征获取单元,包括:The apparatus according to claim 23, wherein the non-local attention sub-model comprises a first processing network, a second processing network and a first fusion network, and the non-local feature acquisition unit comprises:
    特征提取子单元,被配置为调用所述第一处理网络,分别对每个所述语音信号帧的第一语音特征进行特征提取,得到每个所述语音信号帧的第二语音特征,所述第一处理网络包括多个空洞残差子网络;The feature extraction subunit is configured to call the first processing network to perform feature extraction on the first voice feature of each of the voice signal frames, and obtain the second voice feature of each of the voice signal frames, and the The first processing network includes a plurality of hole residual sub-networks;
    第一融合子单元,被配置为调用所述第二处理网络,将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行融合,得到每个所述语音信号帧的第三语音特征;The first fusion subunit is configured to call the second processing network to fuse the first speech features of each of the speech signal frames with the first speech features of other speech signal frames to obtain each of the speech the third speech feature of the signal frame;
    第二融合子单元,被配置为调用所述第一融合网络,分别对每个所述语音信号帧的第二语音特征和第三语音特征进行融合,得到每个所述语音信号帧的非局部语音特征。The second fusion subunit is configured to call the first fusion network to fuse the second speech feature and the third speech feature of each of the speech signal frames respectively to obtain the non-local part of each of the speech signal frames voice characteristics.
  25. 根据权利要求24所述的装置,其中,所述第二处理网络包括残差非局部子网络、卷积子网络和反卷积子网络;所述第一融合子单元,被配置为:The apparatus of claim 24, wherein the second processing network comprises a residual non-local sub-network, a convolution sub-network and a deconvolution sub-network; the first fusion sub-unit is configured to:
    调用所述残差非局部子网络,基于多个所述语音信号帧的权重,将每个所述语音信号 帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个所述语音信号帧加权融合后的第一语音特征;The residual non-local sub-network is called, and based on the weights of the plurality of speech signal frames, the first speech features of each of the speech signal frames are weighted and fused with the first speech features of other speech signal frames to obtain the first speech feature after weighted fusion of each described speech signal frame;
    调用所述卷积子网络,对每个所述语音信号帧加权融合后的第一语音特征进行编码,得到每个所述语音信号帧的编码特征;Calling the convolution sub-network to encode the first voice feature after weighted fusion of each of the voice signal frames, to obtain the coding feature of each of the voice signal frames;
    调用所述反卷积子网络,对每个所述语音信号帧的编码特征进行解码,得到每个所述语音信号帧的第三语音特征。The deconvolution sub-network is invoked to decode the encoded features of each of the speech signal frames to obtain a third speech feature of each of the speech signal frames.
  26. 根据权利要求25所述的装置,其中,所述残差非局部子网络包括第一融合层和第二融合层,所述第一融合子单元,被配置为:The apparatus of claim 25, wherein the residual non-local sub-network comprises a first fusion layer and a second fusion layer, and the first fusion subunit is configured to:
    调用所述第一融合层,基于多个所述语音信号帧的权重,将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个所述语音信号帧的融合特征;The first fusion layer is called, and based on the weights of the plurality of speech signal frames, the first speech features of each of the speech signal frames are weighted and fused with the first speech features of other speech signal frames to obtain each fusion features of the speech signal frame;
    调用所述第二融合层,对每个所述语音信号帧的第一语音特征与融合特征进行融合,得到每个所述语音信号帧加权融合后的第一语音特征。The second fusion layer is called to fuse the first speech feature of each of the speech signal frames with the fusion feature to obtain the weighted and fused first speech feature of each of the speech signal frames.
  27. 一种电子设备,所述电子设备包括:An electronic device comprising:
    一个或多个处理器;one or more processors;
    用于存储所述一个或多个处理器可执行指令的存储器;memory for storing the one or more processor-executable instructions;
    其中,所述一个或多个处理器被配置为执行如下步骤:wherein the one or more processors are configured to perform the following steps:
    确定原始语音信号的多个第一语音特征,每个所述第一语音特征对应于所述原始语音信号中的一个语音信号帧;determining a plurality of first voice features of the original voice signal, each of the first voice features corresponding to a voice signal frame in the original voice signal;
    对多个所述第一语音特征进行处理,得到多个非局部语音特征,每个所述非局部语音特征对应于一个语音信号帧,且每个所述非局部语音特征是基于所述非局部语音特征对应的语音信号帧的第一语音特征,以及除所述语音信号帧之外的其他语音信号帧的第一语音特征进行融合得到的;Process a plurality of the first speech features to obtain a plurality of non-local speech features, each of the non-local speech features corresponds to a speech signal frame, and each of the non-local speech features is based on the non-local speech features Obtained by fusing the first voice feature of the voice signal frame corresponding to the voice feature and the first voice feature of other voice signal frames except the voice signal frame;
    分别对所述原始语音信号中每个语音信号帧的非局部语音特征进行处理,得到每个所述语音信号帧的混合语音特征;The non-local voice features of each voice signal frame in the original voice signal are processed respectively to obtain a mixed voice feature of each of the voice signal frames;
    基于多个所述语音信号帧的混合语音特征,获取去噪参数;Obtaining denoising parameters based on the mixed speech features of a plurality of the speech signal frames;
    基于所述去噪参数对所述原始语音信号进行去噪,得到目标语音信号。The original speech signal is denoised based on the denoising parameter to obtain a target speech signal.
  28. 根据权利要求17所述的电子设备,其中,所述一个或多个处理器被配置为执行如下步骤:18. The electronic device of claim 17, wherein the one or more processors are configured to perform the steps of:
    分别对每个所述语音信号帧的原始幅度进行特征提取,得到每个所述语音信号帧的第一语音特征。Feature extraction is performed on the original amplitude of each of the speech signal frames, respectively, to obtain the first speech feature of each of the speech signal frames.
  29. 根据权利要求28所述的电子设备,其中,所述一个或多个处理器被配置为执行如下步骤:The electronic device of claim 28, wherein the one or more processors are configured to perform the steps of:
    基于所述去噪参数,分别对多个所述语音信号帧的原始幅度进行去噪,得到多个所述语音信号帧的目标幅度;Based on the denoising parameters, denoising is performed on the original amplitudes of the plurality of speech signal frames, respectively, to obtain a plurality of target amplitudes of the speech signal frames;
    对多个所述语音信号帧的原始相位和目标幅度进行组合,得到所述目标语音信号。The target speech signal is obtained by combining the original phase and target amplitude of a plurality of the speech signal frames.
  30. 根据权利要求27所述的电子设备,其中,所述一个或多个处理器被配置为执行如下步骤:The electronic device of claim 27, wherein the one or more processors are configured to perform the steps of:
    对多个所述语音信号帧的混合语音特征进行特征识别,得到所述去噪参数。Feature recognition is performed on the mixed speech features of a plurality of the speech signal frames to obtain the denoising parameters.
  31. 根据权利要求27所述的电子设备,其中,所述一个或多个处理器被配置为执行如 下步骤:The electronic device of claim 27, wherein the one or more processors are configured to perform the steps of:
    分别对每个所述语音信号帧的第一语音特征进行特征提取,得到每个所述语音信号帧的第二语音特征;Perform feature extraction on the first voice feature of each of the voice signal frames, respectively, to obtain the second voice feature of each of the voice signal frames;
    将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行融合,得到每个所述语音信号帧的第三语音特征;Fusing the first voice feature of each of the voice signal frames with the first voice features of other voice signal frames to obtain the third voice feature of each of the voice signal frames;
    分别对每个所述语音信号帧的第二语音特征和第三语音特征进行融合,得到每个所述语音信号帧的非局部语音特征。The second voice feature and the third voice feature of each of the voice signal frames are respectively fused to obtain the non-local voice feature of each of the voice signal frames.
  32. 根据权利要求31所述的电子设备,其中,所述一个或多个处理器被配置为执行如下步骤:The electronic device of claim 31, wherein the one or more processors are configured to perform the steps of:
    对每个所述语音信号帧的非局部语音特征和第一语音特征进行融合,得到每个所述语音信号帧融合后的非局部语音特征。The non-local speech feature and the first speech feature of each of the speech signal frames are fused to obtain a non-local speech feature after fusion of each of the speech signal frames.
  33. 根据权利要求31所述的电子设备,其中,所述一个或多个处理器被配置为执行如下步骤:The electronic device of claim 31, wherein the one or more processors are configured to perform the steps of:
    基于多个所述语音信号帧的权重,将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个所述语音信号帧加权融合后的第一语音特征;Based on the weights of the plurality of speech signal frames, the first speech features of each speech signal frame are respectively weighted and fused with the first speech features of other speech signal frames to obtain the weighted fusion of each speech signal frame. The first voice feature of ;
    对每个所述语音信号帧加权融合后的第一语音特征进行编码,得到每个所述语音信号帧的编码特征;Encoding the first voice feature after weighted fusion of each of the voice signal frames, to obtain the coding feature of each of the voice signal frames;
    对每个所述语音信号帧的编码特征进行解码,得到每个所述语音信号帧的第三语音特征。Decoding the encoded features of each of the voice signal frames to obtain third voice features of each of the voice signal frames.
  34. 根据权利要求33所述的电子设备,其中,所述一个或多个处理器被配置为执行如下步骤:The electronic device of claim 33, wherein the one or more processors are configured to perform the steps of:
    对每个所述语音信号帧的编码特征进行特征缩小,得到多个缩小后的编码特征;Feature reduction is performed on the coding feature of each of the speech signal frames to obtain a plurality of reduced coding features;
    对每个所述缩小后的编码特征进行解码,得到每个所述语音信号帧的第三语音特征。Decoding each of the reduced encoded features to obtain a third voice feature of each of the voice signal frames.
  35. 根据权利要求33所述的电子设备,其中,所述一个或多个处理器被配置为执行如下步骤:The electronic device of claim 33, wherein the one or more processors are configured to perform the steps of:
    基于多个所述语音信号帧的权重,将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个所述语音信号帧的融合特征;Based on the weights of the plurality of speech signal frames, the first speech feature of each speech signal frame is weighted and fused with the first speech features of other speech signal frames, to obtain the fusion feature of each speech signal frame. ;
    对每个所述语音信号帧的第一语音特征与融合特征进行融合,得到每个所述语音信号帧加权融合后的第一语音特征。The first voice feature of each of the voice signal frames is fused with the fusion feature to obtain the weighted and fused first voice feature of each of the voice signal frames.
  36. 根据权利要求27所述的电子设备,其中,所述一个或多个处理器被配置为执行如下步骤:The electronic device of claim 27, wherein the one or more processors are configured to perform the steps of:
    调用非局部注意力子模型,对多个所述第一语音特征进行处理,得到多个所述非局部语音特征;calling the non-local attention sub-model to process a plurality of the first speech features to obtain a plurality of the non-local speech features;
    所述分别对所述原始语音信号中每个语音信号帧的非局部语音特征进行处理,得到每个所述语音信号帧的混合语音特征,包括:The non-local voice features of each voice signal frame in the original voice signal are respectively processed to obtain the mixed voice features of each of the voice signal frames, including:
    调用局部注意力子模型,分别对每个所述语音信号帧的非局部语音特征进行处理,得到每个所述语音信号帧的混合语音特征。The local attention sub-model is invoked to process the non-local speech features of each of the speech signal frames respectively to obtain the mixed speech features of each of the speech signal frames.
  37. 根据权利要求36所述的电子设备,其中,所述非局部注意力子模型包括第一处理网络、第二处理网络和第一融合网络,所述一个或多个处理器被配置为执行如下步骤:37. The electronic device of claim 36, wherein the non-local attention sub-model includes a first processing network, a second processing network, and a first fusion network, the one or more processors configured to perform the steps of :
    调用所述第一处理网络,分别对每个所述语音信号帧的第一语音特征进行特征提取,得到每个所述语音信号帧的第二语音特征,所述第一处理网络包括多个空洞残差子网络;Invoke the first processing network to perform feature extraction on the first voice features of each of the voice signal frames, respectively, to obtain the second voice features of each of the voice signal frames, and the first processing network includes a plurality of holes Residual sub-network;
    调用所述第二处理网络,将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行融合,得到每个所述语音信号帧的第三语音特征;Invoke the second processing network to fuse the first voice feature of each of the voice signal frames with the first voice features of other voice signal frames to obtain the third voice feature of each of the voice signal frames;
    调用所述第一融合网络,分别对每个所述语音信号帧的第二语音特征和第三语音特征进行融合,得到每个所述语音信号帧的非局部语音特征。The first fusion network is invoked to respectively fuse the second speech feature and the third speech feature of each of the speech signal frames to obtain non-local speech features of each of the speech signal frames.
  38. 根据权利要求37所述的电子设备,其中,所述第二处理网络包括残差非局部子网络、卷积子网络和反卷积子网络;所述一个或多个处理器被配置为执行如下步骤:38. The electronic device of claim 37, wherein the second processing network comprises a residual non-local sub-network, a convolution sub-network, and a deconvolution sub-network; the one or more processors configured to perform the following step:
    调用所述残差非局部子网络,基于多个所述语音信号帧的权重,将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个所述语音信号帧加权融合后的第一语音特征;The residual non-local sub-network is called, and based on the weights of the plurality of speech signal frames, the first speech features of each of the speech signal frames are weighted and fused with the first speech features of other speech signal frames to obtain the first speech feature after weighted fusion of each described speech signal frame;
    调用所述卷积子网络,对每个所述语音信号帧加权融合后的第一语音特征进行编码,得到每个所述语音信号帧的编码特征;Calling the convolution sub-network to encode the first voice feature after weighted fusion of each of the voice signal frames, to obtain the coding feature of each of the voice signal frames;
    调用所述反卷积子网络,对每个所述语音信号帧的编码特征进行解码,得到每个所述语音信号帧的第三语音特征。The deconvolution sub-network is invoked to decode the encoded features of each of the speech signal frames to obtain a third speech feature of each of the speech signal frames.
  39. 根据权利要求38所述的电子设备,其中,所述残差非局部子网络包括第一融合层和第二融合层,所述一个或多个处理器被配置为执行如下步骤:38. The electronic device of claim 38, wherein the residual non-local sub-network includes a first fusion layer and a second fusion layer, the one or more processors configured to perform the steps of:
    调用所述第一融合层,基于多个所述语音信号帧的权重,将每个所述语音信号帧的第一语音特征分别与其他语音信号帧的第一语音特征进行加权融合,得到每个所述语音信号帧的融合特征;The first fusion layer is called, and based on the weights of the plurality of speech signal frames, the first speech features of each of the speech signal frames are weighted and fused with the first speech features of other speech signal frames to obtain each fusion features of the speech signal frame;
    调用所述第二融合层,对每个所述语音信号帧的第一语音特征与融合特征进行融合,得到每个所述语音信号帧加权融合后的第一语音特征。The second fusion layer is called to fuse the first speech feature of each of the speech signal frames with the fusion feature to obtain the weighted and fused first speech feature of each of the speech signal frames.
  40. 一种非临时性计算机可读存储介质,当所述非临时性计算机可读存储介质中的指令由电子设备的处理器执行时,使得所述电子设备能够执行如下步骤:A non-transitory computer-readable storage medium, when an instruction in the non-transitory computer-readable storage medium is executed by a processor of an electronic device, enables the electronic device to perform the following steps:
    确定原始语音信号的多个第一语音特征,每个所述第一语音特征对应于所述原始语音信号中的一个语音信号帧;determining a plurality of first voice features of the original voice signal, each of the first voice features corresponding to a voice signal frame in the original voice signal;
    对多个所述第一语音特征进行处理,得到多个非局部语音特征,每个所述非局部语音特征对应于一个语音信号帧,且每个所述非局部语音特征是基于所述非局部语音特征对应的语音信号帧的第一语音特征,以及除所述语音信号帧之外的其他语音信号帧的第一语音特征进行融合得到的;Process a plurality of the first speech features to obtain a plurality of non-local speech features, each of the non-local speech features corresponds to a speech signal frame, and each of the non-local speech features is based on the non-local speech features Obtained by fusing the first voice feature of the voice signal frame corresponding to the voice feature and the first voice feature of other voice signal frames except the voice signal frame;
    分别对所述原始语音信号中每个语音信号帧的非局部语音特征进行处理,得到每个所述语音信号帧的混合语音特征;The non-local voice features of each voice signal frame in the original voice signal are processed respectively to obtain a mixed voice feature of each of the voice signal frames;
    基于多个所述语音信号帧的混合语音特征,获取去噪参数;Obtaining denoising parameters based on the mixed speech features of a plurality of the speech signal frames;
    基于所述去噪参数对所述原始语音信号进行去噪,得到目标语音信号。The original speech signal is denoised based on the denoising parameter to obtain a target speech signal.
  41. 一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如下步骤:A computer program product, comprising a computer program that implements the following steps when the computer program is executed by a processor:
    确定原始语音信号的多个第一语音特征,每个所述第一语音特征对应于所述原始语音信号中的一个语音信号帧;determining a plurality of first voice features of the original voice signal, each of the first voice features corresponding to a voice signal frame in the original voice signal;
    对多个所述第一语音特征进行处理,得到多个非局部语音特征,每个所述非局部语音特征对应于一个语音信号帧,且每个所述非局部语音特征是基于所述非局部语音特征对应的语音信号帧的第一语音特征,以及除所述语音信号帧之外的其他语音信号帧的第一 语音特征进行融合得到的;Process a plurality of the first speech features to obtain a plurality of non-local speech features, each of the non-local speech features corresponds to a speech signal frame, and each of the non-local speech features is based on the non-local speech features Obtained by fusing the first voice feature of the voice signal frame corresponding to the voice feature and the first voice feature of other voice signal frames except the voice signal frame;
    分别对所述原始语音信号中每个语音信号帧的非局部语音特征进行处理,得到每个所述语音信号帧的混合语音特征;The non-local voice features of each voice signal frame in the original voice signal are processed respectively to obtain a mixed voice feature of each of the voice signal frames;
    基于多个所述语音信号帧的混合语音特征,获取去噪参数;Obtaining denoising parameters based on the mixed speech features of a plurality of the speech signal frames;
    基于所述去噪参数对所述原始语音信号进行去噪,得到目标语音信号。The original speech signal is denoised based on the denoising parameter to obtain a target speech signal.
PCT/CN2021/116212 2021-01-29 2021-09-02 Voice signal processing method and electronic device WO2022160715A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110125640.5A CN112967730A (en) 2021-01-29 2021-01-29 Voice signal processing method and device, electronic equipment and storage medium
CN202110125640.5 2021-01-29

Publications (1)

Publication Number Publication Date
WO2022160715A1 true WO2022160715A1 (en) 2022-08-04

Family

ID=76273584

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/116212 WO2022160715A1 (en) 2021-01-29 2021-09-02 Voice signal processing method and electronic device

Country Status (2)

Country Link
CN (1) CN112967730A (en)
WO (1) WO2022160715A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111028861B (en) * 2019-12-10 2022-02-22 思必驰科技股份有限公司 Spectrum mask model training method, audio scene recognition method and system
CN112967730A (en) * 2021-01-29 2021-06-15 北京达佳互联信息技术有限公司 Voice signal processing method and device, electronic equipment and storage medium
CN113343924B (en) * 2021-07-01 2022-05-17 齐鲁工业大学 Modulation signal identification method based on cyclic spectrum characteristics and generation countermeasure network
CN113674753B (en) * 2021-08-11 2023-08-01 河南理工大学 Voice enhancement method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012096072A1 (en) * 2011-01-13 2012-07-19 日本電気株式会社 Audio-processing device, control method therefor, recording medium containing control program for said audio-processing device, vehicle provided with said audio-processing device, information-processing device, and information-processing system
WO2014070139A2 (en) * 2012-10-30 2014-05-08 Nuance Communications, Inc. Speech enhancement
CN106486131A (en) * 2016-10-14 2017-03-08 上海谦问万答吧云计算科技有限公司 A kind of method and device of speech de-noising
CN112071307A (en) * 2020-09-15 2020-12-11 江苏慧明智能科技有限公司 Intelligent incomplete voice recognition method for elderly people
CN112967730A (en) * 2021-01-29 2021-06-15 北京达佳互联信息技术有限公司 Voice signal processing method and device, electronic equipment and storage medium

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284749A (en) * 2017-07-19 2019-01-29 微软技术许可有限责任公司 Refine image recognition
CN108010514B (en) * 2017-11-20 2021-09-10 四川大学 Voice classification method based on deep neural network
CN109147798B (en) * 2018-07-27 2023-06-09 北京三快在线科技有限公司 Speech recognition method, device, electronic equipment and readable storage medium
CN109919114A (en) * 2019-03-14 2019-06-21 浙江大学 One kind is based on the decoded video presentation method of complementary attention mechanism cyclic convolution
KR20200119410A (en) * 2019-03-28 2020-10-20 한국과학기술원 System and Method for Recognizing Emotions from Korean Dialogues based on Global and Local Contextual Information
US11580970B2 (en) * 2019-04-05 2023-02-14 Samsung Electronics Co., Ltd. System and method for context-enriched attentive memory network with global and local encoding for dialogue breakdown detection
CN110148091A (en) * 2019-04-10 2019-08-20 深圳市未来媒体技术研究院 Neural network model and image super-resolution method based on non local attention mechanism
CN110415702A (en) * 2019-07-04 2019-11-05 北京搜狗科技发展有限公司 Training method and device, conversion method and device
CN110298413B (en) * 2019-07-08 2021-07-16 北京字节跳动网络技术有限公司 Image feature extraction method and device, storage medium and electronic equipment
CN110739002B (en) * 2019-10-16 2022-02-22 中山大学 Complex domain speech enhancement method, system and medium based on generation countermeasure network
CN110992974B (en) * 2019-11-25 2021-08-24 百度在线网络技术(北京)有限公司 Speech recognition method, apparatus, device and computer readable storage medium
CN111341331B (en) * 2020-02-25 2023-04-18 厦门亿联网络技术股份有限公司 Voice enhancement method, device and medium based on local attention mechanism
CN112257758A (en) * 2020-09-27 2021-01-22 浙江大华技术股份有限公司 Fine-grained image recognition method, convolutional neural network and training method thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012096072A1 (en) * 2011-01-13 2012-07-19 日本電気株式会社 Audio-processing device, control method therefor, recording medium containing control program for said audio-processing device, vehicle provided with said audio-processing device, information-processing device, and information-processing system
WO2014070139A2 (en) * 2012-10-30 2014-05-08 Nuance Communications, Inc. Speech enhancement
CN106486131A (en) * 2016-10-14 2017-03-08 上海谦问万答吧云计算科技有限公司 A kind of method and device of speech de-noising
CN112071307A (en) * 2020-09-15 2020-12-11 江苏慧明智能科技有限公司 Intelligent incomplete voice recognition method for elderly people
CN112967730A (en) * 2021-01-29 2021-06-15 北京达佳互联信息技术有限公司 Voice signal processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112967730A (en) 2021-06-15

Similar Documents

Publication Publication Date Title
WO2022160715A1 (en) Voice signal processing method and electronic device
CN110097019B (en) Character recognition method, character recognition device, computer equipment and storage medium
CN110543289B (en) Method for controlling volume and electronic equipment
CN111696532B (en) Speech recognition method, device, electronic equipment and storage medium
CN111445901B (en) Audio data acquisition method and device, electronic equipment and storage medium
CN110047468B (en) Speech recognition method, apparatus and storage medium
CN108320756B (en) Method and device for detecting whether audio is pure music audio
CN110933334B (en) Video noise reduction method, device, terminal and storage medium
CN109003621B (en) Audio processing method and device and storage medium
CN109243479B (en) Audio signal processing method and device, electronic equipment and storage medium
CN112233689B (en) Audio noise reduction method, device, equipment and medium
CN112581358A (en) Training method of image processing model, image processing method and device
CN111276122A (en) Audio generation method and device and storage medium
CN111613213A (en) Method, device, equipment and storage medium for audio classification
CN109961802B (en) Sound quality comparison method, device, electronic equipment and storage medium
CN109065068B (en) Audio processing method, device and storage medium
CN111176465A (en) Use state identification method and device, storage medium and electronic equipment
CN111223475A (en) Voice data generation method and device, electronic equipment and storage medium
CN112233688B (en) Audio noise reduction method, device, equipment and medium
CN116208704A (en) Sound processing method and device
CN112133319A (en) Audio generation method, device, equipment and storage medium
CN112508959A (en) Video object segmentation method and device, electronic equipment and storage medium
CN110232417B (en) Image recognition method and device, computer equipment and computer readable storage medium
CN113301444B (en) Video processing method and device, electronic equipment and storage medium
CN112151017B (en) Voice processing method, device, system, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21922310

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 06.11.2023)