WO2023222071A1 - 语音信号的处理方法、装置、设备及介质 - Google Patents

语音信号的处理方法、装置、设备及介质 Download PDF

Info

Publication number
WO2023222071A1
WO2023222071A1 PCT/CN2023/094965 CN2023094965W WO2023222071A1 WO 2023222071 A1 WO2023222071 A1 WO 2023222071A1 CN 2023094965 W CN2023094965 W CN 2023094965W WO 2023222071 A1 WO2023222071 A1 WO 2023222071A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
mixing
module
features
convolution
Prior art date
Application number
PCT/CN2023/094965
Other languages
English (en)
French (fr)
Inventor
王炳乾
宿绍勋
夏友祥
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Publication of WO2023222071A1 publication Critical patent/WO2023222071A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • the present disclosure generally relates to the field of natural language processing, and relates to a speech signal processing method, device, equipment and medium.
  • wake-up technology based on template matching
  • wake-up technology based on hidden Markov model
  • wake-up technology based on deep learning.
  • the most widely used is wake-up technology based on deep learning.
  • the present disclosure provides a voice signal processing method, device, equipment and medium, which can provide faster voice recognition response and improve user experience.
  • embodiments of the present disclosure provide a voice signal processing method, including:
  • the recognition results of the speech signal are obtained
  • the response strategy corresponding to the recognition result is executed.
  • convolutional mixing processing is performed according to speech features to obtain shallow speech recognition features, including:
  • the convolution mixing module includes a spatial position convolution mixing module and a channel position convolution mixing module.
  • the mixing result of the spatial position convolution mixing module and the input of the spatial position convolution mixing module are input to the channel position convolution mixing through a residual connection. module.
  • the spatial positional convolutional blending module includes a depthwise separable convolutional layer and the channel positional convolutional mixing module includes a pointwise convolutional layer.
  • the spatial position convolution mixing module further includes at least one of a first activation function layer and a first normalization layer
  • the channel position convolution mixing module further includes a second activation function layer and a second normalization layer. At least one of the chemical layers.
  • a convolutional mixing module is used to perform convolutional mixing processing on speech features, including:
  • hybrid processing based on multi-layer perception is performed according to shallow speech recognition features to obtain deep speech recognition features, including:
  • a multi-layer perceptual mixing module is used to perform convolutional mixing processing on shallow speech recognition features.
  • the multi-layer perceptual mixing module includes a spatial perceptual mixing module and a channel perceptual mixing module.
  • the spatial perceptual mixing module performs spatial perceptual mixing on shallow speech recognition features and provides the mixing results to the channel perceptual mixing module.
  • the channel perceptual mixing module performs spatial perceptual mixing on the shallow speech recognition features. Perform channel-aware mixing.
  • the multi-layer perceptual mixing module also includes: a first transposition module and a first transposition module.
  • the first transposition module is used to transpose the feature vector of the shallow speech recognition feature.
  • the transposition module uses to transpose the feature vector of the mixed result.
  • the spatial-aware mixing module includes a first fully connected layer and a second fully-connected layer
  • the channel-aware mixing module includes a third fully connected layer and a fourth fully connected layer.
  • the spatial-aware mixing module further includes a third activation function layer
  • the channel-aware mixing module further includes a fourth activation function layer.
  • the method before performing convolution mixing processing according to speech features to obtain shallow speech recognition features, the method further includes:
  • Convolutional mixing processing is performed based on the downsampled speech features to obtain shallow speech recognition features.
  • downsampling speech features includes:
  • the feature embedding module includes a feature embedding layer.
  • the feature embedding module further includes: a fifth activation function layer and a third normalization layer.
  • extracting speech features of the speech signal includes:
  • an embodiment of the present disclosure provides a speech signal processing device, including:
  • the acquisition module is used to acquire the voice signals collected from the environment
  • Speech feature extraction module used to extract speech features of speech signals
  • the convolution mixing module is used to perform convolution mixing processing based on speech features to obtain shallow speech recognition features
  • the multi-layer perceptual mixing module is used to perform multi-layer perceptual mixing based on shallow speech recognition features to obtain deep speech recognition features;
  • the classification module is used to obtain the recognition results of speech signals based on deep speech recognition features
  • the execution module is used to execute the response strategy corresponding to the recognition result according to the recognition result.
  • the convolution mixing module includes a spatial position convolution mixing module and a channel position convolution mixing module.
  • the mixing result of the spatial position convolution mixing module and the input of the spatial position convolution mixing module are input to the channel through a residual connection.
  • Positional convolution mixing module is
  • the spatial positional convolutional blending module includes a depthwise separable convolutional layer and the channel positional convolutional mixing module includes a pointwise convolutional layer.
  • the spatial position convolution mixing module further includes at least one of a first activation function layer and a first normalization layer
  • the channel position convolution mixing module further includes a second activation function layer and a second normalization layer. At least one of the chemical layers.
  • the speech signal processing device includes a plurality of convolution mixing modules connected in series.
  • the multi-layer perceptual blending module includes a spatially aware blending module and a channel-aware blending module.
  • the spatial-aware mixing module performs spatial-aware mixing on shallow speech recognition features and provides the mixing results to the channel-aware mixing module.
  • the channel-aware mixing module performs channel-aware mixing on the mixing results.
  • the multi-layer perceptual mixing module also includes: a first transposition module and a first transposition module.
  • the first transposition module is used to transpose the feature vector of the shallow speech recognition feature.
  • the transposition module uses to transpose the feature vector of the mixed result.
  • the spatial-aware mixing module includes a first fully connected layer and a second fully-connected layer
  • the channel-aware mixing module includes a third fully connected layer and a fourth fully connected layer.
  • the spatial-aware mixing module further includes a third activation function layer
  • the channel-aware mixing module further includes a fourth activation function layer.
  • the voice signal processing device further includes:
  • Feature embedding module used to downsample speech features
  • the convolution mixing module is specifically used for:
  • Convolutional mixing processing is performed based on the downsampled speech features to obtain shallow speech recognition features.
  • the feature embedding module includes a feature embedding layer.
  • the feature embedding module further includes: a fifth activation function layer and a third normalization layer.
  • the speech feature extraction module is specifically used for:
  • embodiments of the present disclosure provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the program, the implementation is as described in the embodiments of the present disclosure. Methods.
  • embodiments of the disclosure provide a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the method described in the embodiments of the disclosure is implemented.
  • FIG 1 shows the system architecture in related technologies
  • Figure 2 is an application scenario diagram of a speech signal processing method provided by an embodiment of the present disclosure
  • Figure 3 is a schematic flowchart of a speech signal processing method provided by an embodiment of the present disclosure
  • Figure 4 is a schematic flowchart of extracting speech features provided by an embodiment of the present disclosure.
  • Figure 5 is a schematic structural diagram of a convolution mixing module provided by an embodiment of the present disclosure.
  • Figure 6 is a schematic structural diagram of a multi-layer sensing hybrid module provided by an embodiment of the present disclosure.
  • FIG. 7 is a schematic flowchart of another speech signal processing method provided by an embodiment of the present disclosure.
  • Figure 8 is a schematic diagram of the model structure corresponding to Figure 7;
  • Figure 9 is a block schematic diagram of a voice signal processing device provided by an embodiment of the present disclosure.
  • FIG. 10 is a block schematic diagram of another voice signal processing device provided by an embodiment of the present disclosure.
  • FIG. 11 is a schematic structural diagram of a computer system of an electronic device or server provided by an embodiment of the present disclosure.
  • the initial wake-up method is to manually click a button (such as pressing the recording button) for voice input.
  • wake-up technology based on deep learning models gradually took shape, forming the system architecture shown in Figure 1.
  • the client collects the user's voice signal in real time and then sends it to the cloud server through the wireless network.
  • the deep learning model on the cloud server analyzes the speech signal and obtains speech recognition results or obtains voice control instructions based on the speech recognition results.
  • the cloud server then returns the speech recognition results or voice control instructions to the client.
  • the client executes corresponding response strategies based on the returned speech recognition results or voice control instructions. For example, respond to the user in response to the wake-up statement, or execute the control plan corresponding to the voice control instruction. It can be seen that because the terminal cannot perform speech recognition on its own, it cannot be used when the network is not smooth. It is impossible to realize the operation of voice recognition, which affects the user's experience of using voice control devices. Moreover, these technologies all have a large number of model parameters and need to be configured in the cloud for calculation, resulting in slow wake-up response and poor user experience.
  • the present disclosure proposes a voice signal processing method, which has a small amount of model parameters, can be configured in a terminal device, and can ensure user voice control experience in an offline state.
  • FIG. 2 is a schematic diagram of an application scenario of the speech signal processing method proposed by an embodiment of the present disclosure.
  • the application scenario includes a sound collection device 20 and a voice signal processing device 10 .
  • the terminal device can be an intelligent voice control device.
  • intelligent voice control devices for example, at least one of intelligent voice control devices, home appliances with voice control functions, smartphones, tablet computers, notebook computers, wearable devices, etc.
  • the sound collection device 20 is a device for collecting voice data, such as a microphone array.
  • the voice signal processing device 10 is connected to the sound collection device 20 and is used to execute the voice signal processing method proposed in this disclosure to identify the voice signal collected by the sound collection device 20 and control the terminal to execute the response strategy corresponding to the recognition result. .
  • FIG. 3 is a schematic flowchart of a speech signal processing method provided by an embodiment of the present disclosure. As shown in Figure 3, the speech signal processing method proposed by the embodiment of the present disclosure includes the following steps:
  • Step 301 Obtain the voice signal collected from the environment.
  • the environment consists of various natural factors.
  • the environment can include a real environment and a virtual environment.
  • the real environment is the environment that exists in real life
  • the virtual environment is the environment obtained by simulating the real environment.
  • the voice signal acquired in step 301 may be a voice signal collected from the environment by the sound collection device, or may be a voice signal processed by the sound collection device based on the voice signal collected from the environment.
  • the processing of speech signals by the sound collection device includes but is not limited to sound source localization, enhanced speech and speech endpoint detection, etc.
  • Step 302 Extract the speech features of the speech signal.
  • the speech signal processing device 10 can extract the Mel Frequency Cepstrum Coefficient (MFCC) of the speech signal to obtain the speech characteristics of the speech signal. That is to say, the speech feature of the speech signal is the Mel frequency cepstrum coefficient of the speech signal.
  • MFCC Mel Frequency Cepstrum Coefficient
  • step 302 includes the following steps:
  • Step 3021 Perform frame processing on the voice signal.
  • the voice signal processing device 10 performs frame processing on the voice signal to obtain multiple audio frames. Framing processing is to divide the speech signal into multiple speech signal segments of fixed length. Each segment of speech signal is called an audio frame. Optionally, the frame length is 10-30ms.
  • Framing the speech signal can be achieved by adding a time window to the speech signal.
  • the signal part within each time window is an audio frame.
  • an overlapping segmentation method can be used when performing frame segmentation processing. That is, when taking the time window of the audio frame from the speech signal for offset, there needs to be a partial overlap area between the audio frames. This overlapping area is called frame shift.
  • the ratio of the frame shift to the frame length ranges from 0-1/2. Determining the position of the next frame by frame shifting can take advantage of the short-term stability of the signal to make a smooth transition between frames and maintain its continuity. At the same time, it can avoid the problem of information omission caused by the boundaries of the time window.
  • the frame length is 25 milliseconds (ms)
  • the frame shift is 10ms
  • the sampling frequency of the voice signal is 16,000 sampling points per second
  • the length of the voice signal is 1s.
  • the speech signal before performing frame processing on the speech signal, can also be pre-emphasized to enhance the high-frequency signal in the speech signal.
  • Step 3022 Perform Fourier transform on the frame-processed speech signal to obtain the spectrum of the speech signal.
  • Fourier transform is used to convert time domain signals into frequency domain signals.
  • Fourier transform can use fast Fourier transform method.
  • Step 3023 Pass the spectrum of the speech signal through Mel filtering to obtain the Mel spectrum.
  • the Mel filter can be a triangular filter bank. Filtering with a Mel filter can make the obtained spectral characteristics more consistent with the hearing characteristics of the human ear.
  • Step 3024 Perform cepstral analysis on the Mel frequency spectrum to obtain the Mel frequency cepstral coefficient of the speech signal.
  • Performing cepstral analysis on the Mel spectrum includes: obtaining the absolute value or logarithm of the Mel spectrum to obtain its energy value, and performing inverse transformation based on the energy value of the Mel spectrum to obtain the Mel frequency cepstral coefficient of the speech signal.
  • the absolute value or square of the Mel spectrum can be obtained to obtain the energy of the Mel spectrum.
  • the inputs of all filters in a triangular filter bank can be Do a logarithmic operation to get the energy of the Mel spectrum.
  • discrete cosine transform DCT
  • the mel frequency cepstrum coefficient can be represented by the MFCC feature vector.
  • the length of the MFCC feature vector is 40
  • the speech feature is a 98 ⁇ 40 two-dimensional feature vector, similar to an image with a length of 98, a width of 40, and a channel number of 1.
  • the Mel frequency cepstral coefficients can also be obtained based on the Mel frequency cepstral coefficients. Its differential spectrum is added to add the dynamic characteristics of the speech signal to the speech characteristics of the speech signal, thereby improving the accuracy of speech signal recognition.
  • the dynamic characteristics of the speech signal are used to indicate the trajectory of the speech signal over time.
  • Step 303 Perform convolution mixing processing according to the speech features to obtain shallow speech recognition features.
  • a convolution mixing module may be used to perform convolution mixing processing on speech features.
  • the convolution mixing module includes a spatial position convolution mixing module and a channel position convolution mixing module.
  • the mixing result of the spatial position convolution mixing module and the input of the spatial position convolution mixing module are input to the channel position convolution mixing module through a residual connection.
  • the embodiment of the present disclosure first uses a spatial position convolution mixing module to mix the spatial position features in the speech features, and then uses a channel position convolution mixing module to mix the channel position features in the speech features.
  • the spatial location convolution module may include depth-separable convolutional layers.
  • the channel position convolutional blending module consists of pointwise convolutional layers. At this time, it can be ensured that the convolutional mixing module has a smaller amount of model parameters.
  • the spatial position convolution module may also include at least one of a first activation function layer and a first normalization layer.
  • the channel position convolution mixing module may also include at least one of a second excitation function layer and a second normalization layer.
  • both the first excitation function layer and the second excitation function layer can use the Gru function GeLU as the excitation function, which can increase the nonlinearity of the matrix and further facilitate the extraction of correlations in different dimensions.
  • the convolution mixing module can be a ConvMixer model, which can perform convolution mixing processing according to speech features to obtain shallow speech recognition features of the speech signal, and has a smaller model than the traditional full convolution model The amount of parameters can be better configured on the terminal device.
  • multiple convolution mixing modules connected in series can be used to obtain shallow speech recognition features of the speech signal according to the computing power of the terminal or the needs of speech recognition, and the convolution The number of product mixing modules can be determined according to application requirements.
  • This Disclosure is not specifically limited here.
  • Step 304 Perform hybrid processing based on multi-layer perception based on shallow speech recognition features to obtain deep speech recognition features.
  • a multi-layer perceptual mixing module can be used to perform convolutional mixing processing on shallow speech recognition features.
  • the multi-layer perceptual mixing module includes a space-aware mixing module and a channel-aware mixing module.
  • the spatial-aware blending module allows communication between different spatial locations, is used to perform spatial-aware blending of shallow speech recognition features, and inputs the blending results into the channel blending-aware model.
  • the channel-aware blending model allows communication between different channels for channel-aware blending of the blending results of spatial-aware blending.
  • the multi-layer perceptual mixing module also includes a first transposition module and a second transposition module.
  • the first transposition module is used to transpose the feature matrix of shallow speech recognition features.
  • the spatial awareness mixing module is used to perform spatial awareness mixing on the transposed shallow speech recognition features.
  • the second transpose module is used to transpose the feature vector of the mixed result.
  • the channel mixing aware module is used to perform channel aware mixing on the transposed mixing results.
  • the implementation process of the multi-layer perceptual mixing module actually converts the row features into column features, mixes the spatial features based on the column features, and then transposes the mixed column features into row features again, and then based on the row features Perform blending of channel features.
  • the multi-layer perceptual mixing module includes a fourth normalization layer, a first transposition module T1, a spatial perceptual mixing module (token-mixing MLP), and a second transposition module T2 , the fifth normalization layer and the channel-aware mixing module (channel-mixing MLP).
  • the fourth normalization layer After the shallow speech recognition features are passed through the fourth normalization layer, channel-based row vector features are obtained.
  • the first transposition module transposes the row vector features into column vector features, and then inputs them into the spatial awareness mixing layer for mixing, and obtains
  • the second transposition module transposes the mixed column vector features to obtain row vector features, and inputs the row vector features into the fifth normalization layer for normalization processing.
  • the processed row vector features are input to the channel-aware mixing layer for mixing to obtain deep speech recognition features.
  • a skip-connection is used between the fifth normalization layer and the second transposition module.
  • the output of the channel-aware mixing module and the output of the second transpose module are connected using residuals.
  • the output of the second transpose module and the input of the fourth normalization layer are connected using residuals.
  • the space-aware mixing module includes a first fully connected layer and a second fully connected layer
  • the channel-aware mixing module includes a third fully connected layer and a fourth fully connected layer.
  • space-aware blending The module may also include a third activation function layer.
  • the channel-aware mixing module may also include a fourth activation function layer.
  • both the third activation function and the fourth activation function layer may use the Gru function GeLU as the activation function.
  • feature mixing is performed based on column features through the spatial awareness mixing module, so that all columns can share parameters in the space awareness mixing module.
  • feature mixing is performed based on row features through the channel awareness mixing module, Enables all rows to share parameters in the column space-aware hybrid module.
  • the alternate execution of two types of perceptual hybrid modules can promote information interaction between the two dimensions.
  • multiple multi-layer perception-based mixing modules connected in series can be used to obtain the speech signal according to the computing power of the terminal or the needs of speech recognition.
  • Deep speech recognition features, and the number of convolution mixing modules can be determined according to application requirements, which is not specifically limited in this disclosure.
  • the convolutional mixing module and the multi-layer perceptual mixing module can form a mixing unit, and multiple mixing units in series can be used to obtain the deep speech recognition features of the speech signal.
  • convolutional mixing processing of shallow speech recognition features through a multi-layer perceptual mixing module to obtain deep speech recognition features can further improve the spatial mixing effect and channel mixing effect between speech features, thereby effectively improving speech features expression effect, thereby improving the speech recognition effect based on speech features.
  • Step 305 Obtain the recognition result of the speech signal based on the deep speech recognition features.
  • the present disclosure classifies speech recognition features through a classifier to obtain the recognition result of the speech signal.
  • the classifier may include a 2D average pooling layer and a fully connected layer. After the deep speech recognition features are input into the 2D average pooling layer, the 2D average pooling layer is used to perform 2D average pooling on the deep speech recognition features. After the pooling results are input into the fully connected layer using the softmax activation function, An N+2 classification output result is obtained. The category with the highest probability among the N+2 classification output results is the recognition result.
  • N is the number of wake-up words or command words obtained by the speech signal processing device through training. 2 represents two categories: silence and unknown.
  • the wake-up word is used to indicate the name of the awakened object, which includes but is not limited to "Little X Little X", “Hello Little X”, etc. Moreover, the wake-up word can be a string consisting of no less than 4 phrases to improve the precise wake-up of the awakened object.
  • Command words include but are not limited to control commands such as "turn on the air conditioner”, “increase volume”, “turn up brightness”, etc.
  • model used to process speech features in this disclosure can be converted into tflite format to run directly on smart phones, tablets and other smart terminals through Java or C++ language.
  • Step 306 According to the recognition result, execute the response strategy corresponding to the recognition result.
  • the intelligent voice control device when the recognition result is a wake-up word, the intelligent voice control device is controlled to respond. For example, respond to "Xiao X is here", "He is”, etc.
  • the recognition result is a command word
  • the corresponding intelligent terminal is controlled to execute the control command.
  • the command word is "turn on the air conditioner”
  • the air conditioner is controlled to be turned on, and the air conditioner is controlled to operate according to the preset control strategy or the last set control strategy.
  • the command word is "increase volume”
  • the player currently playing is controlled to increase the volume by one level.
  • the command word is "increase brightness”
  • the lighting device currently in the lighting state is controlled to increase the brightness by one level.
  • the method before performing convolution mixing processing according to the speech features to obtain shallow speech recognition features, the method also includes: downsampling the speech features. Then perform convolution mixing processing based on the speech features to obtain shallow speech recognition features, including: performing convolution mixing processing based on the down-sampled speech features to obtain shallow speech recognition features.
  • a feature embedding module can be used to downsample speech features.
  • the feature embedding module includes a feature embedding layer.
  • the feature embedding module also includes a fifth activation function layer and a third normalization layer.
  • the feature embedding layer can be a 2D convolution layer with 64 convolution kernels and 2 channels. Therefore, the feature input to the convolutional mixing module after being processed by the feature embedding module is a tensor of [batch size, 49, 20, 64].
  • This feature embedding module can complete all down-sampling processes of the neural network, effectively reducing the resolution of the image and increasing the receptive field, making it easier for the convolutional hybrid module and the hybrid model based on multi-layer perception to obtain further spatial information.
  • the speech signal processing method includes the following steps:
  • Step 601 Obtain the voice signal collected from the environment.
  • Step 602 Extract speech features of the speech signal.
  • Step 603 Input the speech features into the feature embedding module to obtain down-sampled speech features.
  • Step 604 Input the down-sampled speech features to multiple convolution mixing modules connected in series to obtain shallow speech recognition features.
  • Step 605 Input shallow speech recognition features to multiple multi-layer perceptual mixing modules connected in series to obtain deep speech recognition features.
  • Step 606 Input the deep speech recognition features to the classifier to obtain the speech recognition result.
  • Step 607 Execute the response strategy corresponding to the speech recognition result.
  • the present disclosure also verifies the effectiveness of the speech signal processing method proposed by the present disclosure.
  • the Google Speech Commands V2 (GSC-V2) data set is used to test the speech signal processing method proposed in the embodiment of the present disclosure.
  • GSC-V2 contains 105829 command words and 2618 speakers, including 'down', ' 35 command words including go', 'left', 'no', 'off', 'on', 'right', 'stop', 'up', 'yes', etc.
  • the ConvMlp-Mixer-S model corresponding to the speech signal processing method proposed in the embodiment of the present disclosure only uses 96k parameters to achieve an accuracy of 96.24, while the ConvMlp-Mixer-L uses 0.299M parameters to achieve an accuracy of 97.77 Compared with the MLP-based model and the transformer-based model, the accuracy is better and the number of model parameters is smaller. Therefore, it can prove the effectiveness of the speech signal processing method provided by the embodiment of the present disclosure.
  • the speech signal processing method proposed by the embodiment of the present disclosure can effectively perform convolution mixing processing and multi-layer perception-based mixing processing on the speech features of the speech signal.
  • the parameter amount of the voice recognition model is reduced, so that the processing device for executing voice signals can be better configured on the terminal device, providing a faster voice recognition response, and improving the response of the terminal device to voice commands. Efficiency helps improve user experience.
  • Figure 9 is a block diagram of a voice signal processing device in an embodiment of the present disclosure.
  • the speech signal processing device 10 includes:
  • the acquisition module 11 is used to acquire speech signals collected from the environment.
  • the speech feature extraction module 12 is used to extract the speech features of the speech signal.
  • the convolution mixing module 13 is used to perform convolution mixing processing based on speech features to obtain shallow speech recognition features.
  • the multi-layer perceptual mixing module 14 is used to perform multi-layer perceptual mixing based on shallow speech recognition features to obtain deep speech recognition features.
  • the classification module 15 is used to obtain the recognition result of the speech signal based on the deep speech recognition features.
  • the execution module 16 is used to execute the response strategy corresponding to the recognition result according to the recognition result.
  • the convolution mixing module 13 includes a spatial position convolution mixing module and a channel position convolution mixing module.
  • the mixing result of the spatial position convolution mixing module and the input of the spatial position convolution mixing module are input to Channel position convolution mixing module.
  • the spatial positional convolutional blending module includes a depthwise separable convolutional layer and the channel positional convolutional mixing module includes a pointwise convolutional layer.
  • the spatial position convolution mixing module further includes at least one of a first activation function layer and a first normalization layer
  • the channel position convolution mixing module further includes a second activation function layer and a second normalization layer. At least one of the chemical layers.
  • the speech signal processing device 10 includes a plurality of convolution mixing modules 14 connected in series.
  • the multi-layer perceptual mixing module 14 includes a spatial-aware mixing module and a channel-aware mixing module.
  • the spatial-aware mixing module performs spatial-aware mixing on shallow speech recognition features and provides the mixing results to the channel-aware mixing module.
  • the channel-aware mixing module performs channel-aware blending of the blending results.
  • the multi-layer perceptual mixing module 14 further includes: a first transposition module and a third A transpose module, the first transpose module is used to transpose the feature vector of the shallow speech recognition feature, and the transpose module is used to transpose the feature vector of the mixed result.
  • the spatial-aware mixing module includes a first fully connected layer and a second fully-connected layer
  • the channel-aware mixing module includes a third fully connected layer and a fourth fully connected layer.
  • the spatial-aware mixing module further includes a third activation function layer
  • the channel-aware mixing module further includes a fourth activation function layer.
  • the voice signal processing device 10 further includes:
  • Feature embedding module 17 used to downsample speech features.
  • the convolution mixing module 13 is specifically used for:
  • Convolutional mixing processing is performed based on the downsampled speech features to obtain shallow speech recognition features.
  • feature embedding module 17 includes a feature embedding layer.
  • the feature embedding module 17 further includes: a fifth activation function layer and a third normalization layer.
  • the speech feature extraction module 12 is specifically used to:
  • the various units or modules recorded in the speech signal processing device 10 correspond to various steps in the method described with reference to FIG. 2 . Therefore, the operations and features described above for the method are also applicable to the speech signal processing device 10 and the units included therein, and will not be described again here.
  • the voice signal processing device 10 can be pre-implemented in the browser or other security applications of the electronic device, or can be loaded into the browser of the electronic device or its security applications through downloading or other methods. Corresponding units in the voice signal processing device 10 may cooperate with units in the electronic device to implement the solutions of the embodiments of the present disclosure.
  • the voice signal processing device proposed in the embodiment of the present disclosure can reduce the cost of voice commands while effectively recognizing voice commands by performing convolution mixing processing and multi-layer perception-based mixing processing on the voice features extracted from the voice signal.
  • the parameter amount of the speech recognition model is increased, so that the executive language
  • the audio signal processing device can be better configured on the terminal device, can provide faster speech recognition response, improve the response efficiency of the terminal device to voice commands, and help improve user experience.
  • FIG. 11 shows a schematic structural diagram of a computer system suitable for implementing an electronic device or server according to an embodiment of the present disclosure.
  • the computer system includes a central processing unit (CPU) 1001, which can operate according to a program stored in a read-only memory (ROM) 1002 or loaded from a storage portion 1008 into a random access memory (RAM) 1003. Perform various appropriate actions and processing.
  • RAM 1003 also stores various programs and data required for system operation instructions.
  • CPU1001, ROM1002, and RAM1003 are connected to each other via a bus 1004.
  • An input/output (I/O) interface 1005 is also connected to bus 1004.
  • the following components are connected to the I/O interface 1005; an input section 1006 including a keyboard, a mouse, etc.; an output section 1007 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., speakers, etc.; and a storage section 1008 including a hard disk, etc. ; and a communication section 1009 including a network interface card such as a LAN card, a modem, etc.
  • the communication section 1009 performs communication processing via a network such as the Internet.
  • Driver 1010 is also connected to I/O interface 1005 as needed.
  • Removable media 1011 such as magnetic disks, optical disks, magneto-optical disks, semiconductor memories, etc., are installed on the drive 1010 as needed, so that a computer program read therefrom is installed into the storage portion 1008 as needed.
  • the process described above with reference to the flowchart FIG. 2 may be implemented as a computer software program.
  • embodiments of the present disclosure include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program contains program code for performing the methods illustrated in the flowcharts.
  • the computer program may be downloaded and installed from the network via communication portion 1009 and/or installed from removable media 1011.
  • this computer program is executed by the central processing unit (CPU) 1001, the above-described functions defined in the system of the present disclosure are performed.
  • the computer-readable medium shown in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), removable Programming read only Memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium that is a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wireless, wire, optical cable, RF, etc., or any suitable combination of the foregoing.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more components that implement the specified logical function(s). executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two connected blocks may actually execute substantially in parallel, and they may sometimes execute in reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by specialized hardware-based systems that perform the specified functions or operating instructions. Implemented, or may be implemented using a combination of specialized hardware and computer instructions.
  • the units or modules described in the embodiments of the present disclosure may be implemented in software or hardware.
  • the described unit or module can also be provided in a processor.
  • a processor includes an acquisition module, a speech feature extraction module, a convolution mixing module, a multi-layer perceptual mixing module, a classification module and an execution module.
  • the names of these units or modules do not constitute a limitation on the unit or module itself under certain circumstances.
  • the acquisition module can also be described as "acquiring the collected voice signal".
  • the present disclosure also provides a computer-readable storage medium.
  • the computer-readable storage medium may be included in the electronic device described in the above embodiments, or may exist separately without being assembled into the electronic device. in electronic equipment.
  • the above computer-readable storage medium stores a or Multiple programs, when the above programs are used by one or more processors to execute the speech signal processing method described in the present disclosure.

Abstract

本公开公开了一种语音信号的处理方法、装置、设备及介质。该方法包括:获取从环境中采集到的语音信号;提取语音信号的语音特征;根据语音特征进行卷积混合处理,得到浅层语音识别特征;根据浅层语音识别特征进行基于多层感知的混合处理,得到深层语音识别特征;根据深层语音识别特征,得到语音信号的识别结果;根据识别结果,执行识别结果对应的响应策略,能够提供更快速的语音识别响应,提高用户体验。

Description

语音信号的处理方法、装置、设备及介质
本公开要求于2022年05月20日提交的申请号为202210560595.0、发明名称为“语音信号的处理方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开一般涉及自然语言处理领域,涉及一种语音信号的处理方法、装置、设备及介质。
背景技术
随着人工智能算法以及AI芯片等硬件技术的发展,智能设备在日常生活中已经被广泛应用。如智能家居语音控制系统、智能音箱、智能会议系统等。语音交互在智能设备中的应用极为广泛且日益成熟。在传统的语音交互场景中,需要人工点击按钮(如按下录音键)进行语音输入,提供输入的语音唤醒设备,进而与设备进行交互。为了进一步提升人机交互体验,语音唤醒技术应运而生。
目前语音唤醒主要有三种方式:基于模板匹配的唤醒技术;基于隐马尔可夫模型的唤醒技术;基于深度学习的唤醒技术。其中,应用最为广泛的是基于深度学习的唤醒技术。
发明内容
本公开提供一种语音信号的处理方法、装置、设备及介质,能够提供更快速的语音识别响应,提高用户体验。
第一方面,本公开实施例提供了一种语音信号的处理方法,包括:
获取从环境中采集到的语音信号;
提取语音信号的语音特征;
根据语音特征进行卷积混合处理,得到浅层语音识别特征;
根据浅层语音识别特征进行基于多层感知的混合处理,得到深层语音识别特征;
根据深层语音识别特征,得到语音信号的识别结果;
根据识别结果,执行识别结果对应的响应策略。
在一些实施例中,根据语音特征进行卷积混合处理,得到浅层语音识别特征,包括:
利用卷积混合模块对语音特征进行卷积混合处理;
其中,卷积混合模块包括空间位置卷积混合模块和通道位置卷积混合模块,空间位置卷积混合模块的混合结果与空间位置卷积混合模块的输入通过残差连接输入至通道位置卷积混合模块。
在一些实施例中,空间位置卷积混合模块包括深度可分离卷积层,通道位置卷积混合模块包括逐点卷积层。
在一些实施例中,空间位置卷积混合模块还包括第一激励函数层和第一归一化层中的至少一种,通道位置卷积混合模块还包括第二激励函数层和第二归一化层中的至少一种。
在一些实施例中,利用卷积混合模块对语音特征进行卷积混合处理,包括:
利用串联连接的多个卷积混合模块对语音特征进行卷积混合处理。
在一些实施例中,根据浅层语音识别特征进行基于多层感知的混合处理,得到深层语音识别特征,包括:
利用多层感知混合模块对浅层语音识别特征进行卷积混合处理,
其中,多层感知混合模块包括空间感知混合模块和通道感知混合模块,空间感知混合模块对浅层语音识别特征进行空间感知混合,并向通道感知混合模块提供混合结果,通道感知混合模块对混合结果进行通道感知混合。
在一些实施例中,多层感知混合模块还包括:第一转置模块和第一转置模块,第一转置模块用于对浅层语音识别特征的特征向量进行转置,转置模块用于对混合结果的特征向量进行转置。
在一些实施例中,空间感知混合模块包括第一全连接层和第二全连接层,通道感知混合模块包括第三全连接层和第四全连接层。
在一些实施例中,空间感知混合模块还包括第三激活函数层,通道感知混合模块还包括第四激活函数层。
在一些实施例中,在根据语音特征进行卷积混合处理,得到浅层语音识别特征之前,还包括:
对语音特征进行下采样;
根据语音特征进行卷积混合处理,得到浅层语音识别特征,包括:
根据经过下采样的语音特征进行卷积混合处理,得到浅层语音识别特征。
在一些实施例中,对语音特征进行下采样,包括:
采用特征嵌入模块对语音特征进行下采样;
其中,特征嵌入模块包括特征嵌入层。
在一些实施例中,特征嵌入模块还包括:第五激活函数层和第三归一化层。
在一些实施例中,提取语音信号的语音特征,包括:
提取语音信号的梅尔频率倒谱系数,得到语音信号的语音特征。
第二方面,本公开实施例提供了一种语音信号的处理装置,包括:
获取模块,用于获取从环境中采集到的语音信号;
语音特征提取模块,用于提取语音信号的语音特征;
卷积混合模块,用于根据语音特征进行卷积混合处理,得到浅层语音识别特征;
多层感知混合模块,用于根据浅层语音识别特征进行基于多层感知的混合处理,得到深层语音识别特征;
分类模块,用于根据深层语音识别特征,得到语音信号的识别结果;
执行模块,用于根据识别结果,执行识别结果对应的响应策略。
在一些实施例中,卷积混合模块包括空间位置卷积混合模块和通道位置卷积混合模块,空间位置卷积混合模块的混合结果与空间位置卷积混合模块的输入通过残差连接输入至通道位置卷积混合模块。
在一些实施例中,空间位置卷积混合模块包括深度可分离卷积层,通道位置卷积混合模块包括逐点卷积层。
在一些实施例中,空间位置卷积混合模块还包括第一激励函数层和第一归一化层中的至少一种,通道位置卷积混合模块还包括第二激励函数层和第二归一化层中的至少一种。
在一些实施例中,语音信号的处理装置包括串联连接的多个卷积混合模块。
在一些实施例中,多层感知混合模块包括空间感知混合模块和通道感 知混合模块,空间感知混合模块对浅层语音识别特征进行空间感知混合,并向通道感知混合模块提供混合结果,通道感知混合模块对混合结果进行通道感知混合。
在一些实施例中,多层感知混合模块还包括:第一转置模块和第一转置模块,第一转置模块用于对浅层语音识别特征的特征向量进行转置,转置模块用于对混合结果的特征向量进行转置。
在一些实施例中,空间感知混合模块包括第一全连接层和第二全连接层,通道感知混合模块包括第三全连接层和第四全连接层。
在一些实施例中,空间感知混合模块还包括第三激活函数层,通道感知混合模块还包括第四激活函数层。
在一些实施例中,语音信号的处理装置还包括:
特征嵌入模块,用于对语音特征进行下采样;
相应的,卷积混合模块具体用于:
根据经过下采样的语音特征进行卷积混合处理,得到浅层语音识别特征。
在一些实施例中,特征嵌入模块包括特征嵌入层。
在一些实施例中,特征嵌入模块还包括:第五激活函数层和第三归一化层。
在一些实施例中,语音特征提取模块,具体用于:
提取语音信号的梅尔频率倒谱系数,得到语音信号的语音特征。
第三方面,本公开实施例提供了一种电子设备,包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,该处理器执行该程序时实现如本公开实施例描述的方法。
第四方面,本公开实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本公开实施例描述的方法。
本公开附加的方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。
附图说明
图1为相关技术中的系统架构;
图2为本公开实施例提供的一种语音信号处理方法的应用场景图;
图3为本公开实施例提供的一种语音信号的处理方法的流程示意图;
图4为本公开实施例提供的一种提取语音特征的流程示意图;
图5为本公开实施例提供的一种卷积混合模块的结构示意图;
图6为本公开实施例提供的一种多层感知混合模块的结构示意图;
图7为本公开实施例提供的另一种语音信号的处理方法的流程示意图;
图8为与图7对应的模型结构示意图;
图9为本公开实施例提供的一种语音信号的处理装置的方框示意图;
图10为本公开实施例提供的另一种语音信号的处理装置的方框示意图;
图11本公开实施例提供的一种电子设备或服务器的计算机系统的结构示意图。
具体实施方式
下面结合附图和实施例对本公开作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与发明相关的部分。
需要说明的是,在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本公开。
随着人工智能算法以及AI芯片等硬件技术的发展,智能设备在日常生活中已经被广泛应用。如智能家居语音控制系统、智能音箱、智能会议系统等。这些智能设备在执行语音控制策略之前,需要被唤醒,即需要执行唤醒设备的操作。相关技术中,最初的唤醒方式为人工点击按钮(如按下录音键)进行语音输入。后来,例如基于深度学习模型的唤醒技术逐渐成型,形成了如图1所示的系统架构。如图1所示,客户端实时采集用户的语音信号,然后通过无线网路发送至云服务器。云服务器上的深度学习模型对语音信号进行解析,得到语音识别结果或根据语音识别结果得到语音控制指令。然后,云服务器将语音识别结果或语音控制指令返回至客户端。客户端根据返回的语音识别结果或语音控制指令执行相应的响应策略。例如,响应于唤醒语句给予用户回应,或者,执行语音控制指令相应的控制方案。可见,由于终端无法自行进行语音识别,导致在网络不畅通时无 法实现语音识别的操作,影响用户对语音控制设备的使用体验。并且,这些技术都存在模型参数量较大,需要配置在云端进行运算,导致唤醒响应速度慢和用户体验不佳。
本公开提出一种语音信号的处理方法,具有较小的模型参数量,能够配置终端设备中,能够确保离线状态下的用户语音控制体验。
图2为本公开实施例提出的语音信号处理方法的应用场景的示意图。参照图2,该应用场景包括声音采集设备20和语音信号的处理装置10。
其中,声音采集设备20和语音信号的处理装置10共同设置在终端设备上。终端设备可为智能语音控制设备。例如,智能语音控制设备、具有语音控制功能的家电设备、智能手机、平板电脑、笔记本电脑、穿戴式设备等中的至少一种。
声音采集设备20是用于采集语音数据的设备,例如麦克风阵列。语音信号的处理装置10与声音采集设备20相连,用于执行本公开提出的语音信号的处理方法,以对声音采集设备20采集到的语音信号进行识别,并控制终端执行识别结果对应的响应策略。
图3为本公开实施例提供的一种语音信号的处理方法的流程示意图。如图3所示,本公开实施例提出的语音信号的处理方法,包括以下步骤:
步骤301,获取从环境中采集到的语音信号。
其中,环境由各种自然因素组成。环境可以包括真实环境和虚拟环境,真实环境是存在于真实生活中的环境,虚拟环境是通过仿真真实环境得到的环境。
该步骤301获取的语音信号,可以为声音采集设备从环境中采集到的语音信号,也可为声音采集设备基于从环境中采集到的语音信号进行处理后的语音信号。其中,声音采集设备对语音信号的处理包括但不限于声源定位、增强语音和语音端点检测等。
步骤302,提取语音信号的语音特征。
可选的,语音信号的处理装置10可以提取语音信号的梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC),得到语音信号的语音特征。也就是说,语音信号的语音特征为语音信号的梅尔频率倒谱系数。
如图4所示,该步骤302的实现过程包括以下步骤:
步骤3021、对语音信号进行分帧处理。
语音信号的处理装置10对语音信号进行分帧处理,得到多个音频帧。分帧处理是将语音信号切分成固定长度的多个语音信号段。每一段语音信号被称为一个音频帧。可选地,帧长为10-30ms。
对语音信号进行分帧处理,可以通过对语音信号加时间窗实现,每个时间窗内的信号部分为一个音频帧。在一种实现方式中,为了避免时间窗的边界导致信息遗漏的问题,在进行分帧处理时,可以采用交叠分段的方法。即在从语音信号中取音频帧的时间窗进行偏移时,音频帧与音频帧之间需要有一部分的重叠区域。该重叠区域称为帧移。可选地,帧移与帧长的比值范围为0-1/2。通过帧移确定下一帧的位置,能够利用信号的短时平稳性,使帧与帧之间平滑过渡,保持其连续性,同时可以避免时间窗的边界导致信息遗漏的问题。在本公开实施例中,帧长为25毫秒(ms),帧移为10ms,语音信号的采样频率为16000个采样点每秒,语音信号长度为1s。
可选的,在对语音信号进行分帧处理之前,还可对语音信号进行预加重处理,以增强语音信号中的高频信号。
步骤3022、对经过分帧处理的语音信号进行傅里叶变换,得到语音信号的频谱。
对每个音频帧进行傅里叶变换,能够得到每个音频帧对应的频谱。其中,傅里叶变换用于将时域信号转换为频域信号。傅里叶变换可采用快速傅里叶变换方式。
步骤3023、把语音信号的频谱通过梅尔滤波得到梅尔频谱。
利用梅尔滤波器对每个音频帧对应的频谱进行滤波处理,能够得到每个音频帧的频谱特征。其中,梅尔滤波器可为三角滤波器组。通过梅尔滤波器进行滤波处理,能够使得到的频谱特征更加符合人耳听觉特性。
步骤3024、在梅尔频谱上进行倒谱分析,得到语音信号的梅尔频率倒谱系数。
在梅尔频谱上进行倒谱分析,包括:获取梅尔频谱的绝对值或者对数得到其能量值,基于梅尔频谱的能量值进行逆变换,得到语音信号的梅尔频率倒谱系数。
在获取语音信号的梅尔频谱后,可以获取梅尔频谱的绝对值或者平方,以得到梅尔频谱的能量。例如,可以对三角滤波器组中所有的滤波器的输 出做对数运算,得到梅尔频谱的能量。然后,可以对语音信号的梅尔频谱的能量值进行离散余弦变换(Discrete Cosine Transform,DCT),得到语音信号的梅尔频率倒谱系数,即得到语音信号的语音特征。该梅尔频率倒谱系数可以通过MFCC特征向量表示。在本公开实施例中,MFCC特征向量的长度为40,语音特征为一个98×40的二维特征向量,类似一个长为98,宽为40,通道数为1的图像。
需要说明的是,由于标准的梅尔频率倒谱系数通常只反映语音信号的静态特性,因此,在获取语音信号的梅尔频率倒谱系数后,还可以在梅尔频率倒谱系数的基础上加上其差分谱,以在语音信号的语音特征中添加语音信号的动态特性,从而提高对语音信号进行识别的准确性。语音信号的动态特性用于指示语音信号随时间变化的轨迹。
步骤303,根据语音特征进行卷积混合处理,得到浅层语音识别特征。
在本公开实施例中,可以采用卷积混合模块对语音特征进行卷积混合处理。其中,卷积混合模块包括空间位置卷积混合模块和通道位置卷积混合模块。空间位置卷积混合模块的混合结果与空间位置卷积混合模块的输入,通过残差连接输入至通道位置卷积混合模块。在一种实现方式中,本公开实施例先利用空间位置卷积混合模块混合语音特征中的空间位置特征,然后利用通道位置卷积混合模块混合语音特征中的通道位置特征。
在一种实现方式中,如图5所示,空间位置卷积模块可包括深度可分离卷积层。通道位置卷积混合模块包括逐点卷积层。此时,能够保证卷积混合模块具有较小的模型参数量。可选地,空间位置卷积模块还可以包括第一激励函数层和第一归一化层中的至少一种。类似的,通道位置卷积混合模块还可以包括第二激励函数层和第二归一化层中的至少一种。其中,第一激励函数层和第二激励函数层可均采用格鲁函数GeLU作为激励函数,能够增加矩阵的非线性,进一步有利于对不同维度的相关性的提取。
示例地,该卷积混合模块可为ConvMixer模型,该模型可以根据语音特征进行卷积混合处理,得到语音信号的浅层语音识别特征,且相较于传统的全卷积模型具有更小的模型参数量,能够更好的配置在终端设备上。
可选的,在对语音特征进行卷积混合处理时,可以根据终端的运算能力或语音识别的需求,利用串联连接的多个卷积混合模块,获得语音信号的浅层语音识别特征,且卷积混合模块的个数可以根据应用需求确定,本 公开在此不做具体限定。
步骤304,根据浅层语音识别特征进行基于多层感知的混合处理,得到深层语音识别特征。
在本公开实施例中,可采用多层感知混合模块对浅层语音识别特征进行卷积混合处理。其中,多层感知混合模块包括空间感知混合模块和通道感知混合模块。空间感知混合模块允许不同空间位置之间进行通信,用于对浅层语音识别特征进行空间感知混合,并将混合结果输入至通道混合感知模型。通道混合感知模型允许不同通道之间进行通信,用于对空间感知混合的混合结果进行通道感知混合。
可选地,多层感知混合模块还包括第一转置模块和第二转置模块。第一转置模块用于对浅层语音识别特征的特征矩阵进行转置。此时,空间感知混合模块用于对转置后的浅层语音识别特征进行空间感知混合。第二转置模块用于将混合结果的特征向量进行转置。此时,通道混合感知模块用于对转置后的混合结果进行通道感知混合。也就是说,多层感知混合模块的实现过程,实际为将行特征转换为列特征,基于列特征进行空间特征的混合,然后再次将混合后的列特征转置为行特征,再基于行特征进行通道特征的混合。
在一种实现方式中,如图6所示,多层感知混合模块包括第四归一化层、第一转置模块T1、空间感知混合模块(token-mixing MLP)、第二转置模块T2、第五归一化层和通道感知混合模块(channel-mixing MLP)。浅层语音识别特征经第四归一化层后,得到基于通道的行向量特征,第一转置模块将行向量特征转置为列向量特征,然后输入至空间感知混合层中进行混合,得到混合后的列向量特征,第二转置模块将混合后的列向量特征进行转置得到行向量特征,并将行向量特征输入至第五归一化层进行归一化处理,经过归一化处理后的行向量特征输入至通道感知混合层进行混合,得到深层语音识别特征。其中,第五归一化层和第二转置模块之间采用跳链接(skip-connection)。通道感知混合模块的输出与第二转置模块的输出采用残差连接。第二转置模块的输出与第四归一化层的输入采用残差连接。
可选地,空间感知混合模块包括第一全连接层和第二全连接层,通道感知混合模块包括第三全连接层和第四全连接层。可选地,空间感知混合 模块还可以包括第三激活函数层。通道感知混合模块还可以包括第四激活函数层。并且,第三激活函数和第四激励函数层可均采用格鲁函数GeLU作为激励函数。
应当理解的是,在本公开实施例中,通过空间感知混合模块基于列特征进行特征混合,使得所有列能够共享空间感知混合模块中的参数,通过通道感知混合模块基于行特征进行特征混合时,使得所有行能够共享列空间感知混合模块中的参数。两种类型的感知混合模块交替执行能够促进两个维度间的信息交互。
可选的,在对浅层语音识别特征进行基于多层感知的混合处理时,可以根据终端的运算能力或语音识别的需求,利用串联连接的多个基于多层感知混合模块,获得语音信号的深层语音识别特征,且卷积混合模块的个数可以根据应用需求确定,本公开在此不做具体限定。并且,卷积混合模块和多层感知混合模块可以组成一个混合单元,可以利用串联的多个混合单元得到语音信号的深层语音识别特征。
还应当理解的是,通过多层感知混合模块对浅层语音识别特征进行卷积混合处理得到深层语音识别特征,能够进一步提高语音特征之间的空间混合效果和通道混合效果,从而有效提高语音特征的表达效果,进而提高基于语音特征的语音识别效果。
步骤305,根据深层语音识别特征,得到语音信号的识别结果。
在一种实现方式中,本公开通过分类器对语音识别特征进行分类,以得到语音信号的识别结果。例如,该分类器可以包括2D平均池化层和全连接层。将深层语音识别特征输入至2D平均池化层后,2D平均池化层用于对该深层语音识别特征进行2D平均池化,池化结果输入至采用softmax激活函数的全连接层后,即可得到一个N+2分类的输出结果,N+2分类的输出结果中概率最大的类别即为识别结果。其中,N为语音信号的处理装置通过训练得到的唤醒词或命令词的个数。2代表静音(silence)和未知(unknow)两种类别。唤醒词用于指示唤醒对象的名称,其包括但不限于“小X小X”、“你好小X”等。并且,唤醒词可以为不低于4个词组组成的字符串,以提高对唤醒对象的精准唤醒。命令词包括但不限于“打开空调”、“增加音量”、“调高亮度”等控制命令。
可选地,本公开用于对语音特征进行处理的模型通过训练后可转换成 tflite格式,以通过Java或者C++语言直接在智能手机、平板电脑等智能终端上运行。
步骤306,根据识别结果,执行识别结果对应的响应策略。
举例来说,当识别结果为唤醒词时,则控制智能语音控制设备进行回应。例如,回应“小X在呢”、“在呢”等。当识别结果为命令词时,则控制相应的智能终端执行控制命令。例如,当命令词为“打开空调”时,即控制空调器打开,并控制空调按照预设控制策略或上一次设置的控制策略运行。当命令词为“增加音量”时,则控制当前处于播放状态的播放器将音量调高一级。当命令词为“调高亮度”时,则控制当前处于照明状态的照明设备将亮度调高一级。
可选地,在根据语音特征进行卷积混合处理,得到浅层语音识别特征之前,该方法还包括:对语音特征进行下采样。则根据语音特征进行卷积混合处理,得到浅层语音识别特征,包括:根据经过下采样的语音特征进行卷积混合处理,得到浅层语音识别特征。
在一种实现方式中,可以采用特征嵌入模块对语音特征进行下采样。特征嵌入模块包括特征嵌入层。可选地,特征嵌入模块还包括第五激活函数层和第三归一化层。例如,特征嵌入层可以为一个卷积核64、通道数为2的2D卷积层。因此,通过特征嵌入模块处理后输入至卷积混合模块的特征为一个[batch size,49,20,64]的张量。该特征嵌入模块能够完成神经网络的所有下采样过程,有效降低了图片的分辨率,增加了感受野,方便卷积混合模块和基于多层感知的混合模型获取更远处的空间信息。
作为一个实施例,如图7和图8所示,语音信号的处理方法,包括如下步骤:
步骤601,获取从环境中采集到的语音信号。
步骤602,提取语音信号的语音特征。
步骤603,将语音特征输入至特征嵌入模块,得到下采样后的语音特征。
步骤604,将下采样后的语音特征输入至串联连接的多个卷积混合模块,得到浅层语音识别特征。
步骤605,将浅层语音识别特征输入至串联连接的多个多层感知混合模块,得到深层语音识别特征。
步骤606,将深层语音识别特征输入至分类器,得到语音识别结果。
步骤607,执行语音识别结果对应的响应策略。
进一步地,本公开还对本公开提出的语音信号的处理方法进行了有效性验证。示例地,利用Google Speech Commands V2(GSC-V2)数据集,对本公开实施例提出的语音信号的处理方法进行测试,GSC-V2包含105829条命令词,2618个说话人,包含'down'、'go'、'left'、'no'、'off'、'on'、'right'、'stop'、'up'、'yes'等35个命令词。分别采用不同大小的卷积混合模块进行实验,如模型深度为8,隐藏层大小为64的ConvMlp-Mixer-S,模型深度为12,隐藏层大小为64的ConvMlp-Mixer-M,模型深度为12,隐藏层大小为128的ConvMlp-Mixer-L。实验过程中采用adamw优化器,warmup epoch为10,迭代步数为25000,学习率为0.02,batch size为256,测试结果如表1所示。
表1
可见,利用本公开实施例提出的语音信号的处理方法对应的ConvMlp-Mixer-S模型仅用了96k的参数量就达到96.24的精度,而ConvMlp-Mixer-L用了0.299M参数便达到97.77的精度,相比于基于MLP的模型和基于tansformer模型具有效果更好的同时,模型参数量更小,因此能够证明本公开实施例提供的语音信号的处理方法的有效性。
综上所述,本公开实施例提出的语音信号的处理方法,通过对语音信号的语音特征进行卷积混合处理和基于多层感知的混合处理,能够在有效 识别语音指令的情况下,降低了语音识别模型的参数量,使得执行语音信号的处理装置能够更好的配置在终端设备上,能够提供更快速的语音识别响应,提高终端设备对语音指令的响应效率,有助于提高用户体验。
应当注意,尽管在附图中以特定顺序描述了本发明方法的操作,但是,这并非要求或者暗示必须按照该特定顺序来执行这些操作,或是必须执行全部所示的操作才能实现期望的结果。
图9为本公开实施例中语音信号的处理装置的方框示意图。
如图9所示,本公开实施例的语音信号的处理装置10,包括:
获取模块11,用于获取从环境中采集到的语音信号。
语音特征提取模块12,用于提取语音信号的语音特征。
卷积混合模块13,用于根据语音特征进行卷积混合处理,得到浅层语音识别特征。
多层感知混合模块14,用于根据浅层语音识别特征进行基于多层感知的混合处理,得到深层语音识别特征。
分类模块15,用于根据深层语音识别特征,得到语音信号的识别结果。
执行模块16,用于根据识别结果,执行识别结果对应的响应策略。
在一些实施例中,卷积混合模块13包括空间位置卷积混合模块和通道位置卷积混合模块,空间位置卷积混合模块的混合结果与空间位置卷积混合模块的输入通过残差连接输入至通道位置卷积混合模块。
在一些实施例中,空间位置卷积混合模块包括深度可分离卷积层,通道位置卷积混合模块包括逐点卷积层。
在一些实施例中,空间位置卷积混合模块还包括第一激励函数层和第一归一化层中的至少一种,通道位置卷积混合模块还包括第二激励函数层和第二归一化层中的至少一种。
在一些实施例中,语音信号的处理装置10包括串联连接的多个卷积混合模块14。
在一些实施例中,多层感知混合模块14包括空间感知混合模块和通道感知混合模块,空间感知混合模块对浅层语音识别特征进行空间感知混合,并向通道感知混合模块提供混合结果,通道感知混合模块对混合结果进行通道感知混合。
在一些实施例中,多层感知混合模块14还包括:第一转置模块和第 一转置模块,第一转置模块用于对浅层语音识别特征的特征向量进行转置,转置模块用于对混合结果的特征向量进行转置。
在一些实施例中,空间感知混合模块包括第一全连接层和第二全连接层,通道感知混合模块包括第三全连接层和第四全连接层。
在一些实施例中,空间感知混合模块还包括第三激活函数层,通道感知混合模块还包括第四激活函数层。
在一些实施例中,如图10所示,语音信号的处理装置10还包括:
特征嵌入模块17,用于对语音特征进行下采样.
相应的,卷积混合模块13具体用于:
根据经过下采样的语音特征进行卷积混合处理,得到浅层语音识别特征。
在一些实施例中,特征嵌入模块17包括特征嵌入层。
在一些实施例中,特征嵌入模块17还包括:第五激活函数层和第三归一化层。
在一些实施例中,语音特征提取模块12,具体用于:
提取语音信号的梅尔频率倒谱系数,得到语音信号的语音特征。
应当理解,语音信号的处理装置10中记载的诸单元或模块与参考图2描述的方法中的各个步骤相对应。由此,上文针对方法描述的操作和特征同样适用于语音信号的处理装置10及其中包含的单元,在此不再赘述。语音信号的处理装置10可以预先实现在电子设备的浏览器或其他安全应用中,也可以通过下载等方式而加载到电子设备的浏览器或其安全应用中。语音信号的处理装置10中的相应单元可以与电子设备中的单元相互配合以实现本公开实施例的方案。
在上文详细描述中提及的若干模块或者单元,这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。
综上所述,本公开实施例提出的语音信号的处理装置,通过对语音信号提取的语音特征进行卷积混合处理和基于多层感知的混合处理,能够在有效识别语音指令的情况下,降低了语音识别模型的参数量,使得执行语 音信号的处理装置能够更好的配置在终端设备上,能够提供更快速的语音识别响应,提高终端设备对语音指令的响应效率,有助于提高用户体验。
下面参考图11,图11示出了适于用来实现本公开实施例的电子设备或服务器的计算机系统的结构示意图,
如图11所示,计算机系统包括中央处理单元(CPU)1001,其可以根据存储在只读存储器(ROM)1002中的程序或者从存储部分1008加载到随机访问存储器(RAM)1003中的程序而执行各种适当的动作和处理。在RAM1003中,还存储有系统的操作指令所需的各种程序和数据。CPU1001、ROM1002以及RAM1003通过总线1004彼此相连。输入/输出(I/O)接口1005也连接至总线1004。
以下部件连接至I/O接口1005;包括键盘、鼠标等的输入部分1006;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分1007;包括硬盘等的存储部分1008;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分1009。通信部分1009经由诸如因特网的网络执行通信处理。驱动器1010也根据需要连接至I/O接口1005。可拆卸介质1011,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1010上,以便于从其上读出的计算机程序根据需要被安装入存储部分1008。
特别地,根据本公开的实施例,上文参考流程图图2描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分1009从网络上被下载和安装,和/或从可拆卸介质1011被安装。在该计算机程序被中央处理单元(CPU)1001执行时,执行本公开的系统中限定的上述功能。
需要说明的是,本公开所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读 存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以为的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作指令。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,前述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以不同于附图中所标注的顺序发生。例如,两个连接表示的方框实际上可以基本并行地执行,他们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作指令的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元或模块可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元或模块也可以设置在处理器中,例如,可以描述为:一种处理器包括获取模块、语音特征提取模块、卷积混合模块、多层感知混合模块、分类模块和执行模块。其中,这些单元或模块的名称在某种情况下并不构成对该单元或模块本身的限定,例如,获取模块,还可以被描述为“获取采集到的语音信号”。
作为另一方面,本公开还提供了一种计算机可读存储介质,该计算机可读存储介质可以是上述实施例中描述的电子设备中所包含的,也可以是单独存在,而未装配入该电子设备中的。上述计算机可读存储介质存储有一个或 多个程序,当上述程序被一个或者一个以上的处理器用来执行描述于本公开的语音信号的处理方法。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离前述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其他技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。

Claims (20)

  1. 一种语音信号的处理方法,其特征在于,包括:
    获取从环境中采集到的语音信号;
    提取所述语音信号的语音特征;
    根据所述语音特征进行卷积混合处理,得到浅层语音识别特征;
    根据所述浅层语音识别特征进行基于多层感知的混合处理,得到深层语音识别特征;
    根据所述深层语音识别特征,得到所述语音信号的识别结果;
    根据所述识别结果,执行所述识别结果对应的响应策略。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述语音特征进行卷积混合处理,得到浅层语音识别特征,包括:
    利用卷积混合模块对所述语音特征进行卷积混合处理;
    其中,所述卷积混合模块包括空间位置卷积混合模块和通道位置卷积混合模块,所述空间位置卷积混合模块的混合结果与所述空间位置卷积混合模块的输入通过残差连接输入至所述通道位置卷积混合模块。
  3. 根据权利要求2所述的方法,其特征在于,所述空间位置卷积混合模块包括深度可分离卷积层,所述通道位置卷积混合模块包括逐点卷积层。
  4. 根据权利要求3所述的方法,其特征在于,所述空间位置卷积混合模块还包括第一激励函数层和第一归一化层中的至少一种,所述通道位置卷积混合模块还包括第二激励函数层和第二归一化层中的至少一种。
  5. 根据权利要求2所述的方法,其特征在于,所述利用卷积混合模块对所述语音特征进行卷积混合处理,包括:
    利用串联连接的多个卷积混合模块对所述语音特征进行卷积混合处理。
  6. 根据权利要求1所述的方法,其特征在于,所述根据所述浅层语音 识别特征进行基于多层感知的混合处理,得到深层语音识别特征,包括:
    利用多层感知混合模块对所述浅层语音识别特征进行卷积混合处理,
    其中,所述多层感知混合模块包括空间感知混合模块和通道感知混合模块,所述空间感知混合模块对所述浅层语音识别特征进行空间感知混合,并向所述通道感知混合模块提供混合结果,所述通道感知混合模块对所述混合结果进行通道感知混合。
  7. 根据权利要求6所述的方法,其特征在于,所述多层感知混合模块还包括:第一转置模块和第一转置模块,所述第一转置模块用于对所述浅层语音识别特征的特征向量进行转置,所述转置模块用于对所述混合结果的特征向量进行转置。
  8. 根据权利要求6所述的方法,其特征在于,所述空间感知混合模块包括第一全连接层和第二全连接层,所述通道感知混合模块包括第三全连接层和第四全连接层。
  9. 根据权利要求8所述的方法,其特征在于,所述空间感知混合模块还包括第三激活函数层,所述通道感知混合模块还包括第四激活函数层。
  10. 根据权利要求1所述的方法,其特征在于,在所述根据所述语音特征进行卷积混合处理,得到浅层语音识别特征之前,还包括:
    对所述语音特征进行下采样;
    所述根据所述语音特征进行卷积混合处理,得到浅层语音识别特征,包括:
    根据经过下采样的语音特征进行卷积混合处理,得到所述浅层语音识别特征。
  11. 根据权利要求10所述的方法,其特征在于,所述对所述语音特征进行下采样,包括:
    采用特征嵌入模块对所述语音特征进行下采样;
    其中,所述特征嵌入模块包括特征嵌入层。
  12. 根据权利要求11所述的方法,其特征在于,所述特征嵌入模块还包括:第五激活函数层和第三归一化层。
  13. 根据权利要求1所述的方法,其特征在于,所述提取语音信号的语音特征,包括:
    提取所述语音信号的梅尔频率倒谱系数,得到所述语音信号的语音特征。
  14. 一种语音信号的处理装置,其特征在于,包括:
    获取模块,用于获取从环境中采集到的语音信号;
    语音特征提取模块,用于提取所述语音信号的语音特征;
    卷积混合模块,用于根据所述语音特征进行卷积混合处理,得到浅层语音识别特征;
    多层感知混合模块,用于根据所述浅层语音识别特征进行基于多层感知的混合处理,得到深层语音识别特征;
    分类模块,用于根据所述深层语音识别特征,得到所述语音信号的识别结果;
    执行模块,用于根据所述识别结果,执行所述识别结果对应的响应策略。
  15. 根据权利要求14所述的装置,其特征在于,所述卷积混合模块包括空间位置卷积混合模块和通道位置卷积混合模块,所述空间位置卷积混合模块的混合结果与所述空间位置卷积混合模块的输入通过残差连接输入至所述通道位置卷积混合模块。
  16. 根据权利要求15所述的装置,其特征在于,所述空间位置卷积混合模块包括深度可分离卷积层,所述通道位置卷积混合模块包括逐点卷积层。
  17. 根据权利要求14所述的装置,其特征在于,所述多层感知混合模 块包括空间感知混合模块和通道感知混合模块,所述空间感知混合模块对所述浅层语音识别特征进行空间感知混合,并向所述通道感知混合模块提供混合结果,所述通道感知混合模块对所述混合结果进行通道感知混合。
  18. 根据权利要求17所述的装置,其特征在于,所述多层感知混合模块还包括:第一转置模块和第一转置模块,所述第一转置模块用于对所述浅层语音识别特征的特征向量进行转置,所述转置模块用于对所述混合结果的特征向量进行转置。
  19. 一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时,实现如权利要求1-13中任一所述的语音信号的处理方法。
  20. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,实现如权利要求1-13中任一所述的语音信号的处理方法。
PCT/CN2023/094965 2022-05-20 2023-05-18 语音信号的处理方法、装置、设备及介质 WO2023222071A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210560595.0 2022-05-20
CN202210560595.0A CN115035887A (zh) 2022-05-20 2022-05-20 语音信号的处理方法、装置、设备及介质

Publications (1)

Publication Number Publication Date
WO2023222071A1 true WO2023222071A1 (zh) 2023-11-23

Family

ID=83120469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/094965 WO2023222071A1 (zh) 2022-05-20 2023-05-18 语音信号的处理方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN115035887A (zh)
WO (1) WO2023222071A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115035887A (zh) * 2022-05-20 2022-09-09 京东方科技集团股份有限公司 语音信号的处理方法、装置、设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010037195A1 (en) * 2000-04-26 2001-11-01 Alejandro Acero Sound source separation using convolutional mixing and a priori sound source knowledge
CN113889091A (zh) * 2021-10-26 2022-01-04 深圳地平线机器人科技有限公司 语音识别方法、装置、计算机可读存储介质及电子设备
CN114333782A (zh) * 2022-01-13 2022-04-12 平安科技(深圳)有限公司 语音识别方法、装置、设备及存储介质
CN114399996A (zh) * 2022-03-16 2022-04-26 阿里巴巴达摩院(杭州)科技有限公司 处理语音信号的方法、装置、存储介质及系统
CN114446318A (zh) * 2022-02-07 2022-05-06 北京达佳互联信息技术有限公司 音频数据分离方法、装置、电子设备及存储介质
CN115035887A (zh) * 2022-05-20 2022-09-09 京东方科技集团股份有限公司 语音信号的处理方法、装置、设备及介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010037195A1 (en) * 2000-04-26 2001-11-01 Alejandro Acero Sound source separation using convolutional mixing and a priori sound source knowledge
CN113889091A (zh) * 2021-10-26 2022-01-04 深圳地平线机器人科技有限公司 语音识别方法、装置、计算机可读存储介质及电子设备
CN114333782A (zh) * 2022-01-13 2022-04-12 平安科技(深圳)有限公司 语音识别方法、装置、设备及存储介质
CN114446318A (zh) * 2022-02-07 2022-05-06 北京达佳互联信息技术有限公司 音频数据分离方法、装置、电子设备及存储介质
CN114399996A (zh) * 2022-03-16 2022-04-26 阿里巴巴达摩院(杭州)科技有限公司 处理语音信号的方法、装置、存储介质及系统
CN115035887A (zh) * 2022-05-20 2022-09-09 京东方科技集团股份有限公司 语音信号的处理方法、装置、设备及介质

Also Published As

Publication number Publication date
CN115035887A (zh) 2022-09-09

Similar Documents

Publication Publication Date Title
CN110600017B (zh) 语音处理模型的训练方法、语音识别方法、系统及装置
WO2021093449A1 (zh) 基于人工智能的唤醒词检测方法、装置、设备及介质
WO2021135577A1 (zh) 音频信号处理方法、装置、电子设备及存储介质
CN110503971A (zh) 用于语音处理的基于神经网络的时频掩模估计和波束形成
US11908483B2 (en) Inter-channel feature extraction method, audio separation method and apparatus, and computing device
US20210117776A1 (en) Method, electronic device and computer readable medium for information processing for accelerating neural network training
CN112200062B (zh) 一种基于神经网络的目标检测方法、装置、机器可读介质及设备
CN112233698B (zh) 人物情绪识别方法、装置、终端设备及存储介质
CN112071322B (zh) 一种端到端的声纹识别方法、装置、存储介质及设备
CN112037822B (zh) 基于ICNN与Bi-LSTM的语音情感识别方法
WO2023222071A1 (zh) 语音信号的处理方法、装置、设备及介质
WO2022183806A1 (zh) 基于神经网络的语音增强方法、装置及电子设备
CN113555032B (zh) 多说话人场景识别及网络训练方法、装置
WO2021203880A1 (zh) 一种语音增强方法、训练神经网络的方法以及相关设备
CN112289338B (zh) 信号处理方法及装置、计算机设备以及可读存储介质
CN115602165A (zh) 基于金融系统的数字员工智能系统
CN113327594B (zh) 语音识别模型训练方法、装置、设备及存储介质
US20210158816A1 (en) Method and apparatus for voice interaction, device and computer readable storate medium
WO2021135454A1 (zh) 一种伪冒语音的识别方法、设备及计算机可读存储介质
CN114664288A (zh) 一种语音识别方法、装置、设备及可存储介质
CN116705013B (zh) 语音唤醒词的检测方法、装置、存储介质和电子设备
CN116959421B (zh) 处理音频数据的方法及装置、音频数据处理设备和介质
CN110634475B (zh) 语音识别方法、装置、电子设备和计算机可读存储介质
Samanta et al. An energy-efficient voice activity detector using reconfigurable Gaussian base normalization deep neural network
Zhao et al. Speech Recognition Method for Home Service Robots Based on CLSTM-HMM Hybrid Acoustic Model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23807022

Country of ref document: EP

Kind code of ref document: A1