WO2021042870A1 - 语音处理的方法、装置、电子设备及计算机可读存储介质 - Google Patents
语音处理的方法、装置、电子设备及计算机可读存储介质 Download PDFInfo
- Publication number
- WO2021042870A1 WO2021042870A1 PCT/CN2020/101602 CN2020101602W WO2021042870A1 WO 2021042870 A1 WO2021042870 A1 WO 2021042870A1 CN 2020101602 W CN2020101602 W CN 2020101602W WO 2021042870 A1 WO2021042870 A1 WO 2021042870A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature
- voice
- text
- speech
- bottleneck
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 22
- 238000012545 processing Methods 0.000 claims abstract description 95
- 238000000034 method Methods 0.000 claims abstract description 38
- 230000006403 short-term memory Effects 0.000 claims abstract description 15
- 238000012549 training Methods 0.000 claims description 57
- 230000009467 reduction Effects 0.000 claims description 35
- 239000000284 extract Substances 0.000 claims description 20
- 238000013528 artificial neural network Methods 0.000 claims description 18
- 238000001228 spectrum Methods 0.000 claims description 18
- 238000009826 distribution Methods 0.000 claims description 16
- 230000015654 memory Effects 0.000 claims description 14
- 238000009432 framing Methods 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 10
- 239000000203 mixture Substances 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 9
- 238000010586 diagram Methods 0.000 description 23
- 238000005516 engineering process Methods 0.000 description 21
- 230000006870 function Effects 0.000 description 17
- 230000008569 process Effects 0.000 description 11
- 238000013473 artificial intelligence Methods 0.000 description 7
- 230000007787 long-term memory Effects 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000005316 response function Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
Definitions
- This application relates to the field of computer technology. Specifically, this application relates to a voice processing method, device, electronic device, and computer-readable storage medium.
- the extraction method of text information in Text To Speech is usually used to extract the corresponding text information, and then splice it in the features of the noisy speech It is sent to the noise reduction network model for training.
- the embodiment of the present application provides a voice processing method, which is executed by an electronic device, and includes:
- determining the first voice feature according to the voice information to be processed includes:
- the first voice feature includes logarithmic power At least one of the MFCC characteristics of the spectrum and Mel frequency cepstrum coefficients.
- determining the first text bottleneck feature according to the voice information to be processed includes:
- the second speech feature is input to the trained automatic speech recognition network ASR network, and the first text bottleneck feature is extracted from the linear layer of the bottleneck of the trained ASR network.
- inputting the first combined feature vector to the trained one-way long short-term memory LSTM model, and performing voice processing on the first combined feature vector to obtain processed voice information includes:
- the method of training the ASR network includes:
- Training step align the text annotations included in the corpus with the audio files corresponding to the text annotations through the Gaussian mixture model GMM to obtain the first text feature, and the corpus is used to train the ASR network;
- the way of training the one-way LSTM model includes:
- the ASR network includes a four-layer hidden deep neural network DNN as an input layer, a bottleneck linear layer, and a probability distribution softmax layer as an output layer.
- the embodiment of the present application also provides a voice processing device, including:
- the first processing module is used to collect voice information to be processed
- the second processing module is configured to determine the first voice feature and the first text bottleneck feature according to the voice information to be processed
- the third processing module is configured to determine the first combined feature vector according to the first voice feature and the first text bottleneck feature
- the fourth processing module is used to input the first combined feature vector to the trained one-way long short-term memory LSTM model, and perform voice processing on the first combined feature vector to obtain the voice information after noise reduction.
- the voice information is sent to other electronic devices for playback.
- the embodiment of the present application also provides an electronic device, including: a processor, a memory, and a bus;
- Bus used to connect the processor and memory
- Memory used to store operation instructions
- the processor is configured to execute the voice processing method described in the embodiment of the present application by invoking an operation instruction.
- the embodiment of the present application also provides a computer-readable storage medium that stores a computer program, and the computer program is used to execute the voice processing method described in the embodiment of the present application.
- FIG. 1A is a system architecture diagram to which a voice processing method provided by an embodiment of this application is applicable;
- FIG. 1B is a schematic flowchart of a voice processing method provided by an embodiment of this application.
- FIG. 2 is a schematic diagram of an ASR network provided by an embodiment of the application.
- FIG. 3 is a schematic flowchart of another voice processing method provided by an embodiment of this application.
- FIG. 4 is a schematic diagram of extracting voice features provided by an embodiment of this application.
- FIG. 5 is a schematic diagram of a combined feature vector provided by an embodiment of this application.
- FIG. 6 is a schematic diagram of a conference system provided by an embodiment of the application.
- FIG. 7 is a schematic structural diagram of a voice processing apparatus provided by an embodiment of this application.
- FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
- AI Artificial Intelligence
- digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
- artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
- Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
- Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technologies and software-level technologies.
- Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
- Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning or deep learning.
- ASR automatic speech recognition technology
- TTS speech synthesis technology
- voiceprint recognition technology Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
- Neural network It is an algorithmic mathematical model that imitates the behavioral characteristics of animal neural networks and performs distributed parallel information processing. This kind of network relies on the complexity of the system, and achieves the purpose of processing information by adjusting the interconnection between a large number of internal nodes.
- Deep neural network Deep neural network
- DNN Deep Neural Networks, deep neural network
- activation function to de-linearize
- cross entropy as loss function
- back propagation optimization algorithm for example, stochastic gradient descent algorithm
- Batch gradient descent algorithm for learning and training (adjusting and updating the weights between neurons) feedforward neural network.
- Automatic speech recognition The goal of ASR (Automatic Speech Recognition) technology is to allow a computer to dictate continuous speech spoken by different people, which is commonly known as a speech dictation machine, which is a technology that realizes the conversion of sound to text. Automatic speech recognition is also called speech recognition (Speech Recognition) or computer speech recognition (Computer Speech Recognition).
- Mel Frequency Cepstrum Coefficient takes into account the characteristics of human hearing. On the spectrum. Mel frequency is proposed based on the hearing characteristics of the human ear, and it has a non-linear corresponding relationship with Hz frequency; Mel frequency cepstrum coefficient uses this relationship between them, and the calculated Hz spectrum characteristics are mainly used for Voice data feature extraction and reduction of operational dimensions.
- Probability distribution softmax layer The output of the softmax layer is a series of positive numbers that add up to 1, that is, the output from the softmax layer can be regarded as a probability distribution.
- the softmax layer turns the output of the neural network into a probability distribution.
- Speech enhancement When the speech signal is interfered or even submerged by various noises, it extracts useful speech signals from the noise background, suppresses and reduces noise interference, and extracts the purest possible original speech from noisy speech. .
- Cross entropy CE (Cross Entropy, cross entropy) can be regarded as the difficulty of expressing the probability distribution p(x) through the probability distribution q(x).
- Cross entropy describes the distance between two probability distributions q(x) and p(x), that is to say, the smaller the cross entropy value (the smaller the value of relative entropy), the two probability distributions q(x) and p(x) ) Is closer.
- Cross entropy loss function is often used in classification problems, especially neural network classification problems. Since cross entropy involves calculating the probability of each category, in neural networks, cross entropy is closely related to the softmax function.
- LSTM Long and short-term memory
- LSTM Long and short-term memory
- LSTM Long Short-Term Memory
- LSTM Long Short-Term Memory
- LSTM is a time recurrent neural network, suitable for processing and predicting important events with relatively long intervals and delays in time series
- LSTM is a recurrent neural network for solving The problem of gradient disappearance in RNN structure is a special kind of recurrent neural network.
- LSTM is a kind of neural network containing LSTM blocks.
- the LSTM block can be an intelligent network unit.
- the LSTM block can memorize the value of a variable length of time. There is a gate in the LSTM block that can determine whether the input is input. It is important to be remembered and whether it can be output.
- Gaussian model is to use Gaussian probability density function (normal distribution curve) to accurately quantify things, and decompose a thing into several models based on Gaussian probability density function (normal distribution curve).
- the principle and process of establishing a Gaussian model for the image background The image gray histogram reflects the frequency of a certain gray value in the image, and it can also be regarded as an estimate of the gray probability density of the image.
- GMM Gaussian mixture model
- K Gaussian models to characterize the characteristics of each pixel in the image, K is a positive integer, and the Gaussian mixture model is updated after a new frame of image is obtained, using each pixel in the current image The point is matched with the Gaussian mixture model.
- the point is judged to be the background point, otherwise it is the previous scenic spot.
- the entire Gaussian model it is mainly determined by the two parameters of variance and mean. The learning of mean and variance and adopting different learning mechanisms will directly affect the stability, accuracy and convergence of the model.
- the noise reduction network model for extracting text information has the following defects: text information is required for testing, which is difficult in practical applications Application; it is necessary to align the text information with the noisy speech features, which is difficult to achieve real-time operation, and the alignment accuracy will affect the noise reduction results; the noise reduction speech required to be trained has corresponding text annotations, which is difficult to obtain in practice. Training corpus.
- the embodiment of the present application provides a voice processing method.
- the technical solutions provided by the embodiments of the present application involve artificial intelligence voice technology.
- the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems will be described in detail with specific embodiments in conjunction with the accompanying drawings.
- FIG. 1A is a system architecture diagram to which the voice processing method provided in an embodiment of the present application is applicable.
- the system architecture diagram includes: a server 11, a network 12, and a user terminal 13, wherein the server 11 establishes a connection with the user terminal 13 through the network 12.
- the server 11 is a background server that processes the voice information to be processed.
- the server 11 and the user terminal 13 provide services for the user together. For example, after the server 11 processes the voice information to be processed, it sends the processed voice information to the user terminal 13 for use by the user.
- the server 11 may be a standalone
- the server can also be a cluster server composed of multiple servers.
- the network 12 may include a wired network and a wireless network. As shown in FIG. 1A, on the side of the access network, the user terminal 13 can be connected to the network 12 in a wireless or wired manner; and on the side of the core network, the server 11 is generally connected to the network 12 in a wired manner. . Of course, the aforementioned server 11 may also be connected to the network 12 in a wireless manner.
- the above-mentioned user terminal 13 may refer to a smart device with data calculation and processing functions, for example, it can play the processed voice information provided by the server, or after processing the voice information to be processed, directly play the processed voice information or send it to other users. Terminal to make it play.
- the user terminal 13 includes, but is not limited to, a smart phone (installed with a communication module), a palmtop computer, a tablet computer, and the like.
- An operating system is installed on the user terminal 13, including but not limited to: Android operating system, Symbian operating system, Windows mobile operating system, Apple iPhone OS operating system, and so on.
- an embodiment of the present application provides a voice processing method.
- the voice processing method is executed by an electronic device.
- the electronic device may be the server 11 in FIG.
- the flow diagram of the method is shown in Fig. 1B, and the method includes the following steps:
- S101 Acquire voice information to be processed.
- the voice information to be processed is the call voice of the conference system.
- S102 Determine the first voice feature and the first text bottleneck bottleneck feature according to the voice information to be processed.
- the first voice feature may be a logarithmic power spectrum or an MFCC (Mel-Frequency Cepstral Coefficients) feature.
- MFCC Mel-Frequency Cepstral Coefficients
- the first text bottleneck feature is extracted from the linear layer of the bottleneck bottleneck.
- the linear layer of the bottleneck bottleneck is the bottleneck layer.
- the bottleneck layer is the middle layer in the multiplayer perceptron. The number of neurons is much smaller than that of other layers. Therefore, The entire neural network is like a bottleneck, and the features extracted from the bottleneck layer are the bottleneck features.
- S103 Determine a first combined feature vector according to the first voice feature and the first text bottleneck feature.
- the first speech feature and the first text bottleneck feature are spliced together to obtain a first combined feature vector.
- the dimension of the first combined feature vector is the difference between the dimensions of each frame of the first speech feature and the first text bottleneck feature The sum of dimensions.
- S104 Input the first combined feature vector to the trained one-way long short-term memory LSTM model, perform voice processing on the first combined feature vector to obtain noise-reduced voice information, and combine the noise-reduced voice information Send to other electronic devices for playback.
- speech processing is speech enhancement (Speech Enhancement).
- speech Enhancement The essence of speech enhancement is speech noise reduction.
- the speech collected by the microphone is usually speech with different noises.
- the main purpose of speech enhancement is to extract the noise from the speech with noise. Restore speech without noise.
- speech enhancement various interference signals can be effectively suppressed, and the target speech signal can be enhanced, which not only improves speech intelligibility and voice quality, but also helps to improve speech recognition.
- the voice information to be processed is collected; the first voice feature and the first text bottleneck bottleneck feature are determined according to the voice information to be processed; the first combined feature is determined based on the first voice feature and the first text bottleneck feature Vector; input the first combined feature vector to the trained one-way long short-term memory LSTM model, and perform voice processing on the first combined feature vector to obtain processed voice information.
- the solution of the embodiment of the present application realizes speech processing based on the bottleneck feature of the first text bottleneck, and improves the efficiency of speech noise reduction and speech quality.
- determining the first voice feature according to the voice information to be processed includes:
- the voice information to be processed is subjected to framing processing and windowing processing; the first voice feature is extracted from the voice information to be processed after framing processing and windowing processing; the first voice feature includes logarithmic power spectrum, At least one of the MFCC characteristics of the Mel frequency cepstrum coefficient.
- the framing process is to cut the variable-length audio included in the voice information to be processed into small segments of fixed length. Framing is needed because the subsequent Fourier transform is suitable for analyzing stable signals, and the audio signal changes rapidly; in order to avoid the omission of the signal by the window boundary, when the frames are offset, there must be frame overlap between frames. There needs to be a part of overlap between frames; the usual choice is that the frame length is 25ms, the frame shift is 10ms, and the time difference between frames is usually 10ms, so there will be overlaps between frames.
- the Fourier transform requires the input signal to be stable, but the audio signal is not stable as a whole.
- the windowing process is that each frame of the signal is usually multiplied by a smooth window function, so that two frames are The end smoothly decays to zero, which can reduce the intensity of the side lobe after Fourier transform and obtain a higher quality spectrum.
- the truncation is done by the window function.
- the actual window function has side lobes of different amplitudes. Therefore, in the convolution, in addition to the discrete points In addition to amplitude components in frequency, there are also varying degrees of amplitude between two adjacent frequency points.
- determining the first text bottleneck feature according to the voice information to be processed includes:
- the second speech feature is input to the trained automatic speech recognition network ASR network, and the first text bottleneck feature is extracted from the linear layer of the bottleneck bottleneck of the trained ASR network.
- a 40-dimensional filter bank feature and a 3-dimensional pitch feature of the fundamental frequency are extracted from the speech information to be processed, where N is 40, M is 3, and the pitch is fundamental to the fundamental frequency of the sound.
- Frequency (F0) is related, and it reflects the pitch information, that is, the tone.
- a filter bank is a group of filters, a group of filters includes F filters, F is a positive integer, it filters the same signal and outputs F synchronized signals, and each filter can be assigned a different response function , Center frequency, gain, bandwidth; the frequencies of each filter in a filter bank are arranged in ascending order, each concentrated on a different frequency, and the number of filters is large enough to determine the short-term energy of each output signal at different times, Obtain Spectrogram.
- inputting the first combined feature vector to the trained one-way long short-term memory LSTM model, and performing voice processing on the first combined feature vector to obtain processed voice information includes:
- a text-related LSTM model is used to implement speech processing on the first combined feature vector, which improves the performance of speech noise reduction.
- the method of training the ASR network includes:
- Training steps align the text annotations included in the corpus of the training ASR network with the audio files corresponding to the text annotations by using the Gaussian mixture model GMM to obtain the first text feature;
- the output layer of the ASR network is the softmax layer, and the softmax layer outputs a probability distribution to realize the loss function.
- the loss function is cross entropy. The normalized value of each value of the current output is calculated, and the maximum value is set to 1. , And the remaining values are 0.
- the loss function is used to describe the degree of fit between the forward propagation output and the expected value; the classical classification loss function is cross entropy, which is used to describe the distance (similarity) between the network output probability distribution and the expected output probability distribution.
- kind of loss function is used to describe the distance (similarity) between the network output probability distribution and the expected output probability distribution.
- the training corpus for ASR and the denoising training corpus are separated, and there is no need to have corresponding text annotations for the denoised speech, and the corpus for training ASR is easy to obtain; the backward information is not used when training the ASR network, so Real-time processing can be achieved.
- the way of training the one-way LSTM model includes:
- the ASR network includes a deep neural network DNN with four hidden layers as the input layer, a linear layer with bottleneck, and a softmax layer with probability distribution as the output layer.
- x t is the input of the ASR network
- y t is the output of the ASR network
- x t is the input of the first hidden layer of the ASR network
- the output of the first hidden layer of the ASR network is used as the input of the second hidden layer of the ASR network
- the output of the second hidden layer of the ASR network is used as the input of the third hidden layer of the ASR network
- the output of the third hidden layer of the ASR network is used as the input of the linear layer of the bottleneck of the ASR network
- the output of the linear layer of the bottleneck of the ASR network As the input of the fourth hidden layer of the ASR network of the ASR network, the output of the fourth hidden layer of the ASR network is used as the input of the softmax layer of the ASR network, and the output of
- the embodiment of the present application provides another voice processing method.
- the voice processing method is executed by an electronic device.
- the electronic device may be the server 11 in FIG. 1A or the user terminal 13 in FIG. 1A.
- This method The schematic diagram of the process is shown in Figure 3. The method includes the following steps:
- S201 Acquire speech containing noise, perform framing processing and windowing processing on the collected speech, and extract speech features.
- the voice containing noise is the voice information to be processed, and the voice feature is the first voice feature.
- the extracted speech feature may be a logarithmic power spectrum or MFCC (Mel-Frequency Cepstral Coefficients, Mel-frequency Cepstral Coefficients) feature.
- MFCC Mel-Frequency Cepstral Coefficients, Mel-frequency Cepstral Coefficients
- the speech is first subjected to framing processing and windowing processing, and then each frame is subjected to FFT (Fast Fourier Transformation, Fast Fourier Transformation) to determine the discrete power spectrum after FFT , Find the logarithm of the obtained discrete power spectrum to obtain the logarithmic power spectrum, and then the voice feature can be obtained.
- FFT Fast Fourier Transformation, Fast Fourier Transformation
- the text bottleneck feature is the first text bottleneck feature.
- the 40-dimensional filter-bank feature and the 3-dimensional fundamental frequency pitch feature are extracted from the collected speech containing noise; the 40-dimensional filter-bank feature and the 3-dimensional pitch feature are performed Splicing to obtain the second voice feature; input the second voice feature to the trained automatic speech recognition network ASR network, and extract the text bottleneck feature from the linear layer of the bottleneck bottleneck of the trained ASR network.
- the combined feature vector is the first combined feature vector.
- the speech feature and the text bottleneck feature are spliced together to obtain a combined feature vector
- the dimension of the combined feature vector is the sum of the dimensions of each frame of the speech feature and the dimension of the text bottleneck feature.
- the dimension of each frame of the speech feature is 257
- the dimension of the text bottleneck feature is 100
- the dimension of the combined feature vector is the sum of the dimension of each frame of the speech feature and the dimension of the text bottleneck feature, that is, the dimension of the combined feature vector Is 357.
- S204 Input the combined feature vector into the trained one-way LSTM model to perform speech enhancement.
- the combined feature vector of the input is subjected to speech enhancement processing, and then the output result of the one-way LSTM model is subjected to feature inverse transformation to realize the output result of the one-way LSTM model Conversion from frequency domain to time domain to obtain enhanced time domain speech.
- the corpus for training ASR includes speech (noisy speech and/or clean speech) and text; the noise reduction training corpus includes noisy speech and clean speech (noisy speech).
- the text information of noisy speech is not required, and real-time noise reduction is achieved; the corpus for training ASR and the noise reduction training corpus are separated, and there is no need for noise reduction speech to have corresponding text annotations, and the corpus for training ASR Easy to get; backward information is not used when training the ASR network, so real-time processing can be achieved.
- the one-way LSTM model is trained with text features as input, the experimental results of the one-way LSTM model obtained after training can basically eliminate the noise in the silent section and eliminate the positive direction of the noise component of the human voice. As a result, the noise reduction performance is effectively improved.
- both parties to the conference join the voice call through the conference software of the terminal, for example, join the voice call through the user terminal shown in Figure 1A, and the two parties to the conference implement the voice call through the conference software.
- the voice processing is realized through the automatic gain control module, audio coding module, audio decoding module, echo cancellation module, voice noise reduction module and howling suppression module.
- the voice noise reduction module affects the call. An important module of quality.
- the speech noise reduction module first trains a general automatic speech recognition ASR network with a bottleneck linear layer, and then inputs the speaker's noisy speech into the bottleneck linear layer of the trained ASR network, and passes the bottleneck of the ASR network Linear layer to extract text bottleneck features.
- the speech noise reduction module performs frame processing and window processing on the noisy speech of the speaker, and then performs fast Fourier transform FFT for each frame, determines the discrete power spectrum after FFT, and finds the correctness of the discrete power spectrum obtained Number, the logarithmic power spectrum is obtained, and the logarithmic power spectrum is the voice feature.
- the speech noise reduction module combines the extracted text bottleneck features with speech features, inputs the combined feature vector into the trained one-way long and short-term memory LSTM model, and performs speech enhancement processing through the trained one-way LSTM model.
- the output of the one-way LSTM model is subjected to feature inverse transformation, and the speaker's speech without noise in the time domain is output.
- the speech noise reduction module optimizes the noise reduction performance by introducing the text bottleneck feature of the speaker's conversation speech. From the text bottleneck feature, you can effectively obtain which speech frames are effective and which noise needs to be eliminated, so as to retain more speech , Which further improves the noise reduction result, makes the call clearer, and reduces the problem of falsely canceling the voice before. For example, during a meeting, when the speaker is saying the phrase "start meeting now", the speech recognition network ASR can obtain the text content of this speech, and then judge that this speech is someone speaking and cannot be deleted.
- the text bottleneck feature of the call speech is obtained to assist noise reduction, which further improves the noise reduction performance, and the overall experience is better; the problem of partial false cancellation of effective speech caused by noise reduction is greatly improved, making the call more Smooth, improve the quality of the call.
- an embodiment of the present application also provides a voice processing device.
- the structure diagram of the device is shown in FIG. 7.
- the voice processing device 60 includes a first processing module 601, a second processing module 602, The third processing module 603 and the fourth processing module 604.
- the first processing module 601 is configured to obtain voice information to be processed
- the second processing module 602 is configured to determine the first voice feature and the first text bottleneck bottleneck feature according to the voice information to be processed;
- the third processing module 603 is configured to determine the first combined feature vector according to the first voice feature and the first text bottleneck feature
- the fourth processing module 604 is used to input the first combined feature vector to the trained one-way long short-term memory LSTM model, and perform voice processing on the first combined feature vector to obtain denoised voice information, and reduce the noise The latter voice information is sent to other electronic devices for playback.
- the second processing module 602 is specifically configured to perform framing processing and windowing processing on the voice information to be processed; extract from the voice information to be processed after framing processing and windowing processing are performed
- the first voice feature; the first voice feature includes at least one of a logarithmic power spectrum and a Mel frequency cepstrum coefficient MFCC feature.
- the second processing module 602 is specifically configured to extract the N-dimensional filter-bank feature and the M-dimensional pitch feature of the fundamental frequency from the voice information to be processed, where both N and M are positive. Integer; splicing the N-dimensional filter-bank feature and the M-dimensional pitch feature to obtain the second voice feature; input the second voice feature to the trained automatic speech recognition network ASR network, from the bottleneck of the trained ASR network Extract the bottleneck feature of the first text from the linear layer of the bottleneck.
- the fourth processing module 604 is specifically configured to perform voice processing on the first combined feature vector through the trained one-way LSTM model; perform feature inverse transformation on the processing result, and perform voice information from frequency domain to time. Domain conversion to obtain processed voice information.
- the method of training the ASR network includes:
- Training steps align the text annotations included in the corpus of the training ASR network with the audio files corresponding to the text annotations by using the Gaussian mixture model GMM to obtain the first text feature;
- the way of training the one-way LSTM model includes:
- the ASR network includes a four-layer hidden deep neural network DNN as an input layer, a bottleneck linear layer, and a probability distribution softmax layer as an output layer.
- the solution of the embodiment of the present application realizes speech processing based on the bottleneck feature of the first text bottleneck, and improves the efficiency of speech noise reduction and speech quality.
- an embodiment of the present application also provides an electronic device.
- a schematic structural diagram of the electronic device is shown in FIG. 8.
- the electronic device 6000 includes at least one processor 6001, a memory 6002, and a bus 6003.
- the memory 6001 is electrically connected to the storage 6002; the memory 6002 is configured to store at least one computer-executable instruction, and the processor 6001 is configured to execute the at least one computer-executable instruction, so as to execute any Steps of any voice processing method provided by an embodiment or any optional implementation manner.
- the processor 6001 may be an FPGA (Field-Programmable Gate Array) or other devices with logic processing capabilities, such as MCU (Microcontroller Unit), CPU (Central Process Unit, central processing unit) ).
- FPGA Field-Programmable Gate Array
- MCU Microcontroller Unit
- CPU Central Process Unit, central processing unit
- the solution of the embodiment of the present application realizes speech processing based on the bottleneck feature of the first text bottleneck, and improves the efficiency of speech noise reduction and speech quality.
- the embodiment of the present application also provides another computer-readable storage medium storing a computer program, which is used to implement any one or any of the embodiments in the first embodiment of the present application when the computer program is executed by a processor.
- An optional implementation manner provides any kind of data and voice processing steps.
- the computer-readable storage medium includes, but is not limited to, any type of disk (including floppy disk, hard disk, optical disk, CD-ROM, and magneto-optical disk), ROM (Read-Only Memory), RAM ( Random Access Memory), EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash memory, magnetic Card or light card. That is, a readable storage medium includes any medium that stores or transmits information in a readable form by a device (for example, a computer).
- the solution of the embodiment of the present application realizes speech processing based on the bottleneck feature of the first text bottleneck, and improves the efficiency of speech noise reduction and speech quality.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Probability & Statistics with Applications (AREA)
- Telephonic Communication Services (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (15)
- 一种语音处理的方法,由电子设备执行,包括:获取待处理的语音信息;根据所述待处理的语音信息,确定第一语音特征和第一文本瓶颈特征;根据所述第一语音特征和所述第一文本瓶颈特征,确定第一组合特征向量;将所述第一组合特征向量输入至训练后的单向长短期记忆LSTM模型,对所述第一组合特征向量进行语音处理,以得到降噪后的语音信息,并将降噪后的所述语音信息发送给其他电子设备以使其进行播放。
- 根据权利要求1所述的方法,其特征在于,所述根据所述待处理的语音信息,确定第一语音特征,包括:将所述待处理的语音信息进行分帧处理和加窗处理;从进行分帧处理和加窗处理后的所述待处理的语音信息中提取所述第一语音特征;所述第一语音特征包括对数功率谱、梅尔频率倒谱系数MFCC特征中的至少一项。
- 根据权利要求1所述的方法,其特征在于,所述根据所述待处理的语音信息,确定第一文本瓶颈特征,包括:从所述待处理的语音信息中提取N维的滤波器组filter-bank特征和M维的基频pitch特征,其中,N和M都为正整数;将所述N维的filter-bank特征和所述M维的pitch特征进行拼接,得到第二语音特征;将所述第二语音特征输入至训练后的自动语音识别网络ASR网络,从所述训练后的ASR网络的瓶颈的线性层中提取所述第一文本瓶颈特征。
- 根据权利要求3所述的方法,其特征在于,对所述ASR网络进行训练的方式,包括:训练步骤:通过混合高斯模型GMM,将语料中包括的文本标注与所述文本标注对应的音频文件对齐,得到第一文本特征,所述语料用于训练所述ASR网 络;从所述音频文件中提取N维的滤波器组filter-bank特征和M维的基频pitch特征;将所述N维的filter-bank特征和所述M维的pitch特征进行拼接,得到第三语音特征;将所述第三语音特征输入至所述ASR网络,对所述ASR网络进行训练,得到所述ASR网络的输出层输出的第二文本特征;根据所述第一文本特征的值和所述第二文本特征的值,确定所述ASR网络的交叉熵CE的值;重复执行所述训练步骤,当训练所述ASR网络所得到的ASR网络的交叉熵CE的值与上次训练所述ASR网络所得到的ASR网络的交叉熵CE的值之间的差值在第一阈值范围内,得到所述训练后的ASR网络。
- 根据权利要求4所述的方法,其特征在于,所述ASR网络包括作为输入层的四层隐层的深度神经网络DNN、一层瓶颈的线性层和作为输出层的概率分布softmax层。
- 根据权利要求1所述的方法,其特征在于,所述将所述第一组合特征向量输入至训练后的单向长短期记忆LSTM模型,对所述第一组合特征向量进行语音处理,以得到降噪后的语音信息,包括:通过所述训练后的单向LSTM模型对所述第一组合特征向量进行语音增强处理;对处理结果进行特征逆变换,对语音信息进行从频域到时域的转换,得到所述降噪后的语音信息。
- 根据权利要求1所述的方法,其特征在于,对所述单向LSTM模型进行训练的方式,包括:采集降噪训练语料中包括的带噪声语音和不带噪声语音;从所述带噪声语音中提取第四语音特征和第二文本瓶颈特征,以及从所述不带噪声语音中提取第五语音特征;将所述第四语音特征与所述第二文本瓶颈特征进行组合,得到第二组合特征向量;将所述第二组合特征向量输入至所述单向LSTM模型,对所述单向LSTM模型进行训练,当所述单向LSTM模型输出的参考值与所述第五语音特征的值之间的最小均方误差小于等于第二阈值,得到所述训练后的单向LSTM模型。
- 一种语音处理的装置,其特征在于,包括:第一处理模块,用于获取待处理的语音信息;第二处理模块,用于根据所述待处理的语音信息,确定第一语音特征和第一文本瓶颈特征;第三处理模块,用于根据所述第一语音特征和所述第一文本瓶颈特征,确定第一组合特征向量;第四处理模块,用于将所述第一组合特征向量输入至训练后的单向长短期记忆LSTM模型,对所述第一组合特征向量进行语音处理,以得到降噪后的语音信息,并将降噪后的所述语音信息发送给其他电子设备以使其进行播放。
- 根据权利要求8所述的装置,其中,所述第二处理模块,还用于将所述待处理的语音信息进行分帧处理和加窗处理;从进行分帧处理和加窗处理后的所述待处理的语音信息中提取所述第一语音特征;所述第一语音特征包括对数功率谱、梅尔频率倒谱系数MFCC特征中的至少一项。
- 根据权利要求8所述的装置,其中,所述第二处理模块,还用于从所述待处理的语音信息中提取N维的滤波器组filter-bank特征和M维的基频pitch特征,其中,N和M都为正整数;将所述N维的filter-bank特征和所述M维的pitch特征进行拼接,得到第二语音特征;将所述第二语音特征输入至训练后的自动语音识别网络ASR网络,从所述训练后的ASR网络的瓶颈的线性层中提取所述第一文本瓶颈特征。
- 根据权利要求10所述的装置,其中,所述第二处理模块,还用于对所述ASR网络进行训练;其中,对所述ASR网络进行训练的方式,包括:训练步骤:通过混合高斯模型GMM,将语料中包括的文本标注与所述文本 标注对应的音频文件对齐,得到第一文本特征,所述语料用于训练所述ASR网络;从所述音频文件中提取N维的滤波器组filter-bank特征和M维的基频pitch特征;将所述N维的filter-bank特征和所述M维的pitch特征进行拼接,得到第三语音特征;将所述第三语音特征输入至所述ASR网络,对所述ASR网络进行训练,得到所述ASR网络的输出层输出的第二文本特征;根据所述第一文本特征的值和所述第二文本特征的值,确定所述ASR网络的交叉熵CE的值;重复执行所述训练步骤,当训练所述ASR网络所得到的ASR网络的交叉熵CE的值与上次训练所述ASR网络所得到的ASR网络的交叉熵CE的值之间的差值在第一阈值范围内,得到所述训练后的ASR网络。
- 根据权利要求8所述的装置,其中,所述第四处理模块,还用于通过所述训练后的单向LSTM模型对所述第一组合特征向量进行语音增强处理;对处理结果进行特征逆变换,对语音信息进行从频域到时域的转换,得到处理后的语音信息。
- 根据权利要求8所述的装置,其中,所述第四处理模块,还用于对所述单向LSTM模型进行训练;对所述单向LSTM模型进行训练的方式,包括:采集降噪训练语料中包括的带噪声语音和不带噪声语音;从所述带噪声语音中提取第四语音特征和第二文本瓶颈特征,以及从所述不带噪声语音中提取第五语音特征;将所述第四语音特征与所述第二文本瓶颈特征进行组合,得到第二组合特征向量;将所述第二组合特征向量输入至所述单向LSTM模型,对所述单向LSTM模型进行训练,当所述单向LSTM模型输出的参考值与所述第五语音特征的值之间的最小均方误差小于等于第二阈值,得到所述训练后的单向LSTM模型。
- 一种电子设备,包括:处理器、存储器;所述存储器,用于存储计算机程序;所述处理器,用于通过调用所述计算机程序,执行上述权利要求1-7中任一项所述的语音处理的方法。
- 一种计算机可读存储介质,存储有计算机程序,所述计算机程序用于被处理器执行时实现如权利要求1-7中任一项所述的语音处理的方法。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2021560990A JP7258182B2 (ja) | 2019-09-05 | 2020-07-13 | 音声処理方法、装置、電子機器及びコンピュータプログラム |
EP20860732.5A EP3933829B1 (en) | 2019-09-05 | 2020-07-13 | Speech processing method and apparatus, electronic device, and computer-readable storage medium |
US17/460,924 US11948552B2 (en) | 2019-09-05 | 2021-08-30 | Speech processing method, apparatus, electronic device, and computer-readable storage medium |
US18/425,381 US20240169975A1 (en) | 2019-09-05 | 2024-01-29 | Speech processing method, apparatus, electronic device, and computer-readable storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910838192.6 | 2019-09-05 | ||
CN201910838192.6A CN110379412B (zh) | 2019-09-05 | 2019-09-05 | 语音处理的方法、装置、电子设备及计算机可读存储介质 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/460,924 Continuation US11948552B2 (en) | 2019-09-05 | 2021-08-30 | Speech processing method, apparatus, electronic device, and computer-readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021042870A1 true WO2021042870A1 (zh) | 2021-03-11 |
Family
ID=68261527
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/101602 WO2021042870A1 (zh) | 2019-09-05 | 2020-07-13 | 语音处理的方法、装置、电子设备及计算机可读存储介质 |
Country Status (5)
Country | Link |
---|---|
US (2) | US11948552B2 (zh) |
EP (1) | EP3933829B1 (zh) |
JP (1) | JP7258182B2 (zh) |
CN (1) | CN110379412B (zh) |
WO (1) | WO2021042870A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113593598A (zh) * | 2021-08-09 | 2021-11-02 | 深圳远虑科技有限公司 | 音频放大器在待机状态下的降噪方法、装置和电子设备 |
CN113763987A (zh) * | 2021-09-06 | 2021-12-07 | 中国科学院声学研究所 | 一种语音转换模型的训练方法及装置 |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110379412B (zh) | 2019-09-05 | 2022-06-17 | 腾讯科技(深圳)有限公司 | 语音处理的方法、装置、电子设备及计算机可读存储介质 |
CN112786001B (zh) * | 2019-11-11 | 2024-04-09 | 北京地平线机器人技术研发有限公司 | 语音合成模型训练方法、语音合成方法和装置 |
CN112786016B (zh) * | 2019-11-11 | 2022-07-19 | 北京声智科技有限公司 | 一种语音识别方法、装置、介质和设备 |
CN110875037A (zh) * | 2019-11-19 | 2020-03-10 | 腾讯科技(深圳)有限公司 | 语音数据处理方法、装置及电子设备 |
CN111460094B (zh) * | 2020-03-17 | 2023-05-05 | 云知声智能科技股份有限公司 | 一种基于tts的音频拼接优化的方法及其装置 |
CN111583919B (zh) * | 2020-04-15 | 2023-10-13 | 北京小米松果电子有限公司 | 信息处理方法、装置及存储介质 |
WO2022027423A1 (zh) * | 2020-08-06 | 2022-02-10 | 大象声科(深圳)科技有限公司 | 一种融合骨振动传感器和双麦克风信号的深度学习降噪方法及系统 |
CN112289299B (zh) * | 2020-10-21 | 2024-05-14 | 北京大米科技有限公司 | 语音合成模型的训练方法、装置、存储介质以及电子设备 |
US11922963B2 (en) * | 2021-05-26 | 2024-03-05 | Microsoft Technology Licensing, Llc | Systems and methods for human listening and live captioning |
CN113674735B (zh) * | 2021-09-26 | 2022-01-18 | 北京奇艺世纪科技有限公司 | 声音转换方法、装置、电子设备及可读存储介质 |
CN114283794A (zh) * | 2021-12-14 | 2022-04-05 | 达闼科技(北京)有限公司 | 噪音过滤方法、装置、电子设备和计算机可读存储介质 |
CN113921023B (zh) * | 2021-12-14 | 2022-04-08 | 北京百瑞互联技术有限公司 | 一种蓝牙音频啸叫抑制方法、装置、介质及蓝牙设备 |
CN114267345B (zh) * | 2022-02-25 | 2022-05-17 | 阿里巴巴达摩院(杭州)科技有限公司 | 模型训练方法、语音处理方法及其装置 |
CN114743012B (zh) * | 2022-04-08 | 2024-02-06 | 北京金堤科技有限公司 | 一种文本识别方法及装置 |
CN114863940B (zh) * | 2022-07-05 | 2022-09-30 | 北京百瑞互联技术有限公司 | 音质转换的模型训练方法、提升音质的方法、装置及介质 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2017134321A (ja) * | 2016-01-29 | 2017-08-03 | 日本電信電話株式会社 | 信号処理方法、信号処理装置及び信号処理プログラム |
US20180025721A1 (en) * | 2016-07-22 | 2018-01-25 | Google Inc. | Automatic speech recognition using multi-dimensional models |
CN109065067A (zh) * | 2018-08-16 | 2018-12-21 | 福建星网智慧科技股份有限公司 | 一种基于神经网络模型的会议终端语音降噪方法 |
CN109785852A (zh) * | 2018-12-14 | 2019-05-21 | 厦门快商通信息技术有限公司 | 一种增强说话人语音的方法及系统 |
CN109841226A (zh) * | 2018-08-31 | 2019-06-04 | 大象声科(深圳)科技有限公司 | 一种基于卷积递归神经网络的单通道实时降噪方法 |
CN109859767A (zh) * | 2019-03-06 | 2019-06-07 | 哈尔滨工业大学(深圳) | 一种用于数字助听器的环境自适应神经网络降噪方法、系统及存储介质 |
CN110379412A (zh) * | 2019-09-05 | 2019-10-25 | 腾讯科技(深圳)有限公司 | 语音处理的方法、装置、电子设备及计算机可读存储介质 |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008076897A2 (en) * | 2006-12-14 | 2008-06-26 | Veoh Networks, Inc. | System for use of complexity of audio, image and video as perceived by a human observer |
US20110035215A1 (en) * | 2007-08-28 | 2011-02-10 | Haim Sompolinsky | Method, device and system for speech recognition |
CN107492382B (zh) * | 2016-06-13 | 2020-12-18 | 阿里巴巴集团控股有限公司 | 基于神经网络的声纹信息提取方法及装置 |
CN108447490B (zh) * | 2018-02-12 | 2020-08-18 | 阿里巴巴集团控股有限公司 | 基于记忆性瓶颈特征的声纹识别的方法及装置 |
CN108461085A (zh) * | 2018-03-13 | 2018-08-28 | 南京邮电大学 | 一种短时语音条件下的说话人识别方法 |
US10672414B2 (en) * | 2018-04-13 | 2020-06-02 | Microsoft Technology Licensing, Llc | Systems, methods, and computer-readable media for improved real-time audio processing |
CN108682418B (zh) * | 2018-06-26 | 2022-03-04 | 北京理工大学 | 一种基于预训练和双向lstm的语音识别方法 |
US11069352B1 (en) * | 2019-02-18 | 2021-07-20 | Amazon Technologies, Inc. | Media presence detection |
US11158307B1 (en) * | 2019-03-25 | 2021-10-26 | Amazon Technologies, Inc. | Alternate utterance generation |
US10923111B1 (en) * | 2019-03-28 | 2021-02-16 | Amazon Technologies, Inc. | Speech detection and speech recognition |
-
2019
- 2019-09-05 CN CN201910838192.6A patent/CN110379412B/zh active Active
-
2020
- 2020-07-13 JP JP2021560990A patent/JP7258182B2/ja active Active
- 2020-07-13 EP EP20860732.5A patent/EP3933829B1/en active Active
- 2020-07-13 WO PCT/CN2020/101602 patent/WO2021042870A1/zh active Application Filing
-
2021
- 2021-08-30 US US17/460,924 patent/US11948552B2/en active Active
-
2024
- 2024-01-29 US US18/425,381 patent/US20240169975A1/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2017134321A (ja) * | 2016-01-29 | 2017-08-03 | 日本電信電話株式会社 | 信号処理方法、信号処理装置及び信号処理プログラム |
US20180025721A1 (en) * | 2016-07-22 | 2018-01-25 | Google Inc. | Automatic speech recognition using multi-dimensional models |
CN109065067A (zh) * | 2018-08-16 | 2018-12-21 | 福建星网智慧科技股份有限公司 | 一种基于神经网络模型的会议终端语音降噪方法 |
CN109841226A (zh) * | 2018-08-31 | 2019-06-04 | 大象声科(深圳)科技有限公司 | 一种基于卷积递归神经网络的单通道实时降噪方法 |
CN109785852A (zh) * | 2018-12-14 | 2019-05-21 | 厦门快商通信息技术有限公司 | 一种增强说话人语音的方法及系统 |
CN109859767A (zh) * | 2019-03-06 | 2019-06-07 | 哈尔滨工业大学(深圳) | 一种用于数字助听器的环境自适应神经网络降噪方法、系统及存储介质 |
CN110379412A (zh) * | 2019-09-05 | 2019-10-25 | 腾讯科技(深圳)有限公司 | 语音处理的方法、装置、电子设备及计算机可读存储介质 |
Non-Patent Citations (1)
Title |
---|
See also references of EP3933829A4 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113593598A (zh) * | 2021-08-09 | 2021-11-02 | 深圳远虑科技有限公司 | 音频放大器在待机状态下的降噪方法、装置和电子设备 |
CN113593598B (zh) * | 2021-08-09 | 2024-04-12 | 深圳远虑科技有限公司 | 音频放大器在待机状态下的降噪方法、装置和电子设备 |
CN113763987A (zh) * | 2021-09-06 | 2021-12-07 | 中国科学院声学研究所 | 一种语音转换模型的训练方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
JP2022529641A (ja) | 2022-06-23 |
US11948552B2 (en) | 2024-04-02 |
US20240169975A1 (en) | 2024-05-23 |
US20210390946A1 (en) | 2021-12-16 |
EP3933829A4 (en) | 2022-06-08 |
CN110379412B (zh) | 2022-06-17 |
CN110379412A (zh) | 2019-10-25 |
EP3933829B1 (en) | 2024-06-12 |
EP3933829A1 (en) | 2022-01-05 |
JP7258182B2 (ja) | 2023-04-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021042870A1 (zh) | 语音处理的方法、装置、电子设备及计算机可读存储介质 | |
Zhao et al. | Monaural speech dereverberation using temporal convolutional networks with self attention | |
CN110600018B (zh) | 语音识别方法及装置、神经网络训练方法及装置 | |
CN109065067B (zh) | 一种基于神经网络模型的会议终端语音降噪方法 | |
WO2021143326A1 (zh) | 语音识别方法、装置、设备和存储介质 | |
KR100908121B1 (ko) | 음성 특징 벡터 변환 방법 및 장치 | |
TW201935464A (zh) | 基於記憶性瓶頸特徵的聲紋識別的方法及裝置 | |
WO2019113130A1 (en) | Voice activity detection systems and methods | |
Zhao et al. | Late reverberation suppression using recurrent neural networks with long short-term memory | |
WO2018223727A1 (zh) | 识别声纹的方法、装置、设备及介质 | |
CN111986679A (zh) | 一种应对复杂声学环境的说话人确认方法、系统及存储介质 | |
CN111508519B (zh) | 一种音频信号人声增强的方法及装置 | |
CN111192598A (zh) | 一种跳变连接深度神经网络的语音增强方法 | |
CN114338623B (zh) | 音频的处理方法、装置、设备及介质 | |
WO2023216760A1 (zh) | 语音处理方法、装置、存储介质、计算机设备及程序产品 | |
CN114267372A (zh) | 语音降噪方法、系统、电子设备和存储介质 | |
Ali et al. | Speech enhancement using dilated wave-u-net: an experimental analysis | |
Li et al. | A Convolutional Neural Network with Non-Local Module for Speech Enhancement. | |
US20230186943A1 (en) | Voice activity detection method and apparatus, and storage medium | |
CN113763978B (zh) | 语音信号处理方法、装置、电子设备以及存储介质 | |
Sivapatham et al. | Gammatone filter bank-deep neural network-based monaural speech enhancement for unseen conditions | |
Li et al. | Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network | |
Romaniuk et al. | Efficient low-latency speech enhancement with mobile audio streaming networks | |
Tzudir et al. | Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients | |
CN112750469A (zh) | 语音中检测音乐的方法、语音通信优化方法及对应的装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20860732 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 20860732.5 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2020860732 Country of ref document: EP Effective date: 20210930 |
|
ENP | Entry into the national phase |
Ref document number: 2021560990 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |