WO2021057239A1 - 语音数据的处理方法、装置、电子设备及可读存储介质 - Google Patents

语音数据的处理方法、装置、电子设备及可读存储介质 Download PDF

Info

Publication number
WO2021057239A1
WO2021057239A1 PCT/CN2020/105034 CN2020105034W WO2021057239A1 WO 2021057239 A1 WO2021057239 A1 WO 2021057239A1 CN 2020105034 W CN2020105034 W CN 2020105034W WO 2021057239 A1 WO2021057239 A1 WO 2021057239A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
enhancement
data
processing
speech
Prior art date
Application number
PCT/CN2020/105034
Other languages
English (en)
French (fr)
Inventor
黄�俊
王燕南
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2021558880A priority Critical patent/JP7301154B2/ja
Priority to EP20868291.4A priority patent/EP3920183A4/en
Publication of WO2021057239A1 publication Critical patent/WO2021057239A1/zh
Priority to US17/447,536 priority patent/US20220013133A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • This application relates to the field of Internet technology. Specifically, this application relates to a voice data processing method, device, electronic equipment, and computer-readable storage medium.
  • speech Enhancement speech noise reduction.
  • the speech collected by a microphone is usually speech with different noises.
  • the main purpose of speech enhancement is to recover speech without noise from noisy speech.
  • speech enhancement various interference signals can be effectively suppressed, and the target speech signal can be enhanced, which not only improves speech intelligibility and voice quality, but also helps to improve speech recognition.
  • the embodiment of the present application provides a method for processing voice data.
  • the method is executed by a server and includes:
  • the second voice data performs voice enhancement processing on the second voice data based on the updated voice enhancement parameter.
  • An embodiment of the present application provides a device for processing voice data, which includes:
  • the receiving module is used to receive the first voice data sent by the sender
  • the acquisition module is used to acquire corresponding speech enhancement parameters
  • a processing module configured to perform voice enhancement processing on the first voice data based on the acquired voice enhancement parameters to obtain first voice enhancement data, and determine the first voice enhancement parameters based on the first voice data;
  • the update module is configured to update the acquired voice enhancement parameters by using the first voice enhancement parameters to obtain the updated voice enhancement parameters, which are used when the second voice data sent by the sender is received, based on the update
  • the subsequent speech enhancement parameters perform speech enhancement processing on the second speech data
  • the sending module is used to send the first voice enhanced data to the receiver.
  • An embodiment of the present application also provides an electronic device, which includes:
  • the bus is used to connect the processor and the memory
  • the memory is used to store operation instructions
  • the processor is configured to call the operation instruction and execute the executable instruction to cause the processor to perform the operation corresponding to the voice data processing method shown in the above application.
  • the embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for processing voice data shown in the above-mentioned application is realized.
  • FIG. 1A is a system architecture diagram to which a voice data processing method provided by an embodiment of the application is applicable;
  • FIG. 1B is a schematic flowchart of a method for processing voice data according to an embodiment of this application
  • Figure 2 is a schematic diagram of the structure of the LSTM model in this application.
  • Figure 3 is a schematic diagram of the logical steps of speech feature extraction in this application.
  • FIG. 4 is a schematic structural diagram of a voice data processing device provided by another embodiment of this application.
  • FIG. 5 is a schematic structural diagram of an electronic device for processing voice data according to another embodiment of this application.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning or deep learning.
  • ASR automatic speech recognition technology
  • TTS speech synthesis technology
  • voiceprint recognition technology Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • the noise reduction model corresponding to the changed speaker it is necessary to obtain the noise reduction model corresponding to the changed speaker, and use the noise reduction model to perform noise reduction processing on the speech data of the speaker. In this way, the noise reduction model corresponding to each speaker needs to be stored, and the storage demand is high.
  • the embodiments of the present application provide a voice data processing method, device, electronic device, and computer-readable storage medium, aiming to solve the above technical problems in related technologies.
  • FIG. 1A is a system architecture diagram to which the voice processing method provided in an embodiment of the present application is applicable.
  • the system architecture diagram includes: a server 11, a network 12, and terminal devices 13 and 14, wherein the server 11 establishes a connection with the terminal device 13 and the terminal device 14 through the network 12.
  • the server 11 is a background server that processes the received voice data after receiving the voice data sent by the sender.
  • the server 11, together with the terminal device 13, and the terminal device 14 provide services for users.
  • the server 11 processes the voice data sent by the terminal device 13 (or the terminal device 14) corresponding to the sender, and then converts the obtained voice enhancement data It is sent to the terminal device 14 (or terminal device 13) corresponding to the recipient to provide it to the user, where the server 11 may be a single server or a cluster server composed of multiple servers.
  • the network 12 may include a wired network and a wireless network. As shown in Figure 1A, on the access network side, the terminal device 13 and the terminal device 14 can be connected to the network 12 in a wireless or wired manner; and on the core network side, the server 11 is generally connected in a wired manner. To network 12. Of course, the aforementioned server 11 may also be connected to the network 12 in a wireless manner.
  • the above-mentioned terminal device 13 and terminal device 14 may refer to smart devices with data calculation and processing functions, for example, they can play processed voice enhancement data provided by a server.
  • the terminal device 13 and the terminal device 14 include, but are not limited to, a smart phone (installed with a communication module), a handheld computer, a tablet computer, and the like.
  • the terminal device 13 and the terminal device 14 are respectively installed with operating systems, including but not limited to: Android operating system, Symbian operating system, Windows mobile operating system, Apple iPhone OS operating system, and so on.
  • an embodiment of the present application provides a method for processing voice data, and the processing method is executed by the server 11 in FIG. 1A. As shown in Figure 1B, the method includes:
  • Step S101 When the first voice data sent by the sender is received, corresponding voice enhancement parameters are obtained.
  • the pre-stored voice enhancement parameters corresponding to the sender are obtained; if the voice enhancement parameters corresponding to the sender are not obtained, then the pre-stored voice enhancement parameters corresponding to the sender are obtained. Set the voice enhancement parameters.
  • the sender can be the party that sends the voice data.
  • the terminal device 13 can be the sender, and the content of user A's speech can be the first voice data, the first voice data It is transmitted to the server through the network.
  • the server After the server receives the first voice data, it can obtain the corresponding voice enhancement parameters, and then perform voice enhancement processing on the first voice data.
  • the server can run an LSTM (Long-Short Term Memory) model, which can be used to perform voice enhancement processing on voice data.
  • LSTM Long-Short Term Memory
  • Step S102 Perform voice enhancement processing on the voice data based on the acquired voice enhancement parameters to obtain first voice enhancement data, and determine the first voice enhancement parameters based on the first voice data;
  • the voice enhancement process is performed on the first voice data based on the preset voice enhancement parameter to obtain the first voice enhancement data.
  • the first voice data is subjected to voice enhancement processing based on the voice enhancement parameter corresponding to the sender to obtain the first voice enhancement data.
  • the voice enhancement parameter corresponding to the sender is not obtained, then the first voice data is voice enhanced based on the preset voice enhancement parameter; if the voice enhancement parameter corresponding to the sender is obtained, Then, perform voice enhancement processing on the first voice data based on the voice enhancement parameter corresponding to the sender.
  • the voice enhancement processing is performed on the first voice data based on the preset voice enhancement parameter to obtain the first voice enhancement
  • the data and the determining the first voice enhancement parameter based on the first voice data includes: performing feature sequence processing on the first voice data through the trained voice enhancement model to obtain the first voice feature sequence, and the voice
  • the enhancement model is set with the preset speech enhancement parameters; the preset speech enhancement parameters are used to perform batch calculation on the first speech feature sequence to obtain the processed first speech feature sequence and the first speech Enhancement parameters; performing feature inverse transformation processing on the processed first voice feature sequence to obtain the first voice enhancement data.
  • the first voice data is subjected to voice enhancement processing based on the voice enhancement parameter corresponding to the sender to obtain the first voice enhancement Data
  • determining the first voice enhancement parameter based on the first voice data includes: performing feature sequence processing on the first voice data through the trained voice enhancement model to obtain a second voice feature sequence;
  • the voice enhancement parameter corresponding to the sender performs batch calculation on the second voice feature sequence to obtain the processed second voice feature sequence and the second voice enhancement parameter; perform the processed second voice feature sequence
  • the inverse feature transformation process obtains the processed second speech enhancement data, and uses the processed second speech enhancement data as the first speech enhancement data.
  • Step S103 Send the first voice enhancement data to the receiver, and use the first voice enhancement parameter to update the acquired voice enhancement parameter to obtain the updated voice enhancement parameter for use when the transmission is received.
  • the second voice data sent by the party performs voice enhancement processing on the second voice data based on the updated voice enhancement parameter.
  • the obtained pre-set voice enhancement parameter is updated based on the first voice enhancement parameter to obtain the updated voice enhancement parameter, and Use the first speech enhancement parameter as the speech enhancement parameter corresponding to the sender.
  • the first voice enhancement parameter is used to update the voice enhancement parameter corresponding to the sender to obtain the updated voice enhancement parameter .
  • the first voice enhancement parameter can be used as the voice enhancement parameter corresponding to the sender. It is stored in the storage container; if the voice enhancement parameter corresponding to the sender has been saved in the storage container, the first voice enhancement parameter can be replaced with the saved voice enhancement parameter.
  • the server sends the first voice enhancement data obtained through the voice enhancement processing to the receiver, and the receiver only needs to play the first voice enhancement data after receiving the first voice enhancement data.
  • the trained speech enhancement model is generated in the following manner: acquiring first speech sample data containing noise, and performing speech feature extraction on the first speech sample data to obtain a first speech feature sequence; Acquire second voice sample data that does not contain noise, and perform voice feature extraction on the second voice sample data to obtain a second voice feature sequence; use the first voice feature sequence to train a preset voice enhancement model, Obtain the first voice feature sequence output by the trained voice enhancement model, and calculate the similarity between the first voice feature sequence obtained by training the voice enhancement model and the second voice feature sequence until the training institute The similarity between the first voice feature sequence obtained by the voice enhancement model and the second voice feature sequence exceeds a preset similarity threshold, and a trained voice enhancement model is obtained.
  • the method for extracting the voice feature sequence includes: performing voice framing and windowing processing on the voice sample data to obtain at least two voice frames of the voice sample data; performing fast Fourier analysis on each voice frame.
  • Leaf transform to obtain each discrete power spectrum corresponding to each voice frame; perform logarithmic calculations on each discrete power spectrum to obtain each logarithmic power spectrum corresponding to each voice frame, and use each logarithmic power spectrum as the voice The voice feature sequence of the sample data.
  • the server when the first voice data sent by the sender is received, the corresponding voice enhancement parameters are acquired, and then the voice enhancement processing is performed on the first voice data based on the acquired voice enhancement parameters to obtain the first voice Enhance data, and determine a first voice enhancement parameter based on the first voice data, and then use the first voice enhancement parameter to update the acquired voice enhancement parameter to obtain the updated voice enhancement parameter for use when receiving To the second voice data sent by the sender, perform voice enhancement processing on the second voice data based on the updated voice enhancement parameters, and send the first voice enhancement data to the receiver.
  • the server can perform voice enhancement processing on the voice data of the sender based on the voice enhancement parameters corresponding to the sender.
  • the voice enhancement parameters corresponding to different senders are different, the voices obtained by performing voice enhancement processing for different senders are different.
  • the enhancement effect is also different. It is realized that the speech enhancement is not only pertinent when multiple models are not required, but also the speech enhancement parameters can be stored. There is no need to store multiple models, and the storage requirement is low.
  • the embodiment of the present application describes in detail a method for processing voice data as shown in FIG. 1B.
  • Step S101 when the first voice data sent by the sender is received, corresponding voice enhancement parameters are obtained;
  • the sender can be the party that sends the voice data.
  • the terminal device 13 can be the sender, and the content of user A's speech can be the first voice data, the first voice data It is transmitted to the server through the network. After the server receives the first voice data, it can obtain the corresponding voice enhancement parameters, and then perform voice enhancement processing on the first voice data.
  • the server can run an LSTM (Long-Short Term Memory) model, which can be used to perform voice enhancement processing on voice data.
  • LSTM Long-Short Term Memory
  • speech Enhancement speech noise reduction.
  • the speech collected by a microphone is usually speech with different noises.
  • the main purpose of speech enhancement is to recover speech without noise from noisy speech.
  • speech enhancement various interference signals can be effectively suppressed and the target speech signal can be enhanced, which can not only improve speech intelligibility and speech quality, but also help improve speech recognition.
  • the basic structure of the LSTM model can be shown in Figure 2, including a front-end LSTM layer, a batch processing layer, and a back-end LSTM layer; where X is each frame of voice in the voice data, and t is a time window.
  • the so-called one frame of speech refers to a short segment in the speech signal.
  • the voice signal is not stable on the macro level, and stable on the micro level, and has short-term stability (the voice signal can be considered to be approximately unchanged within 10 to 30 ms).
  • This can be used to divide the voice signal into short segments
  • each short segment is called a frame.
  • the length of a frame of speech is 10ms, then the segment of speech includes 100 frames.
  • the front-end LSTM layer, batch processing layer, and back-end LSTM layer will simultaneously calculate voice frames in different time windows.
  • the batch processing layer is used to calculate the voice enhancement parameters corresponding to the voice data, such as the mean value. And variance.
  • the terminal device 13 and the terminal device 14 may also have the following characteristics:
  • the device has a central processing unit, a memory, an input component and an output component, that is to say, the device is often a microcomputer device with communication functions.
  • the device can also have a variety of input methods, such as keyboard, mouse, touch screen, microphone and camera, etc., and can adjust the input as needed.
  • the equipment often has a variety of output methods, such as receivers, display screens, etc., which can also be adjusted as needed;
  • the device In the software system, the device must have an operating system, such as Windows Mobile, Symbian, Palm, Android, iOS, etc. At the same time, these operating systems are becoming more and more open, and personalized applications based on these open operating system platforms are emerging one after another, such as communication books, calendars, notepads, calculators, and various games, which greatly satisfy individuality. User’s needs;
  • an operating system such as Windows Mobile, Symbian, Palm, Android, iOS, etc.
  • these operating systems are becoming more and more open, and personalized applications based on these open operating system platforms are emerging one after another, such as communication books, calendars, notepads, calculators, and various games, which greatly satisfy individuality. User’s needs;
  • the device has flexible access methods and high-bandwidth communication performance, and can automatically adjust the selected communication method according to the selected service and the environment in which it is located, so that it is convenient for users to use.
  • the equipment can support GSM (Global System for Mobile Communication), WCDMA (Wideband Code Division Multiple Access), CDMA2000 (Code Division Multiple Access), TDSCDMA (Time Division- Synchronous Code Division Multiple Access, Time Division Synchronous Code Division Multiple Access), Wi-Fi (Wireless-Fidelity, Wireless Fidelity), and WiMAX (Worldwide Interoperability for Microwave Access), etc., so as to adapt to multiple standard networks, Not only supports voice services, but also supports multiple wireless data services;
  • equipment pays more attention to humanization, individualization and multi-function.
  • equipment has moved from a "equipment-centric" model to a "human-centric” model, integrating embedded computing, control technology, artificial intelligence technology, and biometric authentication technology, which fully embodies the people-oriented tenet .
  • the equipment can be adjusted according to individual needs and become more personalized.
  • the device itself integrates many software and hardware, and its functions are becoming more and more powerful.
  • the acquiring corresponding speech enhancement parameters includes:
  • the server may use the trained LSTM model to perform voice enhancement processing on the first voice data.
  • the trained LSTM model is a general model with preset speech enhancement parameters, that is, the speech enhancement parameters in the trained LSTM model.
  • the trained LSTM model can perform speech enhancement processing on any user's speech data.
  • the trained LSTM model in order to provide targeted speech enhancement for different users, can be trained using the user’s speech data to obtain the user’s speech enhancement parameters. In this way, the user’s speech When the data is subjected to voice enhancement processing, the user's voice enhancement parameters can be used to perform voice enhancement processing on the user's voice data.
  • the voice data of user A is used to train the trained LSTM model to obtain the voice enhancement parameters of user A.
  • the trained LSTM model can use user A's voice enhancement parameters for voice enhancement processing.
  • the server when the server receives the user's first voice data, it may first obtain the user's voice enhancement parameters.
  • the voice enhancement parameters corresponding to each user may be stored in the storage container of the server, or may be stored in the storage container of other devices, which is not limited in the embodiment of the present application.
  • the server does not obtain the user's voice enhancement parameters, it means that the server has received the user's voice data for the first time, and it is sufficient to obtain the preset voice enhancement parameters at this time.
  • Step S102 Perform voice enhancement processing on the first voice data based on the acquired voice enhancement parameters to obtain first voice enhancement data, and determine the first voice enhancement parameters based on the first voice data;
  • the voice enhancement parameter corresponding to the sender is not obtained, then the first voice data is voice enhanced based on the preset voice enhancement parameter; if the voice enhancement parameter corresponding to the sender is obtained, Then, perform voice enhancement processing on the first voice data based on the voice enhancement parameter corresponding to the sender.
  • the voice enhancement process is performed on the first voice data based on the obtained voice enhancement parameter to obtain the first voice
  • the step of enhancing data and determining a first voice enhancement parameter based on the first voice data includes:
  • the first speech data can be input into the trained LSTM model, and the trained LSTM model performs feature sequence processing on the first speech data to obtain the first speech
  • the first voice feature sequence corresponding to the data where the first voice feature sequence includes at least two voice features
  • the first voice feature sequence is batch-processed using preset voice enhancement parameters to obtain the processed first voice feature Sequence, and then perform feature inverse transformation processing on the processed first voice feature sequence to obtain the first voice enhancement data, that is, use the trained LSTM model (general model) to perform voice enhancement processing on the first voice data.
  • the batch calculation can adopt the following formula (1) and formula (2):
  • ⁇ B is the mean value in the speech enhancement parameters
  • x i is the input speech feature
  • y i is the output speech feature after speech enhancement
  • ⁇ , ⁇ , and ⁇ are variable parameters respectively.
  • the first voice data to train the trained LSTM model to obtain the first voice enhancement parameter, that is, the voice enhancement parameter corresponding to the sender, and then store it.
  • the following formula (3) and formula (4) can be used to train the trained LSTM model:
  • ⁇ B is the mean value in the speech enhancement parameters
  • Is the variance in the speech enhancement parameters
  • x i is the input speech feature
  • m is the number of speech features.
  • performing voice enhancement processing on the first voice data based on the acquired voice enhancement parameters and determining the first voice enhancement parameters based on the first voice data may be executed sequentially or in parallel. Execution, etc., can be adjusted according to actual needs in actual applications, and the embodiment of the present application does not limit the execution order.
  • the voice enhancement parameter corresponding to the sender is obtained, the first voice data is subjected to voice enhancement processing based on the obtained voice enhancement parameter to obtain the first voice enhancement Data, and the step of determining a first voice enhancement parameter based on the first voice data includes:
  • the first voice data can be input into the trained LSTM model, and the trained LSTM model performs feature sequence processing on the first voice data to obtain the first voice data
  • the corresponding second voice feature sequence where the second voice feature sequence includes at least two voice features
  • the second voice feature sequence is batch-processed and calculated using the voice enhancement parameters corresponding to the sender to obtain the processed second voice Feature sequence, and then perform feature inverse transformation on the processed second voice feature sequence to obtain the second voice enhancement data, that is, replace the voice enhancement parameters corresponding to the sender with the voice enhancement parameters in the trained LSTM model , And then use the updated LSTM model to perform voice enhancement processing on the second voice data.
  • the batch calculation can also adopt formula (1) and formula (2), which will not be repeated here.
  • formula (3) and formula (4) can also be used for training the updated LSTM model, which will not be repeated here.
  • the trained speech enhancement model is generated in the following manner:
  • the first voice sample data containing noise is obtained, and voice feature extraction is performed on the first voice sample data to obtain the first voice feature a, and the second voice sample data that does not contain noise is obtained, and the second voice sample data is obtained.
  • Perform voice feature extraction on the voice sample data to obtain the second voice feature b and then input the voice feature a into the original LSTM model, and use the voice feature b as the training target to perform one-way training on the original LSTM model, that is, one-way adjustment in the LSTM model
  • the similarity calculation can use the angle cosine, Pearson correlation coefficient and other similarity measurement methods, also can use the Euclidean distance, Manhattan distance and other distance measurement methods, of course, can also use other calculation methods, specific calculations
  • the mode can be set according to actual needs, which is not limited in the embodiment of the present application.
  • the manner of speech feature extraction includes:
  • the voice sample data is the voice signal.
  • the voice signal is a time-domain signal.
  • the processor cannot directly process the time-domain signal. Therefore, it is necessary to perform voice framing and windowing processing on the voice sample data to obtain the voice sample data.
  • At least two speech frames so as to convert the time domain signal into a frequency domain signal that can be processed by the processor, as shown in Figure 3, and then perform FFT (Fast Fourier Transformation, Fast Fourier Transformation) on each speech frame separately to obtain Discrete power spectrum corresponding to each voice frame, and then perform logarithmic calculation on each discrete power spectrum to obtain each logarithmic power spectrum corresponding to each voice frame, thereby obtaining the voice feature corresponding to each voice frame, and the collection of all voice features It is the voice feature sequence corresponding to the voice sample data. Perform feature inverse transformation processing on the voice feature sequence to convert the voice feature sequence in the frequency domain into a voice signal in the time domain.
  • the feature extraction method for the first voice sample data is the same as the feature extraction method for the second voice sample data. Therefore, for the convenience of description, the embodiment of the present application combines the first voice sample data and the second voice sample data.
  • the data are collectively referred to as voice sample data.
  • Step S103 Send the first voice enhancement data to the receiver, and use the first voice enhancement parameter to update the acquired voice enhancement parameter to obtain the updated voice enhancement parameter for use when the transmission is received.
  • the second voice data sent by the party performs voice enhancement processing on the second voice data based on the updated voice enhancement parameter.
  • the first speech enhancement parameter is used to update the acquired speech enhancement parameter to obtain the updated speech enhancement parameter. In this way, no adaptive training is required.
  • the first voice enhancement parameter can be used as the voice enhancement parameter corresponding to the sender. It is stored in the storage container; if the voice enhancement parameter corresponding to the sender has been saved in the storage container, the first voice enhancement parameter can be replaced with the saved voice enhancement parameter.
  • the second voice data sent by the sender When the second voice data sent by the sender is received, the second voice data can be processed for voice enhancement based on the first voice enhancement parameter, that is, the updated voice enhancement parameter.
  • the server can continuously train the trained LSTM model in one direction based on the latest voice data sent by the sender, thereby continuously updating the voice enhancement parameters corresponding to the sender, so that the matching degree between the voice enhancement parameters and the sender becomes more and more. High, the voice enhancement effect for the sender is getting better and better.
  • the server sends the first voice enhancement data obtained through the voice enhancement processing to the receiver, and the receiver only needs to play the first voice enhancement data after receiving the first voice enhancement data.
  • the execution order of the server for updating the voice enhancement parameters and sending the voice enhancement data can be sequential or parallel. In actual applications, it can be set according to actual needs, which is not limited in the embodiments of this application. .
  • the trained LSTM model is running in the server, the trained LSTM model has general speech enhancement parameters, and there is no user A in the storage container or other storage containers in the server The speech enhancement parameters.
  • the terminal device corresponding to user A sends the first sentence to the server.
  • the server After receiving the first sentence of user A, the server searches for the voice corresponding to user A Enhancement parameters. Because there are no voice enhancement parameters of user A in the storage container or other storage containers in the server, the voice enhancement parameters of user A cannot be obtained.
  • the general voice enhancement parameters of the trained LSTM model are obtained, and the general The speech enhancement parameter performs speech enhancement processing on the first sentence of speech, and obtains the enhanced first sentence of speech, and sends the enhanced first sentence of speech to the corresponding terminal equipment of user B and user C, and at the same time, adopts the first sentence of speech One-way training is performed on the trained LSTM model, and the first speech enhancement parameter of user A is obtained and stored.
  • the terminal device After user A completes the second sentence, the terminal device sends the second sentence to the server.
  • the server After receiving the second sentence of user A, the server searches for the voice enhancement parameters corresponding to user A. The search is successful this time, and the user is retrieved.
  • the enhanced second sentence is obtained, and the enhanced second sentence is sent to the terminal devices corresponding to user B and user C.
  • the second sentence is used to perform one-way training on the updated LSTM model, Obtain the second speech enhancement parameter of user A, and replace the first speech enhancement parameter with the second speech enhancement parameter.
  • the speech enhancement process for subsequent speeches can be deduced by analogy, so I won't repeat them here.
  • the server when the first voice data sent by the sender is received, the corresponding voice enhancement parameters are acquired, and then the voice enhancement processing is performed on the first voice data based on the acquired voice enhancement parameters to obtain the first voice Enhance data, and determine a first voice enhancement parameter based on the first voice data, and then use the first voice enhancement parameter to update the acquired voice enhancement parameter to obtain the updated voice enhancement parameter for use when receiving To the second voice data sent by the sender, perform voice enhancement processing on the second voice data based on the updated voice enhancement parameters, and send the first voice enhancement data to the receiver.
  • the server can perform voice enhancement processing on the voice data of the sender based on the voice enhancement parameters corresponding to the sender.
  • the voice enhancement parameters corresponding to different senders are different, the voices obtained by performing voice enhancement processing for different senders are different.
  • the enhancement effect is also different. It is realized that the speech enhancement is not only pertinent when multiple models are not required, but also the speech enhancement parameters can be stored. There is no need to store multiple models, and the storage requirement is low.
  • the server can continue to train the trained LSTM model in one direction based on the latest voice data sent by the sender, so as to continuously update the voice enhancement parameters corresponding to the sender, so that the matching degree between the voice enhancement parameters and the sender becomes more and more. High, the voice enhancement effect for the sender is getting better and better.
  • it is enough to train the speech enhancement parameters, and it is not necessary to train the entire trained LSTM model or a whole layer in the model, which improves the cost and speed of training.
  • FIG. 4 is a schematic structural diagram of a voice data processing apparatus provided by another embodiment of this application. As shown in FIG. 4, the apparatus in this embodiment may include:
  • the receiving module 401 is configured to receive the first voice data sent by the sender
  • the obtaining module 402 is used to obtain corresponding speech enhancement parameters
  • the processing module 403 is configured to perform voice enhancement processing on the first voice data based on the acquired voice enhancement parameters to obtain first voice enhancement data, and determine the first voice enhancement parameters based on the first voice data;
  • the update module 404 is configured to use the first voice enhancement parameter to update the acquired voice enhancement parameter to obtain the updated voice enhancement parameter, which is used when the second voice data sent by the sender is received, based on the Performing voice enhancement processing on the second voice data with the updated voice enhancement parameters;
  • the sending module 405 is configured to send the first voice enhanced data to the receiver.
  • the acquisition module is specifically configured to:
  • the update module is further configured to update the obtained pre-set voice enhancement parameter based on the first voice enhancement parameter to obtain The updated speech enhancement parameter, and the first speech enhancement parameter is used as the speech enhancement parameter corresponding to the sender.
  • the update module is further configured to use the first voice enhancement parameter to update the voice enhancement parameter corresponding to the sender, Get the updated speech enhancement parameters.
  • the processing module is further configured to perform voice enhancement processing on the first voice data based on the preset voice enhancement parameter, To obtain the first speech enhancement data.
  • the processing module includes: a feature sequence processing sub-module, a batch processing calculation sub-module, and a feature inverse transformation processing sub-module;
  • the feature sequence processing sub-module is used to perform feature sequence processing on the first voice data through the trained voice enhancement model to obtain the first voice feature sequence,
  • the speech enhancement model is set with the preset speech enhancement parameters;
  • a batch calculation sub-module configured to perform batch calculation on the first voice feature sequence by using the preset voice enhancement parameters to obtain the processed first voice feature sequence and the first voice enhancement parameters;
  • the feature inverse transformation processing sub-module is configured to perform feature inverse transformation processing on the processed first speech feature sequence to obtain the first speech enhancement data.
  • the processing module is further configured to perform voice enhancement processing on the first voice data based on the voice enhancement parameter corresponding to the sender To obtain the first speech enhancement data.
  • the processing module includes: a feature sequence processing sub-module, a batch processing calculation sub-module, and a feature inverse transformation processing sub-module;
  • the feature sequence processing submodule is also used to perform feature sequence processing on the first voice data through the trained voice enhancement model to obtain a second voice feature sequence;
  • the batch calculation sub-module is further configured to perform batch calculation on the second voice feature sequence by using the voice enhancement parameter to obtain the processed second voice feature sequence and the second voice enhancement parameter;
  • the feature inverse transformation processing sub-module is further configured to perform feature inverse transformation processing on the processed second speech feature sequence to obtain processed second speech enhancement data, and to combine the processed second speech enhancement data As the first speech enhancement data.
  • the trained speech enhancement model is generated in the following manner:
  • the manner of extracting the speech feature sequence includes:
  • the voice data processing device of this embodiment can execute the voice data processing method shown in the first embodiment of the present application, and the implementation principles are similar, and will not be repeated here.
  • the server when the first voice data sent by the sender is received, the corresponding voice enhancement parameters are acquired, and then the voice enhancement processing is performed on the first voice data based on the acquired voice enhancement parameters to obtain the first voice Enhance data, and determine a first voice enhancement parameter based on the first voice data, and then use the first voice enhancement parameter to update the acquired voice enhancement parameter to obtain the updated voice enhancement parameter for use when receiving To the second voice data sent by the sender, perform voice enhancement processing on the second voice data based on the updated voice enhancement parameters, and send the first voice enhancement data to the receiver.
  • the server can perform voice enhancement processing on the voice data of the sender based on the voice enhancement parameters corresponding to the sender.
  • the voice enhancement parameters corresponding to different senders are different, the voices obtained by performing voice enhancement processing for different senders are different.
  • the enhancement effect is also different. It is realized that the speech enhancement is not only pertinent when multiple models are not required, but also the speech enhancement parameters can be stored. There is no need to store multiple models, and the storage requirement is low.
  • the server can continue to train the trained LSTM model in one direction based on the latest voice data sent by the sender, so as to continuously update the voice enhancement parameters corresponding to the sender, so that the matching degree between the voice enhancement parameters and the sender becomes more and more. High, the voice enhancement effect for the sender is getting better and better.
  • it is enough to train the speech enhancement parameters, and it is not necessary to train the entire trained LSTM model or a whole layer in the model, which improves the cost and speed of training.
  • an electronic device includes: a memory and a processor; at least one program, stored in the memory, for being executed by the processor, can realize: In, when the first voice data sent by the sender is received, the corresponding voice enhancement parameters are acquired, and then the voice enhancement processing is performed on the first voice data based on the acquired voice enhancement parameters to obtain the first voice enhancement data, and Determine a first voice enhancement parameter based on the first voice data, and then use the first voice enhancement parameter to update the acquired voice enhancement parameter to obtain the updated voice enhancement parameter for use when the sender sends Performing voice enhancement processing on the second voice data based on the updated voice enhancement parameters, and sending the first voice enhancement data to the receiver.
  • the server can perform voice enhancement processing on the voice data of the sender based on the voice enhancement parameters corresponding to the sender. Because the voice enhancement parameters corresponding to different senders are different, the voices obtained by performing voice enhancement processing for different senders are different. The enhancement effect is also different. It is realized that the speech enhancement is not only pertinent when multiple models are not required, but also the speech enhancement parameters can be stored. There is no need to store multiple models, and the storage requirement is low.
  • the server can continue to train the trained LSTM model in one direction based on the latest voice data sent by the sender, so as to continuously update the voice enhancement parameters corresponding to the sender, so that the matching degree between the voice enhancement parameters and the sender becomes more and more. High, the voice enhancement effect for the sender is getting better and better.
  • it is enough to train the speech enhancement parameters, and it is not necessary to train the entire trained LSTM model or a whole layer in the model, which improves the cost and speed of training.
  • an electronic device is provided.
  • the electronic device 5000 shown in FIG. 5 includes a processor 5001 and a memory 5003. Among them, the processor 5001 and the memory 5003 are connected, for example, through a bus 5002.
  • the electronic device 5000 may further include a transceiver 5004. It should be noted that in actual applications, the transceiver 5004 is not limited to one, and the structure of the electronic device 5000 does not constitute a limitation to the embodiment of the present application.
  • the processor 5001 may be a CPU, a general-purpose processor, DSP, ASIC, FPGA, or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It can implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of this application.
  • the processor 5001 may also be a combination that implements computing functions, for example, including one or more microprocessor combinations, DSP and microprocessor combinations, and so on.
  • the bus 5002 may include a path for transferring information between the above-mentioned components.
  • the bus 5002 may be a PCI bus or an EISA bus.
  • the bus 5002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used to represent in FIG. 5, but it does not mean that there is only one bus or one type of bus.
  • the memory 5003 can be ROM or other types of static storage devices that can store static information and instructions, RAM or other types of dynamic storage devices that can store information and instructions, or it can be EEPROM, CD-ROM or other optical disk storage, or optical disk storage. (Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can be used by a computer Any other media accessed, but not limited to this.
  • the memory 5003 is used to store application program codes for executing the solutions of the present application, and the processor 5001 controls the execution.
  • the processor 5001 is configured to execute the application program code stored in the memory 5003 to implement the content shown in any of the foregoing method embodiments.
  • electronic equipment includes but is not limited to: mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PAD (tablet computers), PMP (portable multimedia players), vehicle terminals (such as vehicle navigation terminals), etc.
  • Mobile terminals such as digital TVs, desktop computers, etc.
  • Another embodiment of the present application provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when it runs on a computer, the computer can execute the corresponding content in the foregoing method embodiment.
  • the voice enhancement processing is performed on the first voice data based on the acquired voice enhancement parameters to obtain the first voice Enhance data, and determine a first voice enhancement parameter based on the first voice data, and then use the first voice enhancement parameter to update the acquired voice enhancement parameter to obtain the updated voice enhancement parameter for use when receiving
  • To the second voice data sent by the sender perform voice enhancement processing on the second voice data based on the updated voice enhancement parameters, and send the first voice enhancement data to the receiver.
  • the server can perform voice enhancement processing on the voice data of the sender based on the voice enhancement parameters corresponding to the sender. Because the voice enhancement parameters corresponding to different senders are different, the voices obtained by performing voice enhancement processing for different senders are different. The enhancement effect is also different. It is realized that the speech enhancement is not only pertinent when multiple models are not required, but also the speech enhancement parameters can be stored. There is no need to store multiple models, and the storage requirement is low.
  • the server can continue to train the trained LSTM model in one direction based on the latest voice data sent by the sender, so as to continuously update the voice enhancement parameters corresponding to the sender, so that the matching degree between the voice enhancement parameters and the sender becomes more and more. High, the voice enhancement effect for the sender is getting better and better.
  • it is enough to train the speech enhancement parameters, and it is not necessary to train the entire trained LSTM model or a whole layer in the model, which improves the cost and speed of training.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)

Abstract

一种语音数据的处理方法、装置、电子设备及计算机可读存储介质。该方法包括:接收发送方发送的第一语音数据,并获取相应的语音增强参数(S101);基于获取到的语音增强参数对第一语音数据进行语音增强处理,以得到第一语音增强数据,并基于第一语音增强数据确定第一语音增强参数(S102);将第一语音增强数据发送至接收方,并采用第一语音增强参数对获取到的语音增强参数进行更新,得到更新后的语音增强参数,以用于当接收到发送方发送的第二语音数据,基于更新后的语音增强参数对第二语音数据进行语音增强处理(S103)。

Description

语音数据的处理方法、装置、电子设备及可读存储介质
本申请要求于2019年9月23日提交中国专利局、申请号为201910900060.1、名称为“语音数据的处理方法、装置、电子设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及互联网技术领域,具体而言,本申请涉及一种语音数据的处理方法、装置、电子设备及计算机可读存储介质。
背景
语音增强(Speech Enhancement)的本质是语音降噪,麦克风采集的语音通常是带有不同噪声的语音,语音增强的主要目的就是从带噪声的语音中恢复不带噪声的语音。通过语音增强可以有效抑制各种干扰信号,增强目标语音信号,不仅提高语音可懂度和话音质量,还有助于提高语音识别。
在对待处理的语音进行语音增强时,首先训练生成一个通用的降噪模型,然后针对不同发言人,利用各个发言人对应的语音数据对整个降噪模型或者模型中的某些层进行自适应训练,得到不同发言人分别对应的降噪模型并存储。在实际应用时,针对不同的发言人,获取对应的降噪模型,并采用降噪模型对该发言人的语音数据进行降噪处理。
技术内容
本申请实施例提供了一种语音数据的处理方法,该方法由服务器执行,包括:
接收发送方发送的第一语音数据,并获取相应的语音增强参数;
基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据,并基于所述第一语音数据确定第一语音增强参数;
将所述第一语音增强数据发送至接收方,并采用所述第一语音增强参数对获取到的语音增强参数进行更新,得到更新后的语音增强参数,以用于当接收到发送方发送的第二语音数据,基于所述更新后的语音增强参数对所述第二语音数据 进行语音增强处理。
本申请实施例提供了一种语音数据的处理的装置,该装置包括:
接收模块,用于接收发送方发送的第一语音数据;
获取模块,用于获取相应的语音增强参数;
处理模块,用于基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据,并基于所述第一语音数据确定第一语音增强参数;
更新模块,用于采用所述第一语音增强参数对获取到的语音增强参数进行更新,得到更新后的语音增强参数,以用于当接收到发送方发送的第二语音数据,基于所述更新后的语音增强参数对所述第二语音数据进行语音增强处理;
发送模块,用于将所述第一语音增强数据发送至接收方。
本申请实施例还提供了一种电子设备,该电子设备包括:
处理器、存储器和总线;
所述总线,用于连接所述处理器和所述存储器;
所述存储器,用于存储操作指令;
所述处理器,用于通过调用所述操作指令,可执行指令使处理器执行如本申请上述所示的语音数据的处理方法对应的操作。
本申请实施例还提供了一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,该程序被处理器执行时实现本申请上述所示的语音数据的处理方法。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对本申请实施例描述中所需要使用的附图作简单地介绍。
图1A为本申请实施例提供的一种语音数据的处理方法所适用的系统架构图;
图1B为本申请一个实施例提供的一种语音数据的处理方法的流程示意图;
图2为本申请中LSTM模型的结构示意图;
图3为本申请中语音特征提取的逻辑步骤示意图;
图4为本申请又一实施例提供的一种语音数据的处理装置的结构示意图;
图5为本申请又一实施例提供的一种语音数据的处理的电子设备的结构示意图。
实施方式
下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本申请,而不能解释为对本申请的限制。
本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是,本申请的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。应该理解,当我们称元件被“连接”或“耦接”到另一元件时,它可以直接连接或耦接到其他元件,或者也可以存在中间元件。此外,这里使用的“连接”或“耦接”可以包括无线连接或无线耦接。这里使用的措辞“和/或”包括一个或更多个相关联的列出项的全部或任一单元和全部组合。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计 算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习或深度学习等几大方向。
语音技术(Speech Technology)的关键技术有自动语音识别技术(ASR)和语音合成技术(TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
如前所述,在对待处理的语音进行语音增强时,针对不同的发言人,需要获取改发言人对应的降噪模型,并采用降噪模型对该发言人的语音数据进行降噪处理。这样就需要将每个发言人对应的降噪模型都需要进行存储,存储量需求较高。
因此,本申请实施例提供了一种语音数据的处理方法、装置、电子设备和计算机可读存储介质,旨在解决相关技术中的如上技术问题。
下面以具体地实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。下面将结合附图,对本申请的实施例进行描述。
图1A是本申请实施例提供的语音处理的方法所适用的系统架构图。参见图1A,该系统架构图包括:服务器11、网络12以及终端设备13和14,其中服务器11通过网络12与终端设备13、终端设备14建立连接。
在本申请的一些实例中,服务器11是在接收到发送方发送的语音数据后,对接收到的语音数据进行处理的后台服务器。服务器11与终端设备13、终端设备14一起为用户提供服务,例如,服务器11对发送方对应的终端设备13(也可以是终端设备14)发送的语音数据进行处理之后,将得到的语音增强数据发送到接收方对应的终端设备14(也可以是终端设备13)以使其提供给用户,其中,服务器11可以是单独的服务器也可以是多个服务器组成的集群服务器。
网络12可以包括有线网络和无线网络。如图1A所示,在接入网一侧,终端设备13和终端设备14可以通过无线的方式或者有线的方式接入到网络12;而 在核心网一侧,服务器11一般是通过有线方式连接到网络12的。当然,上述服务器11也可以通过无线方式连接到网络12。
上述终端设备13和终端设备14可以是指具有数据计算处理功能的智能设备,例如可以播放服务器提供的处理后的语音增强数据。终端设备13和终端设备14包括但不限于(安装有通信模块的)智能手机、掌上电脑、平板电脑等。终端设备13和终端设备14上分别安装有操作系统,包括但不限于:Android操作系统、Symbian操作系统、Windows mobile操作系统、以及苹果iPhone OS操作系统等等。
基于图1A所示的系统架构图,本申请实施例提供了一种语音数据的处理方法,该处理方法由图1A中的服务器11执行。如图1B所示,该方法包括:
步骤S101,当接收到发送方发送的第一语音数据,获取相应的语音增强参数。
在一些实施例中,在获取相应的语音增强参数的过程中,获取预先存储的与所述发送方对应的语音增强参数;若未获取到与所述发送方对应的语音增强参数,则获取预先设置的语音增强参数。
在实际应用中,本申请实话例可以应用于基于网络的语音通讯的应用场景中,比如电话会议、视频会议等。其中,发送方可以是发送语音数据的一方,比如,用户A通过终端设备13进行发言,那么终端设备13就可以是发送方,用户A的发言内容就可以是第一语音数据,第一语音数据通过网络传输到服务器,服务器在接收到第一语音数据后,就可以获取相应的语音增强参数,进而对第一语音数据进行语音增强处理。其中,服务器中可以运行LSTM(Long-Short Term Memory,长短期记忆)模型,该模型可以用于对语音数据进行语音增强处理。
步骤S102,基于获取到的语音增强参数对所述语音数据进行语音增强处理,以得到第一语音增强数据,并基于所述第一语音数据确定第一语音增强参数;
在一些实施例中,若未获取到与所述发送方对应的语音增强参数,基于所述预先设置的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据。
在一些实施例中,若获取到与所述发送方对应的语音增强参数,基于与所 述发送方对应的语音增强参数对所述第一语音数据进行语音增强处理以得到第一语音增强数据。
在实际应用中,如果未获取到发送方对应的语音增强参数,那么就基于预设置的语音增强参数对第一语音数据进行语音增强处理;如果获取到与所述发送方对应的语音增强参数,那么就基于与所述发送方对应的语音增强参数对所述第一语音数据进行语音增强处理。
在一些实施例中,若未获取到与所述发送方对应的语音增强参数,所述基于所述预先设置的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据以及所述基于所述第一语音数据确定第一语音增强参数,包括:通过训练后的语音增强模型,对所述第一语音数据进行特征序列处理,得到第一语音特征序列,所述语音增强模型设置有所述预设置的语音增强参数;采用所述预设置的语音增强参数对所述第一语音特征序列进行批处理计算,得到处理后的第一语音特征序列和所述第一语音增强参数;对所述处理后的第一语音特征序列进行特征逆变换处理,得到所述第一语音增强数据。
在一些实施例中,若获取到与所述发送方对应的语音增强参数,所述基于与所述发送方对应的语音增强参数对所述第一语音数据进行语音增强处理以得到第一语音增强数据,以及所述基于所述第一语音数据确定第一语音增强参数,包括:通过训练后的语音增强模型,对所述第一语音数据进行特征序列处理,得到第二语音特征序列;采用与所述发送方对应的语音增强参数对所述第二语音特征序列进行批处理计算,得到处理后的第二语音特征序列和第二语音增强参数;对所述处理后的第二语音特征序列进行特征逆变换处理,得到处理后第二语音增强数据,并将所述处理后的第二语音增强数据作为所述第一语音增强数据。
步骤S103,将所述第一语音增强数据发送至接收方,并采用所述第一语音增强参数对获取到的语音增强参数进行更新,得到更新后的语音增强参数,以用于当接收到发送方发送的第二语音数据,基于所述更新后的语音增强参数对所述第二语音数据进行语音增强处理。
在一些实施例中,若未获取到与所述发送方对应的语音增强参数,基于所述第一语音增强参数对获取的预先设置的语音增强参数进行更新,得到更新后的 语音增强参数,并将所述第一语音增强参数作为与所述发送方对应的语音增强参数。
在一些实施例中,若获取到与所述发送方对应的语音增强参数,所采用所述第一语音增强参数对与所述发送方对应的语音增强参数进行更新,得到更新后的语音增强参数。
具体而言,在基于第一语音数据确定了第一语音增强参数之后,如果存储容器中没有发送方对应的语音增强参数,那么就可以将第一语音增强参数作为发送方对应的语音增强参数并保存在存储容器中;如果存储容器中已经保存了发送方对应的语音增强参数,那么就可以将第一语音增强参数替换已保存的语音增强参数。同时,服务器将经过语音增强处理得到的第一语音增强数据发送至接收方,接收方接收到第一语音增强数据后进行播放即可。
在一些实施例中,所述训练后的语音增强模型通过如下方式生成:获取包含噪声的第一语音样本数据,并对所述第一语音样本数据进行语音特征提取,得到第一语音特征序列;获取不包含噪声的第二语音样本数据,并对所述第二语音样本数据进行语音特征提取,得到第二语音特征序列;采用所述第一语音特征序列对预设的语音增强模型进行训练,得到训练后的语音增强模型所输出的第一语音特征序列,并计算所述训练所述语音增强模型得到的第一语音特征序列与所述第二语音特征序列的相似度,直至所述训练所述语音增强模型得到的第一语音特征序列与所述第二语音特征序列的相似度超过预设相似度阈值,得到训练后的语音增强模型。
在一些实施例中,语音特征序列提取的方式,包括:对语音样本数据进行语音分帧和加窗处理,得到所述语音样本数据的至少两个语音帧;对各个语音帧分别进行快速傅里叶变换,得到各个语音帧分别对应的各个离散功率谱;对各个离散功率谱分别进行对数计算,得到各个语音帧分别对应的各个对数功率谱,并将各个对数功率谱作为所述语音样本数据的语音特征序列。
在本申请实施例中,当接收到发送方发送的第一语音数据,获取相应的语音增强参数,然后基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,得到第一语音增强数据,并基于所述第一语音数据确定第一语音增强参数, 再采用所述第一语音增强参数对获取到的语音增强参数进行更新,得到更新后的语音增强参数,以用于当接收到发送方发送的第二语音数据,基于所述更新后的语音增强参数对所述第二语音数据进行语音增强处理,并将所述第一语音增强数据发送至接收方。这样,服务器可以基于发送方对应的语音增强参数对发送方的语音数据进行语音增强处理,因为不同的发送方对应的语音增强参数是不同的,所以,针对不同发送方进行语音增强处理得到的语音增强效果也是不同的,实现了在不需要多个模型的情况下,语音增强不仅依然具有针对性,而且存储语音增强参数即可,不需要存储多个模型,存储量需求较低。
本申请实施例对如图1B所示的一种语音数据的处理方法进行详细说明。
步骤S101,当接收到发送方发送的第一语音数据,获取相应的语音增强参数;
在实际应用中,本申请实话例可以应用于基于网络的语音通讯的应用场景中,比如电话会议、视频会议等。其中,发送方可以是发送语音数据的一方,比如,用户A通过终端设备13进行发言,那么终端设备13就可以是发送方,用户A的发言内容就可以是第一语音数据,第一语音数据通过网络传输到服务器,服务器在接收到第一语音数据后,就可以获取相应的语音增强参数,进而对第一语音数据进行语音增强处理。
其中,服务器中可以运行LSTM(Long-Short Term Memory,长短期记忆)模型,该模型可以用于对语音数据进行语音增强处理。
语音增强(Speech Enhancement)的本质是语音降噪,麦克风采集的语音通常是带有不同噪声的语音,语音增强的主要目的就是从带噪声的语音中恢复不带噪声的语音。通过语音增强可以有效抑制各种干扰信号,增强目标语音信号,不仅可以提高语音可懂度和语音质量,还有助于提高语音识别。
LSTM模型的基本结构可以如图2所示,包括前端LSTM层、批处理层,以及后端LSTM层;其中,X为语音数据中的每一帧语音,t为时间窗口。
所谓一帧语音,指的是语音信号中的一个短段。具体的,语音信号在宏观上是不平稳的,在微观上是平稳的,具有短时平稳性(10~30ms内可以认为语音信号近似不变),这个就可以把语音信号分为一些短段来进行处理,每一个短段 称为一帧。比如,某段1S的语音中,一帧语音的长度为10ms,那么该段语音就包括100帧。
LSTM模型在处理语音数据时,前端LSTM层、批处理层,以及后端LSTM层会同时对不同时间窗口的语音帧进行计算,其中批处理层用于计算语音数据对应的语音增强参数,比如均值和方差。
进一步,在本申请实施例中,终端设备13和终端设备14还可以具有如下特点:
(1)在硬件体系上,设备具备中央处理器、存储器、输入部件和输出部件,也就是说,设备往往是具备通信功能的微型计算机设备。另外,还可以具有多种输入方式,诸如键盘、鼠标、触摸屏、送话器和摄像头等,并可以根据需要进行调整输入。同时,设备往往具有多种输出方式,如受话器、显示屏等,也可以根据需要进行调整;
(2)在软件体系上,设备必须具备操作系统,如Windows Mobile、Symbian、Palm、Android、iOS等。同时,这些操作系统越来越开放,基于这些开放的操作系统平台开发的个性化应用程序层出不穷,如通信簿、日程表、记事本、计算器以及各类游戏等,极大程度地满足了个性化用户的需求;
(3)在通信能力上,设备具有灵活的接入方式和高带宽通信性能,并且能根据所选择的业务和所处的环境,自动调整所选的通信方式,从而方便用户使用。设备可以支持GSM(Global System for Mobile Communication,全球移动通信系统)、WCDMA(Wideband Code Division Multiple Access,宽带码分多址)、CDMA2000(Code Division Multiple Access,码分多址)、TDSCDMA(Time Division-Synchronous Code Division Multiple Access,时分同步码分多址)、Wi-Fi(Wireless-Fidelity,无线保真)以及WiMAX(Worldwide Interoperability for Microwave Access,全球微波互联接入)等,从而适应多种制式网络,不仅支持语音业务,更支持多种无线数据业务;
(4)在功能使用上,设备更加注重人性化、个性化和多功能化。随着计算机技术的发展,设备从“以设备为中心”的模式进入“以人为中心”的模式,集成了嵌入式计算、控制技术、人工智能技术以及生物认证技术等,充分体现了以人为 本的宗旨。由于软件技术的发展,设备可以根据个人需求调整设置,更加个性化。同时,设备本身集成了众多软件和硬件,功能也越来越强大。
在本申请一种优选实施例中,所述获取相应的语音增强参数,包括:
获取与所述发送方对应的语音增强参数;
若未获取到与所述发送方对应的语音增强参数,则获取预设置的语音增强参数;
具体而言,服务器接收到第一语音数据之后,可以采用训练后的LSTM模型对第一语音数据进行语音增强处理。训练后的LSTM模型是一个通用模型,具有预设置的语音增强参数,也就是训练后的LSTM模型中的语音增强参数,训练后的LSTM模型可以对任何用户的语音数据进行语音增强处理。
在本申请实施例中,为了对不同的用户提供针对性的语音增强,可以采用用户的语音数据对训练后的LSTM模型进行训练,得到该用户的语音增强参数,这样,在对该用户的语音数据进行语音增强处理时,就可以采用该用户的语音增强参数对该用户的语音数据进行语音增强处理。
比如,采用用户A的语音数据对训练后的LSTM模型进行训练,得到用户A的语音增强参数。在对用户A后续的语音数据进行语音增强处理时,训练后的LSTM模型就可以使用用户A的语音增强参数进行语音增强处理。
因此,服务器在接收到用户的第一语音数据时,可以先获取该用户的语音增强参数。在本申请实施例中,各个用户对应的语音增强参数可以存储在服务器的存储容器中,也可以存储在其它设备的存储容器中,本申请实施例对此不作限制。
如果服务器没有获取到该用户的语音增强参数,那么就表示服务器是第一次接收到该用户的语音数据,此时获取预设置的语音增强参数即可。
步骤S102,基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据,并基于所述第一语音数据确定第一语音增强参数;
在实际应用中,如果未获取到发送方对应的语音增强参数,那么就基于预设置的语音增强参数对第一语音数据进行语音增强处理;如果获取到与所述发送 方对应的语音增强参数,那么就基于与所述发送方对应的语音增强参数对所述第一语音数据进行语音增强处理。
在本申请一种优选实施例中,若未获取到与所述发送方对应的语音增强参数,所述基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,得到第一语音增强数据,并基于所述第一语音数据确定第一语音增强参数的步骤,包括:
通过训练后的语音增强模型,对所述第一语音数据进行特征序列处理,得到第一语音特征序列,所述语音增强模型设置有所述预设置的语音增强参数;
采用所述预设置的语音增强参数对所述第一语音特征序列进行批处理计算,得到处理后的第一语音特征序列;
对所述处理后的第一语音特征序列进行特征逆变换处理,得到所述第一语音增强数据,并基于第一语音数据确定第一语音增强参数。
具体而言,如果没有获取到发送方对应的语音增强参数,那么就可以将第一语音数据输入训练后的LSTM模型,训练后的LSTM模型对第一语音数据进行特征序列处理,得到第一语音数据对应的第一语音特征序列,其中,第一语音特征序列包括至少两个语音特征,然后采用预设置的语音增强参数对第一语音特征序列进行批处理计算,得到处理后的第一语音特征序列,再对处理后的第一语音特征序列进行特征逆变换处理,就可以得到第一语音增强数据了,也就是采用训练后的LSTM模型(通用模型)对第一语音数据进行语音增强处理。其中,批处理计算可以采用如下公式(1)和公式(2):
Figure PCTCN2020105034-appb-000001
Figure PCTCN2020105034-appb-000002
μ B为语音增强参数中的均值,
Figure PCTCN2020105034-appb-000003
为语音增强参数中的方差,x i为输入的语音特征,y i为输出的语音增强后的语音特征,ε、γ、β分别为变量参数。
以及,采用第一语音数据对训练后的LSTM模型进行训练,得到第一语音增强参数,也就是与发送方对应的语音增强参数,然后进行存储。其中,对训练后的LSTM模型进行训练可以采用如下公式(3)和公式(4):
Figure PCTCN2020105034-appb-000004
Figure PCTCN2020105034-appb-000005
μ B为语音增强参数中的均值,
Figure PCTCN2020105034-appb-000006
为语音增强参数中的方差,x i为输入的语音特征,m为语音特征的数量。
需要说明的是,基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,与基于所述第一语音数据确定第一语音增强参数的执行顺序可以是先后执行,也可以是并列执行等,在实际应用中可以根据实际需求进行调整,本申请实施例对执行顺序不作限制。
在本申请一种优选实施例中,若获取到与所述发送方对应的语音增强参数,所述基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,得到第一语音增强数据,并基于所述第一语音数据确定第一语音增强参数的步骤,包括:
通过训练后的语音增强模型,对所述第一语音数据进行特征序列处理,得到第二语音特征序列;
采用所述语音增强参数对所述第二语音特征序列进行批处理计算,得到处理后的第二语音特征序列;
对所述处理后的第二语音特征序列进行特征逆变换处理,得到处理后第二语音增强数据,并将所述处理后的第二语音增强数据作为所述第一语音增强数据,并基于所述第一语音数据确定第二语音增强参数。
具体而言,如果获取到了发送方对应的语音增强参数,那么就可以将第一语音数据输入训练后的LSTM模型,训练后的LSTM模型对第一语音数据进行特征序列处理,得到第一语音数据对应的第二语音特征序列,其中,第二语音特征序列包括至少两个语音特征,然后采用与发送方对应的语音增强参数对第二语音特征序列进行批处理计算,得到处理后的第二语音特征序列,再对处理后的第二语音特征序列进行特征逆变换处理,就可以得到第二语音增强数据了,也就是将发送方对应的语音增强参数替换训练后的LSTM模型中的语音增强参数,然后采用更新后的LSTM模型对第二语音数据进行语音增强处理。其中,批处理计算也可以采用公式(1)和公式(2),在此就不赘述了。
以及,采用第一语音数据对更新后的LSTM模型进行训练,得到第二语音 增强参数,也就是与发送方对应的最新的语音增强参数,然后进行存储。其中,对更新后的LSTM模型进行训练也可以采用公式(3)和公式(4),在此就不赘述了。
在本申请一种优选实施例中,所述训练后的语音增强模型通过如下方式生成:
获取包含噪声的第一语音样本数据,并对所述第一语音样本数据进行语音特征提取,得到第一语音特征;
获取不包含噪声的第二语音样本数据,并对所述第二语音样本数据进行语音特征提取,得到第二语音特征;
采用所述第一语音特征对预设的语音增强模型进行训练,得到训练后的第一语音特征;
计算所述训练后的第一语音特征与所述第二语音特征的相似度,直至所述训练后的第一语音特征与所述第二语音特征的相似度超过预设相似度阈值,得到训练后的语音增强模型。
具体而言,获取包含噪声的第一语音样本数据,并对第一语音样本数据进行语音特征提取,得到第一语音特征a,以及,获取不包含噪声的第二语音样本数据,并对第二语音样本数据进行语音特征提取,得到第二语音特征b,然后将语音特征a输入原始LSTM模型,将语音特征b作为训练目标,对原始LSTM模型进行单向训练,也就是单向调整LSTM模型中的所有参数,得到训练后的第一语音特征a',其中,所有参数包括语音增强参数,再计算所述训练后的第一语音特征a'与所述第二语音特征b的相似度,直至训练后的第一语音特征a'与第二语音特征b的相似度超过预设相似度阈值,从而得到训练后的LSTM模型。
其中,相似度计算可以采用夹角余弦、皮尔森相关系数等相似度度量的方式,也可以采用欧氏距离、曼哈顿距离等距离度量的方式,当然,还可以采用其它的计算方式,具体的计算方式可以根据实际需求进行设置,本申请实施例对此不作限制。
在本申请一种优选实施例中,语音特征提取的方式,包括:
对语音样本数据进行语音分帧和加窗处理,得到所述语音样本数据的至少 两个语音帧;
对各个语音帧分别进行快速傅里叶变换,得到各个语音帧分别对应的各个离散功率谱;
对各个离散功率谱分别进行对数计算,得到各个语音帧分别对应的各个对数功率谱,并将各个对数功率谱作为所述语音样本数据的语音特征。
具体而言,语音样本数据也就是语音信号,语音信号是时域信号,处理器无法直接对时域信号进行处理,所以需要对语音样本数据进行语音分帧和加窗处理,得到语音样本数据的至少两个语音帧,从而将时域信号转换为处理器可处理的频域信号,如图3所示,然后对每个语音帧分别进行FFT(Fast Fourier Transformation,快速傅里叶变换),得到各个语音帧对应的离散功率谱,再对各个离散功率谱进行对数计算,得到各个语音帧分别对应的各个对数功率谱,从而得到了各个语音帧分别对应的语音特征,所有语音特征的集合就是该语音样本数据对应的语音特征序列。对语音特征序列进行特征逆变换处理,即可将频域的语音特征序列转换为时域的语音信号。
需要说明的是,对第一语音样本数据进行特征提取与对第二语音样本数据进行特征提取的方式是一样的,所以为了方便描述,本申请实施例将第一语音样本数据和第二语音样本数据统称为语音样本数据。
步骤S103,将所述第一语音增强数据发送至接收方,并采用所述第一语音增强参数对获取到的语音增强参数进行更新,得到更新后的语音增强参数,以用于当接收到发送方发送的第二语音数据,基于所述更新后的语音增强参数对所述第二语音数据进行语音增强处理。
通常,通过训练得到发言人对应的降噪模型时,需要进行自适应训练,而自适应训练需要的数据量较大,所以自适应训练的时间较久、效率较低。
而在本申请实施例中,采用第一语音增强参数对获取到的语音增强参数进行更新,得到更新后的语音增强参数即可,这样,就不需要进行自适应训练了。
具体而言,在基于第一语音数据确定了第一语音增强参数之后,如果存储容器中没有发送方对应的语音增强参数,那么就可以将第一语音增强参数作为发送方对应的语音增强参数并保存在存储容器中;如果存储容器中已经保存了发送方 对应的语音增强参数,那么就可以将第一语音增强参数替换已保存的语音增强参数。
当接收到发送方发送的第二语音数据,就可以基于第一语音增强参数,也就是更新后的语音增强参数对第二语音数据进行语音增强处理了。这样,服务器就可以基于发送方发送的最新的语音数据持续对训练后的LSTM模型进行单向训练,从而持续更新发送方对应的语音增强参数,使得语音增强参数与发送方的匹配度越来越高,针对发送方的语音增强效果也越来越好。
同时,服务器将经过语音增强处理得到的第一语音增强数据发送至接收方,接收方接收到第一语音增强数据后进行播放即可。
需要说明的是,服务器进行语音增强参数的更新与发送语音增强数据的执行顺序可以是先后顺序,也可以是并列顺序,在实际应用中可以根据实际需求进行设置,本申请实施例对此不作限制。
为方便理解,本申请实施例通过以下事例进行详细说明。
假设,用户A、用户B和用户C进行电话会议,服务器中正在运行训练后的LSTM模型,训练后的LSTM模型具有通用语音增强参数,且服务器中的存储容器或其它存储容器中均没有用户A的语音增强参数。
在这种情况下,当用户A完成第一句发言后,用户A对应的终端设备将第一句发言发送至服务器,服务器接收到用户A的第一句发言后,查找与用户A对应的语音增强参数,因为服务器中的存储容器或其它存储容器中都没有用户A的语音增强参数,所以无法获取到用户A的语音增强参数,因此获取训练后的LSTM模型的通用语音增强参数,并采用通用语音增强参数对第一句发言进行语音增强处理,得到增强后的第一句发言,并将增强后的第一句发言发送至用户B和用户C对应的终端设备,同时,采用第一句发言对训练后的LSTM模型进行单向训练,得到用户A的第一语音增强参数并进行存储。
当用户A完成第二句发言后,终端设备将第二句发言发送至服务器,服务器接收到用户A的第二句发言后,查找与用户A对应的语音增强参数,此次查找成功,获取用户A的第一语音增强参数,并将第一语音增强参数替换训练后的 LSTM模型中的通用语音增强参数,得到更新后的LSTM模型,然后采用更新后的LSTM模型对第二句发言进行语音增强处理,得到增强后的第二句发言,并将增强后的第二句发言发送至用户B和用户C对应的终端设备,同时,采用第二句发言对更新后的LSTM模型进行单向训练,得到用户A的第二语音增强参数,并将第二语音增强参数替换掉第一语音增强参数。针对后续发言的语音增强处理过程依此类推,在此就不赘述了。
在本申请实施例中,当接收到发送方发送的第一语音数据,获取相应的语音增强参数,然后基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,得到第一语音增强数据,并基于所述第一语音数据确定第一语音增强参数,再采用所述第一语音增强参数对获取到的语音增强参数进行更新,得到更新后的语音增强参数,以用于当接收到发送方发送的第二语音数据,基于所述更新后的语音增强参数对所述第二语音数据进行语音增强处理,并将所述第一语音增强数据发送至接收方。这样,服务器可以基于发送方对应的语音增强参数对发送方的语音数据进行语音增强处理,因为不同的发送方对应的语音增强参数是不同的,所以,针对不同发送方进行语音增强处理得到的语音增强效果也是不同的,实现了在不需要多个模型的情况下,语音增强不仅依然具有针对性,而且存储语音增强参数即可,不需要存储多个模型,存储量需求较低。
进一步,服务器还可以基于发送方发送的最新的语音数据持续对训练后的LSTM模型进行单向训练,从而持续更新发送方对应的语音增强参数,使得语音增强参数与发送方的匹配度越来越高,针对发送方的语音增强效果也越来越好。同时,在持续单向训练的过程中,训练语音增强参数即可,不需要对整个训练后的LSTM模型或模型中的一整层进行训练,提高了训练的成本和速度。
图4为本申请又一实施例提供的一种语音数据的处理装置的结构示意图,如图4所示,本实施例的装置可以包括:
接收模块401,用于接收发送方发送的第一语音数据;
获取模块402,用于获取相应的语音增强参数;
处理模块403,用于基于获取到的语音增强参数对所述第一语音数据进行语 音增强处理,以得到第一语音增强数据,并基于所述第一语音数据确定第一语音增强参数;
更新模块404,用于采用所述第一语音增强参数对获取到的语音增强参数进行更新,得到更新后的语音增强参数,以用于当接收到发送方发送的第二语音数据,基于所述更新后的语音增强参数对所述第二语音数据进行语音增强处理;
发送模块405,用于将所述第一语音增强数据发送至接收方。
在本申请一种优选实施例中,所述获取模块具体用于:
获取预先存储的与所述发送方对应的语音增强参数;若未获取到与所述发送方对应的语音增强参数,则获取预先设置的语音增强参数
在一些实施例中,若未获取到与所述发送方对应的语音增强参数,所述更新模块,还用于基于所述第一语音增强参数对获取的预先设置的语音增强参数进行更新,得到更新后的语音增强参数,并将所述第一语音增强参数作为与所述发送方对应的语音增强参数。
在一些实施例中,若获取到与所述发送方对应的语音增强参数,所述更新模块,还用于采用所述第一语音增强参数对与所述发送方对应的语音增强参数进行更新,得到更新后的语音增强参数。
在一些实施例中,若未获取到与所述发送方对应的语音增强参数,所述处理模块,还用于基于所述预先设置的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据。
在本申请一些实施例中,所述处理模块包括:特征序列处理子模块、批处理计算子模块和特征逆变换处理子模块;
若未获取到与所述发送方对应的语音增强参数,特征序列处理子模块,用于通过训练后的语音增强模型,对所述第一语音数据进行特征序列处理,得到第一语音特征序列,所述语音增强模型设置有所述预设置的语音增强参数;
批处理计算子模块,用于采用所述预设置的语音增强参数对所述第一语音特征序列进行批处理计算,得到处理后的第一语音特征序列和所述第一语音增强参数;
特征逆变换处理子模块,用于对所述处理后的第一语音特征序列进行特征逆变换处理,得到所述第一语音增强数据。
在一些实施例中,若获取到与所述发送方对应的语音增强参数,所述处理模块,还用于基于与所述发送方对应的语音增强参数对所述第一语音数据进行语音增强处理以得到第一语音增强数据。
在本申请一些实施例中,所述处理模块包括:特征序列处理子模块、批处理计算子模块和特征逆变换处理子模块;
若获取到与所述发送方对应的语音增强参数,所述特征序列处理子模块,还用于通过训练后的语音增强模型,对所述第一语音数据进行特征序列处理,得到第二语音特征序列;
所述批处理计算子模块,还用于采用所述语音增强参数对所述第二语音特征序列进行批处理计算,得到处理后的第二语音特征序列和第二语音增强参数;
所述特征逆变换处理子模块,还用于对所述处理后的第二语音特征序列进行特征逆变换处理,得到处理后第二语音增强数据,并将所述处理后的第二语音增强数据作为所述第一语音增强数据。
在本申请一种优选实施例中,所述训练后的语音增强模型通过如下方式生成:
获取包含噪声的第一语音样本数据,并对所述第一语音样本数据进行语音特征提取,得到第一语音特征序列;
获取不包含噪声的第二语音样本数据,并对所述第二语音样本数据进行语音特征提取,得到第二语音特征序列;
采用所述第一语音特征序列对预设的语音增强模型进行训练,得到训练后的语音增强模型所输出的第一语音特征序列;
计算所述训练所述语音增强模型得到的第一语音特征序列与所述第二语音特征序列的相似度,直至所述训练所述语音增强模型得到的第一语音特征序列与所述第二语音特征序列的相似度超过预设相似度阈值,得到训练后的语音增强模型。
在本申请一些优选实施例中,语音特征序列提取的方式,包括:
对语音样本数据进行语音分帧和加窗处理,得到所述语音样本数据的至少两个语音帧;
对各个语音帧分别进行快速傅里叶变换,得到各个语音帧分别对应的各个离散功率谱;
对各个离散功率谱分别进行对数计算,得到各个语音帧分别对应的各个对数功率谱,并将各个对数功率谱作为所述语音样本数据的语音特征序列。
本实施例的语音数据的处理装置可执行本申请第一个实施例所示的语音数据的处理方法,其实现原理相类似,此处不再赘述。
在本申请实施例中,当接收到发送方发送的第一语音数据,获取相应的语音增强参数,然后基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,得到第一语音增强数据,并基于所述第一语音数据确定第一语音增强参数,再采用所述第一语音增强参数对获取到的语音增强参数进行更新,得到更新后的语音增强参数,以用于当接收到发送方发送的第二语音数据,基于所述更新后的语音增强参数对所述第二语音数据进行语音增强处理,并将所述第一语音增强数据发送至接收方。这样,服务器可以基于发送方对应的语音增强参数对发送方的语音数据进行语音增强处理,因为不同的发送方对应的语音增强参数是不同的,所以,针对不同发送方进行语音增强处理得到的语音增强效果也是不同的,实现了在不需要多个模型的情况下,语音增强不仅依然具有针对性,而且存储语音增强参数即可,不需要存储多个模型,存储量需求较低。
进一步,服务器还可以基于发送方发送的最新的语音数据持续对训练后的LSTM模型进行单向训练,从而持续更新发送方对应的语音增强参数,使得语音增强参数与发送方的匹配度越来越高,针对发送方的语音增强效果也越来越好。同时,在持续单向训练的过程中,训练语音增强参数即可,不需要对整个训练后的LSTM模型或模型中的一整层进行训练,提高了训练的成本和速度。
本申请的又一实施例中提供了一种电子设备,该电子设备包括:存储器和处理器;至少一个程序,存储于存储器中,用于被处理器执行时,可实现:在本 申请实施例中,当接收到发送方发送的第一语音数据,获取相应的语音增强参数,然后基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据,并基于所述第一语音数据确定第一语音增强参数,再采用所述第一语音增强参数对获取到的语音增强参数进行更新,得到更新后的语音增强参数,以用于当接收到发送方发送的第二语音数据,基于所述更新后的语音增强参数对所述第二语音数据进行语音增强处理,并将所述第一语音增强数据发送至接收方。这样,服务器可以基于发送方对应的语音增强参数对发送方的语音数据进行语音增强处理,因为不同的发送方对应的语音增强参数是不同的,所以,针对不同发送方进行语音增强处理得到的语音增强效果也是不同的,实现了在不需要多个模型的情况下,语音增强不仅依然具有针对性,而且存储语音增强参数即可,不需要存储多个模型,存储量需求较低。
进一步,服务器还可以基于发送方发送的最新的语音数据持续对训练后的LSTM模型进行单向训练,从而持续更新发送方对应的语音增强参数,使得语音增强参数与发送方的匹配度越来越高,针对发送方的语音增强效果也越来越好。同时,在持续单向训练的过程中,训练语音增强参数即可,不需要对整个训练后的LSTM模型或模型中的一整层进行训练,提高了训练的成本和速度。
在一些实施例中提供了一种电子设备,如图5所示,图5所示的电子设备5000包括:处理器5001和存储器5003。其中,处理器5001和存储器5003相连,如通过总线5002相连。电子设备5000还可以包括收发器5004。需要说明的是,实际应用中收发器5004不限于一个,该电子设备5000的结构并不构成对本申请实施例的限定。
处理器5001可以是CPU,通用处理器,DSP,ASIC,FPGA或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。处理器5001也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等。
总线5002可包括一通路,在上述组件之间传送信息。总线5002可以是PCI 总线或EISA总线等。总线5002可以分为地址总线、数据总线、控制总线等。为便于表示,图5中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
存储器5003可以是ROM或可存储静态信息和指令的其他类型的静态存储设备,RAM或者可存储信息和指令的其他类型的动态存储设备,也可以是EEPROM、CD-ROM或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。
存储器5003用于存储执行本申请方案的应用程序代码,并由处理器5001来控制执行。处理器5001用于执行存储器5003中存储的应用程序代码,以实现前述任一方法实施例所示的内容。
其中,电子设备包括但不限于:移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。
本申请的又一实施例提供了一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,当其在计算机上运行时,使得计算机可以执行前述方法实施例中相应内容。在本申请实施例中,当接收到发送方发送的第一语音数据,获取相应的语音增强参数,然后基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,得到第一语音增强数据,并基于所述第一语音数据确定第一语音增强参数,再采用所述第一语音增强参数对获取到的语音增强参数进行更新,得到更新后的语音增强参数,以用于当接收到发送方发送的第二语音数据,基于所述更新后的语音增强参数对所述第二语音数据进行语音增强处理,并将所述第一语音增强数据发送至接收方。这样,服务器可以基于发送方对应的语音增强参数对发送方的语音数据进行语音增强处理,因为不同的发送方对应的语音增强参数是不同的,所以,针对不同发送方进行语音增强处理得到的语音增强效果也是 不同的,实现了在不需要多个模型的情况下,语音增强不仅依然具有针对性,而且存储语音增强参数即可,不需要存储多个模型,存储量需求较低。
进一步,服务器还可以基于发送方发送的最新的语音数据持续对训练后的LSTM模型进行单向训练,从而持续更新发送方对应的语音增强参数,使得语音增强参数与发送方的匹配度越来越高,针对发送方的语音增强效果也越来越好。同时,在持续单向训练的过程中,训练语音增强参数即可,不需要对整个训练后的LSTM模型或模型中的一整层进行训练,提高了训练的成本和速度。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
以上所述仅是本申请的部分实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。

Claims (20)

  1. 一种语音数据的处理方法,由服务器执行,包括:
    接收发送方发送的第一语音数据,并获取相应的语音增强参数;
    基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据,并基于所述第一语音数据确定第一语音增强参数;
    将所述第一语音增强数据发送至接收方,并采用所述第一语音增强参数对获取到的语音增强参数进行更新,得到更新后的语音增强参数,以用于当接收到发送方发送的第二语音数据,基于所述更新后的语音增强参数对所述第二语音数据进行语音增强处理。
  2. 根据权利要求1所述的语音数据的处理方法,其中,所述获取相应的语音增强参数,包括:
    获取预先存储的与所述发送方对应的语音增强参数;
    若未获取到与所述发送方对应的语音增强参数,则获取预先设置的语音增强参数。
  3. 根据权利要求2所述的语音数据的处理方法,其中,若未获取到与所述发送方对应的语音增强参数,所述采用所述第一语音增强参数对获取的语音增强参数进行更新,得到更新后的语音增强参数,包括:
    基于所述第一语音增强参数对获取的预先设置的语音增强参数进行更新,得到更新后的语音增强参数,并将所述第一语音增强参数作为与所述发送方对应的语音增强参数。
  4. 根据权利要求2所述的语音数据的处理方法,其中,若获取到与所述发送方对应的语音增强参数,所述采用所述第一语音增强参数对获取的语音增强参数进行更新,得到更新后的语音增强参数,包括:
    采用所述第一语音增强参数对与所述发送方对应的语音增强参数进行更新,得到更新后的语音增强参数。
  5. 根据权利要求2所述的语音数据的处理方法,其中,若未获取到与所述发送方对应的语音增强参数,所述基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据,包括:
    基于所述预先设置的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据。
  6. 根据权利要求5所述的语音数据的处理方法,其中,若未获取到与所述发送方对应的语音增强参数,所述基于所述预先设置的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据以及所述基于所述第一语音数据确定第一语音增强参数,包括:
    通过训练后的语音增强模型,对所述第一语音数据进行特征序列处理,得到第一语音特征序列,所述语音增强模型设置有所述预设置的语音增强参数;
    采用所述预设置的语音增强参数对所述第一语音特征序列进行批处理计算,得到处理后的第一语音特征序列和所述第一语音增强参数;
    对所述处理后的第一语音特征序列进行特征逆变换处理,得到所述第一语音增强数据。
  7. 根据权利要求2所述的语音数据的处理方法,其中,若获取到与所述发送方对应的语音增强参数,所述基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据,包括:
    基于与所述发送方对应的语音增强参数对所述第一语音数据进行语音增强处理以得到第一语音增强数据。
  8. 根据权利要求7所述的语音数据的处理方法,其中,若获取到与所述发送方对应的语音增强参数,所述基于与所述发送方对应的语音增强参数对所述第一语音数据进行语音增强处理以得到第一语音增强数据,以及所述基于所述第一语音数据确定第一语音增强参数,包括:
    通过训练后的语音增强模型,对所述第一语音数据进行特征序列处理,得到第二语音特征序列;
    采用与所述发送方对应的语音增强参数对所述第二语音特征序列进行批处理计算,得到处理后的第二语音特征序列和第二语音增强参数;
    对所述处理后的第二语音特征序列进行特征逆变换处理,得到处理后第二语音增强数据,并将所述处理后的第二语音增强数据作为所述第一语音增强数据。
  9. 根据权利要求6或8所述的语音数据的处理方法,其中,所述训练后的 语音增强模型通过如下方式生成:
    获取包含噪声的第一语音样本数据,并对所述第一语音样本数据进行语音特征提取,得到第一语音特征序列;
    获取不包含噪声的第二语音样本数据,并对所述第二语音样本数据进行语音特征提取,得到第二语音特征序列;
    采用所述第一语音特征序列对预设的语音增强模型进行训练,得到训练后的语音增强模型所输出的第一语音特征序列,并计算所述训练所述语音增强模型得到的第一语音特征序列与所述第二语音特征序列的相似度,直至所述训练所述语音增强模型得到的第一语音特征序列与所述第二语音特征序列的相似度超过预设相似度阈值,得到训练后的语音增强模型。
  10. 根据权利要求9所述的语音数据的处理方法,其中,语音特征序列提取的方式,包括:
    对语音样本数据进行语音分帧和加窗处理,得到所述语音样本数据的至少两个语音帧;
    对各个语音帧分别进行快速傅里叶变换,得到各个语音帧分别对应的各个离散功率谱;
    对各个离散功率谱分别进行对数计算,得到各个语音帧分别对应的各个对数功率谱,并将各个对数功率谱作为所述语音样本数据的语音特征序列。
  11. 一种语音数据的处理装置,包括:
    接收模块,用于接收发送方发送的第一语音数据;
    获取模块,用于获取相应的语音增强参数;
    处理模块,用于基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据,并基于所述第一语音数据确定第一语音增强参数;
    更新模块,用于采用所述第一语音增强参数对获取到的语音增强参数进行更新,得到更新后的语音增强参数,以用于当接收到发送方发送的第二语音数据,基于所述更新后的语音增强参数对所述第二语音数据进行语音增强处理;
    发送模块,用于将所述第一语音增强数据发送至接收方。
  12. 根据权利要求11所述的装置,其中,所述获取模块,还用于获取预先存储的与所述发送方对应的语音增强参数;若未获取到与所述发送方对应的语音增强参数,则获取预先设置的语音增强参数。
  13. 根据权利要求12所述的装置,其中,若未获取到与所述发送方对应的语音增强参数,所述更新模块,还用于基于所述第一语音增强参数对获取的预先设置的语音增强参数进行更新,得到更新后的语音增强参数,并将所述第一语音增强参数作为与所述发送方对应的语音增强参数。
  14. 根据权利要求12所述的装置,其中,若获取到与所述发送方对应的语音增强参数,所述更新模块,还用于采用所述第一语音增强参数对与所述发送方对应的语音增强参数进行更新,得到更新后的语音增强参数。
  15. 根据权利要求12所述的装置,其中,若未获取到与所述发送方对应的语音增强参数,所述处理模块,还用于基于所述预先设置的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据。
  16. 根据权利要求15所述的装置,其中,所述处理模块包括特征序列处理子模块、批处理计算子模块和特征逆变换处理子模块;
    若未获取到与所述发送方对应的语音增强参数,所述特征序列处理子模块,用于通过训练后的语音增强模型,对所述第一语音数据进行特征序列处理,得到第一语音特征序列,所述语音增强模型设置有所述预设置的语音增强参数;
    所述批处理计算子模块,用于采用所述预设置的语音增强参数对所述第一语音特征序列进行批处理计算,得到处理后的第一语音特征序列和所述第一语音增强参数;
    所述特征逆变换处理子模块,用于对所述处理后的第一语音特征序列进行特征逆变换处理,得到所述第一语音增强数据。
  17. 根据权利要求12所述的装置,其中,若获取到与所述发送方对应的语音增强参数,所述处理模块,还用于基于与所述发送方对应的语音增强参数对所述第一语音数据进行语音增强处理以得到第一语音增强数据。
  18. 根据权利要求17所述的装置,其中,所述处理模块包括特征序列处理 子模块、批处理计算子模块和特征逆变换处理子模块;
    若获取到与所述发送方对应的语音增强参数,所述特征序列处理子模块,用于通过训练后的语音增强模型,对所述第一语音数据进行特征序列处理,得到第二语音特征序列;
    所述批处理计算子模块,用于采用所述发送方对应的语音增强参数对所述第二语音特征序列进行批处理计算,得到处理后的第二语音特征序列和第二语音增强参数;
    所述特征逆变换处理子模块,用于对所述处理后的第二语音特征序列进行特征逆变换处理,得到处理后第二语音增强数据,并将所述处理后的第二语音增强数据作为所述第一语音增强数据。
  19. 一种电子设备,其包括:
    处理器、存储器和总线;
    所述总线,用于连接所述处理器和所述存储器;
    所述存储器,用于存储操作指令;
    所述处理器,用于通过调用所述操作指令,执行上述权利要求1-10中任一项所述的语音数据的处理方法。
  20. 一种计算机可读存储介质,所述计算机存储介质用于存储计算机指令,当其在计算机上运行时,使得计算机可以执行上述权利要求1-10中任一项所述的语音数据的处理方法。
PCT/CN2020/105034 2019-09-23 2020-07-28 语音数据的处理方法、装置、电子设备及可读存储介质 WO2021057239A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2021558880A JP7301154B2 (ja) 2019-09-23 2020-07-28 音声データの処理方法並びにその、装置、電子機器及びコンピュータプログラム
EP20868291.4A EP3920183A4 (en) 2019-09-23 2020-07-28 SPEECH DATA PROCESSING METHOD AND DEVICE, ELECTRONIC DEVICE AND READABLE STORAGE MEDIUM
US17/447,536 US20220013133A1 (en) 2019-09-23 2021-09-13 Speech data processing method and apparatus, electronic device, and readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910900060.1 2019-09-23
CN201910900060.1A CN110648680B (zh) 2019-09-23 2019-09-23 语音数据的处理方法、装置、电子设备及可读存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/447,536 Continuation US20220013133A1 (en) 2019-09-23 2021-09-13 Speech data processing method and apparatus, electronic device, and readable storage medium

Publications (1)

Publication Number Publication Date
WO2021057239A1 true WO2021057239A1 (zh) 2021-04-01

Family

ID=69011077

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/105034 WO2021057239A1 (zh) 2019-09-23 2020-07-28 语音数据的处理方法、装置、电子设备及可读存储介质

Country Status (5)

Country Link
US (1) US20220013133A1 (zh)
EP (1) EP3920183A4 (zh)
JP (1) JP7301154B2 (zh)
CN (1) CN110648680B (zh)
WO (1) WO2021057239A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110648680B (zh) * 2019-09-23 2024-05-14 腾讯科技(深圳)有限公司 语音数据的处理方法、装置、电子设备及可读存储介质
CN112820307B (zh) * 2020-02-19 2023-12-15 腾讯科技(深圳)有限公司 语音消息处理方法、装置、设备及介质
WO2021189979A1 (zh) * 2020-10-26 2021-09-30 平安科技(深圳)有限公司 语音增强方法、装置、计算机设备及存储介质
CN112562704B (zh) * 2020-11-17 2023-08-18 中国人民解放军陆军工程大学 基于blstm的分频拓谱抗噪语音转换方法
CN114999508B (zh) * 2022-07-29 2022-11-08 之江实验室 一种利用多源辅助信息的通用语音增强方法和装置

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102800322A (zh) * 2011-05-27 2012-11-28 中国科学院声学研究所 一种噪声功率谱估计与语音活动性检测方法
US9058820B1 (en) * 2013-05-21 2015-06-16 The Intellisis Corporation Identifying speech portions of a sound model using various statistics thereof
CN104952448A (zh) * 2015-05-04 2015-09-30 张爱英 一种双向长短时记忆递归神经网络的特征增强方法及系统
US9208794B1 (en) * 2013-08-07 2015-12-08 The Intellisis Corporation Providing sound models of an input signal using continuous and/or linear fitting
CN108615533A (zh) * 2018-03-28 2018-10-02 天津大学 一种基于深度学习的高性能语音增强方法
CN108877823A (zh) * 2018-07-27 2018-11-23 三星电子(中国)研发中心 语音增强方法和装置
CN109102823A (zh) * 2018-09-05 2018-12-28 河海大学 一种基于子带谱熵的语音增强方法
CN109273021A (zh) * 2018-08-09 2019-01-25 厦门亿联网络技术股份有限公司 一种基于rnn的实时会议降噪方法及装置
CN109427340A (zh) * 2017-08-22 2019-03-05 杭州海康威视数字技术股份有限公司 一种语音增强方法、装置及电子设备
CN109979478A (zh) * 2019-04-08 2019-07-05 网易(杭州)网络有限公司 语音降噪方法及装置、存储介质及电子设备
CN110648680A (zh) * 2019-09-23 2020-01-03 腾讯科技(深圳)有限公司 语音数据的处理方法、装置、电子设备及可读存储介质

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007116585A (ja) 2005-10-24 2007-05-10 Matsushita Electric Ind Co Ltd ノイズキャンセル装置およびノイズキャンセル方法
JP5188300B2 (ja) * 2008-07-14 2013-04-24 日本電信電話株式会社 基本周波数軌跡モデルパラメータ抽出装置、基本周波数軌跡モデルパラメータ抽出方法、プログラム及び記録媒体
US8234111B2 (en) * 2010-06-14 2012-07-31 Google Inc. Speech and noise models for speech recognition
JP5870476B2 (ja) * 2010-08-04 2016-03-01 富士通株式会社 雑音推定装置、雑音推定方法および雑音推定プログラム
PL2866228T3 (pl) * 2011-02-14 2016-11-30 Dekoder audio zawierający estymator szumu tła
CN103650040B (zh) * 2011-05-16 2017-08-25 谷歌公司 使用多特征建模分析语音/噪声可能性的噪声抑制方法和装置
JP5916054B2 (ja) * 2011-06-22 2016-05-11 クラリオン株式会社 音声データ中継装置、端末装置、音声データ中継方法、および音声認識システム
JP2015004959A (ja) * 2013-05-22 2015-01-08 ヤマハ株式会社 音響処理装置
GB2519117A (en) * 2013-10-10 2015-04-15 Nokia Corp Speech processing
GB2520048B (en) * 2013-11-07 2018-07-11 Toshiba Res Europe Limited Speech processing system
CN104318927A (zh) * 2014-11-04 2015-01-28 东莞市北斗时空通信科技有限公司 一种抗噪声的低速率语音编码方法及解码方法
JP2016109933A (ja) 2014-12-08 2016-06-20 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America 音声認識方法ならびに音声認識システムおよびそれに含まれる音声入力装置
CN105355199B (zh) * 2015-10-20 2019-03-12 河海大学 一种基于gmm噪声估计的模型组合语音识别方法
CN106971741B (zh) * 2016-01-14 2020-12-01 芋头科技(杭州)有限公司 实时将语音进行分离的语音降噪的方法及系统
CN106340304B (zh) * 2016-09-23 2019-09-06 桂林航天工业学院 一种适用于非平稳噪声环境下的在线语音增强方法
CN106898348B (zh) * 2016-12-29 2020-02-07 北京小鸟听听科技有限公司 一种出声设备的去混响控制方法和装置
US10978091B2 (en) * 2018-03-19 2021-04-13 Academia Sinica System and methods for suppression by selecting wavelets for feature compression in distributed speech recognition
US10811000B2 (en) * 2018-04-13 2020-10-20 Mitsubishi Electric Research Laboratories, Inc. Methods and systems for recognizing simultaneous speech by multiple speakers
CN110176245A (zh) * 2019-05-29 2019-08-27 贾一焜 一种语音降噪系统
KR102260216B1 (ko) * 2019-07-29 2021-06-03 엘지전자 주식회사 지능적 음성 인식 방법, 음성 인식 장치, 지능형 컴퓨팅 디바이스 및 서버
CN110648681B (zh) * 2019-09-26 2024-02-09 腾讯科技(深圳)有限公司 语音增强的方法、装置、电子设备及计算机可读存储介质

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102800322A (zh) * 2011-05-27 2012-11-28 中国科学院声学研究所 一种噪声功率谱估计与语音活动性检测方法
US9058820B1 (en) * 2013-05-21 2015-06-16 The Intellisis Corporation Identifying speech portions of a sound model using various statistics thereof
US9208794B1 (en) * 2013-08-07 2015-12-08 The Intellisis Corporation Providing sound models of an input signal using continuous and/or linear fitting
CN104952448A (zh) * 2015-05-04 2015-09-30 张爱英 一种双向长短时记忆递归神经网络的特征增强方法及系统
CN109427340A (zh) * 2017-08-22 2019-03-05 杭州海康威视数字技术股份有限公司 一种语音增强方法、装置及电子设备
CN108615533A (zh) * 2018-03-28 2018-10-02 天津大学 一种基于深度学习的高性能语音增强方法
CN108877823A (zh) * 2018-07-27 2018-11-23 三星电子(中国)研发中心 语音增强方法和装置
CN109273021A (zh) * 2018-08-09 2019-01-25 厦门亿联网络技术股份有限公司 一种基于rnn的实时会议降噪方法及装置
CN109102823A (zh) * 2018-09-05 2018-12-28 河海大学 一种基于子带谱熵的语音增强方法
CN109979478A (zh) * 2019-04-08 2019-07-05 网易(杭州)网络有限公司 语音降噪方法及装置、存储介质及电子设备
CN110648680A (zh) * 2019-09-23 2020-01-03 腾讯科技(深圳)有限公司 语音数据的处理方法、装置、电子设备及可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3920183A4

Also Published As

Publication number Publication date
JP7301154B2 (ja) 2023-06-30
EP3920183A4 (en) 2022-06-08
CN110648680A (zh) 2020-01-03
US20220013133A1 (en) 2022-01-13
JP2022527527A (ja) 2022-06-02
EP3920183A1 (en) 2021-12-08
CN110648680B (zh) 2024-05-14

Similar Documents

Publication Publication Date Title
WO2021057239A1 (zh) 语音数据的处理方法、装置、电子设备及可读存储介质
US8996372B1 (en) Using adaptation data with cloud-based speech recognition
CN108198569B (zh) 一种音频处理方法、装置、设备及可读存储介质
JP2021515277A (ja) オーディオ信号処理システム、及び入力オーディオ信号を変換する方法
US11355097B2 (en) Sample-efficient adaptive text-to-speech
CN109410973B (zh) 变声处理方法、装置和计算机可读存储介质
CN113362812B (zh) 一种语音识别方法、装置和电子设备
CN112562691A (zh) 一种声纹识别的方法、装置、计算机设备及存储介质
CN111583944A (zh) 变声方法及装置
CN106165015B (zh) 用于促进基于加水印的回声管理的装置和方法
CN109801635A (zh) 一种基于注意力机制的声纹特征提取方法及装置
CN111128221A (zh) 一种音频信号处理方法、装置、终端及存储介质
US20230335148A1 (en) Speech Separation Method, Electronic Device, Chip, and Computer-Readable Storage Medium
CN111583906A (zh) 一种语音会话的角色识别方法、装置及终端
CN111009257A (zh) 一种音频信号处理方法、装置、终端及存储介质
EP4254408A1 (en) Speech processing method and apparatus, and apparatus for processing speech
US11776563B2 (en) Textual echo cancellation
CN110827808A (zh) 语音识别方法、装置、电子设备和计算机可读存储介质
CN114898762A (zh) 基于目标人的实时语音降噪方法、装置和电子设备
US11354520B2 (en) Data processing method and apparatus providing translation based on acoustic model, and storage medium
CN107437412B (zh) 一种声学模型处理方法、语音合成方法、装置及相关设备
WO2022147692A1 (zh) 一种语音指令识别方法、电子设备以及非瞬态计算机可读存储介质
CN110580910B (zh) 一种音频处理方法、装置、设备及可读存储介质
US20230186943A1 (en) Voice activity detection method and apparatus, and storage medium
CN114783455A (zh) 用于语音降噪的方法、装置、电子设备和计算机可读介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20868291

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020868291

Country of ref document: EP

Effective date: 20210901

ENP Entry into the national phase

Ref document number: 2021558880

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE