WO2021159772A1 - 语音增强方法及装置、电子设备和计算机可读存储介质 - Google Patents

语音增强方法及装置、电子设备和计算机可读存储介质 Download PDF

Info

Publication number
WO2021159772A1
WO2021159772A1 PCT/CN2020/126345 CN2020126345W WO2021159772A1 WO 2021159772 A1 WO2021159772 A1 WO 2021159772A1 CN 2020126345 W CN2020126345 W CN 2020126345W WO 2021159772 A1 WO2021159772 A1 WO 2021159772A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
matrix
sparse matrix
enhanced
neural network
Prior art date
Application number
PCT/CN2020/126345
Other languages
English (en)
French (fr)
Inventor
方雪飞
崔晓春
李从兵
刘晓宇
曹木勇
余涛
杨栋
周荣鑫
李文焱
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP20918152.8A priority Critical patent/EP4002360A4/en
Publication of WO2021159772A1 publication Critical patent/WO2021159772A1/zh
Priority to US17/717,620 priority patent/US20220262386A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • This application relates to the technical field of speech noise reduction, and in particular to a speech enhancement method and device, electronic equipment, and computer-readable storage medium.
  • Voice is one of the most convenient and natural communication tools for human beings. It can eliminate the distance barriers between people and improve the efficiency of interaction between people and machines.
  • the ubiquitous noise in the real environment affects the quality of voice communication to varying degrees. For example, in a rich game scene, when the user uses the game voice in a noisy environment, the microphone will collect various environmental noises, and in a multi-person team voice, one party is interfered by noise, which will affect the entire team member Quality of the call.
  • the embodiments of the present application provide a speech enhancement method and device, electronic equipment, and computer-readable storage medium, which can provide a depth dictionary that can characterize the depth features of clean speech, so as to better perform speech enhancement on noisy speech.
  • An embodiment of the application proposes a speech enhancement method, which includes: obtaining a clean speech sample; decomposing the clean speech sample to obtain a first sparse matrix and m basis matrices, wherein the values in the first sparse matrix are all positive A number, m is a positive integer greater than 1; according to the first sparse matrix and the weight matrix of the target neural network, the visible layer neuron state vector of the target neural network is obtained; according to the visible layer neuron state vector Update the weight matrix with the clean speech samples to obtain a depth dictionary for speech enhancement.
  • the embodiment of the present application provides another method for speech enhancement.
  • the method for speech enhancement includes: acquiring a voice to be enhanced; acquiring a depth dictionary for speech enhancement according to any of the above-mentioned methods;
  • the enhanced speech is expanded in depth to determine the second sparse matrix of the speech to be enhanced; and the clean speech of the speech to be enhanced is determined according to the second sparse matrix and the depth dictionary.
  • An embodiment of the present application also provides a voice enhancement device, which includes a sample acquisition module, a decomposition module, a visible layer reconstruction module, and a depth dictionary acquisition module.
  • the sample acquisition module may be configured to acquire clean speech samples; the decomposition module may be configured to decompose the clean speech samples to obtain a first sparse matrix and m basis matrices, wherein the values in the first sparse matrix are equal to Is a positive number, and m is a positive integer greater than 1.
  • the visible layer reconstruction module may be configured to obtain the visible layer neuron state vector of the target neural network according to the first sparse matrix and the weight matrix of the target neural network
  • the depth dictionary acquisition module may be configured to update the weight matrix according to the visual layer neuron state vector and the clean speech sample to acquire a depth dictionary for speech enhancement.
  • An embodiment of the present application also proposes an electronic device, which includes: one or more processors; a memory, configured to store one or more programs, when the one or more programs are processed by the one or more The processor executes, so that the one or more processors implement the speech enhancement method described in any one of the foregoing.
  • the embodiment of the present application also proposes a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the speech enhancement method as described in any one of the above is implemented.
  • the first sparse matrix that can represent the clean speech as concisely as possible is obtained by deep decomposition of the clean speech, and then the first sparse matrix is obtained in the target neural network
  • the hidden layer introduces the first sparse matrix to complete the training of the target neural network and obtain a depth dictionary that can represent the depth information of the clean speech signal.
  • this solution compared to training a neural network in related technologies to determine the mapping relationship between noisy signals and clean signals, this solution only needs clean speech signals to complete the acquisition of the depth dictionary in the speech enhancement technology, and has a stronger generalization ability.
  • the technical solution provided by this application obtains a deep dictionary that can characterize the depth features of the clean speech signal by decomposing the clean speech, and can learn a deeper representation of the clean speech.
  • Fig. 1 is a schematic diagram of a system architecture of a speech enhancement system provided by an embodiment of the present application.
  • Fig. 2 is a schematic structural diagram of a computer system applied to a speech enhancement device provided by an embodiment of the present application.
  • Fig. 3 is a flowchart of a voice enhancement method provided by an embodiment of the present application.
  • Fig. 4 is a flowchart of step S2 in Fig. 3 in an exemplary embodiment.
  • Fig. 5 is a flowchart of step S3 in Fig. 3 in an exemplary embodiment.
  • Fig. 6 is a flowchart of step S4 in Fig. 3 in an exemplary embodiment.
  • Fig. 7 is a flowchart of step S4 in Fig. 3 in an exemplary embodiment.
  • Fig. 8 is a flowchart of a voice enhancement method provided by an embodiment of the present application.
  • Fig. 9 is a flowchart of step S03 in Fig. 8 in an exemplary embodiment.
  • Fig. 10 is a flowchart of step S03 in Fig. 8 in an exemplary embodiment.
  • FIG. 11 is a schematic structural diagram of an iterative algorithm for learning soft thresholds provided by an embodiment of the present application.
  • Fig. 12 is a flowchart of step S03 in Fig. 8 in an exemplary embodiment.
  • FIG. 13 is a schematic structural diagram of a trainable iterative threshold algorithm provided by an embodiment of the present application.
  • FIG. 14 is a schematic diagram of the effect of speech enhancement provided by an embodiment of the present application.
  • Fig. 15 is a game speech engine provided by an embodiment of the present application.
  • Fig. 16 is a block diagram of a speech enhancement device provided by an embodiment of the present application.
  • the terms “a”, “an”, “the”, “said” and “at least one” are used to indicate that there are one or more elements/components/etc.; the terms “including”, “including” and “Have” is used to mean open-ended inclusion and means that in addition to the listed elements/components/etc., there may be other elements/components/etc.; the terms “first” and “second “And “third” etc. are only used as markers, not as a restriction on the number of objects.
  • FIG. 1 is a schematic diagram of a system architecture 100 of a speech enhancement system provided by an embodiment of the present application.
  • the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
  • the user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
  • the terminal devices 101, 102, 103 may be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, wearable devices, and virtual reality devices. , Smart home and so on.
  • the server 105 may be a server that provides various services, such as a background management server that provides support for devices operated by users using the terminal devices 101, 102, and 103.
  • the background management server can analyze and process the received request and other data, and feed back the processing result to the terminal device.
  • the server 105 may, for example, obtain a clean voice sample; decompose the clean voice sample to obtain a first sparse matrix and m basis matrices, wherein the values in the first sparse matrix are all positive numbers, and m is a positive integer greater than 1;
  • the first sparse matrix and the weight matrix of the target neural network are used to obtain the visual layer neuron state vector of the target neural network; the weight matrix is updated according to the visual layer neuron state vector and the clean voice sample To obtain a depth dictionary for speech enhancement.
  • the server 105 may be a physical server, such as an independent physical server, or a server cluster composed of multiple physical servers.
  • Distributed systems can also provide basic cloud computing such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, and big data and artificial intelligence platforms Service cloud server.
  • FIG. 2 shows a schematic structural diagram of a computer system 200 suitable for implementing a terminal device according to an embodiment of the present application.
  • the terminal device shown in FIG. 2 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.
  • the computer system 200 includes a central processing unit (Central Processing Unit, CPU) 201, which can be loaded into a random storage unit according to a program stored in a read-only memory (Read-Only Memory, ROM) 202 or from a storage part 208. Access to the program in the memory (Random Access Memory, RAM) 203 to execute various appropriate actions and processing. In the RAM 203, various programs and data required for the operation of the system 200 are also stored.
  • the CPU 201, the ROM 202, and the RAM 203 are connected to each other through a bus 204.
  • An input/output (I/O) interface 205 is also connected to the bus 204.
  • the following components are connected to the I/O interface 205: the input part 206 including keyboard, mouse, etc.; including the output part 207 such as cathode ray tube (Cathode Ray Tube, CRT), liquid crystal display (LCD), and speakers 207 ; A storage section 208 including a hard disk, etc.; and a communication section 209 including a network interface card such as a LAN card, a modem, and the like.
  • the communication section 209 performs communication processing via a network such as the Internet.
  • the drive 210 is also connected to the I/O interface 205 as needed.
  • the removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 210 as needed, so that the computer program read from it is installed into the storage part 208 as needed.
  • the process described above with reference to the flowchart can be implemented as a computer software program.
  • the embodiments of the present application include a computer program product, which includes a computer program carried on a computer-readable storage medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from the network through the communication section 209, and/or installed from the removable medium 211.
  • the central processing unit (CPU) 201 the above-mentioned functions defined in the system of the present application are executed.
  • the computer-readable storage medium shown in this application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the above.
  • Examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only Memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable storage medium other than the computer-readable storage medium.
  • the computer-readable storage medium may be sent, propagated or transmitted for use by or in combination with the instruction execution system, apparatus, or device program of.
  • the program code contained on the computer-readable storage medium can be transmitted by any suitable medium, including but not limited to: wireless, wire, optical cable, radio frequency (RF), etc., or any suitable combination of the foregoing.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of the code, and the above-mentioned module, program segment, or part of the code contains one or more for realizing the specified logic function.
  • Executable instructions may also occur in a different order than noted in the drawings. For example, two blocks shown one after another can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram or flowchart, and the combination of blocks in the block diagram or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or operations, or can be implemented by It is realized by a combination of dedicated hardware and computer instructions.
  • modules and/or units involved in the embodiments described in the present application can be implemented in software or hardware.
  • the described modules and/or units may also be provided in the processor, for example, it may be described as: a processor includes a sending unit, an acquiring unit, a determining unit, and a first processing unit. Wherein, the names of these modules and/or units do not constitute a limitation on the modules and/or units themselves under certain circumstances.
  • the embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium may be included in the device described in the foregoing embodiment; or it may exist alone without being assembled into the device.
  • the above-mentioned computer-readable storage medium carries one or more programs.
  • the functions that can be realized by the device include: obtaining a clean voice sample; decomposing the clean voice sample to obtain the first A sparse matrix and m basis matrices, wherein the values in the first sparse matrix are all positive numbers, and m is a positive integer greater than 1, and the target is obtained according to the first sparse matrix and the weight matrix of the target neural network
  • the visual layer neuron state vector of the neural network the weight matrix is updated according to the visual layer neuron state vector and the clean speech sample to obtain a depth dictionary for speech enhancement.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural speech processing technology, and machine learning/deep learning.
  • Speech Technology includes Automatic Speech Recognition (ASR), Text To Speech (TTS), and voiceprint recognition technology. Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • ASR Automatic Speech Recognition
  • TTS Text To Speech
  • voiceprint recognition technology Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • Machine Learning is a multi-field interdisciplinary subject, involving multiple subjects such as probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
  • Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent. Its applications are in all fields of artificial intelligence.
  • Machine learning and deep learning usually include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and style teaching learning.
  • Fig. 3 is a flowchart of a speech enhancement method provided by the present application.
  • the method provided in the embodiments of the present application can be processed by any electronic device with computing and processing capabilities, such as the server 105 and/or the terminal devices 102 and 103 in the embodiment of FIG. 1 above.
  • the server 105 is used as Take the executive body as an example to illustrate.
  • the voice enhancement method provided by the embodiment of the present application may include the following steps.
  • Speech enhancement refers to the technology that extracts useful speech signals from the noise background to suppress and reduce noise interference when the speech signal is disturbed by various noises or even submerged.
  • speech enhancement is to extract the purest possible original speech from noisy speech.
  • a neural network model is usually supervised training through noisy speech and its corresponding clean speech to determine the mapping relationship between noisy speech and clean speech.
  • the above-mentioned speech enhancement method needs to collect a large amount of clean speech signals and noise signals during the training process, and the collection of the noise signals is laborious and laborious, which is not conducive to improving the efficiency of speech enhancement.
  • the generalization ability of the neural network model trained through noisy speech and clean speech is general, that is, when the noise in the test set and the noise in the training set are significantly different, the noise reduction ability of the neural network model will be There is a significant decline.
  • the neural network has poor interpretability and cannot properly explain the speech enhancement process.
  • step S1 a clean voice sample is obtained.
  • Clean speech can refer to some pure speech signals that do not include noise.
  • some clean speech can be obtained as training samples of the target neural network.
  • the original voice in the game (excluding environmental noise) can be obtained as the training sample of the target neural network.
  • speech signals can usually be represented by formula (1).
  • Y ⁇ R M ⁇ N is the observed speech
  • X ⁇ R M ⁇ N is the sparse matrix
  • D ⁇ R M ⁇ N is the dictionary
  • n is the noise
  • M is the number of rows of the speech signal
  • N is the number of columns of the speech. Both M and N are positive integers greater than or equal to 1, and R represents the real number field.
  • the speech signal can be sparsely decomposed by formula (1).
  • the sparse decomposition of the speech signal can refer to the expression of most or all of the original signal Y with a linear combination of fewer basic signals according to a given over-complete dictionary D, so as to obtain a more concise representation of the signal.
  • Signals can be called atoms.
  • the signal sparse decomposition based on an over-complete dictionary is a new signal representation theory, which uses an over-complete redundant function system to replace the traditional orthogonal basis functions, which provides great flexibility for signal adaptive sparse expansion.
  • Sparse decomposition can achieve the efficiency of data compression, and more importantly, can use the redundancy characteristics of the dictionary to capture the intrinsic characteristics of the signal.
  • step S2 the clean speech sample is decomposed to obtain a first sparse matrix and m basis matrices, wherein the values in the first sparse matrix are all positive numbers, and m is a positive integer greater than 1.
  • this embodiment creatively proposes a deep dictionary that can characterize the deep features of the speech signal by extracting clean speech to complete the method of enhancing noisy speech.
  • a deep semi-non-negative matrix (Deep semi-Nonnegative Matrix Factorization, Deep semi-NMF) can be introduced into the hidden layer of the Restricted Boltzmann Machine (RBM) to train the characteristics of speech signals.
  • RBM Restricted Boltzmann Machine
  • the depth semi-non-negative matrix can be expressed by the following formula.
  • Y ⁇ is a clean speech sample (it can be used as the visual layer in RBM to observe the state variable of the neuron)
  • H m is the sparse matrix (which can be used as the state variable of the hidden layer neuron in the RBM)
  • the superscript ⁇ indicates that the value in the matrix can be positive or negative
  • the superscript + indicates that the value in the matrix is limited Is a positive number
  • m is a positive integer greater than or equal to 1.
  • the clean speech samples may be deeply decomposed by formula (2) to obtain the first sparse matrix And m basis matrices
  • the deep non-negative matrix can also be adapted to deeply decompose the clean speech sample to obtain the first sparse matrix And m basis matrices
  • step S3 the visual layer neuron state vector of the target neural network is obtained according to the first sparse matrix and the weight matrix of the target neural network.
  • the target neural network may be a restricted Boltzmann machine, and the restricted Boltzmann machine may include a visible layer and a hidden layer.
  • individual elements of the limited weight matrix w ij Boltzmann machine may specify edge weights between the hidden layer units h j v i and a visible layer unit weight of v i
  • each layer may have a visible The first bias term a i
  • each hidden layer unit h j may have a second bias term b j .
  • a restricted Boltzmann machine may be used to process the clean speech sample to obtain a depth dictionary h that can characterize the intrinsic characteristics of the clean speech sample (for example, in formula (2) Any one of H 1 , H 2 , H 3 ... H m ), and the depth dictionary h can be used as the neuron state variable of the hidden layer of the restricted Boltzmann machine.
  • the weight matrix W, the first bias term a, and the second bias term b of the restricted Boltzmann machine may be first performed before training the restricted Boltzmann machine.
  • the variable v* and the initial visual layer neuron state variable v determined according to the clean speech samples update the parameters in the target neural network model.
  • step S4 the weight matrix is updated according to the visual layer neuron state vector and the clean speech sample to obtain a depth dictionary for speech enhancement.
  • short-time Fourier transform processing may be performed on the clean speech sample to obtain the spectrogram of the clean speech sample, and the spectrogram of the clean speech sample may be determined as the Restricted Boltzmann's visual layer observes the neuron state variable v.
  • the weight matrix W, the first bias term a, and the second bias of the target neural network can be updated in the reverse direction through the visual layer observation neuron state variable v and the visual layer neuron state variable v*.
  • different clean speech samples may be used to train the target neural network until the training standard is reached.
  • the weight matrix W of the target neural network obtained after training may be the final depth dictionary D.
  • the speech enhancement method obtained by the embodiments of the present application obtains the first sparse matrix that can represent the clean speech as concisely as possible through the deep decomposition of the clean speech, and then introduces the first sparse matrix in the hidden layer of the target neural network to After completing the training of the target neural network, a depth dictionary that can represent the depth information of the clean speech signal is obtained.
  • the embodiment of the present application obtains a depth dictionary that can characterize the depth features of the clean speech signal by decomposing the clean speech, and can learn a deeper representation of the clean speech.
  • Fig. 4 is a flowchart of step S2 in Fig. 3 in an exemplary embodiment.
  • a deep semi-non-negative matrix may be used to process the clean speech sample to obtain a first sparse matrix that can characterize the depth characteristics of the clean speech sample.
  • a semi-non-negative matrix to solve the clean speech on the one hand, it can realize the processing of the speech signal with a huge amount of data; on the other hand, if there are negative numbers in the sparse matrix obtained after decomposition, it cannot be obtained in actual problems.
  • Real meaning, and the use of a semi-non-negative matrix to solve the speech signal can ensure that the values in the finally obtained sparse matrix are all integers, which is convenient to give meaning in practical applications; in addition, the use of a deep semi-non-negative matrix to solve the speech signal Decomposition can also obtain the depth characteristics of the speech signal, which can better describe the speech signal.
  • the deep semi-non-negative matrix factorization matrix can be described by formula (2).
  • the deep semi-non-negative matrix solution process may include an initialization process and an iterative update process.
  • Formula (3) describes the initialization process of the deep semi-non-negative matrix.
  • Y ⁇ is the clean speech sample (that is, the visual layer in RBM)
  • H m is the sparse matrix (ie, the hidden layer in the RBM)
  • the superscript ⁇ indicates that the value in the matrix can be positive or negative
  • the superscript + indicates that the value in the matrix is limited to a positive number
  • the initialization process of the deep semi-non-negative matrix solution may include the following steps.
  • the m base matrices may include a first base matrix and a second base matrix
  • decomposing the clean speech sample to obtain a sparse matrix and m base matrices may include the following steps:
  • step S21 a semi-non-negative matrix solution is performed on the clean speech sample to determine the first base matrix and the first target matrix.
  • step S22 a semi-non-negative matrix solution is performed on the first target matrix to initialize the second base matrix and the second target matrix.
  • the optimal solution may be determined through iterative update to serve as the first sparse matrix.
  • the error function shown in formula (4) can be established according to the above-mentioned depth semi-non-negative matrix.
  • C deep represents the error function
  • Y represents the clean speech sample
  • Z m represents the m basis matrices
  • H m represents the first sparse matrix, where m is greater than or equal to 1 Positive integer.
  • the matrix superscript + represents the generalized Mohr-Penross inverse
  • the matrix superscript pos means to keep all the positive elements in the matrix and set all the negative elements to 0; the matrix superscript neg means to keep all the negative elements in the matrix and set all the positive elements to 0.
  • the update iteration process of solving the depth semi-non-negative matrix can include the following steps.
  • step S23 a base matrix variable is determined according to the first base matrix and the second base matrix.
  • step S24 the basis matrix variables, the clean speech samples, and the second target matrix are processed by a basis matrix update function, and the second basis matrix is updated.
  • each base matrix Z i can be updated iteratively according to formula (5).
  • step S25 the base matrix variables and the clean speech samples are processed by a sparse matrix update function, and the second target matrix is updated, where the second target matrix is the first sparse matrix.
  • each target matrix H i can be updated iteratively according to formula (6) until the number of iterations reaches a preset value, or the error function is less than the preset error value, and the target matrix H m is used as the first sparse matrix Perform output.
  • the clean speech samples may be processed through a deep semi-non-negative matrix to determine the first sparse matrix that can guarantee the depth characteristics of the clean speech.
  • Fig. 5 is a flowchart of step S3 in Fig. 3 in an exemplary embodiment.
  • the target neural network may be a restricted Boltzmann machine, and the restricted Boltzmann machine may include a first bias term a.
  • step S3 may include the following steps.
  • step S31 the visual layer conditional probability of the target neural network is determined according to the first sparse matrix, the weight matrix of the target neural network, and the first bias term.
  • the visual layer conditional probability of the target neural network can be determined according to formula (7).
  • logistic() can be an activation function, for example Represents the conditional probability of the visible layer about the hidden layer, It can represent the state variable of the i-th visual layer neuron, h represents the state variable of the hidden layer neuron, a i represents the i-th first bias item of the visual layer, and Wij represents the i-th row of the weight matrix.
  • the value of the column, h j represents the jth value of the state variable of the hidden layer neuron of the target neural network (also the depth dictionary).
  • step S32 the state variable of the visual layer neuron is determined according to the visual layer conditional probability.
  • the state variable of the visual layer neuron may be determined according to the conditional probability of the visual layer by means of random sampling. For example, a random number r i is generated on [0,1], and the state variable v* of the visual layer neuron is determined according to formula (8).
  • the state variables of the neurons in the visible layer may be determined inversely according to the state variables of the neurons in the hidden layer based on the conditional probability of the visible layer.
  • Fig. 6 is a flowchart of step S4 in Fig. 3 in an exemplary embodiment.
  • the target neural network may be a restricted Boltzmann machine, and the restricted Boltzmann machine may include a second bias term b.
  • step S4 may include the following steps.
  • step S41 the first hidden layer conditional probability of the target neural network is determined according to the weight matrix, the clean speech sample, and the second bias term.
  • the first hidden layer conditional probability may be determined according to formula (9).
  • logistic() can be an activation function, for example p(h j
  • step S42 the second hidden layer conditional probability of the target neural network is determined according to the weight matrix, the visual layer neuron state vector, and the second bias term.
  • the second hidden layer conditional probability may be determined according to formula (10).
  • v * ) represents the conditional probability of the second hidden layer
  • logistic can be an activation function, for example p(h j
  • v* can represent the state variable of the reconstructed visible layer neuron
  • h j represents the state variable of the jth hidden layer neuron
  • b j represents the j-th second bias item of the visible layer
  • Wij represents the value of the i-th row and j-th column of the weight matrix
  • step S43 the weight matrix is updated according to the conditional probability of the first hidden layer, the conditional probability of the second hidden layer, the clean speech sample, and the state vector of the visual layer neuron.
  • the weight matrix W may be updated according to formula (11).
  • p(h 1
  • v) represents the conditional probability of the first hidden layer
  • p(h 1
  • v * ) represents the conditional probability of the second hidden layer
  • v T represents the visual A device for layer neuron state variables
  • v *T can represent the reconstructed visual layer neuron state variable
  • h represents the transposition of the hidden layer neuron state variable
  • represents the learning rate.
  • Fig. 7 is a flowchart of step S4 in Fig. 3 in an exemplary embodiment.
  • the above-mentioned step S4 may include the following steps.
  • step S44 the first bias term is updated according to the clean speech sample and the state vector of the visual layer neuron.
  • the first bias term a may be updated according to formula (12).
  • represents the learning rate
  • v represents the state variable of the visual layer neuron determined according to the clean speech sample
  • v* can represent the state variable of the reconstructed visual layer neuron.
  • step S45 the second bias term is updated according to the first hidden layer conditional probability and the second hidden layer conditional probability.
  • the second bias term b may be updated according to formula (13).
  • represents the learning rate
  • p(h 1
  • v) represents the conditional probability of the first hidden layer
  • p(h 1
  • v * ) represents the conditional probability of the second hidden layer
  • Fig. 8 is a flowchart of a voice enhancement method shown in an embodiment of the present application.
  • the method provided in the embodiments of the present application can be processed by any electronic device with computing and processing capabilities, such as the server 105 and/or the terminal devices 102 and 103 in the embodiment of FIG. 1 above.
  • the server 105 is used as Take the executive body as an example to illustrate.
  • the voice method provided by the embodiment of the present application may include the following steps.
  • step S01 the speech to be enhanced is obtained.
  • the speech to be enhanced may refer to a speech signal including noise.
  • step S02 a depth dictionary for speech enhancement is obtained.
  • a depth dictionary that can be used for speech enhancement can be obtained by the above-mentioned speech enhancement method.
  • step S03 depth expansion is performed on the speech to be enhanced according to the depth dictionary, and a second sparse matrix of the speech to be enhanced is determined.
  • Y n DX+n
  • Y n the speech to be enhanced
  • D the depth dictionary
  • X the second sparse matrix X
  • is a preset parameter.
  • iterative Soft-Thresholding Algorithm Iterative Soft-Thresholding Algorithm, ISTA
  • learning soft-thresholding iterative algorithm The Learned Iterative Soft-Thresholding Algorithm, LISTA
  • TISTA Trained Iterative Soft-Thresholding Algorithm
  • step S04 the clean speech of the speech to be enhanced is determined according to the second sparse matrix and the depth dictionary.
  • the noisy speech can be sparsely learned on the dictionary D learned from the clean speech to obtain the second sparse matrix X, and then the obtained DX can be used as the final noise-reduced speech.
  • the clean speech Y* of the speech to be enhanced can be determined by formula (15).
  • D represents the depth dictionary
  • X represents the second sparse matrix of the speech to be enhanced.
  • the embodiment of the application accurately determines the sparse matrix of the noisy speech, and based on the sparse matrix and the depth dictionary of the clean speech, the clean speech is accurately restored from the noisy speech.
  • This method can be applied to different speech signals with strong generalization ability.
  • Fig. 9 is a flowchart of step S03 in Fig. 8 in an exemplary embodiment.
  • a soft threshold iterative algorithm algorithm in order to solve the Lasso problem in the linear regression process, may be used to determine the sparse matrix of the noisy speech.
  • the Lasso problem is mainly used to describe the constrained optimization problem in linear regression, and the lasso problem limits the linear regression of the l1 norm.
  • step S03 may include the following steps.
  • the ISTA algorithm may be used to solve the above formula (14), for example, the formula (16) may be iteratively solved.
  • L is the largest eigenvalue in D T D
  • is a preset parameter
  • D is the depth dictionary.
  • step S031 a second sparse matrix of the speech to be enhanced is obtained.
  • the second sparse matrix of the speech to be enhanced can be initialized as x k in the first iteration process, for example , the variables in x k can be randomly assigned.
  • step S032 a first soft threshold is determined according to the depth dictionary and the second sparse matrix.
  • step S033 the second sparse matrix is updated according to the second sparse matrix, the depth dictionary, and the first soft threshold.
  • Updating the second sparse matrix according to the second sparse matrix, the depth dictionary D, and the first soft threshold may include the following steps:
  • Step 2 Determine x k+1 according to formula (16).
  • a soft threshold iterative algorithm is used to determine a sparse matrix from noisy speech, and this method solves the Lasso problem in the regression process.
  • Fig. 10 is a flowchart of step S03 in Fig. 8 in an exemplary embodiment.
  • the above-mentioned step S03 may include the following steps.
  • the LISTA algorithm can be used to solve the above formula (14), for example, the formula (18) can be iteratively solved.
  • K is a positive integer greater than or equal to 1.
  • Solving using the LISTA algorithm can include the following steps:
  • step S034 the first training speech and its corresponding sparse matrix are acquired.
  • sparse decomposition may be performed on any speech (for example, noisy speech or clean speech) to obtain the first training speech and its corresponding sparse matrix.
  • step S035 the second sparse matrix of the speech to be enhanced is initialized.
  • the second sparse matrix can be initialized, that is, the second sparse matrix can be assigned at will.
  • step S036 the first training speech, the sparse matrix corresponding to the first training speech, and the initialized second sparse matrix are used to train the target feedforward neural network through a backpropagation algorithm to determine the front Feed the first target parameter, the second target parameter and the second soft threshold of the neural network.
  • Backpropagation algorithm is a supervised learning algorithm.
  • the backpropagation algorithm mainly consists of two links: incentive propagation and weight update, and iterates repeatedly until the response of the target network to the input reaches the predetermined target range.
  • the incentive communication link can include two steps:
  • Forward propagation stage Input training samples into the target network to obtain an incentive response.
  • Back-propagation stage the expected excitation response and the excitation response corresponding to the training sample are calculated to obtain the response error.
  • the weight update process can include the following two steps:
  • the parameters in the LISTA algorithm can be determined through a feedforward neural network as shown in FIG. 11. It is understandable that the weight matrix W is used in the LISTA algorithm to replace the parameters set in ISTA, and the number of iteration layers is truncated, that is, how many layers of feedforward neural network the algorithm is expanded into.
  • the feedforward neural network includes a first target parameter Wk 1, a second target parameter Wk 1, and a second soft threshold Both can be the parameters learned through the training of the feedforward neural network.
  • updating the parameters in the feedforward neural network to determine the second sparse matrix of the speech to be enhanced may include the following steps.
  • Step 2 Use the sparse matrix of the first training speech as x k+1 .
  • Step 3 Determine the parameters Wk 1 and Wk 2 of the feedforward neural network through the back propagation algorithm
  • Step 4 Repeat steps 1 to 4 until the number of iterations reaches the preset threshold, or: in, Among them, x * can refer to the clean speech corresponding to the first training speech y, and x 1 can refer to the second sparse matrix after initialization.
  • step S037 the speech to be enhanced is processed according to the first target parameter, the second target parameter and the second soft threshold of the feedforward neural network to determine the second sparse matrix of the speech to be enhanced.
  • the initialized second sparse matrix can be used as x 1 , and the speech to be enhanced can be input into the structure corresponding to LISTA as y to determine the structure to be enhanced.
  • the optimal sparse matrix X of the enhanced speech is then used to determine the clean speech of the speech to be enhanced by formula (15).
  • the sparse matrix is determined from the noisy speech by learning a soft threshold iterative algorithm. On the one hand, it solves the Lasso problem in the regression process. On the other hand, because the method uses neural network training parameters, the convergence speed of this method is relatively fast. quick.
  • Fig. 12 is a flowchart of step S03 in Fig. 8 in an exemplary embodiment.
  • the parameters in the TISTA algorithm can be determined through the neural network structure shown in FIG. 13.
  • the neural network structure may include a linear estimation unit r k and a minimum mean square error estimation unit
  • formula (14) can be solved, for example, by formula (19).
  • the r k in formula (19) can be determined according to formula (20) and formula (21) Parameter, the ⁇ MMSE function is determined according to formula (22).
  • ⁇ k is the parameter to be learned
  • y is the input second training speech
  • x k may be the sparse matrix to be learned
  • D is the depth dictionary
  • W D T (DD T ) -1 .
  • N and M may refer to the number of rows and columns of the depth dictionary D.
  • ⁇ 2 can represent the average value of the equation of the speech disturbance signal n, and p represents the probability of a non-zero element.
  • the function can be expressed by the formula (23), the ⁇ in the formula (22) can be expressed by the formula (24), and the function F can be expressed by the formula (25).
  • is the set error value, for example, ⁇ can be e -9 .
  • ⁇ 2 in formula (24) can represent the average value of the equation of the speech disturbance signal n, and ⁇ 2 can represent the variance of the non-zero elements in the speech disturbance signal n.
  • the parameter ⁇ k in the neural network structure may be determined through multiple training speeches.
  • the process of determining the second sparse matrix of the speech to be enhanced by the LISTA algorithm may include the following steps.
  • step S0308 the second training speech and its corresponding sparse matrix are acquired.
  • sparse decomposition may be performed on any speech (for example, noisy speech or clean speech) to obtain the second training speech and its corresponding sparse matrix.
  • step S039 a second sparse matrix of the speech to be enhanced is obtained.
  • step S3010 a linear estimation unit is determined according to the depth dictionary, the second training speech, and the sparse matrix corresponding to the second training speech.
  • the linear estimation unit r k may be determined according to formula (20), and it can be understood that there is a parameter ⁇ k to be learned in the linear estimation unit.
  • step S3011 a minimum mean square error estimation unit is determined according to the depth dictionary and the second training speech.
  • the minimum mean square error estimation unit may be determined according to formula (21) in There is a parameter ⁇ k to be learned in.
  • step S0312 the second sparse matrix, the second training speech, the sparse matrix of the second training speech, the linear estimation unit and the minimum mean square error estimation unit are performed by a back propagation algorithm. Processing to determine the target parameter in the linear estimation unit and the minimum mean square error estimation unit.
  • the second training speech may be y
  • the initialized second sparse matrix is x k
  • the sparse matrix corresponding to the second training speech is x k+1 for training as shown in FIG. 13
  • a neural network determines target parameters in the linear estimation unit and the minimum mean square error estimation unit.
  • step S0313 the speech to be enhanced is processed according to the target parameters in the linear estimation unit and the minimum mean square error estimation unit to determine the clean speech of the speech to be enhanced.
  • the initialized second sparse matrix can be input as x k and noisy speech as the speech signal y.
  • the optimal sparse matrix of the speech to be enhanced is determined.
  • the technical solution provided in this embodiment can be represented by the following cyclic process.
  • Step 2 Obtain the second training speech y and its corresponding sparse matrix x k+1 .
  • Step 3 Determine the parameters ⁇ k , p, ⁇ of the feedforward neural network through a back propagation algorithm.
  • Step Four ⁇ k according to the present cycle of the process and updating x k, p, ⁇ is determined in the next cycle of x k. Repeat steps 2 to 4 until the number of iterations reaches the preset threshold, or: E
  • the noisy speech can be used as the input speech y input value in the neural network structure to determine the optimal sparse matrix (ie, the second sparse matrix) of the speech to be enhanced.
  • This embodiment can train the soft threshold iterative algorithm to determine the sparse matrix from the noisy speech. On the one hand, it solves the Lasso problem in the regression process. On the other hand, because the method uses neural network training parameters, the convergence speed of this method is relatively high. Fast; In addition, since the training parameters of this method are less, the training of this method is easier and more stable.
  • the speech signal to be enhanced is a time sequence signal
  • the speech to be enhanced can be Fourier transformed to obtain the first spectrogram of the speech to be enhanced, so as to The first spectrogram of the speech is expanded in depth.
  • the phase information of the speech to be enhanced can be obtained; and the second sparse matrix and the depth dictionary are used to determine the The second spectrogram of the speech to be enhanced; finally, the phase information and the second spectrogram are superimposed to obtain the second spectrogram including the phase information.
  • the second spectrogram including phase information may also be subjected to inverse Fourier transform to determine the clean speech of the speech to be enhanced in the time domain.
  • Fig. 14 is a schematic diagram showing an effect of speech enhancement according to an exemplary embodiment.
  • the voice enhancement method provided by the above embodiment can perform voice enhancement processing on the voice signal sent by the sender (or sent by the receiver) to remove environmental noise, so that the receiver and sender can perform high-quality voice communication.
  • Fig. 15 is a game speech engine according to an exemplary embodiment.
  • the voice enhancement method provided in the embodiments of the present application may be applied to the game field, and the application process may include the following steps.
  • the recorders cover boys of different ages and girls of different ages; Select common texts in the game, record according to the common texts in a quiet (environmental noise less than 30 decibels) environment, and finally generate a clean speech database; perform short-time Fourier transform on the data of the clean speech database one by one , And only retain the information of the amplitude spectrum to obtain the two-dimensional spectrogram Y ⁇ R M ⁇ N ; using the restricted Boltzmann machine (RBM) and the deep semi-non-negative matrix factorization method combined, according to the embodiment of this application
  • the provided speech enhancement method finally generates a depth dictionary D suitable for the game speech scene.
  • the microphone starts to collect sound.
  • the module will load the generated depth dictionary D, and Short-time Fourier transform is performed on the noisy speech, and the two-dimensional amplitude spectrum Y ⁇ R M ⁇ N after Fourier transform is used as the noisy speech signal, and the three schemes proposed in the embodiments of this application (Scheme 1 ISTA, Scheme 2 LISTA, Scheme 3 TISA) confirm the best sparse matrix X from the noisy speech signal Y on the dictionary D, and finally DX is the enhanced amplitude spectrum, combined with the phase spectrum of the noisy speech Y
  • the enhanced voice can be obtained and transmitted to the next processing module of the game voice engine. After being encoded, it is finally sent to the receiving end through the network, and the final voice received by the game receiver Clean, clear and understandable.
  • Fig. 16 is a block diagram of a speech enhancement device shown in an embodiment of the present application.
  • the speech enhancement device 1600 provided by the embodiment of the present application may include: a sample acquisition module 1601, a decomposition module 1602, a visible layer reconstruction module 1603, and a depth dictionary acquisition module 1604.
  • the sample acquisition module 1601 may be configured to obtain clean speech samples; the decomposition module 1602 may be configured to decompose the clean speech samples to obtain a first sparse matrix and m basis matrices, wherein in the first sparse matrix The values are all positive, and m is a positive integer greater than 1.
  • the target neural network includes a first bias term.
  • the visible layer reconstruction module 1603 may include: a visible layer conditional probability determination unit and a visible layer neuron state variable determination unit.
  • the visual layer conditional probability determining unit may be configured to determine the visual layer conditional probability of the target neural network according to the first sparse matrix, the weight matrix of the target neural network, and the first bias term
  • the visual layer neuron state variable determining unit may be configured to determine the visual layer neuron state variable according to the visual layer conditional probability.
  • the target neural network further includes a second bias term.
  • the depth dictionary acquisition module 1604 may include: a first conditional probability determination unit, a second conditional probability determination unit, and a weight update unit.
  • the first conditional probability determining unit may be configured to determine the first hidden layer conditional probability of the target neural network according to the weight matrix, the clean speech sample, and the second bias term;
  • the second The conditional probability determination unit may be configured to determine the second hidden layer conditional probability of the target neural network according to the weight matrix, the visual layer neuron state vector, and the second bias term;
  • the weight update unit may It is configured to update the weight matrix according to the first hidden layer conditional probability, the second hidden layer conditional probability, the clean speech sample, and the visual layer neuron state vector.
  • the depth dictionary acquisition module 1604 may further include: a first offset item update unit and a second offset item update unit.
  • the first bias item update unit may be configured to update the first bias item according to the clean speech sample and the visual layer neuron state vector;
  • the second bias item update unit may be configured To update the second bias term according to the first hidden layer conditional probability and the second hidden layer conditional probability.
  • the m base matrices include a first base matrix and a second base matrix.
  • the decomposition module 1602 may include: a first decomposition unit, a second decomposition unit, a base matrix variable determination unit, a first update unit, and a second update unit.
  • the first decomposition unit may be configured to determine the first base matrix and the first target matrix by performing a semi-non-negative matrix solution on the clean speech sample;
  • the second decomposition unit may be configured to determine the first base matrix and the first target matrix by The first target matrix performs a semi-non-negative matrix solution to initialize the second base matrix and the second target matrix;
  • the base matrix variable determining unit may be configured to be based on the first base matrix and the second base matrix A matrix determines a base matrix variable;
  • the first update unit may be configured to process the base matrix variable, the clean speech sample, and the second target matrix through a base matrix update function, and update the second base matrix;
  • the second update unit may be configured to process the base matrix variables and the clean speech samples through a sparse matrix update function, and update the second target matrix, where the second target matrix is the first sparse matrix .
  • the embodiment of the present application also provides another voice enhancement device, which may include: a voice to-be-enhanced voice acquisition module, a depth dictionary determination module, a second sparse matrix acquisition module, and a clean voice acquisition module.
  • the acquisition module for speech to be enhanced may be configured to acquire the speech to be enhanced;
  • the depth dictionary determination unit may be configured to acquire a depth dictionary for speech enhancement according to any one of the above-mentioned methods;
  • the second sparse matrix acquisition module may It is configured to perform a deep expansion of the voice to be enhanced according to the depth dictionary to determine a second sparse matrix of the voice to be enhanced;
  • the clean voice acquisition module may be configured to perform according to the second sparse matrix and the depth dictionary Determine the clean voice of the voice to be enhanced.
  • the second sparse matrix acquisition module may include: an initialization unit, a first soft threshold determination unit, and a second sparse matrix update unit.
  • the initialization unit may be configured to obtain a second sparse matrix of the speech to be enhanced;
  • the first soft threshold determination unit may be configured to determine a first soft threshold according to the depth dictionary and the second sparse matrix;
  • the second sparse matrix update unit may be configured to update the second sparse matrix according to the second sparse matrix, the depth dictionary, and the first soft threshold.
  • the second sparse matrix acquisition module may include: a first training speech acquisition unit, a first initialization unit, a first back propagation unit, and a first determination unit.
  • the first training speech acquiring unit may be configured to acquire a first training speech and its corresponding sparse matrix; the first initializing unit may be configured to initialize a second sparse matrix of the speech to be enhanced; the first A backpropagation unit may be configured to train the target feedforward neural network on the first training speech, the sparse matrix corresponding to the first training speech, and the initialized second sparse matrix through a backpropagation algorithm to Determine the first target parameter, the second target parameter, and the second soft threshold of the feedforward neural network; the first determining unit may be configured to determine the first target parameter, the second target parameter, and the second target parameter of the feedforward neural network.
  • the second soft threshold processes the speech to be enhanced to determine a second sparse matrix of the speech to be enhanced.
  • the second sparse matrix acquisition module may include: a second training speech acquisition unit, a second initialization unit, a linear estimation unit determination unit, a minimum mean square error estimation unit, a second back propagation unit, and a first 2. Determine the unit.
  • the second training speech acquiring unit may be configured to acquire a second training speech and its corresponding sparse matrix; the second initialization unit may be configured to acquire a second sparse matrix of the speech to be enhanced; the linear estimation The unit determining unit may be configured to determine a linear estimation unit according to the depth dictionary, the second training speech, and the sparse matrix corresponding to the second training speech; the minimum military error estimation unit may be configured to determine the linear estimation unit according to the depth dictionary
  • the second training speech determines the minimum mean square error estimation unit;
  • the second back propagation unit may be configured to perform a back propagation algorithm on the second sparse matrix, the second training speech, and the second
  • the sparse matrix of the training speech, the linear estimation unit and the minimum mean square error estimation unit perform processing to determine the target parameters in the linear estimation unit and the minimum mean square error estimation unit;
  • the second determination unit It may be configured to process the speech to be enhanced according to the target parameters in the linear estimation unit and the minimum mean square error estimation unit to determine the clean speech of the speech to be
  • the second sparse matrix acquisition module further includes a Fourier transform unit.
  • the Fourier transform unit may be configured to perform Fourier transform on the speech to be enhanced to obtain the first spectrogram of the speech to be enhanced, so as to compare the first spectrogram of the speech to be enhanced Perform a deep expansion.
  • the clean speech acquiring module may include: a phase information acquiring module, a second spectrogram acquiring unit, and a phase superimposing unit.
  • the phase information acquiring module may be configured to acquire phase information of the speech to be enhanced;
  • the second spectrogram acquiring unit may be configured to determine the to-be-enhanced speech according to the second sparse matrix and the depth dictionary A second spectrogram of speech;
  • the phase superimposing unit may be configured to superimpose the phase information with the second spectrogram to obtain a second spectrogram including phase information.
  • the clean speech acquisition module may further include an inverse Fourier transform unit.
  • the inverse Fourier transform unit may be configured to perform inverse Fourier transform on the second spectrogram including phase information to determine the clean speech of the speech to be enhanced.
  • the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solutions of the embodiments of the present application can be embodied in the form of a software product, and the software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.), including several instructions It is used to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a smart device, etc.) to execute the method according to the embodiment of the present application, such as one or more steps shown in FIG. 3.
  • a computing device which may be a personal computer, a server, a mobile terminal, or a smart device, etc.

Abstract

一种语音增强方法、装置以及电子设备和计算机可读存储介质,该方法包括:获取干净语音样本(S1);分解该干净语音样本获得第一稀疏矩阵和m个基矩阵,其中第一稀疏矩阵中的值均为正数,m为大于1的正整数(S2);根据第一稀疏矩阵和目标神经网络的权重矩阵,获得目标神经网络的可视层神经元状态向量(S3);根据可视层神经元状态向量和干净语音样本更新权重矩阵,以获取用于语音增强的深度字典(S4)。

Description

语音增强方法及装置、电子设备和计算机可读存储介质
相关申请的交叉引用
本申请基于申请号为202010085323.0、申请日为2020年02月10日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及语音降噪技术领域,尤其涉及一种语音增强方法及装置、电子设备和计算机可读存储介质。
背景技术
语音是人类最便捷和自然的交流工具之一,它既可以消除人与人之间交流的距离隔阂,又可以提高人与机器之间交互的效率。然而,现实环境中无处不在的噪声不同程度地影响着语音交流的质量。例如,在丰富的游戏场景中,用户在嘈杂环境中使用游戏语音时麦克风会会采集到的各类环境噪音,而且在多人的组队语音中,一方受到噪声干扰,将影响整个组队成员中的通话质量。
传统的语音降噪算法基于统计分析,假设噪音信号相对语音是缓慢变换的。但是,由于复杂环境下噪声的丰富性,当假设和实时不符合时,算法效果不会符合预期效果。
发明内容
本申请实施例提供一种语音增强方法及装置、电子设备和计算机可读存储介质,可以提供一种可以表征干净语音深度特征的深度字典,以便于更好的对带噪语音进行语音增强。
本申请实施例提出一种语音增强方法,该方法包括:获取干净语音样本;分解所述干净语音样本获得第一稀疏矩阵和m个基矩阵,其中所述第一稀疏矩阵中的值均为正数,m为大于1的正整数;根据所述第一稀疏矩阵和目标神经网络的权重矩阵,获得所述目标神经网络的可视层神经元状态向量;根据所述可视层神经元状态向量和所述干净语音样本更新所述权重矩阵,以获取用于语音增强的深度字典。
本申请实施例提供了另外一种语音增强方法,所述语音增强方法包括:获取待增强语音;根据上述任一所述方法获取用于语音增强的深度字典; 根据所述深度字典对所述待增强语音进行深度展开,确定所述待增强语音的第二稀疏矩阵;根据所述第二稀疏矩阵和所述深度字典确定所述待增强语音的干净语音。
本申请实施例还提供了一种语音增强装置,所述语音增强装置包括:样本获取模块、分解模块、可见层重建模块以及深度字典获取模块。
其中,所述样本获取模块可以配置为获取干净语音样本;所述分解模块可以配置为分解所述干净语音样本获得第一稀疏矩阵和m个基矩阵,其中所述第一稀疏矩阵中的值均为正数,m为大于1的正整数;所述可见层重建模块可以配置为根据所述第一稀疏矩阵和目标神经网络的权重矩阵,获得所述目标神经网络的可视层神经元状态向量;所述深度字典获取模块可以配置为根据所述可视层神经元状态向量和所述干净语音样本更新所述权重矩阵,以获取用于语音增强的深度字典。
本申请实施例还提出一种电子设备,该电子设备包括:一个或多个处理器;存储器,配置为存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现上述任一项所述的语音增强方法。
本申请实施例还提出一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现如上述任一项所述的语音增强方法。
应用本申请实施例提供的语音增强方法、装置及电子设备和计算机可读存储介质,通过对干净语音的深度分解获得了可以尽可能简洁的表示干净语音的第一稀疏矩阵,然后在目标神经网络的隐藏层引入了该第一稀疏矩阵,以完成目标神经网络的训练,获得了可以表征干净语音信号深度信息的深度字典。一方面,相比于相关技术中训练神经网络确定带噪信号和干净信号之间的映射关系,本方案只需干净语音信号即可完成语音增强技术中的深度字典的获取,泛化能力更强;另一方面,本申请提供的技术方案通过对干净语音的分解获取了可以表征干净语音信号深度特征的深度字典,可以学习到干净语音更深层的表示。
附图说明
图1是本申请实施例提供的语音增强系统的系统架构的示意图。
图2是本申请实施例提供的一种应用于语音增强装置的计算机系统的结构示意图。
图3是本申请实施例提供的一种语音增强方法的流程图。
图4是图3中步骤S2在一示例性实施例中的流程图。
图5是图3中步骤S3在一示例性实施例中的流程图。
图6是图3中步骤S4在一示例性实施例中的流程图。
图7是图3中步骤S4在一示例性实施例中的流程图。
图8是本申请实施例提供的一种语音增强方法的流程图。
图9是图8中步骤S03在一示例性实施例中的流程图。
图10是图8中步骤S03在一示例性实施例中的流程图。
图11是本申请实施例提供的一种学习软阈值迭代算法的结构示意图。
图12是图8中步骤S03在一示例性实施例中的流程图。
图13是本申请实施例提供的一种可训练迭代阈值算法的结构示意图。
图14是本申请实施例提供的一种语音增强的效果示意图。
图15是本申请实施例提供的一种游戏语音引擎。
图16是本申请实施例提供的一种语音增强装置的框图。
具体实施方式
现在将参考附图更全面地描述示例实施例。然而,示例实施例能够以多种形式实施,且不应被理解为限于在此阐述的实施例;相反,提供这些实施例使得本申请将全面和完整,并将示例实施例的构思全面地传达给本领域的技术人员。在图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。
本申请所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。在下面的描述中,提供许多细节从而给出对本申请的实施方式的充分理解。然而,本领域技术人员将意识到,可以实践本申请的技术方案而省略特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请的各方面。
附图仅为本申请的示意性图解,图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和步骤,也不是必须按所描述的顺序执行。例如,有的步骤还可以分解,而有的步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
本说明书中,用语“一个”、“一”、“该”、“所述”和“至少一个”用以表示存在一个或多个要素/组成部分/等;用语“包含”、“包括”和“具有”用以表示开放式的包括在内的意思并且是指除了列出的要素/组成部分/等之外还可存在另外的要素/组成部分/等;用语“第一”、“第二”和“第三”等仅作为标记使用,不是对其对象的数量限制。
下面结合附图对本申请示例实施方式进行详细说明。
图1是本申请实施例提供的语音增强系统的系统架构100的示意图。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。其中,终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机、台式计算机、可穿戴设备、虚拟现实设备、智能家居等等。
服务器105可以是提供各种服务的服务器,例如对用户利用终端设备101、102、103所进行操作的装置提供支持的后台管理服务器。后台管理服务器可以对接收到的请求等数据进行分析等处理,并将处理结果反馈给终端设备。
服务器105可例如获取干净语音样本;分解所述干净语音样本获得第一稀疏矩阵和m个基矩阵,其中所述第一稀疏矩阵中的值均为正数,m为大于1的正整数;根据所述第一稀疏矩阵和目标神经网络的权重矩阵,获得所述目标神经网络的可视层神经元状态向量;根据所述可视层神经元状态向量和所述干净语音样本更新所述权重矩阵,以获取用于语音增强的深度字典。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的,服务器105可以是一个实体的服务器,如可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、以及大数据和人工智能平台等基础云计算服务的云服务器。
下面参考图2,其示出了适于用来实现本申请实施例的终端设备的计算机系统200的结构示意图。图2示出的终端设备仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图2所示,计算机系统200包括中央处理单元(Central Processing Unit,CPU)201,其可以根据存储在只读存储器(Read-Only Memory,ROM)202中的程序或者从储存部分208加载到随机访问存储器(Random Access Memory,RAM)203中的程序而执行各种适当的动作和处理。在RAM 203中,还存储有系统200操作所需的各种程序和数据。CPU 201、ROM 202以及RAM 203通过总线204彼此相连。输入/输出(I/O)接口205也连接至总线204。
以下部件连接至I/O接口205:包括键盘、鼠标等的输入部分206; 包括诸如阴极射线管(Cathode Ray Tube,CRT)、液晶显示器(Liquid Crystal Display,LCD)等以及扬声器等的输出部分207;包括硬盘等的储存部分208;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分209。通信部分209经由诸如因特网的网络执行通信处理。驱动器210也根据需要连接至I/O接口205。可拆卸介质211,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器210上,以便于从其上读出的计算机程序根据需要被安装入储存部分208。
上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读存储介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分209从网络上被下载和安装,和/或从可拆卸介质211被安装。在该计算机程序被中央处理单元(CPU)201执行时,执行本申请的系统中限定的上述功能。
需要说明的是,本申请所示的计算机可读存储介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读存储介质,该计算机可读存储介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读存储介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方 框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本申请实施例中所涉及到的模块和/或单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的模块和/或单元也可以设置在处理器中,例如,可以描述为:一种处理器包括发送单元、获取单元、确定单元和第一处理单元。其中,这些模块和/或单元的名称在某种情况下并不构成对该模块和/或单元本身的限定。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质可以是上述实施例中描述的设备中所包含的;也可以是单独存在,而未装配入该设备中。上述计算机可读存储介质承载有一个或者多个程序,当上述一个或者多个程序被一个该设备执行时,使得该设备可实现功能包括:获取干净语音样本;分解所述干净语音样本获得第一稀疏矩阵和m个基矩阵,其中所述第一稀疏矩阵中的值均为正数,m为大于1的正整数;根据所述第一稀疏矩阵和目标神经网络的权重矩阵,获得所述目标神经网络的可视层神经元状态向量;根据所述可视层神经元状态向量和所述干净语音样本更新所述权重矩阵,以获取用于语音增强的深度字典。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语音处理技术以及机器学习/深度学习等几大方向。
语音技术(Speech Technology)的关键技术有自动语音识别技术(Automatic Speech Recognition,ASR)和语音合成技术(Text To Speech,TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。
机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概 率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。
本申请实施例提供的方案涉及人工智能的语音增强、机器学习等技术,通过如下实施例进行说明。
图3是本申请提供的一种语音增强方法的流程图。本申请实施例所提供的方法可以由任意具备计算处理能力的电子设备处理,例如上述图1实施例中的服务器105和/或终端设备102、103,在下面的实施例中,以服务器105为执行主体为例进行举例说明。
参照图3,本申请实施例提供的语音增强方法可以包括以下步骤。
语音增强是指当语音信号被各种各样的噪声干扰、甚至淹没后,从噪声背景中提取有用的语音信号,抑制、降低噪声干扰的技术。一句话,语音增强就是从含噪语音中提取尽可能纯净的原始语音。
在相关技术中,为了实现对带噪语音的语音增强,通常通过带噪语音及其对应的干净语音对神经网络模型进行有监督的训练,以确定带噪语音与干净语音之间的映射关系。然而,上述语音增强方法在训练过程中需要收集大量的干净语音信号和噪声信号,而噪声信号的收集费事耗力,不利于提高语音增强效率。另外,通过带噪语音和干净语音训练的神经网络模型的语音增强的泛化能力一般,即当测试集中的噪声和训练集中的噪声存在较大的差异时,该神经网络模型的降噪能力会出现明显的下降。另外,神经网络得到解释性较差,无法对语音增强过程进行合适的解释。
在步骤S1中,获取干净语音样本。
干净语音可以指的是一些纯净的不包括噪音的语音信号。
在一些实施例中,可以获取一些干净语音作为目标神经网络的训练样本。例如,可以获取游戏中的原始语音(不包括环境噪音)作为所述目标神经网络的训练样本。
在语音技术领域中,语音信号通常可以用公式(1)进行表示。
Y=DX+n          (1)
其中,Y∈R M×N为观测语音,X∈R M×N为稀疏矩阵,D∈R M×N为字典,n为噪声,M为语音信号的行数,N为语音的列数,M和N均为大于或者等于1的正整数,R代表实数域。
在一些实施例中,语音信号可以通过公式(1)进行稀疏分解。语音信号的稀疏分解可以指的是根据给定的超完备字典D,用较少的基本信 号的线性组合来表达大部分或者全部的原始信号Y,从而获得信号更为简洁的表示方式,上述基本信号可以被称作原子。
基于超完备字典的信号稀疏分解是一种新的信号表示理论,其采用超完备的冗余函数系统代替传统的正交基函数,为信号自适应的稀疏扩展提供了极大的灵活性。稀疏分解可以实现数据压缩的高效性,更重要的是可以利用字典的冗余特性捕捉信号内在的本质特征。
在步骤S2中,分解所述干净语音样本获得第一稀疏矩阵和m个基矩阵,其中所述第一稀疏矩阵中的值均为正数,m为大于1的正整数。
为了克服传统字典学习仅能学习语音信号的浅层特征的缺点,本实施例,创造性的提出了通过干净语音提取可以表征语音信号深层特征的深度字典,以完成对带噪语音的增强的方法。例如,可以在受限玻尔兹曼机(Restricted Boltzmann Machine,RBM)的隐藏层中引入深度半非负矩阵(Deep semi-Nonnegative Matrix Factorization,Deep semi-NMF)以训练出可以表征语音信号特征的深度字典。
其中,所述深度半非负矩阵可以用如下公式进行表示。
Figure PCTCN2020126345-appb-000001
其中Y ±为干净语音样本(可作为RBM中的可视层观察神经元状态变量),
Figure PCTCN2020126345-appb-000002
为基矩阵,H m为稀疏矩阵(可作为RBM中的隐藏层神经元状态变量),其中上标±表征该矩阵中的值可为正可为负,上标+表征该矩阵中的值限定为正数,m为大于或者等于1的正整数。
在一些实施例中,可以通过公式(2)对所述干净语音样本进行深度分解,以获得第一稀疏矩阵
Figure PCTCN2020126345-appb-000003
和m个基矩阵
Figure PCTCN2020126345-appb-000004
可以理解的是,还可以适应深度非负矩阵对所述干净语音样本进行深度分解,以获得第一稀疏矩阵
Figure PCTCN2020126345-appb-000005
和m个基矩阵
Figure PCTCN2020126345-appb-000006
在步骤S3中,根据所述第一稀疏矩阵和目标神经网络的权重矩阵,获得所述目标神经网络的可视层神经元状态向量。
在一些实施例中,所述目标神经网络可以是受限玻尔兹曼机,所述受限玻尔兹曼机可以包括可见层和隐藏层。其中,所述受限玻尔兹曼机的权重矩阵中的单个元素w ij可以指定隐藏层单元h j和可视层单元v i之间边的权重,此外每个可视层v i可以有第一偏置项a i,每个隐藏层单元h j可以有第二偏置项b j
在一些实施例中,可以利用受限玻尔兹曼机对所述干净语音样本进行处理,以获得可以表征所述干净语音样本内在的本质特征的深度字典h(可例如是公式(2)中的H 1、H 2、H 3….H m中的任意一个),并可以将所述深度字典h作为所述受限玻尔兹曼机的隐藏层的神经元状态变量。
在一些实施例中,在对受限玻尔兹曼机进行训练之前,可以首先对所述受限玻尔兹曼机的权重矩阵W、第一偏置项a以及第二偏置项b进行初始化处理,并根据初始化后的权重矩阵W、a和b以及所述深度字典h重新构建所述目标神经网络的可视层神经元状态向量v*,以便于根据重建的可视层神经元状态变量v*和根据干净语音样本确定的初始可视层神经元状态变量v更新所述目标神经网络模型中的参数。
在步骤S4中,根据所述可视层神经元状态向量和所述干净语音样本更新所述权重矩阵,以获取用于语音增强的深度字典。
在一些实施例中,可以对所述干净语音样本进行使用短时傅里叶变换处理,以获得所述干净语音样本的语谱图,并可以将所述干净语音样本的语谱图确定所述受限玻尔兹曼的可视层观测神经元状态变量v。
在一些实施例中,可以通过可视层观测神经元状态变量v和可视层神经元状态变量v*反向更新所述目标神经网络的权重矩阵W、第一偏置项a和第二偏置项b。
在一些实施例中,可以使用不同的干净语音样本对所述目标神经网络进行训练,直至达到训练标准。
在一些实施例中,训练后获得的所述目标神经网络的权重矩阵W就可以是最终确定的深度字典D。
本申请实施例提供的语音增强方法,通过对干净语音的深度分解获得了可以尽可能简洁的表示干净语音的第一稀疏矩阵,然后在目标神经网络的隐藏层引入了该第一稀疏矩阵,以完成目标神经网络的训练,获得了可以表征干净语音信号深度信息的深度字典。一方面,相比于相关技术中通过收集大量的带噪语音及其对应的干净语音训练神经网络来确定带噪信号和干净信号之间的映射关系这种有监督的语音增强方法来说,本方案只需干净语音信号即可完成语音增强技术中的深度字典的获取,泛化能力更强,而且由于本方案只需收集干净语音,相比于上述有监督的语音增强方法来说,可以提供语音增强的效率;另一方面,本申请实施例通过对干净语音的分解获取了可以表征干净语音信号深度特征的深度字典,可以学习到干净语音更深层的表示。
图4是图3中步骤S2在一示例性实施例中的流程图。
在一些实施例中,可以使用深度半非负矩阵对所述干净语音样本进行处理,以获取可以表征所述的干净语音样本深度特征的第一稀疏矩阵。
使用半非负矩阵对所述干净语音进行求解,一方面可以实现对数据量庞大的语音信号进行将维处理;另一方面,若分解过后获得稀疏矩阵中存在负数,则在实际问题中无法获得真实的意义,而使用半非负矩阵对语音信号进行求解可以保证最后获得的稀疏矩阵中的值全为整数,方便在实际应用中给出意义;另外,使用深度半非负矩阵对语音信号进行分解还可以获得语音信号的深度特征,可以更好的描述语音信号。
在一些实施例中,可以通过公式(2)描述深度半非负矩阵分解矩阵。在一些实施例中,所述深度半非负矩阵求解过程可以包括初始化过程和迭代更新过程。
公式(3)描述了深度半非负矩阵的初始化过程。
Figure PCTCN2020126345-appb-000007
其中Y ±为干净语音样本(即RBM中的可视层),
Figure PCTCN2020126345-appb-000008
为基矩阵,H m为稀疏矩阵(即RBM中的隐藏层),其中上标±表征该矩阵中的值可为正可为负,上标+表征该矩阵中的值限定为正数,m为大于或者等于1的正整数。
在一些实施例中,所述深度半非负矩阵求解的初始化过程可以包括以下步骤。
通过半非负矩阵求解方法对
Figure PCTCN2020126345-appb-000009
进行分解,以获得
Figure PCTCN2020126345-appb-000010
Figure PCTCN2020126345-appb-000011
通过深度半非负矩阵求解方法对
Figure PCTCN2020126345-appb-000012
进行分解,以获得
Figure PCTCN2020126345-appb-000013
Figure PCTCN2020126345-appb-000014
以此类推,直至获取所有
Figure PCTCN2020126345-appb-000015
Figure PCTCN2020126345-appb-000016
(i为大于或者等于1,且小于或者等于m的正整数)。
在一些实施例中,假设m等于2,则所述m个基矩阵可以包括第一基矩阵、第二基矩阵,其中分解所述干净语音样本获得稀疏矩阵和m个基矩阵可以包括以下步骤:
在步骤S21中,通过对所述干净语音样本进行半非负矩阵求解,以确定所述第一基矩阵和第一目标矩阵。
在步骤S22中,通过对所述第一目标矩阵进行半非负矩阵求解,以初始化所述第二基矩阵和第二目标矩阵。
在一些实施例中,当完成对深度半非负矩阵中的m个基矩阵以及对应的目标矩阵的初始化之后,可以通过迭代更新确定最优解以作为所述第一稀疏矩阵。
在一些实施例中,可以根据上述深度半非负矩阵建立如公式(4)所示的误差函数。
Figure PCTCN2020126345-appb-000017
其中C deep代表了所述误差函数,Y代表所述干净语音样本、Z 1Z 2…Z m代表所述m个基矩阵,H m代表所述第一稀疏矩阵,其中m为大于等于1的正整数。
对所述误差函数关于Z i(i大于等于1,小于等于m)求解偏导数,偏导数为0的点即是Z i的最佳解,其公式表征如下:
令C deep=0,求解得:
Figure PCTCN2020126345-appb-000018
其中,
Figure PCTCN2020126345-appb-000019
矩阵上标+表示摩尔-彭若斯广义逆,
Figure PCTCN2020126345-appb-000020
利用凸优化理论可得稀疏矩阵
Figure PCTCN2020126345-appb-000021
的更新计算方式:
Figure PCTCN2020126345-appb-000022
其中矩阵上标pos表示保留矩阵中所有的正数元素,将所有的负数元素置为0;矩阵上标neg表示保留矩阵中所有的负数元素,将所有的正数元素置为0。
由此可见,求解深度半非负矩阵的更新迭代过程可以包括以下步骤。
在步骤S23中,根据所述第一基矩阵、所述第二基矩阵确定基矩阵变量。
在一些实施例中,可以根据公式ψ i=Z 1Z 2…Z i-1确定各个基矩阵变量ψ i
在步骤S24中,通过基矩阵更新函数对所述基矩阵变量、所述干净语音样本以及所述第二目标矩阵进行处理,更新所述第二基矩阵。
在一些实施例中,可以根据公式(5)迭代更新各个基矩阵Z i
在步骤S25中,通过稀疏矩阵更新函数对所述基矩阵变量和所述干净语音样本进行处理,更新所述第二目标矩阵,所述第二目标矩阵为所述第一稀疏矩阵。
在一些实施例中,可以根据公式(6)迭代更新各个目标矩阵H i,直至迭代次数达到预设值,或则误差函数小于预设误差值,将目标矩阵H m作为所述第一稀疏矩阵进行输出。
本申请实施例可以通过深度半非负矩阵对所述干净语音样本进行处理,以确定可以保证所述干净语音深度特征的第一稀疏矩阵。
图5是图3中步骤S3在一示例性实施例中的流程图。
在一些实施例中,所述目标神经网络可以是受限玻尔兹曼机,所述受限玻尔兹曼机可以包括第一偏置项a。
参考图5,上述步骤S3可以包括以下步骤。
在步骤S31中,根据所述第一稀疏矩阵、所述目标神经网络的权重矩阵以及所述第一偏置项确定所述目标神经网络的可视层条件概率。
在一些实施例中,可以根据公式(7)确定所述目标神经网络的可视层条件概率。
Figure PCTCN2020126345-appb-000023
其中,logistic()可以是激活函数,例如
Figure PCTCN2020126345-appb-000024
代表可视层有关于隐藏层的条件概率,
Figure PCTCN2020126345-appb-000025
可以代表第i个可视层神经元状态变量,h代表隐藏层神经元状态变量,a i代表可视层的第i个第一偏置项,W ij代表所述权重矩阵第i行第j列的值,h j代表所述目标神经网络隐藏层神经元状态变量(也是所述深度字典)第j个值。
在步骤S32中,根据所述可视层条件概率确定所述可视层神经元状态变量。
在一些实施例中,可以通过随机抽样的方式根据所述可视层条件概率确定所述可视层神经元状态变量。例如,在[0,1]上产生随机数r i,并根据公式(8)确定可视层神经元状态变量v*。
Figure PCTCN2020126345-appb-000026
本申请实施例可以基于可视层条件概率根据隐藏层神经元状态变量反向确定可视层神经元状态变量。
图6是图3中步骤S4在一示例性实施例中的流程图。
在一些实施例中,所述目标神经网络可以是受限玻尔兹曼机,所述受限玻尔兹曼机可以包括第二偏置项b。
参考图6,上述步骤S4可以包括以下步骤。
在步骤S41中,根据所述权重矩阵、所述干净语音样本以及所述第二偏置项确定所述目标神经网络的第一隐藏层条件概率。
在一些实施例中,可以根据公式(9)确定所述第一隐藏层条件概率。
p(h j|v)=logistic(b j+∑ iw ijv i)       (9)
其中,p(h j|v)代表所述第一隐藏层条件概率,logistic()可以是激活函数,例如
Figure PCTCN2020126345-appb-000027
p(h j|v)可以代表隐藏层有关于可视层的条件概率,v可以代表可视层神经元状态变量,h j代表第j个隐藏层神经元状态变量,b j代表可视层的第j个第二偏置项,W ij代表所述权重矩阵第i行第j列的值,v i代表所述目标神经网络可见层神经元状态变量第i个值。
在步骤S42中,根据所述权重矩阵、所述可视层神经元状态向量以及所述第二偏置项确定所述目标神经网络的第二隐藏层条件概率。
在一些实施例中,可以根据公式(10)确定所述第二隐藏层条件概率。
Figure PCTCN2020126345-appb-000028
其中,p(h j|v *)代表所述第二隐藏层条件概率,logistic可以是激活函数,例如
Figure PCTCN2020126345-appb-000029
p(h j|v *)可以代表隐藏层有关于可视层的条件概率,v*可以代表重新构建的可视层神经元状态变量,h j代表第j个隐藏层神经元状态变量,b j代表可视层的第j个第二偏置项,W ij代表所述权重矩阵第i行第j列的值,
Figure PCTCN2020126345-appb-000030
代表重新构建的可见层神经元状态变量的第i个值。
在步骤S43中,根据所述第一隐藏层条件概率、所述第二隐藏层条件概率、所述干净语音样本以及所述可视层神经元状态向量更新所述权重矩阵。
在一些实施例中,可以根据公式(11)更新所述权重矩阵W。
W←W+∈×(p(h=1|v)v T-p(h=1|v *)v *T)     (11)
其中,p(h=1|v)代表所述第一隐藏层条件概率,p(h=1|v *)代表所述第二隐藏层条件概率,v T代表根据干净语音样本确定的可视层神经元状态变量的装置,v *T可以代表重建的可视层神经元状态变量,h代表隐藏层神经元状态变量的转置,∈代表学习率。
图7是图3中步骤S4在一示例性实施例中的流程图。参考图7,上述步骤S4可以包括以下步骤。
在步骤S44中,根据所述干净语音样本和所述可视层神经元状态向量更新所述第一偏置项。
在一些实施例中,可以根据公式(12)更新所述第一偏置项a。
a←a+ε×(v-v *)       (12)
其中,ε代表学习率,v代表根据干净语音样本确定的可视层神经元状态变量,v*可以代表重建的可视层神经元状态变量。
在步骤S45中,根据所述第一隐藏层条件概率和所述第二隐藏层条件概率更新所述第二偏置项。
在一些实施例中,可以根据公式(13)更新所述第二偏置项b。
b←b+ε[p(h=1|v)-p(h=1|v *)]     (13)
其中,ε代表学习率,p(h=1|v)代表所述第一隐藏层条件概率,p(h=1|v *)代表所述第二隐藏层条件概率。
图8是本申请实施例示出的一种语音增强方法的流程图。本申请实施例所提供的方法可以由任意具备计算处理能力的电子设备处理,例如上述图1实施例中的服务器105和/或终端设备102、103,在下面的实施例中,以服务器105为执行主体为例进行举例说明。
参照图8,本申请实施例提供的语音方法可以包括以下步骤。
在步骤S01中,获取待增强语音。
在一些实施例中,所述待增强语音可以指的是包括噪音在内的语音信号。
在步骤S02中,获取用于语音增强的深度字典。
在一些实施例中,可以通过上述语音增强方法获取可以用于语音增强的深度字典。
在步骤S03中,根据所述深度字典对所述待增强语音进行深度展开,确定所述待增强语音的第二稀疏矩阵。
在一些实施例中,从一个带噪语音Y n中获得第二稀疏矩阵X可以用公式y=DX+n进行表示,稀疏表示问题可以用公式(14)进行表示。
Figure PCTCN2020126345-appb-000031
其中,Y n=DX+n,Y n代表待增强语音,D代表所述深度字典,X代表所述第二稀疏矩阵X,λ为预设参数。
在一些实施例中,为了解决最小绝对收缩和选择算子(Least absolute  shrinkage and selection operator,Lasso)问题,可以采用软阈值迭代算法(Iterative Soft-Thresholding Algorithm,ISTA)算法、学习软阈值迭代算法(Learned Iterative Soft-Thresholding Algorithm,LISTA)算法或者可训练迭代阈值算法(Trained Iterative Soft-Thresholding Algorithm,TISTA)算法确定所述带噪语音的稀疏矩阵。
在步骤S04中,根据所述第二稀疏矩阵和所述深度字典确定所述待增强语音的干净语音。
在一些实施例中,可以将带噪语音通过在干净语音上学习出来的字典D上进行稀疏学习,以获得第二稀疏矩阵X,然后将得到的DX作为最终的降噪语音。
在一些实施例中,可以通过公式(15)确定所述待增强语音的干净语音Y *
Y *=DX           (15)
其中,D代表所述深度字典,X代表所述待增强语音的第二稀疏矩阵。
本申请实施例准确的确定了带噪语音的稀疏矩阵,并基于稀疏矩阵以及干净语音的深度字典,准确的从带噪语音中恢复了干净语音。该方法,可以实用于不同的语音信号泛化能力强。
图9是图8中步骤S03在一示例性实施例中的流程图。
在一些实施例中,为了解决线性回归过程中的Lasso问题,可以采用软阈值迭代算法算法、学习软阈值迭代算法或者可训练迭代阈值算法确定所述带噪语音的稀疏矩阵。其中,Lasso问题主要用于描述线性回归中有约束的优化问题,lasso问题限定了l1范数的线性回归。
参考图9,上述步骤S03可以包括以下步骤。
在一些实施例中,可以使用ISTA算法求解上述公式(14),例如可以通过公式(16)进行迭代求解。
Figure PCTCN2020126345-appb-000032
其中,
Figure PCTCN2020126345-appb-000033
是第一软阈值,其中sign(x)可以定义为:
Figure PCTCN2020126345-appb-000034
L为D TD中最大的特征值,λ为预设的参数,D是所述深度字典。
在步骤S031中,获取所述待增强语音的第二稀疏矩阵。
在一些实施例中,可以初始化所述待增强语音的第二稀疏矩阵作为第一迭代过程中的x k,例如可以对所述x k中的变量进行随意赋值。
在步骤S032中,根据所述深度字典、所述第二稀疏矩阵确定第一软阈值。
在一些实施例中,可以根据
Figure PCTCN2020126345-appb-000035
确定所述第一软阈值
Figure PCTCN2020126345-appb-000036
其中L为D TD(D是所述深度字典)中最大的特征值。
在步骤S033中,根据所述第二稀疏矩阵、所述深度字典以及所述第一软阈值更新所述第二稀疏矩阵。
根据所述第二稀疏矩阵、所述深度字典D以及所述第一软阈值更新所述第二稀疏矩阵可以包括以下步骤:
步骤一:初始化所述第二稀疏矩阵获得x k,k=1。
步骤二:根据公式(16)确定x k+1
步骤三:k=k+1,返回步骤二,直至迭代次数达到预设阈值,或者:|x k-x k+1|<ε。
本实施例通过软阈值迭代算法从带噪语音中确定了稀疏矩阵,该方法解决了回归过程中的Lasso问题。
图10是图8中步骤S03在一示例性实施例中的流程图。参考图10,上述步骤S03可以包括以下步骤。
在一些实施例中,可以使用LISTA算法求解上述公式(14),例如可以通过公式(18)进行迭代求解。
Figure PCTCN2020126345-appb-000037
其中K为大于等于1的正整数。
使用LISTA算法进行求解可以包括如下步骤:
在步骤S034中,获取第一训练语音及其对应的稀疏矩阵。
在一些实施例中,可以对任意一个语音(可例如是带噪语音也可以是干净语音)进行稀疏分解以获得所述第一训练语音及其对应的稀疏矩阵。
在步骤S035中,初始化所述待增强语音的第二稀疏矩阵。
在一些实施例中,在所述待增强语音的最优稀疏矩阵尚未确定之前,可以对所述第二稀疏矩阵进行初始化,即可以对所述第二稀疏矩阵随意赋值。
在步骤S036中,通过反向传播算法对所述第一训练语音、所述第一训练语音对应的稀疏矩阵以及初始化后的第二稀疏矩阵对目标前馈神经 网络进行训练,以确定所述前馈神经网络的第一目标参数、第二目标参数和第二软阈值。
反向传播算法,是一种监督学习算法。反向传播算法主要由激励传播和权重更新两个环节反复循环迭代,直到目标网络对输入的响应达到预定的目标范围为止。
其中,激励传播环节可以包括两步:
1、前向传播阶段:将训练样本输入到目标网络以获得激励响应。
2、反向传播阶段:将期望的激励响应与训练样本对应的激励响应求差,从而获得响应误差。
权重更新环节可以包括以下两步:
1、将训练样本对应的激励响应和上述响应误差相乘,从而获得目标网络的权重的梯度。
2、将上述梯度加权后取反,并与更新前的权重相加,以获得更新后的权重。
在一些实施例中,可以通过如图11所示的前馈神经网络确定LISTA算法中的参数。可以理解的是,LISTA算法中使用权重矩阵W代替ISTA中的设定的参数,迭代层数被截断为多少,即算法展开为多少层的前馈神经网络。
如图11所示,所述前馈神经网络包括的第一目标参数Wk 1、第二目标参数Wk 1以及第二软阈值
Figure PCTCN2020126345-appb-000038
均可以是通过前馈神经网络的训练中学习到的参数。
如图11所示,更新前馈神经网络中的参数,以确定所述待增强语音的第二稀疏矩阵可以包括以下步骤。
步骤一:初始化所述待增强语音的第二稀疏矩阵,获得x k,k=1。
步骤二:以所述第一训练语音的稀疏矩阵作为x k+1
步骤三:通过反向传播算法确定所述前馈神经网络的参数Wk 1和Wk 2,
Figure PCTCN2020126345-appb-000039
步骤四:重复执行步骤一~步骤四,直至迭代次数达到预设阈值,或者:
Figure PCTCN2020126345-appb-000040
其中,
Figure PCTCN2020126345-appb-000041
其中,x *可以指的是第一训练语音y对应的干净语音,x 1可以指的是初始化后的第二稀疏矩阵。
在步骤S037中,根据所述前馈神经网络的第一目标参数、第二目标参数和第二软阈值对所述待增强语音进行处理,以确定所述待增强语音的第二稀疏矩阵。
在一些实施例中,当LISTA对应结构图中的参数确定之后,可以将初始化之后的第二稀疏矩阵作为x 1,将待增强语音作为y输入至该LISTA对应的结构中,以确定所述待增强语音的最优稀疏矩阵X,然后再通过公式(15)确定所述待增强语音的干净语音。
本实施例通过学习软阈值迭代算法从带噪语音中确定了稀疏矩阵,一方面解决了回归过程中的Lasso问题,另一方面由于该方法是通过神经网络训练参数,所以该方法的收敛速度较快。
图12是图8中步骤S03在一示例性实施例中的流程图。
在一些实施例中,可以通过如图13所示的神经网络结构确定TISTA算法中的参数。如图13所示,所述神经网络结构可以包括线性估计单元r k,和最小均方差误差估计单元
Figure PCTCN2020126345-appb-000042
在一些实施例中,通过图13可知,可例如通过公式(19)对公式(14)进行求解。
Figure PCTCN2020126345-appb-000043
其中,可以根据公式(20)、公式(21)确定公式(19)中的r k
Figure PCTCN2020126345-appb-000044
参数,根据公式(22)确定η MMSE函数。
r k=x kkW(y-Dx k)    (20)
其中,γ k为待学习参数,y为输入的第二训练语音,x k可以为待学习稀疏矩阵,D为所述深度字典,W=D T(DD T) -1
Figure PCTCN2020126345-appb-000045
其中,N和M可以指的是所述深度字典D的行数和列数。
Figure PCTCN2020126345-appb-000046
其中,σ 2可以代表语音扰动信号n的方程的平均值,p代表非零元素的概率。
在公式(21)中,
Figure PCTCN2020126345-appb-000047
函数可以用公式(23)表示,公式(22)中的ξ可以用公式(24)表示、函数F可以用公式(25)表示。
Figure PCTCN2020126345-appb-000048
ε为设定的误差值,例如ε可以是e -9
其中,
ξ=α 22     (24)
Figure PCTCN2020126345-appb-000049
其中,公式(24)中的σ 2可以代表语音扰动信号n的方程的平均值,σ 2可以代表语音扰动信号n中非零元素的方差。
在一些实施例中,可以通过多个训练语音确定所述神经网络结构中的参数γ k
在一些实施例中,通过LISTA算法确定待增强语音的第二稀疏矩阵的过程可以包括以下步骤。
在步骤S038中,获取第二训练语音及其对应的稀疏矩阵。
在一些实施例中,可以对任意一个语音(可例如是带噪语音也可以是干净语音)进行稀疏分解以获得所述第二训练语音及其对应的稀疏矩阵。
在步骤S039中,获取所述待增强语音的第二稀疏矩阵。
在步骤S3010中,根据所述深度字典、所述第二训练语音以及所述第二训练语音对应的稀疏矩阵确定线性估计单元。
在一些实施例中,可以根据公式(20)确定所述线性估计单元r k,可以理解的是,所述线性估计单元中存在待学习参数γ k
在步骤S3011根据所述深度字典、所述第二训练语音确定最小均方差误差估计单元。
在一些实施例中,可以根据公式(21)确定所述最小均方差误差估计单元
Figure PCTCN2020126345-appb-000050
其中
Figure PCTCN2020126345-appb-000051
中存在待学习参数γ k
在步骤S0312中,通过反向传播算法对所述第二稀疏矩阵、所述第二训练语音、所述第二训练语音的稀疏矩阵、所述线性估计单元和所述最小均方差误差估计单元进行处理,以确定所述线性估计单元以及所述最小均方差误差估计单元中的目标参数。
在一些实施例中,可以以所述第二训练语音为y,以初始化后的第二稀疏矩阵为x k,以第二训练语音对应的稀疏矩阵为x k+1训练如图13所示的神经网络以确定所述线性估计单元以及所述最小均方差误差估计单元中的目标参数。
在步骤S0313中,根据所述线性估计单元以及所述最小均方差误差估计单元中的目标参数对所述待增强语音进行处理,以确定所述待增强语音的干净语音。
在一些实施例中,当所述线性估计单元以及所述最小均方差误差估计单元中的目标参数确定之后,可以将初始化后的第二稀疏矩阵作为x k、带噪语音作为语音信号y输入至如图13所示的神经网络中,以确定所述待增强语音的最佳稀疏矩阵。
在一些实施例中,本实施例提供的技术方案可以用以下循环过程表示。
步骤一:初始化所述待增强语音的第二稀疏矩阵,获得x k,k=1。
步骤二:获取第二训练语音y及其对应的稀疏矩阵x k+1
步骤三:通过反向传播算法确定所述前馈神经网络的参数γ k、p、α。
步骤四:根据本次循环过程中的x k和更新后的γ k、p、α确定下一循环过程中的x k。重复执行步骤二~步骤四,直至迭代次数达到预设阈值,或者:E||x Tk、p、α)-x *||<ε,其中x *可以指的是所述待增强矩阵的最佳稀疏矩阵。
在训练结束后,可以将带噪语音作为输入语音y输入值所述神经网络结构中以确定所述待增强语音的最优稀疏矩阵(即第二稀疏矩阵)。
本实施例可训练软阈值迭代算法从带噪语音中确定了稀疏矩阵,一方面解决了回归过程中的Lasso问题,另一方面由于该方法是通过神经网络训练参数,所以该方法的收敛速度较快;另外,由于该方法的训练参数少,所以该方法训练更容易稳定。
在一些实施例中,所述待增强语音信号是一种时序信号,可以对所述待增强语音进行傅里叶变换,获取所述待增强语音的第一语谱图,以便对所述待增强语音的第一语谱图进行深度展开。
在一些实施例中,由于根据公式(15)确定的干净语音并不包括相位信息,所以可以获取所述待增强语音的相位信息;并根据所述第二稀疏矩阵和所述深度字典确定所述待增强语音的第二语谱图;最后将所述相位信息与所述第二语谱图叠加,获得包括相位信息的第二语谱图。
在一些实施例中,还可以对所述包括相位信息的第二语谱图进行傅里叶反变换,以确定时域中所述待增强语音的干净语音。
图14是根据一示例性实施例示出的一种语音增强的效果示意图。
如图14所示,通过上述实施例提供的语音增强方法可以将发送端发送(或接收端发送)的语音信号进行语音增强处理,以去除环境噪声,使得接收端和发送端能够进行高质量语音通讯。
图15是根据一示例性实施例示出的一种游戏语音引擎。
在一些实施例中,本申请实施例提供的语音增强方法可以应用于游戏领域,应用过程可以包括以下步骤。
通过调查游戏中的设备,筛选占比靠前的百款机型,分别设置机型 在媒体和语音模式下,选择一批录音人,录音人覆盖不同年龄段的男生和不同年龄段的女生;选择游戏中常见的本文,在安静的(环境噪声低于30分贝)环境下根据所述常见的文本进行录音,最终生成一个干净语音数据库;将干净语音数据库的数据逐条进行短时傅里叶变换,并仅保留幅度谱的信息,得到二维的语谱图Y∈R M×N;利用受限玻尔兹曼机(RBM)和深度半非负矩阵分解法结合,按照本申请实施例中提供的语音增强方法,最终生成适合游戏语音场景的深度字典D。
如图15所示,在带噪语音增强阶段,可以打开游戏界面中的麦克风后,麦克风就开始采集声音,当声音经过游戏语音引擎模块时,该模块会加载已经生成的深度字典D,并对带噪语音进行短时傅里叶变换,将傅里叶变换后的二维幅度谱Y∈R M×N作为带噪语音信号,使用本申请实施例中提出的3种方案(方案一ISTA,方案二LISTA,方案三TISA)中的任一种算法从带噪语音信号Y在字典D上的确认最佳稀疏矩阵X,最终DX即为增强后的幅度谱,结合带噪语音Y的相位谱,进行短时傅里叶反变换后,即可得到增强后的语音,并传输给游戏语音引擎的下一个处理模块,最终经过编码后通过网络发送给接收端,最终游戏的接收方接收的语音干净,清晰,可懂性高。
图16是本申请实施例示出的一种语音增强装置的框图。参照图16,本申请实施例提供的语音增强装置1600可以包括:样本获取模块1601、分解模块1602、可见层重建模块1603以及深度字典获取模块1604。
其中,所述样本获取模块1601可以配置为获取干净语音样本;所述分解模块1602可以配置为分解所述干净语音样本获得第一稀疏矩阵和m个基矩阵,其中所述第一稀疏矩阵中的值均为正数,m为大于1的正整数;所述可见层重建模块1603可以配置为根据所述第一稀疏矩阵和目标神经网络的权重矩阵,获得所述目标神经网络的可视层神经元状态向量;所述深度字典获取模块1604可以配置为根据所述可视层神经元状态向量和所述干净语音样本更新所述权重矩阵,以获取用于语音增强的深度字典。
在一些实施例中,所述目标神经网络包括第一偏置项。
在一些实施例中,所述可见层重建模块1603可以包括:可视层条件概率确定单元和可视层神经元状态变量确定单元。
其中,所述可视层条件概率确定单元可以配置为根据所述第一稀疏矩阵、所述目标神经网络的权重矩阵以及所述第一偏置项确定所述目标神经网络的可视层条件概率;所述可视层神经元状态变量确定单元可以配置为根据所述可视层条件概率确定所述可视层神经元状态变量。
在一些实施例中,所述目标神经网络还包括第二偏置项。
在一些实施例中,所述深度字典获取模块1604可以包括:第一条件概率确定单元、第二条件概率确定单元以及权重更新单元。
其中,所述第一条件概率确定单元可以配置为根据所述权重矩阵、所述干净语音样本以及所述第二偏置项确定所述目标神经网络的第一隐藏层条件概率;所述第二条件概率确定单元可以配置为根据所述权重矩阵、所述可视层神经元状态向量以及所述第二偏置项确定所述目标神经网络的第二隐藏层条件概率;所述权重更新单元可以配置为根据所述第一隐藏层条件概率、所述第二隐藏层条件概率、所述干净语音样本以及所述可视层神经元状态向量更新所述权重矩阵。
所述深度字典获取模块1604还可以包括:第一偏置项更新单元和第二偏置项更新单元。
其中,所述第一偏置项更新单元可以配置为根据所述干净语音样本和所述可视层神经元状态向量更新所述第一偏置项;所述第二偏置项更新单元可以配置为根据所述第一隐藏层条件概率和所述第二隐藏层条件概率更新所述第二偏置项。
在一些实施例中,所述m个基矩阵包括第一基矩阵、第二基矩阵。
在一些实施例中,所述分解模块1602可以包括:第一分解单元、第二分解单元、基矩阵变量确定单元、第一更新单元和第二更新单元。
其中,所述第一分解单元可以配置为通过对所述干净语音样本进行半非负矩阵求解,以确定所述第一基矩阵和第一目标矩阵;所述第二分解单元可以配置为通过对所述第一目标矩阵进行半非负矩阵求解,以初始化所述第二基矩阵和第二目标矩阵;所述基矩阵变量确定单元可以配置为根据所述第一基矩阵、所述第二基矩阵确定基矩阵变量;所述第一更新单元可以配置为通过基矩阵更新函数对所述基矩阵变量、所述干净语音样本以及所述第二目标矩阵进行处理,更新所述第二基矩阵;所述第二更新单元可以配置为通过稀疏矩阵更新函数对所述基矩阵变量和所述干净语音样本进行处理,更新所述第二目标矩阵,所述第二目标矩阵为所述第一稀疏矩阵。
本申请实施例还提供了另外一种语音增强装置,所述语音增强装置可以包括:待增强语音获取模块、深度字典确定模块、第二稀疏矩阵获取模块以及干净语音获取模块。
其中,所述待增强语音获取模块可以配置为获取待增强语音;深度字典确定单元可以配置为根据上述任一项所述方法获取用于语音增强的深度字典;所述第二稀疏矩阵获取模块可以配置为根据所述深度字典对所述待增强语音进行深度展开,确定所述待增强语音的第二稀疏矩阵;所述干净语音获取模块可以配置为根据所述第二稀疏矩阵和所述深度字典确定所述待增强语音的干净语音。
在一些实施例中,所述第二稀疏矩阵获取模块可以包括:初始化单元、第一软阈值确定单元以及第二稀疏矩阵更新单元。
其中,所述初始化单元可以配置为获取所述待增强语音的第二稀疏 矩阵;所述第一软阈值确定单元可以配置为根据所述深度字典、所述第二稀疏矩阵确定第一软阈值;所述第二稀疏矩阵更新单元可以配置为根据所述第二稀疏矩阵、所述深度字典以及所述第一软阈值更新所述第二稀疏矩阵。
在一些实施例中,所述第二稀疏矩阵获取模块可以包括:第一训练语音获取单元、第一初始化单元、第一反向传播单元以及第一确定单元。
其中,所述的第一训练语音获取单元可以配置为获取第一训练语音及其对应的稀疏矩阵;所述第一初始化单元可以配置为初始化所述待增强语音的第二稀疏矩阵;所述第一反向传播单元可以配置为通过反向传播算法对对所述第一训练语音、所述第一训练语音对应的稀疏矩阵以及初始化后的第二稀疏矩阵对目标前馈神经网络进行训练,以确定所述前馈神经网络的第一目标参数、第二目标参数和第二软阈值;所述第一确定单元可以配置为根据所述前馈神经网络的第一目标参数、第二目标参数和第二软阈值对所述待增强语音进行处理以确定所述待增强语音的第二稀疏矩阵。
在一些实施例中,所述第二稀疏矩阵获取模块可以包括:第二训练语音获取单元、第二初始化单元、线性估计单元确定单元、最小均方差误差估计单元、第二反向传播单元以及第二确定单元。
其中,所述第二训练语音获取单元可以配置为获取第二训练语音及其对应的稀疏矩阵;所述第二初始化单元可以配置为获取所述待增强语音的第二稀疏矩阵;所述线性估计单元确定单元可以配置为根据所述深度字典、所述第二训练语音以及所述第二训练语音对应的稀疏矩阵确定线性估计单元;所述最小军方误差估计单元可以配置为根据所述深度字典、所述第二训练语音确定最小均方差误差估计单元;所述第二反向传播单元可以配置为通过反向传播算法对所述第二稀疏矩阵、所述第二训练语音、所述第二训练语音的稀疏矩阵、所述线性估计单元和所述最小均方差误差估计单元进行处理,以确定所述线性估计单元以及所述最小均方差误差估计单元中的目标参数;所述第二确定单元可以配置为根据所述线性估计单元以及所述最小均方差误差估计单元中的目标参数对所述待增强语音进行处理,以确定所述待增强语音的干净语音。
在一些实施例中,所述第二稀疏矩阵获取模块还包括傅里叶变换单元。其中,所述傅里叶变换单元可以配置为对所述待增强语音进行傅里叶变换,获取所述待增强语音的第一语谱图,以便对所述待增强语音的第一语谱图进行深度展开。
在一些实施例中,所述干净语音获取模块可以包括:相位信息获取模块以及第二语谱图获取单元以及相位叠加单元。
其中,所述相位信息获取模块可以配置为获取所述待增强语音的相位信息;所述第二语谱图获取单元可以配置为根据所述第二稀疏矩阵和 所述深度字典确定所述待增强语音的第二语谱图;所述相位叠加单元可以配置为将所述相位信息与所述第二语谱图叠加,获得包括相位信息的第二语谱图。
在一些实施例中,所述干净语音获取模块还可以包括傅里叶反变换单元。
其中,所述傅里叶反变换单元可以配置为对所述包括相位信息的第二语谱图进行傅里叶反变换,以确定所述待增强语音的干净语音。
由于本申请实施例的语音增强装置的各个功能模块与上述语音增强方法的示例实施例的步骤对应,因此在此不再赘述。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,本申请实施例的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中,包括若干指令用以使得一台计算设备(可以是个人计算机、服务器、移动终端、或者智能设备等)执行根据本申请实施例的方法,例如图3的一个或多个所示的步骤。
此外,上述附图仅是根据本申请示例性实施例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。

Claims (14)

  1. 一种语音增强方法,所述方法由电子设备执行,所述方法包括:
    获取干净语音样本;
    分解所述干净语音样本获得第一稀疏矩阵和m个基矩阵,其中所述第一稀疏矩阵中的值均为正数,m为大于1的正整数;
    根据所述第一稀疏矩阵和目标神经网络的权重矩阵,获得所述目标神经网络的可视层神经元状态向量;
    根据所述可视层神经元状态向量和所述干净语音样本更新所述权重矩阵,以获取用于语音增强的深度字典。
  2. 根据权利要求1所述方法,其中,所述目标神经网络包括第一偏置项;所述根据所述第一稀疏矩阵和目标神经网络的权重矩阵,获得所述目标神经网络的可视层神经元状态向量,包括:
    根据所述第一稀疏矩阵、所述目标神经网络的权重矩阵以及所述第一偏置项确定所述目标神经网络的可视层条件概率;
    根据所述可视层条件概率确定所述可视层神经元状态变量。
  3. 根据权利要求2所述方法,其中,所述目标神经网络还包括第二偏置项;所述根据所述可视层神经元状态向量和所述干净语音样本更新所述权重矩阵,以获取用于语音增强的深度字典,包括:
    根据所述权重矩阵、所述干净语音样本以及所述第二偏置项确定所述目标神经网络的第一隐藏层条件概率;
    根据所述权重矩阵、所述可视层神经元状态向量以及所述第二偏置项确定所述目标神经网络的第二隐藏层条件概率;
    根据所述第一隐藏层条件概率、所述第二隐藏层条件概率、所述干净语音样本以及所述可视层神经元状态向量更新所述权重矩阵。
  4. 根据权利要求3所述方法,其中,所述根据所述可视层神经元状态向量和所述干净语音样本更新所述权重矩阵,以获取用于语音增强的深度字典,还包括:
    根据所述干净语音样本和所述可视层神经元状态向量更新所述第一偏置项;
    根据所述第一隐藏层条件概率和所述第二隐藏层条件概率更新所述第二偏置项。
  5. 根据权利要求1所述方法,其中,所述m个基矩阵包括第一基矩阵、第二基矩阵,所述分解所述干净语音样本获得稀疏矩阵和m个基矩阵,包括:
    通过对所述干净语音样本进行半非负矩阵求解,以确定所述第一基矩阵和第一目标矩阵;
    通过对所述第一目标矩阵进行半非负矩阵求解,以初始化所述第二基矩阵和第二目标矩阵;
    根据所述第一基矩阵、所述第二基矩阵确定基矩阵变量;
    通过基矩阵更新函数对所述基矩阵变量、所述干净语音样本以及所述第二目标矩阵进行处理,更新所述第二基矩阵;
    通过稀疏矩阵更新函数对所述基矩阵变量和所述干净语音样本进行处理,更新所述第二目标矩阵,所述第二目标矩阵为所述第一稀疏矩阵。
  6. 一种语音增强方法,所述方法由电子设备执行,所述方法包括:
    获取待增强语音;
    根据权利要求1-5任一项所述的方法,获取用于语音增强的深度字典;
    根据所述深度字典对所述待增强语音进行深度展开,确定所述待增强语音的第二稀疏矩阵;
    根据所述第二稀疏矩阵和所述深度字典确定所述待增强语音的干净语音。
  7. 根据权利要求6所述方法,其中,所述方法还包括:
    获取所述待增强语音的第二稀疏矩阵;
    根据所述深度字典、所述第二稀疏矩阵确定第一软阈值;
    根据所述第二稀疏矩阵、所述深度字典以及所述第一软阈值更新所述第二稀疏矩阵。
  8. 根据权利要求6所述方法,其中,所述根据所述深度字典对所述待增强语音进行深度展开,确定所述待增强语音的第二稀疏矩阵,包括:
    获取第一训练语音及其对应的稀疏矩阵;
    初始化所述待增强语音的第二稀疏矩阵;
    通过反向传播算法对对所述第一训练语音、所述第一训练语音对应的稀疏矩阵以及初始化后的第二稀疏矩阵对目标前馈神经网络进行训练,以确定所述前馈神经网络的第一目标参数、第二目标参数和第二软阈值;
    根据所述前馈神经网络的第一目标参数、第二目标参数和第二软阈值,对所述待增强语音进行处理以确定所述待增强语音的第二稀疏矩阵。
  9. 根据权利要求6所述方法,其中,所述根据所述第二稀疏矩阵和所述深度字典确定所述待增强语音的干净语音,包括:
    获取第二训练语音及其对应的稀疏矩阵、所述待增强语音的第二稀疏矩阵;
    根据所述第二训练语音、所述深度字典以及所述第二稀疏矩阵确定线性估计单元,并根据所述深度字典、所述第二训练语音确定最小均方差误差估计单元;
    通过反向传播算法对所述线性估计单元和所述最小均方差误差估计 单元进行处理,以确定所述线性估计单元以及所述最小均方差误差估计单元中的目标参数;
    根据所述线性估计单元以及所述最小均方差误差估计单元中的目标参数对所述待增强语音进行处理,以确定所述待增强语音的干净语音。
  10. 根据权利要求6-9任一项所述的方法,其中,所述根据所述深度字典对所述待增强语音进行深度展开,确定所述待增强语音的第二稀疏矩阵,包括:
    对所述待增强语音进行傅里叶变换,获取所述待增强语音的第一语谱图;
    根据所述深度字典对所述待增强语音的第一语谱图进行深度展开,确定所述待增强语音的第二稀疏矩阵。
  11. 根据权利要求10所述方法,其中,所述根据所述第二稀疏矩阵和所述深度字典确定所述待增强语音的干净语音,包括:
    获取所述待增强语音的相位信息;
    根据所述第二稀疏矩阵和所述深度字典确定所述待增强语音的第二语谱图;
    将所述相位信息与所述第二语谱图叠加,获得包括相位信息的第二语谱图;
    对所述包括相位信息的第二语谱图进行傅里叶反变换,以确定所述待增强语音的干净语音。
  12. 一种语音增强装置,包括:
    样本获取模块,配置为获取干净语音样本;
    分解模块,配置为分解所述干净语音样本获得第一稀疏矩阵和m个基矩阵,其中所述第一稀疏矩阵中的值均为正数,m为大于1的正整数;
    可见层重建模块,配置为根据所述第一稀疏矩阵和目标神经网络的权重矩阵,获得所述目标神经网络的可视层神经元状态向量;
    深度字典获取模块,配置为根据所述可视层神经元状态向量和所述干净语音样本更新所述权重矩阵,以获取用于语音增强的深度字典。
  13. 一种电子设备,所述电子设备包括:
    一个或多个处理器;
    存储器,配置为存储一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-11中任一项所述的方法。
  14. 一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现如权利要求1-11中任一项所述的方法。
PCT/CN2020/126345 2020-02-10 2020-11-04 语音增强方法及装置、电子设备和计算机可读存储介质 WO2021159772A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20918152.8A EP4002360A4 (en) 2020-02-10 2020-11-04 LANGUAGE ENHANCEMENT METHOD AND DEVICE, ELECTRONIC DEVICE AND COMPUTER READABLE STORAGE MEDIUM
US17/717,620 US20220262386A1 (en) 2020-02-10 2022-04-11 Speech enhancement method and apparatus, electronic device, and computer- readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010085323.0A CN111312270B (zh) 2020-02-10 2020-02-10 语音增强方法及装置、电子设备和计算机可读存储介质
CN202010085323.0 2020-02-10

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/717,620 Continuation US20220262386A1 (en) 2020-02-10 2022-04-11 Speech enhancement method and apparatus, electronic device, and computer- readable storage medium

Publications (1)

Publication Number Publication Date
WO2021159772A1 true WO2021159772A1 (zh) 2021-08-19

Family

ID=71150935

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/126345 WO2021159772A1 (zh) 2020-02-10 2020-11-04 语音增强方法及装置、电子设备和计算机可读存储介质

Country Status (4)

Country Link
US (1) US20220262386A1 (zh)
EP (1) EP4002360A4 (zh)
CN (1) CN111312270B (zh)
WO (1) WO2021159772A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312270B (zh) * 2020-02-10 2022-11-22 腾讯科技(深圳)有限公司 语音增强方法及装置、电子设备和计算机可读存储介质
CN113823291A (zh) * 2021-09-07 2021-12-21 广西电网有限责任公司贺州供电局 一种应用于电力作业中的声纹识别的方法及系统

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103413551A (zh) * 2013-07-16 2013-11-27 清华大学 基于稀疏降维的说话人识别方法
CN103559888A (zh) * 2013-11-07 2014-02-05 航空电子系统综合技术重点实验室 基于非负低秩和稀疏矩阵分解原理的语音增强方法
JPWO2015060375A1 (ja) * 2013-10-23 2017-03-09 国立大学法人 長崎大学 生体音信号処理装置、生体音信号処理方法および生体音信号処理プログラム
JP6220701B2 (ja) * 2014-02-27 2017-10-25 日本電信電話株式会社 サンプル列生成方法、符号化方法、復号方法、これらの装置及びプログラム
CN108322858A (zh) * 2018-01-25 2018-07-24 中国科学技术大学 基于张量分解的多麦克风语音增强方法
CN108615533A (zh) * 2018-03-28 2018-10-02 天津大学 一种基于深度学习的高性能语音增强方法
CN108899045A (zh) * 2018-06-29 2018-11-27 中国航空无线电电子研究所 基于约束低秩与稀疏分解的子空间语音增强方法
CN108986834A (zh) * 2018-08-22 2018-12-11 中国人民解放军陆军工程大学 基于编解码器架构与递归神经网络的骨导语音盲增强方法
CN109087664A (zh) * 2018-08-22 2018-12-25 中国科学技术大学 语音增强方法
CN110164465A (zh) * 2019-05-15 2019-08-23 上海大学 一种基于深层循环神经网络的语音增强方法及装置
CN110634502A (zh) * 2019-09-06 2019-12-31 南京邮电大学 基于深度神经网络的单通道语音分离算法
CN111312270A (zh) * 2020-02-10 2020-06-19 腾讯科技(深圳)有限公司 语音增强方法及装置、电子设备和计算机可读存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9972315B2 (en) * 2015-01-14 2018-05-15 Honda Motor Co., Ltd. Speech processing device, speech processing method, and speech processing system
CN107563497B (zh) * 2016-01-20 2021-03-19 中科寒武纪科技股份有限公司 用于稀疏人工神经网络的计算装置和运算方法
CN110164469B (zh) * 2018-08-09 2023-03-10 腾讯科技(深圳)有限公司 一种多人语音的分离方法和装置
CN110189761B (zh) * 2019-05-21 2021-03-30 哈尔滨工程大学 一种基于贪婪深度字典学习的单信道语音去混响方法
CN112712096A (zh) * 2019-10-25 2021-04-27 中国科学院声学研究所 基于深度递归非负矩阵分解的音频场景分类方法及系统

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103413551A (zh) * 2013-07-16 2013-11-27 清华大学 基于稀疏降维的说话人识别方法
JPWO2015060375A1 (ja) * 2013-10-23 2017-03-09 国立大学法人 長崎大学 生体音信号処理装置、生体音信号処理方法および生体音信号処理プログラム
CN103559888A (zh) * 2013-11-07 2014-02-05 航空电子系统综合技术重点实验室 基于非负低秩和稀疏矩阵分解原理的语音增强方法
JP6220701B2 (ja) * 2014-02-27 2017-10-25 日本電信電話株式会社 サンプル列生成方法、符号化方法、復号方法、これらの装置及びプログラム
CN108322858A (zh) * 2018-01-25 2018-07-24 中国科学技术大学 基于张量分解的多麦克风语音增强方法
CN108615533A (zh) * 2018-03-28 2018-10-02 天津大学 一种基于深度学习的高性能语音增强方法
CN108899045A (zh) * 2018-06-29 2018-11-27 中国航空无线电电子研究所 基于约束低秩与稀疏分解的子空间语音增强方法
CN108986834A (zh) * 2018-08-22 2018-12-11 中国人民解放军陆军工程大学 基于编解码器架构与递归神经网络的骨导语音盲增强方法
CN109087664A (zh) * 2018-08-22 2018-12-25 中国科学技术大学 语音增强方法
CN110164465A (zh) * 2019-05-15 2019-08-23 上海大学 一种基于深层循环神经网络的语音增强方法及装置
CN110634502A (zh) * 2019-09-06 2019-12-31 南京邮电大学 基于深度神经网络的单通道语音分离算法
CN111312270A (zh) * 2020-02-10 2020-06-19 腾讯科技(深圳)有限公司 语音增强方法及装置、电子设备和计算机可读存储介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
See also references of EP4002360A4 *
XU GENPENG: "Dictionary Learning Algorithms and Its Application in Speech Enhancement", GUANGDONG UNIVERSITY OF TECHNOLOGY MASTER’S THESES, 31 December 2017 (2017-12-31), XP055835293 *

Also Published As

Publication number Publication date
EP4002360A1 (en) 2022-05-25
EP4002360A4 (en) 2022-11-16
US20220262386A1 (en) 2022-08-18
CN111312270B (zh) 2022-11-22
CN111312270A (zh) 2020-06-19

Similar Documents

Publication Publication Date Title
US20220004870A1 (en) Speech recognition method and apparatus, and neural network training method and apparatus
US11645835B2 (en) Hypercomplex deep learning methods, architectures, and apparatus for multimodal small, medium, and large-scale data representation, analysis, and applications
JP5509488B2 (ja) 形状を認識する方法および形状を認識する方法を実施するシステム
US20220262386A1 (en) Speech enhancement method and apparatus, electronic device, and computer- readable storage medium
Du et al. Independent component analysis
Jain et al. Blind source separation and ICA techniques: a review
CN113611323A (zh) 一种基于双通道卷积注意力网络的语音增强方法及系统
CN110164465A (zh) 一种基于深层循环神经网络的语音增强方法及装置
Marban et al. Estimation of interaction forces in robotic surgery using a semi-supervised deep neural network model
CN113256508A (zh) 一种改进的小波变换与卷积神经网络图像去噪声的方法
JP6099032B2 (ja) 信号処理装置、信号処理方法及びコンピュータプログラム
CN112232395A (zh) 一种基于联合训练生成对抗网络的半监督图像分类方法
Şimşekli et al. Non-negative tensor factorization models for Bayesian audio processing
CN111714124B (zh) 磁共振电影成像方法、装置、成像设备及存储介质
Das et al. ICA methods for blind source separation of instantaneous mixtures: A case study
CN115601257A (zh) 一种基于局部特征和非局部特征的图像去模糊方法
Li et al. Automatic Modulation Recognition Based on a New Deep K-SVD Denoising Algorithm
JP7047665B2 (ja) 学習装置、学習方法及び学習プログラム
Srikotr et al. Vector quantization of speech spectrum based on the vq-vae embedding space learning by gan technique
CN117253287B (zh) 基于域泛化的动作预测模型训练方法、相关方法及产品
Zhao et al. Local-and-Nonlocal Spectral Prior Regularized Tensor Recovery for Cauchy Noise Removal
Mashhadi et al. Interpolation of sparse graph signals by sequential adaptive thresholds
Liu et al. One to multiple mapping dual learning: Learning multiple signals from one mixture
US11783847B2 (en) Systems and methods for unsupervised audio source separation using generative priors
Finkelstein et al. Transfer learning promotes robust parametric mapping of diffusion encoded MR fingerprinting

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20918152

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020918152

Country of ref document: EP

Effective date: 20220215

NENP Non-entry into the national phase

Ref country code: DE