CN115376501B - Voice enhancement method and device, storage medium and electronic equipment - Google Patents

Voice enhancement method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN115376501B
CN115376501B CN202211317370.9A CN202211317370A CN115376501B CN 115376501 B CN115376501 B CN 115376501B CN 202211317370 A CN202211317370 A CN 202211317370A CN 115376501 B CN115376501 B CN 115376501B
Authority
CN
China
Prior art keywords
voice
phase
original
speech
spectrum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211317370.9A
Other languages
Chinese (zh)
Other versions
CN115376501A (en
Inventor
黄仁杰
刘金成
洪秀贞
刘代琴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Raixun Information Technology Co ltd
Original Assignee
Shenzhen Raixun Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Raixun Information Technology Co ltd filed Critical Shenzhen Raixun Information Technology Co ltd
Priority to CN202211317370.9A priority Critical patent/CN115376501B/en
Publication of CN115376501A publication Critical patent/CN115376501A/en
Application granted granted Critical
Publication of CN115376501B publication Critical patent/CN115376501B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • G10L21/0332Details of processing therefor involving modification of waveforms

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a voice enhancement method and device, a storage medium and electronic equipment, wherein the method comprises the following steps: acquiring original voice, wherein the original voice comprises a useful audio signal; carrying out phase coding on the original voice by adopting an automatic encoder to obtain enhanced voice; and adopting a conditional generation countermeasure network to classify and judge the enhanced voice, and outputting pure voice of the original voice, wherein the pure voice comprises the useful audio signal. The invention solves the technical problem of poor effect when the voice enhancement is carried out based on frequency in the related technology, fully utilizes the time domain information and the phase information of the voice and can effectively improve the quality and the intelligibility of the voice.

Description

Voice enhancement method and device, storage medium and electronic equipment
Technical Field
The invention relates to the field of computers, in particular to a voice enhancement method and device, a storage medium and electronic equipment.
Background
In various intelligent voice interaction scenarios, sound input and output are often subject to various disturbances. Audio enhancement and noise cancellation are the mainstream processing methods through auditory perception modeling based on a psycho-physiological model.
In the related art, many speech signal processing systems, such as speech recognition and speaker identification, use the amplitude spectrum/energy spectrum related features, and usually ignore the phase characteristics. This is because human hearing is traditionally considered phase insensitive. However, some recent studies indicate that, regardless of the physiological structure of the cochlea and the sound signal processing process, there are processing and perception of the sound phase, and when speech enhancement is performed by using the characteristics of the sound such as the frequency, there is aliasing or even coincidence of the frequency characteristics of the interfering audio and the frequency characteristics of the human audio due to the uncertainty and randomness of the environmental sound, resulting in poor speech enhancement effect.
In view of the above problems in the related art, no effective solution has been found at present.
Disclosure of Invention
The embodiment of the invention provides a voice enhancement method and device, a storage medium and electronic equipment.
According to an aspect of an embodiment of the present invention, there is provided a speech enhancement method, including: acquiring original voice, wherein the original voice comprises a useful audio signal; carrying out phase coding on the original voice by adopting an automatic encoder to obtain enhanced voice; and adopting a conditional generation countermeasure network to classify and judge the enhanced voice, and outputting pure voice of the original voice, wherein the pure voice comprises the useful audio signal.
Optionally, performing phase encoding on the original speech by using an automatic encoder to obtain an enhanced speech includes: performing a first encoding operation on the original voice by using an encoder to generate phase information; performing a second encoding operation on the phase information with a generator to generate an enhanced speech; wherein the auto-encoder comprises the encoder and the generator.
Optionally, performing a first encoding operation on the original speech by using an encoder, and generating phase information includes: extracting a waveform signal of the original voice in a time domain; performing short-time Fourier transform on the waveform signal to generate a frequency domain signal of the original voice; and carrying out phase angle acquisition operation on the frequency domain signal to acquire a phase spectrum of the original voice, and generating the phase information based on the phase spectrum.
Optionally, performing a second encoding operation on the phase information by using a generator, and generating the enhanced speech includes: determining a sensitive phase interval of the generator, wherein the sensitive phase interval is a phase interval of a cochlear perceived sound; filtering the phase information by using the sensitive phase interval to obtain an intermediate phase spectrum, wherein the intermediate phase spectrum comprises a first phase spectrum and a second phase spectrum, the first phase spectrum corresponds to the interference audio signal, the second phase spectrum corresponds to the useful audio signal, and the original speech further comprises the interference audio signal; acquiring a target characteristic spectrum of the original voice, wherein the target characteristic spectrum comprises an amplitude spectrum and/or an energy spectrum corresponding to a vocal cord of a specified object; and generating enhanced voice by adopting the intermediate phase spectrum and the target characteristic spectrum code.
Optionally, the obtaining the target feature spectrum of the original speech includes: analyzing a fixed tone map in the original voice, and determining the fixed tone map as a tone map of the useful audio signal; intercepting reference voice corresponding to the fixed tone map from a voice section of the original voice; and filtering out fundamental tone components from the reference voice, and extracting the amplitude spectrum and/or the energy spectrum corresponding to the vocal cords of the specified object from the fundamental tone components.
Optionally, the classifying and determining the enhanced speech by using a conditional generation countermeasure network, and outputting the pure speech of the original speech includes: synchronously inputting the enhanced voice into a classifier and a judger of the condition generation countermeasure network; acquiring a plurality of voice categories output by the classifier and voice truth output by the judger, wherein one voice category corresponds to one voice truth; and determining the voice category corresponding to the maximum voice truth as a target voice category, selecting a time domain component of the target voice category from the enhanced voice, and outputting the time domain component as first pure voice of the original voice, wherein the pure voice comprises the first pure voice.
Optionally, the classifying and determining the enhanced speech by using a conditional generation countermeasure network, and outputting the pure speech of the original speech includes: synchronously inputting the enhanced voice into a classifier and a judger of the condition generation countermeasure network; acquiring the voice category output by the classifier and the voice truth degree output by the judger; judging whether the voice category is the same as a configuration category of a generator of the automatic encoder and whether the voice truth is greater than a preset threshold value, wherein the configuration category is used for indicating a signal category of the useful audio signal; and if the voice type is the same as the configuration type of a generator of the automatic encoder and the voice truth degree is larger than a preset threshold value, outputting the enhanced voice to be second pure voice of the original voice, wherein the pure voice comprises the second pure voice.
According to another aspect of the embodiments of the present invention, there is provided a speech enhancement apparatus including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring original voice, and the original voice comprises a useful audio signal; the coding module is used for carrying out phase coding on the original voice by adopting an automatic coder to obtain enhanced voice; and the output module is used for adopting a conditional generation countermeasure network to classify and judge the enhanced voice and outputting pure voice of the original voice, wherein the pure voice comprises the useful audio signal.
Optionally, the encoding module includes: a first encoding unit, configured to perform a first encoding operation on the original speech by using an encoder, and generate phase information; a second encoding unit for performing a second encoding operation on the phase information with the generator to generate an enhanced speech; wherein the auto-encoder comprises the encoder and the generator.
Optionally, the first encoding unit includes: an extraction subunit, configured to extract a waveform signal of the original speech in a time domain; the transformation subunit is used for carrying out short-time Fourier transform on the waveform signal to generate a frequency domain signal of the original voice; and the acquisition subunit is used for carrying out phase angle acquisition operation on the frequency domain signal, acquiring a phase spectrum of the original voice and generating the phase information based on the phase spectrum.
Optionally, the second encoding unit includes: a determining subunit, configured to determine a sensitive phase interval of the generator, wherein the sensitive phase interval is a phase interval of a cochlear perceived sound; a filtering subunit, configured to filter the phase information by using the sensitive phase interval to obtain an intermediate phase spectrum, where the intermediate phase spectrum includes a first phase spectrum and a second phase spectrum, the first phase spectrum corresponds to the interfering audio signal, and the second phase spectrum corresponds to the useful audio signal, where the original speech further includes the interfering audio signal; the acquiring subunit is configured to acquire a target feature spectrum of the original speech, where the target feature spectrum includes a magnitude spectrum and/or an energy spectrum corresponding to a vocal cord of a specified object; a generating subunit, configured to generate an enhanced speech by using the intermediate phase spectrum and the target feature spectrum coding.
Optionally, the obtaining subunit is further configured to: analyzing a fixed tone map in the original voice, and determining the fixed tone map as a tone map of the useful audio signal; intercepting reference voice corresponding to the fixed tone map from a voice section of the original voice; and filtering out fundamental tone components from the reference voice, and extracting the amplitude spectrum and/or the energy spectrum corresponding to the vocal cords of the specified object from the fundamental tone components.
Optionally, the output module includes: an input unit for synchronously inputting the enhanced voice into the classifier and judger of the condition generation countermeasure network; the first obtaining unit is used for obtaining a plurality of voice categories output by the classifier and obtaining voice truth output by the judger, wherein one voice category corresponds to one voice truth; and the first output unit is used for determining the voice category corresponding to the maximum voice truth as a target voice category, selecting a time domain component of the target voice category from the enhanced voice, and outputting the time domain component as first pure voice of the original voice, wherein the pure voice comprises the first pure voice.
Optionally, the output module includes: an input unit, for inputting the enhanced voice into the classifier and judger of the condition generation countermeasure network synchronously; a second obtaining unit, configured to obtain a voice category output by the classifier, and obtain a voice truth degree output by the judger; the judging unit is used for judging whether the voice category is the same as the configuration category of a generator of the automatic encoder and whether the voice truth is greater than a preset threshold value, wherein the configuration category is used for indicating the signal category of the useful audio signal; and the second output unit is used for outputting the enhanced voice to be second pure voice of the original voice if the voice type is the same as the configuration type of the generator of the automatic encoder and the voice truth is greater than a preset threshold value, wherein the pure voice comprises the second pure voice.
According to another aspect of the embodiments of the present invention, there is also provided a storage medium including a stored program which performs the above steps when the program is executed.
According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus; wherein: a memory for storing a computer program; a processor for executing the steps of the method by running the program stored in the memory.
Embodiments of the present invention also provide a computer program product containing instructions, which when run on a computer, cause the computer to perform the steps of the above method.
According to the invention, the original voice is obtained, wherein the original voice comprises a useful audio signal, the original voice is subjected to phase coding by adopting an automatic coder to obtain enhanced voice, the enhanced voice is classified and distinguished by adopting a condition generation countermeasure network, and pure voice of the original voice is output, wherein the pure voice comprises the useful audio signal and an interference audio signal, the original voice is subjected to phase coding, and the condition generation countermeasure network is utilized to carry out classification and discrimination constraints, so that a phase-based noise elimination scheme is realized, the technical problem of poor effect when voice enhancement is carried out on the basis of frequency in the related technology is solved, time domain information and phase information of the voice are fully utilized, and the quality and intelligibility of the voice can be effectively improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not limit the invention. In the drawings:
fig. 1 is a block diagram of a hardware configuration of a sound pickup apparatus according to an embodiment of the present invention;
FIG. 2 is a flow diagram of a method of speech enhancement according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an architecture of an autoencoder-condition generating countermeasure network in an embodiment of the present invention;
FIG. 4 is a waveform diagram illustrating enhancement of speech according to an embodiment of the present invention;
FIG. 5 is a block diagram of a speech enhancement apparatus according to an embodiment of the present invention;
fig. 6 is a block diagram of an electronic device implementing an embodiment of the invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
The method provided by the embodiment of the invention can be executed in a server, a mobile phone, a recording/radio/pickup device, a tablet, a computer, a processor or a similar voice processing device. Taking the example of being operated on a sound pickup apparatus, fig. 1 is a block diagram of a hardware configuration of a sound pickup apparatus according to an embodiment of the present invention. As shown in fig. 1, the pickup may include one or more (only one shown in fig. 1) pickups 102 (the pickup 102 may include, but is not limited to, a processing device such as a micro-pickup MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is merely an illustration, and does not limit the structure of the sound pickup apparatus. For example, the tone arm may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a sound pickup program, for example, software programs and modules of application software, such as a sound pickup program corresponding to a voice enhancement method in an embodiment of the present invention, and the sound pickup apparatus 102 executes various functional applications and data processing by running the sound pickup apparatus program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the tone arm 102, which may be connected to the tone arm via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the sound pickup apparatus. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
In the present embodiment, a speech enhancement method is provided, and fig. 2 is a flowchart of a speech enhancement method according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:
step S202, obtaining original voice, wherein the original voice comprises useful audio signals;
the original speech of the present embodiment includes an interfering audio signal and a useful audio signal. Wherein the useful audio signal is a voice signal emitted by a designated speaker. The interfering audio signal may be ambient sound or noise, or may be a voice signal from an unrelated speaker. For example, in a customer service scenario, the audio signals of interest are the voice signal of the user who dialed the hotline and the voice signal of the customer service utterance that answers the user. The disturbing audio signal may be other than the speech signal associated with the user and the customer service answering the user, such as the speaking sound of other customer services, or a noisy sound of the environment, etc.
Optionally, the original speech is a speech signal collected by a radio device (e.g. a microphone).
Step S204, carrying out phase coding on the original voice by adopting an automatic encoder to obtain enhanced voice;
the auto-encoder (AE) of the present embodiment includes two networks: respectively an encoding (encoder) network and a decoding (decoder) network. An encoding network for encoding the sound signal into sound features, denoted in the model by E; the decoding network is also called a Generator (Generator) for decoding the sound signal based on the sound features, and is used for G representation in the model. The principle of the encoding network is a process of compressing information, and the principle of the decoding network is a process of decompressing.
Step S206, classifying and judging the enhanced voice by adopting a conditional generation countermeasure network, and outputting pure voice of original voice, wherein the pure voice comprises a useful audio signal;
the Conditional generation countermeasure network (CGAN) of this embodiment includes a Classifier (Classifier, C) and a Discriminator (Discriminator, D), where the Classifier inputs a voice x and outputs a category C to which the voice x belongs, and the Discriminator inputs the voice x and determines the degree of truth of the category C.
Through the steps, original voice is obtained, wherein the original voice comprises a useful audio signal, an automatic encoder is adopted to carry out phase coding on the original voice to obtain enhanced voice, a condition generation countermeasure network is adopted to classify and judge the enhanced voice, and pure voice of the original voice is output, wherein the pure voice comprises the useful audio signal, the original voice is subjected to phase coding, and the condition generation countermeasure network is utilized to carry out classification and judgment restraint, so that a noise elimination scheme based on phase is realized, the technical problem of poor effect when voice enhancement is carried out based on frequency in the related technology is solved, time domain information and phase information of the voice are fully utilized, and the quality and intelligibility of the voice can be effectively improved.
In an embodiment, the phase-coding the original speech with an automatic encoder to obtain the enhanced speech includes:
s11, performing a first coding operation on the original voice by adopting a coder to generate phase information;
in one embodiment of this embodiment, performing the first encoding operation on the original speech using the encoder, and generating the phase information includes: extracting a waveform signal of an original voice in a time domain; carrying out short-time Fourier transform on the waveform signal to generate a frequency domain signal of the original voice; and carrying out phase angle acquisition operation on the frequency domain signal to acquire a phase spectrum of the original voice.
The spectral feature refers to a signal feature of a sound in a frequency domain, and the short-time Fourier transform (STFT) is used to transform a time-domain signal into the frequency domain. In the phase spectrum (phase spectrum) of the present embodiment, after the waveform signal of the original voice is transformed into the frequency domain, the phase information of the signal can be obtained by performing the phase angle operation on the signal.
And S12, performing a second coding operation on the phase information by adopting the generator to generate enhanced voice.
In one embodiment of this embodiment, performing the second encoding operation on the phase information with the generator, generating the enhanced speech includes: determining a sensitive phase interval of the generator, wherein the sensitive phase interval is a phase interval of cochlear perception sound; filtering phase information by adopting a sensitive phase interval to obtain an intermediate phase spectrum, wherein the intermediate phase spectrum comprises a first phase spectrum and a second phase spectrum, the first phase spectrum corresponds to an interference audio signal, the second phase spectrum corresponds to a useful audio signal, and the original voice also comprises the interference audio signal; acquiring a target characteristic spectrum of original voice, wherein the target characteristic spectrum comprises an amplitude spectrum and/or an energy spectrum corresponding to a vocal cord of a specified object; and generating the enhanced voice by adopting the intermediate phase spectrum and the target characteristic spectrum coding.
Because the original voice is noisy, the total phase interval of the original voice comprises a phase interval in which sound is perceived by a cochlea and a phase interval in which sound cannot be perceived by the cochlea, a sensitive phase interval is filtered from the phase information of the original voice through filtering, one part of interference audio signals in the original voice can be filtered according to the phase, and the other part of interference audio signals and useful audio signals which can be perceived by the cochlea, namely the audio signals of the first phase spectrum and the audio signals of the second phase spectrum, are left.
The phase characteristics of sound are in degrees. For example, for a sine wave, 0 ° represents the origin of the wave. The first peak is 90 deg., the waveform becomes negative at 180 deg., and completes a full cycle at 360 deg.. Two sinusoids of the same frequency are considered to be in phase when they oscillate at the same time. The intensity of the waves doubles when they are combined.
If one of the waves starts half a cycle after the other, the gates are inverted and their phase difference is 180 °. When they are superimposed, the peaks cancel each other out. If the phase difference of two waves is larger or smaller than 180 deg., they will only affect some frequencies when superimposed, affecting the sound quality.
The perception of sound is based on the auditory nerve of human body, for sine wave with single frequency, the frequency is above 2kHz, there is basically no phase lock between cell potential and audio, the hair cell no longer changes its potential according to the different phase of sine wave, at this moment, the auditory system no longer encodes the phase of sound, but only encodes the amplitude. Therefore, the phase of the sine wave above 2kHz is basically meaningless for human ears and does not belong to the sensitive phase interval of the human body. Moreover, in a multi-channel scene, the phase adjustment is the phase adjustment (within 1 kHz) between multiple channels, which can affect the subjective perception of human ears, especially the sound field positioning and focus effects, such as phase matching between a subwoofer and a main speaker or active noise reduction earphones, for a common stereo speaker or earphones, the phase of synchronously changing two channels usually has no significant auditory effect, or the audibility of the absolute phase of a single body is very weak unless there is a phase difference between the left and right channels, so the phase of a multi-channel sine wave above 1kHz is basically meaningless for human ears and does not belong to a sensitive phase interval of a human body.
Although audio signals outside the sensitive phase interval have no meaning to auditory nerve of human body, the time relation between a plurality of signals with the same information contained in sound at the same time can be the same, when two audio signals containing the same sound source (for example, sound emitted by a loudspeaker and noise generated when the loudspeaker operates, a plurality of microphone or guitar sound box signals for picking up sound of a jazz drum) are mixed, if the two audio signals are different in phase from each other, phase cancellation can occur, so that signal loss of certain frequency can be caused, even the whole signal is lost, and the voice quality is influenced.
Optionally, the obtaining of the target feature spectrum of the original speech includes: analyzing a fixed tone map in the original voice, and determining the fixed tone map as a tone map of a useful audio signal; intercepting reference voice corresponding to the fixed tone color atlas in a voice section of original voice; pitch components are filtered from the reference speech, and a magnitude spectrum and/or an energy spectrum corresponding to a vocal cord of the specified object are extracted from the pitch components. The designated object is an object (human, animal, etc.) to which voice extraction is required, and the pure voice is voice generated by a sound source of the designated object.
In this embodiment, the reference speech may be obtained by real-time parsing from the original speech, or the test sound may be used as the reference speech in a test stage before speech acquisition. During real-time analysis, when a sound channel of a fixed object generates sound, tone characteristic information is fixed (tone of each person is different), so that a tone map of the sound is also fixed, other noises in the environment are unfixed random sounds, and the tone maps are scattered, so that noisy sound components in original voice can be filtered through the scattered tone maps, relatively clear pitch components are obtained, pitch components of useful audio signals are obtained, and therefore useful target characteristic maps are obtained.
In an example of this embodiment, the classifying and determining the enhanced speech by using the conditional generation countermeasure network, and outputting the clean speech of the original speech includes: a classifier and a judger for generating the enhanced voice synchronous input condition into an antagonistic network; acquiring a plurality of voice categories output by a classifier and voice truth degrees output by a judger, wherein each voice category corresponds to one voice truth degree; and selecting a target voice category with the maximum voice truth degree from the plurality of voice categories, selecting a time domain component of the target voice category from the enhanced voice, and outputting the time domain component as the first pure voice of the original voice.
In some scenarios, the interfering audio is equivalent to the useful audio, or the interfering audio has a plurality of sources, and in this case, it cannot be easily determined which speech category is a useful audio signal (such as a sound emitted by a specific user) to be output finally. The judger (discriminator/discriminator) is used for judging the degree of truth of the output speech class of the classifier, and may be classified into two classes (0,1) or may be in a range of 0 to 100.
In another example of this embodiment, the classifying and determining the enhanced speech by using the conditional generation countermeasure network, and outputting the clean speech of the original speech includes: the classifier and the judger which generate the enhanced voice synchronous input condition into the confrontation network; acquiring the voice category output by the classifier and the voice truth output by the judger; judging whether the voice category is the same as the configuration category of a generator of an automatic encoder or not and whether the voice truth is greater than a preset threshold or not, wherein the configuration category is used for indicating the signal category of the useful audio signal; and if the voice category is the same as the configuration category of a generator of the automatic encoder and the voice truth is greater than a preset threshold value, outputting the enhanced voice as second pure voice of the original voice.
The automatic encoder of this embodiment may also preset the configuration class c, encoder (E), input speech x, output encoding z, given the class c, the generated z will be of higher quality, i.e. more random, since the information already contained in c can be removed. A generator (G) which inputs the code z and outputs the speech x, if given the class c, then generates speech belonging to the class c. Under normal circumstances, the class of the classifier output that the condition generates against the network should be the same as the configuration class of the generator of the auto-encoder.
Fig. 3 is a schematic structural diagram of an autoencoder-condition generating countermeasure network in an embodiment of the present invention, which includes 4 components, corresponding to 4 neural networks, and is described here:
and a component E: the encoder inputs speech x and outputs encoding z.
If class c is also given, the generated z will be of higher quality, i.e. more random, since the information already contained in c can be removed.
And (3) a component G: the generator inputs the code z and outputs the speech x'.
If a class c is also given, speech belonging to class c is generated.
And (C) component: and the classifier inputs the voice x and outputs the category c to which the voice x belongs.
And (D) component: the discriminator, input the pronunciation x, judge its true degree, 0 or 1.
The auto-encoder in this embodiment is complementary to GAN for enhancement of the original speech. Wherein G is used for: for z generated from x, G should be able to recover x' close to x (close on the phoneme), G generated speech should be identifiable by D as belonging to true speech, and G generated speech should be identifiable by C as belonging to the C class.
The embodiment proposes a speech enhancement algorithm for generating an antagonistic network based on automatic coding, which combines an automatic coder with a generation antagonistic network GAN to generate speech enhancement of the antagonistic network conditionally. In this algorithm, the generator G network performs the enhancement task. The input to the G-network being a noisy speech signal
Figure 913396DEST_PATH_IMAGE001
And a potential representation z, the output of which is an enhancement signal
Figure 308605DEST_PATH_IMAGE002
. The G-network and the generator network of the algorithm design are all convoluted. Meanwhile, in order to improve the problem of insufficient limiting capability of the generator D, two minor components are added in the loss function of the G net, and the difference distance between the enhanced voice signal and the pure voice signal generated by the G net can be reduced to the minimum. Secondly, a binary classifier C is trained, the structure of the binary classifier C is consistent with that of the binary classifier D, and the binary classifier C is used for distinguishing noisy speech and pure speech. The enhanced voice generated by G is simultaneously input into C, and the loss function which is reversely propagated to G is defined as L GC
Figure 185295DEST_PATH_IMAGE003
Wherein x is c Is original voice; g (z, x) c ) To enhance speech; c () is a back propagation function; lambda [ alpha ] 2 Is a hyperparameter that is a self-defined value, illustratively λ 2 Can be set to 1/2.
FIG. 4 is a waveform diagram illustrating enhancement of speech according to an embodiment of the present invention, wherein the waveform of clean speech is more concise and the sound quality is clearer.
The embodiment provides a speech enhancement algorithm for generating an antagonistic network based on automatic coding, wherein the speech enhancement of the antagonistic network is generated by combining an automatic coder with a generation antagonistic network GAN and an automatic coder-condition, the input of the model is an original speech signal, under the common supervision of a discriminator D and a classifier C, the automatic coder automatically extracts the characteristics in the original waveform, the supervised learning is carried out on the complex mapping relation between the noisy speech and the pure speech, and the output of the automatic coder is an enhanced speech waveform. The algorithm fully utilizes the time domain information of the voice, effectively solves the problem of neglecting the phase, and can effectively improve the quality and intelligibility of the voice.
Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method according to the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 2
In this embodiment, a speech enhancement apparatus is further provided, which is used to implement the foregoing embodiments and preferred embodiments, and is not described again after being described. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 5 is a block diagram of a speech enhancement apparatus according to an embodiment of the present invention, as shown in fig. 5, the apparatus including: an acquisition module 50, an encoding module 52, an output module 54, wherein,
an obtaining module 50, configured to obtain original speech, where the original speech includes a useful audio signal;
an encoding module 52, configured to perform phase encoding on the original speech by using an automatic encoder to obtain an enhanced speech;
and the output module 54 is configured to employ a conditional generation countermeasure network to perform classification and discrimination on the enhanced speech, and output pure speech of the original speech, where the pure speech includes the useful audio signal.
Optionally, the encoding module includes: a first encoding unit, configured to perform a first encoding operation on the original speech by using an encoder, and generate phase information; a second encoding unit for performing a second encoding operation on the phase information with the generator to generate an enhanced speech; wherein the auto-encoder comprises the encoder and the generator.
Optionally, the first encoding unit includes: an extraction subunit, configured to extract a waveform signal of the original speech in a time domain; the transformation subunit is used for carrying out short-time Fourier transform on the waveform signal to generate a frequency domain signal of the original voice; and the acquisition subunit is used for carrying out phase angle acquisition operation on the frequency domain signal, acquiring a phase spectrum of the original voice and generating the phase information based on the phase spectrum.
Optionally, the second encoding unit includes: a determining subunit, configured to determine a sensitive phase interval of the generator, wherein the sensitive phase interval is a phase interval of a cochlear perceived sound; a filtering subunit, configured to filter the phase information using the sensitive phase interval to obtain an intermediate phase spectrum, where the intermediate phase spectrum includes a first phase spectrum and a second phase spectrum, where the first phase spectrum corresponds to the interfering audio signal, and the second phase spectrum corresponds to the useful audio signal, where the original speech further includes the interfering audio signal; the acquiring subunit is configured to acquire a target feature spectrum of the original speech, where the target feature spectrum includes a magnitude spectrum and/or an energy spectrum corresponding to a vocal cord of a specified object; a generating subunit, configured to generate an enhanced speech by using the intermediate phase spectrum and the target feature spectrum coding.
Optionally, the obtaining subunit is further configured to: analyzing a fixed tone map in the original voice, and determining the fixed tone map as a tone map of the useful audio signal; intercepting reference voice corresponding to the fixed tone map from a voice section of the original voice; and filtering out fundamental tone components from the reference voice, and extracting the amplitude spectrum and/or the energy spectrum corresponding to the vocal cords of the specified object from the fundamental tone components.
Optionally, the output module includes: an input unit, for inputting the enhanced voice into the classifier and judger of the condition generation countermeasure network synchronously; the first obtaining unit is used for obtaining a plurality of voice categories output by the classifier and obtaining voice truth output by the judger, wherein one voice category corresponds to one voice truth; a first output unit, configured to determine a speech category corresponding to the maximum speech truth degree as a target speech category, select a time domain component of the target speech category from the enhanced speech, and output the time domain component as a first pure speech of the original speech, where the pure speech includes the first pure speech.
Optionally, the output module includes: an input unit for synchronously inputting the enhanced voice into the classifier and judger of the condition generation countermeasure network; a second obtaining unit, configured to obtain a voice category output by the classifier, and obtain a voice truth degree output by the judger; the judging unit is used for judging whether the voice category is the same as the configuration category of a generator of the automatic encoder and whether the voice truth is greater than a preset threshold value, wherein the configuration category is used for indicating the signal category of the useful audio signal; and the second output unit is used for outputting the enhanced voice to be second pure voice of the original voice if the voice type is the same as the configuration type of the generator of the automatic encoder and the voice truth is greater than a preset threshold value, wherein the pure voice comprises the second pure voice.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Example 3
Fig. 6 is a structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 6, the electronic device includes a processor 61, a communication interface 62, a memory 63, and a communication bus 64, where the processor 61, the communication interface 62, and the memory 63 complete mutual communication through the communication bus 64, and the memory 63 is used for storing a computer program; the processor 61 is configured to implement the following steps when executing the program stored in the memory 63: acquiring original voice, wherein the original voice comprises a useful audio signal; carrying out phase coding on the original voice by adopting an automatic encoder to obtain enhanced voice; and adopting a conditional generation countermeasure network to classify and judge the enhanced voice, and outputting pure voice of the original voice, wherein the pure voice comprises the useful audio signal.
The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the terminal and other devices.
The Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In yet another embodiment provided by the present application, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the speech enhancement method described in any of the above embodiments.
In yet another embodiment provided herein, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform the speech enhancement method of any of the above embodiments.
In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.
The above description is only for the preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the scope of protection of the present application.
The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A method for speech enhancement, comprising:
acquiring original voice, wherein the original voice comprises a useful audio signal and also comprises an interference audio signal;
carrying out phase coding on the original voice by adopting an automatic encoder to obtain enhanced voice, wherein the automatic encoder comprises an encoder and a generator;
adopting a conditional generation countermeasure network to classify and judge the enhanced voice and outputting pure voice of the original voice, wherein the pure voice comprises the useful audio signal;
wherein; the performing phase encoding on the original speech by using an automatic encoder to obtain an enhanced speech includes: performing a first encoding operation on the original voice by using an encoder to generate phase information; performing a second encoding operation on the phase information with a generator to generate an enhanced speech;
performing, with the generator, a second encoding operation on the phase information, the generating of the enhanced speech including: determining a sensitive phase interval of the generator, wherein the sensitive phase interval is a phase interval of a cochlear perceived sound; filtering the phase information by using the sensitive phase interval to obtain an intermediate phase spectrum, wherein the intermediate phase spectrum comprises a first phase spectrum and a second phase spectrum, the first phase spectrum corresponds to the interference audio signal, and the second phase spectrum corresponds to the useful audio signal; and acquiring a target characteristic spectrum of the original voice, and encoding by adopting the intermediate phase spectrum and the target characteristic spectrum to generate enhanced voice.
2. The method of claim 1, wherein performing a first encoding operation on the original speech using an encoder, and wherein generating phase information comprises:
extracting a waveform signal of the original voice in a time domain;
performing short-time Fourier transform on the waveform signal to generate a frequency domain signal of the original voice;
and carrying out phase angle acquisition operation on the frequency domain signal to acquire a phase spectrum of the original voice, and generating the phase information based on the phase spectrum.
3. The method of claim 1, wherein the obtaining the target feature spectrum of the original speech comprises:
analyzing a fixed tone map in the original voice, and determining the fixed tone map as a tone map of the useful audio signal;
intercepting reference voice corresponding to the fixed tone map from a voice section of the original voice;
filtering out fundamental tone components from the reference voice, and extracting a magnitude spectrum and/or an energy spectrum corresponding to a vocal cord of a specified object from the fundamental tone components;
and generating the target characteristic spectrum based on the amplitude spectrum and/or the energy spectrum corresponding to the vocal cords of the specified object.
4. The method of claim 1, wherein the clean speech comprises a first clean speech; the adopting a conditional generation countermeasure network to classify and judge the enhanced voice, and outputting the pure voice of the original voice comprises:
synchronously inputting the enhanced voice into a classifier and a judger of the condition generation countermeasure network;
acquiring a plurality of voice categories output by the classifier and voice truth degrees output by the judger, wherein one voice category corresponds to one voice truth degree;
and determining the voice category corresponding to the maximum voice truth as a target voice category, selecting a time domain component of the target voice category from the enhanced voice, and outputting the time domain component as the first pure voice of the original voice.
5. The method of claim 1, wherein the clean speech comprises a second clean speech; the adopting the conditional generation countermeasure network to classify and judge the enhanced speech, and outputting the pure speech of the original speech comprises:
synchronously inputting the enhanced voice into a classifier and a judger of the condition generation countermeasure network;
acquiring the voice category output by the classifier and the voice truth output by the judger;
judging whether the voice category is the same as a configuration category of a generator of the automatic encoder and whether the voice truth is greater than a preset threshold value, wherein the configuration category is used for indicating a signal category of the useful audio signal;
and if the voice type is the same as the configuration type of a generator of the automatic encoder and the voice truth degree is greater than a preset threshold value, outputting the enhanced voice as a second pure voice of the original voice.
6. A speech enhancement apparatus, comprising:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring original voice, and the original voice comprises a useful audio signal and an interference audio signal;
the coding module is used for carrying out phase coding on the original voice by adopting an automatic coder to obtain enhanced voice;
the output module is used for adopting a conditional generation countermeasure network to classify and judge the enhanced voice and outputting pure voice of the original voice, wherein the pure voice comprises the useful audio signal;
the encoding module includes: a first encoding unit, configured to perform a first encoding operation on the original speech by using an encoder, and generate phase information; a second encoding unit for performing a second encoding operation on the phase information with the generator to generate an enhanced speech; wherein the auto-encoder comprises the encoder and the generator;
the second encoding unit includes: a determining subunit, configured to determine a sensitive phase interval of the generator, wherein the sensitive phase interval is a phase interval of a cochlear perceived sound; a filtering subunit, configured to filter the phase information using the sensitive phase interval to obtain an intermediate phase spectrum, where the intermediate phase spectrum includes a first phase spectrum and a second phase spectrum, where the first phase spectrum corresponds to the interfering audio signal, and the second phase spectrum corresponds to the useful audio signal, where the original speech further includes the interfering audio signal; the acquiring subunit is configured to acquire a target feature spectrum of the original speech, where the target feature spectrum includes a magnitude spectrum and/or an energy spectrum corresponding to a vocal cord of a specified object; and the generating subunit is used for generating enhanced voice by adopting the intermediate phase spectrum and the target characteristic spectrum code.
7. A computer-readable storage medium, comprising a stored program, wherein the program is operative to perform the method steps of any of the preceding claims 1 to 5.
8. An electronic device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; wherein:
a memory for storing a computer program;
a processor for performing the method steps of any of claims 1 to 5 by executing a program stored on a memory.
CN202211317370.9A 2022-10-26 2022-10-26 Voice enhancement method and device, storage medium and electronic equipment Active CN115376501B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211317370.9A CN115376501B (en) 2022-10-26 2022-10-26 Voice enhancement method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211317370.9A CN115376501B (en) 2022-10-26 2022-10-26 Voice enhancement method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN115376501A CN115376501A (en) 2022-11-22
CN115376501B true CN115376501B (en) 2023-02-14

Family

ID=84072603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211317370.9A Active CN115376501B (en) 2022-10-26 2022-10-26 Voice enhancement method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN115376501B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390950A (en) * 2019-08-17 2019-10-29 杭州派尼澳电子科技有限公司 A kind of end-to-end speech Enhancement Method based on generation confrontation network
CN110619885A (en) * 2019-08-15 2019-12-27 西北工业大学 Method for generating confrontation network voice enhancement based on deep complete convolution neural network
CN110739002A (en) * 2019-10-16 2020-01-31 中山大学 Complex domain speech enhancement method, system and medium based on generation countermeasure network
CN110781433A (en) * 2019-10-11 2020-02-11 腾讯科技(深圳)有限公司 Data type determination method and device, storage medium and electronic device
CN111653288A (en) * 2020-06-18 2020-09-11 南京大学 Target person voice enhancement method based on conditional variation self-encoder

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109147804A (en) * 2018-06-05 2019-01-04 安克创新科技股份有限公司 A kind of acoustic feature processing method and system based on deep learning
ES2928295T3 (en) * 2020-02-14 2022-11-16 System One Noc & Dev Solutions S A Method for improving telephone voice signals based on convolutional neural networks
US20210012767A1 (en) * 2020-09-25 2021-01-14 Intel Corporation Real-time dynamic noise reduction using convolutional networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110619885A (en) * 2019-08-15 2019-12-27 西北工业大学 Method for generating confrontation network voice enhancement based on deep complete convolution neural network
CN110390950A (en) * 2019-08-17 2019-10-29 杭州派尼澳电子科技有限公司 A kind of end-to-end speech Enhancement Method based on generation confrontation network
CN110781433A (en) * 2019-10-11 2020-02-11 腾讯科技(深圳)有限公司 Data type determination method and device, storage medium and electronic device
CN110739002A (en) * 2019-10-16 2020-01-31 中山大学 Complex domain speech enhancement method, system and medium based on generation countermeasure network
CN111653288A (en) * 2020-06-18 2020-09-11 南京大学 Target person voice enhancement method based on conditional variation self-encoder

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《基于自动编码器和生成对抗网络的语音增强方法研究》;许春秋等;《计算机工程与设计》;20190916;第49卷(第9期);第2578-2583页 *

Also Published As

Publication number Publication date
CN115376501A (en) 2022-11-22

Similar Documents

Publication Publication Date Title
KR100643310B1 (en) Method and apparatus for disturbing voice data using disturbing signal which has similar formant with the voice signal
US10841688B2 (en) Annoyance noise suppression
EP3005362B1 (en) Apparatus and method for improving a perception of a sound signal
CN112185410B (en) Audio processing method and device
CN109493883A (en) A kind of audio time-delay calculation method and apparatus of smart machine and its smart machine
CN107342097A (en) The way of recording, recording device, intelligent terminal and computer-readable recording medium
CN110992967A (en) Voice signal processing method and device, hearing aid and storage medium
CN111385688A (en) Active noise reduction method, device and system based on deep learning
CN110428835A (en) A kind of adjusting method of speech ciphering equipment, device, storage medium and speech ciphering equipment
CN114333865A (en) Model training and tone conversion method, device, equipment and medium
CN112614504A (en) Single sound channel voice noise reduction method, system, equipment and readable storage medium
CN114338623A (en) Audio processing method, device, equipment, medium and computer program product
CN113949955A (en) Noise reduction processing method and device, electronic equipment, earphone and storage medium
CN110232909A (en) A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
US11996114B2 (en) End-to-end time-domain multitask learning for ML-based speech enhancement
KR20200028852A (en) Method, apparatus for blind signal seperating and electronic device
CN116132875B (en) Multi-mode intelligent control method, system and storage medium for hearing-aid earphone
CN115376501B (en) Voice enhancement method and device, storage medium and electronic equipment
US11551707B2 (en) Speech processing method, information device, and computer program product
Li et al. Speech enhancement algorithm based on sound source localization and scene matching for binaural digital hearing aids
CN115223584B (en) Audio data processing method, device, equipment and storage medium
US20230209283A1 (en) Method for audio signal processing on a hearing system, hearing system and neural network for audio signal processing
CN108899041B (en) Voice signal noise adding method, device and storage medium
CN111009259A (en) Audio processing method and device
CN107197403A (en) A kind of terminal audio frequency parameter management method, apparatus and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant