CN112382282B - Voice denoising processing method and device, electronic equipment and storage medium - Google Patents

Voice denoising processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112382282B
CN112382282B CN202011232882.6A CN202011232882A CN112382282B CN 112382282 B CN112382282 B CN 112382282B CN 202011232882 A CN202011232882 A CN 202011232882A CN 112382282 B CN112382282 B CN 112382282B
Authority
CN
China
Prior art keywords
audio
noise
voice data
frequency
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011232882.6A
Other languages
Chinese (zh)
Other versions
CN112382282A (en
Inventor
穆文斌
李晓东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing 58 Information Technology Co Ltd
Original Assignee
Beijing 58 Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing 58 Information Technology Co Ltd filed Critical Beijing 58 Information Technology Co Ltd
Priority to CN202011232882.6A priority Critical patent/CN112382282B/en
Publication of CN112382282A publication Critical patent/CN112382282A/en
Application granted granted Critical
Publication of CN112382282B publication Critical patent/CN112382282B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Abstract

The invention provides a voice denoising processing method, a voice denoising processing device, electronic equipment and a storage medium. The method comprises the following steps: acquiring original voice data to be processed; acquiring audio segmentation points of the original voice data according to the frequency distribution condition of the original voice data, and performing audio segmentation on the original voice data according to the audio segmentation points to obtain a plurality of audio segments; determining a target audio segment which does not belong to noise in the original voice data according to the frequency variation waveform of each audio segment; and sequentially splicing each target audio segment according to the time sequence of each target audio segment in the original voice data to obtain the denoised voice data. The noise precision rate in the denoising processing process can be effectively improved, and the accuracy rate of the speaker recognition result is further improved.

Description

Voice denoising processing method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a speech denoising processing method and apparatus, an electronic device, and a storage medium.
Background
The voice recognition is an important means for identifying the same speaker, the existing speaker voiceprint identification is to acquire speaker voice data, perform denoising processing on the speaker voice data, perform voice feature extraction, and perform voice recognition through a preset voice recognition model, so that the accuracy of a voice denoising result is closely related to the accuracy of a voice recognition result.
In the related art, the complete corpus data is generally used directly for model training of speech recognition. However, since the corpus data may contain noise data of non-target speakers, the recognition accuracy of the training model based on such corpus data is affected.
Disclosure of Invention
The embodiment of the invention provides a voice denoising processing method and device, electronic equipment and a storage medium, and aims to solve the problems that the accuracy of the existing voice denoising processing mode is poor and the accuracy of a voice recognition result is influenced.
In order to solve the technical problem, the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a speech denoising processing method, including:
acquiring original voice data to be processed;
acquiring audio segmentation points of the original voice data according to the frequency distribution condition of the original voice data, and performing audio segmentation on the original voice data according to the audio segmentation points to obtain a plurality of audio segments;
determining a target audio segment which does not belong to noise in the original voice data according to the frequency variation waveform of each audio segment;
and sequentially splicing each target audio segment according to the time sequence of each target audio segment in the original voice data to obtain the denoised voice data.
Optionally, the step of obtaining an audio segmentation point of the original speech data according to a frequency distribution of the original speech data, and performing audio segmentation on the original speech data according to the audio segmentation point to obtain a plurality of audio segments includes:
acquiring a reference voice section of which the frequency is lower than a first frequency value and/or higher than a second frequency value in the original voice data according to the frequency distribution condition of the original voice data;
and taking the position of the central point of each reference voice segment in the original voice data as an audio segmentation point, and performing audio segmentation on the original voice data according to the audio segmentation point to obtain a plurality of audio segments.
Optionally, the first frequency value is 20 hertz and the second frequency value is 20000 hertz.
Optionally, the step of determining a target audio segment not belonging to noise in the original speech data according to the frequency variation waveform of each audio segment includes:
identifying each audio segment through a preset noise identification model to obtain the audio segments belonging to noise in the audio segments;
and/or acquiring an audio segment belonging to noise in the audio segment according to the similarity between the frequency variation waveform of each audio segment and the frequency variation waveform of each noise sample in a preset noise sample set;
wherein the noise identification model is a machine learning model trained by a plurality of audio samples determined to be noise.
Optionally, the noise identification model includes at least one of a first noise identification model for identifying an environmental background sound and a second noise identification model for identifying a color ring back tone, where the first noise identification model is obtained by training audio samples of a plurality of environmental background sounds, and the second noise identification model is obtained by training audio samples of a plurality of color ring back tones; and the noise sample set comprises audio samples of a plurality of prompt voices as the noise samples.
In a second aspect, an embodiment of the present invention provides a speech denoising processing apparatus, including:
the voice data acquisition module is used for acquiring original voice data to be processed;
the voice audio cutting module is used for acquiring audio cutting points of the original voice data according to the frequency distribution condition of the original voice data and performing audio cutting on the original voice data according to the audio cutting points to obtain a plurality of audio segments;
the noise identification module is used for determining a target audio segment which does not belong to noise in the original voice data according to the frequency variation waveform of each audio segment;
and the voice splicing module is used for sequentially splicing each target audio segment according to the time sequence of each target audio segment in the original voice data to obtain the denoised voice data.
Optionally, the voice audio cutting module includes:
the reference voice segment obtaining submodule is used for obtaining a reference voice segment of which the frequency is lower than a first frequency value and/or the frequency is higher than a second frequency value in the original voice data according to the frequency distribution condition of the original voice data by reference audio;
and the voice audio cutting submodule is used for taking the position of the central point of each reference voice segment in the original voice data as an audio cutting point and carrying out audio cutting on the original voice data according to the audio cutting point to obtain a plurality of audio segments.
Optionally, the first frequency value is 20 hertz and the second frequency value is 20000 hertz.
Optionally, the noise identification module includes:
the first noise identification submodule is used for identifying each audio segment through a preset noise identification model to obtain the audio segments belonging to noise in the audio segments;
and/or the second noise identification submodule is used for acquiring the audio segments belonging to the noise in the audio segments according to the similarity between the frequency variation waveform of each audio segment and the frequency variation waveform of each noise sample in a preset noise sample set;
wherein the noise identification model is a machine learning model trained by a plurality of audio samples determined to be noise.
Optionally, the noise identification model includes at least one of a first noise identification model for identifying an environmental background sound and a second noise identification model for identifying a color ring back tone, where the first noise identification model is obtained by training audio samples of a plurality of environmental background sounds, and the second noise identification model is obtained by training audio samples of a plurality of color ring back tones; and the noise sample set comprises audio samples of a plurality of prompt voices as the noise samples.
In a third aspect, an embodiment of the present invention additionally provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the speech denoising processing method according to the first aspect.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when executed by a processor, the computer program implements the steps of the speech denoising processing method according to the first aspect.
In the embodiment of the invention, the original voice data is cut, the voice part, namely the noise part, of the non-target speaker in the cut audio segment is positioned and removed, and the residual audio segment is spliced again, so that the denoising effect of the voice data is realized. The noise precision rate in the denoising processing process can be effectively improved, and the accuracy rate of the speaker recognition result is further improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without inventive labor.
FIG. 1 is a flowchart illustrating a speech denoising method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating steps of another speech denoising method according to an embodiment of the present invention;
fig. 3 is a waveform distribution diagram of original voice data containing color ring in an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a speech denoising processing apparatus according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of another speech denoising processing apparatus according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a hardware structure of an electronic device in the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a flowchart of steps of a speech denoising processing method according to an embodiment of the present invention is shown.
Step 110, obtaining original voice data to be processed.
And step 120, acquiring an audio segmentation point of the original voice data according to the frequency distribution condition of the original voice data, and performing audio segmentation on the original voice data according to the audio segmentation point to obtain a plurality of audio segments.
And step 130, determining a target audio segment which does not belong to noise in the original voice data according to the frequency variation waveform of each audio segment.
And step 140, sequentially splicing each target audio segment according to the time sequence of each target audio segment in the original voice data to obtain the denoised voice data.
In the embodiment of the present invention, the original voice data to be processed may be obtained in any available manner, and the embodiment of the present invention is not limited thereto. For example, in an application scenario of identifying a speaker in a recording of a calling party, the source of the original voice data may be call recording data of a call service platform such as a call forwarding platform. The original voice data may also be recording data in a recording apparatus, or the like. In different application scenarios, the content and the acquisition mode specifically included in the original voice data can be set by user according to requirements, and the embodiment of the present invention is not limited.
In practical applications, the obtained original voice data is generally a continuous recording, a speaker may pause for a short time during the recording, and noises and speaking sounds at different times may also be variable, that is, the audio frequency, i.e., the frequency, of the original voice data may be continuously variable, and the noise conditions at different positions in the original voice data may also affect the frequency distribution thereof to some extent. Therefore, in the embodiment of the present invention, the original voice data may be segmented into a plurality of audio segments, and each audio segment is further identified to determine whether it is a noise audio.
Moreover, when the audio segmentation is performed on the original voice data, the audio segmentation point of the original voice data can be obtained according to the frequency distribution condition of the original voice data, and the condition that the frequency at the audio segmentation point needs to meet can be set in a user-defined manner according to the requirement, which is not limited in the embodiment of the present invention.
For example, an inflection point at which a frequency variation tendency changes every time the frequency changes from a fall to a rise may be acquired as one audio division point; or, each reference speech segment in the original speech data within a specified frequency range may also be obtained, and then the central point of each reference speech segment is taken as an audio segmentation point, or the lowest frequency or the highest frequency in each reference speech segment is taken as an audio segmentation point, and so on; or, the time length can be specified as a time unit, the original voice data is cut according to the time unit from the initial position of the original voice data, and the time distance between two continuous audio division points is a time unit; and so on.
After the audio segmentation points are determined, the original speech data may be audio segmented according to the audio segmentation points, and the original speech data may be segmented into a plurality of audio segments. Furthermore, in the embodiment of the present invention, the original voice data may be subjected to audio segmentation in any available manner, and the embodiment of the present invention is not limited thereto.
In practical application, environmental noise, call prompt tone, polyphonic ringtone recording played when a telephone is not connected, and the like all belong to noise, frequency variation waveforms of different types of noise have certain difference relative to the frequency variation waveform of a speaker voice, and generally speaking, the frequency variation waveforms of different types of noise also have certain characteristics. Therefore, in the embodiment of the present invention, it is possible to determine whether each audio segment belongs to noise according to the frequency variation waveform of each audio segment, thereby obtaining a target audio segment that does not belong to noise in the original voice data. The waveform characteristics of the frequency variations of different noises can be obtained in any available manner, so as to classify each audio segment and determine whether the audio segment is noise, which is not limited in this embodiment of the present invention.
And sequentially splicing each target audio segment according to the time sequence of each target audio segment in the original voice data to obtain the denoised voice data. The target audio segment may be spliced in any available manner, and the embodiment of the present invention is not limited thereto.
Referring to fig. 2, in the embodiment of the present invention, the step 120 may further include:
step 121, obtaining a reference voice segment with a frequency lower than a first frequency value and/or a frequency higher than a second frequency value in the original voice data according to the frequency distribution condition of the original voice data;
and step 122, taking the position of the central point of each reference voice segment in the original voice data as an audio segmentation point, and performing audio segmentation on the original voice data according to the audio segmentation point to obtain a plurality of audio segments.
In practical application, in the recording or manufacturing process of voice data, a short pause generally occurs between noise and speaker voice and between different noises, that is, the voice is not heard clearly or completely heard by human ears, so that each pause time can be taken as an audio division point, and the audio frequency is lower than a certain degree or higher than a certain degree, so that the hearing effect of the corresponding audio can be influenced, and the voice is not heard clearly or completely heard by human ears.
Therefore, in the embodiment of the present invention, when determining the audio segmentation point, reference speech segments with frequencies lower than a first frequency value and/or frequencies higher than a second frequency value in the original speech data may be obtained according to a frequency distribution condition of the original speech data, and a position of a center point of each of the reference speech segments in the original speech data is further used as an audio segmentation point, and the original speech data is subjected to audio segmentation according to the audio segmentation point to obtain a plurality of audio segments. Of course, in the embodiment of the present invention, the position of any point in each reference speech segment in the original speech data may also be used as an audio segmentation point, for example, the start position or the end position in each reference speech segment may be used as an audio segmentation point, or the lowest frequency position in the reference speech segment with a frequency lower than the first frequency value may also be used as an audio segmentation point, the highest frequency position in the reference speech segment with a frequency higher than the second frequency value may also be used as an audio segmentation point, and so on; or, each reference speech segment may be directly used as an audio segmentation point, which is not limited in the embodiment of the present invention.
Fig. 3 shows a waveform distribution diagram of an original voice data containing a color ring, where a first audio segment is a location of a color ring audio, each black mark location is an audio division point, and a frequency at each audio division point is lower than a first frequency value.
Optionally, in an embodiment of the present invention, the first frequency value is 20hz, and the second frequency value is 20000 hz.
In practical applications, the human ear can hear audio signals of 20HZ (hertz) to 20KHZ (kilohertz), so that when performing speech recognition, the first frequency value can be set to 20HZ, and the second frequency value can be set to 20000 HZ in consideration of satisfying the requirement that the human ear can hear. Of course, in the embodiment of the present invention, specific values of the first frequency value and the second frequency value may also be set in a user-defined manner according to requirements in a specific application scenario, which is not limited in the embodiment of the present invention.
Referring to fig. 2, in an embodiment of the present invention, the step 130 may further include:
step 131, recognizing each audio segment through a preset noise recognition model to obtain an audio segment belonging to noise in the audio segments, wherein the noise recognition model is a machine learning model obtained through training of a plurality of audio samples determined to be noise;
and/or the presence of a gas in the gas,
and step 132, acquiring the audio segments belonging to the noise in the audio segments according to the similarity between the frequency variation waveform of each audio segment and the frequency variation waveform of each noise sample in a preset noise sample set.
In practical application, in order to classify each audio segment and determine whether the audio segment is noise, at least one noise recognition model for recognizing noise may be trained in advance, and each audio segment may be further recognized by the noise recognition model to obtain an audio segment belonging to noise in the audio segment.
The noise recognition model may be any machine learning model, and may be trained by a plurality of audio samples determined to be noise. In addition, since the noise may be further subdivided into various types of noise, such as environmental noise, color ring noise, and prompt tone noise, in order to improve the accuracy of the noise identification model, different types of noise may be identified, and when the noise identification model is trained, the adopted audio samples determined as noise may include audio samples in different noise types, for example, the audio samples determined as environmental noise, the audio samples determined as color ring noise, the audio samples determined as prompt tone noise, and the like.
As described above, the following noise recordings are generally available in the recorded data: 1. the operator prompts the recording when the telephone can not be connected; 2. the color ring back tone recording is played when the telephone is not connected; there may also be ambient background noise. In addition, the noise cannot be removed from the recorded sound by using a method such as VAD (Voice activity detection).
In addition, in practical applications, selectable polyphonic ringtone, prompt tone, etc. are generally limited, and the types of noises (such as wind noise, rain noise, car noise, animal cry, non-talking voice, etc.) existing in the environment are generally limited, so that a noise sample set including at least one noise sample can be constructed according to the noises that may occur. And further, according to the similarity between the frequency variation waveform of each audio segment and the frequency variation waveform of each noise sample in a preset noise sample set, the audio segment belonging to the noise in the audio segment can be obtained. For example, for any audio segment, if the similarity between the frequency variation waveform of at least one noise sample in the noise sample set and the frequency variation waveform of the audio segment is higher than a preset value (e.g. 0.9, 0.8, etc.), the audio segment may be determined to belong to the noise, and if the similarity between the frequency variation waveform of the noise sample in the noise sample set and the frequency variation waveform of the audio segment is not higher than the preset value, the audio segment may be determined not to belong to the noise.
Moreover, in the embodiment of the present invention, when two manners of the above-mentioned steps 131 and 132 are simultaneously adopted to determine whether each audio segment belongs to noise, if the determination results of the two manners are not consistent for the same audio segment, the priorities of the two manners may be preset, and the determination result of the manner with the higher priority is the final determination result of the corresponding audio segment, or a third manner is additionally provided, and in case that the determination results of the two manners are not consistent, the third manner is used to perform the re-determination, and the determination result of the third manner is the final determination result of the corresponding audio segment; and so on. The third determination method may be set by self-definition according to requirements, and the embodiment of the present invention is not limited thereto. For example, the third determination method may be a manual determination method, or a machine learning model different from the noise recognition model, which is not limited in this embodiment of the present invention.
Optionally, in the embodiment of the present invention, the noise identification model includes at least one of a first noise identification model for identifying an environmental background sound and a second noise identification model for identifying a color ring back tone, where the first noise identification model is obtained by training audio samples of a plurality of environmental background sounds, and the second noise identification model is obtained by training audio samples of a plurality of color ring back tones; and audio samples containing a plurality of prompt voices in the noise sample set are used as the noise samples.
In practical application, the selectable range of the color ring is wider and the environmental noise is more variable relative to the prompt tone, so that when a noise sample set is constructed, if the noise sample set of the color ring and the environmental noise is constructed, all selectable color rings and all possible environmental noises need to be considered in order to ensure the completeness of the noise sample set, so that the data volume of the noise sample set is larger, when the similarity between the frequency variation waveform of an audio frequency segment and the frequency variation waveform of each noise sample in the preset noise sample set is obtained, the calculation amount is relatively larger, and the processing efficiency of the denoising process is easily influenced.
Therefore, in the embodiment of the present invention, in order to further improve the processing efficiency of the denoising process, the environmental background sound and the polyphonic ringtone audio in the audio segment, that is, the environmental noise and the polyphonic ringtone noise, may be recognized through the noise recognition model, and the prompt voice, that is, the prompt tone noise, in the audio segment is recognized through the audio sample set that includes the multiple prompt voices as the noise sample. Moreover, in order to improve the accuracy of the recognition results of the environmental background sound and the polyphonic ringtone audio, two models can be trained respectively to recognize the environmental background sound and the polyphonic ringtone audio respectively, that is, two noise recognition models can be set, which are respectively a first noise recognition model for recognizing the environmental background sound and a second noise recognition model for recognizing the polyphonic ringtone audio, and a noise recognition model can be set according to requirements, which is any one of the first noise recognition model for recognizing the environmental background sound and the second noise recognition model for recognizing the polyphonic ringtone audio, wherein the first noise recognition model is obtained by training audio samples of a plurality of environmental background sounds, and the second noise recognition model is obtained by training audio samples of a plurality of polyphonic ringtone audio.
Then, when determining that the original voice data does not belong to the target audio segment of the noise, it may be determined, for each audio segment, whether the audio segment is the environmental background sound or the polyphonic ringtone audio through the first noise recognition model and the second noise recognition model, and meanwhile, it may also be determined whether the corresponding audio segment is the prompt voice by obtaining the similarity between the audio segment and the audio sample of each prompt voice in the noise sample set, so as to accurately determine whether the audio segment is the noise, and further determine what type of noise it is.
In the embodiment of the invention, the original voice data is cut, the voice part, namely the noise part, of the non-target speaker in the cut audio segment is positioned and removed, and the residual audio segment is spliced again, so that the denoising effect of the voice data is realized. The noise precision rate in the denoising processing process can be effectively improved, and the accuracy rate of the speaker recognition result is further improved.
Fig. 4 is a schematic structural diagram illustrating a speech denoising processing apparatus according to an embodiment of the present invention.
The speech denoising processing device of the embodiment of the invention comprises: a voice data acquisition module 210, a voice audio cutting module 220, a noise recognition module 230, and a voice splicing module 240.
The functions of the modules and the interaction relationship between the modules are described in detail below.
A voice data obtaining module 210, configured to obtain original voice data to be processed;
the voice audio segmentation module 220 is configured to obtain an audio segmentation point of the original voice data according to a frequency distribution condition of the original voice data, and perform audio segmentation on the original voice data according to the audio segmentation point to obtain a plurality of audio segments;
a noise identification module 230, configured to determine, according to the frequency variation waveform of each audio segment, a target audio segment that does not belong to noise in the original speech data;
and the voice splicing module 240 is configured to splice each target audio segment in sequence according to the time sequence of each target audio segment in the original voice data, so as to obtain the denoised voice data.
Referring to fig. 5, in an embodiment of the present invention, the voice audio cutting module 220 may further include:
the reference voice segment obtaining submodule 221 is configured to obtain, by referring to an audio frequency according to a frequency distribution condition of the original voice data, a reference voice segment of which the frequency is lower than a first frequency value and/or the frequency is higher than a second frequency value in the original voice data;
the voice audio segmentation sub-module 222 is configured to use a position of a center point of each reference voice segment in the original voice data as an audio segmentation point, and perform audio segmentation on the original voice data according to the audio segmentation point to obtain a plurality of audio segments.
Optionally, the first frequency value is 20 hertz and the second frequency value is 20000 hertz.
Referring to fig. 5, in an embodiment of the present invention, the noise identification module 230 may further include:
the first noise identification submodule 231 is configured to identify each audio segment through a preset noise identification model, and acquire an audio segment belonging to noise in the audio segments;
and/or the presence of a gas in the gas,
the second noise identification submodule 232 is configured to obtain an audio segment belonging to noise in the audio segment according to a similarity between the frequency variation waveform of each audio segment and the frequency variation waveform of each noise sample in a preset noise sample set;
wherein the noise identification model is a machine learning model trained by a plurality of audio samples determined to be noise.
Optionally, the noise identification model includes at least one of a first noise identification model for identifying an environmental background sound and a second noise identification model for identifying a color ring back tone, where the first noise identification model is obtained by training audio samples of a plurality of environmental background sounds, and the second noise identification model is obtained by training audio samples of a plurality of color ring back tones; and the noise sample set comprises audio samples of a plurality of prompt voices as the noise samples.
The speech denoising processing device provided by the embodiment of the present invention can implement each process implemented in the method embodiments of fig. 1 to fig. 2, and is not described herein again to avoid repetition.
Preferably, an embodiment of the present invention further provides an electronic device, including: the processor, the memory, and the computer program stored in the memory and capable of running on the processor, when being executed by the processor, implement each process of the above-mentioned speech denoising processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, it is not described here again.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements each process of the embodiment of the speech denoising processing method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
Fig. 6 is a schematic diagram of a hardware structure of an electronic device implementing various embodiments of the present invention.
The electronic device 500 includes, but is not limited to: a radio frequency unit 501, a network module 502, an audio output unit 503, an input unit 504, a sensor 505, a display unit 506, a user input unit 507, an interface unit 508, a memory 509, a processor 510, and a power supply 511. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 6 does not constitute a limitation of the electronic device, and that the electronic device may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. In the embodiment of the present invention, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.
It should be understood that, in the embodiment of the present invention, the radio frequency unit 501 may be used for receiving and sending signals during a message sending and receiving process or a call process, and specifically, receives downlink data from a base station and then processes the received downlink data to the processor 510; in addition, the uplink data is transmitted to the base station. In general, radio frequency unit 501 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 501 can also communicate with a network and other devices through a wireless communication system.
The electronic device provides wireless broadband internet access to the user via the network module 502, such as assisting the user in sending and receiving e-mails, browsing web pages, and accessing streaming media.
The audio output unit 503 may convert audio data received by the radio frequency unit 501 or the network module 502 or stored in the memory 509 into an audio signal and output as sound. Also, the audio output unit 503 may also provide audio output related to a specific function performed by the electronic apparatus 500 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 503 includes a speaker, a buzzer, a receiver, and the like.
The input unit 504 is used to receive an audio or video signal. The input Unit 504 may include a Graphics Processing Unit (GPU) 5041 and a microphone 5042, and the Graphics processor 5041 processes image data of a still picture or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 506. The image frames processed by the graphic processor 5041 may be stored in the memory 509 (or other storage medium) or transmitted via the radio frequency unit 501 or the network module 502. The microphone 5042 may receive sounds and may be capable of processing such sounds into audio data. The processed audio data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 501 in case of the phone call mode.
The electronic device 500 also includes at least one sensor 505, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor includes an ambient light sensor that can adjust the brightness of the display panel 5061 according to the brightness of ambient light, and a proximity sensor that can turn off the display panel 5061 and/or a backlight when the electronic device 500 is moved to the ear. As one type of motion sensor, an accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of an electronic device (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), and vibration identification related functions (such as pedometer, tapping); the sensors 505 may also include fingerprint sensors, pressure sensors, iris sensors, molecular sensors, gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., which are not described in detail herein.
The display unit 506 is used to display information input by the user or information provided to the user. The Display unit 506 may include a Display panel 5061, and the Display panel 5061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.
The user input unit 507 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 507 includes a touch panel 5071 and other input devices 5072. Touch panel 5071, also referred to as a touch screen, may collect touch operations by a user on or near it (e.g., operations by a user on or near touch panel 5071 using a finger, stylus, or any suitable object or attachment). The touch panel 5071 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 510, and receives and executes commands sent by the processor 510. In addition, the touch panel 5071 may be implemented in various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. In addition to the touch panel 5071, the user input unit 507 may include other input devices 5072. In particular, other input devices 5072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein.
Further, the touch panel 5071 may be overlaid on the display panel 5061, and when the touch panel 5071 detects a touch operation thereon or nearby, the touch operation is transmitted to the processor 510 to determine the type of the touch event, and then the processor 510 provides a corresponding visual output on the display panel 5061 according to the type of the touch event. Although in fig. 6, the touch panel 5071 and the display panel 5061 are two independent components to implement the input and output functions of the electronic device, in some embodiments, the touch panel 5071 and the display panel 5061 may be integrated to implement the input and output functions of the electronic device, and is not limited herein.
The interface unit 508 is an interface for connecting an external device to the electronic apparatus 500. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 508 may be used to receive input (e.g., data information, power, etc.) from external devices and transmit the received input to one or more elements within the electronic apparatus 500 or may be used to transmit data between the electronic apparatus 500 and external devices.
The memory 509 may be used to store software programs as well as various data. The memory 509 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 509 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
The processor 510 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 509 and calling data stored in the memory 509, thereby performing overall monitoring of the electronic device. Processor 510 may include one or more processing units; preferably, the processor 510 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 510.
The electronic device 500 may further include a power supply 511 (e.g., a battery) for supplying power to various components, and preferably, the power supply 511 may be logically connected to the processor 510 via a power management system, so as to implement functions of managing charging, discharging, and power consumption via the power management system.
In addition, the electronic device 500 includes some functional modules that are not shown, and are not described in detail herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (12)

1. A speech denoising processing method is characterized by comprising the following steps:
acquiring original voice data to be processed;
acquiring audio segmentation points of the original voice data according to the frequency distribution condition of the original voice data and the frequency range which can be heard by human ears, or taking each pause moment in the original voice data as the audio segmentation points of the original voice data and performing audio segmentation on the original voice data according to the audio segmentation points to obtain a plurality of audio segments;
according to the frequency variation waveform of each audio segment, determining a target audio segment which does not belong to noise and an audio segment which belongs to noise in original voice data, and rejecting a noise audio segment in a plurality of audio segments;
and sequentially splicing each target audio segment according to the time sequence of each target audio segment in the original voice data to obtain the denoised voice data.
2. The method of claim 1, wherein the step of obtaining an audio segmentation point of the original speech data according to the frequency distribution of the original speech data and the frequency range that can be heard by human ears, and performing audio segmentation on the original speech data according to the audio segmentation point to obtain a plurality of audio segments comprises:
acquiring a reference voice segment with the frequency lower than a first frequency value and/or the frequency higher than a second frequency value in the original voice data according to the frequency distribution condition of the original voice data and the frequency range which can be heard by human ears;
and taking the position of the central point of each reference voice segment in the original voice data as an audio segmentation point, and performing audio segmentation on the original voice data according to the audio segmentation point to obtain a plurality of audio segments.
3. The method of claim 2, wherein the first frequency value is 20 hertz and the second frequency value is 20000 hertz.
4. A method as claimed in any one of claims 1-3, wherein the step of determining a target audio segment not belonging to noise and an audio segment belonging to noise in the original speech data based on the frequency variation waveform of each of the audio segments comprises:
identifying each audio segment through a preset noise identification model to obtain the audio segments belonging to noise in the audio segments;
and/or acquiring an audio segment belonging to noise in the audio segment according to the similarity between the frequency variation waveform of each audio segment and the frequency variation waveform of each noise sample in a preset noise sample set;
wherein the noise identification model is a machine learning model trained by a plurality of audio samples determined to be noise.
5. The method according to claim 4, wherein the noise identification model comprises at least one of a first noise identification model for identifying an environmental background sound and a second noise identification model for identifying a color ring back tone, wherein the first noise identification model is obtained by training audio samples of a plurality of environmental background sounds, and the second noise identification model is obtained by training audio samples of a plurality of color ring back tones; and the noise sample set comprises audio samples of a plurality of prompt voices as the noise samples.
6. A speech denoising processing apparatus, comprising:
the voice data acquisition module is used for acquiring original voice data to be processed;
the voice audio cutting module is used for acquiring audio cutting points of the original voice data according to the frequency distribution condition of the original voice data and the frequency range which can be heard by human ears, or taking each pause moment in the original voice data as the audio cutting point of the original voice data and performing audio cutting on the original voice data according to the audio cutting points to obtain a plurality of audio segments;
the noise identification module is used for determining a target audio segment which does not belong to noise and an audio segment which belongs to noise in original voice data according to the frequency variation waveform of each audio segment, and rejecting a noise audio segment in a plurality of audio segments;
and the voice splicing module is used for sequentially splicing each target audio segment according to the time sequence of each target audio segment in the original voice data to obtain the denoised voice data.
7. The apparatus of claim 6, wherein the voice audio cutting module comprises:
the reference voice section obtaining submodule is used for obtaining a reference voice section of which the frequency is lower than a first frequency value and/or the frequency is higher than a second frequency value in the original voice data according to the frequency distribution condition of the original voice data and the frequency range which can be heard by human ears by reference audio;
and the voice audio cutting submodule is used for taking the position of the central point of each reference voice segment in the original voice data as an audio cutting point and carrying out audio cutting on the original voice data according to the audio cutting point to obtain a plurality of audio segments.
8. The apparatus of claim 7, wherein the first frequency value is 20Hz and the second frequency value is 20000 Hz.
9. The apparatus according to any one of claims 6-8, wherein the noise identification module comprises:
the first noise identification submodule is used for identifying each audio segment through a preset noise identification model to obtain the audio segments belonging to noise in the audio segments;
and/or the second noise identification submodule is used for acquiring the audio segments belonging to the noise in the audio segments according to the similarity between the frequency variation waveform of each audio segment and the frequency variation waveform of each noise sample in a preset noise sample set;
wherein the noise identification model is a machine learning model trained by a plurality of audio samples determined to be noise.
10. The apparatus of claim 9, wherein the noise recognition model comprises at least one of a first noise recognition model for recognizing an environmental background sound and a second noise recognition model for recognizing a color ring back tone, the first noise recognition model is obtained by training audio samples of a plurality of environmental background sounds, and the second noise recognition model is obtained by training audio samples of a plurality of color ring back tones; and the noise sample set comprises audio samples of a plurality of prompt voices as the noise samples.
11. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the speech denoising processing method according to any one of claims 1 to 5.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech denoising processing method according to any one of claims 1 to 5.
CN202011232882.6A 2020-11-06 2020-11-06 Voice denoising processing method and device, electronic equipment and storage medium Active CN112382282B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011232882.6A CN112382282B (en) 2020-11-06 2020-11-06 Voice denoising processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011232882.6A CN112382282B (en) 2020-11-06 2020-11-06 Voice denoising processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112382282A CN112382282A (en) 2021-02-19
CN112382282B true CN112382282B (en) 2022-02-11

Family

ID=74578809

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011232882.6A Active CN112382282B (en) 2020-11-06 2020-11-06 Voice denoising processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112382282B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114360481A (en) * 2021-11-26 2022-04-15 惠州华阳通用智慧车载系统开发有限公司 Denoising driving method of buzzer

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6901362B1 (en) * 2000-04-19 2005-05-31 Microsoft Corporation Audio segmentation and classification
KR100574883B1 (en) * 2003-03-20 2006-04-27 주식회사 케이티 Method for Speech Detection Using Removing Noise
CN101404160B (en) * 2008-11-21 2011-05-04 北京科技大学 Voice denoising method based on audio recognition
CN103377651B (en) * 2012-04-28 2015-12-16 北京三星通信技术研究有限公司 The automatic synthesizer of voice and method
CN103530432A (en) * 2013-09-24 2014-01-22 华南理工大学 Conference recorder with speech extracting function and speech extracting method
CN105931635B (en) * 2016-03-31 2019-09-17 北京奇艺世纪科技有限公司 A kind of audio frequency splitting method and device
CN105975569A (en) * 2016-05-03 2016-09-28 深圳市金立通信设备有限公司 Voice processing method and terminal
CN107135443B (en) * 2017-03-29 2020-06-23 联想(北京)有限公司 Signal processing method and electronic equipment
CN107452394A (en) * 2017-07-31 2017-12-08 上海斐讯数据通信技术有限公司 A kind of method and system that noise is reduced based on frequency characteristic
CN107742516B (en) * 2017-09-29 2020-11-17 上海望潮数据科技有限公司 Intelligent recognition method, robot and computer readable storage medium
CN107864410B (en) * 2017-10-12 2023-08-25 庄世健 Multimedia data processing method and device, electronic equipment and storage medium
CN108257592A (en) * 2018-01-11 2018-07-06 广州势必可赢网络科技有限公司 A kind of voice dividing method and system based on shot and long term memory models
CN108962227B (en) * 2018-06-08 2020-06-30 百度在线网络技术(北京)有限公司 Voice starting point and end point detection method and device, computer equipment and storage medium
CN109994126A (en) * 2019-03-11 2019-07-09 北京三快在线科技有限公司 Audio message segmentation method, device, storage medium and electronic equipment
CN110718228B (en) * 2019-10-22 2022-04-12 中信银行股份有限公司 Voice separation method and device, electronic equipment and computer readable storage medium
CN111508498B (en) * 2020-04-09 2024-01-30 携程计算机技术(上海)有限公司 Conversational speech recognition method, conversational speech recognition system, electronic device, and storage medium

Also Published As

Publication number Publication date
CN112382282A (en) 2021-02-19

Similar Documents

Publication Publication Date Title
CN109558512B (en) Audio-based personalized recommendation method and device and mobile terminal
CN107799125A (en) A kind of audio recognition method, mobile terminal and computer-readable recording medium
CN108511002B (en) Method for recognizing sound signal of dangerous event, terminal and computer readable storage medium
CN107919138B (en) Emotion processing method in voice and mobile terminal
CN109065060B (en) Voice awakening method and terminal
CN107886969B (en) Audio playing method and audio playing device
CN109215683B (en) Prompting method and terminal
CN110097872B (en) Audio processing method and electronic equipment
CN108391008B (en) Message reminding method and mobile terminal
CN110012143B (en) Telephone receiver control method and terminal
CN110225195B (en) Voice communication method and terminal
CN111182118B (en) Volume adjusting method and electronic equipment
CN113452845B (en) Method for identifying abnormal telephone number and electronic equipment
CN110830368A (en) Instant messaging message sending method and electronic equipment
CN111401463A (en) Method for outputting detection result, electronic device, and medium
CN111738100A (en) Mouth shape-based voice recognition method and terminal equipment
CN111835522A (en) Audio processing method and device
CN110995921A (en) Call processing method, electronic device and computer readable storage medium
CN108093119B (en) Strange incoming call number marking method and mobile terminal
CN108597495B (en) Method and device for processing voice data
CN112382282B (en) Voice denoising processing method and device, electronic equipment and storage medium
CN111292727B (en) Voice recognition method and electronic equipment
CN110427149B (en) Terminal operation method and terminal
CN109361804B (en) Incoming call processing method and mobile terminal
CN115312036A (en) Model training data screening method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant