CN112992190B - Audio signal processing method and device, electronic equipment and storage medium - Google Patents

Audio signal processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112992190B
CN112992190B CN202110145224.1A CN202110145224A CN112992190B CN 112992190 B CN112992190 B CN 112992190B CN 202110145224 A CN202110145224 A CN 202110145224A CN 112992190 B CN112992190 B CN 112992190B
Authority
CN
China
Prior art keywords
audio signal
sub
target audio
region
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110145224.1A
Other languages
Chinese (zh)
Other versions
CN112992190A (en
Inventor
宗博文
杨晶生
苗天时
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202110145224.1A priority Critical patent/CN112992190B/en
Publication of CN112992190A publication Critical patent/CN112992190A/en
Application granted granted Critical
Publication of CN112992190B publication Critical patent/CN112992190B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The disclosure provides an audio signal processing method, an audio signal processing device, an electronic device and a storage medium. One embodiment of the method comprises: acquiring a target audio signal; determining an activation region of the target audio signal based on a time domain diagram of the target audio signal and a preset volume threshold; determining an echo howling region of the target audio signal based on the frequency spectrum centroid sequence of the target audio signal and a preset sequence distribution threshold; determining a human voice region and a noise region of the target audio signal based on a Mel frequency cepstrum of the target audio signal and a pre-trained machine learning model; and determining a target human voice area corresponding to the target audio signal according to the activation area, the echo howling area, the human voice area and the noise area. The embodiment can obtain high-quality clear voice, and is favorable for improving the accuracy of voice recognition or language recognition.

Description

Audio signal processing method and device, electronic equipment and storage medium
Technical Field
The embodiment of the disclosure relates to the technical field of audio signal processing, in particular to a method and a device for processing an audio signal, electronic equipment and a storage medium.
Background
Speech Recognition technology (ASR) is a technology that converts human Speech into text. The application of speech recognition is very wide, and speech interaction, speech input and the like are common.
Language Identification (LID) is a technology for automatically identifying and judging a speech signal of a speaker by a computer system to obtain a Language type corresponding to the speech. And the recognition result can be fed back to the speech recognition model of the corresponding language by means of the LID, so that the automatic multi-language interaction experience is realized.
Generally, noise other than human voice is included in an audio signal, and the noise affects the effect of speech recognition or language identification. Therefore, it is necessary to provide a technical solution for obtaining clear human voice.
Disclosure of Invention
The embodiment of the disclosure provides an audio signal processing method and device, an electronic device and a storage medium.
In a first aspect, the present disclosure provides a method for processing an audio signal, including:
acquiring a target audio signal;
determining an activation region of the target audio signal based on the time domain diagram of the target audio signal and a preset volume threshold;
determining an echo howling region of the target audio signal based on a frequency spectrum centroid sequence of the target audio signal and a preset sequence distribution threshold, wherein the frequency spectrum centroid sequence is obtained by performing window sliding processing on a time domain graph of the target audio signal and taking a frequency spectrum centroid of a sub-time domain graph in each window;
determining a human voice region and a noise region of the target audio signal based on a Mel frequency cepstrum of the target audio signal and a pre-trained machine learning model;
and determining a target human voice area corresponding to the target audio signal according to the activation area, the echo howling area, the human voice area and the noise area.
In some optional embodiments, the method further comprises:
and determining the language corresponding to the target audio signal according to the target human voice area.
In some optional embodiments, the determining an echo howling region of the target audio signal based on the spectrum centroid sequence of the target audio signal and a preset sequence distribution threshold includes:
generating a frequency spectrum centroid sequence of the target audio signal according to the time domain diagram of the target audio signal;
determining a plurality of sub-spectrum centroid sequences corresponding to the spectrum centroid sequence based on a window sliding method;
for each sub-spectrum centroid sequence, determining a sequence distribution parameter of the sub-spectrum centroid sequence, and comparing the sequence distribution parameter of the sub-spectrum centroid sequence with the sequence distribution threshold to determine a detection result of the sub-spectrum centroid sequence, wherein the detection result indicates whether an audio signal region corresponding to the sub-spectrum centroid sequence belongs to an echo howling region;
and determining an echo howling area of the target audio signal according to the detection result of each sub-spectrum centroid sequence.
In some alternative embodiments, the sequence distribution threshold comprises a standard deviation threshold and a correlation coefficient threshold; and
the determining the sequence distribution parameters of the sub-spectrum centroid sequence includes:
carrying out normalization processing on the sub-spectrum centroid sequence, wherein the normalization processing is used for mapping the numerical value of the spectrum centroid to a preset range; determining a maximum value point of the sub-spectrum centroid sequence; determining the standard deviation of the maximum value point of the sub-spectrum centroid sequence; determining the time interval mean value of the maximum value point of the sub-spectrum centroid sequence; respectively removing data with the length of the time interval mean value from the head and the tail of the sub-spectrum centroid sequence to obtain a corresponding first segment and a second segment, and calculating correlation coefficients of the first segment and the second segment;
the comparing the sequence distribution parameter of the sub-spectrum centroid sequence with the sequence distribution threshold to determine the detection result of the sub-spectrum centroid sequence includes:
and under the condition that the standard deviation of the maximum value point of the sub-spectrum centroid sequence is smaller than the standard deviation threshold value and the correlation coefficient of the first segment and the second segment is larger than the correlation coefficient threshold value, determining that the audio signal region corresponding to the sub-spectrum centroid sequence belongs to an echo howling region.
In some optional embodiments, the generating a sequence of spectral centroids of the target audio signal according to the time domain diagram of the target audio signal includes:
determining a plurality of sub-time domain graphs corresponding to the time domain graph of the target audio signal based on a window sliding method;
calculating the spectral centroid of each of the sub-time domain graphs to form a sequence of spectral centroids of the target audio signal.
In some optional embodiments, for each of the plurality of sub-time domain graphs, the spectral centroid of the sub-time domain graph is determined by:
determining the frequency weight corresponding to each point in the sub time domain graph according to the amplitude of each point in the sub time domain graph and the sum of the amplitudes of each point in the sub time domain graph;
determining the frequency spectrum centroid component corresponding to each point in the sub-time domain graph according to the frequency weight of each point in the sub-time domain graph and the corresponding fast Fourier transform frequency;
and summing the frequency spectrum centroid components corresponding to each point in the sub-time domain graph to obtain the frequency spectrum centroid of the sub-time domain graph.
In some alternative embodiments, the machine learning model is trained by:
acquiring a training sample set, wherein the training sample set comprises at least one training sample formed by mixing human voice and noise and a sample label used for indicating the position of the human voice and the position of the noise in the training sample;
and training a convolutional neural network through the training sample set until a preset convergence condition is reached to obtain the machine learning model, wherein the input of the convolutional neural network comprises a Mel frequency cepstrum of the training sample, and the output of the convolutional neural network comprises a start point and a stop point of a sound region, a foreground confidence coefficient, a human voice confidence coefficient and a noise confidence coefficient.
In some optional embodiments, the determining the human voice region and the noise region of the target audio signal based on the mel-frequency cepstrum of the target audio signal and a pre-trained machine learning model includes:
inputting the Mel frequency cepstrum of the target audio signal into the machine learning model to obtain at least one target sound region and corresponding human voice confidence and noise confidence;
determining the target sound area as a human sound area under the condition that the human sound confidence corresponding to the target sound area is greater than a preset human sound confidence threshold; determining the target sound region as a noise region under the condition that the noise confidence corresponding to the target sound region is greater than a preset noise confidence threshold;
combining the obtained multiple voice areas based on a non-maximum suppression algorithm to obtain the voice area of the target audio signal; and carrying out merging processing on the obtained multiple noise regions based on a non-maximum suppression algorithm to obtain the noise region of the target audio signal.
In some optional embodiments, the determining the active region of the target audio signal based on the time domain map of the target audio signal and a preset volume threshold includes:
carrying out absolute value processing on the time domain graph of the target audio signal to obtain a corresponding volume graph;
determining a plurality of sub-volume maps corresponding to the volume maps based on a window sliding method;
for each sub-volume map, determining the volume average value corresponding to the sub-volume map; and determining that the audio signal region corresponding to the sub-volume map belongs to the active region when the volume average value is larger than the volume threshold value.
In some optional embodiments, the determining a target human voice region corresponding to the target audio signal according to the active region, the echo howling region, the human voice region, and the noise region includes:
taking intersection of the human voice area and the activation area and performing expansion corrosion treatment to obtain a first audio signal area;
merging the noise area and the echo howling area and carrying out corrosion expansion treatment to obtain a second audio signal area;
and removing the intersection of the first audio signal and the second audio signal from the first audio signal region to obtain a target human voice region corresponding to the target audio signal.
In a second aspect, the present disclosure provides an apparatus for processing an audio signal, comprising:
an acquisition unit configured to acquire a target audio signal;
the first detection unit is used for determining an activation region of the target audio signal based on a time domain diagram of the target audio signal and a preset volume threshold;
a second detection unit, configured to determine an echo howling region of the target audio signal based on a spectrum centroid sequence of the target audio signal and a preset sequence distribution threshold, where the spectrum centroid sequence is obtained by performing window sliding processing on a time domain diagram of the target audio signal and taking a spectrum centroid of a sub-time domain diagram in each window;
a third detection unit, configured to determine a human voice region and a noise region of the target audio signal based on a mel-frequency cepstrum of the target audio signal and a machine learning model trained in advance;
and the processing unit is used for determining a target human voice area corresponding to the target audio signal according to the activation area, the echo howling area, the human voice area and the noise area.
In some optional embodiments, the apparatus further comprises a language identification unit, and the language identification unit is configured to:
and determining the language corresponding to the target audio signal according to the target human voice area.
In some optional embodiments, the second detecting unit includes:
a spectrum centroid sequence generating unit, configured to generate a spectrum centroid sequence of the target audio signal according to a time domain diagram of the target audio signal;
the sub-spectrum centroid sequence generating unit is used for determining a plurality of sub-spectrum centroid sequences corresponding to the spectrum centroid sequence based on a window sliding method;
the detection unit is used for determining a sequence distribution parameter of the sub-spectrum centroid sequence and comparing the sequence distribution parameter of the sub-spectrum centroid sequence with the sequence distribution threshold value to determine a detection result of the sub-spectrum centroid sequence, wherein the detection result indicates whether an audio signal region corresponding to the sub-spectrum centroid sequence belongs to an echo howling region;
and the merging unit is used for determining an echo howling area of the target audio signal according to the detection result of each sub-spectrum centroid sequence.
In some alternative embodiments, the sequence distribution threshold comprises a standard deviation threshold and a correlation coefficient threshold;
the second detection unit is further configured to: normalizing the sub-spectrum centroid sequence, wherein the normalization is used for mapping the numerical value of the spectrum centroid to the range of [0,1 ]; determining a maximum value point of the sub-spectrum centroid sequence; determining the standard deviation of the maximum value point of the sub-spectrum centroid sequence; determining the time interval mean value of the maximum value point of the sub-spectrum centroid sequence; respectively removing data with the length of the time interval mean value from the head and the tail of the sub-spectrum centroid sequence to obtain a corresponding first segment and a second segment, and calculating correlation coefficients of the first segment and the second segment;
and under the condition that the standard deviation of the maximum value point of the sub-spectrum centroid sequence is smaller than the standard deviation threshold value and the correlation coefficient of the first segment and the second segment is larger than the correlation coefficient threshold value, determining that the audio signal region corresponding to the sub-spectrum centroid sequence belongs to an echo howling region.
In some optional embodiments, the generating unit is further configured to:
determining a plurality of sub-time domain graphs corresponding to the time domain graph of the target audio signal based on a window sliding method;
calculating the spectral centroid of each of the sub-time domain graphs to form a sequence of spectral centroids of the target audio signal.
In some optional embodiments, for each of the plurality of sub-time domain graphs, the spectral centroid of the sub-time domain graph is determined by:
determining the frequency weight corresponding to each point in the sub time domain graph according to the amplitude of each point in the sub time domain graph and the sum of the amplitudes of each point in the sub time domain graph;
determining the frequency spectrum centroid component corresponding to each point in the sub-time domain graph according to the frequency weight of each point in the sub-time domain graph and the corresponding fast Fourier transform frequency;
and summing the frequency spectrum centroid components corresponding to each point in the sub-time domain graph to obtain the frequency spectrum centroid of the sub-time domain graph.
In some alternative embodiments, the machine learning model is trained by:
acquiring a training sample set, wherein the training sample set comprises at least one training sample formed by mixing human voice and noise and a sample label used for indicating the position of the human voice and the position of the noise in the training sample;
and training a convolutional neural network through the training sample set until a preset convergence condition is reached to obtain the machine learning model, wherein the input of the convolutional neural network comprises a Mel frequency cepstrum of the training sample, and the output of the convolutional neural network comprises a start point and a stop point of a sound region, a foreground confidence coefficient, a human voice confidence coefficient and a noise confidence coefficient.
In some optional embodiments, the third detecting unit is further configured to:
inputting the Mel frequency cepstrum of the target audio signal into the machine learning model to obtain at least one target sound region and corresponding human voice confidence and noise confidence;
determining the target sound area as a human sound area under the condition that the human sound confidence corresponding to the target sound area is greater than a preset human sound confidence threshold; determining the target sound region as a noise region under the condition that the noise confidence corresponding to the target sound region is greater than a preset noise confidence threshold;
combining the obtained multiple voice areas based on a non-maximum suppression algorithm to obtain the voice area of the target audio signal; and carrying out merging processing on the obtained multiple noise regions based on a non-maximum suppression algorithm to obtain the noise region of the target audio signal.
In some optional embodiments, the first detecting unit is further configured to:
carrying out absolute value processing on the time domain graph of the target audio signal to obtain a corresponding volume graph;
determining a plurality of sub-volume maps corresponding to the volume maps based on a window sliding method;
for each sub-volume map, determining the volume average value corresponding to the sub-volume map; and determining that the audio signal region corresponding to the sub-volume map belongs to the active region when the volume average value is larger than the volume threshold value.
In some optional embodiments, the processing unit is further configured to:
taking intersection of the human voice area and the activation area and performing expansion corrosion treatment to obtain a first audio signal area;
merging the noise area and the echo howling area and carrying out corrosion expansion treatment to obtain a second audio signal area;
and removing the intersection of the first audio signal and the second audio signal from the first audio signal region to obtain a target human voice region corresponding to the target audio signal.
In a third aspect, the present disclosure provides an electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described in any embodiment of the first aspect of the disclosure.
In a fourth aspect, the present disclosure provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by one or more processors, implements the method as described in any one of the embodiments of the first aspect of the present disclosure.
According to the audio signal processing method and device, the electronic device and the storage medium provided by the embodiment of the disclosure, the activation region, the echo howling region, the voice region and the noise region of the target audio signal are respectively determined, and then the audio signal regions of various types are combined to obtain the target voice region, so that high-quality clear voice can be obtained, and the accuracy of voice recognition or language recognition can be improved.
In the embodiment of the disclosure, a frequency spectrum centroid sequence is obtained by performing window sliding processing on a time domain diagram of a target audio signal and taking a frequency spectrum centroid of a sub-time domain diagram in each window, and then an echo howling area is determined according to the frequency spectrum centroid sequence, so that information contained in the target audio signal is fully utilized, and a detection result of the echo howling area is accurate and reliable.
In addition, in this embodiment, detection of each sound type may be implemented based on a rule or a lightweight neural network, and an algorithm is light and fast, so that the audio Processing scheme can be detected in real time, and can also be deployed in a pure CPU (Central Processing Unit) hardware environment.
Drawings
Other features, objects, and advantages of the disclosure will become apparent from a reading of the following detailed description of non-limiting embodiments which proceeds with reference to the accompanying drawings. The drawings are only for purposes of illustrating the particular embodiments and are not to be construed as limiting the invention. In the drawings:
fig. 1 is a system architecture diagram of one embodiment of an audio signal processing system according to the present disclosure;
FIG. 2 is a flow diagram of one embodiment of an audio signal processing method according to the present disclosure;
FIG. 3 is a schematic diagram of determining a sub-volume map based on a window sliding method according to the present disclosure;
FIG. 4 is a schematic block diagram of one embodiment of an audio signal processing apparatus according to the present disclosure;
FIG. 5 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the audio signal processing method, apparatus, terminal device and storage medium of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a voice interaction application, a video conference application, a short video social application, a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the terminal devices 101, 102, 103.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having a microphone and a speaker, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as a plurality of software or software modules (for example for an audio signal processing service) or as a single software or software module. And is not particularly limited herein.
The server 105 may be a server providing various services, such as a background server providing processing services for audio signals captured on the terminal devices 101, 102, 103. The background server can perform corresponding processing on the received audio signals and the like.
In some cases, the audio signal processing method provided by the present disclosure may be performed by the terminal devices 101, 102, 103 and the server 105 together, for example, the step of "acquiring the target audio signal" may be performed by the terminal devices 101, 102, 103, and the step of "determining the target human voice region corresponding to the target audio signal according to the activation region, the echo howling region, the human voice region and the noise region" may be performed by the server 105. The present disclosure is not limited thereto. Accordingly, the audio signal processing means may be provided in the terminal devices 101, 102, 103 and the server 105, respectively.
In some cases, the audio signal processing method provided by the present disclosure may be executed by the terminal devices 101, 102, and 103, and accordingly, the audio signal processing apparatus may also be disposed in the terminal devices 101, 102, and 103, and in this case, the system architecture 100 may not include the server 105.
In some cases, the audio signal processing method provided by the present disclosure may be executed by the server 105, and accordingly, the audio signal processing apparatus may also be disposed in the server 105, and in this case, the system architecture 100 may also not include the terminal devices 101, 102, and 103.
The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continuing reference to fig. 2, a flow 200 of an embodiment of an audio signal processing method according to the present disclosure is shown, applied to the terminal device or the server in fig. 1, the flow 200 including the steps of:
step 201, a target audio signal is obtained.
In the case where the execution subject is a terminal device, a sound wave may be converted into an electric signal by a microphone in the terminal device, thereby obtaining a target audio signal. The terminal device can also receive audio signals sent by other terminal devices or the server through the network, so as to obtain the target audio signal.
In the case where the execution subject is a server, an audio signal transmitted by a terminal device or another server may be received by the server through a network, thereby obtaining a target audio signal.
In the case where the execution subject is a terminal device and a server, an audio signal may be collected by the terminal device through a microphone and transmitted to the server for processing.
In this embodiment, the target audio signal may be a series of data sampled at a specific sampling frequency. The target audio signal may be represented as a series of discrete data points in a time domain plot. Here, the abscissa of the time domain diagram is time, and the ordinate is the amplitude of the audio signal.
In an example of a video conference scenario, the target audio signal may be an audio signal collected in real time, which may include human voice, echo howling, and noise composed of noise floor, current sound, keyboard and mouse click sound, various types of click sound, and the like.
In an acoustic scenario, howling phenomena tend to occur when a closed loop of acoustic feedback is formed. The reason is that: the sound signals collected by the microphone comprise sound signals amplified by the loudspeaker, the signals are continuously superposed and amplified in the sound feedback loop, and positive feedback generates vibration to generate howling.
In this embodiment, echo and howling and similar feedback interference are referred to by echo and howling.
Step 202, determining an activation region of the target audio signal based on the time domain diagram of the target audio signal and a preset volume threshold.
In this embodiment, the volume may be determined according to the amplitude of the audio signal in the time domain diagram, and a preset volume threshold may be used as a division criterion of the mute region and the active region. Here, the mute region may be a region where the amplitude of the audio signal is smaller than a preset volume threshold, corresponding to a mute state. The activation region may be a region where the amplitude of the audio signal is greater than or equal to a preset volume threshold, containing audible sounds that can be heard by the human ear.
In one example, step 202 may proceed as follows:
firstly, absolute value processing is carried out on a time domain graph of a target audio signal to obtain a corresponding volume graph.
For example, for each data point corresponding to the target audio signal in the time domain diagram, the abscissa of the data point is kept unchanged, and the absolute value of the ordinate of the data point is taken to obtain the volume diagram of the target audio signal. The abscissa of the above-mentioned volume map is time, and the ordinate is the absolute value of the amplitude (i.e., volume) of the audio signal.
Next, a plurality of sub-volume maps corresponding to the volume map are determined based on a window sliding method.
In this embodiment, the window sliding method is to slide on the data sequence by a certain step length using a sliding window with a certain width, and use the data intercepted by the sliding window at each position as the corresponding sub-data sequence.
In this embodiment, a sliding window with a certain width may be used to slide on the volume map in a certain step length, so as to obtain a plurality of sub-volume maps corresponding to the volume map.
Fig. 3 is a schematic diagram of determining a sub-volume map based on a window sliding method. In fig. 3, the abscissa of the volume map is time, and the values thereof are t0, t1, t2 … … and t10 in this order. The ordinate of the volume map is the volume, and the values thereof are 7, 2, and 1 … … 5 in this order. The respective position states during the sliding are shown in fig. 3, using a sliding window with a width of 6 time units, sliding in steps of 2 time units on the loudness chart, wherein the sliding window is shown in a dashed box. Thus, three sub-volume maps corresponding to time ranges of t1 to t6, t3 to t8, and t5 to t10, respectively, can be obtained.
And finally, for each sub-volume map, determining a volume average value corresponding to the sub-volume map, and comparing the volume average value with a preset volume threshold value.
In case the volume average is larger than the volume threshold, it may be determined that the audio signal region to which the sub-volume map corresponds belongs to the active region. In the case where the volume average value is not greater than the volume threshold value, it may be determined that the audio signal region corresponding to the sub-volume map belongs to a mute region.
In the example shown in fig. 3, for the sub-volume map with the time range of t3 to t8, the average value of all volume values in the sub-volume map is 5.83, and assuming that the preset volume threshold is 5, it is known that the average value of the volume corresponding to the sub-volume map is greater than the volume threshold, so that it can be determined that the audio signal region corresponding to the sub-volume map (i.e. the audio signal region with the time range of t3 to t 8) belongs to the active region.
In the above embodiment, a plurality of sub volume maps are generated based on the window sliding method, and the volume average value corresponding to the sub volume map is compared with a preset volume threshold value to determine whether the sub volume map belongs to the mute region or the active region. The embodiment makes full use of the volume information contained in the target audio signal, so that the volume detection result is more accurate and reasonable.
And 203, determining an echo howling region of the target audio signal based on the frequency spectrum centroid sequence of the target audio signal and a preset sequence distribution threshold, wherein the frequency spectrum centroid sequence is obtained by performing window sliding processing on the time domain diagram of the target audio signal and taking the frequency spectrum centroid of the sub-time domain diagram in each window.
In this embodiment, the detection of the echo howling region may be performed based on a sequence of spectral centroids of the target audio signal. In one example, the sequence of spectral centroids of the target audio signal may be obtained by:
firstly, a plurality of sub-time domain graphs corresponding to the time domain graph of the target audio signal are determined based on a window sliding method.
Referring to the example shown in fig. 3, a sliding window with a certain width is used to slide on the time domain map of the target audio signal with a certain step size, so as to obtain a plurality of corresponding sub-time domain maps.
Secondly, the spectral centroid of each sub-time domain map is calculated to form a sequence of spectral centroids of the target audio signal.
Generally, the spectral centroid is one of the important physical parameters describing the timbre property, is the center of gravity of the frequency components, is the frequency averaged by energy weighting over a certain frequency range, and has the unit of Hz. It is important information of frequency distribution and energy distribution of an audio signal. In the subjective perception field, the spectral centroid describes the brightness of sound, sound with dull and low quality tends to have more low-frequency content, the spectral centroid is relatively low, most of the sound with bright and cheerful quality is concentrated on high frequency, and the spectral centroid is relatively high.
For each sub-time domain graph, the spectral centroid of the sub-time domain graph may be determined by:
firstly, according to the amplitude of each point in the time domain sub-diagram and the sum of the amplitudes of each point in the time domain sub-diagram, determining the frequency weight corresponding to each point in the time domain sub-diagram.
And secondly, determining the frequency spectrum centroid component corresponding to each point in the sub-time domain graph according to the frequency weight of each point in the sub-time domain graph and the corresponding fast Fourier transform frequency.
Here, Fast Fourier Transform (DFT), which is a general name of an efficient and Fast calculation method for calculating Discrete Fourier Transform (DFT) by using a computer, is abbreviated as FFT.
And finally, summing the frequency spectrum centroid components corresponding to each point in the sub-time domain graph to obtain the frequency spectrum centroid of the sub-time domain graph.
In one example, for each sub-time domain graph, the spectral centroid of the sub-time domain graph can be calculated by:
Figure BDA0002929790160000131
wherein t is a mark of the sub-time domain graph, centroid [ t ]]Is the spectral centroid of the tth sub-time domain graph, S [ k, t ]]Is the amplitude of the kth point in the tth sub-time domain graph, Sj, t]Is the amplitude of the jth point in the tth sub-time domain graph, F [ k ]]In order to carry out fast Fourier transform on a time domain diagram of a target audio signal, and then, the frequency corresponding to the kth point in the tth sub-time domain diagram, n is the total number of data points in the tth sub-time domain diagram, and t, j, k and n are integers. Wherein the content of the first and second substances,
Figure BDA0002929790160000141
the frequency weight corresponding to the k-th point,
Figure BDA0002929790160000142
the spectral centroid component corresponding to the k-th point.
In one example, the time domain plot S includes data points (t1, S1), (t2, S2), (t3, S3), (t4, S4), (t5, S5), (t6, S6). After the time domain graph S is subjected to fast fourier transform, the frequencies corresponding to the data points are f1, f2, f3, f4, f5 and f6 in sequence. By sliding the time domain graph S with a sliding window having a width of 3 time units and a step size of 1 time unit, 4 sub-time domain graphs can be obtained, where the 1 st sub-time domain graph S1 includes data points (t1, S1), (t2, S2), (t3, S3). For the sub-time domain graph S1, the spectrum centroid [1] of the sub-time domain graph is:
Figure BDA0002929790160000143
according to the method, the spectral centroids centroid [2], centroid [3] and centroid [4] of other sub-time domain graphs can be determined, so as to obtain the spectral centroid sequences [ centroid [1], centroid [2], centroid [3] and centroid [4] ] of the target audio signal, wherein the time corresponding to each spectral centroid is t2, t3, t4 and t5 in sequence (here, the time average value of each data point in the sub-time domain graph is taken as the time of the corresponding spectral centroid, and the time of the spectral centroid can also be determined in other manners).
In one example, step 203 may be performed as follows:
firstly, a frequency spectrum centroid sequence of the target audio signal is generated according to a time domain diagram of the target audio signal.
The generation process of the spectrum centroid sequence can refer to the description in the foregoing.
Secondly, determining a plurality of sub-spectrum centroid sequences corresponding to the spectrum centroid sequence based on a window sliding method.
For example, for the spectrum centroid sequences [ centroid [1], centroid [2], centroid [3], centroid [4] ] described above, the sub-spectrum centroid sequences corresponding to the spectrum centroid sequences can be determined based on a window sliding method, wherein one sub-spectrum centroid sequence is [ centroid [2], centroid [3], centroid [4] ], and the time corresponding to each spectrum centroid in the sub-spectrum centroid sequences is the spectrums t3, t4, t5 in sequence.
Then, for each sub-spectrum centroid sequence, determining a sequence distribution parameter of the sub-spectrum centroid sequence, and comparing the sequence distribution parameter of the sub-spectrum centroid sequence with a sequence distribution threshold to determine a detection result of the sub-spectrum centroid sequence, where the detection result indicates whether an audio signal region corresponding to the sub-spectrum centroid sequence belongs to an echo howling region.
Here, the sequence distribution parameters of the sub-spectrum centroid sequence can be used to describe the distribution characteristics of the frequencies in the sub-spectrum centroid sequence. The sequence distribution threshold may be used as a basis for detecting an echo howling area.
In one example, the sequence distribution parameters include a standard deviation and a correlation coefficient, and the sequence distribution threshold includes a standard deviation threshold and a correlation coefficient threshold.
The standard deviation and the correlation coefficient may be determined as follows: normalizing the sub-spectrum centroid sequence, wherein the normalization is used for mapping the numerical value of the spectrum centroid to the range of [0,1 ]; determining a maximum value point of the sub-spectrum centroid sequence; determining the standard deviation of the maximum value point of the sub-spectrum centroid sequence; determining the time interval mean value of the maximum value point of the sub-spectrum centroid sequence; and respectively removing data with the length of the time interval mean value from the head and the tail of the sub-spectrum centroid sequence to obtain a corresponding first segment and a second segment, and calculating the correlation coefficients of the first segment and the second segment.
The normalization process is, for example: assuming that the maximum value in the sub-spectrum centroid sequence is max and the minimum value is min, the normalization result of the spectrum centroid x in the sub-spectrum centroid sequence is
Figure BDA0002929790160000151
Through normalization processing, the influence of the spectrum centroid absolute value can be eliminated, the relative change of the spectrum centroid value in the sub-spectrum centroid sequence is reflected, and the subsequent analysis on the distribution characteristics of the frequency in the sub-spectrum centroid sequence is facilitated.
When the correlation coefficient is calculated, assuming that the time interval mean value of the maximum value point of the sub-spectrum centroid sequence h is g, the first segment is h [ g: ], namely, the value is taken from the time g to the end of h; the second segment is h [: h-g ], namely, the value is taken from the head end of h until the time h-g.
In the above example, the standard deviation of the maxima points of the sub-spectrum centroid sequence can describe the difference of the maxima points, and the correlation coefficient of the first segment and the second segment can describe the periodicity or self-similarity of the distribution of the sub-spectrum centroid sequence. The distribution characteristics of the frequencies in the sub-spectrum centroid sequence are accurately and comprehensively described through the two indexes.
In the above example, the detection result of the sub-spectrum centroid sequence can be determined as follows: and under the condition that the standard deviation of the maximum value point of the sub-spectrum centroid sequence is smaller than a standard deviation threshold value and the correlation coefficient of the first segment and the second segment is larger than a correlation coefficient threshold value, determining that the audio signal region corresponding to the sub-spectrum centroid sequence belongs to an echo howling region.
For example, in the example described above, for a sub-spectrum sequence [ centroid [2], centroid [3], centroid [4] ], assuming that the standard deviation of the maximum value point is smaller than the standard deviation threshold and the correlation coefficient of the first segment and the second segment is larger than the correlation coefficient threshold, it can be determined that the audio signal region corresponding to the sub-spectrum sequence, i.e., the audio signal region with the time range of t3-t5, belongs to the echo-howling region.
And finally, determining an echo howling area of the target audio signal according to the detection result of each sub-spectrum centroid sequence.
For example, the echo howling regions corresponding to the respective sub-spectrum centroid sequences are merged to obtain the echo howling region of the target audio signal.
In the embodiment, the frequency spectrum centroid sequence is obtained by a window sliding method, echo howling detection is performed according to the frequency spectrum centroid sequence, information contained in a target audio signal is fully utilized, and the echo howling detection precision is improved.
And step 204, determining a human voice region and a noise region of the target audio signal based on the Mel frequency cepstrum of the target audio signal and a pre-trained machine learning model.
In the field of sound processing, Mel-Frequency Cepstrum (Mel-Frequency Cepstrum) is a linear transformation of the logarithmic energy spectrum based on the nonlinear Mel Scale (Mel Scale) of sound frequencies. It is readily understood that the mel-frequency cepstrum of the target audio signal may be generated from a time domain plot of the target audio signal.
In one example, the machine learning model is trained by:
firstly, a training sample set is obtained, wherein the training sample set comprises at least one training sample formed by mixing human voice and noise, and a sample label used for indicating the position of the human voice and the position of the noise in the training sample.
The voice may include voices of various languages, genders and ages, and the noise may include various types of background noise, current sounds, keyboard and mouse knocking sounds, various types of knocking sounds, and the like. The human voice and the noise can be mixed randomly or non-randomly to obtain a training sample.
Secondly, training the convolutional neural network through a training sample set until a preset convergence condition is reached to obtain a machine learning model, wherein the input of the convolutional neural network comprises a Mel frequency cepstrum of the training sample, and the output of the convolutional neural network comprises a start point and a stop point of a sound region, a foreground confidence coefficient, a human voice confidence coefficient and a noise confidence coefficient.
The value range of the foreground confidence coefficient is [0,1], and the larger the value is, the higher the possibility that the corresponding audio signal is not muted is. The above-mentioned human voice confidence coefficient's range of values is [0,1], and the larger the numerical value is, the more likely the corresponding audio signal is the human voice. The noise confidence coefficient is in a value range of [0,1], and the larger the value is, the higher the possibility that the corresponding audio signal is noise is.
In the training, the loss functions of the foreground confidence, the human voice confidence and the noise confidence may be cross entropy functions, and the loss functions of the start point and the end point of the voice region may be CIoU functions.
In this embodiment, the training samples of the machine learning model can be synthesized based on a small amount of strong label data, so that the labeling cost can be saved.
In one example, step 204 may be implemented as follows:
firstly, inputting the Mel frequency cepstrum of a target audio signal into a machine learning model to obtain at least one target sound region and corresponding human voice confidence coefficient and noise confidence coefficient.
Similar to the above description, the confidence level of the voice here is in the range of [0,1], and a larger value indicates that the corresponding audio signal is more likely to be the voice. The noise confidence coefficient has a value range of [0,1], and a larger value indicates that the corresponding audio signal is more likely to be noise.
Secondly, determining the target sound area as a voice area under the condition that the voice confidence corresponding to the target sound area is greater than a preset voice confidence threshold; and under the condition that the noise confidence corresponding to the target sound region is greater than a preset noise confidence threshold, determining the target sound region as a noise region.
In this embodiment, thresholds are set for the human voice area and the noise area, respectively, and are determined separately.
Finally, merging the obtained multiple voice areas based on a non-maximum suppression algorithm to obtain the voice area of the target audio signal; and carrying out merging processing on the obtained multiple noise regions based on a non-maximum suppression algorithm to obtain the noise region of the target audio signal.
Here, the Non-Maximum Suppression is abbreviated as NMS algorithm, and english is Non-Maximum Suppression. The idea is to search for local maxima and suppress the maxima.
In this embodiment, the detection of the human voice region and the noise region can be realized based on a lightweight convolutional neural network, and the calculation speed is high and the algorithm is lightweight.
And step 205, determining a target human voice area corresponding to the target audio signal according to the activation area, the echo howling area, the human voice area and the noise area.
In one example, step 205 may proceed as follows:
firstly, the intersection of the human voice area and the activation area is taken and expansion corrosion processing is carried out to obtain a first audio signal area.
Generally, dilation and erosion are referred to as morphological operations. They are typically performed on binary images, similar to contour detection. The dilation operation expands bright white areas in the magnified image by adding pixels to the perceptual boundaries of objects in the image. The erosion operation, in contrast, removes pixels along the object boundary and reduces the size of the object.
The dilation and erosion processes may be applied to one-dimensional signals as well, such as audio signals. Similar to the two-dimensional case, performing the dilation operation on a certain audio signal region can expand the range of the audio signal region. Conversely, a region of an audio signal is eroded to narrow the region.
In this embodiment, the audio signal is filtered by erosion-expansion processing.
In one example, at least two rounds of erosion-expansion processing may be performed on the intersection of the vocal region and the activation region, wherein the first round of processing is erosion-after-dilation to remove small intervals between the audio signals and avoid splitting speech with normal pauses into two segments, and the second round of processing is erosion-before-dilation to remove burrs in the audio signals.
And secondly, merging the noise area and the echo howling area and carrying out corrosion expansion treatment to obtain a second audio signal area.
In one example, the union of the noise region and the echo howling region may be subjected to at least two rounds of erosion-dilation processing, wherein the first round of erosion-dilation processing is to extract noise bodies, and the second round of erosion-dilation processing is to remove fine noise.
And finally, removing the intersection of the first audio signal and the second audio signal from the first audio signal region to obtain a target human voice region corresponding to the target audio signal.
Through corrosion expansion processing of the audio signals, high-quality voice can be extracted, the fact that voice with normal pause is split into two sections is avoided, the extracted voice is more suitable for the requirement of language identification, and therefore the effect of language identification is remarkably improved.
In this embodiment, an activation region, an echo howling region, a human voice region, and a noise region of a target audio signal are respectively determined, and then the audio signal regions of different types are combined to obtain the target human voice region, so that high-quality clear human voice can be obtained, and the accuracy of voice recognition or language recognition can be improved.
In addition, in the embodiment, the detection of each sound type can be realized based on a rule or a light weight neural network, and the algorithm is light and fast, so that the audio processing scheme can be detected in real time and can be deployed in a pure CPU hardware environment.
In one example, a preset language detection module may be input according to the target human voice region to determine a language corresponding to the target audio signal.
The audio processing method in the embodiment can obtain high-quality clear human voice, so that the accuracy of language identification can be effectively improved.
With further reference to fig. 4, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an audio signal processing apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various terminal devices.
As shown in fig. 4, the audio signal processing apparatus 400 of the present embodiment includes: an acquisition unit 401, a first detection unit 402, a second detection unit 403, a third detection unit 404, and a processing unit 405. The acquiring unit 401 is configured to acquire a target audio signal; a first detecting unit 402, configured to determine an active region of a target audio signal based on a time domain diagram of the target audio signal and a preset volume threshold; a second detecting unit 403, configured to determine an echo howling region of the target audio signal based on a frequency spectrum centroid sequence of the target audio signal and a preset sequence distribution threshold, where the frequency spectrum centroid sequence is obtained by performing window sliding processing on a time domain diagram of the target audio signal and taking a frequency spectrum centroid of a sub-time domain diagram in each window; a third detecting unit 404, configured to determine a human voice region and a noise region of the target audio signal based on a mel-frequency cepstrum of the target audio signal and a pre-trained machine learning model; the processing unit 495 is configured to determine a target human voice region corresponding to the target audio signal according to the activation region, the echo howling region, the human voice region, and the noise region.
In this embodiment, specific processing of the obtaining unit 401, the first detecting unit 402, the second detecting unit 403, the third detecting unit 404, and the processing unit 405 of the audio signal processing apparatus 400 and technical effects thereof may refer to related descriptions of step 201, step 202, step 203, step 204, and step 205 in the corresponding embodiment of fig. 2, which are not repeated herein.
In some optional embodiments, the apparatus further comprises a language identification unit (not shown), and the language identification unit is configured to: and determining the language corresponding to the target audio signal according to the target human voice area.
In some optional embodiments, the second detecting unit 403 includes: a spectrum centroid sequence generating unit, configured to generate a spectrum centroid sequence of the target audio signal according to a time domain diagram of the target audio signal; the sub-spectrum centroid sequence generating unit is used for determining a plurality of sub-spectrum centroid sequences corresponding to the spectrum centroid sequence based on a window sliding method; the detection unit is used for determining a sequence distribution parameter of the sub-spectrum centroid sequence and comparing the sequence distribution parameter of the sub-spectrum centroid sequence with the sequence distribution threshold value to determine a detection result of the sub-spectrum centroid sequence, wherein the detection result indicates whether an audio signal region corresponding to the sub-spectrum centroid sequence belongs to an echo howling region; and the merging unit is used for determining an echo howling area of the target audio signal according to the detection result of each sub-spectrum centroid sequence.
In some alternative embodiments, the sequence distribution threshold comprises a standard deviation threshold and a correlation coefficient threshold; the second detecting unit 403 is further configured to: normalizing the sub-spectrum centroid sequence, wherein the normalization is used for mapping the numerical value of the spectrum centroid to the range of [0,1 ]; determining a maximum value point of the sub-spectrum centroid sequence; determining the standard deviation of the maximum value point of the sub-spectrum centroid sequence; determining the time interval mean value of the maximum value point of the sub-spectrum centroid sequence; respectively removing data with the length of the time interval mean value from the head and the tail of the sub-spectrum centroid sequence to obtain a corresponding first segment and a second segment, and calculating correlation coefficients of the first segment and the second segment; and under the condition that the standard deviation of the maximum value point of the sub-spectrum centroid sequence is smaller than the standard deviation threshold value and the correlation coefficient of the first segment and the second segment is larger than the correlation coefficient threshold value, determining that the audio signal region corresponding to the sub-spectrum centroid sequence belongs to an echo howling region.
In some optional embodiments, the generating unit is further configured to: determining a plurality of sub-time domain graphs corresponding to the time domain graph of the target audio signal based on a window sliding method; calculating the spectral centroid of each of the sub-time domain graphs to form a sequence of spectral centroids of the target audio signal.
In some optional embodiments, for each of the plurality of sub-time domain graphs, the spectral centroid of the sub-time domain graph may be determined by:
determining the frequency weight corresponding to each point in the sub time domain graph according to the amplitude of each point in the sub time domain graph and the sum of the amplitudes of each point in the sub time domain graph;
determining the frequency spectrum centroid component corresponding to each point in the sub-time domain graph according to the frequency weight of each point in the sub-time domain graph and the corresponding fast Fourier transform frequency;
and summing the frequency spectrum centroid components corresponding to each point in the sub-time domain graph to obtain the frequency spectrum centroid of the sub-time domain graph.
In some alternative embodiments, the machine learning model is trained by: acquiring a training sample set, wherein the training sample set comprises at least one training sample formed by mixing human voice and noise and a sample label used for indicating the position of the human voice and the position of the noise in the training sample; and training a convolutional neural network through the training sample set until a preset convergence condition is reached to obtain the machine learning model, wherein the input of the convolutional neural network comprises a Mel frequency cepstrum of the training sample, and the output of the convolutional neural network comprises a start point and a stop point of a sound region, a foreground confidence coefficient, a human voice confidence coefficient and a noise confidence coefficient.
In some optional embodiments, the third detecting unit 404 is further configured to: inputting the Mel frequency cepstrum of the target audio signal into the machine learning model to obtain at least one target sound region and corresponding human voice confidence and noise confidence; determining the target sound area as a human sound area under the condition that the human sound confidence corresponding to the target sound area is greater than a preset human sound confidence threshold; determining the target sound region as a noise region under the condition that the noise confidence corresponding to the target sound region is greater than a preset noise confidence threshold; combining the obtained multiple voice areas based on a non-maximum suppression algorithm to obtain the voice area of the target audio signal; and carrying out merging processing on the obtained multiple noise regions based on a non-maximum suppression algorithm to obtain the noise region of the target audio signal.
In some optional embodiments, the first detecting unit 402 is further configured to: carrying out absolute value processing on the time domain graph of the target audio signal to obtain a corresponding volume graph; determining a plurality of sub-volume maps corresponding to the volume maps based on a window sliding method; for each sub-volume map, determining the volume average value corresponding to the sub-volume map; and determining that the audio signal region corresponding to the sub-volume map belongs to the active region when the volume average value is larger than the volume threshold value.
In some optional embodiments, the processing unit 405 is further configured to: taking intersection of the human voice area and the activation area and performing expansion corrosion treatment to obtain a first audio signal area; merging the noise area and the echo howling area and carrying out corrosion expansion treatment to obtain a second audio signal area; and removing the intersection of the first audio signal and the second audio signal from the first audio signal region to obtain a target human voice region corresponding to the target audio signal.
It should be noted that, for details of implementation and technical effects of each unit in the audio signal processing apparatus provided in the embodiments of the present disclosure, reference may be made to descriptions of other embodiments in the present disclosure, and details are not repeated herein.
Referring now to FIG. 5, a block diagram of a computer system 500 suitable for use in implementing the terminal devices of the present disclosure is shown. The computer system 500 shown in fig. 5 is only an example and should not bring any limitations to the functionality or scope of use of the embodiments of the present disclosure.
As shown in fig. 5, computer system 500 may include a processing device (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage device 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the computer system 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, a touch pad, a keyboard, a mouse, a camera, a microphone, and the like; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the computer system 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates a computer system 500 having various means of electronic equipment, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program, when executed by the processing device 501, performs the above-described functions defined in the methods of embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the audio signal processing method as shown in the embodiment shown in fig. 2 and its alternative embodiments.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Here, the name of the unit does not constitute a limitation of the unit itself in some cases, and for example, the acquisition unit may also be described as "a unit that acquires a target audio signal".
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims (12)

1. A method of processing an audio signal, comprising:
acquiring a target audio signal;
determining an activation region of the target audio signal based on the time domain diagram of the target audio signal and a preset volume threshold;
determining an echo howling region of the target audio signal based on a frequency spectrum centroid sequence of the target audio signal and a preset sequence distribution threshold, wherein the frequency spectrum centroid sequence is obtained by performing window sliding processing on a time domain graph of the target audio signal and taking a frequency spectrum centroid of a sub-time domain graph in each window;
determining a human voice region and a noise region of the target audio signal based on a Mel frequency cepstrum of the target audio signal and a pre-trained machine learning model;
determining a target human voice area corresponding to the target audio signal according to the activation area, the echo howling area, the human voice area and the noise area, including: taking an intersection of the human voice area and the activation area, and performing expansion corrosion treatment to obtain a first audio signal area; merging the noise area and the echo howling area and carrying out corrosion expansion treatment to obtain a second audio signal area; and removing the intersection of the first audio signal and the second audio signal from the first audio signal region to obtain a target human voice region corresponding to the target audio signal.
2. The method of claim 1, wherein the method further comprises:
and determining the language corresponding to the target audio signal according to the target human voice area.
3. The method of claim 1, wherein the determining an echo howling region of the target audio signal based on the spectral centroid sequence of the target audio signal and a preset sequence distribution threshold comprises:
generating a sequence of spectral centroids of the target audio signal from a time domain map of the target audio signal;
determining a plurality of sub-spectrum centroid sequences corresponding to the spectrum centroid sequence based on a window sliding method;
for each sub-spectrum centroid sequence, determining a sequence distribution parameter of the sub-spectrum centroid sequence, and comparing the sequence distribution parameter of the sub-spectrum centroid sequence with the sequence distribution threshold to determine a detection result of the sub-spectrum centroid sequence, wherein the detection result indicates whether an audio signal region corresponding to the sub-spectrum centroid sequence belongs to an echo howling region;
and determining an echo howling area of the target audio signal according to the detection result of each sub-spectrum centroid sequence.
4. The method of claim 3, wherein the sequence distribution threshold comprises a standard deviation threshold and a correlation coefficient threshold; and
the determining of the sequence distribution parameters of the sub-spectrum centroid sequence includes:
carrying out normalization processing on the sub-spectrum centroid sequence, wherein the normalization processing is used for mapping the numerical value of the spectrum centroid to a preset range; determining a maximum value point of the sub-spectrum centroid sequence; determining the standard deviation of the maximum value point of the sub-spectrum centroid sequence; determining the time interval mean value of the maximum value point of the sub-spectrum centroid sequence; respectively removing data with the length of the time interval mean value from the head and the tail of the sub-spectrum centroid sequence to obtain a corresponding first segment and a second segment, and calculating correlation coefficients of the first segment and the second segment;
the comparing the sequence distribution parameter of the sub-spectrum centroid sequence with the sequence distribution threshold value to determine the detection result of the sub-spectrum centroid sequence includes:
and under the condition that the standard deviation of the maximum value point of the sub-spectrum centroid sequence is smaller than the standard deviation threshold value and the correlation coefficient of the first segment and the second segment is larger than the correlation coefficient threshold value, determining that the audio signal region corresponding to the sub-spectrum centroid sequence belongs to an echo howling region.
5. The method of claim 3, wherein the generating a sequence of spectral centroids of the target audio signal from the time domain map of the target audio signal comprises:
determining a plurality of sub-time domain graphs corresponding to the time domain graph of the target audio signal based on a window sliding method;
calculating the spectral centroid of each of the sub-time domain graphs to form a sequence of spectral centroids of the target audio signal.
6. The method of claim 5, wherein, for each of the plurality of sub-time domain graphs, a spectral centroid of the sub-time domain graph is determined by:
determining the frequency weight corresponding to each point in the sub time domain graph according to the amplitude of each point in the sub time domain graph and the sum of the amplitudes of each point in the sub time domain graph;
determining the frequency spectrum centroid component corresponding to each point in the sub-time domain graph according to the frequency weight of each point in the sub-time domain graph and the corresponding fast Fourier transform frequency;
and summing the frequency spectrum centroid components corresponding to each point in the sub-time domain graph to obtain the frequency spectrum centroid of the sub-time domain graph.
7. The method of claim 1, wherein the machine learning model is trained by:
acquiring a training sample set, wherein the training sample set comprises at least one training sample formed by mixing human voice and noise, and a sample label used for indicating the position of the human voice and the position of the noise in the training sample;
and training a convolutional neural network through the training sample set until a preset convergence condition is reached to obtain the machine learning model, wherein the convolutional neural network inputs include Mel frequency cepstrum of the training samples, and the convolutional neural network outputs include a start point and a stop point of a sound region, a foreground confidence coefficient, a human voice confidence coefficient and a noise confidence coefficient.
8. The method of claim 7, wherein the determining human and noise regions of the target audio signal based on the mel-frequency cepstrum of the target audio signal and a pre-trained machine learning model comprises:
inputting the Mel frequency cepstrum of the target audio signal into the machine learning model to obtain at least one target sound region and corresponding human voice confidence and noise confidence;
determining the target sound area as a human sound area under the condition that the human sound confidence corresponding to the target sound area is greater than a preset human sound confidence threshold; determining the target sound region as a noise region under the condition that the noise confidence corresponding to the target sound region is greater than a preset noise confidence threshold;
combining the obtained multiple voice areas based on a non-maximum suppression algorithm to obtain the voice area of the target audio signal; and combining the obtained multiple noise regions based on a non-maximum suppression algorithm to obtain the noise region of the target audio signal.
9. The method according to any one of claims 1-8, wherein the determining the activation region of the target audio signal based on the time domain map of the target audio signal and a preset volume threshold comprises:
carrying out absolute value processing on the time domain graph of the target audio signal to obtain a corresponding volume graph;
determining a plurality of sub-volume maps corresponding to the volume map based on a window sliding method;
for each sub-volume map, determining the volume average value corresponding to the sub-volume map; and determining that the audio signal region corresponding to the sub-volume map belongs to the activation region under the condition that the volume average value is larger than the volume threshold value.
10. An apparatus for processing an audio signal, comprising:
an acquisition unit configured to acquire a target audio signal;
the first detection unit is used for determining an activation region of the target audio signal based on a time domain diagram of the target audio signal and a preset volume threshold;
the second detection unit is used for determining an echo howling area of the target audio signal based on a frequency spectrum centroid sequence of the target audio signal and a preset sequence distribution threshold, wherein the frequency spectrum centroid sequence is obtained by performing window sliding processing on a time domain image of the target audio signal and taking the frequency spectrum centroid of a sub-time domain image in each window;
a third detection unit, configured to determine a human voice region and a noise region of the target audio signal based on a mel-frequency cepstrum of the target audio signal and a pre-trained machine learning model;
a processing unit, configured to determine a target human voice region corresponding to the target audio signal according to the activation region, the echo howling region, the human voice region, and the noise region, where the processing unit includes: taking an intersection of the human voice area and the activation area, and performing expansion corrosion treatment to obtain a first audio signal area; merging the noise area and the echo howling area and carrying out corrosion expansion treatment to obtain a second audio signal area; and removing the intersection of the first audio signal and the second audio signal from the first audio signal region to obtain a target human voice region corresponding to the target audio signal.
11. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-9.
12. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by one or more processors, implements the method of any one of claims 1-9.
CN202110145224.1A 2021-02-02 2021-02-02 Audio signal processing method and device, electronic equipment and storage medium Active CN112992190B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110145224.1A CN112992190B (en) 2021-02-02 2021-02-02 Audio signal processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110145224.1A CN112992190B (en) 2021-02-02 2021-02-02 Audio signal processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112992190A CN112992190A (en) 2021-06-18
CN112992190B true CN112992190B (en) 2021-12-10

Family

ID=76346224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110145224.1A Active CN112992190B (en) 2021-02-02 2021-02-02 Audio signal processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112992190B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116863957B (en) * 2023-09-05 2023-12-12 硕橙(厦门)科技有限公司 Method, device, equipment and storage medium for identifying operation state of industrial equipment
CN117079660B (en) * 2023-10-18 2023-12-19 广东图盛超高清创新中心有限公司 Panoramic sound real-time data noise reduction method for rebroadcasting vehicle

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1530929A (en) * 2003-02-21 2004-09-22 哈曼贝克自动系统-威美科公司 System for inhibitting wind noise
CN107928673A (en) * 2017-11-06 2018-04-20 腾讯科技(深圳)有限公司 Acoustic signal processing method, device, storage medium and computer equipment
WO2018221206A1 (en) * 2017-05-29 2018-12-06 株式会社トランストロン Echo suppression device, echo suppression method and echo suppression program
CN109600526A (en) * 2019-01-08 2019-04-09 上海上湖信息技术有限公司 Customer service quality determining method and device, readable storage medium storing program for executing
CN110097884A (en) * 2019-06-11 2019-08-06 大众问问(北京)信息科技有限公司 A kind of voice interactive method and device

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2210427B1 (en) * 2007-09-26 2015-05-06 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus, method and computer program for extracting an ambient signal
FR2943875A1 (en) * 2009-03-31 2010-10-01 France Telecom METHOD AND DEVICE FOR CLASSIFYING BACKGROUND NOISE CONTAINED IN AN AUDIO SIGNAL.
US9672841B2 (en) * 2015-06-30 2017-06-06 Zte Corporation Voice activity detection method and method used for voice activity detection and apparatus thereof
CN106486131B (en) * 2016-10-14 2019-10-11 上海谦问万答吧云计算科技有限公司 A kind of method and device of speech de-noising
CN111149370B (en) * 2017-09-29 2021-10-01 杜比实验室特许公司 Howling detection in a conferencing system
CN109599120B (en) * 2018-12-25 2021-12-07 哈尔滨工程大学 Abnormal mammal sound monitoring method based on large-scale farm plant
CN111081246B (en) * 2019-12-24 2022-06-24 北京达佳互联信息技术有限公司 Method and device for awakening live broadcast robot, electronic equipment and storage medium
CN112017630B (en) * 2020-08-19 2022-04-01 北京字节跳动网络技术有限公司 Language identification method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1530929A (en) * 2003-02-21 2004-09-22 哈曼贝克自动系统-威美科公司 System for inhibitting wind noise
WO2018221206A1 (en) * 2017-05-29 2018-12-06 株式会社トランストロン Echo suppression device, echo suppression method and echo suppression program
CN107928673A (en) * 2017-11-06 2018-04-20 腾讯科技(深圳)有限公司 Acoustic signal processing method, device, storage medium and computer equipment
CN109600526A (en) * 2019-01-08 2019-04-09 上海上湖信息技术有限公司 Customer service quality determining method and device, readable storage medium storing program for executing
CN110097884A (en) * 2019-06-11 2019-08-06 大众问问(北京)信息科技有限公司 A kind of voice interactive method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于变换域稀疏度量的多级FrFT语音增强;范珍艳;《计算机工程与设计》;20200930;第41卷(第9期);第2574-2584页 *

Also Published As

Publication number Publication date
CN112992190A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
CN107945786B (en) Speech synthesis method and device
KR102118411B1 (en) Systems and methods for source signal separation
EP2828856B1 (en) Audio classification using harmonicity estimation
CN109545193B (en) Method and apparatus for generating a model
CN111161752A (en) Echo cancellation method and device
CN111415653B (en) Method and device for recognizing speech
CN112992190B (en) Audio signal processing method and device, electronic equipment and storage medium
CN107705782B (en) Method and device for determining phoneme pronunciation duration
CN108039181B (en) Method and device for analyzing emotion information of sound signal
CN111028845A (en) Multi-audio recognition method, device, equipment and readable storage medium
CN113257283B (en) Audio signal processing method and device, electronic equipment and storage medium
CN108877779B (en) Method and device for detecting voice tail point
CN107680584B (en) Method and device for segmenting audio
CN111341333B (en) Noise detection method, noise detection device, medium, and electronic apparatus
CN111863015A (en) Audio processing method and device, electronic equipment and readable storage medium
WO2013138122A2 (en) Automatic realtime speech impairment correction
CN108962226B (en) Method and apparatus for detecting end point of voice
JP6724290B2 (en) Sound processing device, sound processing method, and program
Jeon et al. Acoustic surveillance of hazardous situations using nonnegative matrix factorization and hidden Markov model
CN115696176A (en) Audio object-based sound reproduction method, device, equipment and storage medium
CN115083440A (en) Audio signal noise reduction method, electronic device, and storage medium
CN111383629B (en) Voice processing method and device, electronic equipment and storage medium
CN114627889A (en) Multi-sound-source sound signal processing method and device, storage medium and electronic equipment
CN109634554B (en) Method and device for outputting information
CN111624554B (en) Sound source positioning method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant