CN112750453B - Audio signal screening method, device, equipment and storage medium - Google Patents

Audio signal screening method, device, equipment and storage medium Download PDF

Info

Publication number
CN112750453B
CN112750453B CN202011549999.7A CN202011549999A CN112750453B CN 112750453 B CN112750453 B CN 112750453B CN 202011549999 A CN202011549999 A CN 202011549999A CN 112750453 B CN112750453 B CN 112750453B
Authority
CN
China
Prior art keywords
audio signal
frame
signal
energy
ratio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011549999.7A
Other languages
Chinese (zh)
Other versions
CN112750453A (en
Inventor
刘鲁鹏
元海明
李贝
王晓红
陈佳路
高强
夏龙
郭常圳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ape Power Future Technology Co Ltd
Original Assignee
Beijing Ape Power Future Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ape Power Future Technology Co Ltd filed Critical Beijing Ape Power Future Technology Co Ltd
Priority to CN202011549999.7A priority Critical patent/CN112750453B/en
Publication of CN112750453A publication Critical patent/CN112750453A/en
Application granted granted Critical
Publication of CN112750453B publication Critical patent/CN112750453B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The application relates to an audio signal screening method, an audio signal screening device, audio signal screening equipment and a storage medium. The method comprises the following steps: determining the signal energy of each frame of audio signal and determining the signal-to-noise ratio of each frame of audio signal in the audio signal; determining two frame numbers under different statistical conditions according to the signal-to-noise ratio of each frame of audio signal and the signal energy of each frame of audio signal, and determining the ratio of the two frame numbers; and determining the audio signal as a target audio signal according to the fact that the ratio is larger than a set ratio threshold. The scheme provided by the application can simply and effectively screen out the target audio signal with small background noise and screen out the target audio signal with small background noise, and has better universality.

Description

Audio signal screening method, device, equipment and storage medium
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for screening audio signals.
Background
In the field of artificial intelligence of speech recognition, a large number of audio signal samples are needed for machine learning, and the quality of the audio signal samples can directly influence the accuracy of a training model in the machine learning process. The audio signals collected in daily life have a lot of noises, which are not beneficial to model training of voice categories, so that the audio signals with smaller noises need to be screened out from a plurality of audio signals. In the audio screening method in the related art, the characteristics of the audio to be screened are compared with the characteristics of the target audio (the audio meeting the noise requirement), and if the comparison result meets the preset condition, the audio to be screened is used as the available audio or as the training sample.
However, in the solutions implemented by the related technologies, before feature comparison, feature extraction needs to be performed on each audio signal, and audio feature extraction is not easy, and the accuracy of screening is not high possibly due to error in audio feature extraction; in addition, according to training requirements of different types or functions, corresponding feature extraction models need to be set for audio feature extraction, the feature extraction models are low in universality, and the complexity of implementation is high.
Disclosure of Invention
In order to overcome the problems in the related art, the application provides an audio signal screening method, an audio signal screening device, audio signal screening equipment and a storage medium.
A first aspect of the present application provides an audio signal screening method, including:
determining the signal energy of each frame of audio signal and determining the signal-to-noise ratio of each frame of audio signal in the audio signal;
determining two frame numbers under different statistical conditions according to the signal-to-noise ratio of each frame of audio signal and the signal energy of each frame of audio signal, and determining the ratio of the two frame numbers;
determining the audio signal as a target audio signal according to the fact that the ratio is larger than a set ratio threshold;
the determining two frame numbers under different statistical conditions according to the signal-to-noise ratio of each frame of audio signal and the signal energy of each frame of audio signal and determining the ratio of the two frame numbers comprises:
counting a first frame number that the signal-to-noise ratio of each frame of audio signal is greater than a set signal-to-noise ratio threshold value and the signal energy of each frame of audio signal is greater than a signal energy threshold value; respectively carrying out difference operation on the signal energy of each frame of audio signal and a set value to obtain various values, and then obtaining the maximum value of the various values as a signal energy threshold value through a max function;
counting a second frame number of each frame of audio signal with the signal energy greater than a signal energy threshold;
determining a ratio of the first frame number to the second frame number. In one mode, the counting a first frame number that the signal-to-noise ratio of each frame of audio signal is greater than a set signal-to-noise ratio threshold and the signal energy of each frame of audio signal is greater than a signal energy threshold includes:
traversing the signal-to-noise ratio of each frame of audio signal and the signal energy of each frame of audio signal, and counting a first frame number that the signal-to-noise ratio of each frame of audio signal is greater than a set signal-to-noise ratio threshold value and the signal energy of each frame of audio signal is greater than a signal energy threshold value;
the counting of the second frame number of the audio signals of each frame, the signal energy of which is greater than the signal energy threshold, includes:
and traversing the signal energy of each frame of audio signal, and counting a second frame number of which the signal energy of each frame of audio signal is greater than a signal energy threshold.
In one mode, the determining a signal-to-noise ratio of each frame of audio signals in the audio signals includes:
framing the audio signal;
carrying out noise reduction processing on each frame of audio signal to obtain each frame of audio signal subjected to noise reduction;
and determining the signal-to-noise ratio of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal after noise reduction and the noise energy of each frame of audio signal before noise reduction.
In one form, the framing the audio signal includes:
framing the audio signal according to a preset time length;
if the audio length of the audio signal does not meet the integral multiple of the preset time length, zero filling processing is carried out on the tail part of the audio signal so that the integral multiple of the preset time length is met, and then framing is carried out.
In one mode, the determining the signal-to-noise ratio of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal after noise reduction and the noise energy of each frame of audio signal before noise reduction includes:
obtaining the noise energy of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal before noise reduction and the signal energy of each frame of audio signal after noise reduction;
and carrying out logarithmic operation according to the ratio of the signal energy of each frame of audio signal after noise reduction to the noise energy, and determining the signal-to-noise ratio of each frame of audio signal before noise reduction.
The second aspect of the present application provides an audio signal screening apparatus, comprising:
the signal energy and signal-to-noise ratio module is used for determining the signal energy of each frame of audio signal and determining the signal-to-noise ratio of each frame of audio signal in the audio signals;
the ratio module is used for determining two frame numbers under different statistical conditions according to the signal-to-noise ratio of each frame of audio signal and the signal energy of each frame of audio signal, and determining the ratio of the two frame numbers;
the screening module is used for determining the audio signal as a target audio signal according to the fact that the ratio determined by the ratio module is larger than a set ratio threshold;
the ratio module comprises:
the first frame number counting submodule is used for counting a first frame number that the signal-to-noise ratio of each frame of audio signal is greater than a set signal-to-noise ratio threshold value and the signal energy of each frame of audio signal is greater than a signal energy threshold value; respectively carrying out difference operation on the signal energy of each frame of audio signal and a set value to obtain various values, and then obtaining the maximum value of the various values as a signal energy threshold value through a max function;
the second frame number counting submodule is used for counting a second frame number of each frame of audio signal, wherein the signal energy of each frame of audio signal is greater than a signal energy threshold;
and the ratio determining submodule is used for determining the ratio of the first frame number counting submodule to the second frame number of the second frame number counting submodule.
A third aspect of the present application provides an electronic device comprising:
a processor; and
a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described above.
A fourth aspect of the present application provides a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform a method as described above.
The technical scheme provided by the application can comprise the following beneficial effects:
the technical scheme of the application firstly determines the signal energy of each frame of audio signal in the audio signal (namely the audio signal to be screened) and determines the signal-to-noise ratio of each frame of audio signal in the audio signal; determining two frame numbers under different statistical conditions according to the signal-to-noise ratio of each frame of audio signal and the signal energy of each frame of audio signal, and determining the ratio of the two frame numbers; then, according to the fact that the ratio is larger than a set ratio threshold, the audio signal is determined to be a target audio signal, and therefore the target audio signal with low background noise is screened out. The screening method is simple and effective, has strong universality, can effectively reduce the complexity of audio signal screening, and improves the screening efficiency.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application, as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.
Fig. 1 is a schematic flowchart of an audio signal screening method according to an embodiment of the present application;
fig. 2 is another schematic flow chart of an audio signal screening method according to an embodiment of the present application;
fig. 3 is a schematic diagram illustrating a framing process of an audio signal according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an audio signal screening apparatus according to an embodiment of the present application;
fig. 5 is another schematic structural diagram of an audio signal screening apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device shown in an embodiment of the present application.
Detailed Description
Preferred embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
In the artificial intelligence field of speech recognition, a large number of audio signal samples are needed for model training, and a large number of noises exist in audio signals collected in daily life, which are not beneficial to model training of speech categories, so that audio signals with smaller noises need to be screened out from a large number of audio signals. In the related art, the characteristics of the audio to be screened are compared with the characteristics of the target audio (audio meeting the noise requirement), and if the comparison result meets a preset condition, the audio to be screened can be used as the audio or used as a training sample. Before feature comparison, feature extraction needs to be carried out on each audio signal, the audio feature extraction is not easy, and the accuracy of screening is not high and the screening efficiency is low due to the fact that the audio feature extraction is wrong.
In view of the above problems, embodiments of the present application provide an audio signal screening method, which can simply and effectively screen out a target audio signal with low background noise.
The technical solutions of the embodiments of the present application are described in detail below with reference to the accompanying drawings.
Fig. 1 is a schematic flowchart of an audio signal screening method according to an embodiment of the present application.
Referring to fig. 1, an embodiment of an audio signal screening method in an embodiment of the present application includes:
in step S101, the signal energy of each frame of audio signal is determined and the signal-to-noise ratio of each frame of audio signal in the audio signal is determined.
SIGNAL-to-NOISE RATIO (SNR) refers to the RATIO of SIGNAL to NOISE in an electronic device or system. In the embodiment of the present application, the signal-to-noise ratio of each frame of audio signal refers to the ratio of the effective sound signal to the background noise in each frame of audio signal.
In this step, the audio signal may be framed; carrying out noise reduction processing on each frame of audio signal to obtain each frame of audio signal subjected to noise reduction; and determining the signal-to-noise ratio of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal after noise reduction and the noise energy of each frame of audio signal before noise reduction.
In this step, the signal energy of each frame of audio signal before noise reduction, the noise energy of each frame of audio signal before noise reduction, and the signal energy of each frame of audio signal after noise reduction may be determined according to a set algorithm.
In this embodiment of the application, the algorithm for performing noise reduction processing on the audio signal may be a Minimum tracking noise estimation algorithm, a Minimum Controlled Recursive Averaging (MCRA) algorithm, or a Minimum Controlled recursive Averaging (IMCRA) algorithm based on wiener filtering.
It is to be understood that the noise reduction algorithm in the embodiment of the present application is not limited, and may be any algorithm capable of reducing the background noise in the audio signal.
In step S102, two frame numbers under different statistical conditions are determined according to the signal-to-noise ratio of each frame of audio signal and the signal energy of each frame of audio signal, and a ratio of the two frame numbers is determined.
In the step, a first frame number that the signal-to-noise ratio of each frame of audio signal is greater than a set signal-to-noise ratio threshold and the signal energy of each frame of audio signal is greater than a signal energy threshold can be counted; counting a second frame number of each frame of audio signal with the signal energy greater than the signal energy threshold; a ratio of the first frame number to the second frame number is determined.
The method for counting the number of frames in which the signal-to-noise ratio of each frame of audio signal is greater than a set signal-to-noise ratio threshold and the signal energy of each frame of audio signal is greater than a first frame number of the signal energy threshold includes: and traversing the signal-to-noise ratio of each frame of audio signal and the signal energy of each frame of audio signal, and counting a first frame number of which the signal-to-noise ratio of each frame of audio signal is greater than a set signal-to-noise ratio threshold and the signal energy of each frame of audio signal is greater than a signal energy threshold.
Wherein, counting a second frame number of each frame of audio signals with the signal energy larger than the signal energy threshold, including: and traversing the signal energy of each frame of audio signal, and counting a second frame number of each frame of audio signal, wherein the signal energy of each frame of audio signal is greater than the signal energy threshold.
The signal-to-noise ratio threshold is set as an empirical threshold for judging the background noise in each frame of audio signal. In the embodiment of the present application, an empirical threshold is preset, that is, a signal-to-noise ratio threshold is set. In practical applications, the range of the snr threshold may be set to be 15 to 25dB, for example, 20dB, according to actual requirements.
In step S103, the audio signal is determined as the target audio signal according to the ratio being greater than the set ratio threshold.
For example, assuming that the ratio threshold is set to be 0.8, if the ratio determined according to the foregoing steps is greater than 0.8, it indicates that the signal-to-noise ratio of the audio signal x for more than 80% of the time duration is greater than 20dB, i.e., the noise content of the audio signal x is low, and the audio signal x is clean audio, so that the audio signal x is screened out.
The technical scheme of the application comprises the steps of firstly determining the signal energy of each frame of audio signal in the audio signal (namely the audio signal to be screened) and determining the signal-to-noise ratio of each frame of audio signal in the audio signal; determining two frame numbers under different statistical conditions according to the signal-to-noise ratio of each frame of audio signal and the signal energy of each frame of audio signal, and determining the ratio of the two frame numbers; then, according to the fact that the ratio is larger than the set ratio threshold, the audio signal is determined to be the target audio signal, and therefore the target audio signal with low background noise is screened out. The screening method is simple and effective, has strong universality, can effectively reduce the complexity of audio signal screening, and improves the screening efficiency.
For convenience of understanding, an application example of the audio signal screening method is provided below for explanation, and an example of the audio signal screening method in this application example includes:
in the embodiment of the present application, it is assumed that a training model of speech recognition needs to recognize a speaker's voice with environmental sounds, and a training sample of the training model needs an audio signal of the speaker's voice with low background noise (or meets the requirement of low background noise). The background noise of the audio signal to be screened in the embodiment of the present application may be environmental sound, that is, the audio signal whose environmental sound meets the requirement needs to be screened out in the embodiment of the present application, and the audio signal is used as a training sample of a training model.
Fig. 2 is another schematic flow chart of the audio signal screening method according to the embodiment of the present application. Fig. 2 presents the solution of the present application in more detail with respect to fig. 1.
Referring to fig. 2, an embodiment of an audio signal screening method in an embodiment of the present application includes:
in step S201, the audio signal is framed.
In the embodiment of the present application, it is assumed that the audio signal is x, i.e., the audio signal to be filtered.
This step may frame the audio signal by a preset time length; if the audio length of the audio signal does not meet the integral multiple of the preset time length, zero filling processing is carried out on the tail part of the audio signal so that the tail part of the audio signal meets the integral multiple of the preset time length, and then framing is carried out.
For example, the audio signal x is framed, each frame may have a preset time length, for example, 32ms, and if the audio length is less than an integer multiple of 32ms, the tail of the audio signal x may be padded with zeros first, so that the length of the audio signal x reaches the integer multiple of 32ms, and then framing is performed. As for the framing method, as shown in fig. 3, frames do not overlap with each other, and the audio signal of each frame after framing can be recorded as:
x i ,i=1,2,..., n . Where n is the total number of frames of the audio signal x. Note that 32ms is an empirical value, and can be adjusted as needed.
In step S202, each frame of audio signal is subjected to noise reduction processing to obtain each frame of audio signal subjected to noise reduction.
This step is for x i Noise reduction is carried out to obtain each frame of audio signal s after noise reduction i
In the embodiment of the present application, the algorithm for performing noise reduction processing on the audio signal may be a Minimum tracking noise estimation algorithm, a Minimum Controlled Recursive Averaging (MCRA) algorithm, or an advanced Minimum Controlled recursive Averaging (IMCRA) algorithm based on wiener filtering.
It should be noted that the algorithm selected for performing the noise reduction processing on the audio signal is not limited, that is, the noise reduction algorithm is not limited, as long as the background noise in the audio signal can be eliminated.
In step S203, signal energies of each frame of audio signal before and after noise reduction are calculated respectively to obtain the signal energy of each frame of audio signal before and after noise reduction.
In the embodiment of the application, the audio signal x of each frame before noise reduction can be determined i According to the audio signal x of each frame before noise reduction i Respectively corresponding sampling values of the M sampling points, and calculating each frame of audio signal x before noise reduction i The signal energy of (a). For example, the audio signal x per frame before noise reduction can be calculated according to the following formula i Signal energy E of x_i
Figure GDA0004045216070000091
Wherein E is x_i For each frame of audio signal x before noise reduction i M is the audio signal x of each frame before the noise reduction i Total number of sample points in (1), x i,j Representing an audio signal x per frame i The value of the j-th sampling point.
In the embodiment of the present application, the noise-reduced audio signal s per frame can be determined i With the audio signal x of each frame before noise reduction i M sampling points corresponding to the positions according to each frame of the audio signal s after the noise reduction i Respectively corresponding to the M sampling points, and calculating the noise-reduced audio signal s of each frame i The signal energy of (a). For example, the noise-reduced audio signal per frame s can be calculated according to the following formula i Signal energy E of s_i
Figure GDA0004045216070000092
Wherein, E s_i For each frame of noise-reduced audio signal s i M is the noise-reduced audio signal s per frame i Total number of sample points in (1), s i,j Representing each frame of the audio signal s i The value of the j-th sample point.
It will be appreciated that in practical applications, the calculation of the audio signal energy may be implemented by other methods, and the above algorithm description is only exemplary and should not be taken as the only limitation of the calculation of the audio signal energy.
In step S204, the noise energy of each frame of audio signal before noise reduction is obtained according to the signal energy of each frame of audio signal before noise reduction and the signal energy of each frame of audio signal after noise reduction.
Subtracting the signal energy of each frame of audio signal after noise reduction from the signal energy of each frame of audio signal before noise reduction to obtain the noise energy of each frame of audio signal before noise reduction.
This step calculates the noise energy of each frame of audio signal before noise reduction, i.e. calculates x i Noise energy of (E) n_
Illustratively, x may be calculated according to the following formula i Noise energy E of n_
E n_ =E x_ -E s_i
Wherein E is n_ Is x i Noise energy of (E) s_ For each frame of noise-reduced audio signal s i Signal energy of (E) x_ For each frame of audio signal x before noise reduction i The signal energy of (c).
In step S205, a signal-to-noise ratio of each frame of audio signal before noise reduction is determined according to the signal energy of each frame of audio signal after noise reduction and the noise energy of each frame of audio signal before noise reduction.
According to the ratio of the signal energy and the noise energy of each frame of audio signal after noise reduction, carrying out logarithmic operation to determine the signal-to-noise ratio of each frame of audio signal before noise reduction.
Recording each frame audio signal x before noise reduction i Has a signal-to-noise ratio of snr i Illustratively, the signal-to-noise ratio may be calculated according to the following formula:
snr i =10log 10 (E s_i /E n_ )
wherein, snr is i For each frame of audio signal x before noise reduction i Signal to noise ratio of, E s_ For the signal energy of each frame of the noise-reduced audio signal, E n_ Is the noise energy of each frame of audio signal before noise reduction.
In step S206, a signal energy threshold is determined.
In this step, a signal energy threshold E is determined thresh Illustratively, the signal energy threshold may be calculated according to the following formula:
E thresh =max(E x_i -30),i=1,2,...,n
wherein 30 in the above formula is an empirical value and can be adjusted as required. That is, each frame of the audio signal x i The signal energy of (2) is respectively subjected to difference operation with a set value such as 30 to obtain each value, and then the maximum value of each value is obtained through a max function to be used as a signal energy threshold value.
In step S207, the snr of each frame of audio signal is counted to be greater than the set snr threshold, and the signal energy of each frame of audio signal is counted to be greater than the first frame number of the signal energy threshold.
In the embodiment of the present application, it is assumed that the snr threshold snr is set thresh Is 20dB. It should be noted that the setting of the snr threshold to 20dB is only illustrative and not limiting, and can be adjusted as needed.
In this step, the signal-to-noise ratio snr of each frame of audio signal can be traversed i Sum signal energy E x_i Statistics of snr i Greater than snr thresh And E x_i Greater than E thresh The number of frames (c) is denoted as p1.
In step S208, a second number of frames in which the signal energy of the audio signal per frame is greater than the signal energy threshold is counted.
In this step, the signal energy E of each frame of the audio signal can be traversed x_i Statistics of E x_i Greater than E thresh The number of frames (c) is denoted as p2.
In step S209, the ratio of the first frame number to the second frame number is determined, and the audio signal is determined to be the target audio signal according to the ratio being greater than the set ratio threshold.
The ratio is such that r = p1/p2, i.e. the ratio of the first frame number to the second frame number is determined. For example, assuming that the ratio threshold is set to be 0.8, if the ratio r is greater than 0.8, it indicates that the signal-to-noise ratio of the effective duration of the audio signal x is greater than 20dB for more than 80% of the duration, that is, the noise content of the audio signal x is low, and the audio signal x is a clean audio, it is determined that the audio signal x is a target audio signal, and the audio signal x may be selected into a sample library for training a speech recognition model. Otherwise, the audio signal x is discarded. The effective duration here means the energy E x_i Greater than E thresh Of the audio frame. It should be noted that, the setting of the ratio threshold to be 0.8 is only for illustration and not limited thereto, and may be adjusted according to needs, for example, the value range of the setting of the ratio threshold may be between 0.7 and 0.9.
Therefore, the target audio signal with low background noise can be screened out by utilizing the signal-to-noise ratio and the signal energy of each frame of audio signal. The screening method is simple and effective, has strong universality, can effectively reduce the complexity of audio signal screening, and improves the screening efficiency.
In the embodiment of the application, it is assumed that a sample voice library needs to be constructed, wherein the sample voice library can be historical voice data and historical text data corresponding to the historical voice data, which are uttered by surrounding users at different distances and different orientations relative to a target user; the historical voice data can comprise common communication phrase voice data, and the historical text data comprises common communication phrase text data; the commonly used communication phrases include names, appellations, commonly used chat phrases between surrounding users and target users, commonly used calling phrases between surrounding users and target users, and the like. The audio signals in the sample speech library are all audio signals with small background noise after being screened by the audio signal screening method in the embodiment of the application, so that the training effect can be more excellent when the sample speech library is used for model training.
Corresponding to the embodiment of the application function implementation method, the application also provides an audio signal screening device, electronic equipment and a corresponding embodiment.
Fig. 4 is a schematic structural diagram of an audio signal screening apparatus according to an embodiment of the present application.
Referring to fig. 4, the audio signal filtering apparatus includes: a signal energy and signal to noise ratio module 401, a ratio module 402, and a filtering module 403.
The signal energy and signal-to-noise ratio module 401 is configured to determine a signal energy of each frame of audio signal and determine a signal-to-noise ratio of each frame of audio signal in the audio signal.
The ratio module 402 is configured to determine two frame numbers under different statistical conditions according to the signal-to-noise ratio of each frame of audio signal and the signal energy of each frame of audio signal, and determine a ratio of the two frame numbers.
The ratio module 402 may count a first frame number that the signal-to-noise ratio of each frame of audio signal is greater than a set signal-to-noise ratio threshold and the signal energy of each frame of audio signal is greater than a signal energy threshold; counting a second frame number of each frame of audio signal with the signal energy greater than the signal energy threshold; a ratio of the first frame number to the second frame number is determined.
And a screening module 403, configured to determine that the audio signal is the target audio signal according to that the ratio determined by the ratio module 402 is greater than a set ratio threshold.
The screening module 403 may determine that the audio signal is the target audio signal, that is, the clean audio signal with low background noise, according to the ratio being greater than the set ratio threshold. For example, assuming that the ratio threshold is set to be 0.8, if the ratio is greater than 0.8, it indicates that the signal-to-noise ratio of the valid duration of the audio signal x is greater than 20dB for more than 80% of the duration, i.e., the noise content of the audio signal x is low, and the audio signal x is clean audio, so that the audio signal x is screened out.
The technical scheme of the application firstly determines the signal energy of each frame of audio signal in the audio signal (namely the audio signal to be screened) and determines the signal-to-noise ratio of each frame of audio signal in the audio signal; determining two frame numbers under different statistical conditions according to the signal-to-noise ratio of each frame of audio signal and the signal energy of each frame of audio signal, and determining the ratio of the two frame numbers; then, according to the fact that the ratio is larger than the set ratio threshold, the audio signal is determined to be the target audio signal, and therefore the target audio signal with low background noise is screened out. The screening method is simple and effective, has strong universality, can effectively reduce the complexity of audio signal screening, and improves the screening efficiency.
Fig. 5 is another schematic structural diagram of an audio signal screening apparatus according to an embodiment of the present application.
Referring to fig. 5, the audio signal filtering apparatus includes: a signal energy and signal to noise ratio module 401, a ratio module 402, and a filtering module 403.
The functions of the signal energy and signal-to-noise ratio module 401, the ratio module 402, and the filtering module 403 may refer to the description in fig. 4, and are not described herein again.
The ratio module 402 may further include: a first frame number counting sub-module 4021, a second frame number counting sub-module 4022, and a ratio determination sub-module 4023.
The first frame number counting submodule 4021 is configured to count a first frame number in which the signal-to-noise ratio of each frame of audio signal is greater than a set signal-to-noise ratio threshold, and the signal energy of each frame of audio signal is greater than a signal energy threshold.
The second frame number counting sub-module 4022 is configured to count a second frame number of each frame of the audio signal, where the signal energy of each frame of the audio signal is greater than the signal energy threshold.
The ratio determining sub-module 4023 is configured to determine a ratio between a first frame number of the first frame number counting sub-module 4021 and a second frame number of the second frame number counting sub-module 4022.
The signal energy and signal-to-noise ratio module 401 may further include: a framing sub-module 4011, a noise reduction sub-module 4012, and a determination sub-module 4013.
The framing sub-module 4011 is configured to frame the audio signal.
Wherein, the framing submodule 4011 frames the audio signal according to a preset time length; if the audio length of the audio signal does not meet the integral multiple of the preset time length, zero filling processing is carried out on the tail part of the audio signal so that the tail part of the audio signal meets the integral multiple of the preset time length, and then framing is carried out.
The noise reduction sub-module 4012 is configured to perform noise reduction processing on each frame of audio signal obtained by the framing sub-module 4011 to obtain each frame of audio signal after noise reduction.
The algorithm selected by the noise reduction sub-module 4012 for performing noise reduction processing on the audio signal is not limited, that is, the noise reduction algorithm is not limited, as long as the background noise in the audio signal can be eliminated.
The determining submodule 4013 is configured to determine, according to the signal energy of each frame of audio signal after noise reduction obtained by the noise reducing submodule 4012 and the noise energy of each frame of audio signal before noise reduction, a signal-to-noise ratio of each frame of audio signal before noise reduction.
The determining sub-module 4013 may obtain noise energy of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal before noise reduction and the signal energy of each frame of audio signal after noise reduction;
and carrying out logarithmic operation according to the ratio of the signal energy to the noise energy of each frame of audio signal after noise reduction, and determining the signal-to-noise ratio of each frame of audio signal before noise reduction.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 6 is a schematic structural diagram of an electronic device shown in an embodiment of the present application. The electronic device may be a mobile terminal device or a server device, etc.
Referring to fig. 6, an electronic device 600 includes a memory 610 and a processor 620.
Processor 620 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 610 may include various types of storage units, such as system memory, read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are required by the processor 620 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 610 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 610 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.
The memory 610 has stored thereon executable code that, when processed by the processor 620, causes the processor 620 to perform some or all of the methods described above.
The aspects of the present application have been described in detail hereinabove with reference to the accompanying drawings. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. Those skilled in the art should also appreciate that the acts and modules referred to in the specification are not necessarily required in the present application. In addition, it can be understood that the steps in the method of the embodiment of the present application may be sequentially adjusted, combined, and deleted according to actual needs, and the modules in the device of the embodiment of the present application may be combined, divided, and deleted according to actual needs.
Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing some or all of the steps of the above-described method of the present application.
Alternatively, the present application may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or electronic device, server, etc.), causes the processor to perform part or all of the steps of the above-described method according to the present application.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the applications disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present application, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (8)

1. A method for audio signal screening, comprising:
determining the signal energy of each frame of audio signal and determining the signal-to-noise ratio of each frame of audio signal in the audio signal;
determining two frame numbers under different statistical conditions according to the signal-to-noise ratio of each frame of audio signal and the signal energy of each frame of audio signal, and determining the ratio of the two frame numbers;
determining the audio signal as a target audio signal according to the fact that the ratio is larger than a set ratio threshold;
the determining two frame numbers under different statistical conditions according to the signal-to-noise ratio of each frame of audio signal and the signal energy of each frame of audio signal and determining the ratio of the two frame numbers comprises:
counting a first frame number that the signal-to-noise ratio of each frame of audio signal is greater than a set signal-to-noise ratio threshold value and the signal energy of each frame of audio signal is greater than a signal energy threshold value; respectively carrying out difference operation on the signal energy of each frame of audio signal and a set value to obtain various values, and then obtaining the maximum value of the various values through a max function to be used as a signal energy threshold value;
counting a second frame number of each frame of audio signal with the signal energy greater than a signal energy threshold;
determining a ratio of the first frame number to the second frame number.
2. The method of claim 1, wherein:
the counting that the signal-to-noise ratio of each frame of audio signal is greater than a set signal-to-noise ratio threshold value and the signal energy of each frame of audio signal is greater than a first frame number of the signal energy threshold value comprises:
traversing the signal-to-noise ratio of each frame of audio signal and the signal energy of each frame of audio signal, and counting a first frame number that the signal-to-noise ratio of each frame of audio signal is greater than a set signal-to-noise ratio threshold value and the signal energy of each frame of audio signal is greater than a signal energy threshold value;
the counting of the second frame number that the signal energy of each frame of audio signal is greater than the signal energy threshold value comprises:
and traversing the signal energy of each frame of audio signal, and counting a second frame number of each frame of audio signal, wherein the signal energy of each frame of audio signal is greater than a signal energy threshold.
3. The method of claim 1, wherein determining the signal-to-noise ratio of each frame of the audio signal comprises:
framing the audio signal;
carrying out noise reduction processing on each frame of audio signal to obtain each frame of audio signal subjected to noise reduction;
and determining the signal-to-noise ratio of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal after noise reduction and the noise energy of each frame of audio signal before noise reduction.
4. The method of claim 3, wherein the framing the audio signal comprises:
framing the audio signal according to a preset time length;
if the audio length of the audio signal does not meet the integral multiple of the preset time length, zero filling processing is carried out on the tail part of the audio signal so that the integral multiple of the preset time length is met, and then framing is carried out.
5. The method as claimed in claim 3, wherein the determining the signal-to-noise ratio of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal after noise reduction and the noise energy of each frame of audio signal before noise reduction comprises:
obtaining the noise energy of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal before noise reduction and the signal energy of each frame of audio signal after noise reduction;
and carrying out logarithmic operation according to the ratio of the signal energy of each frame of audio signal after noise reduction to the noise energy, and determining the signal-to-noise ratio of each frame of audio signal before noise reduction.
6. An audio signal screening apparatus, comprising:
the signal energy and signal-to-noise ratio module is used for determining the signal energy of each frame of audio signal and determining the signal-to-noise ratio of each frame of audio signal in the audio signal;
the ratio module is used for determining two frame numbers under different statistical conditions according to the signal-to-noise ratio of each frame of audio signal and the signal energy of each frame of audio signal, and determining the ratio of the two frame numbers;
the screening module is used for determining the audio signal as a target audio signal according to the fact that the ratio determined by the ratio module is larger than a set ratio threshold;
the ratio module comprises:
the first frame number counting submodule is used for counting a first frame number that the signal-to-noise ratio of each frame of audio signal is greater than a set signal-to-noise ratio threshold value and the signal energy of each frame of audio signal is greater than a signal energy threshold value; respectively carrying out difference operation on the signal energy of each frame of audio signal and a set value to obtain various values, and then obtaining the maximum value of the various values as a signal energy threshold value through a max function;
the second frame number counting submodule is used for counting a second frame number of each frame of audio signal, wherein the signal energy of each frame of audio signal is greater than a signal energy threshold;
and the ratio determining submodule is used for determining the ratio of the first frame number counting submodule to the second frame number of the second frame number counting submodule.
7. An electronic device, comprising:
a processor; and
a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any one of claims 1-5.
8. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any one of claims 1-5.
CN202011549999.7A 2020-12-24 2020-12-24 Audio signal screening method, device, equipment and storage medium Active CN112750453B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011549999.7A CN112750453B (en) 2020-12-24 2020-12-24 Audio signal screening method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011549999.7A CN112750453B (en) 2020-12-24 2020-12-24 Audio signal screening method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112750453A CN112750453A (en) 2021-05-04
CN112750453B true CN112750453B (en) 2023-03-14

Family

ID=75647447

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011549999.7A Active CN112750453B (en) 2020-12-24 2020-12-24 Audio signal screening method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112750453B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597498A (en) * 2018-04-10 2018-09-28 广州势必可赢网络科技有限公司 A kind of multi-microphone voice acquisition method and device
CN108847217A (en) * 2018-05-31 2018-11-20 平安科技(深圳)有限公司 A kind of phonetic segmentation method, apparatus, computer equipment and storage medium
CN110265052A (en) * 2019-06-24 2019-09-20 秒针信息技术有限公司 The signal-to-noise ratio of radio equipment determines method, apparatus, storage medium and electronic device
CN111477243A (en) * 2020-04-16 2020-07-31 维沃移动通信有限公司 Audio signal processing method and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10433076B2 (en) * 2016-05-30 2019-10-01 Oticon A/S Audio processing device and a method for estimating a signal-to-noise-ratio of a sound signal

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597498A (en) * 2018-04-10 2018-09-28 广州势必可赢网络科技有限公司 A kind of multi-microphone voice acquisition method and device
CN108847217A (en) * 2018-05-31 2018-11-20 平安科技(深圳)有限公司 A kind of phonetic segmentation method, apparatus, computer equipment and storage medium
CN110265052A (en) * 2019-06-24 2019-09-20 秒针信息技术有限公司 The signal-to-noise ratio of radio equipment determines method, apparatus, storage medium and electronic device
CN111477243A (en) * 2020-04-16 2020-07-31 维沃移动通信有限公司 Audio signal processing method and electronic equipment

Also Published As

Publication number Publication date
CN112750453A (en) 2021-05-04

Similar Documents

Publication Publication Date Title
CN112786066B (en) Audio signal screening method and device and electronic equipment
CN108615535B (en) Voice enhancement method and device, intelligent voice equipment and computer equipment
CN110164467B (en) Method and apparatus for speech noise reduction, computing device and computer readable storage medium
CN106024002B (en) Time zero convergence single microphone noise reduction
WO2011015237A1 (en) Method and apparatus for audio signal classification
CN108806707B (en) Voice processing method, device, equipment and storage medium
CN110111811B (en) Audio signal detection method, device and storage medium
CN112802463B (en) Audio signal screening method, device and equipment
CN112309417A (en) Wind noise suppression audio signal processing method, device, system and readable medium
CN112602150A (en) Noise estimation method, noise estimation device, voice processing chip and electronic equipment
CN112750453B (en) Audio signal screening method, device, equipment and storage medium
CN108093356B (en) Howling detection method and device
WO2024017110A1 (en) Voice noise reduction method, model training method, apparatus, device, medium, and product
CN113611329A (en) Method and device for detecting abnormal voice
EP4128226A1 (en) Automatic leveling of speech content
CN112652323B (en) Audio signal screening method and device, electronic equipment and storage medium
CN108899041B (en) Voice signal noise adding method, device and storage medium
CN108053834A (en) audio data processing method, device, terminal and system
CN112289337A (en) Method and device for filtering residual noise after machine learning voice enhancement
CN115171735A (en) Voice activity detection method, storage medium and electronic equipment
CN111048096B (en) Voice signal processing method and device and terminal
CN111145770B (en) Audio processing method and device
CN113053399A (en) Multi-channel audio mixing method and device
CN112562712A (en) Recording data processing method and system, electronic equipment and storage medium
CN113409802B (en) Method, device, equipment and storage medium for enhancing voice signal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant