CN115762569A

CN115762569A - Signal processing method, device, equipment and storage medium

Info

Publication number: CN115762569A
Application number: CN202211333219.4A
Authority: CN
Inventors: 李向阳
Original assignee: Huizhou Desay SV Automotive Co Ltd
Current assignee: Huizhou Desay SV Automotive Co Ltd
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2023-03-07

Abstract

The embodiment of the invention provides a signal processing method, a device, equipment and a storage medium, wherein the method comprises the following steps: performing framing processing on an original audio signal to obtain each sub-frame audio signal; for each subframe audio signal, performing feature extraction on the subframe audio signal, and determining an energy feature vector of the subframe audio signal; and determining the audio type of the sub-frame audio signal according to the energy characteristic vector and by combining a set comprehensive judgment rule. By using the method, the energy characteristic vector of the audio signal is obtained by processing the original audio signal, and whether each frame of signal in the original audio signal is a voice signal or a non-voice signal can be determined according to the energy characteristic vector and based on the set comprehensive judgment rule. Compared with the hard decision method based on the threshold value in the prior art, the technical scheme can more accurately decide the audio type of each frame of signal in the original audio signal, and has better generalization and effect.

Description

Signal processing method, device, equipment and storage medium

Technical Field

The present invention relates to the field of speech technology, and in particular, to a signal processing method, apparatus, device, and storage medium.

Background

In the technical field of voice, before a related voice task is performed, an audio signal needs to be processed, and the core idea is that voice information can be extracted from original audio as accurately and cleanly as possible, and noise signals such as non-voice environmental sounds are filtered. Among various signal processing methods, a voice endpoint detection algorithm and processing are important ones, and the voice endpoint detection algorithm aims to extract a part containing voice in an audio signal, and because voice information has continuity in a time domain, the detection of a starting endpoint and the detection of a tail endpoint are involved. The effective voice endpoint detection method can accurately and completely extract voice information, and separate background noise signals without voice, thereby being widely applied to the fields such as noise suppression, voice enhancement, voice recognition and the like.

The existing endpoint detection method is to perform hard decision based on data of energy and other dimensions according to a specific threshold, and the decision method is too hard and direct, does not consider the frequency domain characteristics of signals, and does not consider the characteristics and proportions of the signals and noise in different environments, so that the characteristics of low accuracy and low adaptability can be generated.

Disclosure of Invention

The embodiment of the invention provides a signal processing method, a signal processing device, signal processing equipment and a signal processing storage medium, which are used for accurately processing the audio type of an original audio signal.

In a first aspect, the present embodiment provides a signal processing method, including:

performing framing processing on an original audio signal to obtain each sub-frame audio signal;

for each subframe audio signal, performing feature extraction on the subframe audio signal, and determining an energy feature vector of the subframe audio signal;

and determining the audio type of the sub-frame audio signal according to the energy characteristic vector and by combining a set comprehensive judgment rule.

In a second aspect, the present embodiment provides a signal processing apparatus, including:

the framing processing module is used for framing the original audio signal to obtain each sub-frame audio signal;

the characteristic determining module is used for extracting the characteristics of the sub-frame audio signals aiming at each sub-frame audio signal and determining the energy characteristic vector of the sub-frame audio signals;

and the judging module is used for determining the audio type of the sub-frame audio signal according to the energy characteristic vector and by combining a set comprehensive judging rule.

In a third aspect, the present embodiment provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the signal processing method according to any of the embodiments of the present invention.

In a fourth aspect, this embodiment provides a computer-readable storage medium, where the computer program is executed by the at least one processor to enable the at least one processor to execute the signal processing method according to any embodiment of the present invention.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a signal processing method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a signal processing method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a signal processing apparatus according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "original", "target", and the like in the description and claims of the present invention and the drawings described above are used for distinguishing similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

Fig. 1 is a flowchart of a signal processing method according to an embodiment of the present invention, which is applicable to a case where an audio type determination is performed on an original audio signal, and the method may be executed by a signal processing apparatus, which may be implemented in a form of hardware and/or software and is generally integrated in an electronic device.

As shown in fig. 1, a signal processing method provided in this embodiment may specifically include the following steps:

s101, framing processing is carried out on the original audio signal to obtain each sub-frame audio signal.

It can be understood that in the field of speech technology, the front-end signal processing of speech has a crucial influence on the effect of the downstream task, and the detection of the speech signal or the non-speech signal in the audio signal is particularly important for the processing performance, the processing delay, the processing accuracy, and the like of the whole application interactive system. Before a related voice task is carried out, an audio signal needs to be processed, and the core idea is that voice information can be extracted from original audio as accurately and cleanly as possible, and noise signals such as non-voice environment sounds are filtered.

On one hand, the voice information can be accurately and completely extracted, and the background noise signal without voice is separated, so that the method can be widely applied to the fields such as noise suppression, voice enhancement, voice recognition and the like. On the other hand, in a system such as a voice recognition service, by a signal processing method, only useful information can be transmitted to a corresponding server, so that the transmission amount of invalid information is reduced, the efficiency and performance of voice recognition are improved, the transmission pressure is reduced, and the processing time is shortened.

The original audio signal may be understood as an original audio signal that is captured and may include a speech signal and a non-speech signal. The raw audio data needs to be acquired before signal processing. In this embodiment, the source of the original audio signal to be processed is not particularly limited, and may be, for example, original audio data collected by a recording device, such as various types of microphones, and the data may include a speech signal and a non-speech noise signal.

Considering that a speech signal is a short-time stationary signal and is considered as a stationary signal between 10ms and 30ms, a comparison signal processing method such as gaussian markov is based on the premise that the signal is stationary, and the signal processing method is performed under the condition of a proper frame length in the acquisition of an original audio signal, wherein firstly, the analog signal needs to be framed, and the common frame lengths are 10ms, 20ms, 30ms and the like, and the frame length of 10ms is preferably adopted for framing in the embodiment.

Specifically, the original audio signal is subjected to framing processing according to a set frame length to obtain audio signals of each subframe. And when the original audio signal is processed subsequently, processing the audio signal of each subframe respectively.

S102, aiming at each subframe audio signal, performing feature extraction on the subframe audio signal, and determining an energy feature vector of the subframe audio signal.

In this embodiment, after the original audio signal is framed to obtain the audio signal of each sub-frame, each sub-audio needs to be sampled, so that the sub-audio signal is converted from an analog quantity to a digital signal. The sampling is often performed at a certain sampling rate, which is commonly 8KHZ, 16KHZ, 48KHZ, etc., and audio signals with different sampling rates are processed, and it is first required to keep the sampling rates consistent, which involves signal reconstruction processes such as down-sampling and up-sampling of signals, i.e., up-sampling and down-sampling. And sampling each sub-frame audio signal to obtain a digital audio signal.

For voice audio signals, common features include time domain features and frequency domain features, and the frequency domain features of the voice audio signals often play a great role in signal analysis and processing. After obtaining the frequency band signal, the total energy of the frequency band signal may be calculated, and for example, a total energy characteristic value is obtained by using a logarithmic energy calculation method.

Further, in order to obtain more fine signal characteristics, the voice signal of the frequency band is divided into different sub-bands, illustratively, 80Hz to 250Hz, 250Hz to 500Hz, 500Hz to 1KHz, 1K to 2KHz, 2KHz to 3KHz, 3KHz to 4KHz, which are 6 sub-bands, and in the specific sub-band division, the sub-band division is performed by filtering through a series of all-pass filters, low-pass filters and high-pass filters.

After the digital audio signal is sub-band divided, the signal energy of each sub-band is calculated for the sub-bands that have been divided, the calculation of the signal energy is not limited specifically here, it should be noted that there are various energy processing and calculation methods, for example, methods such as logarithmic energy and mel-spectrum energy image, and exemplarily, the calculation of the sub-band energy takes a logarithm with base 10, also called logarithmic energy, and the logarithmic energy is respectively formed into a vector, which is denoted as an energy feature vector in this embodiment. It should be noted that the total energy characteristic value and the subband energy characteristic value in this embodiment should be calculated in the same manner.

Specifically, for each subframe audio signal, data processing and feature extraction are performed on the subframe audio signal, and an energy feature vector of the subframe audio signal is determined. The above can be understood as a process of preprocessing an original audio signal to obtain an energy feature vector. The energy feature vector serves as the basis data for subsequent determination of the signal type.

S103, determining the audio type of the sub-frame audio signal according to the energy characteristic vector and by combining a set comprehensive judgment rule.

Wherein, the set comprehensive judgment rule can be understood as that the judgment of the audio type is carried out twice based on the set priority. The first discrimination may be based on a total energy feature value in the energy feature vector. The first judgment is mainly based on the total energy characteristic value, the total energy characteristic value is compared with a set threshold value, if the total energy characteristic value is smaller than the threshold value, the energy of the input signal is too low, the frame audio signal is judged to be a non-voice signal, and subsequent corresponding processing is not carried out.

In this embodiment, when it is determined through the first decision that the total energy characteristic value is greater than or equal to the set threshold, a second decision on the voice type needs to be made. Before the second decision, the likelihood probability of each sub-band can be calculated according to the sub-energy characteristic value of each sub-band and based on the pre-constructed voice statistical model. And then carrying out ratio calculation according to the likelihood probability of the speech signal and the non-speech signal of each sub-band to obtain a likelihood ratio.

Specifically, the likelihood ratio of each subband speech signal and the non-speech signal is obtained as the local likelihood ratio of each subband speech signal. Meanwhile, for each subband likelihood ratio, obtaining the weighted likelihood ratio of each subband according to the weight, namely obtaining the global likelihood ratio of the signal. After obtaining the local likelihood ratio and the global likelihood ratio, a decision is made. The decision making is to compare the global likelihood ratio and the likelihood ratio of each sub-band with the set speech signal threshold value, if one of the global likelihood ratio and the likelihood ratio exceeds the threshold value judged as the speech signal, the input data is judged as the speech signal, otherwise, the input data is the non-speech signal.

The embodiment of the invention provides a signal processing method, which comprises the following steps: firstly, framing an original audio signal to obtain each sub-frame audio signal; then, for each subframe audio signal, performing feature extraction on the subframe audio signal to determine an energy feature vector of the subframe audio signal; and finally, determining the audio type of the sub-frame audio signal according to the energy characteristic vector and by combining a set comprehensive judgment rule. By using the method, the energy characteristic vector of the audio signal is obtained by processing the original audio signal, and whether each frame of signal in the original audio signal is a voice signal or a non-voice signal can be determined according to the energy characteristic vector and based on the set comprehensive judgment rule. Compared with the hard decision method based on the threshold value in the prior art, the technical scheme can more accurately decide the audio type of each frame of signal in the original audio signal, and has better generalization and effect.

Example two

Fig. 2 is a schematic flow diagram of a signal processing method according to a second embodiment of the present invention, which is a further optimization of the foregoing embodiments, and in this embodiment, the limiting optimization of "performing feature extraction on the sub-frame audio signal and determining an energy feature vector of the sub-frame audio signal" is further performed as follows: carrying out sampling rate conversion on the subframe audio signals to obtain digital audio signals; performing frequency band division on each digital audio signal to obtain each sub-band audio signal; and carrying out energy processing on the sub-frame audio signals and each sub-band audio signal to obtain energy characteristic vectors of the sub-frame audio signals.

And, the definition optimization of the audio type of the sub-frame audio signal determined according to the energy feature vector and by combining with the set comprehensive judgment rule is as follows: extracting a total energy characteristic value from the energy characteristic vector; judging whether the total energy characteristic value is larger than a first set threshold value or not; if so, determining the audio type of the sub-frame audio signal based on each sub-energy characteristic value in the energy characteristic vector; otherwise, determining the subframe audio signal as a non-voice signal.

As shown in fig. 2, the second embodiment provides a signal processing method, which specifically includes the following steps:

s201, framing processing is carried out on the original audio signal, and audio signals of all sub-frames are obtained.

S202, performing sampling rate conversion on the subframe audio signals to obtain each digital audio signal.

In this step, sampling is often performed at a certain sampling rate, which is commonly 8KHZ, 16KHZ, 48KHZ, and the like, and audio signals with different sampling rates are processed, and the sampling rates of the audio signals need to be kept consistent, which involves signal reconstruction processing such as down-sampling and up-sampling of signals. In the present embodiment, according to the characteristics of the frequency range of the voice audio signal, etc., a reconstruction process of down-sampling the voice audio signal is often adopted, for example, audio signals at various sampling frequencies are uniformly converted into audio signal data with a sampling rate of 8KHZ by down-sampling. Specific conversion algorithms are various, for example, high frequency down-sampling to the target frequency may be performed in multiple steps, and the intermediate signal processing may be performed in conjunction with the associated filters.

Considering that common sampling frequencies are 8KHZ, 16KHZ, 32KHZ, 48KHZ and the like, in order to recover signal information without distortion, a sampling rate of 8KHZ is uniformly adopted, if audio data obtained by other sampling frequencies exist, the audio data are uniformly integrated into 8KHZ in an up-and-down sampling mode, in an actual operation process, the audio data can be sampled to 8KHZ in one step, and a target frequency can be reached through multiple times of conversion by means of a filter and the like.

Specifically, the subframe audio signals are subjected to sampling rate conversion, so that the subframe audio signals are converted from analog signals into digital information, and digital audio signals are obtained.

And S203, performing frequency band division on each digital audio signal to obtain each sub-band audio signal.

In the present embodiment, considering that the frequency range of the voice signal is generally between 0-4KHZ, the voice energy is mainly concentrated in the frequency region, and considering the frequency of the electrical noise, etc., the frequency range of the audio signal is between 80-4KHZ, and the signal in the frequency range is obtained by the band pass filter. In order to obtain finer signal characteristics, in this step, the 80-4KHZ signal is divided into 6 sub-bands, which correspond to 6 sub-bands of 80-250 Hz, 250-500 Hz, 500-1 KHZ, 1 k-2 KHZ, 2 k-3 KHZ, and 3 k-4 KHZ, and specifically, the sub-bands may be divided by corresponding filters to obtain each sub-band audio signal.

S204, performing energy processing on the sub-frame audio signals and the sub-band audio signals to obtain energy characteristic vectors of the sub-frame audio signals.

In this step, signal energy of each subband is calculated mainly for the subbands that have been divided, and there are various processing and calculating methods for signal energy, such as logarithmic energy, mel-spectrum energy, and the like.

Specifically, after obtaining the frequency band signal with the frequency range of 80-4KHZ, the total energy of the frequency band signal is first calculated, and in this embodiment, the total energy characteristic value is preferably obtained by adopting a logarithmic energy calculation method. After the sub-bands are divided, energy calculation is carried out on each sub-band, logarithmic energy calculation is carried out on each sub-band audio signal in the same calculation mode, and logarithmic energy of each sub-band is obtained and recorded as a sub-band energy characteristic value. In this embodiment, the total energy eigenvalue and each subband energy eigenvalue may be stored in a multi-bit eigenvector. Or the total energy characteristic value is stored separately, and each sub-band energy characteristic value is stored in a 6-dimensional characteristic vector. In this embodiment, the total energy characteristic value and each subband energy characteristic value are preferably stored in a multi-bit characteristic vector.

And S205, extracting total energy characteristic values from the energy characteristic vectors.

Specifically, the total energy characteristic value corresponding to the frequency band signal with the frequency range of 80-4KHZ is extracted from the energy characteristic vector.

And S206, judging whether the total energy characteristic value is larger than a first set threshold value, if so, executing a step S207, and if not, executing a step S208.

The first set threshold is a minimum energy value that should be met corresponding to the voice signal, and the first set threshold may be obtained based on historical empirical data. Specifically, in this step, the mode of determining whether the subframe audio signal is a speech signal based on whether the total energy characteristic value is greater than the first set threshold is the first determination. Firstly, the total energy characteristic value is determined, if the total energy characteristic value is greater than the first set threshold, step S207 needs to be executed continuously to perform subsequent determination, which may be understood as that, in the case that the total energy is greater than the lowest threshold, global and local determinations are further performed. If the total energy is lower than the corresponding threshold, step S208 is executed to confirm that the frame of audio signal is a non-speech signal.

And S207, if so, determining the audio type of the sub-frame audio signal based on each sub-energy characteristic value in the energy characteristic vector.

Specifically, in the first decision, if the total energy characteristic value is greater than the first set threshold, then global and local decisions need to be further performed. And determining the likelihood ratio of the speech signal and the non-speech signal corresponding to each sub-band based on each sub-energy characteristic value in the energy characteristic vector, and comparing the likelihood ratio with a set threshold value to further determine the audio type of the sub-frame audio signal.

And S208, if not, determining the subframe audio signal as a non-voice signal.

Specifically, if the total energy characteristic value is less than or equal to the first set threshold, the sub-frame audio signal is considered to be a non-speech signal.

In the embodiment, an original audio signal is preprocessed, the sampling frequency is adjusted to be uniform and meeting the requirement, and then the audio signal is divided into a plurality of sub-bands on a frequency domain; and then, extracting features based on the global and sub-bands of the signal, and judging according to the likelihood ratio of the sub-band and the global information based on a pre-constructed voice detection model so as to deduce whether the current signal is a voice signal. By utilizing the method, a better effect can be obtained through a series of processing algorithms, better generalization can be obtained through the likelihood ratio of the statistical probability model compared with the common hard decision method based on the threshold value, and iterative automatic model parameter updating can be carried out in a real using environment compared with a method based on a neural network, so that a better environment adaptation effect is achieved.

As an optional embodiment of the present invention, on the basis of the above embodiment, the step of determining the audio type of the subframe audio signal based on each sub-energy feature value in the energy feature vector may be specifically expressed as:

a1 The sub-energy characteristic values are combined with a pre-constructed voice statistical model to determine the voice signal probability and the non-voice signal probability of each sub-energy characteristic value.

Specifically, for each sub-energy feature value, the sub-energy feature value is used as an input of the speech statistical model, and the likelihood probability that the signal is a speech signal and a non-speech signal can be calculated respectively.

b1 Based on the speech signal probability and the non-speech signal probability, the sub-band likelihood ratio of speech signal/non-speech signal in each sub-energy feature value is determined.

Specifically, the speech signal probability and the non-speech signal probability are divided to determine a likelihood ratio (LLR) of the speech signal/non-speech signal in each sub-energy feature value, which is referred to as a subband likelihood ratio in this embodiment.

c1 Determines the audio type of the sub-frame audio signal based on the respective sub-band likelihood ratios.

Specifically, the likelihood ratio of each sub-band is compared with a set threshold to determine whether the sub-frame audio signal is a speech signal or a non-speech signal.

Further, the step of determining the audio type of the sub-frame audio signal according to the likelihood ratios may be expressed as including: weighting each sub-band likelihood ratio to obtain a global likelihood ratio; judging whether at least one likelihood ratio in each likelihood ratio and the global likelihood ratio is larger than a second set threshold value or not; if so, determining the subframe audio signal as a voice signal; otherwise, determining the subframe audio signal as a non-voice signal.

The second set threshold may be understood as a lowest threshold for determining that the sub-band has the voice information, and the second set threshold may be obtained according to historical data. Specifically, the likelihood ratio of each subband is compared with a second set threshold to be used as a local decision. The sum of the logarithmically weighted likelihood ratios of all the subbands is used as a global likelihood ratio, and the global likelihood ratio is compared with a second set threshold value to be used as a global decision. And when one of the local judgment result or the global judgment result is greater than a second set threshold, the current frame is determined to be a voice signal, and if the local judgment result or the global judgment result is less than or equal to the second set threshold, the current frame is determined to be a non-voice signal.

As an alternative embodiment of the present invention, on the basis of the above embodiment, the step of constructing the speech statistical model may be expressed as: determining target parameters of a voice statistical model based on the marked voice training samples and combining a maximum expectation algorithm; and constructing a voice statistical model according to the target parameters.

In this embodiment, a statistical-based model is adopted, and there are various mathematical-statistical-based models, which can be used to model the distribution of noise signals and speech signals in audio signals, for example, the distribution may be characterized by a laplacian distribution with specific characteristics, a gamma distribution, a bilateral gamma distribution, and the like. In the present embodiment, noise and a speech signal are modeled by a widely used Mixture Gaussian Model GMM (Gaussian Mixture Model). The Gaussian mixture model is a mixture model obtained by weighting and summing a plurality of Gaussian components, and belongs to one of probability map models.

The theoretical idea based on the GMM model is to assume that both the speech signal and the non-speech signal conform to Gaussian distribution, and assume that the non-speech signal is more stationary than the speech signal and the energy is less than the speech signal. I.e. the mean and variance of the non-speech signal is smaller than the mean and variance of the speech signal. Two gaussian models can be used to fit the speech and non-speech signals in the signal, separated by the above assumptions. The theoretical formula for modeling GMM is as follows:

wherein, w ₁ Representing the weight, mu, of the speech signal ₁ Representing the mean, σ, of the speech signal ₁ Representing the variance, w, of the speech signal ₂ Representing the weight, mu, of the speech signal ₂ Representing the mean, σ, of the speech signal ₂ Representing the variance of the speech signal. The formula is a GMM model with two Gaussian components, which is respectively established for speech signals and non-speech noiseAnd (5) modeling and description.

In this embodiment, the acquisition of model parameters of the GMM for inference calculation is obtained through model training, specifically, the process of model training is to train using a priori known corpus aligned with a label, and finally obtain the above 6 parameters, so as to fix the model, specifically, training the model using an Expectation Maximization (EM) Algorithm, and in the inference stage, initializing the model using the trained parameters in the statistical model based on the GMM. It should be noted that, in the above embodiment, the audio signal is divided into sub-bands, so for the GMM statistical model, the model may also be applied to the sub-bands, that is, the energy characteristics of the sub-bands also obey the probability distribution of the GMM.

In this alternative embodiment, the statistical model is modeled for the signal from the perspective of containing speech signals and non-speech signals, and each subband information is modeled separately. Compared with the method based on the deep neural network in the prior art, a large amount of linguistic data are needed for training and learning, parameters of the model which is frozen after training are completely solidified, any adjustment can not be carried out according to the environment during use, and the adaptability of the method is greatly limited; the modeling method of the technical scheme can continuously update the model parameters and better adapt to signal processing in various environments.

As an alternative embodiment of the present invention, on the basis of the above embodiment, after determining the audio type of the subframe audio signal, the method further includes: and updating the voice statistical model and a second set threshold according to the energy characteristic vector and the corresponding audio type.

In this embodiment, after determining the audio type of the subframe audio signal, the determination result may be stored. It is understood that after determining the audio type of each sub-frame audio signal, the decision result can be used as an input of a downstream task, thereby supporting the downstream task such as speech recognition, speech enhancement, and the like.

In addition, the judgment result can be used as a sample to update the parameters of the voice detection model, the input voice signals and the judgment result are continuously recorded and stored, on the basis, the audio signals and the judgment result which accord with a certain number are subjected to statistical analysis to respectively obtain the statistics of the voice signals and the non-voice signals, and the parameters of the voice detection model are updated according to the statistics. The real-time updating of the voice detection model ensures that the model can learn and adaptively adjust different environments, and increases the adaptability and generalization capability of the model.

In this embodiment, after the determination, the corresponding data is stored for updating the model, and a specific updating manner is not specifically limited. Illustratively, the minimum of each subband feature is tracked, 16 minima in 100 frames are found for each feature, these minima not exceeding 100 at the maximum, and if the current feature is one of the 16 minima in 100 frames, the median of the five minima is calculated and returned. The minimum value obtained here is later used to update noise, and the speech signal is updated by a similar method, and parameter values such as mean value and variance in the corresponding speech statistical model are updated. Therefore, the model can be continuously updated and iterated in the actual use process.

And no matter whether the final judgment of the subframe speech signal in the two judgments is the speech signal or the non-speech signal, the judged result and the corresponding energy characteristic value can be used as the input of signal smoothing, the core function of the signal smoothing is to smooth and update the threshold value, the statistical characteristic of the energy characteristic can be utilized, and a smoothing function and a smoothing factor are introduced to update the threshold value, so that the self-adaptability of the threshold is realized. The second set threshold may be specifically used for the judgment of the smooth processed voice signal, and is not specifically limited herein.

In the optional embodiment, iterative automatic model parameter updating can be performed in a real use environment, so that a better environment adaptation effect is achieved.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a signal processing apparatus according to a third embodiment of the present invention, which is applicable to a case where an audio type of an original audio signal is determined, and the signal processing apparatus may be implemented in a form of hardware and/or software and is generally integrated in an electronic device. As shown in fig. 3, the apparatus includes: a framing processing module 31, a feature determination module 32 and a discrimination module 33, wherein,

a framing processing module 31, configured to perform framing processing on the original audio signal to obtain each sub-frame audio signal;

the feature determination module 32 is configured to perform feature extraction on the subframe audio signals for each subframe audio signal, and determine an energy feature vector of the subframe audio signal;

and the judging module 33 is configured to determine the audio type of the sub-frame audio signal according to the energy feature vector and by combining a set comprehensive judging rule.

An embodiment of the present invention provides a signal processing apparatus, including: the framing processing module is used for framing the original audio signal to obtain each subframe audio signal; the characteristic determining module is used for extracting the characteristics of the subframe audio signals aiming at each subframe audio signal and determining the energy characteristic vector of the subframe audio signals; and the judging module is used for determining the audio type of the sub-frame audio signal according to the energy characteristic vector and by combining with a set comprehensive judging rule. By utilizing the device, the energy characteristic vector of the audio signal is obtained by processing the original audio signal, and according to the energy characteristic vector and based on the set comprehensive judgment rule, whether each frame of signal in the original audio signal is a voice signal or a non-voice signal can be determined. Compared with the hard decision method based on the threshold value in the prior art, the technical scheme can more accurately decide the audio type of each frame of signal in the original audio signal, and has better generalization and effect.

Optionally, the feature determining module 32 may be specifically configured to:

carrying out sampling rate conversion on the subframe audio signals to obtain digital audio signals;

dividing frequency bands of the digital audio signals to obtain sub-band audio signals;

and carrying out energy processing on the sub-frame audio signals and the sub-band audio signals to obtain energy characteristic vectors of the sub-frame audio signals.

Optionally, the determining module 33 may specifically include:

the first extraction unit is used for extracting a total energy characteristic value from the energy characteristic vector;

the first judging unit is used for judging whether the total energy characteristic value is larger than a first set threshold value or not;

the first voice determining unit is used for determining the audio type of the sub-frame audio signal based on each sub-energy characteristic value in the energy characteristic vector if the sub-frame audio signal is of the sub-frame audio signal;

a first non-speech determining unit, configured to determine that the subframe audio signal is a non-speech signal if not.

Optionally, the first speech determination unit may be specifically configured to:

determining the voice signal probability and the non-voice signal probability of each sub-energy characteristic value by combining each sub-energy characteristic value with a pre-constructed voice statistical model;

determining the sub-band likelihood ratio of the voice signal/the non-voice signal in each sub-energy characteristic value according to the voice signal probability and the non-voice signal probability;

and determining the audio type of the sub-frame audio signal according to the likelihood ratio of each sub-band.

Optionally, the first speech determining unit is configured to perform a step of determining an audio type of the subframe audio signal according to each likelihood ratio, and may specifically include:

weighting each sub-band likelihood ratio to obtain a global likelihood ratio;

judging whether at least one likelihood ratio in each likelihood ratio and the global likelihood ratio is larger than a second set threshold value or not;

if so, determining the subframe audio signal as a voice signal;

otherwise, determining the subframe audio signal as a non-voice signal.

Optionally, the apparatus further includes a model building module, and the model building module may specifically be configured to:

determining target parameters of a voice statistical model based on the marked voice training samples and combining a maximum expectation algorithm;

and constructing a voice statistical model according to the target parameters.

Optionally, the apparatus further includes an update module, where the update module is configured to:

and updating the voice statistical model and a second set threshold according to the energy feature vector and the audio type.

The signal processing device provided by the embodiment of the invention can execute the signal processing method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

Fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 4, the electronic device 40 includes at least one processor 41, and a memory communicatively connected to the at least one processor 41, such as a Read Only Memory (ROM) 42, a Random Access Memory (RAM) 43, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 41 may perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 42 or the computer program loaded from a storage unit 48 into the Random Access Memory (RAM) 43. In the RAM 43, various programs and data necessary for the operation of the electronic apparatus 40 can also be stored. The processor 41, the ROM 42, and the RAM 43 are connected to each other via a bus 44. An input/output (I/O) interface 45 is also connected to bus 44.

A number of components in the electronic device 40 are connected to the I/O interface 45, including: an input unit 46 such as a keyboard, a mouse, etc.; an output unit 47 such as various types of displays, speakers, and the like; a storage unit 48 such as a magnetic disk, an optical disk, or the like; and a communication unit 49 such as a network card, modem, wireless communication transceiver, etc. The communication unit 49 allows the electronic device 40 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Processor 41 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 41 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 41 performs the various methods and processes described above, such as signal processing methods.

In some embodiments, the signal processing method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 48. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 40 via the ROM 42 and/or the communication unit 49. When the computer program is loaded into the RAM 43 and executed by the processor 41, one or more steps of the signal processing method described above may be performed. Alternatively, in other embodiments, the processor 41 may be configured to perform the signal processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A signal processing method, comprising:

2. The method of claim 1, wherein the performing feature extraction on the subframe audio signal and determining an energy feature vector of the subframe audio signal comprises:

performing frequency band division on each digital audio signal to obtain each sub-band audio signal;

and carrying out energy processing on the sub-frame audio signals and each sub-band audio signal to obtain energy characteristic vectors of the sub-frame audio signals.

3. The method according to claim 1, wherein the determining the audio type of the sub-frame audio signal according to the energy feature vector and in combination with a set comprehensive discriminant rule comprises:

extracting a total energy characteristic value from the energy characteristic vector;

judging whether the total energy characteristic value is larger than a first set threshold value or not;

if yes, determining the audio type of the sub-frame audio signal based on each sub-energy characteristic value in the energy characteristic vector;

otherwise, determining the subframe audio signal as a non-voice signal.

4. The method of claim 3, wherein the determining the audio type of the subframe audio signal based on the sub-energy eigenvalues in the energy eigenvector comprises:

determining a sub-band likelihood ratio of a voice signal/a non-voice signal in each sub-energy characteristic value according to the voice signal probability and the non-voice signal probability;

and determining the audio type of the sub-frame audio signal according to the sub-band likelihood ratio.

5. The method of claim 4, wherein determining the audio type of the sub-frame audio signal based on each of the likelihood ratios comprises:

weighting each sub-band likelihood ratio to obtain a global likelihood ratio;

if so, determining the subframe audio signal as a voice signal;

otherwise, determining the subframe audio signal as a non-voice signal.

6. The method of claim 4, wherein the step of constructing the speech statistical model comprises:

and constructing the voice statistical model according to the target parameters.

7. The method of claim 5, further comprising, after the determining the audio type of the sub-frame audio signal:

and updating the voice statistical model and the second set threshold according to the energy feature vector and the corresponding audio type.

8. A signal processing apparatus, characterized by comprising:

the framing processing module is used for framing the original audio signal to obtain each subframe audio signal;

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the signal processing method of any one of claims 1-7.

10. A computer-readable storage medium storing computer instructions for causing a processor to implement the signal processing method of any one of claims 1 to 7 when executed.