CN116682447A

CN116682447A - Speech processing method, device, storage medium and computer equipment

Info

Publication number: CN116682447A
Application number: CN202210164531.9A
Authority: CN
Inventors: 鄢聪
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd; Guangzhou Shiyuan Artificial Intelligence Innovation Research Institute Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd; Guangzhou Shiyuan Artificial Intelligence Innovation Research Institute Co Ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2023-09-01

Abstract

The application discloses a voice processing method, a device, a storage medium and computer equipment, wherein the method comprises the following steps: acquiring an initial voice frame in a voice signal, acquiring an initial frequency spectrum corresponding to the initial voice frame and a global voice existence probability, and acquiring an initial power spectrum based on the initial frequency spectrum; determining each target frequency point in the noise fade-in stage based on the initial power spectrum, and acquiring the local voice existence probability corresponding to each frequency point; based on the global voice existence probability, carrying out probability correction on the local voice existence probability corresponding to each target frequency point to obtain the target voice existence probability corresponding to each target frequency point; acquiring gain factors corresponding to all frequency points based on the target voice existence probability corresponding to all the target frequency points and the local voice existence probabilities corresponding to other frequency points; and performing gain processing on the initial frequency spectrum based on the gain factors corresponding to the frequency points to obtain a target frequency spectrum, and generating a target voice frame corresponding to the initial voice frame based on the target frequency spectrum. By adopting the method and the device, the definition of the target voice frame is improved.

Description

Speech processing method, device, storage medium and computer equipment

Technical Field

The present application relates to the field of speech technology, and in particular, to a speech processing method, apparatus, storage medium, and computer device.

Background

With the vigorous development of voice technology, more and more electronic devices have related functions, such as voice recognition, voice communication, voice control, and the like, implemented based on voice technology. Since various voices exist in daily life, the voice frames collected by the electronic device necessarily contain a certain amount of noise, and the noise affects the related functions of the electronic device to a certain extent. Therefore, in the prior art, after the electronic device collects the voice frame, the noise reduction processing is performed on the voice frame.

Disclosure of Invention

The application provides a voice processing method, a voice processing device, a storage medium and computer equipment, which can solve the technical problem of how to improve the definition of a target voice frame.

In a first aspect, an embodiment of the present application provides a method for processing speech, including:

acquiring an initial voice frame in a voice signal, acquiring an initial frequency spectrum corresponding to the initial voice frame and a global voice existence probability, and acquiring an initial power spectrum based on the initial frequency spectrum, wherein the initial power spectrum comprises a plurality of frequency points and power values of all frequency points in the plurality of frequency points;

Based on the initial power spectrum, determining each target frequency point meeting the noise fade-in stage in the plurality of frequency points, and acquiring the local voice existence probability corresponding to each frequency point;

based on the global voice existence probability, probability correction is carried out on the local voice existence probability corresponding to each target frequency point, and the target voice existence probability corresponding to each target frequency point is obtained;

acquiring gain factors corresponding to the frequency points based on the target voice existence probabilities corresponding to the target frequency points and the local voice existence probabilities corresponding to other frequency points, wherein the other frequency points are frequency points which are not in a noise fading-in stage in the plurality of frequency points;

and performing gain processing on the initial frequency spectrum based on the gain factors corresponding to the frequency points to obtain a target frequency spectrum, and generating a target voice frame corresponding to the initial voice frame based on the target frequency spectrum.

In a second aspect, an embodiment of the present application provides a speech processing apparatus, including:

the power spectrum acquisition module is used for acquiring an initial voice frame in a voice signal, acquiring an initial frequency spectrum corresponding to the initial voice frame and a global voice existence probability, and acquiring an initial power spectrum based on the initial frequency spectrum, wherein the initial power spectrum comprises a plurality of frequency points and power values of all frequency points in the plurality of frequency points;

The frequency point determining module is used for determining each target frequency point meeting the noise fade-in stage in the plurality of frequency points based on the initial power spectrum;

the probability acquisition module is used for acquiring the local voice existence probability corresponding to each frequency point;

the probability correction module is used for carrying out probability correction on the local voice existence probability corresponding to each target frequency point based on the global voice existence probability to obtain the target voice existence probability corresponding to each target frequency point;

the factor obtaining module is used for obtaining gain factors corresponding to the frequency points based on the target voice existence probability corresponding to the target frequency points and the local voice existence probability corresponding to other frequency points, wherein the other frequency points are frequency points which are not in a noise fade-in stage in the plurality of frequency points;

and the voice frame generation module is used for carrying out gain processing on the initial frequency spectrum based on the gain factors corresponding to the frequency points to obtain a target frequency spectrum, and generating a target voice frame corresponding to the initial voice frame based on the target frequency spectrum.

In a third aspect, embodiments of the present application provide a storage medium storing a computer program adapted to be loaded by a processor and to perform the steps of the above method.

In a fourth aspect, embodiments of the present application provide a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method described above when the program is executed.

In the embodiment of the application, the target frequency point in the noise fade-in stage in the power spectrum is firstly identified, then the probability correction is carried out on the local voice existence probability corresponding to the target frequency point based on the global voice existence probability of the initial voice frame, so that the accuracy of the local voice existence probability corresponding to each frequency point is improved, the accuracy of the gain factor obtained based on the local voice existence probability calculation is improved, the situation of residual noise signals and voice distortion in the target voice frame obtained based on the initial voice frame is further reduced, and the definition of the target voice frame is improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the application, and that other drawings can be obtained from these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a voice processing method according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a voice processing method according to an embodiment of the present application;

FIG. 3 is an exemplary diagram of a current speech period according to an embodiment of the present application;

fig. 4 is an exemplary schematic diagram of a first frequency point sequence according to an embodiment of the present application;

fig. 5 is an exemplary schematic diagram of a second frequency bin sequence according to an embodiment of the present application;

fig. 6 is a schematic flow chart of a voice processing method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a voice processing device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a speech processing device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the features and advantages of the present application more comprehensible, embodiments accompanied with figures in the present application are described in detail below, wherein the embodiments are described only in some but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The existing voice noise reduction method comprises the steps of obtaining each frequency point corresponding to a current voice frame, determining the minimum power value corresponding to each frequency point, calculating a posterior signal-to-noise ratio, an priori signal-to-noise ratio and a local voice existence probability based on the minimum power value, obtaining a noise power spectrum corresponding to the current voice frame, obtaining gain factors corresponding to each frequency point based on the obtained parameters, and finally performing gain processing on the current voice frame based on the gain factors corresponding to each frequency point to obtain a target voice frame.

In the voice interaction process, due to various sudden external noises, such as motor vibration and other external noises generated by a washing machine and an air conditioner, a large amount of noise signals remain in a target voice frame obtained based on the noise reduction process, and problems such as voice distortion and the like are caused.

The following describes in detail the voice processing method provided in the embodiment of the present application with reference to fig. 1 to 6.

Referring to fig. 1, a flow chart of a voice processing method is provided in an embodiment of the present application. As shown in fig. 1, the method may include the following steps S101 to S105.

S101, acquiring an initial voice frame in a voice signal, acquiring an initial frequency spectrum corresponding to the initial voice frame and a global voice existence probability, and acquiring an initial power spectrum based on the initial frequency spectrum, wherein the initial power spectrum comprises a plurality of frequency points and power values of all frequency points in the plurality of frequency points.

In one embodiment, the voice signal includes a plurality of continuous voice frames, and when the voice processing device performs voice processing on the voice signal, the voice processing device acquires the voice frames frame by frame according to the sequence of the voice frames, and performs voice processing on the currently acquired voice frames, it can be understood that the initial voice frame is one voice frame acquired by the voice processing device at the current moment in the process of continuously acquiring the voice frames. The global speech presence probability refers to the presence probability of speech in one complete initial speech frame. The initial spectrum is a frequency distribution curve, i.e. a frequency spectrum density, and is obtained by performing signal processing processes such as short-time fourier transform (STFT) on an initial speech frame, where the initial spectrum includes a plurality of frequency points and amplitude values of each of the plurality of frequency points, and an initial power spectrum is obtained based on the initial spectrum, where the initial power spectrum includes a plurality of frequency points and power values of each of the plurality of frequency points, and an exemplary amplitude value of a frequency a in the initial spectrum is |y _a I, the power value of frequency a in the initial power spectrum is Y _a | ² 。

The voice processing device comprises a voice acquisition device which is used for acquiring voice signals in the environment where the voice processing device is located. The voice processing device acquires a voice signal based on the voice acquisition device and acquires an initial voice frame in the voice signal. Then, voice preprocessing is carried out based on the initial voice frame, voice feature extraction is carried out based on voice signals after voice preprocessing, so that voice features such as fundamental tone, bark frequency cepstrum coefficient (Bark-frequency cepstral coefficients, BFCC) and Mel frequency cepstrum coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC) are obtained, and then global voice existence probability corresponding to the initial voice frame is obtained based on the voice features in the initial voice frame.

The voice processing device may further obtain an initial power spectrum corresponding to the initial voice frame based on the voice signal after voice preprocessing.

S102, determining each target frequency point in a noise fade-in stage in the plurality of frequency points based on the initial power spectrum, and acquiring the local voice existence probability corresponding to each frequency point.

In one embodiment, the noise fade-in phase refers to a phase in which the noise energy gradually increases, and alternatively, the noise fade-in phase may be a non-stationary noise phase in which the noise is faded in, i.e., the noise energy gradually increases, but the magnitude of the increase is not stable. The local voice existence probability refers to the existence probability of voice in a voice segment corresponding to a frequency point in the initial power spectrum.

The voice processing device sequentially acquires power values of all frequency points in the plurality of frequency points aiming at the plurality of frequency points in the initial power spectrum, judges whether the currently acquired frequency point is in a noise fading-in stage or not based on the power value of the currently acquired frequency point, and takes the currently acquired frequency point as a target frequency point when the currently acquired frequency point is in the noise fading-in stage. Meanwhile, the local voice existence probability of the currently acquired frequency point is calculated based on the power value of the currently acquired frequency point. Thereby, the target frequency point in the initial power spectrum is identified, and the local voice existence probability corresponding to a plurality of frequency points in the initial power spectrum is obtained.

And S103, carrying out probability correction on the local voice existence probability corresponding to each target frequency point based on the global voice existence probability to obtain the target voice existence probability corresponding to each target frequency point.

In one embodiment, the voice processing device obtains a correction factor, where the correction factor is a parameter corresponding to the voice processing device and is used to control a probability correction degree of the local voice existence probability. Optionally, the correction factor may be a preset parameter, or may be an empirical value of the speech processing apparatus, that is, an empirical value for performing probability correction with respect to the local speech existence probability, and further, the value range of the correction factor is (0, 1).

Exemplary, assuming that the local voice existence probability corresponding to the kth frequency point in the initial power spectrum is p (k, l), the global voice existence probability corresponding to the initial voice frame is p _global The correction factor is alpha, and the local voice existence probability of the kth frequency point in the initial power spectrum after correction, namely the target voice existence probability isNote that l represents the number of frames of the initial speech frame in the speech signal. The correction process of the local voice existence probability can be as follows:

s104, obtaining gain factors corresponding to the frequency points based on the target voice existence probability corresponding to the target frequency points and the local voice existence probabilities corresponding to other frequency points, wherein the other frequency points are frequency points which are not in a noise fade-in stage in the plurality of frequency points.

In one embodiment, the gain factor is a parameter that gains the power values of the frequency bins. The voice processing device divides a plurality of frequency points in the initial power spectrum into a target frequency point and other frequency points, wherein the target frequency point is a frequency point in a noise fading-up stage, and the other frequency points are corresponding frequency points not in the noise fading-up stage.

The voice processing device obtains gain factors corresponding to the frequency points based on the target voice existence probability corresponding to the target frequency points and the local voice existence probability corresponding to other frequency points.

S105, performing gain processing on the initial frequency spectrum based on the gain factors corresponding to the frequency points to obtain a target frequency spectrum, and generating a target voice frame corresponding to the initial voice frame based on the target frequency spectrum.

It can be appreciated that, since the initial power spectrum obtained based on the initial speech frame includes a plurality of frequency points, the correlation processing on the initial power spectrum can be implemented by processing each frequency point. The voice processing method corresponding to each frequency point will be described in detail with reference to fig. 2-4.

Referring to fig. 2, a flow chart of a voice processing method is provided in an embodiment of the application. As shown in fig. 2, the method may include the following steps S201 to S211.

S201, collecting initial voice frames in the voice signals.

In one embodiment, the voice signal includes a plurality of continuous voice frames, and when the voice processing device performs voice processing on the voice signal, the voice processing device acquires the voice frames frame by frame according to the sequence of the voice frames, and performs voice processing on the currently acquired voice frames, it can be understood that the initial voice frame is one voice frame acquired by the voice processing device at the current moment in the process of continuously acquiring the voice frames.

The voice processing device comprises a voice acquisition device which is used for acquiring voice signals in the environment where the voice processing device is located. The voice processing device acquires a voice signal based on the voice acquisition device and acquires an initial voice frame in the voice signal.

S202, acquiring the global voice existence probability of the initial voice frame by adopting a probability estimation model.

In one embodiment, the global speech presence probability refers to the presence probability of speech in one complete initial speech frame. The probability estimation model is used to obtain a neural network model of global speech existence probability in an initial speech frame, and the model may be a GRU (Gated Recurrent Unit, GRU) model, a Long short-term memory (LSTM) model, a time delay neural network (Time Delay Neural Network, TDNN) model, and the like, which is not limited herein. Further, the probability estimation model is a probability estimation model obtained by training based on a plurality of training voice frames and the corresponding global voice existence probabilities thereof.

After the voice processing device collects the initial voice frame, the initial voice frame is input into a probability estimation model, so that the global voice existence probability of the initial voice frame is obtained through the probability estimation model.

Optionally, the probability estimation model performs a correlation process on the initial speech frame to obtain speech features corresponding to the initial speech frame, such as pitch, bark-frequency cepstrum coefficient (Bark-frequency cepstral coefficients, BFCC), mel-frequency cepstrum coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC), and the like, so as to calculate the speech existence probability of the initial speech frame based on the extracted speech features.

S203, performing Fourier transform on the initial voice frame to obtain an initial frequency spectrum corresponding to the initial voice frame, wherein the initial frequency spectrum comprises a plurality of frequency points and amplitude values corresponding to the frequency points.

In one embodiment, the fourier transform may be a short-time fourier transform (STFT), and the initial spectrum includes a plurality of bins and amplitude values corresponding to the plurality of bins.

After the voice processing device collects the initial voice frame, signal processing processes such as short-time Fourier transform and the like are carried out on the initial voice frame to obtain an initial frequency spectrum containing noise, and the amplitude value of a kth frequency point in the initial frequency spectrum is |Y (k, l) |, wherein l represents the frame number of the initial voice frame in a voice signal.

S204, based on the initial frequency spectrum, acquiring an initial power spectrum corresponding to the initial voice frame.

In one embodiment, the initial power spectrum includes a plurality of frequency bins and power values for each of the plurality of frequency bins.

The voice processing device calculates the power value of each frequency point based on the amplitude value of each frequency point in the initial power spectrum, and generates an initial power spectrum corresponding to the initial voice frame based on the power value of each frequency point.

Illustratively, if the amplitude value of the kth frequency point in the initial spectrum is Y (k, l), the power value of the kth frequency point in the initial power spectrum is |y (k,l)| ² and obtaining the square value of the amplitude value of the kth frequency point, and taking the obtained square value as the power value of the kth frequency point.

In one embodiment, the speech processing device obtains an initial power spectrum |Y (k, l) | ² Then, the initial power spectrum can be smoothed in time and dimension to obtain a new |Y (k, l) | ² I.e. the one in question. Wherein the smoothing over time is as follows:

the smoothing in dimensions is shown by the following formula:

S(k，l)＝α _s (k，l)S(k，l-1)+S _f (k，l)

b (i) is a normalized window function, 2w+1 is a window length, |Y (k-i, l) | ² Representing the power value of the short-time fourier transform of the noisy speech in the time-frequency domain.

The initial voice frame is input into the trained probability estimation model, so that the global voice existence probability of the initial voice frame is directly obtained, the complicated probability estimation process is reduced, and the voice processing efficiency is improved.

S205, acquiring a first frequency point in the initial power spectrum, wherein the first frequency point is any frequency point in the frequency points.

In one embodiment, since the initial power spectrum includes a plurality of frequency points, when the voice processing device performs the relevant voice processing on the initial voice frame based on the initial power spectrum, the voice processing device sequentially acquires the first frequency points in the initial power spectrum until the initial power spectrum is traversed. The first frequency point is any one of the frequency points of the initial power spectrum.

S206, if the audio stage corresponding to the first frequency point is a noise fade-in stage, determining that the first frequency point is a target frequency point meeting the noise fade-in stage, wherein the audio stage corresponding to the first frequency point is determined by a voice frame position of a historical voice frame to which a historical frequency point of a minimum power value in a historical voice period belongs, the historical voice period is a last voice period adjacent to a current voice period in which the initial voice frame is located, and the frequency of the historical frequency point is the same as that of the first frequency point.

In one embodiment, the audio phase corresponding to each frequency point in the current speech period is stored in the speech processing device, and the audio phase corresponding to each frequency point is updated every period. Specifically, the audio stage corresponding to each frequency point in the current voice period is determined based on the voice frame position of the historical voice frame to which the historical frequency point corresponding to the minimum power value belongs in the historical voice period, wherein the historical voice period is the last voice period adjacent to the current voice period in which the initial voice frame is positioned, and the frequency of the historical frequency point is the same as that of the first frequency point. It is understood that the speech frame position refers to the position of the history speech frame in the history speech period.

The voice processing device directly acquires an audio stage corresponding to the first frequency point, and determines the first frequency point as a target frequency point meeting the noise fade-in stage when the audio stage corresponding to the first frequency point is the noise fade-in stage.

The voice processing device obtains a first frequency point in the initial power spectrum, obtains a first frequency corresponding to the first frequency point, determines a storage frequency point consistent with the first frequency of the first frequency point in the storage medium based on the first frequency, and takes an audio stage corresponding to the storage frequency point as an audio stage corresponding to the first frequency point.

The audio phase corresponding to each frequency point is obtained through the audio phase corresponding to each frequency point stored in the voice processing device, so that the target frequency point meeting the noise fading-in phase is determined in the initial voice power spectrum, the audio phase of each frequency point stored in the voice processing device is obtained based on the previous historical voice period identification, the target frequency point meeting the noise fading-in phase in the current voice period is determined through the audio phase corresponding to each frequency point updated regularly, the new noise fading-in phase formed by the protruding noise in the voice signal is identified, and the accuracy of the audio phase corresponding to each frequency point is further improved.

S207, if the initial voice frame is the last frame of the current voice period, acquiring the first frequency of the first frequency point.

In one embodiment, the voice signal includes a plurality of voice periods, the voice periods include a plurality of voice frames arranged according to an acquisition sequence, and the voice period in which the initial voice frame is located is taken as a current voice period. The initial power spectrum comprises a plurality of first frequency points with different frequencies and power values of the first frequency points.

And the voice processing device acquires the first frequency of the first frequency point when the initial voice frame is the last frame in the current voice period.

S208, acquiring a first frequency point sequence corresponding to the first frequency in the current voice period, wherein all frequency points in the first frequency point sequence are arranged according to the acquisition sequence of voice frames in the current voice period.

In one embodiment, the first frequency point sequence includes a plurality of frequency points arranged according to an acquisition sequence of a voice frame in a current voice period, and frequencies corresponding to the frequency points in the first frequency point sequence are all first frequencies.

By way of example, fig. 3 shows an exemplary schematic of a current speech cycle. As shown in fig. 3, the current speech period C1 includes a plurality of speech frames VF arranged according to the acquisition order, each speech frame VF has a plurality of frequency points FQ corresponding to the speech frame VF, frequencies of the frequency points FQ in the plurality of frequency points FQ corresponding to the speech frame VF are different, and the frequency points FQ with the same frequency form a first frequency point sequence L1 corresponding to the frequency.

The voice processing device firstly acquires an initial power spectrum corresponding to each voice frame in a current voice period, then acquires a plurality of frequency points corresponding to first frequency in a plurality of frequency points corresponding to each voice frame based on the initial power spectrum corresponding to each voice frame, and then acquires a first frequency point sequence corresponding to the first frequency based on the acquisition sequence of each voice frame and the frequency points corresponding to the first frequency.

S209, a second frequency point corresponding to the first minimum power value is obtained from the first frequency point sequence, and the voice frame position of the voice frame to which the second frequency point belongs in the current voice period is obtained.

In one embodiment, the voice processing device compares the power values of the frequency points in the first frequency point sequence one by one, obtains the minimum value, namely the first minimum power value, of the power values of the frequency points, determines the second frequency point corresponding to the first minimum power value in the first frequency point sequence, determines the voice frame to which the second frequency point belongs, and finally obtains the voice frame position of the voice frame to which the second frequency point belongs in the current voice period.

By way of example, fig. 4 shows an exemplary schematic diagram of a first frequency bin sequence. As shown in fig. 4, the first frequency point sequence L1 includes a plurality of frequency points FQ (12 frequency points are shown in fig. 4), and it is assumed that the second frequency point FQ corresponding to the first minimum value is the frequency point FQ corresponding to frame2 _min Second frequency point FQ _min The speech frame position of the belonging speech frame2 in the current speech period C1 is 2.

S210, determining an audio stage corresponding to a third frequency point indicated by the first frequency in a target voice period based on a voice frame position of a voice frame to which the second frequency point belongs and a set position range in the current voice period, wherein the target voice period is a next voice period adjacent to the current voice period.

In one embodiment, the set position range is a position range for determining that the noise fade-in phase is satisfied in the target speech period, and the set position range is determined based on the period length of the speech period. The target speech period is the next speech period adjacent to the current speech period. Optionally, the voice period is customized by the voice processing device, and after determining the period length of the voice period, the voice processing device obtains the set position range corresponding to the voice period. For example, if the voice period length is a, the upper limit threshold of the range of the set position range is X, the upper limit of the set position range is AX, and the lower limit of the set position range is 1, i.e., the set position range is [1, AX ]. Further exemplary, if the voice period length is 12, the upper range threshold of the position range is set to 0.25, then the position range is set to [1,0.25 x 12], i.e., [1,3]. If a decimal part exists in the value of AX, the integral part of AX is set as the upper limit of the set position range, and if ax=13×0.25=3.25, the upper limit of the set position range is set to 3, that is, the set position range is set to [1,3]. It will be appreciated that the set position range is at the front end of the speech period.

The voice processing device determines an audio stage corresponding to a third frequency point indicated by the first frequency in the target voice period based on the voice frame position of the voice frame to which the second frequency point belongs and the set position range in the current voice period, wherein the third frequency point indicated by the first frequency in the target voice period refers to all frequency points with the first frequency in the target voice period.

The voice frame position of the voice frame corresponding to the second frequency point in the current voice period is determined by acquiring the second frequency point corresponding to the minimum power value in the current voice period, so that whether a noise fade-in stage exists in the voice period is judged by judging whether the voice frame corresponding to the minimum power value is positioned at the front end of the voice period, and further, based on the current judging result, the frequency point corresponding to the noise fade-in stage is determined in the next voice period, namely, the frequency point meeting the noise fade-in stage is dynamically updated, then, probability correction is dynamically carried out on the frequency point meeting the noise fade-in stage, the recognition accuracy of the noise fade-in stage is improved, and the probability accuracy of the target frequency point is improved.

In one embodiment, when determining the audio phase corresponding to the third frequency point, the speech processing apparatus may first determine whether the speech frame position is within a set position range in the current speech period, and if the speech frame position is within the set position range in the current speech period, determine that the audio phase corresponding to the third frequency point corresponding to the first frequency point in the target speech period is a noise fade-in phase.

The voice processing device firstly judges whether the voice frame position is in the set position range in the current voice period or not, namely, compares the voice frame position with the upper range limit and the lower range limit of the set position range, if the voice frame position is smaller than or equal to the upper range limit and larger than or equal to the lower range limit, judges that the voice frame position is in the set position range in the current voice period, and determines that the voice frame position is in the set position range in the current voice period, and the audio stages corresponding to a plurality of third frequency points corresponding to the first frequency are all noise fade-in stages.

It can be understood that if the voice frame position is not within the set position range in the current voice period, it can be determined that the audio phases corresponding to the plurality of third frequency points corresponding to the first frequency in the target voice period are all non-noise fade-in phases.

And judging whether the voice frame position is positioned in a set position range or not to obtain an audio stage of a frequency point corresponding to each frequency in the target voice period, so that the voice existence probability in the next voice period is corrected in a targeted manner based on the voice condition of the current voice period, and the accuracy of the local voice existence probability corresponding to each target frequency point in the next voice period is further improved.

S211, acquiring a second frequency point sequence corresponding to the first frequency of the first frequency point, wherein the second frequency point sequence comprises a fourth frequency point corresponding to the first frequency in each historical voice frame of the historical voice period and a fifth frequency point corresponding to the first frequency in the acquired voice frame in the current voice period.

In one embodiment, the first frequency is the frequency of the first frequency point currently traversed to. The historical speech period is the last speech period adjacent to the current speech period to which the initial speech frame belongs. By way of example, fig. 5 shows an exemplary schematic diagram of a second frequency bin sequence. As shown in fig. 5, the second frequency point sequence L2 includes a fourth frequency point FQ corresponding to the first frequency of each historical voice frame of the historical voice period C2 ₄ And a fifth frequency point FQ corresponding to the first frequency in the acquired voice frame in the current voice period C1 ₅ 。

The voice processing device acquires a fourth frequency point corresponding to the first frequency in each historical voice frame of the historical voice period, acquires a fifth frequency point corresponding to the first frequency in each acquired voice frame of the current voice period, and sequences according to each historical voice frame and the acquisition sequence of each voice frame to obtain a second frequency point sequence corresponding to the first frequency.

S212, obtaining a second minimum power value based on the power value of each frequency point in the second frequency point sequence.

In one embodiment, the voice processing device compares the power values of the frequency points in the second frequency point sequence one by one, and obtains the minimum value, namely the second minimum power value, of the power values of the frequency points.

S213, obtaining the local voice existence probability corresponding to the first frequency point based on the power value of the first frequency point and the second minimum power value.

In one embodiment, the voice processing device obtains the local voice existence probability corresponding to the first frequency point based on the power value of the first frequency point and the second minimum power value.

The minimum values in the current voice period and the historical voice period are obtained by obtaining the minimum values in the two periods, so that the accuracy of the minimum value parameter in the calculation process of the local voice existence probability is improved, and the accuracy of the local voice existence probability corresponding to each frequency point in the initial voice frame is further improved.

In an embodiment, when the voice device obtains the local voice presence probability, the voice device may further obtain a local voice absence probability corresponding to the first frequency point based on the power value of the first frequency point and the second minimum power value; and acquiring the local voice existence probability corresponding to the first frequency point based on the power value of the first frequency point and the local voice non-existence probability.

The local voice existence probability refers to the existence probability of voice in a voice segment corresponding to a frequency point in the initial power spectrum. The voice processing device firstly obtains the local voice non-existence probability corresponding to the first frequency point based on the power value of the first frequency point and the second minimum power value, and then obtains the local voice existence probability corresponding to the first frequency point based on the local voice non-existence probability corresponding to the first frequency point.

The first frequency points are sequentially acquired in the initial power spectrum, then the first frequency points which are acquired at present are subjected to correlation processing one by one, so that the processing of the initial power spectrum is realized through the processing of each frequency point, the target frequency point in the initial power spectrum is identified, the local voice existence probability of the target frequency point is acquired, the voice processing device can pertinently correct the local voice existence probability corresponding to the target frequency point, and the accuracy of the local voice existence probability corresponding to each frequency point is improved.

In one embodiment, the voice processing device may obtain the local voice non-existence probability corresponding to the first frequency point based on the power value of the first frequency point and the second minimum power value; acquiring a sixth frequency point adjacent to the first frequency point in the second frequency point sequence, and acquiring a historical noise power value, a historical posterior signal-to-noise ratio and a historical gain factor corresponding to the sixth frequency point; and acquiring the local voice existence probability corresponding to the first frequency point based on the historical noise power value, the historical posterior signal-to-noise ratio, the historical gain factor, the power value of the first frequency point and the local voice non-existence probability.

In one embodiment, the voice processing device obtains the local voice non-existence probability corresponding to the first frequency point based on the power value of the first frequency point and the second minimum power value. The method for obtaining the local voice non-existence probability corresponding to the first frequency point is shown in the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,the local voice non-existence probability corresponding to the kth frequency point is l, which is the number of frames of the initial voice frame in the voice signal; />The second minimum power value corresponding to the kth frequency point; b (B) _min The deviation of the minimum noise estimation is an empirical value of the voice processing device and can be adjusted according to application scenes; gamma ray ₁ For a threshold constant, gamma is usually set ₁ ＝3.0。

The voice processing device acquires a sixth frequency point adjacent to the first frequency point in the frequency point sequence, and it is to be noted that, because the first frequency point is the frequency point corresponding to the newly acquired initial voice frame, the first frequency point is necessarily located at the extreme end of the corresponding second frequency point sequence, and only one sixth frequency point adjacent to the first frequency point exists.

The voice processing device acquires a historical noise power value, a historical posterior signal-to-noise ratio and a historical gain factor corresponding to the sixth frequency point in the storage module, and acquires the local voice existence probability corresponding to the first frequency point based on the historical noise power value, the historical posterior signal-to-noise ratio, the historical gain factor, the power value and the local voice non-existence probability. It can be understood that the historical voice frame corresponding to the sixth frequency point is the last initial voice frame, and the related parameter obtained by the voice processing device is the related parameter generated in the last voice frame processing process and corresponding to the first frequency.

The local voice existence probability corresponding to the first frequency point is obtained, and then all calculation related parameters corresponding to the first frequency point in the previous initial voice frame are determined, so that the local voice existence probability corresponding to the first frequency point is calculated, the relevance of all voice frames in a voice signal is increased, errors caused by independently measuring and calculating the local voice existence probability corresponding to all frequency points in the initial voice frame are avoided, and the accuracy of the local voice existence probability is improved.

In one embodiment, the voice processing device may obtain a posterior signal-to-noise ratio corresponding to the first frequency point based on the historical noise power value and the power value of the first frequency point; acquiring a priori signal-to-noise ratio corresponding to the first frequency point based on the historical posterior signal-to-noise ratio and the historical gain factor; and acquiring the local voice existence probability corresponding to the first frequency point based on the prior signal-to-noise ratio, the posterior signal-to-noise ratio and the local voice non-existence probability.

In one embodiment, the manner of obtaining the posterior signal-to-noise ratio corresponding to the first frequency point is shown in the following formula:

wherein lambda is _d (k, l-1) is a noise power value corresponding to a kth frequency point corresponding to a 1 st frame (i.e., a sixth frequency point), i.e., a historical noise power value; y (k, l) | ² The power value of the kth frequency point corresponding to the initial voice frame is obtained; gamma (k, l) is the posterior signal-to-noise ratio corresponding to the kth frequency bin.

The prior signal-to-noise ratio corresponding to the first frequency point is obtained by the following formula:

wherein, xi (k, l) is the priori signal-to-noise ratio corresponding to the kth frequency point;the gain factor corresponding to the kth frequency point corresponding to the 1 st-1 frame (namely the sixth frequency point), namely the historical gain factor; gamma (k, l-1) is the posterior signal-to-noise ratio corresponding to the kth frequency point corresponding to the 1 st frame (i.e., the sixth frequency point), i.e., the historical posterior signal-to-noise ratio.

The acquisition mode of the local voice existence probability corresponding to the first frequency point is shown in the following formula:

wherein p (k, l) is the local voice existence probability corresponding to the kth frequency point, and q (k, l) is the local voice non-existence probability corresponding to the kth frequency point.

In one embodiment, the voice processing device may further generate a noise power value corresponding to the first frequency point, specifically as shown in the following formula:

λ _d (k，l)＝α _d (k，l)λ _d (k，l-1)+[1-α _d (k，l)]|Y(k，l)| ²

wherein alpha is _d (k，l)＝+(1-α _d )p(k，l)，α _d For a smoothing constant, ranging between (0, 1), 0.9 is usually taken.

By determining each calculation related parameter corresponding to the first frequency in the previous initial voice frame, the local voice existence probability corresponding to the first frequency point is calculated, the relevance of each voice frame in the voice signal is increased, errors caused by independently measuring and calculating the local voice existence probability corresponding to each frequency point in the initial voice frame are avoided, and therefore the accuracy of the local voice existence probability is improved.

S214, based on the global voice existence probability, probability correction is carried out on the local voice existence probability corresponding to each target frequency point, and the target voice existence probability corresponding to each target frequency point is obtained.

and S215, if the first frequency point is a target frequency point, acquiring a gain factor corresponding to the first frequency point based on the target voice existence probability corresponding to the first frequency point.

The voice processing device firstly determines the frequency point type of the first frequency point, and if the first frequency point is a target frequency point, the gain factor corresponding to the first frequency point is obtained based on the target voice existence probability, the priori signal-to-noise ratio and the posterior signal-to-noise ratio corresponding to the first frequency point.

For example, the process of obtaining the gain factor corresponding to each frequency point may be as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device, for the lower gain limit, 0.02 may be taken.

And when the first frequency point is the target frequency point, p (k, l) in the formula is the local target voice existence probability corresponding to the first frequency point.

S216, if the first frequency point is other frequency points, obtaining a gain factor corresponding to the first frequency point based on the local voice existence probability corresponding to the first frequency point.

In one embodiment, the voice processing device determines a frequency point type of a first frequency point, and if the first frequency point is another frequency point, obtains a gain factor corresponding to the first frequency point based on a local voice existence probability, a priori signal-to-noise ratio and a posterior signal-to-noise ratio corresponding to the first frequency point. When the first frequency point is other frequency points, p (k, l) in the formula is the local voice existence probability corresponding to the first frequency point.

The frequency point type of the first frequency point is determined, so that the calculation parameters corresponding to the gain factors are obtained based on the frequency point type of the first frequency point, namely, the target voice existence probability is used for the target frequency point, the local voice existence probability is used for other frequency points, the accuracy of the gain factors obtained by calculation based on the local voice existence probability is improved, the conditions of residual noise signals and voice distortion in the target voice frame obtained based on the initial voice frame are further reduced, and the definition of the target voice frame is improved.

S217, performing gain processing on the initial frequency spectrum based on the gain factors corresponding to the frequency points to obtain a target frequency spectrum, and generating a target voice frame corresponding to the initial voice frame based on the target frequency spectrum.

In one embodiment, the target speech frame is a low noise and clean speech frame obtained after speech processing the initial speech frame.

The voice processing device carries out amplitude value gain on the amplitude value corresponding to each frequency point in the initial frequency spectrum based on the gain factor corresponding to each frequency point to obtain a target amplitude value corresponding to each frequency point after gain, so that a target frequency spectrum is obtained based on the target amplitude value corresponding to each frequency point, and then a target voice frame is obtained based on the target frequency spectrum.

For example, assuming that the gain factor corresponding to the kth frequency point in the initial power spectrum is G (k, l), the Y' (k, l) corresponding to the kth frequency point in the initial power spectrum may be obtained based on the following formula:

Y’(k，l)＝G(k，l)Y’(k，l)

where l is the number of frames of the initial speech frame in the speech signal corresponding to the initial spectrum.

The speech processing apparatus obtains a target spectrum based on Y' (k, l) corresponding to each frequency point, specifically, may generate the target spectrum according to each frequency point and the amplitude value obtained by gain of each frequency point, and finally directly process the target spectrum to generate a target speech frame corresponding to the target spectrum. Alternatively, when the target speech frame is generated based on the target spectrum, the inverse process with respect to the initial spectrum generation based on the initial speech frame may be performed, or other speech frame generation method, without any limitation.

Referring to fig. 6, a flowchart of a voice processing method is provided in an embodiment of the present application. As shown in fig. 6, the method may include the following steps.

S1, collecting voice signals.

The voice processing device continuously collects the environmental voice of the environment where the voice processing device is located through the voice collecting device so as to collect a plurality of voice frames, each voice frame in the plurality of voice frames is arranged according to the collection sequence to form a voice signal, and when one voice frame is collected, S2 is executed.

S2, acquiring an initial voice frame.

The voice processing device acquires the currently acquired voice frame, takes the voice frame as an initial voice frame, and then respectively executes S3 and S4.

S3, obtaining the global voice existence probability.

The voice processing device obtains the global voice existence probability corresponding to the initial voice frame through the probability estimation model. S9 is performed.

S4, acquiring an initial frequency spectrum.

The voice processing device performs signal processing procedures such as short-time fourier transform (STFT) on the initial voice frame to obtain an initial frequency spectrum containing noise, where the initial frequency spectrum includes a plurality of frequency points and amplitude values of each frequency point in the plurality of frequency points, and S5 is executed.

S5, acquiring an initial power spectrum.

The voice processing device obtains power values of all frequency points in the plurality of frequency points in the initial power spectrum based on amplitude values of all frequency points in the plurality of frequency points in the initial frequency spectrum, and generates the initial power spectrum, wherein the initial power spectrum comprises the plurality of frequency points and the power values of all frequency points in the plurality of frequency points. S6 is performed.

S6, determining a target frequency point.

The voice processing device acquires an audio stage corresponding to each frequency point, and then takes the frequency point with the audio stage being the noise fading-in stage as a target frequency point. S9 is performed.

S7, obtaining the local voice non-existence probability corresponding to each frequency point.

The voice processing apparatus calculates a local voice non-existence probability corresponding to each frequency point based on the power value and the minimum power value corresponding to each frequency point, and then executes S8.

S8, obtaining the local voice existence probability corresponding to each frequency point.

The voice processing device obtains the historical noise power value, the historical posterior signal-to-noise ratio and the historical gain factor corresponding to each frequency point in the storage module, calculates the local voice existence probability corresponding to each frequency point based on the historical noise power value, the historical posterior signal-to-noise ratio, the historical gain factor, the initial power value and the local voice non-existence probability corresponding to each frequency point, and then executes S9.

S9, obtaining the target voice existence probability corresponding to the target frequency point.

The voice processing device corrects the local voice existence probability corresponding to the target frequency point based on the global voice existence probability to obtain the target voice existence probability corresponding to the target frequency point, and S10 is executed.

S10, calculating gain factors corresponding to all the frequency points.

The voice processing device calculates gain factors corresponding to the frequency points according to the frequency point types corresponding to the frequency points. The method comprises the steps of calculating a gain factor corresponding to a target frequency point based on the target voice existence probability, the priori signal-to-noise ratio and the posterior signal-to-noise ratio corresponding to the target frequency point; and based on the local voice existence probability, the priori signal-to-noise ratio and the posterior signal-to-noise ratio corresponding to other frequency points, calculating gain factors corresponding to other frequency points, and executing S11.

S11, generating a target frequency spectrum.

The voice processing device performs gain processing on the initial amplitude value of each frequency point based on the gain factor corresponding to each frequency point to obtain a target amplitude value after gain, and then generates a target frequency spectrum based on the target amplitude value corresponding to each frequency point, and S12 is executed.

S12, generating a target voice frame.

The speech processing apparatus generates a target speech frame based on the target spectrum.

The following describes in detail a speech processing device according to an embodiment of the present application with reference to fig. 7 to 8. It should be noted that, the speech processing apparatus of fig. 7-8 is used to execute the method of the embodiment of fig. 1-6, and for convenience of explanation, only the relevant parts of the embodiment of the present application are shown, and specific technical details are not disclosed, please refer to the embodiment of fig. 1-6.

Referring to fig. 7, a schematic structural diagram of a speech processing device is provided in an embodiment of the present application. As shown in fig. 7, the voice processing apparatus 1 according to the embodiment of the present application may include: a power spectrum acquisition module 11, a frequency point determination module 12, a probability acquisition module 13, a probability correction module 14, a factor acquisition module 15 and a voice frame generation module 16.

The power spectrum acquisition module 11 is configured to acquire an initial voice frame in a voice signal, acquire an initial frequency spectrum corresponding to the initial voice frame and a global voice existence probability, and acquire an initial power spectrum based on the initial frequency spectrum, where the initial power spectrum includes a plurality of frequency points and power values of each frequency point in the plurality of frequency points;

a frequency point determining module 12, configured to determine, based on the initial power spectrum, each target frequency point that satisfies a noise fade-in stage from the plurality of frequency points;

The probability obtaining module 13 is configured to obtain a local voice existence probability corresponding to each frequency point;

the probability correction module 14 is configured to perform probability correction on the local voice existence probabilities corresponding to the target frequency points based on the global voice existence probabilities, so as to obtain target voice existence probabilities corresponding to the target frequency points;

the factor obtaining module 15 is configured to obtain a gain factor corresponding to each frequency point based on the target voice existence probability corresponding to each target frequency point and the local voice existence probabilities corresponding to other frequency points, where the other frequency points are frequency points that are not in a noise fade-in stage in the plurality of frequency points;

the voice frame generating module 16 is configured to perform gain processing on the initial spectrum based on the gain factors corresponding to the frequency points, obtain a target spectrum, and generate a target voice frame corresponding to the initial voice frame based on the target spectrum.

In one embodiment, the frequency point determining module 12 is specifically configured to:

acquiring a first frequency point from the initial power spectrum, wherein the first frequency point is any frequency point in the frequency points;

if the audio stage corresponding to the first frequency point is a noise fade-in stage, determining that the first frequency point is a target frequency point meeting the noise fade-in stage, wherein the audio stage corresponding to the first frequency point is determined by a voice frame position of a historical voice frame to which a historical frequency point of a minimum power value in a historical voice period belongs, the historical voice period is a last voice period adjacent to a current voice period in which the initial voice frame is located, and the frequency of the historical frequency point is the same as that of the first frequency point.

In the embodiment of the application, the audio stage corresponding to each frequency point is acquired through the audio stage corresponding to each frequency point stored in the voice processing device, so that the target frequency point meeting the noise fade-in stage is determined in the initial voice power spectrum, the audio stage of each frequency point stored in the voice processing device is obtained based on the identification of the last historical voice period, the target frequency point meeting the noise fade-in stage in the current voice period is determined by updating the audio stage corresponding to each frequency point at regular time, the new noise fade-in stage formed by the raised noise in the voice signal is identified, and the accuracy of the audio stage corresponding to each frequency point is further improved.

In another embodiment, referring to fig. 8, a schematic structural diagram of a speech processing device is provided in an embodiment of the present disclosure. As shown in fig. 8, the speech processing apparatus 1 of the embodiment of the present specification may further include: a phase determination module 17.

The stage determination module 17 is specifically configured to:

if the initial voice frame is the last frame of the current voice period, acquiring a first frequency of the first frequency point;

acquiring a first frequency point sequence corresponding to the first frequency in the current voice period, wherein all frequency points in the first frequency point sequence are arranged according to the acquisition sequence of voice frames in the current voice period;

acquiring a second frequency point corresponding to a first minimum power value from the first frequency point sequence, and acquiring a voice frame position of a voice frame to which the second frequency point belongs in the current voice period;

and determining an audio stage corresponding to a third frequency point indicated by the first frequency in a target voice period based on the voice frame position of the voice frame to which the second frequency point belongs and the set position range in the current voice period, wherein the target voice period is the next voice period adjacent to the current voice period.

In the embodiment of the application, the second frequency point corresponding to the minimum power value in the current voice period is obtained, and the voice frame position of the voice frame corresponding to the second frequency point in the current voice period is determined, so that whether a noise fade-in stage exists in the voice period is judged by judging whether the voice frame corresponding to the minimum power value is positioned at the front end of the voice period, and further, based on the current judging result, the frequency point corresponding to the noise fade-in stage is determined in the next voice period, namely, the frequency point meeting the noise fade-in stage is dynamically updated, and then the probability correction is dynamically carried out on the frequency point meeting the noise fade-in stage, so that the recognition accuracy of the noise fade-in stage is improved, and the probability accuracy of the target frequency point is improved.

In one embodiment, the stage determination module 17 is specifically configured to:

if the voice frame position is within the set position range in the current voice period, determining that an audio stage corresponding to a third frequency point corresponding to the first frequency in the target voice period is a noise fading-in stage.

In the embodiment of the application, the audio stage of the frequency point corresponding to each frequency in the target voice period is obtained by judging whether the voice frame position is positioned in the set position range, so that the voice existence probability in the next voice period is corrected in a targeted manner based on the voice condition of the current voice period, and the accuracy of the local voice existence probability corresponding to each target frequency point in the next voice period is further improved.

In one embodiment, the probability obtaining module 12 is specifically configured to:

acquiring a second frequency point sequence corresponding to the first frequency of the first frequency point, wherein the second frequency point sequence comprises a fourth frequency point corresponding to the first frequency in each historical voice frame of a historical voice period and a fifth frequency point corresponding to the first frequency in the acquired voice frame in the current voice period;

acquiring a second minimum power value based on the power value of each frequency point in the second frequency point sequence;

and acquiring the local voice existence probability corresponding to the first frequency point based on the power value of the first frequency point and the second minimum power value.

In the embodiment of the application, the minimum value in the current voice period and the historical voice period is obtained by obtaining the minimum value in the two periods, so that the accuracy of the minimum value parameter in the calculation process of the local voice existence probability is improved, and the accuracy of the local voice existence probability corresponding to each frequency point in the initial voice frame is further improved.

based on the power value of the first frequency point and the second minimum power value, obtaining the local voice non-existence probability corresponding to the first frequency point;

And acquiring the local voice existence probability corresponding to the first frequency point based on the power value of the first frequency point and the local voice non-existence probability.

In the embodiment of the application, the first frequency points are sequentially acquired in the initial power spectrum, and then the first frequency points which are acquired at present are subjected to related processing one by one, so that the processing of the initial power spectrum is realized through the processing of each frequency point, the target frequency point in the initial power spectrum is identified, the local voice existence probability of the target frequency point is acquired, the voice processing device can pertinently correct the local voice existence probability corresponding to the target frequency point, and the accuracy of the local voice existence probability corresponding to each frequency point is improved.

In one embodiment, the factor obtaining module 14 is specifically configured to:

if the first frequency point is a target frequency point, acquiring a gain factor corresponding to the first frequency point based on the target voice existence probability, the priori signal-to-noise ratio and the posterior signal-to-noise ratio corresponding to the first frequency point;

and if the first frequency point is other frequency points, acquiring a gain factor corresponding to the first frequency point based on the local voice existence probability, the priori signal-to-noise ratio and the posterior signal-to-noise ratio corresponding to the first frequency point.

In one embodiment, the frequency point type of the first frequency point is determined, so that the calculation parameters corresponding to the gain factors are obtained based on the frequency point type of the first frequency point, namely, the target voice existence probability is used for the target frequency point, the local voice existence probability is used for other frequency points, so that the accuracy of the gain factors obtained by calculation based on the local voice existence probability is improved, the situations of residual noise signals and voice distortion in the target voice frame obtained based on the initial voice frame are further reduced, and the definition of the target voice frame is improved.

In one embodiment, the power spectrum acquisition module 11 is specifically configured to:

collecting an initial voice frame in a voice signal;

a probability estimation model is adopted to obtain the global voice existence probability of the initial voice frame;

performing Fourier transform on the initial voice frame to obtain an initial frequency spectrum corresponding to the initial voice frame, wherein the initial frequency spectrum comprises a plurality of frequency points and amplitude values corresponding to the frequency points;

and acquiring an initial power spectrum corresponding to the initial voice frame based on the initial frequency spectrum.

In the embodiment of the application, the initial voice frame is input into the trained probability estimation model, so that the global voice existence probability of the initial voice frame is directly obtained, the complicated probability estimation process is reduced, and the voice processing efficiency is improved.

The embodiment of the present application further provides a storage medium, where the storage medium may store a plurality of program instructions, where the program instructions are adapted to be loaded by a processor and execute the method steps of the embodiment shown in fig. 1 to fig. 6, and the specific execution process may refer to the specific description of the embodiment shown in fig. 1 to fig. 6, which is not repeated herein.

Referring to fig. 9, a schematic structural diagram of a computer device is provided in an embodiment of the present application. As shown in fig. 9, the computer device 1000 may include: at least one processor 1001, at least one communication bus 1002, at least one input output interface 1003, at least one network interface 1004, and at least one memory 1005. Wherein the processor 1001 may include one or more processing cores. The processor 1001 connects various parts within the entire computer device 1000 using various interfaces and lines, performs various functions of the terminal 1000 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1005, and calling data stored in the memory 1005. The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the processor 1001. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others. The communication bus 1002 is used to enable connected communication between these components. As shown in fig. 9, an operating system, a network communication module, an input-output interface module, and a voice processing program may be included in a memory 1005, which is one type of terminal device storage medium.

In the computer device 1000 shown in fig. 9, the input/output interface 1003 is mainly used for providing an input interface for a user and an access device, and acquiring data input by the user and the access device.

In one embodiment.

The processor 1001 may be configured to call a speech processing program stored in the memory 1005, and specifically perform the following operations:

Optionally, when the processor 1001 determines, based on the initial power spectrum, each target frequency point that satisfies the noise fade-in phase among the plurality of frequency points, the following operations are specifically performed:

Optionally, the processor 1001 may be further configured to invoke a speech processing program stored in the memory 1005, and specifically perform the following operations:

Optionally, when the processor 1001 determines, based on the voice frame position of the voice frame to which the second frequency point belongs and the set position range in the current voice period, an audio phase corresponding to the third frequency point indicated by the first frequency in the target voice period, the following operations are specifically performed:

Optionally, when the processor 1001 performs the obtaining the local voice existence probabilities corresponding to the frequency points, the following operations are specifically performed:

Optionally, when the processor 1001 obtains the local voice existence probability corresponding to the first frequency point based on the power value of the first frequency point and the second minimum power value, the following operations are specifically executed:

Optionally, when executing the obtaining the gain factor corresponding to each frequency point based on the target voice existence probability corresponding to each target frequency point and the local voice existence probabilities corresponding to other frequency points, the processor 1001 specifically executes the following operations:

if the first frequency point is a target frequency point, acquiring a gain factor corresponding to the first frequency point based on the target voice existence probability corresponding to the first frequency point;

and if the first frequency point is other frequency points, acquiring a gain factor corresponding to the first frequency point based on the local voice existence probability corresponding to the first frequency point.

Optionally, when the processor 1001 acquires an initial speech frame in the speech signal, acquires an initial spectrum corresponding to the initial speech frame and a global speech existence probability, and acquires an initial power spectrum based on the initial spectrum, where the initial power spectrum includes a plurality of frequency points and power values of each of the plurality of frequency points, the processor specifically performs the following operations:

collecting an initial voice frame in a voice signal;

It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The foregoing describes a speech processing method, apparatus, storage medium, and device provided by the present application, and those skilled in the art should not understand the present application to limit the scope of the present application in view of the foregoing description of the present application.

Claims

1. A method of speech processing, the method comprising:

2. The method of claim 1, wherein the determining, based on the initial power spectrum, each target frequency bin of the plurality of frequency bins that meets a noise fade-in phase comprises:

3. The method as recited in claim 2, further comprising:

4. The method of claim 3, wherein the determining the audio phase corresponding to the third frequency point indicated by the first frequency in the target voice period based on the voice frame position of the voice frame to which the second frequency point belongs and the set position range in the current voice period includes:

5. The method of claim 2, wherein the obtaining the local voice existence probabilities corresponding to the frequency points includes:

6. The method of claim 5, wherein the obtaining the local voice existence probability corresponding to the first frequency point based on the power value of the first frequency point and the second minimum power value comprises:

7. The method according to claim 1, wherein the obtaining the gain factor corresponding to each frequency point based on the target voice existence probability corresponding to each target frequency point and the local voice existence probabilities corresponding to other frequency points includes:

8. The method of claim 1, wherein the acquiring an initial speech frame in a speech signal, acquiring an initial spectrum corresponding to the initial speech frame and a global speech existence probability, and acquiring an initial power spectrum based on the initial spectrum, wherein the initial power spectrum includes a plurality of frequency points and power values of each frequency point in the plurality of frequency points, includes:

collecting an initial voice frame in a voice signal;

9. A speech processing apparatus, comprising:

10. A storage medium having stored thereon a computer program, which when executed by a processor implements the speech processing method of any of claims 1-8.

11. A computer device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the steps of the speech processing method according to any of claims 1-8.