CN114387989B

CN114387989B - Voice signal processing method, device, system and storage medium

Info

Publication number: CN114387989B
Application number: CN202210286433.2A
Authority: CN
Inventors: 张斌; 易鑫
Original assignee: Beijing Huijin Chunhua Technology Co ltd
Current assignee: Beijing Huijin Chunhua Technology Co ltd
Priority date: 2022-03-23
Filing date: 2022-03-23
Publication date: 2022-07-01
Anticipated expiration: 2042-03-23
Also published as: CN114387989A

Abstract

The embodiment of the application provides a voice signal processing method, which comprises the steps of converting a historical voice signal with a preset length into a first frequency domain signal; determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values; under the condition that the alternative fundamental tone frequency point exists in the first frequency domain signal, setting the frequency point amplitude value corresponding to the alternative fundamental tone frequency point in the first frequency domain signal to zero to obtain a second frequency domain signal; determining a target voiced sound signal based on the second frequency-domain signal; generating a speech compensation signal from the target voiced sound signal. The method comprises the steps of analyzing signals from a frequency domain level, determining frequency points of voiced signals through voiced amplitude values of all the frequency points according to quasi-periodicity of the voiced signals, filtering frequency points which do not contain the voiced signals or are weak in the voiced signals, filtering noise signals and unvoiced signals as far as possible, enabling the finally obtained voiced signals to be closer to the real situation, improving the accuracy of voiced signal estimation, enabling calculated amount to be light, and solving the problem that packet loss compensation calculated amount of voice signals in the existing algorithm is large.

Description

Voice signal processing method, device, system and storage medium

Technical Field

The present application relates to the field of speech signal processing technologies, and in particular, to a speech signal processing method, apparatus, system, and storage medium.

Background

For network voice communication, if no data to be played exists in a voice buffer, packet loss compensation is needed, that is, historical data is used for generating current data, otherwise, the opposite end obviously feels that the sound is discontinuous. In a general packet loss compensation scheme, a pitch period of a signal is calculated from a historical signal, a voiced signal at a historical time is used as a current voiced signal, and an unvoiced signal and background noise are superimposed to be output as a speech signal.

The data signal when the pitch period is calculated also contains unvoiced signals and noise, which will affect the result to some extent; in addition, the correlation calculation amount is large, and certain system overhead is increased.

In addition, when the supplemented data packets are combined, a voiced signal, an unvoiced signal and comfortable background noise are expected to be combined. However, in any case, the history signal taken during the combination hardly contains only voiced sound, or contains both voiced sound, unvoiced sound and background noise, or contains only background noise, resulting in poor compensation effect.

Disclosure of Invention

The embodiment of the application provides a voice signal processing method, a voice signal processing device, a voice signal processing system and a storage medium, and can solve the problem that the receiving end receiving accuracy is low due to the fact that the existing system overhead is large and the voice compensation effect is poor.

A first aspect of an embodiment of the present application provides a speech signal processing method, including:

converting a historical voice signal with a preset length into a first frequency domain signal;

determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values;

under the condition that the alternative fundamental tone frequency point exists in the first frequency domain signal, setting the frequency point amplitude value corresponding to the alternative fundamental tone frequency point in the first frequency domain signal to zero to obtain a second frequency domain signal;

determining a target voiced sound signal based on the second frequency-domain signal;

generating a speech compensation signal from the target voiced sound signal.

Optionally, the method further comprises:

by the formula

Determining corresponding threshold amplitudes for different fundamental frequencies

Wherein, in the step (A),

in order to be a smoothing factor, the method,

and performing FFT on the ith iteration value to obtain the amplitude of the kth frequency point.

Optionally, the determining the candidate pitch bins in the first frequency domain signal based on the voiced amplitude of the bins with different pitch frequencies and the corresponding threshold amplitude includes:

and selecting the pitch frequency points with the frequency point voiced sound amplitude and/or the integer multiple frequency point voiced sound amplitude larger than the corresponding threshold amplitude as alternative pitch frequency points.

Optionally, the selecting, as the candidate pitch frequency points, the pitch frequency points whose voiced sound amplitudes of the frequency points and/or integer-times frequency point voiced sound amplitudes are greater than the corresponding threshold amplitudes includes:

selecting three fundamental tone frequency points with the maximum voiced amplitude of the frequency points as alternative fundamental tone frequency points under the condition that the selected voiced amplitude of the frequency points and/or the voiced amplitude of the integral multiple frequency points are larger than or equal to three fundamental tone frequency points with the corresponding threshold amplitude;

and under the condition that the selected frequency point voiced sound amplitude and/or integral multiple frequency point voiced sound amplitude is larger than the corresponding threshold amplitude and the selected frequency point voiced sound amplitude is smaller than three, sequencing the frequency point voiced sound amplitudes of the rest fundamental sound frequency points from large to small, and supplementing the rest fundamental sound frequency points to the alternative fundamental sound frequency points according to the sequencing order so as to meet the requirement of the three alternative fundamental sound frequency points.

Optionally, the method further comprises:

and generating a voice compensation signal according to the background comfort noise under the condition that the first frequency domain signal does not have the alternative fundamental tone frequency point.

Optionally, the determining a target voiced sound signal based on the second frequency-domain signal includes:

converting the second frequency domain signal into an alternative time domain signal;

and selecting the signal with the longest pitch period from the alternative time domain signals as a target voiced sound signal.

Optionally, the method further includes:

under the condition that the voice compensation signal is generated according to the background comfortable noise, generating meaning expression word options of the whole voice based on adjacent voice information;

acquiring a selection result of the user based on the meaning expression word option;

sending the meaning expression words corresponding to the selection result to a receiving end of the voice message;

and/or the presence of a gas in the gas,

and under the condition that the voice compensation signal is generated according to the background comfort noise and the receiving of the voice information of the sending end is finished, the voice information comprising the voice compensation signal is played at the sending end.

A second aspect of the embodiments of the present application provides a speech signal processing apparatus, including:

the conversion unit is used for converting the historical voice signal with the preset length into a first frequency domain signal;

the determining unit is used for determining alternative fundamental tone frequency points in the first frequency domain signal based on the frequency point amplitudes of different fundamental tone frequencies;

an obtaining unit, configured to set zero to a frequency point amplitude corresponding to the first frequency domain signal except the alternative fundamental tone frequency point, to obtain a second frequency domain signal, when the alternative fundamental tone frequency point exists in the first frequency domain signal;

the determining unit is further configured to determine a target voiced sound signal based on the second frequency-domain signal;

a generating unit for generating a speech compensation signal from the target voiced sound signal.

A third aspect of the embodiments of the present application provides an electronic system, which includes a memory and a processor, where the processor is configured to implement the steps of the above-mentioned digital-analog temperature safety monitoring when executing a computer program stored in the memory.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the above-mentioned speech signal processing method.

In summary, the voice signal processing method provided by the embodiment of the present application converts a historical voice signal with a preset length into a first frequency domain signal; determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values; under the condition that the alternative fundamental tone frequency point exists in the first frequency domain signal, setting the frequency point amplitude value corresponding to the alternative fundamental tone frequency point in the first frequency domain signal to zero to obtain a second frequency domain signal; determining a target voiced sound signal based on the second frequency-domain signal; generating a speech compensation signal from the target voiced sound signal. The method comprises the steps of analyzing signals from a frequency domain level, determining frequency points of voiced signals through voiced amplitude values of all the frequency points according to quasi-periodicity of the voiced signals, filtering frequency points which do not contain the voiced signals or are weak in the voiced signals, filtering noise signals and unvoiced signals as far as possible, enabling the finally obtained voiced signals to be closer to the real situation, improving the accuracy of voiced signal estimation, enabling calculated amount to be light, and solving the problem that packet loss compensation calculated amount of voice signals in the existing algorithm is large.

Accordingly, the speech signal processing device, the electronic system and the computer-readable storage medium provided by the embodiment of the invention also have the technical effects.

Drawings

Fig. 1 is a schematic flowchart of a possible speech signal processing method according to an embodiment of the present application;

fig. 2 is a schematic block diagram of a possible speech signal processing apparatus according to an embodiment of the present application;

fig. 3 is a schematic hardware structure diagram of a possible speech signal processing apparatus according to an embodiment of the present disclosure;

FIG. 4 is a schematic block diagram of a possible electronic system according to an embodiment of the present disclosure;

fig. 5 is a schematic structural block diagram of a possible computer-readable storage medium provided in an embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth," and the like in the description and claims of this application and in the above-described drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

In some examples, when the voice information packet is lost for the first time, the historical data (30 ms plus a small amount of extra data) may be used for pitch period detection, and the pitch period detection adopts a method of calculating a correlation value, and then fine tuning is performed to obtain and store a pitch period, and an algorithm correlation parameter is calculated according to the historical signal. The AR filter coefficients can be found by the levenson durbin algorithm from the most recent 20 ms of data at the historical time. And according to the pitch period, taking the data with the same length as the pitch period at the latest historical moment as a voiced signal. In addition, random noise can be generated and output as an unvoiced signal through an AR filter. Comfort background noise may also be generated. Finally, the unvoiced signal and the voiced signal can be weighted and summed, and then are superposed with the comfortable background noise, and the result is used as the result of packet loss compensation. If the voice information packet loss does not occur for the first time, directly skipping and directly utilizing the existing pitch period to perform packet loss compensation.

The above example has two problems as follows:

1) the pitch period is calculated by adopting a relevant method, mainly utilizing the characteristic that a voiced sound signal is a quasi-periodic signal, but the data signal in the pitch period calculation also comprises an unvoiced sound signal and noise, which can cause certain influence on a compensation result; in addition, due to the inclusion of unvoiced sound signals and noise, the amount of correlation calculation is large, and a certain amount of system overhead is increased.

2) And 4, the combination of voiced sound signals and unvoiced sound signals and comfortable background noise is expected when packet loss compensation is carried out. However, in any case, the history signal taken at the time of combining hardly contains only voiced sound, either contains voiced sound, unvoiced sound, and background noise, or only contains background noise.

Referring to fig. 1, a flowchart of a method for solving the problem of low receiving accuracy at a receiving end due to large system overhead and poor voice compensation effect according to an embodiment of the present application may specifically include: S110-S150.

S110, converting a historical voice signal with a preset length into a first frequency domain signal;

for example, the preset length of the historical speech signal may be a plurality of lengths of historical speech signals that facilitate subsequent calculations. For example, the historical speech information 256 × fs _ mult is processed through a hanning window, then down-sampled to 8000Hz, and FFT is performed, and the length of the down-sampled data is 256.

S120, determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced sound amplitude values of the frequency points with different fundamental tone frequencies and the corresponding threshold amplitude values;

for example, the candidate pitch frequency point may be a frequency point having a valid voiced signal, and the candidate pitch frequency point may be a plurality of candidate pitch frequency points.

S130, under the condition that the alternative fundamental tone frequency point exists in the first frequency domain signal, setting the frequency point amplitude value corresponding to the alternative fundamental tone frequency point in the first frequency domain signal to zero to obtain a second frequency domain signal;

illustratively, if the alternative pitch frequency point exists in the first frequency domain signal, that is, the first frequency domain signal has an effective voiced sound signal, the frequency point amplitude values corresponding to the frequency points except for the alternative pitch frequency point in the first frequency domain signal are set to zero, so as to achieve the effect of clearly not including the voiced sound signal or the frequency points with weaker voiced sound signal.

S140, determining a target voiced sound signal based on the second frequency-domain signal;

for example, a relatively pure voiced sound signal, i.e., the target voiced sound signal, may be obtained by converting the second frequency-domain signal.

And S150, generating a voice compensation signal according to the target voiced sound signal.

Illustratively, random noise may be generated and passed through an AR filter and output as an unvoiced signal. Comfort background noise may also be generated. Finally, the unvoiced signal and the target voiced signal can be weighted and summed, and then are superposed with the comfortable background noise, and the result is used as the result of packet loss compensation.

According to the voice signal processing method provided by the above embodiment, a preset-length historical voice signal is converted into a first frequency domain signal; determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values; under the condition that the alternative fundamental tone frequency point exists in the first frequency domain signal, setting the frequency point amplitude value corresponding to the alternative fundamental tone frequency point in the first frequency domain signal to zero to obtain a second frequency domain signal; determining a target voiced sound signal based on the second frequency-domain signal; and generating a voice compensation signal according to the target voiced sound signal. The method comprises the steps of analyzing signals from a frequency domain level, determining frequency points of voiced signals through voiced amplitude values of all the frequency points according to quasi-periodicity of the voiced signals, filtering frequency points which do not contain the voiced signals or are weak in the voiced signals, filtering noise signals and unvoiced signals as far as possible, enabling the finally obtained voiced signals to be closer to the real situation, improving the accuracy of voiced signal estimation, enabling calculated amount to be light, and solving the problem that packet loss compensation calculated amount of voice signals in the existing algorithm is large.

According to some embodiments, the method further comprises:

by the formula

（1）

Determining corresponding threshold amplitudes for different pitch frequencies

Wherein, in the step (A),

in order to be a smoothing factor, the method,

For example, the specific calculation manner of the corresponding threshold amplitude of each frequency point may include:

taking a frame duration of 10ms of a speech signal as an example, a frame may contain a number of samples

Wherein

Is the ratio of the current sampling rate to 8000Hz, such as 16000Hz,

the sampling rate is 48000Hz,

. In the initial stage, the continuous frame data is used for filling

The data is used to make a corresponding threshold amplitude estimate, which is a process of repeated iteration, where one iteration is recorded as oneAnd (4) iteration value. And (3) performing window function, then down-sampling to 8000Hz, wherein the signal length is 256 at the moment, and performing FFT to obtain the amplitude of each frequency point, wherein the interval between each frequency point is 8000/256=31.25 Hz. Is provided with

And the voiced sound signal threshold of the ith epoch and the kth frequency point is shown. The corresponding threshold amplitude values for different pitch frequencies can be determined using equation (1) above

. The smoothing factor may take the value 0.7. The total number of iterations may be 20. The time for estimating each corresponding threshold amplitude is the first 64 frames of the system initialization, i.e., the initial 640 milliseconds.

According to some embodiments, the determining alternative pitch bins in the first frequency domain signal based on the bin voiced amplitudes of different pitch frequencies and corresponding threshold amplitudes comprises:

Illustratively, depending on the quasi-periodicity of the voiced signal, assuming a pitch frequency of fB, a larger amplitude will appear at the location of fB, 2fB, 3fB, etc. kfB on the spectrogram. Meanwhile, in order to avoid the influence of noise and unvoiced data, noise signals and unvoiced signals need to be filtered as much as possible, in order to filter the influence of the unvoiced signals and the noise signals more finely, a corresponding threshold amplitude value can be set on each frequency point, the frequency point is considered to contain effective voiced signals when the threshold amplitude value is larger than the threshold amplitude value, and otherwise, the frequency point does not contain the effective voiced signals and needs to be filtered. Since the human pitch period ranges roughly from 2.5 ms to 15 ms, corresponding to a pitch frequency of 67Hz to 400Hz, the calculated alternative pitch frequencies are all in this range. The pitch frequency is selected by adopting a frequency domain analysis method, and the pitch frequency meets the following two conditions: the amplitude value corresponding to the fundamental frequency at the frequency point is larger than the voiced sound threshold of the frequency point. And if the amplitude at the frequency point fB is large, the integral multiple frequency point amplitude corresponding to fB should also be large.

According to some embodiments, the selecting a pitch frequency point with a voiced sound amplitude of a frequency point and/or an integer multiple of the voiced sound amplitude of the frequency point larger than the corresponding threshold amplitude as the candidate pitch frequency point includes:

Illustratively, the interval between two adjacent frequency points may be 8000/256=31.25 Hz. An array may be used to record whether the amplitude of each frequency point is greater than the voiced threshold of the corresponding frequency point. If the value is larger than the voiced threshold, the corresponding position of the array is recorded as 1, otherwise, the corresponding position of the array is recorded as 0, and the array is recorded as array. Since the downsampling is 256 samples, the array size is 256 and the inner elements are 0 or 1. The initial value of the count group pitch _ array [16] is 0, all data elements in the array are traversed, and the pitch _ array _ map are updated according to the following formula (2):

（2）

(3)

in equation (3), FFT is a value of each frequency bin after FFT, a position corresponding to the minimum pitch frequency is 67/31.25=2.144, and a position corresponding to the maximum pitch frequency is 400/31.25= 12.8. Rounding the fractional numbers, considering the pitch frequencies to be 2 to 13 in the array, and traversing the pitch _ array [16] starting from the index 2 to find the position of the first pitch _ array with consecutive values of 1, denoted as s, and the position of the last pitch _ array, denoted as e. There may be multiple pieces of pitch _ array or values of pitch _ array from 2 to 13 that are all 0. If the value of pitch _ array is 0 from 2 to 13, then the segment is considered to contain no voiced signal. If the values of pitch _ array from 2 to 13 are not all 0.

The pitch _ array has a value of 1 and each segment corresponds to a pitch frequency, and the calculation formula is as follows:

（4）

in the equation (4), s is a start position where the pitch _ array is continuously 1, e is an end position where the pitch _ array is continuously 1, and the pitch _ array _ map is a corresponding fft size. Equivalent to weighted averaging from s to e according to pitch _ array _ magn. The corresponding pitch period is

The number of corresponding samples is

This results in several pitch periods, from which the final 3 candidate pitch periods are then selected. The principle of screening is as follows: if the value is more than or equal to three pitch period values, the alternative three pitch period values are taken from the pitch periods

The maximum three pitch periods. Such asIf the value is less than three pitch periods, the alternative three pitch periods include the existing pitch period and the rest alternative pitch period

The corresponding pitch period of the maximum value is replaced.

According to some embodiments, further comprising:

Illustratively, if the computation does not contain a voiced signal, the background comfort noise is used as the extended signal estimation result.

According to some embodiments, the determining a target voiced sound signal based on the second frequency-domain signal comprises:

For example, if the calculation contains a voiced sound signal, 16 milliseconds of data, namely 128 × fs _ mult, is selected from the historical data, and is subjected to a hanning window, and then the historical data is downsampled to 8000Hz for FFT. The length of the down-sampled data is 128, the frequency point interval after FFT is 8000/128 = 62.5Hz, the amplitudes of other frequency points except the three alternative pitch frequencies in FFT are all set to be 0, then IFFT is carried out, and the maximum length of the alternative pitch period is taken from the data after IFFT as a voiced sound signal; the unvoiced signal is generated by a random noise generator, and then the background comfort noise is added, the three are fused to obtain a signal with packet loss compensation,

according to some embodiments, the method further comprises:

under the condition that the voice compensation signal is generated according to the background comfort noise, generating meaning expression word options of the whole voice based on adjacent voice information;

for example, when the speech compensation signal is generated according to the background comfort noise, it is described that the calculated recent historical speech information does not contain a voiced signal, and a situation of packet loss compensation distortion or inaccuracy easily occurs, and the sender may not know whether the receiver accurately knows the information to be actually expressed, and may generate an ideogram option of the whole speech segment through context prediction based on adjacent speech information, so that the sender selects the predicted ideogram, and further ensures accuracy of information transfer.

According to some embodiments, the method further comprises:

For example, when the voice compensation signal is generated according to the background comfort noise and the receiving of the voice information at the sending end is completed, packet loss compensation distortion or inaccuracy is likely to occur, the sending end may not know whether the receiving end accurately knows the actually-to-be-expressed information, and the voice information including the voice compensation signal may be played at the sending end, so that the sending end determines whether the information received by the receiving end is accurate according to the automatically-played compensated voice information, thereby further ensuring the accuracy of information transmission

Referring to fig. 2, an embodiment of the speech signal processing apparatus described in the embodiment of the present application may include:

a converting unit 201, configured to convert a preset-length historical speech signal into a first frequency domain signal;

a determining unit 202, configured to determine alternative fundamental tone frequency points in the first frequency domain signal based on frequency point amplitudes of different fundamental tone frequencies;

an obtaining unit 203, configured to set zero to a frequency point amplitude corresponding to the first frequency domain signal except for the alternative fundamental tone frequency point to obtain a second frequency domain signal when the alternative fundamental tone frequency point exists in the first frequency domain signal;

the determining unit 202 is further configured to determine a target voiced sound signal based on the second frequency-domain signal;

a generating unit 204, configured to generate a speech compensation signal according to the target voiced sound signal.

According to the voice signal processing method provided by the embodiment, the historical voice signal with the preset length is converted into the first frequency domain signal; determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values; when the alternative fundamental tone frequency points exist in the first frequency domain signal, setting the frequency point amplitude values corresponding to the alternative fundamental tone frequency points in the first frequency domain signal to be zero to obtain a second frequency domain signal; determining a target voiced sound signal based on the second frequency-domain signal; generating a speech compensation signal from the target voiced sound signal. The method comprises the steps of analyzing signals from a frequency domain level, determining frequency points of voiced signals through voiced amplitude values of all the frequency points according to quasi-periodicity of the voiced signals, filtering frequency points which do not contain the voiced signals or are weak in the voiced signals, filtering noise signals and unvoiced signals as far as possible, enabling the finally obtained voiced signals to be closer to the real situation, improving the accuracy of voiced signal estimation, enabling calculated amount to be light, and solving the problem that packet loss compensation calculated amount of voice signals in the existing algorithm is large.

Fig. 2 above describes the speech signal processing apparatus in the embodiment of the present application from the perspective of a modular functional entity, and the following describes the speech signal processing apparatus in the embodiment of the present application in detail from the perspective of hardware processing, referring to fig. 3, an embodiment of the speech signal processing apparatus 300 in the embodiment of the present application includes:

an input device 301, an output device 302, a processor 303 and a memory 304, wherein the number of the processor 303 may be one or more, and one processor 303 is taken as an example in fig. 3. In some embodiments of the present application, the input device 301, the output device 302, the processor 303 and the memory 304 may be connected by a bus or other means, wherein fig. 3 illustrates the connection by the bus.

Wherein, by calling the operation instruction stored in the memory 304, the processor 303 is configured to perform the following steps:

generating a speech compensation signal from the target voiced sound signal.

Optionally, the method further comprises:

by the formula

Determining corresponding threshold amplitudes for different pitch frequencies

Wherein, in the step (A),

in order to be a smoothing factor, the method,

Optionally, the method further comprises:

Optionally, the method further includes:

and/or the presence of a gas in the atmosphere,

The processor 303 is also configured to perform any of the methods in the corresponding embodiments of fig. 1 by calling the operation instructions stored in the memory 304.

Referring to fig. 4, fig. 4 is a schematic view of an embodiment of an electronic system according to the present application.

As shown in fig. 4, the embodiment of the present application provides an electronic system 400, which includes a memory 410, a processor 420, and a computer program 411 stored in the memory 420 and running on the processor 420, and when the processor 420 executes the computer program 411, the following steps are implemented:

and generating a voice compensation signal according to the target voiced sound signal.

Optionally, the method further comprises:

by the formula

Wherein, in the step (A),

in order to be a smoothing factor, the method,

selecting three fundamental tone frequency points with the maximum frequency point voiced sound amplitude as alternative fundamental tone frequency points under the condition that the selected frequency point voiced sound amplitude and/or the integer multiple frequency point voiced sound amplitude is larger than or equal to three fundamental tone frequency points of which the corresponding threshold amplitude is larger than or equal to the selected frequency point voiced sound amplitude;

Optionally, the method further comprises:

Optionally, the method further includes:

and/or the presence of a gas in the atmosphere,

In a specific implementation, when the processor 420 executes the computer program 411, any of the embodiments corresponding to fig. 1 may be implemented.

Since the electronic system described in this embodiment is a device for implementing a speech signal processing apparatus in this embodiment, based on the method described in this embodiment, a person skilled in the art can understand the specific implementation manner of the electronic system of this embodiment and various variations thereof, so that how to implement the method in this embodiment of the present application by the electronic system is not described in detail herein, and as long as the person skilled in the art implements the device used by the method in this embodiment of the present application, the electronic system falls within the scope of the present application to be protected.

Referring to fig. 5, fig. 5 is a schematic diagram of an embodiment of a computer-readable storage medium according to an embodiment of the present application.

As shown in fig. 5, the present embodiment provides a computer-readable storage medium 500, on which a computer program 511 is stored, the computer program 511 implementing the following steps when executed by a processor:

determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced sound amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values;

generating a speech compensation signal from the target voiced sound signal.

Optionally, the method further comprises:

by the formula

Wherein, in the process,

in order to be a smoothing factor, the method,

and selecting the pitch frequency points with the voiced sound amplitude of the frequency point and/or the voiced sound amplitude of the integral multiple frequency points larger than the corresponding threshold amplitude as alternative pitch frequency points.

Optionally, the method further comprises:

and generating a voice compensation signal according to the background comfort noise under the condition that the alternative fundamental tone frequency point does not exist in the first frequency domain signal.

Optionally, the method further includes:

and/or the presence of a gas in the gas,

In a specific implementation, the computer program 511 may implement any of the embodiments corresponding to fig. 1 when executed by a processor.

It should be noted that, in the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to relevant descriptions of other embodiments for parts that are not described in detail in a certain embodiment.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Embodiments of the present application further provide a computer program product, which includes computer software instructions, when the computer software instructions are run on a processing device, the processing device is caused to execute the flow in the speech signal processing method in the corresponding embodiment of fig. 1.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or part of the technical solutions contributing to the prior art, or all or part of the technical solutions, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A speech signal processing method, comprising:

selecting a signal with the longest pitch period from the alternative time domain signals as a target voiced sound signal;

2. The method of claim 1, further comprising:

by the formula

Determining corresponding threshold amplitudes for different pitch frequencies

Wherein, in the step (A),

in order to be a smoothing factor, the data is,

is as follows

After FFT of each iteration value

The amplitude of each frequency point.

3. The method of claim 1, wherein determining alternative pitch bins in the first frequency domain signal based on bin voiced amplitudes for different pitch frequencies and corresponding threshold amplitudes comprises:

4. The method of claim 3, wherein selecting the pitch bins with voiced amplitudes at the bins and/or integer multiples of the voiced amplitudes at the bins greater than the corresponding threshold amplitudes as candidate pitch bins comprises:

5. The method of claim 1, further comprising:

and generating a voice compensation signal according to background comfort noise under the condition that the first frequency domain signal does not have the alternative fundamental tone frequency point.

6. The method of claim 5, further comprising:

and/or the presence of a gas in the atmosphere,

7. A speech signal processing apparatus, comprising:

the determining unit is used for determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values;

the determining unit is further configured to convert the second frequency domain signal into an alternative time domain signal; selecting a signal with the longest pitch period from the alternative time domain signals as a target voiced sound signal;

8. An electronic system comprising a memory, a processor, characterized in that the processor is adapted to carry out the steps of the speech signal processing method according to any one of claims 1 to 6 when executing a computer program stored in the memory.

9. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program realizes the steps of the speech signal processing method according to any one of claims 1 to 6 when executed by a processor.