CN114387989B - Voice signal processing method, device, system and storage medium - Google Patents

Voice signal processing method, device, system and storage medium Download PDF

Info

Publication number
CN114387989B
CN114387989B CN202210286433.2A CN202210286433A CN114387989B CN 114387989 B CN114387989 B CN 114387989B CN 202210286433 A CN202210286433 A CN 202210286433A CN 114387989 B CN114387989 B CN 114387989B
Authority
CN
China
Prior art keywords
frequency
signal
voiced
amplitude
domain signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210286433.2A
Other languages
Chinese (zh)
Other versions
CN114387989A (en
Inventor
张斌
易鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huijin Chunhua Technology Co ltd
Original Assignee
Beijing Huijin Chunhua Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Huijin Chunhua Technology Co ltd filed Critical Beijing Huijin Chunhua Technology Co ltd
Priority to CN202210286433.2A priority Critical patent/CN114387989B/en
Publication of CN114387989A publication Critical patent/CN114387989A/en
Application granted granted Critical
Publication of CN114387989B publication Critical patent/CN114387989B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain

Abstract

The embodiment of the application provides a voice signal processing method, which comprises the steps of converting a historical voice signal with a preset length into a first frequency domain signal; determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values; under the condition that the alternative fundamental tone frequency point exists in the first frequency domain signal, setting the frequency point amplitude value corresponding to the alternative fundamental tone frequency point in the first frequency domain signal to zero to obtain a second frequency domain signal; determining a target voiced sound signal based on the second frequency-domain signal; generating a speech compensation signal from the target voiced sound signal. The method comprises the steps of analyzing signals from a frequency domain level, determining frequency points of voiced signals through voiced amplitude values of all the frequency points according to quasi-periodicity of the voiced signals, filtering frequency points which do not contain the voiced signals or are weak in the voiced signals, filtering noise signals and unvoiced signals as far as possible, enabling the finally obtained voiced signals to be closer to the real situation, improving the accuracy of voiced signal estimation, enabling calculated amount to be light, and solving the problem that packet loss compensation calculated amount of voice signals in the existing algorithm is large.

Description

Voice signal processing method, device, system and storage medium
Technical Field
The present application relates to the field of speech signal processing technologies, and in particular, to a speech signal processing method, apparatus, system, and storage medium.
Background
For network voice communication, if no data to be played exists in a voice buffer, packet loss compensation is needed, that is, historical data is used for generating current data, otherwise, the opposite end obviously feels that the sound is discontinuous. In a general packet loss compensation scheme, a pitch period of a signal is calculated from a historical signal, a voiced signal at a historical time is used as a current voiced signal, and an unvoiced signal and background noise are superimposed to be output as a speech signal.
The data signal when the pitch period is calculated also contains unvoiced signals and noise, which will affect the result to some extent; in addition, the correlation calculation amount is large, and certain system overhead is increased.
In addition, when the supplemented data packets are combined, a voiced signal, an unvoiced signal and comfortable background noise are expected to be combined. However, in any case, the history signal taken during the combination hardly contains only voiced sound, or contains both voiced sound, unvoiced sound and background noise, or contains only background noise, resulting in poor compensation effect.
Disclosure of Invention
The embodiment of the application provides a voice signal processing method, a voice signal processing device, a voice signal processing system and a storage medium, and can solve the problem that the receiving end receiving accuracy is low due to the fact that the existing system overhead is large and the voice compensation effect is poor.
A first aspect of an embodiment of the present application provides a speech signal processing method, including:
converting a historical voice signal with a preset length into a first frequency domain signal;
determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values;
under the condition that the alternative fundamental tone frequency point exists in the first frequency domain signal, setting the frequency point amplitude value corresponding to the alternative fundamental tone frequency point in the first frequency domain signal to zero to obtain a second frequency domain signal;
determining a target voiced sound signal based on the second frequency-domain signal;
generating a speech compensation signal from the target voiced sound signal.
Optionally, the method further comprises:
by the formula
Figure 757158DEST_PATH_IMAGE001
Figure 69191DEST_PATH_IMAGE002
Determining corresponding threshold amplitudes for different fundamental frequencies
Figure 710256DEST_PATH_IMAGE003
Wherein, in the step (A),
Figure 347911DEST_PATH_IMAGE004
in order to be a smoothing factor, the method,
Figure 684214DEST_PATH_IMAGE005
and performing FFT on the ith iteration value to obtain the amplitude of the kth frequency point.
Optionally, the determining the candidate pitch bins in the first frequency domain signal based on the voiced amplitude of the bins with different pitch frequencies and the corresponding threshold amplitude includes:
and selecting the pitch frequency points with the frequency point voiced sound amplitude and/or the integer multiple frequency point voiced sound amplitude larger than the corresponding threshold amplitude as alternative pitch frequency points.
Optionally, the selecting, as the candidate pitch frequency points, the pitch frequency points whose voiced sound amplitudes of the frequency points and/or integer-times frequency point voiced sound amplitudes are greater than the corresponding threshold amplitudes includes:
selecting three fundamental tone frequency points with the maximum voiced amplitude of the frequency points as alternative fundamental tone frequency points under the condition that the selected voiced amplitude of the frequency points and/or the voiced amplitude of the integral multiple frequency points are larger than or equal to three fundamental tone frequency points with the corresponding threshold amplitude;
and under the condition that the selected frequency point voiced sound amplitude and/or integral multiple frequency point voiced sound amplitude is larger than the corresponding threshold amplitude and the selected frequency point voiced sound amplitude is smaller than three, sequencing the frequency point voiced sound amplitudes of the rest fundamental sound frequency points from large to small, and supplementing the rest fundamental sound frequency points to the alternative fundamental sound frequency points according to the sequencing order so as to meet the requirement of the three alternative fundamental sound frequency points.
Optionally, the method further comprises:
and generating a voice compensation signal according to the background comfort noise under the condition that the first frequency domain signal does not have the alternative fundamental tone frequency point.
Optionally, the determining a target voiced sound signal based on the second frequency-domain signal includes:
converting the second frequency domain signal into an alternative time domain signal;
and selecting the signal with the longest pitch period from the alternative time domain signals as a target voiced sound signal.
Optionally, the method further includes:
under the condition that the voice compensation signal is generated according to the background comfortable noise, generating meaning expression word options of the whole voice based on adjacent voice information;
acquiring a selection result of the user based on the meaning expression word option;
sending the meaning expression words corresponding to the selection result to a receiving end of the voice message;
and/or the presence of a gas in the gas,
and under the condition that the voice compensation signal is generated according to the background comfort noise and the receiving of the voice information of the sending end is finished, the voice information comprising the voice compensation signal is played at the sending end.
A second aspect of the embodiments of the present application provides a speech signal processing apparatus, including:
the conversion unit is used for converting the historical voice signal with the preset length into a first frequency domain signal;
the determining unit is used for determining alternative fundamental tone frequency points in the first frequency domain signal based on the frequency point amplitudes of different fundamental tone frequencies;
an obtaining unit, configured to set zero to a frequency point amplitude corresponding to the first frequency domain signal except the alternative fundamental tone frequency point, to obtain a second frequency domain signal, when the alternative fundamental tone frequency point exists in the first frequency domain signal;
the determining unit is further configured to determine a target voiced sound signal based on the second frequency-domain signal;
a generating unit for generating a speech compensation signal from the target voiced sound signal.
A third aspect of the embodiments of the present application provides an electronic system, which includes a memory and a processor, where the processor is configured to implement the steps of the above-mentioned digital-analog temperature safety monitoring when executing a computer program stored in the memory.
A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the above-mentioned speech signal processing method.
In summary, the voice signal processing method provided by the embodiment of the present application converts a historical voice signal with a preset length into a first frequency domain signal; determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values; under the condition that the alternative fundamental tone frequency point exists in the first frequency domain signal, setting the frequency point amplitude value corresponding to the alternative fundamental tone frequency point in the first frequency domain signal to zero to obtain a second frequency domain signal; determining a target voiced sound signal based on the second frequency-domain signal; generating a speech compensation signal from the target voiced sound signal. The method comprises the steps of analyzing signals from a frequency domain level, determining frequency points of voiced signals through voiced amplitude values of all the frequency points according to quasi-periodicity of the voiced signals, filtering frequency points which do not contain the voiced signals or are weak in the voiced signals, filtering noise signals and unvoiced signals as far as possible, enabling the finally obtained voiced signals to be closer to the real situation, improving the accuracy of voiced signal estimation, enabling calculated amount to be light, and solving the problem that packet loss compensation calculated amount of voice signals in the existing algorithm is large.
Accordingly, the speech signal processing device, the electronic system and the computer-readable storage medium provided by the embodiment of the invention also have the technical effects.
Drawings
Fig. 1 is a schematic flowchart of a possible speech signal processing method according to an embodiment of the present application;
fig. 2 is a schematic block diagram of a possible speech signal processing apparatus according to an embodiment of the present application;
fig. 3 is a schematic hardware structure diagram of a possible speech signal processing apparatus according to an embodiment of the present disclosure;
FIG. 4 is a schematic block diagram of a possible electronic system according to an embodiment of the present disclosure;
fig. 5 is a schematic structural block diagram of a possible computer-readable storage medium provided in an embodiment of the present application.
Detailed Description
The embodiment of the application provides a voice signal processing method, a voice signal processing device, a voice signal processing system and a storage medium, and can solve the problem that the receiving end receiving accuracy is low due to the fact that the existing system overhead is large and the voice compensation effect is poor.
The terms "first," "second," "third," "fourth," and the like in the description and claims of this application and in the above-described drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments.
In some examples, when the voice information packet is lost for the first time, the historical data (30 ms plus a small amount of extra data) may be used for pitch period detection, and the pitch period detection adopts a method of calculating a correlation value, and then fine tuning is performed to obtain and store a pitch period, and an algorithm correlation parameter is calculated according to the historical signal. The AR filter coefficients can be found by the levenson durbin algorithm from the most recent 20 ms of data at the historical time. And according to the pitch period, taking the data with the same length as the pitch period at the latest historical moment as a voiced signal. In addition, random noise can be generated and output as an unvoiced signal through an AR filter. Comfort background noise may also be generated. Finally, the unvoiced signal and the voiced signal can be weighted and summed, and then are superposed with the comfortable background noise, and the result is used as the result of packet loss compensation. If the voice information packet loss does not occur for the first time, directly skipping and directly utilizing the existing pitch period to perform packet loss compensation.
The above example has two problems as follows:
1) the pitch period is calculated by adopting a relevant method, mainly utilizing the characteristic that a voiced sound signal is a quasi-periodic signal, but the data signal in the pitch period calculation also comprises an unvoiced sound signal and noise, which can cause certain influence on a compensation result; in addition, due to the inclusion of unvoiced sound signals and noise, the amount of correlation calculation is large, and a certain amount of system overhead is increased.
2) And 4, the combination of voiced sound signals and unvoiced sound signals and comfortable background noise is expected when packet loss compensation is carried out. However, in any case, the history signal taken at the time of combining hardly contains only voiced sound, either contains voiced sound, unvoiced sound, and background noise, or only contains background noise.
Referring to fig. 1, a flowchart of a method for solving the problem of low receiving accuracy at a receiving end due to large system overhead and poor voice compensation effect according to an embodiment of the present application may specifically include: S110-S150.
S110, converting a historical voice signal with a preset length into a first frequency domain signal;
for example, the preset length of the historical speech signal may be a plurality of lengths of historical speech signals that facilitate subsequent calculations. For example, the historical speech information 256 × fs _ mult is processed through a hanning window, then down-sampled to 8000Hz, and FFT is performed, and the length of the down-sampled data is 256.
S120, determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced sound amplitude values of the frequency points with different fundamental tone frequencies and the corresponding threshold amplitude values;
for example, the candidate pitch frequency point may be a frequency point having a valid voiced signal, and the candidate pitch frequency point may be a plurality of candidate pitch frequency points.
S130, under the condition that the alternative fundamental tone frequency point exists in the first frequency domain signal, setting the frequency point amplitude value corresponding to the alternative fundamental tone frequency point in the first frequency domain signal to zero to obtain a second frequency domain signal;
illustratively, if the alternative pitch frequency point exists in the first frequency domain signal, that is, the first frequency domain signal has an effective voiced sound signal, the frequency point amplitude values corresponding to the frequency points except for the alternative pitch frequency point in the first frequency domain signal are set to zero, so as to achieve the effect of clearly not including the voiced sound signal or the frequency points with weaker voiced sound signal.
S140, determining a target voiced sound signal based on the second frequency-domain signal;
for example, a relatively pure voiced sound signal, i.e., the target voiced sound signal, may be obtained by converting the second frequency-domain signal.
And S150, generating a voice compensation signal according to the target voiced sound signal.
Illustratively, random noise may be generated and passed through an AR filter and output as an unvoiced signal. Comfort background noise may also be generated. Finally, the unvoiced signal and the target voiced signal can be weighted and summed, and then are superposed with the comfortable background noise, and the result is used as the result of packet loss compensation.
According to the voice signal processing method provided by the above embodiment, a preset-length historical voice signal is converted into a first frequency domain signal; determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values; under the condition that the alternative fundamental tone frequency point exists in the first frequency domain signal, setting the frequency point amplitude value corresponding to the alternative fundamental tone frequency point in the first frequency domain signal to zero to obtain a second frequency domain signal; determining a target voiced sound signal based on the second frequency-domain signal; and generating a voice compensation signal according to the target voiced sound signal. The method comprises the steps of analyzing signals from a frequency domain level, determining frequency points of voiced signals through voiced amplitude values of all the frequency points according to quasi-periodicity of the voiced signals, filtering frequency points which do not contain the voiced signals or are weak in the voiced signals, filtering noise signals and unvoiced signals as far as possible, enabling the finally obtained voiced signals to be closer to the real situation, improving the accuracy of voiced signal estimation, enabling calculated amount to be light, and solving the problem that packet loss compensation calculated amount of voice signals in the existing algorithm is large.
According to some embodiments, the method further comprises:
by the formula
Figure 596676DEST_PATH_IMAGE006
Figure 233193DEST_PATH_IMAGE007
(1)
Determining corresponding threshold amplitudes for different pitch frequencies
Figure 448274DEST_PATH_IMAGE008
Wherein, in the step (A),
Figure 534523DEST_PATH_IMAGE009
in order to be a smoothing factor, the method,
Figure 250675DEST_PATH_IMAGE010
and performing FFT on the ith iteration value to obtain the amplitude of the kth frequency point.
For example, the specific calculation manner of the corresponding threshold amplitude of each frequency point may include:
taking a frame duration of 10ms of a speech signal as an example, a frame may contain a number of samples
Figure 741699DEST_PATH_IMAGE011
Wherein
Figure 127681DEST_PATH_IMAGE012
Is the ratio of the current sampling rate to 8000Hz, such as 16000Hz,
Figure 969735DEST_PATH_IMAGE013
the sampling rate is 48000Hz,
Figure 958420DEST_PATH_IMAGE014
. In the initial stage, the continuous frame data is used for filling
Figure 241634DEST_PATH_IMAGE015
The data is used to make a corresponding threshold amplitude estimate, which is a process of repeated iteration, where one iteration is recorded as oneAnd (4) iteration value. And (3) performing window function, then down-sampling to 8000Hz, wherein the signal length is 256 at the moment, and performing FFT to obtain the amplitude of each frequency point, wherein the interval between each frequency point is 8000/256=31.25 Hz. Is provided with
Figure 126413DEST_PATH_IMAGE016
And the voiced sound signal threshold of the ith epoch and the kth frequency point is shown. The corresponding threshold amplitude values for different pitch frequencies can be determined using equation (1) above
Figure 252501DEST_PATH_IMAGE017
. The smoothing factor may take the value 0.7. The total number of iterations may be 20. The time for estimating each corresponding threshold amplitude is the first 64 frames of the system initialization, i.e., the initial 640 milliseconds.
According to some embodiments, the determining alternative pitch bins in the first frequency domain signal based on the bin voiced amplitudes of different pitch frequencies and corresponding threshold amplitudes comprises:
and selecting the pitch frequency points with the frequency point voiced sound amplitude and/or the integer multiple frequency point voiced sound amplitude larger than the corresponding threshold amplitude as alternative pitch frequency points.
Illustratively, depending on the quasi-periodicity of the voiced signal, assuming a pitch frequency of fB, a larger amplitude will appear at the location of fB, 2fB, 3fB, etc. kfB on the spectrogram. Meanwhile, in order to avoid the influence of noise and unvoiced data, noise signals and unvoiced signals need to be filtered as much as possible, in order to filter the influence of the unvoiced signals and the noise signals more finely, a corresponding threshold amplitude value can be set on each frequency point, the frequency point is considered to contain effective voiced signals when the threshold amplitude value is larger than the threshold amplitude value, and otherwise, the frequency point does not contain the effective voiced signals and needs to be filtered. Since the human pitch period ranges roughly from 2.5 ms to 15 ms, corresponding to a pitch frequency of 67Hz to 400Hz, the calculated alternative pitch frequencies are all in this range. The pitch frequency is selected by adopting a frequency domain analysis method, and the pitch frequency meets the following two conditions: the amplitude value corresponding to the fundamental frequency at the frequency point is larger than the voiced sound threshold of the frequency point. And if the amplitude at the frequency point fB is large, the integral multiple frequency point amplitude corresponding to fB should also be large.
According to some embodiments, the selecting a pitch frequency point with a voiced sound amplitude of a frequency point and/or an integer multiple of the voiced sound amplitude of the frequency point larger than the corresponding threshold amplitude as the candidate pitch frequency point includes:
selecting three fundamental tone frequency points with the maximum voiced amplitude of the frequency points as alternative fundamental tone frequency points under the condition that the selected voiced amplitude of the frequency points and/or the voiced amplitude of the integral multiple frequency points are larger than or equal to three fundamental tone frequency points with the corresponding threshold amplitude;
and under the condition that the selected frequency point voiced sound amplitude and/or integral multiple frequency point voiced sound amplitude is larger than the corresponding threshold amplitude and the selected frequency point voiced sound amplitude is smaller than three, sequencing the frequency point voiced sound amplitudes of the rest fundamental sound frequency points from large to small, and supplementing the rest fundamental sound frequency points to the alternative fundamental sound frequency points according to the sequencing order so as to meet the requirement of the three alternative fundamental sound frequency points.
Illustratively, the interval between two adjacent frequency points may be 8000/256=31.25 Hz. An array may be used to record whether the amplitude of each frequency point is greater than the voiced threshold of the corresponding frequency point. If the value is larger than the voiced threshold, the corresponding position of the array is recorded as 1, otherwise, the corresponding position of the array is recorded as 0, and the array is recorded as array. Since the downsampling is 256 samples, the array size is 256 and the inner elements are 0 or 1. The initial value of the count group pitch _ array [16] is 0, all data elements in the array are traversed, and the pitch _ array _ map are updated according to the following formula (2):
Figure 248139DEST_PATH_IMAGE018
Figure 385859DEST_PATH_IMAGE019
Figure 444470DEST_PATH_IMAGE020
(2)
Figure 57854DEST_PATH_IMAGE021
Figure 591603DEST_PATH_IMAGE022
(3)
in equation (3), FFT is a value of each frequency bin after FFT, a position corresponding to the minimum pitch frequency is 67/31.25=2.144, and a position corresponding to the maximum pitch frequency is 400/31.25= 12.8. Rounding the fractional numbers, considering the pitch frequencies to be 2 to 13 in the array, and traversing the pitch _ array [16] starting from the index 2 to find the position of the first pitch _ array with consecutive values of 1, denoted as s, and the position of the last pitch _ array, denoted as e. There may be multiple pieces of pitch _ array or values of pitch _ array from 2 to 13 that are all 0. If the value of pitch _ array is 0 from 2 to 13, then the segment is considered to contain no voiced signal. If the values of pitch _ array from 2 to 13 are not all 0.
The pitch _ array has a value of 1 and each segment corresponds to a pitch frequency, and the calculation formula is as follows:
Figure 849409DEST_PATH_IMAGE023
(4)
in the equation (4), s is a start position where the pitch _ array is continuously 1, e is an end position where the pitch _ array is continuously 1, and the pitch _ array _ map is a corresponding fft size. Equivalent to weighted averaging from s to e according to pitch _ array _ magn. The corresponding pitch period is
Figure 75991DEST_PATH_IMAGE024
The number of corresponding samples is
Figure 645513DEST_PATH_IMAGE025
This results in several pitch periods, from which the final 3 candidate pitch periods are then selected. The principle of screening is as follows: if the value is more than or equal to three pitch period values, the alternative three pitch period values are taken from the pitch periods
Figure 920636DEST_PATH_IMAGE026
The maximum three pitch periods. Such asIf the value is less than three pitch periods, the alternative three pitch periods include the existing pitch period and the rest alternative pitch period
Figure 360845DEST_PATH_IMAGE027
The corresponding pitch period of the maximum value is replaced.
According to some embodiments, further comprising:
and generating a voice compensation signal according to the background comfort noise under the condition that the first frequency domain signal does not have the alternative fundamental tone frequency point.
Illustratively, if the computation does not contain a voiced signal, the background comfort noise is used as the extended signal estimation result.
According to some embodiments, the determining a target voiced sound signal based on the second frequency-domain signal comprises:
converting the second frequency domain signal into an alternative time domain signal;
and selecting the signal with the longest pitch period from the alternative time domain signals as a target voiced sound signal.
For example, if the calculation contains a voiced sound signal, 16 milliseconds of data, namely 128 × fs _ mult, is selected from the historical data, and is subjected to a hanning window, and then the historical data is downsampled to 8000Hz for FFT. The length of the down-sampled data is 128, the frequency point interval after FFT is 8000/128 = 62.5Hz, the amplitudes of other frequency points except the three alternative pitch frequencies in FFT are all set to be 0, then IFFT is carried out, and the maximum length of the alternative pitch period is taken from the data after IFFT as a voiced sound signal; the unvoiced signal is generated by a random noise generator, and then the background comfort noise is added, the three are fused to obtain a signal with packet loss compensation,
according to some embodiments, the method further comprises:
under the condition that the voice compensation signal is generated according to the background comfort noise, generating meaning expression word options of the whole voice based on adjacent voice information;
acquiring a selection result of the user based on the meaning expression word option;
sending the meaning expression words corresponding to the selection result to a receiving end of the voice message;
for example, when the speech compensation signal is generated according to the background comfort noise, it is described that the calculated recent historical speech information does not contain a voiced signal, and a situation of packet loss compensation distortion or inaccuracy easily occurs, and the sender may not know whether the receiver accurately knows the information to be actually expressed, and may generate an ideogram option of the whole speech segment through context prediction based on adjacent speech information, so that the sender selects the predicted ideogram, and further ensures accuracy of information transfer.
According to some embodiments, the method further comprises:
and under the condition that the voice compensation signal is generated according to the background comfort noise and the receiving of the voice information of the sending end is finished, the voice information comprising the voice compensation signal is played at the sending end.
For example, when the voice compensation signal is generated according to the background comfort noise and the receiving of the voice information at the sending end is completed, packet loss compensation distortion or inaccuracy is likely to occur, the sending end may not know whether the receiving end accurately knows the actually-to-be-expressed information, and the voice information including the voice compensation signal may be played at the sending end, so that the sending end determines whether the information received by the receiving end is accurate according to the automatically-played compensated voice information, thereby further ensuring the accuracy of information transmission
Referring to fig. 2, an embodiment of the speech signal processing apparatus described in the embodiment of the present application may include:
a converting unit 201, configured to convert a preset-length historical speech signal into a first frequency domain signal;
a determining unit 202, configured to determine alternative fundamental tone frequency points in the first frequency domain signal based on frequency point amplitudes of different fundamental tone frequencies;
an obtaining unit 203, configured to set zero to a frequency point amplitude corresponding to the first frequency domain signal except for the alternative fundamental tone frequency point to obtain a second frequency domain signal when the alternative fundamental tone frequency point exists in the first frequency domain signal;
the determining unit 202 is further configured to determine a target voiced sound signal based on the second frequency-domain signal;
a generating unit 204, configured to generate a speech compensation signal according to the target voiced sound signal.
According to the voice signal processing method provided by the embodiment, the historical voice signal with the preset length is converted into the first frequency domain signal; determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values; when the alternative fundamental tone frequency points exist in the first frequency domain signal, setting the frequency point amplitude values corresponding to the alternative fundamental tone frequency points in the first frequency domain signal to be zero to obtain a second frequency domain signal; determining a target voiced sound signal based on the second frequency-domain signal; generating a speech compensation signal from the target voiced sound signal. The method comprises the steps of analyzing signals from a frequency domain level, determining frequency points of voiced signals through voiced amplitude values of all the frequency points according to quasi-periodicity of the voiced signals, filtering frequency points which do not contain the voiced signals or are weak in the voiced signals, filtering noise signals and unvoiced signals as far as possible, enabling the finally obtained voiced signals to be closer to the real situation, improving the accuracy of voiced signal estimation, enabling calculated amount to be light, and solving the problem that packet loss compensation calculated amount of voice signals in the existing algorithm is large.
Fig. 2 above describes the speech signal processing apparatus in the embodiment of the present application from the perspective of a modular functional entity, and the following describes the speech signal processing apparatus in the embodiment of the present application in detail from the perspective of hardware processing, referring to fig. 3, an embodiment of the speech signal processing apparatus 300 in the embodiment of the present application includes:
an input device 301, an output device 302, a processor 303 and a memory 304, wherein the number of the processor 303 may be one or more, and one processor 303 is taken as an example in fig. 3. In some embodiments of the present application, the input device 301, the output device 302, the processor 303 and the memory 304 may be connected by a bus or other means, wherein fig. 3 illustrates the connection by the bus.
Wherein, by calling the operation instruction stored in the memory 304, the processor 303 is configured to perform the following steps:
converting a historical voice signal with a preset length into a first frequency domain signal;
determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values;
under the condition that the alternative fundamental tone frequency point exists in the first frequency domain signal, setting the frequency point amplitude value corresponding to the alternative fundamental tone frequency point in the first frequency domain signal to zero to obtain a second frequency domain signal;
determining a target voiced sound signal based on the second frequency-domain signal;
generating a speech compensation signal from the target voiced sound signal.
Optionally, the method further comprises:
by the formula
Figure 555066DEST_PATH_IMAGE028
Figure 80725DEST_PATH_IMAGE029
Determining corresponding threshold amplitudes for different pitch frequencies
Figure 893960DEST_PATH_IMAGE030
Wherein, in the step (A),
Figure 209183DEST_PATH_IMAGE031
in order to be a smoothing factor, the method,
Figure 574305DEST_PATH_IMAGE032
and performing FFT on the ith iteration value to obtain the amplitude of the kth frequency point.
Optionally, the determining the candidate pitch bins in the first frequency domain signal based on the voiced amplitude of the bins with different pitch frequencies and the corresponding threshold amplitude includes:
and selecting the pitch frequency points with the frequency point voiced sound amplitude and/or the integer multiple frequency point voiced sound amplitude larger than the corresponding threshold amplitude as alternative pitch frequency points.
Optionally, the selecting, as the candidate pitch frequency points, the pitch frequency points whose voiced sound amplitudes of the frequency points and/or integer-times frequency point voiced sound amplitudes are greater than the corresponding threshold amplitudes includes:
selecting three fundamental tone frequency points with the maximum voiced amplitude of the frequency points as alternative fundamental tone frequency points under the condition that the selected voiced amplitude of the frequency points and/or the voiced amplitude of the integral multiple frequency points are larger than or equal to three fundamental tone frequency points with the corresponding threshold amplitude;
and under the condition that the selected frequency point voiced sound amplitude and/or integral multiple frequency point voiced sound amplitude is larger than the corresponding threshold amplitude and the selected frequency point voiced sound amplitude is smaller than three, sequencing the frequency point voiced sound amplitudes of the rest fundamental sound frequency points from large to small, and supplementing the rest fundamental sound frequency points to the alternative fundamental sound frequency points according to the sequencing order so as to meet the requirement of the three alternative fundamental sound frequency points.
Optionally, the method further comprises:
and generating a voice compensation signal according to the background comfort noise under the condition that the first frequency domain signal does not have the alternative fundamental tone frequency point.
Optionally, the determining a target voiced sound signal based on the second frequency-domain signal includes:
converting the second frequency domain signal into an alternative time domain signal;
and selecting the signal with the longest pitch period from the alternative time domain signals as a target voiced sound signal.
Optionally, the method further includes:
under the condition that the voice compensation signal is generated according to the background comfort noise, generating meaning expression word options of the whole voice based on adjacent voice information;
acquiring a selection result of the user based on the meaning expression word option;
sending the meaning expression words corresponding to the selection result to a receiving end of the voice message;
and/or the presence of a gas in the atmosphere,
and under the condition that the voice compensation signal is generated according to the background comfort noise and the receiving of the voice information of the sending end is finished, the voice information comprising the voice compensation signal is played at the sending end.
The processor 303 is also configured to perform any of the methods in the corresponding embodiments of fig. 1 by calling the operation instructions stored in the memory 304.
Referring to fig. 4, fig. 4 is a schematic view of an embodiment of an electronic system according to the present application.
As shown in fig. 4, the embodiment of the present application provides an electronic system 400, which includes a memory 410, a processor 420, and a computer program 411 stored in the memory 420 and running on the processor 420, and when the processor 420 executes the computer program 411, the following steps are implemented:
converting a historical voice signal with a preset length into a first frequency domain signal;
determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values;
under the condition that the alternative fundamental tone frequency point exists in the first frequency domain signal, setting the frequency point amplitude value corresponding to the alternative fundamental tone frequency point in the first frequency domain signal to zero to obtain a second frequency domain signal;
determining a target voiced sound signal based on the second frequency-domain signal;
and generating a voice compensation signal according to the target voiced sound signal.
Optionally, the method further comprises:
by the formula
Figure 259365DEST_PATH_IMAGE033
Figure 938608DEST_PATH_IMAGE034
Determining corresponding threshold amplitudes for different fundamental frequencies
Figure 353409DEST_PATH_IMAGE035
Wherein, in the step (A),
Figure 889432DEST_PATH_IMAGE036
in order to be a smoothing factor, the method,
Figure 61787DEST_PATH_IMAGE037
and performing FFT on the ith iteration value to obtain the amplitude of the kth frequency point.
Optionally, the determining the candidate pitch bins in the first frequency domain signal based on the voiced amplitude of the bins with different pitch frequencies and the corresponding threshold amplitude includes:
and selecting the pitch frequency points with the frequency point voiced sound amplitude and/or the integer multiple frequency point voiced sound amplitude larger than the corresponding threshold amplitude as alternative pitch frequency points.
Optionally, the selecting, as the candidate pitch frequency points, the pitch frequency points whose voiced sound amplitudes of the frequency points and/or integer-times frequency point voiced sound amplitudes are greater than the corresponding threshold amplitudes includes:
selecting three fundamental tone frequency points with the maximum frequency point voiced sound amplitude as alternative fundamental tone frequency points under the condition that the selected frequency point voiced sound amplitude and/or the integer multiple frequency point voiced sound amplitude is larger than or equal to three fundamental tone frequency points of which the corresponding threshold amplitude is larger than or equal to the selected frequency point voiced sound amplitude;
and under the condition that the selected frequency point voiced sound amplitude and/or integral multiple frequency point voiced sound amplitude is larger than the corresponding threshold amplitude and the selected frequency point voiced sound amplitude is smaller than three, sequencing the frequency point voiced sound amplitudes of the rest fundamental sound frequency points from large to small, and supplementing the rest fundamental sound frequency points to the alternative fundamental sound frequency points according to the sequencing order so as to meet the requirement of the three alternative fundamental sound frequency points.
Optionally, the method further comprises:
and generating a voice compensation signal according to the background comfort noise under the condition that the first frequency domain signal does not have the alternative fundamental tone frequency point.
Optionally, the determining a target voiced sound signal based on the second frequency-domain signal includes:
converting the second frequency domain signal into an alternative time domain signal;
and selecting the signal with the longest pitch period from the alternative time domain signals as a target voiced sound signal.
Optionally, the method further includes:
under the condition that the voice compensation signal is generated according to the background comfort noise, generating meaning expression word options of the whole voice based on adjacent voice information;
acquiring a selection result of the user based on the meaning expression word option;
sending the meaning expression words corresponding to the selection result to a receiving end of the voice message;
and/or the presence of a gas in the atmosphere,
and under the condition that the voice compensation signal is generated according to the background comfort noise and the receiving of the voice information of the sending end is finished, the voice information comprising the voice compensation signal is played at the sending end.
In a specific implementation, when the processor 420 executes the computer program 411, any of the embodiments corresponding to fig. 1 may be implemented.
Since the electronic system described in this embodiment is a device for implementing a speech signal processing apparatus in this embodiment, based on the method described in this embodiment, a person skilled in the art can understand the specific implementation manner of the electronic system of this embodiment and various variations thereof, so that how to implement the method in this embodiment of the present application by the electronic system is not described in detail herein, and as long as the person skilled in the art implements the device used by the method in this embodiment of the present application, the electronic system falls within the scope of the present application to be protected.
Referring to fig. 5, fig. 5 is a schematic diagram of an embodiment of a computer-readable storage medium according to an embodiment of the present application.
As shown in fig. 5, the present embodiment provides a computer-readable storage medium 500, on which a computer program 511 is stored, the computer program 511 implementing the following steps when executed by a processor:
converting a historical voice signal with a preset length into a first frequency domain signal;
determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced sound amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values;
under the condition that the alternative fundamental tone frequency point exists in the first frequency domain signal, setting the frequency point amplitude value corresponding to the alternative fundamental tone frequency point in the first frequency domain signal to zero to obtain a second frequency domain signal;
determining a target voiced sound signal based on the second frequency-domain signal;
generating a speech compensation signal from the target voiced sound signal.
Optionally, the method further comprises:
by the formula
Figure 279142DEST_PATH_IMAGE038
Figure 610766DEST_PATH_IMAGE039
Determining corresponding threshold amplitudes for different fundamental frequencies
Figure 255374DEST_PATH_IMAGE040
Wherein, in the process,
Figure 915026DEST_PATH_IMAGE041
in order to be a smoothing factor, the method,
Figure 204580DEST_PATH_IMAGE042
and performing FFT on the ith iteration value to obtain the amplitude of the kth frequency point.
Optionally, the determining the candidate pitch bins in the first frequency domain signal based on the voiced amplitude of the bins with different pitch frequencies and the corresponding threshold amplitude includes:
and selecting the pitch frequency points with the voiced sound amplitude of the frequency point and/or the voiced sound amplitude of the integral multiple frequency points larger than the corresponding threshold amplitude as alternative pitch frequency points.
Optionally, the selecting, as the candidate pitch frequency points, the pitch frequency points whose voiced sound amplitudes of the frequency points and/or integer-times frequency point voiced sound amplitudes are greater than the corresponding threshold amplitudes includes:
selecting three fundamental tone frequency points with the maximum voiced amplitude of the frequency points as alternative fundamental tone frequency points under the condition that the selected voiced amplitude of the frequency points and/or the voiced amplitude of the integral multiple frequency points are larger than or equal to three fundamental tone frequency points with the corresponding threshold amplitude;
and under the condition that the selected frequency point voiced sound amplitude and/or integral multiple frequency point voiced sound amplitude is larger than the corresponding threshold amplitude and the selected frequency point voiced sound amplitude is smaller than three, sequencing the frequency point voiced sound amplitudes of the rest fundamental sound frequency points from large to small, and supplementing the rest fundamental sound frequency points to the alternative fundamental sound frequency points according to the sequencing order so as to meet the requirement of the three alternative fundamental sound frequency points.
Optionally, the method further comprises:
and generating a voice compensation signal according to the background comfort noise under the condition that the alternative fundamental tone frequency point does not exist in the first frequency domain signal.
Optionally, the determining a target voiced sound signal based on the second frequency-domain signal includes:
converting the second frequency domain signal into an alternative time domain signal;
and selecting the signal with the longest pitch period from the alternative time domain signals as a target voiced sound signal.
Optionally, the method further includes:
under the condition that the voice compensation signal is generated according to the background comfort noise, generating meaning expression word options of the whole voice based on adjacent voice information;
acquiring a selection result of the user based on the meaning expression word option;
sending the meaning expression words corresponding to the selection result to a receiving end of the voice message;
and/or the presence of a gas in the gas,
and under the condition that the voice compensation signal is generated according to the background comfort noise and the receiving of the voice information of the sending end is finished, the voice information comprising the voice compensation signal is played at the sending end.
In a specific implementation, the computer program 511 may implement any of the embodiments corresponding to fig. 1 when executed by a processor.
It should be noted that, in the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to relevant descriptions of other embodiments for parts that are not described in detail in a certain embodiment.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Embodiments of the present application further provide a computer program product, which includes computer software instructions, when the computer software instructions are run on a processing device, the processing device is caused to execute the flow in the speech signal processing method in the corresponding embodiment of fig. 1.
The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or part of the technical solutions contributing to the prior art, or all or part of the technical solutions, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (9)

1. A speech signal processing method, comprising:
converting a historical voice signal with a preset length into a first frequency domain signal;
determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values;
under the condition that the alternative fundamental tone frequency point exists in the first frequency domain signal, setting the frequency point amplitude value corresponding to the alternative fundamental tone frequency point in the first frequency domain signal to zero to obtain a second frequency domain signal;
converting the second frequency domain signal into an alternative time domain signal;
selecting a signal with the longest pitch period from the alternative time domain signals as a target voiced sound signal;
and generating a voice compensation signal according to the target voiced sound signal.
2. The method of claim 1, further comprising:
by the formula
Figure DEST_PATH_IMAGE002
Determining corresponding threshold amplitudes for different pitch frequencies
Figure DEST_PATH_IMAGE004
Wherein, in the step (A),
Figure DEST_PATH_IMAGE006
in order to be a smoothing factor, the data is,
Figure DEST_PATH_IMAGE008
is as follows
Figure DEST_PATH_IMAGE010
After FFT of each iteration value
Figure DEST_PATH_IMAGE012
The amplitude of each frequency point.
3. The method of claim 1, wherein determining alternative pitch bins in the first frequency domain signal based on bin voiced amplitudes for different pitch frequencies and corresponding threshold amplitudes comprises:
and selecting the pitch frequency points with the frequency point voiced sound amplitude and/or the integer multiple frequency point voiced sound amplitude larger than the corresponding threshold amplitude as alternative pitch frequency points.
4. The method of claim 3, wherein selecting the pitch bins with voiced amplitudes at the bins and/or integer multiples of the voiced amplitudes at the bins greater than the corresponding threshold amplitudes as candidate pitch bins comprises:
selecting three fundamental tone frequency points with the maximum voiced amplitude of the frequency points as alternative fundamental tone frequency points under the condition that the selected voiced amplitude of the frequency points and/or the voiced amplitude of the integral multiple frequency points are larger than or equal to three fundamental tone frequency points with the corresponding threshold amplitude;
and under the condition that the selected frequency point voiced sound amplitude and/or integral multiple frequency point voiced sound amplitude is larger than the corresponding threshold amplitude and the selected frequency point voiced sound amplitude is smaller than three, sequencing the frequency point voiced sound amplitudes of the rest fundamental sound frequency points from large to small, and supplementing the rest fundamental sound frequency points to the alternative fundamental sound frequency points according to the sequencing order so as to meet the requirement of the three alternative fundamental sound frequency points.
5. The method of claim 1, further comprising:
and generating a voice compensation signal according to background comfort noise under the condition that the first frequency domain signal does not have the alternative fundamental tone frequency point.
6. The method of claim 5, further comprising:
under the condition that the voice compensation signal is generated according to the background comfort noise, generating meaning expression word options of the whole voice based on adjacent voice information;
acquiring a selection result of the user based on the meaning expression word option;
sending the meaning expression words corresponding to the selection result to a receiving end of the voice message;
and/or the presence of a gas in the atmosphere,
and under the condition that the voice compensation signal is generated according to the background comfort noise and the receiving of the voice information of the sending end is finished, the voice information comprising the voice compensation signal is played at the sending end.
7. A speech signal processing apparatus, comprising:
the conversion unit is used for converting the historical voice signal with the preset length into a first frequency domain signal;
the determining unit is used for determining alternative fundamental tone frequency points in the first frequency domain signal based on the voiced amplitude values of the frequency points of different fundamental tone frequencies and the corresponding threshold amplitude values;
an obtaining unit, configured to set zero to a frequency point amplitude corresponding to the first frequency domain signal except the alternative fundamental tone frequency point, to obtain a second frequency domain signal, when the alternative fundamental tone frequency point exists in the first frequency domain signal;
the determining unit is further configured to convert the second frequency domain signal into an alternative time domain signal; selecting a signal with the longest pitch period from the alternative time domain signals as a target voiced sound signal;
a generating unit for generating a speech compensation signal from the target voiced sound signal.
8. An electronic system comprising a memory, a processor, characterized in that the processor is adapted to carry out the steps of the speech signal processing method according to any one of claims 1 to 6 when executing a computer program stored in the memory.
9. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program realizes the steps of the speech signal processing method according to any one of claims 1 to 6 when executed by a processor.
CN202210286433.2A 2022-03-23 2022-03-23 Voice signal processing method, device, system and storage medium Active CN114387989B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210286433.2A CN114387989B (en) 2022-03-23 2022-03-23 Voice signal processing method, device, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210286433.2A CN114387989B (en) 2022-03-23 2022-03-23 Voice signal processing method, device, system and storage medium

Publications (2)

Publication Number Publication Date
CN114387989A CN114387989A (en) 2022-04-22
CN114387989B true CN114387989B (en) 2022-07-01

Family

ID=81205899

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210286433.2A Active CN114387989B (en) 2022-03-23 2022-03-23 Voice signal processing method, device, system and storage medium

Country Status (1)

Country Link
CN (1) CN114387989B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2398980B (en) * 2003-02-27 2005-09-14 Motorola Inc Speech communication unit and method for synthesising speech therein
CN1901431A (en) * 2006-07-04 2007-01-24 华为技术有限公司 Lost frame hiding method and device
CN101887723A (en) * 2007-06-14 2010-11-17 华为终端有限公司 Fine tuning method and device for pitch period
WO2017084545A1 (en) * 2015-11-19 2017-05-26 电信科学技术研究院 Method and system for voice packet loss concealment
CN106856093A (en) * 2017-02-23 2017-06-16 海信集团有限公司 Audio-frequency information processing method, intelligent terminal and Voice command terminal
CN109346109A (en) * 2018-12-05 2019-02-15 百度在线网络技术(北京)有限公司 Fundamental frequency extracting method and device
CN111653285A (en) * 2020-06-01 2020-09-11 北京猿力未来科技有限公司 Packet loss compensation method and device
CN113421584A (en) * 2021-07-05 2021-09-21 平安科技(深圳)有限公司 Audio noise reduction method and device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
MX2018010754A (en) * 2016-03-07 2019-01-14 Fraunhofer Ges Forschung Error concealment unit, audio decoder, and related method and computer program fading out a concealed audio frame out according to different damping factors for different frequency bands.

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2398980B (en) * 2003-02-27 2005-09-14 Motorola Inc Speech communication unit and method for synthesising speech therein
CN1901431A (en) * 2006-07-04 2007-01-24 华为技术有限公司 Lost frame hiding method and device
CN101887723A (en) * 2007-06-14 2010-11-17 华为终端有限公司 Fine tuning method and device for pitch period
WO2017084545A1 (en) * 2015-11-19 2017-05-26 电信科学技术研究院 Method and system for voice packet loss concealment
CN106856093A (en) * 2017-02-23 2017-06-16 海信集团有限公司 Audio-frequency information processing method, intelligent terminal and Voice command terminal
CN109346109A (en) * 2018-12-05 2019-02-15 百度在线网络技术(北京)有限公司 Fundamental frequency extracting method and device
CN111653285A (en) * 2020-06-01 2020-09-11 北京猿力未来科技有限公司 Packet loss compensation method and device
CN113421584A (en) * 2021-07-05 2021-09-21 平安科技(深圳)有限公司 Audio noise reduction method and device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
E Gunduzhan等.Linear prediction based packet loss concealment algorithm for PCM coded speech.《 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING》.2001,第9卷(第8期), *
杨顺辽.同态解卷处理在基音检测中的应用.《计算机工程与应用》.2013,第49卷(第24期), *

Also Published As

Publication number Publication date
CN114387989A (en) 2022-04-22

Similar Documents

Publication Publication Date Title
KR101831078B1 (en) Voice Activation Detection Method and Device
US7286980B2 (en) Speech processing apparatus and method for enhancing speech information and suppressing noise in spectral divisions of a speech signal
JP6559741B2 (en) Audio signal resampling for low-delay encoding / decoding
KR101737824B1 (en) Method and Apparatus for removing a noise signal from input signal in a noisy environment
CN100520913C (en) Method of enhancing quality of speech and apparatus thereof
JPH0820878B2 (en) Parallel processing type pitch detector
KR102012325B1 (en) Estimation of background noise in audio signals
JP2016541004A5 (en)
KR20130095726A (en) Controlling a noise-shaping feedback loop in a digital audio signal encoder
WO2021093808A1 (en) Detection method and apparatus for effective voice signal, and device
US7630432B2 (en) Method for analysing the channel impulse response of a transmission channel
JP4551817B2 (en) Noise level estimation method and apparatus
CN114387989B (en) Voice signal processing method, device, system and storage medium
JP6728142B2 (en) Method and apparatus for identifying and attenuating pre-echo in a digital audio signal
US8280725B2 (en) Pitch or periodicity estimation
US7343284B1 (en) Method and system for speech processing for enhancement and detection
CN106415718B (en) Linear prediction analysis device, method and recording medium
JP2013205831A (en) Voice quality objective evaluation device and method
EP2230664A1 (en) Method and apparatus for attenuating noise in an input signal
WO2014018662A1 (en) Method of extracting zero crossing data from full spectrum signals
JP5152800B2 (en) Noise suppression evaluation apparatus and program
WO2006127968A1 (en) Restoring audio signals corrupted by impulsive noise
CN108848435B (en) Audio signal processing method and related device
JPH0844395A (en) Voice pitch detecting device
KR101176207B1 (en) Audio communication system and method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant