CN106788876B

CN106788876B - Method and system for compensating voice packet loss

Info

Publication number: CN106788876B
Application number: CN201510802586.8A
Authority: CN
Inventors: 邹莹
Original assignee: China Academy of Telecommunications Technology CATT
Current assignee: China Academy of Telecommunications Technology CATT
Priority date: 2015-11-19
Filing date: 2015-11-19
Publication date: 2020-01-21
Anticipated expiration: 2035-11-19
Also published as: CN106788876A; WO2017084545A1

Abstract

The invention provides a method and a system for compensating voice packet loss, wherein the method comprises the following steps: when a lost voice packet is found for the first time, generating a first voice sequence vec1 according to historical voice data before the lost voice packet; if the next packet of voice data after the voice packet loss can be obtained, generating a second voice sequence vec2 according to the next packet of voice data after the voice packet loss, and performing packet loss compensation according to the first voice sequence vecl and the second voice sequence vec2 to improve the quality of synthesized voice; and if the next voice packet of the lost packet cannot be acquired, performing packet loss compensation by using the first voice sequence vecl.

Description

Method and system for compensating voice packet loss

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a method and a system for compensating for voice packet loss.

Background

Voice-over-IP (VoIP), a technical service for transmitting Voice over an IP network, transmits packetized Voice data using the IP network. Since the IP network is widely distributed and the VoIP telephone has a low cost, it is possible to integrate multimedia communication, voice and data, so that the VoIP has been rapidly developed in recent years, and there is a trend to gradually replace the conventional circuit-switched network (e.g., PSTN-public service telephone network). In view of the fact that IP networks are currently used primarily for transmitting data traffic, they use a best-effort connectionless transmission technique, whereby there is no corresponding quality of service guarantee. When the network is congested, problems of packet loss and delay jitter occur at the receiving end, which seriously affects the voice quality of the receiving end. Therefore, at the receiving end of the VoIP system, a packet loss compensation technique is generally adopted to reconstruct a lost voice packet, and a network delay jitter resistant technique is applied to eliminate adverse effects caused by delay jitter.

The purpose of the PLC (Packet Loss compensation) algorithm is to generate a synthesized voice that can replace the lost voice, and the ideal synthesized voice should have the same tone and spectrum characteristics as the lost voice. The most common method is to generate a reasonable approximation of speech using historical data prior to the loss of speech data. If the lost data length is not too long and the lost speech segment is not located in a region where speech changes rapidly, the speech sounds natural after compensation by the PLC algorithm. Some of the existing standards propose receiving end-based PLC technologies, such as ANSI ti.521(Annex B), ITU-T rec.g.711(Appendix I), ITU-T rec.g.722(Appendix III), and so on.

The common feature of several PLC algorithms in the above mentioned standard is the pitch detection and the synthesized speech (as shown in fig. 1), wherein the pitch period detection is only performed when the first packet loss compensation after packet loss is found, and the pitch period detected for the first time is used if the packet loss compensation is performed continuously. Because the computational complexity of detecting the pitch period is high, it is generally implemented in two steps: firstly, rough estimation is carried out on pitch period of history data (original signals or residual signals) after down sampling, and then more accurate pitch period is searched out in the vicinity of the roughly estimated pitch period. Then, a frame of speech similar to the original speech is synthesized according to the length, and a smooth transition is made between the real speech data and the synthesized speech data or between the synthesized speech and the synthesized speech so as to avoid generating sharp noise, and meanwhile, the correlation between the later frame speech and the former frame speech is reduced by an energy decreasing mode when packet is continuously lost.

The existing PLC method only utilizes historical data before packet loss to compensate the voice packet data, so that the optimal compensation effect cannot be achieved, and particularly, the compensation effect is not ideal at the turning position from fundamental tone to fundamental tone.

Disclosure of Invention

In view of the above technical problems, the present invention provides a method and a system for compensating for a voice packet loss, which solve the problem in the prior art that the compensation effect of the voice packet loss compensation is not ideal.

According to an aspect of the present invention, there is provided a method for compensating for a voice packet loss, the method including: when a lost voice packet is found for the first time (namely, when packet loss compensation is carried out for the first time), generating a first voice sequence vec1 according to historical voice data before the lost voice packet; if the next packet of voice data after the voice packet loss can be acquired, generating a second voice sequence vec2 according to the next packet of voice data after the voice packet loss; and performing packet loss compensation according to the first voice sequence vecl and the second voice sequence vec 2.

Optionally, if the next packet of voice data cannot be obtained, the method further includes: and performing packet loss compensation according to the first voice sequence vecl.

Optionally, the generating a first voice sequence according to historical voice data before the voice packet is lost specifically includes: carrying out fundamental tone rough estimation on the historical voice data to obtain a fundamental tone rough estimation result; according to the rough estimation result of the fundamental tone, performing fundamental tone fine search to determine a first fundamental tone period Lag 1; the first speech sequence vecl is generated from the historical speech data according to the first pitch Lag 1.

Optionally, the generating a second voice sequence vec2 according to the next packet of voice data after the voice packet is lost, specifically: a second speech sequence vec2 is found out of the next packet of speech data after the missing speech packet, which is phase aligned with the first speech sequence vecl.

Optionally, the finding out a second speech sequence vec2 phase-aligned with the first speech sequence vecl from the next packet of speech data after the missing speech packet specifically includes: according to the first pitch Lag1, performing pitch fine search in the next packet of voice data to determine a second pitch Lag2 of the next packet of voice data; finding out the position with the strongest correlation with the first voice sequence vecl in the next packet of voice data;

and determining the second voice sequence vec2 from the next packet of voice data according to the position with the strongest correlation and a second pitch period Lag 2.

Optionally, before performing packet loss compensation according to the first speech sequence vecl and the second speech sequence vec2, the method further includes: calculating a normalized cross-correlation value of the first speech sequence vecl and the second speech sequence vec 2; comparing the normalized cross-correlation value with a set threshold value; if the normalized cross-correlation value is larger than the set threshold value, performing packet loss compensation according to the first voice sequence vecl and the second voice sequence vec 2; and if the normalized cross-correlation value is smaller than or equal to the set threshold value, performing packet loss compensation according to the first voice sequence vecl.

Optionally, the performing packet loss compensation according to the first speech sequence vecl and the second speech sequence vec2 specifically includes: calculating the number of pitch data to be filled according to the first voice sequence vec1 and the second voice sequence vec 2; calculating the length of each pitch datum to be filled according to the number of the filled pitch data; respectively smoothing the first voice sequence vec1 and the second voice sequence vec 2; and synthesizing a voice sequence with the pitch cycle length for the first voice sequence vec1 and the second voice sequence vec2 after smoothing according to the length of each pitch data.

Optionally, the calculating the number of pitch data to be padded for the first speech sequence vec1 and the second speech sequence vec2 specifically includes: calculating the number numPitch of pitch data to be padded, numPitch, according to the length d of the lost voice packet, the determined length rd after the phases of the first voice sequence vec1 and the second voice sequence vec2 are aligned, the first pitch Lag1 and the second pitch Lag2, by using the following formula; numclick ═ (rd + (avgLag/2))/avgLag, where avgLag denotes the average pitch period length and avgLag ═ Lag1+ Lag 2)/2.

Optionally, the performing packet loss compensation according to the first speech sequence vecl and the second speech sequence vec2 further includes: calculating the ratio of the difference value of the filling length of numPitch data with the pitch length of avgLag and the actual filling length to the average pitch cycle length avgLag; judging whether pitch doubling exists according to the ratio; if yes, regenerating a second voice sequence vec2 according to the corrected pitch length; and if not, calculating the length of each pitch datum to be filled according to the number of the filled pitch data.

Optionally, the synthesizing a speech sequence with a pitch cycle length by using the smoothed first speech sequence vec1 and second speech sequence vec2 according to the length of each pitch datum specifically includes: determining the weight w1 of the vec1 of the first voice sequence and the weight w2 of the vec2 of the second voice sequence when synthesizing according to the distance relation between the position of the voice to be synthesized and the vec1 of the first voice sequence and the vec2 of the second voice sequence respectively; according to the first voice sequence vec1 and its weight w1, and the second voice sequence vec2 and its weight w2, a voice sequence with a pitch period length is synthesized by using the smoothed voice sequences vec1 and vec2, if a lost data packet does not arrive after filling the synthesized voice sequence with a pitch period length, only the smoothed first voice sequence vec1 and second voice sequence vec2 need to be used again to synthesize other voice sequences in sequence, after synthesizing data with a pitch period each time, the lost data packet is checked to arrive, if so, smooth concatenation is performed, and if not, the above-mentioned method is used for compensation.

According to another aspect of the present invention, there is also provided a system for compensating for voice packet loss, the system comprising: a first voice sequence module, configured to generate a first voice sequence vec1 according to historical voice data before a lost voice packet when the lost voice packet is found for the first time; a second voice sequence module, configured to generate a second voice sequence vec2 according to the next packet of voice data after the voice packet is lost if the next packet of voice data after the voice packet is lost can be obtained; and the compensation module is configured to perform packet loss compensation according to the first speech sequence vecl and the second speech sequence vec 2.

Optionally, the compensation module is further configured to perform packet loss compensation according to the first voice sequence vecl if the next packet of voice data cannot be obtained.

Optionally, the first speech sequence module is specifically configured to perform coarse pitch estimation on the historical speech data to obtain a coarse pitch estimation result; according to the rough estimation result of the fundamental tone, performing fundamental tone fine search to determine a first fundamental tone period Lag 1; the first speech sequence vecl is generated from the historical speech data according to the first pitch Lag 1.

Optionally, the second speech sequence module is specifically configured to: a second speech sequence vec2 is found out of the next packet of speech data after the missing speech packet, which is phase aligned with the first speech sequence vecl.

Optionally, the second speech sequence module specifically includes:

a first unit, configured to perform pitch search in the next speech data according to the first pitch Lag1, and determine a second pitch Lag2 of the next speech data;

a second unit, configured to find a location in the next packet of voice data where the vecl correlation with the first voice sequence is strongest;

a third unit, configured to determine the second speech sequence vec2 from the next packet of speech data according to the position with the strongest correlation and a second pitch Lag 2.

Optionally, the second speech sequence module further includes:

a fourth unit, configured to calculate a normalized cross-correlation value between the first speech sequence vecl and the second speech sequence vec 2;

a fifth unit, configured to compare the normalized cross-correlation value with a set threshold value; if the normalized cross-correlation value is larger than the set threshold value, triggering the compensation module to perform packet loss compensation according to the first voice sequence vecl and the second voice sequence vec 2; and if the normalized cross-correlation value is smaller than or equal to the set threshold value, triggering the compensation module to perform packet loss compensation according to the first voice sequence vecl.

Optionally, the compensation module specifically includes:

a sixth unit, configured to calculate the number of pitch data to be padded according to the first speech sequence vec1 and the second speech sequence vec 2;

a seventh unit, configured to calculate a length of each piece of pitch data to be padded according to the number of the padded pitch data;

an eighth unit, configured to perform smoothing on the first speech sequence vec1 and the second speech sequence vec2 respectively;

an eighth unit, configured to synthesize a speech sequence with a pitch length according to the length of each pitch data by using the smoothed first speech sequence vec1 and second speech sequence vec 2.

Optionally, the sixth unit is specifically configured to: calculating the number numPitch of pitch data to be padded, numPitch, according to the length d of the lost voice packet, the determined length rd after the phases of the first voice sequence vec1 and the second voice sequence vec2 are aligned, the first pitch Lag1 and the second pitch Lag2, by using the following formula; numclick ═ (rd + (avgLag/2))/avgLag, where avgLag denotes the average pitch period length and avgLag ═ Lag1+ Lag 2)/2.

Optionally, the compensation module further comprises:

a tenth unit, configured to calculate a ratio of a difference between a length obtained by padding with numPitch data having a pitch period length of avgl and an actual length to be padded to an average pitch period length avgl; judging whether pitch doubling exists according to the ratio; if yes, triggering a second voice sequence module to regenerate a second voice sequence vec2 according to the corrected fundamental tones; if not, triggering the seventh unit to calculate the length of each pitch datum to be filled according to the number of the filled pitch data.

Optionally, the ninth unit is further configured to: determining the weight w1 of the vec1 of the first voice sequence and the weight w2 of the vec2 of the second voice sequence when synthesizing according to the distance relation between the position of the voice to be synthesized and the vec1 of the first voice sequence and the vec2 of the second voice sequence respectively; synthesizing the smoothed first voice sequence vec1 and the smoothed second voice sequence vec2 to obtain a voice sequence with a pitch period length according to the first voice sequence vec1 and the weight w1 thereof, and the second voice sequence vec2 and the weight w2 thereof, wherein if a lost voice data packet does not arrive after filling the synthesized voice sequence with the pitch period length, only the smoothed first voice sequence vec1 and the smoothed second voice sequence vec2 need to be used again to sequentially synthesize other voice sequences, each time after synthesizing data with a pitch period length, whether the lost data packet arrives is checked, if so, performing smooth concatenation, otherwise, continuing to compensate by using the method.

The invention has the beneficial effects that: in the embodiment of the invention, under the condition that the current data packet is lost, if the next voice packet of the lost packet cannot be obtained, the historical data before the lost packet is used for packet loss compensation; if the next voice packet of the lost packet can be obtained, packet loss compensation is performed by using data before and after (bidirectional) the lost packet, so as to improve the quality of the synthesized voice. And the pitch doubling phenomenon can be detected and corrected with less cost while the packet loss compensation is carried out. Under the condition of finding out the lost packet, only compensating the data with the pitch period length when the method is called every time, so that the fine compensation length can more reasonably compensate the lost packet by using the packet data when a corresponding delay packet is received in the process.

Drawings

Fig. 1 is a schematic diagram of a packet loss compensation algorithm in the prior art;

fig. 2 is a flowchart of a method for compensating for voice packet loss according to a first embodiment of the present invention;

fig. 3 is a flowchart illustrating a method for compensating for voice packet loss according to a second embodiment of the present invention;

FIG. 4 is a diagram illustrating a compensation method according to a second embodiment of the present invention;

FIG. 5 is a diagram illustrating smoothing of vec1 according to a second embodiment of the present invention;

FIG. 6 is a diagram illustrating a cosine window function waveform according to a second embodiment of the present invention;

FIG. 7 is a diagram illustrating speech synthesis in a second embodiment of the present invention, wherein the left side is extended and the right side is compressed;

FIG. 8 is a schematic diagram of a waveform without compensation for a loss of 20ms of data;

FIG. 9 is a schematic diagram of a waveform with 20ms of missing data compensated using only historical data;

FIG. 10 is a schematic diagram of a waveform with 20ms data lost compensated for with bi-directional data;

FIG. 11 is a diagram of an original waveform without packet loss;

fig. 12 is a flowchart illustrating voice packet loss compensation performed by the voice packet loss compensation system according to the present invention;

fig. 13 is a system block diagram of voice packet loss compensation in the third embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

First embodiment

Referring to fig. 2, a method for compensating for a voice packet loss in a first embodiment is shown, and the specific steps are as follows:

step S201, when a lost voice packet is found for the first time, a first voice sequence vec1 is generated according to the historical voice data before the lost voice packet, and then the process proceeds to step S203.

Specifically, in step S201, when a lost voice packet is found for the first time (i.e., when packet loss compensation is performed for the first time), the coarse pitch estimation is performed on the historical voice data to obtain a coarse pitch estimation result; according to the rough estimation result of the fundamental tone, performing fundamental tone fine search to determine a first fundamental tone period Lag 1; the first speech sequence vecl is generated from the historical speech data according to the first pitch Lag 1.

In step S203, if the next packet of voice data after the missing voice packet can be acquired, a second voice sequence vec2 is generated from the next packet of voice data after the missing voice packet, and the process then proceeds to step S205.

Specifically, in step S203, finding out a second speech sequence vec2 phase-aligned with the first speech sequence vecl from the next packet of speech data after the missing speech packet includes: according to the first pitch Lag1, performing pitch fine search in the next packet of voice data to determine a second pitch Lag2 of the next packet of voice data; finding out the position with the strongest correlation with the first voice sequence vecl in the next packet of voice data; and determining the second voice sequence vec2 from the next packet of voice data according to the position with the strongest correlation and the second pitch period Lag 2.

Optionally, if the next packet of voice data cannot be obtained, outputting second packet loss compensation data according to the first voice sequence vecl, and performing packet loss compensation.

And step S205, performing packet loss compensation according to the first voice sequence vecl and the second voice sequence vec 2.

Specifically, in step S205, the number of pitch data to be padded is calculated from the first speech sequence vec1 and the second speech sequence vec 2; calculating the length of each pitch datum to be filled according to the number of the filled pitch data; respectively smoothing the first voice sequence vec1 and the second voice sequence vec 2; a pitch length speech sequence is synthesized using the smoothed first speech sequence vec1 and second speech sequence vec2 according to the length of each pitch data, and preferably, a pitch length speech sequence is synthesized using waveform interpolation using the smoothed first speech sequence vec1 and second speech sequence vec 2. It is understood, of course, that the specific interpolation algorithm is not limited in this embodiment.

It should be noted that, the calculating the number of pitch data to be padded according to the first speech sequence vec1 and the second speech sequence vec2 specifically includes: calculating the number numPitch of pitch data to be padded, numPitch, according to the length d of the lost voice packet, the determined length rd after the phases of the first voice sequence vec1 and the second voice sequence vec2 are aligned, the first pitch Lag1 and the second pitch Lag2, by using the following formula; numclick ═ (rd + (avgLag/2))/avgLag, where avgLag denotes the average pitch period length and avgLag ═ Lag1+ Lag 2)/2.

It should be noted that, the synthesizing a speech sequence with a pitch cycle length according to the length of each pitch datum specifically includes: determining the weight w1 of the vec1 of the first voice sequence and the weight w2 of the vec2 of the second voice sequence when synthesizing according to the distance relation between the position of the voice to be synthesized and the vec1 of the first voice sequence and the vec2 of the second voice sequence respectively; according to the first speech sequence vec1 and its weight w1, and the second speech sequence vec2 and its weight w2, a speech sequence with a pitch length is synthesized from the smoothed first speech sequence vec1 and second speech sequence vec2 (for example, by using a waveform interpolation method), wherein if a missing speech packet has not arrived after filling the synthesized speech sequence with a pitch length, it is only necessary to sequentially synthesize other speech sequences by using the smoothed first speech sequence vec1 and second speech sequence vec2 (for example, by using a waveform interpolation method), after synthesizing data with a long pitch length each time, it is checked whether the missing speech packet arrives, if so, smooth concatenation is performed, otherwise, compensation is continued by using the above method.

In the embodiment of the invention, under the condition that the current data packet is lost, if the next voice packet of the lost packet cannot be obtained, the historical data before the lost packet is used for packet loss compensation; if the next voice packet of the lost packet can be obtained, packet loss compensation is performed by using data before and after (bidirectional) the lost packet, so as to improve the quality of the synthesized voice. And the pitch doubling phenomenon can be detected and corrected with less cost while the packet loss compensation is carried out. Under the condition of finding out the lost packet, only compensating the data with the pitch period length when the method is called every time, so that the fine compensation length can more reasonably compensate the lost packet by using the packet data when a corresponding delay packet is received in the process.

Second embodiment

Referring to fig. 3, a flowchart of an algorithm for packet loss compensation using bidirectional data is shown, and the specific steps are as follows:

step 1, when packet loss compensation is performed for the first time, firstly, rough pitch estimation is performed by using history data after down sampling, and the purpose of down sampling is to reduce the amount of computation, for example, down sampling is performed by using an autocorrelation method, wherein data can be down sampled to 4K, and the calculation method of an autocorrelation function is as follows:

where N is the length of the sequence x (i) and p is the number of autocorrelation values to be calculated.

Three maximum values are found out from the p autocorrelation values, and the three maximum values respectively correspond to a pitch period to be selected. Then, performing gene fine search by using original data near three gene periods to be selected (the fine search method also adopts an autocorrelation method), and determining a final pitch period: and bag 1. A speech sequence is generated from the history based on the pitch periods found (vec1, generated by tracing Lag1 samples forward from the latest samples in the history) to synthesize the final output speech.

Step 2, if the next packet data of the lost packet does not exist, synthesizing the output voice sequence by directly using the voice sequence generated in the previous step; if the next packet data of the missing packet exists, a fine pitch search is performed on it. The target of the new pitch period search is locked near the Lag1, the search is centered at Lag1, the search range is plus or minus Lag1/2, and the maximum autocorrelation value is found in the range by the autocorrelation method to determine the new pitch period: and bag 2.

In step 3, a speech sequence vec2 phase-aligned with vec1 is found from the next packet data of the lost packet (hereinafter referred to as new data). If the historical data are stored in the speed buffer, then vec1 is that the latest data in the speed buffer trace forward by 1 sampling points, and the length is that of the Lag 1; preferably, a position k corresponding to the maximum value in the positions (corr (k)) with the strongest correlation with vec1 is found in the new data (Next packet data in fig. 4) by using a cross correlation method (a calculation method of a cross correlation function, see formula 4-2), and long data of the length of the Lag2 is extracted, and the sequence is recorded as vec 2. The purpose of finding the starting point of vec2 based on the strongest correlation principle is to align the phases of the two sequences.

And step 4, calculating a normalized cross-correlation value of vec1 and vec 2(a calculation method of a normalized cross-correlation function is shown in formula 4-3), comparing the normalized cross-correlation value with a set threshold, and performing packet loss compensation by using bidirectional data under the condition that the correlation between vec1 and vec2 is strong (larger than the threshold), otherwise performing packet loss compensation only by using historical data.

And step 5, calculating the number of the pitch to be filled according to the length d (shown in fig. 3) of the lost data and the lengths rd, Lag1 and Lag2 determined after phase alignment, and judging whether the previously determined pitch period length is reasonable or not according to the number of the filling. The method comprises the following steps:

1. calculating the average pitch period length avgLag ═ (Lag1+ Lag 2)/2;

2. calculating the number of pitches to be padded, numPitch ═ (rd + (avgLag/2))/avgLag;

3. the ratio of the difference between the actual length to be filled and the numclick data with length avgl to avgl is calculated as follows: abs (numclick X avgLag-rd)/avgLag.

If the ratio falls between 0.4 and 0.6, it turns out that the previously determined pitch period length is likely to be a doubled pitch period, and to determine this possibility, it is looked for whether any of the three candidate pitch period values mentioned in step 1 falls within the values of the three candidate pitch periods

In which the Lag1 is updated with this value, if any, and the program jumps to step 2 for re-execution; if the value does not exist, the problem of pitch period searching or correlation detection is shown, and in order to avoid the problem to be continuously worsened, packet loss compensation is carried out only by using historical data; if this ratio falls within [0.2,0.4 ]]Or [0.6,0.8 ]]In the method, the pitch period search or the correlation detection is proved to have problems, and in order to avoid the problems to be continuously worsened, only historical data is used for carrying out packet loss compensation; if this ratio falls within [0,0.2 ]]Or [0.8,1 ]]In this case, the pitch period search is correct, and the following procedure is continued.

And 6, calculating the length of each pitch to be inserted. Since the pitch length is gradually changed, the pitch lengths before and after packet loss are likely to be different, and in order to be closer to the change of the actual period length, we also change the padded pitch length, and the sum of the changed lengths is equal to rd. The method comprises the following steps:

● takes Lag1 as the initial value length of numPitch pitches to be filled, counts the total length len filled in the current filling mode, if len is equal to rd, indicates that the current filling mode is optimal, and continues to step 7. Otherwise, the following steps are performed.

● if len is equal to or greater than rd, it is handled in two cases:

in case 1, the Lan 2 is equal to or greater than Lan 1. The length of the compensated first pitch is reduced by 1, and if the reduction to the last pitch does not allow len to be equal to rd, the above process is repeated from the first pitch until len is equal to rd.

Case 2, bag 2 is less than bag 1. The length of the compensated last pitch is reduced by 1 one by one from the beginning and back, and if len cannot be made equal to rd by the reduction to the first pitch, the above procedure is repeated again from the last pitch until len is equal to rd.

If len is less than rd, it is handled in two cases

In case 1, the Lan 2 is equal to or greater than Lan 1. The length of the compensated last pitch is incremented by 1 one by one, starting from the last pitch, and if len cannot be made equal to rd by adding the first pitch, the above procedure is repeated starting from the last pitch until len is equal to rd.

Case 2, bag 2 is less than bag 1. Starting from the first compensated pitch, the length is incremented by 1, and if addition of the last pitch does not allow len to equal rd, the above procedure is repeated starting from the first pitch until len equals rd.

And 7, smoothing vec1 and vec 2. Because the pitch data to be inserted later are interpolated by vec1 and vec2, if no smoothing is performed between the beginning and the end of vec1 or vec2, discontinuity occurs between the interpolated pitch data, and if the discontinuity is serious, sharp noise occurs. The following describes how to perform smoothing processing by taking vec1 as an example.

As shown in FIG. 5, a previous piece of data of vec1 is taken and called o1, and a latest piece of data of vec1 is taken and called o2, and o2 and o1 are equal in length. The left half of the triangular window is applied to o1, the right half of the triangular window is applied to o2, and then the two signals are added with overlap, resulting in the update of the o2 region.

And 8, generating data with long pitch period. And (4) when the PLC algorithm is called, sequentially synthesizing new data according to the lengths of the fundamental tones calculated in the step 6. The synthesis method preferably adopts a waveform interpolation method, and specifically comprises the following steps:

● the lengths of vec1 and vec2 are adjusted according to the target length N (the length of the pitch to be synthesized). The method of adjustment is as follows:

let the original length be L and the target length be N. When L is equal to N, adjustment is not needed, when L is not equal, adjustment cannot be performed in a variable sampling mode, otherwise, the tone of the original voice is damaged, and stretching or compressing processing needs to be performed in a cosine window adding mode. Referring to fig. 6, the cosine window function is as follows:

stretching treatment (when L is less than N): and generating a cosine window with the length of 2 xL, aligning the left/right boundary of the left/right half cosine window with the left/right boundary of the original sequence, and adding the left/right boundary of the left/right half cosine window to the original sequence to obtain data p1/p 2. When synthesizing, the left boundary of p1 is aligned with the left boundary of the target sequence (fig. 7, left a), the right boundary of p2 is aligned with the right boundary of the target sequence (fig. 7, left b), and then p1 and the corresponding point of p2 are added (fig. 7, left c) to obtain the final synthesized speech.

Compression process (when L is greater than N): generating a cosine window of length 2 x N, left/right of left/right half cosine windowThe boundaries are aligned with the left/right boundaries of the original sequence and added to the original sequence to yield data p1/p 2. When synthesizing, the left boundary of p1 is aligned with the left boundary of the target sequence (fig. 7, right a), the right boundary of p2 is aligned with the right boundary of the target sequence (fig. 7, right b), and then p1 and the corresponding point of p2 are added (fig. 7, right c) to obtain the final synthesized voice.

● after vec1 and vec2 have been adjusted to target lengths as described above, the respective weights w1 and w2 are calculated. The calculation of the weight is based on the distance relationship of the position of the speech to be synthesized to vec1 and vec 2. The principle is to which sequence the synthesized speech is close to which sequence its waveform should be closer to. When the number of the pitches to be inserted is numPitch and the data of the ith (i ═ 1,2, …, numPitch) pitch to be synthesized is obtained in step 6, the corresponding weight is w2 ═ i/(numPitch +1), and w1 is 1-w 2.

● synthesize speech of one pitch period length. vec1 is multiplied by weight w1 and vec2 is multiplied by weight w2, which are added to obtain the final synthesized speech. When packet loss compensation is continuously performed, only the 8 th step is repeatedly executed, and numPitch segment synthesized voice is generated in sequence.

Referring to fig. 8 to 11, experimental data of packet loss compensation under the condition of strong data correlation before and after packet loss is shown, where the experimental conditions are 20ms for packet length and 5% of packet loss rate. As shown in fig. 8, in the middle part, when a packet is lost, and the algorithm finds that the packet loss occurs, packet loss compensation is performed according to the following steps:

(1) taking out 30 ms data from BUFFER, downsampling the data to 4K, and solving 50 autocorrelation values for the downsampled data by using a formula 4-1, wherein N takes 60 sampling points (the 60 sampling points correspond to the maximum pitch period length under the sampling rate of 4K), and finds the maximum three sampling points, wherein the corresponding K value corresponds to the number of sampling points corresponding to the roughly estimated pitch period. When the three k values are mapped to the corresponding positions k 'at the original sampling rate, and the autocorrelation value is preferably calculated by using the formula 4-1 in the vicinity of k' (in the range of 0.5ms from left to right), the position corresponding to the maximum value calculated in the range is set as k ", and the maximum value of the 3 k" to be selected is set as the final gene period, which is set as Lag 1. And traces back the Lag1 samples forward from the latest sample in the BUFFER to take out the voice sequence vec 1.

(2) In the case of obtaining the next packet of the lost packet, we find out the new pitch Lag2 for the data in the obtained data packet according to the method described in step 2, and preferably also find the autocorrelation value by using formula 4-1, and N is the Lag 1.

(3) According to the step 3, the vec1 and the vec2 are phase-aligned, the vec1 moves the Lag2 times point by point in the new data, the cross correlation value of the vec1 and the vec2 corresponding to the vec1 is calculated every time the new data is moved, the Lag2 cross correlation values are calculated according to the formula 4-2, wherein N is MIN (Lag1, the length LEN-Lag2 of a data packet), finally the largest one of the Lag2 cross correlation values is found, and the Lag2 sample point data are taken from the position corresponding to the largest value as the final sequence vec 2.

(4) The normalized cross-correlation values bestCorr for vec1 and vec2 are calculated as described in step 4 above. Where N is MIN (Lan 1, length of packet LEN-Lan 2) and k is 0. According to test experience, the comparison threshold is 0.6, and when the bestCorr is more than 0.6, the correlation between the two sequences is considered to be strong, otherwise, the correlation is weak.

(5) And 5, calculating the number of the fundamental tones to be filled according to the step 5, and judging whether the length of the previously determined fundamental tone period is reasonable.

(6) The length of each pitch to be inserted is calculated as described above in step 6.

(7) And (4) smoothing the sequence according to the method in the step 7, wherein the length of a smoothing region is as follows: 5 samplerate/8000 samples, wherein samplerate is the sampling rate of data.

(8) When the data is required to be synthesized, the synthesis of the data with the pitch period is performed according to the 8 th step. The data synthesized in the above steps is shown in fig. 10, and it can be seen that the waveform compensated by the present algorithm is closer to the original waveform than that of fig. 9.

Third embodiment

Referring to fig. 13, a block diagram of a system for compensating for a voice packet loss in the third embodiment is shown, where the system 13 includes:

a first voice sequence module 131, configured to generate a first voice sequence vec1 according to historical voice data before a lost voice packet when the lost voice packet is found for the first time;

a second voice sequence module 133, configured to generate a second voice sequence vec2 according to the next packet of voice data after the missing voice packet if the next packet of voice data after the missing voice packet can be obtained;

a compensation module 135, configured to perform packet loss compensation according to the first speech sequence vecl and the second speech sequence vec 2.

It should be noted that the compensation module 135 is further configured to perform packet loss compensation according to the first voice sequence vecl if the next packet of voice data cannot be obtained.

It should be noted that, the first speech sequence module 131 is specifically configured to perform coarse pitch estimation on the historical speech data to obtain a coarse pitch estimation result; according to the rough estimation result of the fundamental tone, performing fundamental tone fine search to determine a first fundamental tone period Lag 1; the first speech sequence vecl is generated from the historical speech data according to the first pitch Lag 1.

It should be noted that the second speech sequence module 133 is specifically configured to: a second speech sequence vec2 is found out of the next packet of speech data after the missing speech packet, which is phase aligned with the first speech sequence vecl.

It should be noted that the second speech sequence module 133 specifically includes: a first unit, configured to perform pitch search in the next speech data according to the first pitch Lag1, and determine a second pitch Lag2 of the next speech data; a second unit, configured to find a location in the next packet of voice data where the vecl correlation with the first voice sequence is strongest; a third unit, configured to determine the second speech sequence vec2 from the next packet of speech data according to the position with the strongest correlation and a second pitch Lag 2.

It should be noted that, the second speech sequence module further includes: a fourth unit, configured to calculate a normalized cross-correlation value between the first speech sequence vecl and the second speech sequence vec 2; a fifth unit, configured to compare the normalized cross-correlation value with a set threshold value; if the normalized cross-correlation value is larger than the set threshold value, triggering the compensation module to perform packet loss compensation according to the first voice sequence vecl and the second voice sequence vec 2; and if the normalized cross-correlation value is smaller than or equal to the set threshold value, triggering the compensation module to perform packet loss compensation according to the first voice sequence vecl.

It should be noted that the compensation module 135 specifically includes: a sixth unit, configured to calculate the number of pitch data to be padded according to the first speech sequence vec1 and the second speech sequence vec 2; a seventh unit, configured to calculate a length of each piece of pitch data to be padded according to the number of the padded pitch data; an eighth unit, configured to perform smoothing on the first speech sequence vec1 and the second speech sequence vec2 respectively; a ninth unit, configured to synthesize a speech sequence with a pitch length for the smoothed first speech sequence vec1 and second speech sequence vec2 (for example, by using a waveform interpolation method) according to the length of each pitch data. It should be noted that the seventh unit is specifically configured to: calculating the number numPitch of pitch data to be padded, numPitch, according to the length d of the lost voice packet, the determined length rd after the phases of the first voice sequence vec1 and the second voice sequence vec2 are aligned, the first pitch Lag1 and the second pitch Lag2, by using the following formula; numclick ═ (rd + (avgLag/2))/avgLag, where avgLag denotes the average pitch period length and avgLag ═ Lag1+ Lag 2)/2.

It should be noted that the compensation module 135 further includes: a tenth unit, configured to calculate a ratio of a difference between a length obtained by padding with numPitch data having a pitch period length of avgl and an actual length to be padded to an average pitch period length avgl; judging whether pitch doubling exists according to the ratio; if yes, triggering a second voice sequence module to regenerate a second voice sequence vec 2; if not, triggering the seventh unit to calculate the length of each pitch datum to be filled according to the number of the filled pitch data.

It should be noted that the ninth unit is further configured to: determining the weight w1 of the vec1 of the first voice sequence and the weight w2 of the vec2 of the second voice sequence when synthesizing according to the distance relation between the position of the voice to be synthesized and the vec1 of the first voice sequence and the vec2 of the second voice sequence respectively; according to the first speech sequence vec1 and its weight w1, and the second speech sequence vec2 and its weight w2, a speech sequence of pitch length is synthesized. Specifically, a speech sequence with a pitch length is synthesized from the smoothed first speech sequence vec1 and the smoothed second speech sequence vec2 (for example, by using a waveform interpolation method), wherein if a missing speech data packet does not arrive after filling the synthesized speech sequence with a pitch length, other speech sequences are sequentially synthesized by using the smoothed first speech sequence vec1 and the smoothed second speech sequence vec2 (for example, by using a waveform interpolation method), each time data with a pitch length is synthesized, whether the missing data packet arrives is checked, if so, the smooth concatenation is performed, and if not, the compensation is performed by using the above method.

With reference to fig. 12, the packet loss compensation system is applied in a play thread, where the play thread plays 10 ms of voice data every 10 ms. A voice packet buffer is provided in the system for buffering received voice packets that have not been played. Whether the current packet is lost is judged through the relation of time stamps, wherein WattedTS represents the time stamp of the data packet required by current playing, and CurrTS represents the effective minimum time stamp existing in the voice packet buffer area. When the WantedTS is equal to the CurrTS, the current required voice packet exists, the packet data is decoded and updated, in this case, whether the last operation is packet loss compensation or not is judged, if so, the strongest correlation position of new data and historical data (including compensated data) is determined, then smooth connection is carried out at the position with the strongest correlation to ensure smooth transition at the boundary, and the processed data is copied to a play buffer for playing. If the last operation is not packet loss compensation, copying the decoded data into a play buffer area for playing; when the WantedTS is smaller than the CurrTS, the data packet which is needed currently does not exist in a voice packet buffer area, the packet may be really lost, or the data packet may not arrive at the specified time due to delay, no matter what reason, a packet loss compensation algorithm is called for packet compensation operation due to the real-time performance of playing, data with a gene cycle length is compensated once, after compensation is completed, it is judged that the data in the voice buffer area is not 10 milliseconds enough, playing is performed if the data is enough, otherwise, the program jumps to the beginning to detect whether the needed voice packet arrives according to the time mark, if so, smooth transition is performed by using the delay packet according to the method, and if not, packet loss compensation is continued until the data in the voice buffer area is played for 10 milliseconds.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be physically included alone, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

While the preferred embodiments of the present invention have been described, it should be understood that modifications and adaptations to those embodiments may occur to one skilled in the art without departing from the principles of the present invention and are within the scope of the present invention.

Claims

1. A method for compensating for voice packet loss, the method comprising:

when a lost voice packet is found for the first time, generating a first voice sequence vec1 according to historical voice data before the lost voice packet;

if the next packet of voice data after the voice packet loss can be acquired, generating a second voice sequence vec2 according to the next packet of voice data after the voice packet loss; performing packet loss compensation according to the first voice sequence vecl and the second voice sequence vec 2;

generating a first voice sequence according to historical voice data before the voice packet is lost, specifically comprising:

carrying out fundamental tone rough estimation on the historical voice data to obtain a fundamental tone rough estimation result;

according to the rough estimation result of the fundamental tone, performing fundamental tone fine search to determine a first fundamental tone period Lag 1;

generating the first speech sequence vecl from the historical speech data in accordance with the first pitch Lag 1;

the generating a second voice sequence vec2 according to the next packet of voice data after the voice packet is lost specifically includes:

finding a second speech sequence vec2 phase aligned with said first speech sequence vecl from the next packet of speech data after the missing speech packet;

the finding out a second speech sequence vec2 phase-aligned with the first speech sequence vecl from the next packet of speech data after the missing speech packet specifically includes:

according to the first pitch Lag1, performing pitch fine search in the next packet of voice data to determine a second pitch Lag2 of the next packet of voice data;

finding out the position with the strongest correlation with the first voice sequence vecl in the next packet of voice data;

2. The method of claim 1, wherein if the next packet of voice data cannot be obtained, the method further comprises:

and performing packet loss compensation according to the first voice sequence vecl.

3. The method according to claim 1, characterized in that before packet loss compensation according to the first and second speech sequences vecl, vec2, the method further comprises:

calculating a normalized cross-correlation value of the first speech sequence vecl and the second speech sequence vec 2;

comparing the normalized cross-correlation value with a set threshold value;

if the normalized cross-correlation value is larger than the set threshold value, performing packet loss compensation according to the first voice sequence vecl and the second voice sequence vec 2;

and if the normalized cross-correlation value is smaller than or equal to the set threshold value, performing packet loss compensation according to the first voice sequence vecl.

4. The method according to claim 3, wherein the performing packet loss compensation according to the first speech sequence vecl and the second speech sequence vec2 specifically includes:

calculating the number of pitch data to be filled according to the first voice sequence vec1 and the second voice sequence vec 2;

calculating the length of each pitch datum to be filled according to the number of the filled pitch data;

respectively smoothing the first voice sequence vec1 and the second voice sequence vec 2;

and synthesizing a voice sequence with the pitch cycle length by using the first voice sequence vec1 and the second voice sequence vec2 after smoothing according to the length of each pitch data.

5. The method according to claim 4, wherein said calculating the number of pitch data to be padded according to said first speech sequence vec1 and said second speech sequence vec2 specifically comprises:

calculating the number numPitch data to be padded, numPitch, according to the length d of the lost speech packet, the length rd determined after the phases of the first speech sequence vec1 and the second speech sequence vec2 are aligned, the first pitch Lag1 and the second pitch Lag2, by using the following formula;

numclick ═ (rd + (avgLag/2))/avgLag, where avgLag denotes the average pitch period length and avgLag ═ Lag1+ Lag 2)/2.

6. The method according to claim 5, wherein said performing packet loss compensation according to said first speech sequence vecl and said second speech sequence vec2, further comprises:

calculating the ratio of the difference value between the length filled by numPitch data with the pitch cycle length of avgLag and the actual length to be filled to the average pitch cycle length avgLag;

judging whether pitch doubling exists according to the ratio;

if yes, regenerating a second voice sequence vec2 according to the corrected pitch;

and if not, calculating the length of each pitch datum to be filled according to the number of the filled pitch data.

7. The method according to claim 5, wherein the synthesizing a pitch length speech sequence using the smoothed first speech sequence vec1 and second speech sequence vec2 according to the length of each pitch datum comprises:

determining the weight w1 of the vec1 of the first voice sequence and the weight w2 of the vec2 of the second voice sequence when synthesizing according to the distance relation between the position of the voice to be synthesized and the vec1 of the first voice sequence and the vec2 of the second voice sequence respectively;

synthesizing the smoothed first voice sequence vec1 and the smoothed second voice sequence vec2 to obtain a voice sequence with a pitch length according to the first voice sequence vec1 and the weight w1 thereof, and the second voice sequence vec2 and the weight w2 thereof, wherein if a lost voice packet does not arrive after filling the synthesized voice sequence with the pitch length, only the smoothed first voice sequence vec1 and the smoothed second voice sequence vec2 need to be used again to sequentially synthesize other voice sequences, and after synthesizing data with a pitch length each time, whether the lost packet arrives is checked.

8. A system for voice packet loss compensation, the system comprising:

a first voice sequence module, configured to generate a first voice sequence vec1 according to historical voice data before a lost voice packet when the lost voice packet is found for the first time;

a second voice sequence module, configured to generate a second voice sequence vec2 according to the next packet of voice data after the voice packet is lost if the next packet of voice data after the voice packet is lost can be obtained;

a compensation module, configured to perform packet loss compensation according to the first voice sequence vecl and the second voice sequence vec 2;

the first voice sequence module is specifically used for carrying out fundamental tone rough estimation on the historical voice data to obtain a fundamental tone rough estimation result; according to the rough estimation result of the fundamental tone, performing fundamental tone fine search to determine a first fundamental tone period Lag 1; generating the first speech sequence vecl from the historical speech data in accordance with the first pitch Lag 1;

the second speech sequence module is specifically configured to: finding a second speech sequence vec2 phase aligned with said first speech sequence vecl from the next packet of speech data after the missing speech packet;

the second speech sequence module specifically includes:

a first unit, configured to perform pitch search in the next packet of speech data according to the first pitch Lag1, and determine a second pitch Lag2 of the next packet of speech data;

9. The system according to claim 8, wherein the compensation module is further configured to perform packet loss compensation according to the first speech sequence vecl if the next packet of speech data cannot be obtained.

10. The system of claim 8, wherein the second speech sequence module further comprises:

11. The system according to claim 8, wherein the compensation module comprises in particular:

a ninth unit, configured to synthesize a speech sequence with a pitch length according to the length of each pitch data by using the smoothed first speech sequence vec1 and second speech sequence vec 2.

12. The system according to claim 11, characterized in that the sixth unit is specifically configured to: calculating the number numPitch of pitch data to be padded, numPitch, according to the length d of the lost voice packet, the determined length rd after the phases of the first voice sequence vec1 and the second voice sequence vec2 are aligned, the first pitch Lag1 and the second pitch Lag2, by using the following formula; numclick ═ (rd + (avgLag/2))/avgLag, where avgLag denotes the average pitch period length and avgLag ═ Lag1+ Lag 2)/2.

13. The system of claim 12, wherein the compensation module further comprises:

14. The system of claim 12, wherein the ninth unit is further configured to: determining the weight w1 of the vec1 of the first voice sequence and the weight w2 of the vec2 of the second voice sequence when synthesizing according to the distance relation between the position of the voice to be synthesized and the vec1 of the first voice sequence and the vec2 of the second voice sequence respectively; synthesizing the smoothed first voice sequence vec1 and the smoothed second voice sequence vec2 to obtain a voice sequence with a pitch length according to the first voice sequence vec1 and the weight w1 thereof, and the second voice sequence vec2 and the weight w2 thereof, wherein if a lost voice packet does not arrive after filling the synthesized voice sequence with the pitch length, only the smoothed first voice sequence vec1 and the smoothed second voice sequence vec2 need to be used again to sequentially synthesize other voice sequences, and after synthesizing data with a pitch length each time, whether the lost packet arrives is checked.