WO2002095731A1

WO2002095731A1 - Voice signal processor

Info

Publication number: WO2002095731A1
Application number: PCT/JP2001/004266
Authority: WO
Inventors: Yasutaka Kanayama; Teruyuki Sato
Original assignee: Fujitsu Limited
Priority date: 2001-05-22
Filing date: 2001-05-22
Publication date: 2002-11-28
Also published as: JPWO2002095731A1; JP4426186B2

Abstract

An auto-correlation value is determined approximately over a cycle with regard to a voice waveform generated from linear PCM data, and a pitch cycle of the voice waveform is extracted on the basis of the auto-correlation value. In the vicinity of a sampled value of an objective voice waveform, the voice waveform from a cycle before the sampled value of the objective voice waveform is used as a predicted waveform to determine the correlation value between the actual voice waveform and the predicted waveform, and a discontinuous point in the actual voice waveform, if any, is detected from the magnitude of the correlation between the predicted waveform and the actual voice waveform. If a discontinuous point is detected, a corrected voice waveform which is similar to the predicted waveform in the vicinity of the discontinuous point and gradually approaches the actual waveform is formed by the interpolation between the predicted waveform and the actual voice waveform.

Description

Description Audio signal processing device Technical field

The present invention relates to an audio signal processing device for digital audio data such as linear PCM audio data in a communication network or a terminal. Background art

In today's information and communication society, various types of information are exchanged via networks, but the data handled is very large and diversified compared to a time ago. This trend is expected to continue in the future.

Networks must deal with such an ever-increasing amount of information, but recently keywords such as “broadband” and “IP” have been frequently used.

"Broadbanding" is to increase the transmission capacity of the communication path so that huge data can be transmitted quickly, and "ip-i-dani" is to send data in Ip bucket units. is there. Since packet switching does not occupy the line, it is a pay-as-you-go service based on the amount of data, and is a very important method today when dealing with huge data.

By the way, speaking of voice, at present, voice information is transmitted by the circuit switching system, and the time is charged according to the time occupying the line. Since the line is occupied, a very high quality is required during that period, and the quality is actually high to some extent.

Considering the transmission of voice using IP packets in the tide of the times It is expected that a service called “VoIP” will start in the near future. In other words, voice data exchanges information by packet exchange like other data.

At this time, since the audio data is very small compared to the size of the data other than audio, the transmission format is not particularly compressed and is expected to be the G.711 PCM format used in the current ATM network. Is done.

However, IP packet exchange is a transmission method suitable for data in which a packet can be retransmitted even if an error occurs, and to some extent in real-time information that cannot be retransmitted such as voice data. It is considered that quality deterioration occurs.

It is well known that such audio waveform discontinuities due to quality degradation can cause a large amount of auditory quality degradation.However, discontinuities in audio waveforms are caused by various causes. is there.

For example, audio codecs used in recent mobile communications mainly use the CELP system, and this system processes linear PCM data in frame units. Parameters such as spectrum envelope information and sound source information are extracted from the frame, enabling encoding at a high compression rate. However, when encoding data encoded in frame units, discontinuities are likely to occur at the boundaries between frames. In order to avoid such discontinuities, typical parameters (such as pitch cycle) are used, and weighting is used to interpolate the audio waveform near the boundary of the frame.

In addition, a method of performing a filtering process to improve auditory sound quality is known. In addition, discontinuous points also occur due to loss of coded data frames (packets) in the wireless section and data errors. At that time, an external check is used to notify the user of the occurrence of the error, and processing such as lowering the level of the audio data is performed to reduce the deterioration in auditory quality. Examples of such techniques There are Japanese Patent Application Laid-Open Nos. 7-106566-1995 and 6-326266-2.

When interpolation is performed at the boundary of a frame as described above. ゃ When a data error occurs, the processing is performed after knowing in advance where a discontinuity has occurred or where a discontinuity may occur. This is mainly performed together with speech coding and decoding. However, in an ATM network or IP network that transmits PCM data in packet units, if a packet is lost or an unknown bit error occurs, the discontinuity that occurs is not checked at any point. It will be transmitted with the cause of quality degradation.

In particular, since the packet transmission route is variable in the IP network, there is a possibility that a bucket issued later in time may pass a bucket issued earlier, depending on the routing condition, and in that case, discontinuity may occur. Dots arise.

FIG. 1 is a diagram showing a state of bucket routing in an IP network. The figure shows a case where three packets are respectively transmitted in order. Even if the second and third packets are transmitted sequentially after the first packet, the second packet is transmitted to the VoIP router 1 after passing through the VoIP router 2. On the other hand, since the third packet is transmitted directly to the VoIP router 1, the third packet transmitted later passes the second packet and arrives at the destination.

In the mobile communication network for IMT-2000, the use of a method called TFO (Tandem Free Operation) is being studied for connection between terminals. This method has the purpose of avoiding the deterioration of quality due to the tandem connection. When the transition from the tandem connection to the TFO or vice versa, there may be systematic discontinuities. There is no technology to check and capture it. Disclosure of the invention

An object of the present invention is to provide an audio signal processing device that detects discontinuous points of an audio waveform that occur unspecified by examining digital audio data and compensates for quality degradation due to the discontinuous points, and in particular, a linear PCM. An object of the present invention is to provide an audio signal processing device that checks data, detects a discontinuity point, immediately corrects a portion where the discontinuity point is determined, and can avoid deterioration in auditory quality. .

An audio signal processing apparatus according to the present invention is an audio signal processing apparatus for processing digital audio data in a communication network, comprising: a waveform prediction unit that detects a period of an input waveform and predicts a waveform to be received from the period; A discontinuous point detecting means for detecting a discontinuous point of the waveform from a correlation value between the detected waveform and the actually received waveform; and, when the discontinuous point is detected, the predicted waveform and the actual waveform. And a correction waveform generating means for generating a correction waveform having no discontinuous points by using the received waveforms.

According to the present invention, the presence or absence of a discontinuity is detected by directly examining the received waveform. Therefore, even if a discontinuity occurs due to an unpredictable cause, the discontinuity is found and corrected. The generated waveform can be generated. Therefore, it is possible to compensate not only for a case where a discontinuity occurs at a position predicted from the system configuration such as a joint between frames, but also for a deterioration in voice quality due to a discontinuity occurring at an arbitrary position in the waveform. come.

As a result, according to the present invention, high-quality voice communication can be provided even when voice is transmitted and received via a communication network based on the packet switching system. BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a diagram showing a state of bucket routing in an IP network. FIG. 2 is a diagram (part 1) for explaining the principle of the embodiment of the present invention. FIG. 3 is a diagram (part 2) for explaining the principle of the embodiment of the present invention.

FIG. 4 is a diagram (part 3) for explaining the principle of the embodiment of the present invention.

FIG. 5 is a diagram (part 4) for explaining the principle of the embodiment of the present invention.

FIG. 6 is a processing block diagram of the audio signal processing device according to the embodiment of the present invention. FIG. 7 is a diagram showing an overall processing flow of the audio signal processing device according to the embodiment of the present invention.

FIG. 8 is a diagram illustrating an application portion and a network of the audio signal processing device according to the embodiment of the present invention. BEST MODE FOR CARRYING OUT THE INVENTION

In the present invention, means for calculating a cycle from past input data, means for predicting a future voice waveform from the obtained cycle, and comparing the predicted waveform with the actual waveform to determine whether correction is necessary And a means for correcting the waveform by using means such as weighting for a discontinuous point requiring correction.

2 to 5 are diagrams illustrating the principle of the embodiment of the present invention.

When observing a speech waveform, it is known that a similar waveform appears continuously with a certain period in a sound part. This is called pitch, and is an important parameter used in recent speech coding methods as one of the parameters for high compression of speech. In the embodiment of the present invention, this pitch period is used for correcting a target speech waveform. FIG. 2 shows an example of a speech waveform, where k corresponds to the pitch period.

The pitch period can be extracted by using a method such as calculation of an autocorrelation coefficient. If the autocorrelation value is high to some extent, it is possible to predict the future waveform (expected waveform) within a certain error range by using the pitch period. In Figure 2, if the pitch period is determined to be k, k samples before The predicted waveform can be obtained by using the value of the linear PCM data as the current value.

If the pitch period appears properly in the normal voice waveform as shown in Fig. 2, the actual waveform does not deviate significantly from the predicted waveform. However, as shown in Fig. 3, if the actual waveform is significantly different from the predicted waveform, it will be a discontinuity on the audio waveform, which may result in a significant loss of audio quality. Therefore, in the present embodiment, the discontinuous point is detected by comparing the actual waveform and the predicted waveform in each sample, and the vicinity of the discontinuous point is interpolated using the predicted waveform.

As a method of comparing the actual waveform with the predicted waveform, there is a method of calculating a local correlation coefficient or the like. Figure 3, point a. Figure 4 shows an enlarged view of the vicinity of. In FIG. 4, the neighborhood of a ₀ a ₂ ,… & ₂ and 13. Near 13. _{_2,,} b ₂ whether it is possible to check the disturbed significantly waveform by Mel seek local correlation.

The sample determined to be a discontinuous point is corrected, but there is a method using weighting and the like as a correction method. In particular, in the present embodiment, it is explained that the weighting method is used. _D view to 5 show the manner in which is gradually corrected to the actual waveform from predicted waveform by weight. In other words, if there is a discontinuity d in the actual waveform shown by the solid line, interpolation is performed with the predicted waveform shown by the broken line. The interpolation method is such that when the correction waveform indicated by the bold line is close to the discontinuity d, the correction waveform has a shape close to the predicted waveform, and gradually approaches the actual waveform.

An embodiment of the present invention will be described with reference to FIGS.

A. (_ I), ···, a (0), ···, a (4). a (0) is a sample to determine whether or not to correct, and a (— 1) is a sample one past. A (1), ···, a (4) are the sample values after a (0). In the present embodiment, since it is necessary to reach the sample W to be collected four times after the sample value a (0) and to reach the sum W directly, in actual processing, up to the value four samples after the sample value to be corrected Perform processing after reading.

First, in order to check whether or not the waveform of the portion including the sample of a (0) has periodicity, the period detection unit 10 in FIG. 6 uses several tens of samples before a (0) (here Then, a segment is formed with 40 samples), and the following calculation is performed.

In addition, in the dozens of samples before a (0), the past reused PCM data from the input is stored in the storage unit 14, and the data is read into the period detection unit 10 from here. . In addition, the number of samples required for detecting the periodicity is set to 40 samples here, but actually, the number of samples is determined so that one period of the pitch of the audio data can be used for period detection. Should. Normally, it is sufficient to have about 40 samples for detecting the pitch period of audio data. If the sampling frequency is different, use an appropriate number of samples according to the frequency. S l, 2,, 50

In this calculation, the value of k and the value of S when S is maximized are obtained. However, in order to prevent waveforms with opposite phases or waveforms with low power from affecting the correction, the numerator must be set to a threshold value that is positive in parentheses and has two terms each multiplied by the denominator. Only when exceeded. That is, the numerator is always a positive number because it is squared, but the expressions in parentheses of the numerator have the same waveform. In this case, the value indicates a large positive value, and when the waveform is out of phase, it indicates a negative value and a large absolute value. Therefore, if the waveforms are out of phase, the above S will be a large value even though the waveforms do not match, and to eliminate this, the parentheses in the numerator must be positive. Limited to the case. The reason why the magnitude of each term of the denominator is equal to or larger than a predetermined threshold is to eliminate a case where the voice power is low. Each term of the denominator is a formula for calculating the power of the voice. By setting these values to a predetermined value or more, a voice waveform with a low power can be removed. Removing a low-power voice waveform is likely to be affected by noise in the case of a low-power voice waveform.The actual voice waveform is different from the past waveform, but is affected by noise. This is to avoid the case where the waveforms are accidentally determined to match as a result of the calculation in. The above threshold value should be appropriately determined by those skilled in the art who try to use the present embodiment experimentally. Next, it is determined whether or not S exceeds a certain threshold. If it exceeds, it is determined that the waveform is a periodic one, and the value of the period k is determined and sent to the prediction unit 11 in FIG. If not exceeded, it is determined that it is not periodic, and the processing of the prediction unit 11, the judgment unit 12, and the correction unit 13 is not performed. The threshold value for the determination of S should also be appropriately set by those skilled in the art through experiments and the like.

The prediction unit 11 predicts that the neighborhood of a (0) looks like the neighborhood of a (— k). Here, the neighborhood of a (0) is a (— 2), ···, a (4), and the neighborhood of a (1 k) is a (– k – 2), ···, a (— k + 4) The prediction unit 11 sends the predicted waveform to the comparison determination unit 12. Here, the predicted waveform is a waveform consisting of a (-k-- 2) to a (-k + 4) samples near a (_k), which is determined to be similar to the vicinity of a (0). It is. Then, the following calculation is performed for the predicted waveform (near a (-k)) and the actual waveform (near a (0)) in a short section. Note that the calculation here is performed for seven samples near each of a (0) and a (— k). ing. This is a calculation that selects a neighborhood that is sufficiently smaller than one cycle of the speech waveform, but large enough to average the noise-like change for each sample. That is, if the number of samples to be calculated is too large, it is not possible to detect a local discontinuity of the waveform, and if the number of samples is too small, the waveform is discontinuous even if the sample value changes like noise. Since it is too sensitive to changes in sample values, for example, it is judged to be a point, we thought that about 7 samples were just right. However, in the present embodiment, the number of samples is not necessarily limited to 7 samples, and should be appropriately determined by those skilled in the art through experiments and the like. τ

Next, it is determined whether or not this T exceeds a certain threshold value. If not, it is determined that the waveform is significantly disturbed at that point, and a correction instruction is issued to the correction unit 13 by the comparison determination unit 12. Is issued. However, in this case as well, the case where the two terms multiplied by the denominator are smaller than a certain threshold is excluded. If the value inside the parentheses of the numerator is negative, it is assumed to be 1Τ. Again, by using only the case where each term of the denominator is larger than the predetermined threshold, the case where the sound power is small is removed.When the parenthesis of the numerator is negative, the value of It is made negative so that it does not exceed the threshold. In other words, it means to exclude the case where the parenthesis of the numerator is negative, that is, the case where the waveforms are in opposite phases. Each of the above-mentioned thresholds should be appropriately determined by those skilled in the art through experiments and the like.

Upon receiving the correction instruction, the correction unit 13 performs interpolation by weighting as described below, and outputs s (corrected audio data sample value). Once the correction instruction is given, η samples (The corrected waveform should be sufficiently smooth and almost coincide with the actual waveform. Like this: The value of n should also be set appropriately by those skilled in the art), and during this period, the functions of period detection, prediction, and comparison judgment are stopped.

1 n 1

s = _a (0) + —— {a (-k) + (offset x (k-m)) / k} 1 = l, 2, ..., n -1, m = 0, l, .. ., Knn where of ί set is the value of a (—1)-a (—k—l) when the correction instruction is issued. When correction is performed, the value of one cycle (k samples) earlier ( This is the amount required to smoothly connect the predicted waveform) and the corrected waveform.

If no correction instruction is given,

s = a (0)

Becomes

After the processing of the capturing unit is completed, the storage unit 14 stores the values as a (4) → a (3), a (3) → a (2), a (i) → a (i-1), and so on. Update. Note that s → a (−1), and the correction result is reflected in the past waveform data stored in the storage unit 14. In the configuration of FIG. 6, one sample of the linear PCM data is input from the input (1) Data is sequentially input, and the latest sample value is input to the direct comparison determination unit 12 and the correction unit 13. The storage unit 14 outputs a predetermined number (for example, about 40 samples) of past sample values before the latest sample value. For example, in the above example, a (4) is a direct input from the input to the comparison / decision unit 12 and the correction unit 13, and a (3) to a (−40) are input from the storage unit 14 to Input to the department. FIG. 7 is a flowchart showing the overall processing of the embodiment of the present invention.

First, in step S1, an autocorrelation coefficient is calculated. The calculation here corresponds to the calculation of S in the above description. Then, in step S2, it is determined whether or not there is periodicity. As described above, the periodicity is determined by determining whether or not the value of S is greater than a predetermined threshold, thereby determining the period k. k is the length of one cycle of the audio waveform in the number of samples. No periodicity If it is determined, the process proceeds to step S7. In this case, in step S7, s = a (0), and the sample value of the audio waveform is output without any correction. Then, in step S8, one new sample value is stored in the storage unit 14 and one oldest sample value is discarded.

If it is determined in step S2 that there is periodicity, in step S3, waveform prediction, that is, a past waveform one cycle before is obtained as a predicted waveform, and in step S4, the current waveform is obtained. Compare with the predicted waveform. The calculation in this step S4 is to calculate the above-mentioned T. For a small number of samples in the vicinity of the target sample, the correlation value between the current waveform and the predicted waveform is obtained, and the correlation value becomes smaller than a predetermined threshold value. It is to judge whether it is larger or not, but the process of step S 4 is called “comparison”. Therefore, by performing “comparison” in step S4, it is determined whether or not there is a discontinuity in the current waveform.

Then, in step S5, it is determined whether or not the waveform needs to be corrected according to whether or not the current voice waveform has a discontinuity as a result of the comparison in step S4. If there are no discontinuities in the voice waveform, it is determined that correction is not necessary, and the process proceeds to steps S7 and S8. In step S2, the same processing as when there is no periodicity is performed. If it is determined in step S5 that the correction is necessary, in step S6, the sample value of the audio waveform is corrected by the above-mentioned weighting operation, and this is output in step S7. In S8, the corrected sample value is stored in the storage unit 14, and the oldest sample value is discarded.

FIG. 8 is a diagram illustrating an application portion and a network of the audio signal processing device according to the embodiment of the present invention.

The public line network 22 is connected to the mobile network 23 via the network 20. The mobile network 23 may be another public network, or the public network 22 may be another mobile network. Network 20 is IP such as the Internet The network is based on a packet switching system. In this case, in order to transmit and receive voice via the network 20, a method called VoIP is adopted. A gateway 21 is provided as a boundary device between the network 20 and the public network 22. Similarly, a gateway 21 is provided as a boundary device between the mobile network 23 and the network 20.

The audio signal processing device according to the embodiment of the present invention is mounted on the gateway 21 as these boundary devices. That is, for example, an audio signal input to the gateway 21 from the public line network 22 is converted into linear PCM data, and then subjected to the audio signal processing of the embodiment of the present invention. Sent in IP format. The gateway 21 receiving the voice data transmitted to the network 20 converts the received voice signal into linear PCM data, performs the voice signal processing according to the embodiment of the present invention, and finally executes the mobile network 2. Send to 3. The same applies to the case where a voice signal is transmitted from the mobile network 23 to the public network 22. Further, in the above description, the gateway is mentioned as an application portion of the audio signal processing device according to the embodiment of the present invention, but the present invention is not limited to this. That is, the present invention can be applied to the case where the received voice is reproduced in a mobile device such as a mobile terminal of the mobile network 23, or in a base station of the mobile network 23 or a telephone of the public network 22. It is also possible to perform the audio signal processing according to the embodiment of the present invention on the audio signal in the state of the lithium PCM data. Industrial applicability

As described above, according to the present invention, auditory quality deterioration can be suppressed irrespective of the cause of the occurrence of a discontinuity in the audio waveform. Also, processing can be performed without significant delay.

Claims

The scope of the claims

1. In an audio signal processing device that processes digital audio data in a communication network,

Waveform prediction means for detecting a cycle of the input waveform and predicting a waveform to be received from the cycle;

Discontinuous point detecting means for detecting a discontinuous point of the waveform from a correlation value between the predicted waveform and the actually received waveform;

Correction waveform generating means for generating a correction waveform having no discontinuity using the predicted waveform and the actually received waveform when the discontinuity is detected. Audio signal processing device.

2. The audio signal processing device according to claim 1, wherein the period of the input waveform is detected by detecting that an autocorrelation value of the input waveform is equal to or greater than a predetermined value.

3. The audio signal processing device according to claim 2, wherein the autocorrelation value is calculated for substantially one cycle of the input waveform.

4. The audio signal processing device according to claim 1, wherein the prediction of the waveform to be received from now on is performed using a waveform one cycle before the waveform to be predicted as a predicted waveform.

5. The detection of the discontinuous point is performed by calculating a correlation value between the predicted waveform and the actually received waveform for several sample points before and after a sample point to determine whether or not a discontinuous point exists. Claim 1 characterized by being obtained by performing an operation An audio signal processing device as described in the above.

6. The voice according to claim 1, wherein the correction waveform is generated by performing a weighted interpolation operation on the sample value of the predicted waveform and the sample value of the actually received waveform. Signal processing device.

7. The weighted interpolation operation is performed by adding an offset amount to a sample value of the predicted waveform, and the corrected waveform and a waveform actually received in the past are continuously connected. 7. The audio signal processing device according to claim 6, wherein:

8. The audio signal processing device according to claim 7, wherein the offset amount is calculated based on two sample values calculated from a cycle of the input waveform.

9. The audio signal processing device according to claim 1, wherein the communication network transmits the audio signal by a bucket switching method.

10. The audio signal processing device according to claim 9, wherein the communication network is an ATM network or an IP network.

11. The audio signal processing device according to claim 1, wherein the digital audio data is linear PCM data.

1 2. In an audio signal processing method for processing digital audio data in a communication network, A waveform prediction step of detecting a cycle of the input waveform and predicting a waveform to be received from the cycle;

A discontinuous point detecting step of detecting a discontinuous point of the waveform from a correlation value between the predicted waveform and the actually received waveform;

And a correction waveform generation step of generating a correction waveform by using the predicted waveform and the actually received waveform when the discontinuity point is detected. An audio signal processing method characterized by the following.

13. The audio signal processing method according to claim 12, wherein the cycle of the input waveform is detected by detecting that an autocorrelation value of the input waveform is equal to or greater than a predetermined value.

14. The audio signal processing method according to claim 13, wherein the autocorrelation value is calculated for substantially one cycle of the input waveform.

15. The audio signal processing method according to claim 12, wherein the waveform to be received from now on is predicted using a waveform one cycle before the waveform to be predicted as a predicted waveform.

16. The detection of the discontinuous point is based on the correlation between the predicted waveform and the actually received waveform for several sample points before and after the sample point at which it is determined whether or not the discontinuous point exists. 13. The audio signal processing method according to claim 12, wherein the method is obtained by calculating a value.

17. The corrected waveform is generated by performing a weighted interpolation operation on the sample value of the predicted waveform and the sample value of the actually received waveform. 13. The audio signal processing method according to claim 12, wherein:

18. The weighted interpolation operation is performed by adding an offset amount to a sample value of the predicted waveform, and the corrected waveform and a waveform actually received in the past are continuously connected. 18. The audio signal processing method according to claim 17, wherein

19. The audio signal processing method according to claim 18, wherein the offset amount is calculated based on two sample values calculated from a cycle of the input waveform.

20. The audio signal processing method according to claim 12, wherein the communication network transmits the audio signal by a bucket switching method.

21. The audio signal processing method according to claim 20, wherein the communication network is an ATM network or an IP network.

22. The audio signal processing method according to claim 12, wherein the digital audio data is linear PCM data.