WO2021135340A1

WO2021135340A1 - Voice signal processing method, system and apparatus, computer device, and storage medium

Info

Publication number: WO2021135340A1
Application number: PCT/CN2020/113219
Authority: WO
Inventors: 许慎愉; 林绪虹; 陈建峰
Original assignee: 广州华多网络科技有限公司
Priority date: 2019-12-31
Filing date: 2020-09-03
Publication date: 2021-07-08
Also published as: CN111063361B; CN111063361A

Abstract

A voice signal processing method. The method comprises: an encoder obtains a voice residual signal from an original voice signal, and shunts same to obtain multiple sub-voice residual signals (S101), obtains compensation information of the sub-voice residual signals on the basis of a preset compensation configuration (S102), and sends a code stream comprising the sub-voice residual signals and corresponding compensation information to a decoder, the code stream being used for instructing the decoder to decode according to the sub-voice residual signals and the corresponding compensation information (S103). The method can effectively improve the anti-packet loss performance of a speech encoder. Also provided are a voice signal processing system and apparatus, a computer device, and a storage medium.

Description

Voice signal processing method, system, device, computer equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201911422259.4, and the invention title is "voice signal processing methods, systems, devices, computer equipment and storage media" on December 31, 2019, and its entire contents Incorporated in this application by reference.

Technical field

This application relates to the technical field of audio and video coding and decoding, and in particular to a voice signal processing method, system, device, computer equipment, and storage medium.

Background technique

Current speech encoders generally use parameter encoding, that is, according to the human voice model, the speech signal is converted into channel parameters and excitation parameters, quantized and coded to generate a code stream, and then the code stream is sent to the channel for transmission. After receiving the code stream, the receiver decodes the channel parameters and excitation parameters, and then re-synthesizes the speech signal according to the utterance model.

In practical applications, packet loss often occurs when the code stream is transmitted. Based on this reality, people have developed many anti-packet loss strategies, which are mainly divided into two categories. One is transmission-oriented. The main idea is to retransmit under low latency and forward error correction (FEC) under high latency. Among them, transmission-oriented FEC, retransmission and other anti-packet strategies are no longer applicable under extremely weak networks (such as 20kbps or even lower). Therefore, another type of anti-packet strategy needs to be adopted, which is to improve the encoder itself. , Also known as anti-packet speech encoder.

However, the anti-packet loss performance of existing speech encoders is generally poor.

Summary of the invention

Based on this, it is necessary to provide a voice signal processing method, system, device, computer equipment, and storage medium in response to the above technical problems.

In the first aspect, an embodiment of the present application provides a voice signal processing method, which includes:

Obtain the speech residual signal, and shunt the speech residual signal to obtain multiple sub-speech residual signals; the speech residual signal is the uncorrelated signal or weakly correlated signal obtained after processing the original speech signal;

Obtain the compensation information of each sub-speech residual signal based on the preset compensation configuration;

Send a code stream including each sub-speech residual signal and corresponding compensation information to the decoder; the code stream is used to instruct the decoder to decode according to each sub-speech residual signal and corresponding compensation information.

In a second aspect, an embodiment of the present application provides a voice signal processing method, which includes:

Receive the code stream sent by the encoder; the code stream includes multiple sub-speech residual signals and corresponding compensation information; each sub-speech residual signal is obtained by splitting from the speech residual signal; the compensation information is determined based on the preset compensation configuration;

Decode according to each sub-speech residual signal in the bitstream and the corresponding compensation information.

In a third aspect, an embodiment of the present application provides a voice signal processing system, which includes: an encoder and a decoder; an encoder, used to implement any one of the voice signal processing provided by the embodiments of the first aspect and the second aspect The steps of the method; the decoder is used to implement the steps of any one of the voice signal processing methods provided by the embodiments of the first aspect and the second aspect.

In a fourth aspect, an embodiment of the present application provides a voice signal processing device, and the device includes:

The shunt module is used to obtain the voice residual signal and shunt the voice residual signal to obtain multiple sub-voice residual signals; the voice residual signal is an uncorrelated signal or a weakly correlated signal obtained after processing the original voice signal;

The acquisition module is used to acquire the compensation information of each sub-speech residual signal based on the preset compensation configuration;

The sending module is used to send a code stream including each sub-speech residual signal and corresponding compensation information to the decoder; the code stream is used to instruct the decoder to decode according to each sub-speech residual signal and corresponding compensation information.

In a fifth aspect, an embodiment of the present application provides a voice signal processing device, the device includes:

The receiving module is used to receive the code stream sent by the encoder; the code stream includes multiple sub-speech residual signals and corresponding compensation information; each sub-speech residual signal is obtained by splitting from the speech residual signal; the compensation information is based on preset compensation The configuration is determined;

The decoding module is used for decoding according to each sub-speech residual signal in the code stream and the corresponding compensation information.

In a sixth aspect, an embodiment of the present application provides a computer device, including a memory and a processor, the memory stores a computer program, and the processor executes the computer program to implement any one of the voice signals provided by the embodiments of the first aspect and the second aspect. Processing method steps.

In a seventh aspect, an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, it implements any one of the voice signal processing provided by the embodiments of the first aspect and the second aspect. Method steps.

In the voice signal processing method, system, device, computer equipment and storage medium provided by the embodiments of the present application, after the encoder obtains the voice residual signal from the original voice signal, it splits the voice residual signal to obtain multiple sub-voice residual signals; And based on the preset compensation configuration, the compensation information of each sub-speech residual signal is obtained; then a code stream including each sub-speech residual signal and the corresponding compensation information is sent to the decoder, where the code stream is used to instruct the decoder according to each sub-speech The residual signal and the corresponding compensation information are decoded. In this method, by splitting the voice residual signal, it is equivalent to sending the decoder after multiple descriptions of the voice encoder parameters, and when describing each split, compensation information is added. This compensation information can be effectively used when the decoder decodes. A better voice signal can be recovered. In this way, the decoder can recover a better voice signal even if packet loss occurs during the transmission process by describing the residual voice signal. Therefore, this method can effectively improve the voice encoder Anti-packet loss performance.

Description of the drawings

Fig. 1 is a block diagram of a voice signal processing system provided by an embodiment;

FIG. 2 is a schematic flowchart of a voice signal processing method provided by an embodiment;

FIG. 3 is an interaction diagram of an encoder and a decoder of a speech signal processing method provided by an embodiment;

FIG. 4 is a schematic flowchart of a voice signal processing method provided by an embodiment;

FIG. 5 is a schematic flowchart of a voice signal processing method provided by an embodiment;

FIG. 6 is a schematic flowchart of a voice signal processing method provided by an embodiment;

FIG. 7 is a schematic flowchart of a voice signal processing method provided by an embodiment;

FIG. 8 is a schematic flowchart of a voice signal processing method provided by an embodiment;

FIG. 9 is a structural block diagram of a voice signal processing device provided by an embodiment;

FIG. 10 is a structural block diagram of a voice signal processing device provided by an embodiment;

Fig. 11 is a diagram of the internal structure of a computer device in an embodiment.

Detailed ways

In order to make the purpose, technical solutions, and advantages of this application clearer and clearer, the following further describes the application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.

In order to better understand the voice signal processing method provided by the embodiment of the present application, an application environment to which the embodiment of the present application is applicable is provided. Please refer to FIG. 1, a voice signal processing method provided by this application can be applied to the voice signal processing system shown in FIG. 1. The system includes an encoder 01 and a decoder 02, where the encoder 01 and the decoder 02 can perform data transmission. The encoder 01 includes, but is not limited to, a contact encoder, a non-contact encoder, an incremental encoder, an absolute encoder, etc. The embodiment of the present application does not specifically limit the type of the encoder. The decoder 02 includes, but is not limited to, a hardware decoder, a wireless decoder, a software decoder, a multi-channel decoder, a single-channel decoder, etc. The type of the decoder is not specifically limited in this embodiment.

Usually under extremely weak networks (such as 20kbps or even lower), the transmission-oriented anti-packet strategy is no longer applicable. At this time, it is necessary to develop an anti-lost voice packet encoder to improve the anti-packet ability of the voice encoder itself. Among them, the split multi-description is an implementation of the anti-packet loss voice encoder, and the split multi-description here refers to the way in which the voice code stream to be transmitted is transmitted in a split mode.

Take the SILK encoder as an example. Generally, in the voice signal, the voice residual signal occupies the largest traffic in the code stream of the SILK encoder. Therefore, it is necessary to consider splitting the voice residual signal in the anti-packet voice encoder. Among them, the speech residual signal means that the speech encoder removes the short and long-term correlation of the original speech signal, performs gain control, and the remaining uncorrelated or weakly correlated signal after noise shaping. The speech residual signal is generally a random segment Pulse sequence. Based on this, the embodiments of the present application provide a voice signal processing method, system, device, computer equipment, and storage medium, which improve the anti-packet loss performance of the voice encoder by shunting the voice residual signal.

Hereinafter, the technical solution of the present application and how the technical solution of the present application solves the above-mentioned technical problems will be described in detail through the embodiments and the accompanying drawings. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments. It should be noted that, in the voice signal processing method provided by the present application, the execution body of FIGS. 2 to 5 is an encoder, and the execution body of FIGS. 6 to 8 is a decoder, where the execution body may also be a voice signal. A processing device, where the device can be implemented as part or all of the encoder through software, hardware, or a combination of software and hardware.

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments It is a part of the embodiments of the present application, but not all of the embodiments.

The following describes an embodiment where the execution body is an encoder.

In one embodiment, FIG. 2 provides a method for processing a voice signal. This embodiment relates to that after an encoder obtains a voice residual signal from an original voice signal, it splits the voice residual signal and adds a voice encoder to the split voice signal. The specific process of compensating information and sending it to the decoder is shown in Figure 2. The method includes:

S101: Obtain a voice residual signal, and shunt the voice residual signal to obtain multiple sub-voice residual signals; the voice residual signal is an uncorrelated signal or a weakly correlated signal obtained after processing the original voice signal.

Among them, the speech residual signal speech is the uncorrelated or weakly correlated signal remaining after the encoder removes the short- and long-term correlation of the original speech signal, and performs gain control and noise shaping.

Among them, the acquisition of the speech residual signal can be understood as the encoder (hereinafter also referred to as the speech encoder) after receiving a segment of the original speech signal, the original speech signal is divided into the speech residual signal and other parameters, where the other parameters are the original The general term after the speech residual signal is removed from the speech signal, that is to say, the other parameters include not only one parameter, but multiple parameters. As to which parameters are specifically included, this embodiment does not limit it.

In this step, after the encoder obtains the voice residual signal from the original voice signal, it shunts the voice residual signal. It is understandable that because the voice signal is actually a bit stream after entering the encoder, the essence of the signal is a signal sequence, so to split the voice residual signal is to split the entire code stream of the voice residual signal sequence, and split into multiple streams. Article signal sequence.

In this embodiment, the voice residual signal can be divided into two code streams, or it can be divided into other numbers of code streams, which is not limited in this embodiment. For example, split into two, that is, split the voice residual signal into a first sub-voice residual signal and a second sub-voice residual signal. What needs to be explained here is that after the speech residual signal is split into the first sub speech residual signal and the second sub speech residual signal, other parameters in the original speech signal except for the speech residual signal will be copied and saved in each sub speech at the same time. In the residual signal, each bar code stream formed after the final split includes the sub-voice residual signal, and also carries complete other parameters. In this way, if the voice residual signal is recovered at the decoding end, it can be restored in combination with other parameters. Out the original voice signal.

S102: Acquire compensation information of each sub-speech residual signal based on a preset compensation configuration.

The compensation configuration indicates the configuration of the compensation information, and the compensation information is the preset compensation information for each sub-speech residual signal, and the decoder can compensate the extra information according to the compensation information according to each sub-speech residual signal. The additional information of compensation can enable the decoder to better recover the speech signal during decoding. For example, the compensation configuration may include code rate configuration and compensation parameters. The code rate configuration determines the upper limit of the traffic of each transmission packet when transmitting the voice signal code stream. The change of the compensation parameter will cause the proportion of the compensation information in a packet of data to change. For another example, the compensation parameter may include the number of small frames dividing each sub-speech residual signal, and the number of non-zero pulses in each small frame. Among them, the compensation configuration information can be determined by the preset packet loss rate. Generally, under the same average code rate, the lower the packet loss rate, the less compensation information; conversely, the higher the packet loss rate, the more compensation information. In extreme cases, such as no packet loss, the compensation information size is 0.

In practical applications, when obtaining the compensation information of each sub-voice residual signal, it needs to be determined based on the compensation configuration, and the compensation configuration needs to be determined in advance. Generally, when the voice residual signal is transmitted, the compensation configuration needs to be preset, and the preset can be determined according to historical big data and in combination with actual conditions, which is not limited in this embodiment. Specifically, based on the preset compensation configuration, the encoder can obtain the compensation information of each sub-speech residual signal through a preset algorithm or a pre-trained neural network model, and directly determine the corresponding compensation using the compensation configuration information as input. information. Of course, it can also be in other ways, which is not limited in this embodiment.

S103: Send a code stream including each sub-speech residual signal and corresponding compensation information to the decoder; the code stream is used to instruct the decoder to decode according to each sub-speech residual signal and corresponding compensation information.

After determining each sub-speech residual signal from the speech residual signal and obtaining the compensation information of each sub-speech residual signal, when the encoder transmits the code stream to the decoder, each code stream includes each sub-speech residual signal and the corresponding The compensation information, of course, also needs to include the other parameters mentioned above. In this embodiment, only the speech residual signal is described, and the other parameters will not be repeated in some embodiments.

It can be understood that the transmission of the code stream from the encoder to the decoder is used to instruct the decoder to decode and recover the speech residual signal according to the sub-speech residual signals and corresponding compensation information in the code stream.

Illustratively, as shown in FIG. 3, a schematic diagram of an encoder sending a split code stream to a decoder is provided. Among them, the main decoder and side decoder in Figure 3 can be considered as one decoder. After the decoder receives the code stream, it will adopt different decoding methods according to the number of received code streams, that is, The main decoder and the side decoder can be regarded as sub-decoders that implement different decoding methods in one decoder. For the specific decoding process of the decoder, please refer to the description in the embodiment with the decoder as the execution subject, and will not be repeated here.

In the speech signal processing method provided in this embodiment, after the encoder obtains the speech residual signal from the original speech signal, it splits the speech residual signal to obtain multiple sub-speech residual signals; and obtains each sub-speech residual signal based on a preset compensation configuration Signal compensation information; and then send to the decoder a code stream including each sub-speech residual signal and corresponding compensation information, where the code stream is used to instruct the decoder to decode according to each sub-speech residual signal and corresponding compensation information. In this method, by splitting the voice residual signal, it is equivalent to sending the decoder after multiple descriptions of the voice encoder parameters, and when describing each split, compensation information is added. This compensation information can be effectively used when the decoder decodes. A better voice signal can be recovered. In this way, the decoder can recover a better voice signal even if packet loss occurs during the transmission process by describing more voice residual signals. Therefore, this method uses more voice residual signals. The described method can effectively improve the anti-packet performance of the speech encoder.

On the basis of the above embodiment, the embodiment of the present application also provides a voice signal processing method, which relates to the specific process of the voice encoder shunting the voice residual signal into two sub-voice residual signals. In one embodiment, If the multiple sub-speech residual signals include a first sub-speech residual signal and a second sub-speech residual signal; as shown in FIG. 4, the above S101 step includes:

S201: quantize the speech residual signal to obtain a quantized sequence corresponding to the speech residual signal.

The speech encoder in this embodiment is described by taking the SILK encoder as an example.

If the voice residual signal is split into the first sub-voice residual signal and the second sub-voice residual signal, the splitting method that can be used is odd-even splitting. The SILK encoder needs to quantize the speech residual signal before shunting the speech residual signal to obtain the quantized sequence corresponding to the speech residual signal.

For example, define the speech residual signal before quantization as: r[n],n=0,1,...,L-1; then the quantized speech residual signal sequence can be expressed as q[n], n=0, 1,...,L-1.

S202: Perform odd-even splitting on the quantized sequence to obtain an odd quantized sequence and an even quantized sequence.

Based on the quantized sequence of the quantized speech residual signal, the quantized sequence is odd-even split, and after the split, the final odd quantized sequence and the even quantized sequence need to be further determined based on the algorithm of the SILK encoder.

For example, if the sequence q[n],n=0,1,...,L-1, the parity split sequence can be expressed as:

Based on the random seed sequence and symbol function of the SILK encoder itself, the determined symbol function is:

Determine the random seed pair sequence:

And determine the odd sequence of random seeds:

Among them, seed_init will keep a copy in odd and even streams, and the size is generated by the SILK encoder.

Further, based on the above determined symbol function and parity random seed sequence, and the parity split sequence after the speech residual signal quantization, the final odd quantization sequence and even quantization sequence can be determined as:

q _e [n]=Q(r[2*n]*sign(s _e [n])-offset)

q _o [n]=Q(r[2*n+1]*sign(s _o [n])-offset)

Among them, Q represents the quantization algorithm in the SILK encoder, which is provided by the SILK encoder. Offset is obtained by looking up the table according to the small frame type, which is also provided by the SILK encoder.

S203: Determine the odd quantization sequence as the first sub-speech residual signal, and determine the even quantization sequence as the second sub-speech residual signal.

Based on the above-determined parity quantization sequence, the encoder determines the odd quantization sequence as the first sub-speech residual signal and the even quantization sequence as the second sub-speech residual signal. Of course, the even quantization sequence can also be determined as the first sub-speech The residual signal, and the odd quantization sequence is determined to be the second sub-speech residual signal, the correspondence between the first and second quantization sequences and the odd and even quantization sequence is not limited in this embodiment.

In this embodiment, the parity split is performed based on the quantized residual speech signal quantization sequence, and combined with the algorithm in the speech encoder, the final parity quantization sequence is determined. In this way, the parity quantization sequence is transmitted as the final encoder. The code stream facilitates the transmission of voice residual signals.

In an embodiment, the compensation parameters in the above compensation configuration are used, and the compensation parameters include the number of small frames into which each sub-speech residual signal is divided, and the number of non-zero pulses in each small frame N2; N1 is a positive integer, and N2 is Taking a non-negative integer as an example, the process of obtaining the compensation information of each sub-speech residual signal will be described. The compensation parameter is preset in practice and can be determined according to the packet loss rate to ensure the reasonableness of the set N1 and N2. As shown in Figure 5, the above S102 step includes:

S301: Obtain the compensation gain, position sequence, and symbol sequence corresponding to each sub-speech residual signal; the length of the compensation gain is N1, and the length of the position sequence and the symbol sequence are both N2.

Among them, the length of the compensation gain, position sequence and symbol sequence are all N2, which means that the compensation gain, position sequence and symbol sequence are all for each small frame in the sub-speech residual signal, that is to say, the encoder obtains in this step It is the compensation gain, position sequence and symbol sequence of each small frame in each sub-speech residual signal.

For example, take N1=cfc, N2=nz as an example

The position sequence x _i can be expressed as x _i =MAX_POS _nz (ABS(rq _i -crq _i )); where (rq _i represents the optimal sequence recovered when all the substreams received by the decoder are received The i-th small frame, crq _i ) represents the i-th small frame in the sequence recovered based on the compensation information when the decoder receives a single bar code stream in the shunted code stream, i can take from 0 to cfc-1 Value; the function ABS represents the absolute value of the items in the speech signal sequence, and the items in the speech signal sequence are determined by rq _i and crq _i ). Among them, the MAX_POSx function returns the position sequence of the first nz largest item.

Among them, the compensation gain can be determined based on the position sequence, for example, the compensation gain g _i can be expressed as

Similarly, the symbol sequence can be determined based on the position sequence, and the symbol sequence s _i can be expressed as: s _i =sign(rq _i [ _xi ]-crq _i [ _xi ]).

S302: Construct a compensation sequence of each sub-speech residual signal according to the N1 position sequences and symbol sequences corresponding to each sub-speech residual signal;

Based on the position sequence and symbol sequence of each sub-speech residual signal determined by the above encoder, all small frames are spliced to form a complete sequence, that is, N1 position sequences and symbol sequences of each small frame are constructed to construct a complete sequence. The completed sequence is each sub-frame. Compensation sequence of speech residual signal. The compensation sequence is represented by cq. Since N1 is the number of small frames divided by each sub-speech residual signal, that is, the length of each sub-speech residual signal, the length of cq of each sub-speech residual signal is L/2, and L is The length of the entire speech residual signal.

S303: Determine the compensation information of each sub-speech residual signal according to the compensation sequence of each sub-speech residual signal and the compensation gain of each sub-speech residual signal.

The compensation sequence and the compensation gain of each sub-speech residual signal determined above are determined, and the compensation sequence and the compensation gain of each sub-speech residual signal are determined as the final compensation information.

When determining the compensation information in this embodiment, each sub-speech residual signal is first divided into multiple small frames, and then the gain value and compensation sequence of each small frame are obtained, and the final compensation sequence is determined based on the entire compensation sequence. Compensation information. In this way, the determined compensation information can fully and completely decode the decoder and provide additional compensation information to ensure the quality of the speech signal recovered by the decoder.

In addition, the above compensation configuration also includes a code rate configuration, which is a parameter that determines the upper limit of the transmission packet flow, and the code rate configuration can be determined by a preset packet loss rate. In one embodiment, the method further includes : Determine the space size of the compensation information of each sub-voice residual signal according to the bit rate configuration; the space size is used to indicate the space capacity for storing the compensation information when the code stream is sent.

After the compensation information is determined, the capacity of the space for storing the compensation information needs to be further determined. In this way, the storage space of each divided bit stream is divided into a similar size, which is beneficial to the subsequent transmission performance test, and in the transmitted bit stream After the compensation information is added, the size of the code stream will increase to a certain extent. By determining the storage space capacity of the compensation information, the factors that affect the efficiency of the code stream transmission can be effectively detected.

The following describes an embodiment where the execution subject is a decoder. It should be noted that although this application divides the embodiment in which the decoder is the main body of execution and the embodiment in which the encoder is the main body of execution, in reality, the decoder and the encoder interact with each other to complete speech signal processing. Therefore, The process descriptions in the embodiment in which the encoder is the execution subject and the embodiment in which the decoder is the execution subject may refer to each other, rather than as a limitation of the execution range of the two.

As shown in Figure 6, in one embodiment, a method for processing a voice signal is provided. This embodiment relates to a specific process of decoding after a decoder receives a code stream sent by an encoder. The method includes:

S401: Receive a code stream sent by an encoder; the code stream includes multiple sub-speech residual signals and corresponding compensation information; each sub-speech residual signal is obtained by splitting from the speech residual signal; the compensation information is determined based on a preset compensation configuration .

S402: Perform decoding according to each sub-speech residual signal in the bitstream and corresponding compensation information.

The principle process of the code stream and compensation configuration in this embodiment can be referred to the description in the embodiment in which the execution body is the encoder, which will not be repeated here.

Among them, when the decoder receives the code stream sent by the encoder, it either receives all or part of the code stream, that is, packet loss occurs. For two different situations, the decoder adopts different decoding methods. For restoring the residual voice signal, refer to the description in the following embodiment for the specific process.

In the speech signal processing method provided by this embodiment, after the decoder receives the code stream sent by the encoder, it decodes according to the sub-speech residual signal and corresponding compensation information carried in each code stream. After the voice residual signal is obtained from the signal, the voice residual signal is shunted to obtain multiple sub-voice residual signals, and based on the preset compensation configuration, the compensation information of each sub-voice residual signal is obtained, and then sent to the decoder including each sub-voice residual signal The code stream of the signal and the corresponding compensation information. In this method, the voice residual signal is split at the encoder end, which is equivalent to sending the decoder after multiple descriptions of the voice encoder parameters, and compensation information is added when each split is described, and the compensation information can be used in the decoder During decoding, a better voice signal can be recovered effectively. In this way, the decoder can recover a better voice signal even if packet loss occurs during the transmission process by describing the residual voice signal. Therefore, this method can effectively improve Anti-packet loss performance of the speech encoder.

In the following, two embodiments are used to describe the decoding process in the case where the decoder receives all the code streams and the case where the decoder only receives a single code stream. In the following, the multiple sub-speech residual signals include the first sub-speech residual signal and the second sub-speech residual signal as an example for description.

In an embodiment, if the multiple sub-speech residual signals include a first sub-speech residual signal and a second sub-speech residual signal; and the received code stream is the first sub-speech residual signal and corresponding compensation information, and, the second The sub-speech residual signal and the corresponding compensation information; then, as shown in FIG. 7, the above step S402 includes:

S501: Restore a corresponding even voice residual signal according to the first sub-speech residual signal, and restore a corresponding odd voice residual signal according to the second sub-speech residual signal.

Among them, the difference between the even voice residual signal and the first sub-speech residual signal is: the first sub-speech residual signal is the sub-signal sequence that the encoder side quantizes and shunts the voice residual signal, while the even voice residual signal is the decoder side according to the sub-signal sequence. The voice residual information recovered from the signal sequence; the odd voice residual signal has the same difference with the second sub-voice residual signal.

In this step, the first sub-speech residual signal is an even quantization sequence, and the second sub-speech residual signal is an odd quantization sequence. In practical applications, the correspondence between the two can also be exchanged, because the first The second itself is only used to distinguish the sub-speech residual signals, which is not limited in this embodiment.

For example, if you define odd quantization sequence and even quantization sequence as:

q _e [n]=Q(r[2*n]*sign(s _e [n])-offset)

q _o [n]=Q(r[2*n+1]*sign(s _o [n])-offset)

Then, the even voice residual signal and the odd voice residual signal can be expressed as:

It can be seen that q(n) is the quantized sequence quantized from r(n), and rq(n) represents the speech residual signal recovered from q(n).

For example, in this embodiment, the process of the decoder recovering rq(n) from q(n) can be performed by using some commonly used decoding algorithms, which is not limited in this embodiment.

S502: Perform interleaving and interpolation on the even voice residual signal and the odd voice residual signal to determine the voice residual signal.

Based on the restored even voice residual signal and odd voice residual signal, the decoder performs interleaving and interpolation on the even voice residual signal and the odd voice residual signal, that is, interleaving and inserting the odd and even items respectively to obtain a complete voice residual signal. As mentioned earlier, when the code stream is sent in practice, other parameters that were separated at the beginning will be carried. After the decoder restores the speech residual signal, it can restore the original speech signal by combining other parameters carried in the code stream.

In this embodiment, since all the code streams received by the decoder, that is to say, all the code streams sent by the encoder are received, the even voice residual signal and the odd voice residual signal are interleaved sample by sample to recover the optimal voice residual signal. Signal, which can restore the original voice signal with higher sound quality.

In another embodiment, if the multiple sub-speech residual signals include a first sub-speech residual signal and a second sub-speech residual signal; and the received code stream is the first sub-speech residual signal and corresponding compensation information, or the second sub-speech residual signal The sub-voice residual signal and the corresponding compensation information; then, as shown in FIG. 8, the above step S402 includes:

S601: Restore the corresponding even voice residual signal according to the first sub-speech residual signal, or restore the corresponding odd voice residual signal according to the second sub-speech residual signal.

In this embodiment, only one sub-speech residual signal is received, for example, only the first sub-speech residual signal is received or only the second sub-speech residual signal is received. Correspondingly, the encoder recovers only the even voice residual signal or the odd voice residual signal, that is, which sub voice residual signal is received, and what is restored is the voice residual signal corresponding to the sub voice residual signal.

S602: Restore a similar voice residual signal based on the even voice residual signal, or restore a similar voice residual signal based on the odd voice residual signal.

Based on the even voice residual signal or odd voice residual signal obtained above, the similar voice residual signal is restored, where the similar voice residual signal represents the voice residual signal recovered based on the compensation information, which is represented by crq(n), and the similar voice residual signal is equal to rq There is a small amount of error between (n), so it is called similar speech residual signal.

The even speech residual signal recovered in the above steps can be expressed as rq _e , the odd speech residual signal can be expressed as rq _o , and the similar speech residual signal is expressed as crq.

Then the method for the decoder to _{recover crq according to rq e} or rq _o can be implemented by the following formula:

The above formula indicates, is determined based on rq _e crq _e, determined based CRQ rq _o is _O, and the sequence of the above formula is 0 ~ L-1 in the fully covered, it is considered that the recovered CRQ _{The lengths of e} and crq _o are both L, so crq _e and crq _o can be collectively referred to as crq, that is, similar speech residual signals.

S603: Determine the target similar voice residual signal based on the compensation information and the similar voice residual signal corresponding to the first sub-voice residual signal, or determine the target similar voice residual signal according to the compensation information and the similar voice residual signal corresponding to the second sub-voice residual signal signal.

When determining the similar voice residual signal, compensation information has not been considered, because the purpose of compensation is to use additional information to make the crq sequence closer to the rq sequence. Therefore, in order to have a better quality of the finally recovered speech residual signal, the compensation information carried in each sub-speech residual signal is merged into the similar speech residual signal to obtain the final target similar speech residual signal.

For example, first obtain the target similar voice residual signal of each small frame in the voice residual signal, multiply the compensation gain of each small frame in the compensation information by the compensation sequence of each small frame, and then add the similar voice residual signal, that is, the target crq _i = crq _i +g _i *cq _i . Based on the target similar voice residual signal of each small frame, the target similar voice residual signal of the entire voice residual signal can be determined.

S604: Determine a voice residual signal according to the target similar voice residual signal.

The target determined in the above steps is similar to the voice residual signal to the determined voice residual signal. It is understandable that although there is a certain small error between the target similar voice residual signal and the optimal voice residual signal determined in this embodiment, this implementation is based on the voice residual signal recovered when the decoder receives a single code stream. signal. In other words, in the case of packet loss, the decoder can recover the target similar voice residual signal.

In practice, the sound quality verification is performed on the target similar voice residual signal obtained by the method provided in this embodiment, as shown in Table 1 below, to use the same code stream ch_f1.wav, compare it under the same packet loss strategy and similar actual traffic This method is divided into MOS of SILK encoder.

Table 1

From the MOS score in the above table, it can be seen that even when the packet loss rate is high, medium sound quality can be restored, and if the packet loss rate is not high, higher sound quality can be restored. Therefore, the encoder provided by the embodiments of the present application has strong anti-packet loss performance by splitting and multiple descriptions of the parameters of the encoder. When packet loss occurs during transmission, the decoder can even receive only one packet. Solve the medium sound quality, if the decoder receives two packets in time, it can restore the higher sound quality.

It should be understood that although the various steps in the flowcharts of FIGS. 2-8 are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in Figures 2-8 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. These sub-steps or stages The execution order of is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.

In addition, the embodiment of the present application also provides a voice signal processing system, which can be referred to as shown in Figure 1. The system includes: an encoder and a decoder; wherein the encoder is used to implement all the previous implementations of the encoder as the main body. The process in the embodiment; the decoder is used to implement the processes in all the previous embodiments where the decoder is the main body of execution.

The implementation principle and technical effect of a voice signal processing system provided by the foregoing embodiment are similar to those of the foregoing voice signal processing method embodiment, and will not be repeated here.

In addition, a virtual device corresponding to the above-mentioned voice signal processing method is also provided. As shown in FIG. 9, in one embodiment, a voice signal processing device is provided. The device includes: a shunt module 10, an acquisition module 11, and a sending module 12, of which,

The shunting module 10 is used to obtain the speech residual signal and shunt the speech residual signal to obtain multiple sub-speech residual signals; the speech residual signal is an uncorrelated signal or a weakly correlated signal obtained after processing the original speech signal;

The obtaining module 11 is configured to obtain compensation information of each sub-speech residual signal based on a preset compensation configuration;

The sending module 12 is used to send a code stream including each sub-speech residual signal and corresponding compensation information to the decoder; the code stream is used to instruct the decoder to decode according to each sub-speech residual signal and corresponding compensation information.

In one embodiment, a voice signal processing device is provided. If the multiple sub-voice residual signals include a first sub-voice residual signal and a second sub-voice residual signal, the above-mentioned shunt module 10 includes:

The quantization unit is used to quantize the speech residual signal to obtain a quantized sequence corresponding to the speech residual signal;

The shunting unit is used to perform odd-even shunting of the quantized sequence to obtain the odd quantized sequence and the even quantized sequence;

The sub-signal determining unit is configured to determine the odd quantization sequence as the first sub-speech residual signal, and determine the even quantization sequence as the second sub-speech residual signal.

In one embodiment, a voice signal processing device is provided, the above compensation configuration includes compensation parameters, and the compensation parameters include the number of small frames into which each sub-speech residual signal is divided, N1, and the number of non-zero pulses in each small frame, N2; N1 is a positive integer and N2 is a non-negative integer; then the above-mentioned obtaining module 11 includes:

The acquiring unit is used to acquire the compensation gain, position sequence and symbol sequence corresponding to each sub-speech residual signal; the length of the compensation gain is N1, and the length of the position sequence and symbol sequence are both N2;

The construction unit is used to construct the compensation sequence of each sub-speech residual signal according to the N1 position sequences and symbol sequences corresponding to each sub-speech residual signal;

The compensation information determining unit is used to determine the compensation information of each sub-speech residual signal according to the compensation sequence of each sub-speech residual signal and the compensation gain of each sub-speech residual signal.

In one embodiment, a voice signal processing device is provided, and the compensation configuration further includes a bit rate configuration; the bit rate configuration is determined according to a preset packet loss rate; the device further includes: a space determining module, configured to configure according to the bit rate Determine the space size of the compensation information of each sub-voice residual signal; the space size is used to indicate the space capacity for storing the compensation information when the code stream is sent.

In one embodiment, as shown in FIG. 10, a voice signal processing device is provided, and the device includes:

The receiving module 13 is used to receive the code stream sent by the encoder; the code stream includes multiple sub-voice residual signals and corresponding compensation information; each sub-voice residual signal is obtained by splitting from the voice residual signal; the compensation information is based on a preset The compensation configuration is determined;

The decoding module 14 is used for decoding according to each sub-speech residual signal in the bitstream and the corresponding compensation information.

In one embodiment, a voice signal processing device is provided. If the plurality of sub-voice residual signals include a first sub-voice residual signal and a second sub-voice residual signal; and the received code stream is the first sub-voice residual signal and the corresponding The compensation information of, and, the second sub-speech residual signal and the corresponding compensation information; then the above-mentioned decoding module 14 includes:

The first restoration unit is configured to restore the corresponding even voice residual signal according to the first sub-voice residual signal, and restore the corresponding odd voice residual signal according to the second sub-voice residual signal;

The first voice signal determining unit is used to perform interleaving and interpolation on the even voice residual signal and the odd voice residual signal to determine the voice residual signal.

In one embodiment, a voice signal processing device is provided. If the plurality of sub-voice residual signals include a first sub-voice residual signal and a second sub-voice residual signal; and the received code stream is the first sub-voice residual signal and the corresponding Or, the second sub-speech residual signal and the corresponding compensation information; the decoding module 14 includes:

The second restoration unit is configured to restore the corresponding even voice residual signal according to the first sub-voice residual signal, or restore the corresponding odd voice residual signal according to the second sub-voice residual signal;

The third restoration unit is configured to restore similar voice residual signals based on even voice residual signals, or restore similar voice residual signals based on odd voice residual signals;

The target similar signal unit is used to determine the target similar voice residual signal based on the compensation information and the similar voice residual signal corresponding to the first sub-voice residual signal, or, according to the compensation information and the similar voice residual signal corresponding to the second sub-voice residual signal, Determine the target similar voice residual signal.

The second voice signal determining unit is used to determine the voice residual signal according to the target similar voice residual signal.

The implementation principles and technical effects of all the voice signal processing devices provided in the foregoing embodiments are similar to those of the foregoing voice signal processing method embodiments, and will not be repeated here.

For the specific limitation of the voice signal processing device, please refer to the above limitation on the voice signal processing method, which will not be repeated here. Each module in the above-mentioned speech signal processing device can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.

In one embodiment, a computer device is provided. The computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 11. The computer equipment includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to realize a voice signal processing method. The display screen of the computer device can be a liquid crystal display or an electronic ink display screen, and the input device of the computer device can be a touch layer covered on the display screen, or it can be a button, trackball or touchpad set on the computer device shell , It can also be an external keyboard, touchpad, or mouse.

Those skilled in the art can understand that the structure shown in FIG. 11 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.

In one embodiment, a computer device is provided, including a memory and a processor, a computer program is stored in the memory, and the processor implements the following steps when the processor executes the computer program:

Or, when the processor executes the computer program, the following steps are implemented:

The implementation principle and technical effect of a computer device provided by the foregoing embodiment are similar to those of the foregoing method embodiment, and will not be repeated here.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:

Or, when the computer program is executed by the processor, the following steps are implemented:

The foregoing embodiment provides a computer-readable storage medium, and its implementation principle and technical effect are similar to those of the foregoing method embodiment, and will not be repeated here.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer readable storage. In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the range described in this specification.

The above-mentioned embodiments only express several implementation manners of the present application, and the description is relatively specific and detailed, but it should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims

A voice signal processing method, characterized in that the method includes:

Acquiring a voice residual signal, and shunting the voice residual signal to obtain a plurality of sub voice residual signals; the voice residual signal is an uncorrelated signal or a weakly correlated signal obtained after processing the original voice signal;

Acquiring compensation information of each of the sub-speech residual signals based on a preset compensation configuration;

A code stream including each of the sub-speech residual signals and corresponding compensation information is sent to the decoder; the code stream is used to instruct the decoder to decode according to each of the sub-speech residual signals and the corresponding compensation information.
The voice signal processing method according to claim 1, wherein if the plurality of sub-speech residual signals includes a first sub-speech residual signal and a second sub-speech residual signal;

Then the shunting the voice residual signal to obtain multiple sub-voice residual signals includes:

Quantizing the speech residual signal to obtain a quantization sequence corresponding to the speech residual signal;

Performing odd-even splitting on the quantization sequence to obtain an odd quantization sequence and an even quantization sequence;

The odd quantization sequence is determined as the first sub-speech residual signal, and the even quantization sequence is determined as the second sub-speech residual signal.
The voice signal processing method according to claim 2, wherein the compensation configuration includes compensation parameters, and the compensation parameters include the number of small frames N1 into which each of the sub-speech residual signals is divided, and the number of small frames in each small frame. The number of zero pulses N2; said N1 is a positive integer, and N2 is a non-negative integer;

Then the acquiring compensation information of each of the sub-speech residual signals based on the preset compensation configuration includes:

Acquiring a compensation gain, a position sequence, and a symbol sequence corresponding to each of the sub-speech residual signals; the length of the compensation gain is N1, and the length of the position sequence and the symbol sequence are both N2;

Constructing a compensation sequence of each sub-speech residual signal according to the N1 position sequences and symbol sequences corresponding to each of the sub-speech residual signals;

The compensation information of each of the sub-speech residual signals is determined according to the compensation sequence of each of the sub-speech residual signals and the compensation gain of each of the sub-speech residual signals.
The voice signal processing method according to claim 3, wherein the compensation configuration further comprises a code rate configuration; the code rate configuration is determined according to a preset packet loss rate;

The method further includes:

The space size of the compensation information of each of the sub-speech residual signals is determined according to the code rate configuration; the space size is used to indicate the space capacity for storing the compensation information when the code stream is sent.
A voice signal processing method, characterized in that the method includes:

Receive the code stream sent by the encoder; the code stream includes a plurality of sub-speech residual signals and corresponding compensation information; each of the sub-speech residual signals is obtained by splitting from the speech residual signal; the compensation information is based on a preset The compensation configuration is determined;

Decoding is performed according to each of the sub-speech residual signals in the code stream and the corresponding compensation information.
The speech signal processing method according to claim 5, wherein if the plurality of sub speech residual signals include a first sub speech residual signal and a second sub speech residual signal; and the received bit stream is the first sub A speech residual signal and corresponding compensation information, and, the second sub-speech residual signal and corresponding compensation information;

Then the decoding according to each of the sub-speech residual signals in the code stream and the corresponding compensation information includes:

Recovering the corresponding even voice residual signal according to the first sub-speech residual signal, and recovering the corresponding odd voice residual signal according to the second sub-speech residual signal;

Perform interleaving and interpolation on the even voice residual signal and the odd voice residual signal to determine the voice residual signal.
The speech signal processing method according to claim 5, wherein if the plurality of sub speech residual signals include a first sub speech residual signal and a second sub speech residual signal; and the received bit stream is the first sub A voice residual signal and corresponding compensation information, or the second sub-voice residual signal and corresponding compensation information;

Then the decoding according to each of the sub-speech residual signals in the code stream and the corresponding compensation information includes:

Restore the corresponding even voice residual signal according to the first sub-speech residual signal, or restore the corresponding odd voice residual signal according to the second sub-speech residual signal;

Recovering a similar voice residual signal according to the even voice residual signal, or recovering the similar voice residual signal according to the odd voice residual signal;

Determine the target similar voice residual signal based on the compensation information corresponding to the first sub-speech residual signal and the similar voice residual signal, or, according to the compensation information corresponding to the second sub-speech residual signal and the similar voice residual signal , Determine the target similar voice residual signal;

Determine the voice residual signal according to the target similar voice residual signal.
A speech signal processing system, characterized in that the system includes: an encoder and a decoder;

The encoder is used to implement the steps of the speech signal processing method in any one of claims 1-4;

The decoder is used to implement the steps of the speech signal processing method of any one of claims 5-7.
A voice signal processing device, characterized in that the device includes:

The splitting module is used to obtain the voice residual signal and split the voice residual signal to obtain multiple sub voice residual signals; the voice residual signal is an uncorrelated signal or weak correlation signal obtained after processing the original voice signal signal;

An obtaining module, configured to obtain compensation information of each of the sub-speech residual signals based on a preset compensation configuration;

The sending module is used to send a code stream including each of the sub-speech residual signals and corresponding compensation information to the decoder; the code stream is used to instruct the decoder to decode according to each of the sub-speech residual signals and corresponding compensation information .
A voice signal processing device, characterized in that the device includes:

The receiving module is configured to receive the code stream sent by the encoder; the code stream includes a plurality of sub-speech residual signals and corresponding compensation information; each of the sub-speech residual signals is obtained by splitting from the speech residual signal; the compensation information It is determined based on the preset compensation configuration;

The decoding module is used for decoding according to each of the sub-speech residual signals in the code stream and the corresponding compensation information.
A computer device, comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the voice signal processing method according to any one of claims 1 to 7 when the processor executes the computer program A step of.
A computer-readable storage medium having a computer program stored thereon, wherein the computer program implements the steps of the voice signal processing method according to any one of claims 1 to 7 when the computer program is executed by a processor.