CN112669857B

CN112669857B - Voice processing method, device and equipment

Info

Publication number: CN112669857B
Application number: CN202110284182.XA
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2021-05-18
Anticipated expiration: 2041-03-17
Also published as: CN112669857A

Abstract

The embodiment of the application discloses a voice processing method, a device and equipment, wherein the voice processing method comprises the following steps: acquiring voice data to be coded, and performing linear prediction analysis on the voice data to be coded to obtain linear prediction parameters; determining a target code vector in a target self-adaptive codebook, an index of the target code vector and a gain corresponding to the target code vector according to the voice data to be coded; and sending the linear prediction parameters, the index of the target code vector and the gain corresponding to the target code vector as the coded data corresponding to the voice data to be coded to a voice decoding end. By adopting the embodiment of the application, the data of the fixed codebook does not need to be generated in the encoding process of the voice data to be encoded, so that the occupation of the data of the fixed codebook on the storage space is reduced, the whole compression performance and the voice quality of the voice encoding are improved, the channel bandwidth required by transmitting the encoded data can be reduced, the transmission performance is improved, and the effect of the voice encoding is improved.

Description

Voice processing method, device and equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to the field of speech coding, and in particular, to a method, an apparatus, and a device for speech processing.

Background

Speech coding is widely used in daily communications, and so-called speech coding is to reduce the channel bandwidth required for speech transmission while ensuring speech pitch quality transmission. For example, in the application of voice call, a transmitting end collects voice data, encodes the voice data using an encoder, and then transmits the encoded data to a receiving end. So that the receiving end can regenerate the voice data through the decoder and play the sound.

At present, the speech coding techniques are mainly classified into three categories: waveform coding, parametric coding, and hybrid coding. Specifically, waveform coding is to treat voice data as general waveform data so that a reconstructed voice waveform maintains an original waveform shape. And parameter coding, namely extracting and coding the characteristic parameters of the voice data to ensure that the reconstructed voice data keeps the semantics of the original voice. And mixing coding, namely combining waveform coding and parameter coding, wherein the waveform coding and the parameter coding comprise voice characteristic parameters and waveform coding information. Practice shows that the transmission bandwidth required by the data coded by the current coder is large, the compression effect is poor, and the transmission performance is low.

Disclosure of Invention

The embodiment of the application provides a method, a device and equipment for processing voice, which can reduce the occupation of the related data of a fixed codebook on a storage space in the coding process of voice data to be coded, and can improve the overall compression performance and voice quality of voice coding on the one hand; on the other hand, the channel bandwidth required for transmitting the coded data corresponding to the voice data to be coded can be reduced, the transmission performance is improved, and the voice coding effect is improved.

In one aspect, an embodiment of the present application provides a speech processing method, where the speech processing method is applied to a speech encoding end, and the speech processing method includes:

acquiring voice data to be coded, and performing linear prediction analysis on the voice data to be coded to obtain linear prediction parameters;

determining a target code vector in a target self-adaptive codebook, an index of the target code vector and a gain corresponding to the target code vector according to the voice data to be coded;

and sending the linear prediction parameters, the index of the target code vector and the gain corresponding to the target code vector as the coded data corresponding to the voice data to be coded to a voice decoding end.

In another aspect, an embodiment of the present application provides a speech processing method, where the speech processing method is applied to a speech decoding end, and the method includes:

receiving coded data corresponding to-be-coded voice data sent by a voice coding end, wherein the coded data comprises: linear prediction parameters corresponding to the voice data to be coded, indexes of target code vectors and gains corresponding to the target code vectors;

determining self-adaptive codebook excitation data according to the index of the target code vector and the gain corresponding to the target code vector;

determining target prediction data corresponding to the voice data to be coded, and performing data analysis on the target prediction data through a fixed codebook prediction model to determine fixed codebook excitation data corresponding to the voice data to be coded;

and synthesizing the adaptive codebook excitation data and the fixed codebook excitation data according to the linear prediction parameters to obtain decoding data corresponding to the voice data to be coded.

On the other hand, an embodiment of the present application provides a speech processing apparatus, which is applied to a speech encoding side, and the speech processing apparatus includes:

the device comprises an acquisition unit, a coding unit and a decoding unit, wherein the acquisition unit is used for acquiring voice data to be coded and carrying out linear prediction analysis on the voice data to be coded to obtain linear prediction parameters;

the determining unit is used for determining a target code vector in a target self-adaptive codebook, an index of the target code vector and a gain corresponding to the target code vector according to the voice data to be coded;

and the sending unit is used for sending the linear prediction parameters, the index of the target code vector and the gain corresponding to the target code vector to the voice decoding end as the coded data corresponding to the voice data to be coded.

In one implementation, the speech processing apparatus further includes:

the acquisition unit is further used for acquiring a target code vector of the previous frame of voice data of the voice data to be coded, a gain corresponding to the target code vector of the previous frame of voice data and fixed codebook excitation data corresponding to the previous frame of voice data;

and the updating unit is used for updating the historical adaptive codebook according to the target code vector of the previous frame of voice data of the voice data to be coded, the gain corresponding to the target code vector of the previous frame of voice data and the excitation data of the fixed codebook corresponding to the previous frame of voice data to obtain the target adaptive codebook.

In one implementation, the obtaining unit, when obtaining fixed codebook excitation data corresponding to a previous frame of speech data, is specifically configured to:

determining target prediction data of previous frame voice data of the voice data to be coded;

and performing data analysis on target prediction data of the previous frame of voice data through a fixed codebook prediction model, and determining fixed codebook excitation data corresponding to the previous frame of voice data.

In one implementation, the updating unit updates the historical adaptive codebook according to a target codevector of a previous frame of speech data of the speech data to be encoded, a gain corresponding to the target codevector of the previous frame of speech data, and excitation data of a fixed codebook corresponding to the previous frame of speech data to obtain a target adaptive codebook, and is specifically configured to:

determining self-adaptive codebook excitation data of the previous frame of voice data according to a target code vector of the previous frame of voice data of the voice data to be coded and a gain corresponding to the target code vector of the previous frame of voice data;

and updating the historical adaptive codebook according to the sum of the adaptive codebook excitation data of the previous frame of voice data and the fixed codebook excitation data corresponding to the previous frame of voice data to obtain the target adaptive codebook.

In one implementation, the speech processing apparatus further includes:

the acquisition unit is further used for acquiring a voice training sample set, and the voice training sample set comprises a plurality of voice training samples;

and the training unit is used for carrying out iterative training on the initial fixed codebook prediction model according to the voice training sample set to obtain a fixed codebook prediction model, and the fixed codebook prediction model is used for determining fixed codebook excitation data corresponding to the input voice data.

In one implementation, when the training unit performs iterative training on the initial fixed codebook prediction model according to the speech training sample set to obtain the fixed codebook prediction model, the training unit is specifically configured to:

acquiring a target voice training sample from the voice training sample set, and performing linear prediction analysis on the target voice training sample to obtain a training linear prediction parameter of the target voice training sample, wherein the target voice training sample is any one of the voice training sample set;

acquiring decoded data corresponding to the last frame of voice data of the target voice training sample, a training target code vector of the last frame of voice data of the target voice training sample and gain corresponding to the training target code vector;

and performing iterative training on the initial fixed codebook prediction model through the training linear prediction parameters, the decoded data corresponding to the last frame of voice data of the target voice training sample, the training target code vector of the last frame of voice data of the target voice training sample and the gain corresponding to the training target code vector to obtain the fixed codebook prediction model.

In one implementation, the speech processing apparatus further includes:

the high-pass filtering unit is used for carrying out high-pass filtering on the voice data to be coded to obtain high-pass filtered voice coded data to be coded;

when the obtaining unit performs linear prediction analysis on the speech data to be encoded to obtain a linear prediction parameter, the obtaining unit is specifically configured to:

and performing linear prediction analysis on the data to be coded after the high-pass filtering to obtain linear prediction parameters corresponding to the voice data to be coded.

On the other hand, an embodiment of the present application provides a speech processing apparatus, which is applied to a speech decoding side, and the speech processing apparatus includes:

a receiving unit, configured to receive encoded data corresponding to-be-encoded voice data sent by a voice encoding end, where the encoded data includes: linear prediction parameters corresponding to the voice data to be coded, indexes of target code vectors and gains corresponding to the target code vectors;

a determining unit, configured to determine adaptive codebook excitation data according to the index of the target codevector and the gain corresponding to the target codevector;

the determining unit is further configured to determine target prediction data corresponding to the to-be-encoded voice data, perform data analysis on the target prediction data through a fixed codebook prediction model, and determine fixed codebook excitation data corresponding to the to-be-encoded voice data;

and the synthesis unit is used for synthesizing the adaptive codebook excitation data and the fixed codebook excitation data according to the linear prediction parameters to obtain decoding data corresponding to the voice data to be coded.

In one implementation, when determining target prediction data corresponding to voice data to be encoded, the determining unit is specifically configured to:

and if the voice data to be coded is the voice data of the initial frame, determining the target value as target prediction data corresponding to the voice data to be coded.

In one implementation, the target prediction data corresponding to the speech data to be encoded includes one or more of the following:

the method comprises the steps of linear prediction parameters, decoding data corresponding to the previous frame of voice data of the voice data to be coded, and adaptive codebook excitation data obtained by decoding the previous frame of voice data.

In one implementation, when the determining unit performs data analysis on the target prediction data through the fixed codebook prediction model to determine fixed codebook excitation data corresponding to the speech data to be encoded, the determining unit is specifically configured to:

performing first data analysis on target prediction data through a fixed codebook prediction model to obtain first fixed codebook excitation data corresponding to-be-coded voice data, wherein the first fixed codebook excitation data are partial data in the fixed codebook excitation data;

performing second data analysis on the target prediction data and the first fixed codebook excitation data through a fixed codebook prediction model to obtain second fixed codebook excitation data corresponding to the voice data to be coded;

and if the first fixed codebook excitation data and the second fixed codebook excitation data meet the target condition, determining fixed codebook excitation data corresponding to the voice data to be coded according to the first fixed codebook excitation data and the second fixed codebook excitation data.

In one implementation, the fixed codebook prediction model includes a spectral feature extraction module and an excitation generation module; the determining unit performs data analysis on the target prediction data through the fixed codebook prediction model, and when determining fixed codebook excitation data corresponding to the speech data to be encoded, the determining unit is specifically configured to:

acquiring linear prediction parameters, decoding data corresponding to the previous frame of voice data of the data to be coded and adaptive codebook excitation data obtained by decoding the previous frame of voice data from target prediction data corresponding to the voice data to be coded;

extracting the spectral characteristics of the voice data to be coded according to the linear prediction parameters through a spectral characteristic extraction module;

and generating fixed codebook excitation data corresponding to the voice data to be coded by an excitation generation module according to the frequency spectrum characteristics, the decoded data corresponding to the last frame of voice data of the voice data to be coded and the self-adaptive codebook excitation data obtained by decoding the last frame of voice data.

In another aspect, an embodiment of the present application provides a speech processing apparatus, including:

a processor adapted to implement one or more instructions; and the number of the first and second groups,

a computer readable storage medium having stored thereon one or more instructions adapted to be loaded by a processor and to execute the speech processing method described above.

In another aspect, embodiments of the present application provide a computer-readable storage medium, which stores one or more instructions, where the one or more instructions are adapted to be loaded by a processor and execute the above-mentioned voice processing method.

In another aspect, embodiments of the present application provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the voice processing method described above.

In the embodiment of the application, the fixed codebook excitation data is generated by the fixed codebook prediction model during the speech coding, so that the occupation of the storage space by the related data of the fixed codebook can be reduced in the coding process of the speech data to be coded, and the improvement of the integral compression performance and the speech quality of the speech coding is facilitated. Moreover, the voice coding end does not need to send the index of the fixed codebook and the excitation of the fixed codebook to the voice decoding end, so that the channel bandwidth required for transmitting the coded data corresponding to the voice data to be coded can be reduced, the transmission performance is improved, and the voice coding effect is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 illustrates an architectural diagram of a speech processing system provided by an exemplary embodiment of the present application;

FIG. 2 is a flow chart illustrating a method of speech processing provided by an exemplary embodiment of the present application;

FIG. 3 is a flow chart illustrating speech encoding according to an exemplary embodiment of the present application;

FIG. 4 illustrates a flow chart of speech decoding provided by an exemplary embodiment of the present application;

FIG. 5 is a flow chart illustrating a method of speech processing provided by an exemplary embodiment of the present application;

FIG. 6 illustrates a training diagram of a fixed codebook prediction model provided by an exemplary embodiment of the present application;

FIG. 7 is a flow chart illustrating a method of speech processing provided by an exemplary embodiment of the present application;

FIG. 8 is a diagram illustrating a structure of a fixed codebook prediction model according to an exemplary embodiment of the present application;

FIG. 9 is a schematic diagram illustrating a scenario of a speech processing method according to an exemplary embodiment of the present application;

FIG. 10 is a schematic diagram illustrating a voice processing apparatus according to an exemplary embodiment of the present application;

FIG. 11 is a block diagram illustrating another speech processing apparatus according to another exemplary embodiment of the present application;

FIG. 12 is a schematic diagram illustrating a speech processing apparatus according to an exemplary embodiment of the present application;

fig. 13 is a schematic structural diagram of another speech processing device according to an exemplary embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The present invention relates to a technique of speech coding and speech decoding, and the speech coding is a technique of converting an analog signal of speech into a digital signal, compressing the speech using redundancy existing in speech data and human auditory characteristics, and transmitting the coded data to a speech decoding side by a speech coding side. The speech decoding is to receive encoded data, decode the encoded data, regenerate a speech digital signal, and play back a sound.

Among them, voice data can be classified into unvoiced sound and voiced sound. Voiced sounds may exhibit significant periodicity in the time domain. As air creates turbulence through the constricted glottis, unvoiced sound exhibits more of a noisy characteristic. There are two types of correlations for speech data: short-term correlation and long-term correlation. The short-term correlation (unvoiced portion) is a correlation between adjacent samples, and the long-term correlation (voiced portion) is a correlation between corresponding samples in adjacent periods, since voiced sounds have periodicity. Both the short-term correlation and the long-term correlation can generate a certain redundancy, so that redundant information generated by the short-term correlation and the long-term correlation needs to be removed during speech coding to obtain coded information. At present, in the encoding process of the voice data to be encoded, the excitation data of a fixed codebook is mainly used to approximate the short-time correlation of the voice data to be encoded.

For example, in a manner of using excitation data of a fixed codebook to approximate short-time correlation of speech data to be encoded, the speech data to be encoded is encoded, and the resulting bit space allocation table of encoded data can be shown in table 1:

TABLE 1

Encoding data	Code word	Number of bits per frame
			Linear prediction parameters (LSP)	L0,L1,L2,L3	18
Index of target code vector (PITCH)	P0,P1,P2	14
			Fixed Codebook (CODE)	C1，S1，C2，S2	34
GAIN (GAIN)	GA1，GB1，GA2，GB2	14

As can be seen, in the encoded data, a fixed Codebook (CODE) parameter occupies 34 bits of data, the number of bits occupied by the fixed codebook parameter is large, the larger the corresponding storage space is, the compression performance and the voice quality of the voice encoding are affected, and the larger the channel bandwidth required for transmitting the fixed codebook parameter is, the transmission performance is affected.

Based on this, the embodiment of the present application proposes a speech processing scheme, which may include processes of speech encoding, transmitting encoded data, and speech decoding. The voice processing scheme adopted by the embodiment of the application can generate the excitation data of the fixed codebook through the prediction model of the fixed codebook during voice coding, and can reduce the occupation of the related data of the fixed codebook on the storage space in the coding process of the voice data to be coded, thereby being beneficial to improving the integral compression performance and the voice quality of the voice coding. Moreover, the voice coding end does not need to send the index of the fixed codebook and the excitation of the fixed codebook to the voice decoding end, so that the channel bandwidth required for transmitting the coded data corresponding to the voice data to be coded can be reduced, the transmission performance is improved, and the voice coding effect is improved.

The speech processing scheme that this application embodiment provided relates to technologies such as artificial intelligence, machine learning, wherein:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operating (interactive) systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning (deep learning) and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine Learning and Deep Learning (DL) generally includes techniques such as artificial neural networks, belief networks, reinforcement Learning, transfer Learning, inductive Learning, and formal Learning.

To better understand the process of speech processing, an embodiment of the present application may provide a speech processing system, and referring to fig. 1, fig. 1 shows a schematic architecture diagram of a speech processing system according to an exemplary embodiment of the present application. As shown in fig. 1, the speech processing system includes a speech processing apparatus 101 (i.e., a speech encoding side) and a speech processing apparatus 102 (i.e., a speech decoding side). The voice processing apparatus 101 and the voice processing apparatus terminal 102 may be directly or indirectly connected by wired or wireless communication. The speech processing device 101 is an encoding end of speech data, and the speech processing device 102 is a decoding end of speech data, and will be described with the speech encoding end 101 and the speech decoding end 102 in the following.

It should be noted that the number and the form of the devices shown in fig. 1 are for example and are not to be construed as limiting the embodiments of the present application. In practical applications, the speech processing system provided in the embodiment of the present application may include more than two speech processing devices, and specifically, may include more than one encoding end and more than one decoding end. The speech processing system provided by the embodiment of the present application may further include only one speech processing device, where the speech processing device is both a speech decoding end and a speech encoding end.

The speech processing system shown in fig. 1 is exemplified by a speech encoding end 101 and a speech decoding end 102. The speech encoding terminal 101 may encode the original speech data and transmit the encoded data to the speech decoding terminal 102, and the speech decoding terminal 102 may decode the received encoded data, decode the encoded data, and regenerate the speech data based on the decoded data, thereby playing the speech.

The speech encoding end 101 and the speech decoding end 102 may be devices capable of performing speech processing, or the speech encoding end 101 and the speech decoding end 102 include an application program capable of performing speech processing, for example, the application program may be an instant speech communication application program. The speech encoding side 101 and the speech decoding side 102 may also be respectively disposed in any one of computer devices involved in speech processing.

Specifically, the speech encoding terminal 101 and the speech decoding terminal 102 may be servers, may be independent servers, or may be a server cluster. The voice encoding terminal 101 and the voice decoding terminal 102 may also be terminal devices, specifically, may be intelligent devices such as a computer, a mobile phone, a tablet computer, a notebook computer, a palm computer, a Mobile Internet Device (MID), a vehicle-mounted terminal, and a wearable device.

In one implementation, the speech processing system may be deployed based on a blockchain network, that is, speech processing devices of the speech encoding end 101 and the speech decoding end 102 may be both deployed in the blockchain network, or the speech processing device of the speech encoding end 101 may be deployed outside the blockchain network, the speech processing device of the speech decoding end 102 may be deployed in the blockchain network, and so on. The voice processing device of the voice encoding terminal 101 and the voice processing device of the voice decoding terminal 102 may serve as nodes in a block chain network. If the voice processing device is a server cluster or a distributed system formed by a plurality of physical servers, each physical server can be used as a node in the block chain network.

The blockchain mentioned here is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Which is essentially a decentralized database, is a string of blocks of data that are related using cryptographic methods. In the speech processing method disclosed in the present application, the data (e.g., encoded data obtained by encoding data to be encoded by a speech encoding end, and decoded data obtained by speech decoding the encoded data by a speech decoding end) may be stored in a block chain.

Based on the voice processing system, the method and the device can be applied to voice call scenes. Scene of voice call: when the user A and the user B carry out voice communication, the scheme of the application can be adopted to encode, transmit and decode the voice data of the user A and the voice data of the user B, thereby realizing the voice communication between the user A and the user B. In a specific implementation, when the user a sends voice data to the user B, the voice processing device of the user a is an encoding end, and the voice processing device of the user B is a decoding end. For example, the voice processing device of the user a is taken as a vehicle-mounted terminal, and the voice processing device of the user B is taken as a mobile phone. After the vehicle-mounted terminal of the user A and the mobile phone of the user B establish communication connection, the vehicle-mounted terminal of the user A can acquire voice data of the user and encode the voice data to obtain encoded data. And similarly, the mobile phone of the user B can also send the coded data to the vehicle-mounted terminal of the user A, and the vehicle-mounted terminal of the user A decodes the coded data to generate voice data, so that voice communication is performed.

The method and the device can also be applied to the video call scene, and when the user A and the user B carry out video call, the voice data of the user A and the voice data of the user B can be coded, transmitted and decoded by adopting the scheme of the method and the device, so that the voice data can be transmitted in the video call of the user A and the user B.

Referring to fig. 2, a speech processing method related to the speech processing system shown in fig. 1, fig. 2 is a flowchart illustrating a speech processing method according to an exemplary embodiment of the present application. The speech processing method is realized by the interaction of a speech encoding end 201 and a speech decoding end 202, and the speech processing method can comprise the following steps 201 to 207:

step 201, the voice encoding end obtains voice data to be encoded, and performs linear prediction analysis on the voice data to be encoded to obtain linear prediction parameters.

In an implementation manner, the manner in which the voice encoding end acquires the voice data to be encoded may be original voice acquired by a microphone of the voice encoding end, and the voice data to be encoded is converted by an analog-to-digital conversion circuit, may also be voice data sent by other voice processing devices, and may also be voice data acquired in a network, which is not limited in this application.

After the voice data to be encoded is acquired, the voice encoding end can encode the voice data to be encoded. The encoding flow of the voice data to be encoded by the voice encoding end can be as shown in fig. 3. Fig. 3 shows a flowchart of an encoding provided by an exemplary embodiment of the present application. Wherein, as shown in fig. 3, the first step of encoding may be to encode the voice data to be encodeds (n)Linear Predictive Analysis (LPA) is performed to obtain linear predictive parameters. Taking the linear prediction parameters as the prediction coefficients of a linear prediction filter, determining the sum of the excitation data of the adaptive codebook based on the target adaptive codebook and the excitation data of the fixed codebook based on the prediction model of the fixed codebook, and passing through the linear prediction filter 1/A (z) to obtain the synthesized voice datas’(n). And (3) passing the difference between the voice data to be coded and the synthetic voice data through a perception weighting filter, and obtaining the optimal target code vector in the target self-adaptive codebook and the gain (Ga) corresponding to the target code vector by taking the minimum mean square error of the perception weighting error as a searching principle. Thus, the coding process of the voice data to be coded is completed, and the coded data is obtained.

Step 202, the speech encoding end determines a target code vector in the target adaptive codebook, an index of the target code vector and a gain corresponding to the target code vector according to the speech data to be encoded.

In one implementation, the codevector in the target adaptive codebook is a pitch parameter, which includes a pitch delay and a gain. And the voice coding end finds a target code vector from the target adaptive codebook to describe pitch period information in the voice data to be coded. And if the product of the code vector found in the target adaptive codebook and the gain corresponding to the code vector is obtained, obtaining the excitation data of the adaptive codebook to be selected.

And then the sum of the excitation data of the fixed codebook generated by the fixed codebook prediction model is obtained to obtain the excitation data of the voice data to be coded. And then the excitation data of the voice data to be coded passes through a linear prediction filter to obtain the synthetic voice data of the voice data to be coded.

The speech data to be encoded is subtracted from the synthesized speech data, and the resulting difference is passed through a perceptual weighting filter as shown in fig. 3, resulting in a perceptual weighting error. The minimum mean square error (MSPE) (minimum Squared predicted error) criterion is used as a measure for searching the target code vector. And if the mean square error of the perceptual weighting error is minimum, determining that the code vector to be selected is the target code vector, wherein the index of the code vector to be selected is the index of the target code vector, and the gain corresponding to the code vector to be selected is the gain corresponding to the target code vector.

Step 203, the voice encoding end sends the encoded data corresponding to the voice data to be encoded to the voice decoding end.

The encoded data sent from the speech encoding end to the speech decoding end may include the line spectrum pair parameter, the index of the target code vector, and the gain corresponding to the target code vector.

Step 204, the speech decoding end determines the adaptive codebook excitation data according to the index of the target code vector and the gain corresponding to the target code vector.

In one implementation, the encoding flow of the speech data to be encoded at the speech decoding end can be as shown in fig. 4. Fig. 4 shows a flowchart of an encoding provided by an exemplary embodiment of the present application. The encoded data obtained by the encoding process shown in fig. 3 may be decoded by the decoding process shown in fig. 4, so as to obtain decoded data corresponding to the encoded data.

Specifically, the speech decoding end may determine the target code vector and the gain corresponding to the target code vector from the same target adaptive codebook as the speech encoding end according to the index of the target code vector of the speech data to be encoded. The adaptive codebook excitation data may then be determined based on a product of the target codevector and a gain corresponding to the target codevector. And then outputting fixed codebook excitation data according to the fixed codebook prediction model, and enabling the sum of the adaptive codebook excitation data and the fixed codebook excitation data to pass through a synthesis filter to obtain synthesized voice data, namely decoding data.

Step 205, the voice decoding end determines target prediction data corresponding to the voice data to be coded, and performs data analysis on the target prediction data through a fixed codebook prediction model to determine fixed codebook excitation data corresponding to the voice data to be coded.

In an implementation manner, the speech decoding end may determine input data, that is, target prediction data, input into the fixed codebook prediction model, and then the fixed codebook prediction model may perform data analysis on the target prediction data to obtain fixed codebook excitation data corresponding to the speech data to be encoded. The decoded data may then be generated by fixed codebook excitation data.

And step 206, the voice decoding end carries out synthesis processing on the adaptive codebook excitation data and the fixed codebook excitation data according to the linear prediction parameters to obtain decoding data corresponding to the voice data to be coded.

In one implementation, the excitation data of the speech data to be encoded is obtained by adding excitation data of a fixed codebook obtained by analyzing prediction model data of the fixed codebook to excitation data of a self-adaptive codebook; and then the excitation data passes through a synthesis filter to obtain synthesized voice data, wherein the synthesized voice data is decoding data corresponding to the voice data to be coded. It should be noted that the synthesis filter in the encoding flow shown in fig. 4 is the same as the linear prediction filter shown in fig. 3, and the speech encoding end simulates the process of decoding at the speech decoding end when encoding the speech data to be encoded.

Based on the above description, please refer to fig. 5, fig. 5 shows a flowchart of a speech processing method provided in an exemplary embodiment of the present application, where the speech processing method can be executed by the speech encoding terminal 201 in the embodiment shown in fig. 2, and the speech processing method can include the following steps 501 to 503:

step 501, obtaining voice data to be coded, and performing linear prediction analysis on the voice data to be coded to obtain linear prediction parameters.

In one implementation, before encoding the voice data, the voice encoding end performs analog-to-digital conversion on the sound signal of the original voice, so as to convert the analog original voice into the voice data, and further encode the voice data. It should be noted that the speech data to be encoded acquired by the speech encoding end may be a frame of speech data, and the frame length of the frame of speech data may be 10 milliseconds (ms) or 20 ms.

In an implementation manner, before performing linear prediction analysis on the voice data to be encoded, the voice encoding end may perform high-pass filtering on the voice data to be encoded, so as to remove a direct-current component in the voice data to be encoded, and obtain the voice data to be encoded after the high-pass filtering. And further carrying out linear prediction analysis on the high-pass filtered voice data to be coded to obtain linear prediction parameters.

The principle of linear predictive analysis is to approximate the current speech sample value using a weighted linear combination of several past speech sample values. The weighting coefficients in the linear combination are then linear prediction parameters. And when the mean square error between the sampling value of the voice data to be coded and the linear prediction sampling value reaches the minimum value, calculating the linear prediction parameter as the linear prediction parameter of the voice data to be coded, wherein the process of calculating the linear prediction parameter is linear prediction analysis. Where necessary, linear prediction parameters are calculated once per frame of speech data. Further, the linear prediction parameters obtained by the linear prediction analysis are used as prediction coefficients of the linear prediction filter.

It will be appreciated that the number of weighting coefficients in the linear combination is the order of the linear prediction parameter. For example, if a weighted linear combination of the past 10 speech samples is used to approximate the current speech sample, the order of the linear prediction parameter is 10.

Step 502, determining a target code vector in the target adaptive codebook, an index of the target code vector and a gain corresponding to the target code vector according to the voice data to be encoded.

In one implementation, the process of determining the target code vector is based on the sub-frame dimension, i.e., a frame of speech data of the speech data to be encoded is divided into a plurality of sub-frame data. For example, the speech data to be encoded is a frame of speech, which is 20ms speech data, and the frame of speech data to be encoded is divided into 4 subframes, each of which is 5ms speech data. After the voice data to be coded is subjected to linear predictive analysis, the obtained data and the target code vector corresponding to each subframe and the gain corresponding to the target code vector which are determined to be split from the voice data to be coded are obtained. The target self-adaptive codebook is used for approximating a long-term periodic structure of voice data to be coded. The determined target code vector is used for indicating a pitch delay parameter of the voice data to be coded, and the pitch delay parameter is an index of the target code vector. The method for searching the target adaptive codebook may be an open-loop and closed-loop pitch analysis method, and may also be other searching methods, which are not limited in this application.

In one implementation, the historical adaptive codebook is updated before the target codevector and corresponding gain for each subframe of speech data is determined. The content of the historical adaptive codebook is the excitation data prior to the current subframe. That is, before determining the target codevector of the speech data to be encoded, update data may be obtained, which is used to update the historical adaptive codebook. The update data may be a target code vector of the voice data of the previous subframe of the current subframe, a gain corresponding to the target code vector, and a fixed codebook excitation data corresponding to the previous subframe.

After the update data is obtained, the adaptive codebook excitation data of the voice data of the previous subframe, that is, the product of the target codevector of the voice data of the previous subframe and the gain corresponding to the target codevector of the voice data of the previous subframe is used as the adaptive codebook excitation data of the voice data of the previous subframe, may be determined according to the gain corresponding to the target codevector of the voice data of the previous subframe and the target codevector of the voice data of the previous subframe. And obtaining excitation data of the voice data of the previous subframe according to the sum of the excitation data of the adaptive codebook of the voice data of the previous subframe and the excitation data of the fixed codebook corresponding to the voice data of the previous subframe, and updating the historical adaptive codebook by using the excitation data of the voice data of the previous subframe to obtain an updated adaptive codebook, namely the target adaptive codebook. The target adaptive codebook may be a register structure, and the updating process of the historical adaptive codebook may be to shift excitation data of the previous subframe voice data into the historical adaptive codebook, and shift the oldest element of the number of elements corresponding to the excitation data of the previous subframe voice data in the historical adaptive codebook out of the historical adaptive codebook, so as to obtain the target adaptive codebook.

The fixed codebook excitation data of the voice data of the previous subframe may be obtained by determining target prediction data of the voice data of the previous subframe, and performing data analysis on the target prediction data through a fixed codebook prediction model to obtain the fixed codebook excitation data of the voice data of the previous subframe. The target prediction data of the previous subframe may include a linear prediction parameter of the speech data of the previous subframe, decoded data corresponding to the speech data of the previous subframe of the speech data of the previous subframe, and adaptive codebook excitation data of the speech data of the previous subframe.

In one implementation, the fixed codebook excitation data is trained prior to encoding the speech data to be encoded. Wherein, the training process for the fixed codebook excitation model can be as shown in fig. 6. Fig. 6 shows a training diagram of a fixed codebook prediction model according to an exemplary embodiment of the present application. The training flow shown in fig. 6 is the same as the encoding flow shown in fig. 3. For fig. 6, the model used is an initial fixed codebook prediction model, and the input speech data to be encoded is each speech training sample in the speech training sample set.

In one implementation, a set of speech training samples is obtained, the set of speech training samples including a plurality of speech training samples. The obtaining manner of each voice training sample in the voice training sample set is not limited in this application, such as voice data obtained from the internet based on big data and the like. The voice data of the voice training sample may be a frame of voice data.

In a possible implementation manner, as shown in fig. 6, the speech encoding end performs iterative training on the initial fixed codebook prediction model according to the speech training sample set to obtain the fixed codebook prediction model. For the training of the fixed codebook prediction model, the speech training samples in the speech training sample set need to be processed to obtain the data for training the fixed codebook prediction model.

Specifically, a target voice training sample is obtained from the voice training sample set, the target voice training sample is any one of the voice training sample set, and linear prediction analysis is performed on the target voice training sample to obtain a target voice training sample linear prediction parameter. The method comprises the steps of obtaining decoding data corresponding to voice data of a previous subframe of a current subframe in a target voice training sample, a training target code vector of the voice data of the previous frame of the current subframe in the target voice training sample and gain corresponding to the training target code vector, and obtaining adaptive codebook excitation data of the previous frame of the current subframe in the target voice training sample based on the training target code vector of the voice data of the previous frame in the target voice training sample and the gain corresponding to the training target code vector. And performing iterative training on the initial fixed codebook prediction model by taking the linear prediction parameters, decoded data corresponding to the previous frame of voice data of the current subframe in the target voice training sample and the previous frame of adaptive codebook excitation data of the target voice training sample as input data so as to obtain the fixed codebook excitation model.

In one implementation, each speech training sample carries a fixed codebook excitation label, and the fixed codebook excitation label carried by the target speech training sample is used to indicate: and the fixed codebook excitation data corresponding to the target voice training sample. The value of the fixed codebook excitation label carried by each speech training sample may be the difference between the perceptual weighting error of the current subframe in the target speech training sample and the adaptive codebook excitation data. The perceptual weighting error of the current subframe in the target speech training sample can be expressed ase(n)The adaptive codebook excitation data may be expressed asp(n)Then the value of the current subframe in the target speech training sample may bee(n)-p(n)。

In a possible implementation manner, the initial fixed codebook prediction model training may use a cross entropy loss function to calculate the loss, or may use a relative entropy loss function to calculate the loss, which is not limited in this application. Further, based on the obtained loss, adjusting parameters of the initial fixed codebook prediction model until the obtained loss meets the training end condition. The initial fixed codebook prediction model when the loss satisfies the training end condition is determined as the fixed codebook prediction model. Optionally, the training end condition may be that the calculated loss reaches a minimum value.

Step 503, the voice encoding end sends the encoded data corresponding to the voice data to be encoded to the voice decoding end.

In one implementation, encoding data includes: linear prediction parameters, the index of the target codevector and the gain corresponding to the target codevector. The linear prediction parameters obtained by the linear prediction analysis are unstable, and the linear prediction parameters can be converted into Line Spectrum Pair (LSP) parameters and quantized to obtain LSP parameters.

Illustratively, the bit allocation of the encoded data may be as shown in table 2:

TABLE 2

Encoding data	Code word	Number of bits per frame
			Linear prediction parameters (LSP)	L0,L1,L2,L3	18
Index of target code vector PITCH	P0,P1,P2	14
			Gain corresponding to target code vector	GA1,GA2	7

As can be seen from table 2, the speech encoding end only needs to send the linear prediction parameter, the index of the target code vector, and the gain of the target code vector to the speech decoding end, and does not need to send the gain corresponding to the fixed code vector index and the fixed code vector determined in the fixed codebook to the speech decoding end, so that the compression performance of the speech encoding is improved, the channel bandwidth required for transmission is also reduced, and the compression effect is improved. Note that, the bit allocation shown in table 2 is only for part of the encoded data, and the encoded data may include other data.

Further, referring to fig. 7, fig. 7 is a flowchart illustrating a speech processing method according to an exemplary embodiment of the present application, where the speech processing method can be executed by the speech decoding end 202 in the embodiment illustrated in fig. 2, and the speech processing method includes the following steps 701 to 704:

step 701, receiving encoded data corresponding to-be-encoded voice data sent by a voice encoding end.

In one implementation, the encoded data includes linear prediction parameters of the speech data to be encoded, an index of a target codevector, and a gain corresponding to the target codevector. The linear prediction parameters are LSP parameters, and the linear prediction parameters can be obtained by interpolation, so as to obtain the prediction parameters of the synthesis filter shown in fig. 4.

Step 702, determining adaptive codebook excitation data according to the index of the target codevector and the gain corresponding to the target codevector.

Specifically, the voice decoding end decodes the encoded data to regenerate the voice data.

As shown in fig. 4, based on the index of the target code vector, the pitch lag of the speech data to be encoded, i.e., the target code vector, can be determined. Further, adaptive codebook excitation data may be determined based on the target codevector and the corresponding gain of the target codevector. The adaptive codebook excitation data is a product of a target codevector and a gain corresponding to the target codevector.

Step 703, determining target prediction data corresponding to the voice data to be encoded, and performing data analysis on the target prediction data through a fixed codebook prediction model to determine fixed codebook excitation data corresponding to the voice data to be encoded.

Specifically, excitation data input to the synthesis filter at the speech decoding end includes adaptive codebook excitation data and fixed codebook excitation data, and the fixed codebook excitation data may be generated based on a fixed codebook prediction model. Specifically, the speech decoding end may determine target prediction data input to the fixed codebook prediction model, and further obtain fixed codebook excitation data output by the fixed codebook prediction model.

In one implementation, if the speech data to be encoded is start frame speech data, the target value is determined as target prediction data corresponding to the speech data to be encoded. Illustratively, the first subframe is initial frame speech data in speech to be decoded, the target prediction data corresponding to the first subframe is a target value, where the target value may be 0 or a smaller random value, and the fixed codebook prediction model may output fixed codebook excitation data of the first subframe based on the input target prediction data.

In one implementation, when the first subframe is not the start frame speech data in the speech to be decoded, the target prediction data corresponding to the first subframe may include one or more of a linear prediction parameter, decoded data corresponding to the speech data of the last subframe of the first subframe, and adaptive codebook excitation data obtained by decoding the speech data of the last subframe.

In one implementation mode, the fixed codebook prediction model does not output all fixed codebook excitation data at one time, after target data is input into the fixed codebook prediction model, the fixed codebook prediction model performs first data analysis on the target prediction data to obtain first fixed codebook excitation data corresponding to voice data to be coded, and the first fixed codebook excitation data are partial data in the fixed codebook excitation data. Further, performing second data analysis on the target prediction data and the first fixed codebook excitation data through a fixed codebook prediction model to obtain second fixed codebook excitation data corresponding to the voice data to be coded.

And if the first fixed codebook excitation data and the second fixed codebook excitation data meet the target condition, determining fixed codebook excitation data corresponding to the voice data to be coded according to the first fixed codebook excitation data and the second fixed codebook excitation data. And if the first fixed codebook excitation data and the second fixed codebook excitation data do not meet the target condition, performing third-time data analysis on the target prediction data and the second fixed codebook excitation data through a fixed codebook prediction model to obtain third fixed codebook excitation data corresponding to the voice data to be coded.

And judging whether the first fixed codebook excitation data, the second fixed codebook excitation data and the third fixed codebook excitation data meet the target condition again, and if so, generating the fixed codebook excitation data according to the first fixed codebook excitation data, the second fixed codebook excitation data and the third fixed codebook prediction model. And if the target condition is not met, generating partial data of the fixed codebook excitation data through the fixed codebook prediction model again. And generating fixed codebook excitation data corresponding to the voice data to be coded based on the partial data of the fixed codebook excitation data until the obtained partial data of the fixed codebook excitation data meets the target condition.

Illustratively, the example is explained by taking the sampling rate of the voice data to be encoded as 8khz, where the sampling rate of 8khz represents 8000 samples of 1 second voice data, and taking the voice data to be encoded as 30ms (one frame), the voice data to be encoded has 240 samples, and the voice data to be encoded includes 3 subframes, each subframe is 10ms of voice data, and each subframe has 80 samples.

Taking a subframe as an example, after target prediction data is input into a fixed codebook prediction model, performing first data analysis on the target prediction data through the fixed codebook prediction model to obtain first fixed codebook excitation data, for example, fixed codebook excitation data corresponding to 8 sampling points can be obtained; further, the target prediction data and the first fixed codebook excitation data are input into the fixed codebook prediction model through the fixed codebook prediction model for second data analysis, so as to obtain second fixed codebook excitation data, for example, the second fixed codebook excitation data may also be fixed codebook excitation data corresponding to 8 sampling points, and the target condition is that the number of sampling points corresponding to a plurality of fixed codebook excitation data output by the fixed codebook prediction model is the same as the number of sampling points corresponding to one subframe. The target condition may be that the sum of the number of sampling points corresponding to the fixed codebook excitation data output by multiple data analysis is equal to the sampling point corresponding to one subframe. At this time, the number of sampling points corresponding to the first fixed codebook excitation data is 8, and the number of sampling points corresponding to the second fixed codebook excitation data is also 8. If the target condition is not met, the second fixed codebook excitation data and the target prediction data need to be subjected to data analysis through a fixed codebook prediction model to obtain third fixed codebook excitation data. It can be seen that the number of sampling points corresponding to the fixed codebook excitation data output by the fixed codebook prediction model through each data analysis is 8, and when the fixed codebook excitation data is output for the 10 th time, the number of sampling points corresponding to the output 10 times of fixed codebook excitation data is 80, so as to achieve the target condition. And further integrating partial data of 10 pieces of fixed codebook excitation data output for 10 times to obtain fixed codebook excitation data of the subframe, thereby completing the iteration of a fixed codebook prediction model to generate the fixed codebook excitation data.

Referring to fig. 8, fig. 8 is a schematic diagram illustrating a structure of a fixed codebook prediction model according to an exemplary embodiment of the present application. The fixed codebook prediction model may be a neural network model constructed based on a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a long-short term memory model (LSTM), and a Gated-Recurrent Unit (GRU), and the structure of the initial fixed codebook prediction model may be determined based on actual application scenario requirements, which is not limited in this application.

The structure of the fixed codebook excitation model as shown in fig. 8 is only an example, and may include a spectral feature extraction module and an excitation generation module. The method comprises the steps that linear prediction parameters, decoding data corresponding to the previous frame of voice data of the voice data to be coded and adaptive codebook excitation data obtained by decoding the previous frame of voice data are obtained from target prediction data corresponding to the voice data to be coded; it should be noted that the previous frame refers to a subframe of voice data in a frame of voice data.

Extracting the spectral characteristics of the voice data to be coded according to the linear prediction parameters through a spectral characteristic extraction module; as shown in fig. 8, the spectral feature extraction model may include one fully-connected layer (fully-connected layer 1) and one GRU (gated round unit 1). And generating fixed codebook excitation data corresponding to the voice data to be coded by an excitation generating module according to the frequency spectrum characteristics, decoded data corresponding to the last frame of voice data of the voice data to be coded of the linear prediction filter and self-adaptive codebook excitation data obtained by decoding the last frame of voice data. As shown in fig. 8, the fixed codebook prediction model may include two fully connected layers: full connection layer 2 (DENSE 2), full connection layer 3 (DENSE 3); and two GRUs: gated cycle cell 2 (GRU 2) and gated cycle cell 3 (GRU 3).

In one implementation, if the structure of the fixed codebook prediction model is as shown in fig. 8, when the sampling rate of the speech data to be encoded is 8khz, the linear prediction analysis may output data of 10 orders, the number of the output data of the linear filter is also 10, the number of the output data of the adaptive filter is also 10, the number of the output data of the fixed codebook prediction model is 10, and the number of the input data of the fixed codebook prediction model is 40, the numbers of the neurons of the fully-connected layer 1, the fully-connected layer 2, and the fully-connected layer 3 may be respectively designed to be 64, and 10, and the numbers of the neurons of the GRU1, the GRU2, and the GRU3 may be respectively designed to be 64, 256, and 10.

In one implementation, when the sampling rate of the speech data to be encoded is 16khz or more, the linear analysis may output data of 16 orders, the number of output data of the linear filter is also 16, the number of output data of the adaptive filter is also 16, the number of output data of the fixed codebook prediction model is 16, and the number of input data of the fixed codebook prediction model is 64, the number of neurons of the full connection layer 1, the full connection layer 2, and the full connection layer 3 may be respectively designed to be 64, 16, and the number of neurons of the GRU1, the GRU2, and the GRU3 may be respectively designed to be 64, 256, 16.

And 704, synthesizing the adaptive codebook excitation data and the fixed codebook excitation data according to the linear prediction parameters to obtain decoding data corresponding to the voice data to be coded.

The voice decoding end adds the fixed codebook excitation data obtained by the prediction of the fixed codebook prediction model and the self-adaptive codebook excitation data to obtain excitation data; and then the excitation data is processed by a synthesis filter to obtain synthesized voice data, namely the synthesized voice data is decoding data corresponding to the voice data to be coded. Note that the synthesis filter in the decoding flow shown in fig. 4 is the same filter as the linear prediction filter shown in fig. 3. In the encoding process, the synthesized speech data output by the linear prediction filter is the decoded data output by the synthesis filter in the decoding process.

Optionally, the decoded data may be post-processed to improve the speech quality of the synthesized speech data.

For better understanding of the above, the speech processing method described above is further explained below in conjunction with an in-vehicle application scenario:

referring to fig. 9, fig. 9 is a diagram illustrating an application scenario of a speech processing method according to an exemplary embodiment of the present application. In an in-vehicle application scenario shown in fig. 9, the speech encoding end is an in-vehicle terminal 90 of a smart car being used by a user a, and the speech decoding end may be a mobile phone of a user B used for a speech communication with the user a. As shown in fig. 9, a user a is driving, and a voice call is established with a user B through a vehicle-mounted terminal 90 of the smart car. A microphone of the vehicle-mounted terminal receives voice emitted by a user, and an analog voice signal is converted into a digital voice signal, namely voice data, through an analog-to-digital conversion circuit. The vehicle-mounted terminal encodes the voice data by adopting the encoding process provided by the embodiment of the application to obtain the encoded data of the voice data, and transmits the encoded data to the mobile phone of the user B. The mobile phone of the user B can also perform analog-to-digital conversion and coding on the sound emitted by the user B, and send the sound to the vehicle-mounted terminal of the user A. Therefore, the vehicle-mounted terminal of the user A decodes and regenerates the voice data of the user B, and the voice processing equipment of the user B decodes and regenerates the voice data of the user A, thereby completing the voice call.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a speech processing apparatus according to an exemplary embodiment of the present application, where the speech processing apparatus 100 is applied to a speech encoding side, and the speech processing apparatus 100 may be configured to execute corresponding steps in the speech processing methods shown in fig. 2 and fig. 5. Referring to fig. 10, the speech processing apparatus 100 includes the following units:

an obtaining unit 1001, configured to obtain voice data to be encoded, and perform linear prediction analysis on the voice data to be encoded to obtain a linear prediction parameter;

a determining unit 1002, configured to determine, according to the voice data to be encoded, a target code vector in a target adaptive codebook, an index of the target code vector, and a gain corresponding to the target code vector;

a sending unit 1003, configured to send the linear prediction parameter, the index of the target code vector, and the gain corresponding to the target code vector to a speech decoding end as encoded data corresponding to speech data to be encoded.

In one implementation, the speech processing apparatus 100 further includes:

the obtaining unit 1001 is further configured to obtain a target code vector of a previous frame of voice data of the voice data to be encoded, a gain corresponding to the target code vector of the previous frame of voice data, and fixed codebook excitation data corresponding to the previous frame of voice data;

an updating unit 1004, configured to update the historical adaptive codebook according to a target code vector of a previous frame of speech data of the speech data to be encoded, a gain corresponding to the target code vector of the previous frame of speech data, and fixed codebook excitation data corresponding to the previous frame of speech data, so as to obtain a target adaptive codebook.

In one implementation, the obtaining unit 1001, when obtaining the fixed codebook excitation data corresponding to the previous frame of speech data, is specifically configured to:

In one implementation, the updating unit 1004 updates the historical adaptive codebook according to the target code vector of the previous frame of speech data of the speech data to be encoded, the gain corresponding to the target code vector of the previous frame of speech data, and the excitation data of the fixed codebook corresponding to the previous frame of speech data to obtain the target adaptive codebook, and is specifically configured to:

In one implementation, the speech processing apparatus 100 further includes:

the obtaining unit 1001 is further configured to obtain a voice training sample set, where the voice training sample set includes a plurality of voice training samples;

a training unit 1005, configured to perform iterative training on the initial fixed codebook prediction model according to the speech training sample set to obtain a fixed codebook prediction model, where the fixed codebook prediction model is used to determine fixed codebook excitation data corresponding to the input speech data.

In one implementation, when iteratively training the initial fixed codebook prediction model according to the speech training sample set to obtain the fixed codebook prediction model, the training unit 1005 is specifically configured to:

In one implementation, the speech processing apparatus 100 further includes:

a high-pass filtering unit 1006, configured to perform high-pass filtering on the data to be encoded to obtain the data to be encoded after the high-pass filtering;

when the obtaining unit 1001 performs linear prediction analysis on the to-be-encoded voice data to obtain a linear prediction parameter corresponding to the to-be-encoded voice data, it is specifically configured to:

and performing linear prediction analysis on the data to be coded after the high-pass filtering to obtain linear prediction parameters.

According to an embodiment of the present application, the units in the speech processing apparatus 100 shown in fig. 10 may be respectively or entirely combined into one or several other units to form one or several other units, or some unit(s) may be further split into multiple units with smaller functions to form one or several other units, which may achieve the same operation without affecting the achievement of the technical effect of the embodiment of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the speech processing apparatus 100 may also include other units, and in practical applications, the functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units. According to another embodiment of the present application, the speech processing apparatus 100 as shown in fig. 10 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the respective methods as shown in fig. 2, 5 on a general-purpose computing device including a general-purpose computer such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and the like, and a storage element, and implementing the speech processing method of the embodiment of the present application. The computer program may be embodied on a computer-readable storage medium, for example, and loaded into and executed in a speech processing apparatus of the speech encoding side 101 of the speech processing system shown in fig. 1 through the computer-readable storage medium.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a speech processing apparatus according to an exemplary embodiment of the present application, where the speech processing apparatus 110 is applied to a speech decoding end, and the speech processing apparatus 110 may be configured to execute corresponding steps in the speech processing methods shown in fig. 2 and fig. 7. Referring to fig. 11, the speech processing apparatus 110 includes the following units:

a receiving unit 1101, configured to receive encoded data corresponding to-be-encoded voice data sent by a voice encoding end, where the encoded data includes: linear prediction parameters corresponding to the voice data to be coded, indexes of target code vectors and gains corresponding to the target code vectors;

a determining unit 1102, configured to determine adaptive codebook excitation data according to the index of the target codevector and the gain corresponding to the target codevector;

the determining unit 1102 is further configured to determine target prediction data corresponding to the to-be-encoded voice data, perform data analysis on the target prediction data through a fixed codebook prediction model, and determine fixed codebook excitation data corresponding to the to-be-encoded voice data;

a synthesizing unit 1103, configured to perform synthesis processing on the adaptive codebook excitation data and the fixed codebook excitation data according to the linear prediction parameter, so as to obtain decoded data corresponding to the to-be-encoded speech data.

In an implementation manner, when determining target prediction data corresponding to speech data to be encoded, the determining unit 1102 is specifically configured to:

In one implementation, the fixed codebook prediction model includes a spectral feature extraction module and an excitation generation module; the determining unit 1102 is configured to perform data analysis on the target prediction data through the fixed codebook prediction model, and when determining fixed codebook excitation data corresponding to the to-be-encoded speech data, specifically:

According to an embodiment of the present application, the units in the speech processing apparatus 110 shown in fig. 11 may be respectively or entirely combined into one or several other units to form a structure, or some unit(s) thereof may be further split into multiple functionally smaller units to form a structure, which may achieve the same operation without affecting the achievement of the technical effect of the embodiment of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the speech processing apparatus 110 may also include other units, and in practical applications, these functions may also be implemented by being assisted by other units, and may be implemented by cooperation of multiple units. According to another embodiment of the present application, the speech processing apparatus 110 as shown in fig. 11 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the respective methods as shown in fig. 2, 8 on a general-purpose computing device including a general-purpose computer such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and the like, and a storage element, and implementing the speech processing method of the embodiment of the present application. The computer program may be embodied on a computer-readable storage medium, for example, and loaded and executed in the speech processing apparatus of the speech decoding side 102 of the speech processing system shown in fig. 1 through the computer-readable storage medium.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a speech processing apparatus according to an exemplary embodiment of the present application, where the speech processing apparatus 120 includes at least a processor 1201 and a computer-readable storage medium 1202. The processor 1201 and the computer-readable storage medium 1202 may be connected by a bus or other means. A computer-readable storage medium 1202 may be stored in the memory, the computer-readable storage medium 1202 for storing a computer program comprising computer instructions, the processor 1201 for executing the computer instructions stored by the computer-readable storage medium 1202. The processor 1201 (or CPU) is a computing core and a control core of the speech Processing apparatus 120, and is adapted to implement one or more computer instructions, and in particular, is adapted to load and execute the one or more computer instructions so as to implement a corresponding method flow or a corresponding function.

An embodiment of the present application also provides a computer-readable storage medium (Memory), which is a Memory device in the speech processing device 120 and is used for storing programs and data. It is understood that the computer readable storage medium 1202 herein may include both a built-in storage medium in the speech processing device 120 and, of course, an extended storage medium supported by the speech processing device 120. The computer readable storage medium provides a storage space that stores an operating system of the speech processing device 120. Also stored in the memory space are one or more computer instructions, which may be one or more computer programs (including program code), suitable for loading and execution by the processor 1201. It should be noted that the computer-readable storage medium 1202 may be a high-speed RAM Memory, or a Non-Volatile Memory (Non-Volatile Memory), such as at least one disk Memory; optionally, at least one computer readable storage medium may be located remotely from the processor 1201.

The speech processing apparatus 120 may be the speech processing apparatus (speech encoding side) 101 in the speech processing system shown in fig. 1; the computer-readable storage medium 1202 has stored therein one or more computer instructions; one or more computer instructions stored in the computer-readable storage medium 1202 are loaded and executed by the processor 1201 to implement the corresponding steps in the above-described speech processing method embodiments; in particular implementations, one or more computer instructions in the computer-readable storage medium 1202 is loaded and executed by the processor 1201 to perform the steps of:

In one implementation, one or more computer instructions in the computer-readable storage medium 1202 are loaded and executed by the processor 1201 to perform the steps of:

acquiring a target code vector of the previous frame of voice data of the voice data to be coded, a gain corresponding to the target code vector of the previous frame of voice data and fixed codebook excitation data corresponding to the previous frame of voice data;

and updating the historical self-adaptive codebook according to the target code vector of the previous frame of voice data of the voice data to be coded, the gain corresponding to the target code vector of the previous frame of voice data and the excitation data of the fixed codebook corresponding to the previous frame of voice data to obtain the target self-adaptive codebook.

In one implementation, the adaptive codebook includes a plurality of codevectors; one or more computer instructions in computer-readable storage medium 1202 are loaded by processor 1201 and perform the steps of:

acquiring a voice training sample set, wherein the voice training sample set comprises a plurality of voice training samples;

and performing iterative training on the initial fixed codebook prediction model according to the voice training sample set to obtain a fixed codebook prediction model, wherein the fixed codebook prediction model is used for determining fixed codebook excitation data corresponding to the input voice data.

carrying out high-pass filtering on data to be coded to obtain the data to be coded after the high-pass filtering;

the method for performing linear prediction analysis on the voice data to be coded to obtain linear prediction parameters comprises the following steps:

Referring to fig. 13, fig. 13 is a schematic structural diagram of a speech processing apparatus according to an exemplary embodiment of the present application, where the speech processing apparatus 130 includes at least a processor 1301 and a computer-readable storage medium 1302. The processor 1301 and the computer-readable storage medium 1302 may be connected by a bus or other means. A computer-readable storage medium 1302 may be stored in the memory, the computer-readable storage medium 1302 being for storing a computer program comprising computer instructions, the processor 1301 being for executing the computer instructions stored by the computer-readable storage medium 1302. The processor 1301 (or CPU) is a computing core and a control core of the speech Processing apparatus 130, and is adapted to implement one or more computer instructions, and specifically, adapted to load and execute the one or more computer instructions so as to implement a corresponding method flow or a corresponding function.

An embodiment of the present application further provides a computer-readable storage medium (Memory), which is a Memory device in the speech processing device 130 and is used for storing programs and data. It is understood that the computer readable storage medium 1302 herein may include both a built-in storage medium in the speech processing apparatus 130 and, of course, an extended storage medium supported by the speech processing apparatus 130. The computer readable storage medium provides a storage space that stores an operating system of the speech processing device 130. Also stored in the memory space are one or more computer instructions, which may be one or more computer programs (including program code), suitable for loading and execution by processor 1301. It should be noted that the computer-readable storage medium 1302 herein may be a high-speed RAM Memory, or may be a Non-Volatile Memory (Non-Volatile Memory), such as at least one disk Memory; optionally, at least one computer readable storage medium may be located remotely from the processor 1301.

The speech processing device 130 may be the speech processing device (speech decoding side) 102 in the speech processing system shown in fig. 1; the computer-readable storage medium 1302 has stored therein one or more computer instructions; one or more computer instructions stored in the computer-readable storage medium 1302 are loaded and executed by the processor 1301 to implement the corresponding steps in the above-described voice processing method embodiments; in particular implementations, one or more computer instructions in the computer-readable storage medium 1302 is loaded and executed by the processor 1301 to perform the steps of:

In one implementation, one or more computer instructions in the computer-readable storage medium 1302 are loaded and executed by the processor 1301 to perform the steps of:

the method comprises the steps of linear prediction parameters, decoding data corresponding to the previous frame of voice data of data to be coded, and adaptive codebook excitation data obtained by decoding the previous frame of voice data.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the voice processing method provided in the above-described various alternatives.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for processing speech, the method being applied to a speech coder, the method comprising:

2. The method of claim 1, wherein prior to determining a target codevector in a target adaptive codebook from the speech data to be encoded, the method further comprises:

and updating a historical adaptive codebook according to a target code vector of the previous frame of voice data of the voice data to be coded, the gain corresponding to the target code vector of the previous frame of voice data and the excitation data of the fixed codebook corresponding to the previous frame of voice data to obtain the target adaptive codebook.

3. The method of claim 2, wherein said obtaining fixed codebook excitation data corresponding to the previous frame of speech data comprises:

determining target prediction data of the previous frame of voice data of the voice data to be coded;

4. The method of claim 3, wherein the updating a historical adaptive codebook according to a target codevector of a previous frame of speech data of the speech data to be encoded, a gain corresponding to the target codevector of the previous frame of speech data, and fixed codebook excitation data corresponding to the previous frame of speech data to obtain the target adaptive codebook comprises:

determining adaptive codebook excitation data of the previous frame of voice data according to a target code vector of the previous frame of voice data of the voice data to be coded and a gain corresponding to the target code vector of the previous frame of voice data;

and updating a historical adaptive codebook according to the sum of the adaptive codebook excitation data of the previous frame of voice data and the fixed codebook excitation data corresponding to the previous frame of voice data to obtain the target adaptive codebook.

5. The method of claim 4, further comprising:

6. The method of claim 5, wherein iteratively training an initial fixed codebook prediction model according to the speech training sample set to obtain a fixed codebook prediction model comprises:

acquiring decoded data corresponding to the last frame of voice data of the target voice training sample, a training target code vector of the last frame of voice data of the target voice training sample and a gain corresponding to the training target code vector;

and performing iterative training on the initial fixed codebook prediction model through the training linear prediction parameters, the decoded data corresponding to the last frame of voice data of the target voice training sample, the training target code vector of the last frame of voice data of the target voice training sample and the gain corresponding to the training target code vector to obtain a fixed codebook prediction model.

7. The method according to any one of claims 1-6, further comprising:

carrying out high-pass filtering on the voice data to be coded to obtain high-pass filtered voice data to be coded;

wherein, the performing linear prediction analysis on the voice data to be coded to obtain linear prediction parameters includes:

and performing linear prediction analysis on the high-pass filtered voice data to be coded to obtain linear prediction parameters corresponding to the voice data to be coded.

8. A speech processing method, applied to a speech decoding end, the method comprising:

9. The method according to claim 8, wherein the determining the target prediction data corresponding to the speech data to be encoded comprises:

and if the voice data to be coded is the voice data of the initial frame, determining the target value as the target prediction data corresponding to the voice data to be coded.

10. The method according to claim 8, wherein the target prediction data corresponding to the speech data to be encoded comprises one or more of:

the linear prediction parameters, the decoded data corresponding to the previous frame of voice data of the voice data to be coded, and the adaptive codebook excitation data obtained by decoding the previous frame of voice data.

11. The method of claim 8, wherein the performing data analysis on the target prediction data through a fixed codebook prediction model to determine fixed codebook excitation data corresponding to the speech data to be encoded comprises:

performing first data analysis on the target prediction data through the fixed codebook prediction model to obtain first fixed codebook excitation data corresponding to the voice data to be coded, wherein the first fixed codebook excitation data are partial data in the fixed codebook excitation data;

performing second data analysis on the target prediction data and the first fixed codebook excitation data through the fixed codebook prediction model to obtain second fixed codebook excitation data corresponding to the voice data to be coded;

and if the first fixed codebook excitation data and the second fixed codebook excitation data meet a target condition, determining fixed codebook excitation data corresponding to the voice data to be coded according to the first fixed codebook excitation data and the second fixed codebook excitation data.

12. The method of claim 10, wherein the fixed codebook prediction model comprises a spectral feature extraction module and an excitation generation module; the data analysis of the target prediction data through a fixed codebook prediction model to determine fixed codebook excitation data corresponding to the to-be-coded voice data includes:

acquiring the linear prediction parameter, decoding data corresponding to a previous frame of voice data of the voice data to be coded and adaptive codebook excitation data obtained by decoding the previous frame of voice data from target prediction data corresponding to the voice data to be coded;

extracting the spectral feature of the voice data to be coded according to the linear prediction parameter through the spectral feature extraction module;

and generating fixed codebook excitation data corresponding to the voice data to be coded by the excitation generating module according to the frequency spectrum characteristics, decoded data corresponding to the last frame of voice data of the voice data to be coded and adaptive codebook excitation data obtained by decoding the last frame of voice data.

13. A speech processing apparatus, applied to a speech encoding side, comprising:

and the sending unit is used for sending the linear prediction parameters, the index of the target code vector and the gain corresponding to the target code vector as the coded data corresponding to the voice data to be coded to a voice decoding end.

14. A speech processing apparatus, applied to a speech decoding side, comprising:

a determining unit, configured to determine adaptive codebook excitation data according to an index of the target codevector and a gain corresponding to the target codevector;

15. A computer-readable storage medium having stored thereon one or more first instructions adapted to be loaded by a processor and to perform the speech processing method according to any of claims 1 to 12.