CN111312264A

CN111312264A - Voice transmission method, system, device, computer readable storage medium and equipment

Info

Publication number: CN111312264A
Application number: CN202010104612.0A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-02-20
Filing date: 2020-02-20
Publication date: 2020-06-19
Anticipated expiration: 2040-02-20
Also published as: CN111312264B

Abstract

The application relates to a voice transmission method, a system, a device, a computer readable storage medium and equipment, wherein the method comprises the following steps: acquiring current coded data in a voice coded code stream; obtaining packet loss recovery capability corresponding to current coded data according to a first voice coding characteristic parameter corresponding to the current coded data and a second voice coding characteristic parameter corresponding to previous coded data of the current coded data through a machine learning-based packet loss recovery capability prediction model; judging whether redundancy multiple-sending processing is needed or not according to the packet loss recovery capability, if so, performing redundancy multiple-sending processing on the current encoding data and then transmitting the current encoding data to a receiving end; if not, the current coded data is directly transmitted to the receiving end. The scheme provided by the application can effectively improve the utilization rate of the network bandwidth and can also ensure the packet loss resistance of the transmission network.

Description

Voice transmission method, system, device, computer readable storage medium and equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, a system, an apparatus, a computer-readable storage medium, and a computer device for voice transmission.

Background

The internet is a non-reliable transmission network, and the main problem faced by voice transmission based on the internet is the problem of packet loss resistance, and the packet loss phenomenon occurs in the transmission process due to the instability of the transmission network. In order to resist network packet loss, a redundancy multi-sending mechanism is usually adopted to send a data packet to a receiving end, so as to increase the probability of receiving the data packet by the receiving end, thereby achieving the effect of resisting packet loss.

However, the redundancy multi-transmission mechanism tends to increase bandwidth by multiple times, consumes excessive network bandwidth resources, and particularly, the problems of network congestion and the like easily occur in a bandwidth-limited scene, which may lead to more packet loss.

Disclosure of Invention

Based on this, it is necessary to provide a voice transmission method, apparatus, system, computer-readable storage medium and computer device for solving the technical problems of more packet loss caused by network bandwidth resource consumption and network congestion due to the redundant multi-transmission processing of data packets in the prior art.

A method of voice transmission, comprising:

acquiring current coded data in a voice coded code stream;

obtaining packet loss recovery capability corresponding to current coded data according to a first voice coding characteristic parameter corresponding to the current coded data and a second voice coding characteristic parameter corresponding to previous coded data of the current coded data through a machine learning-based packet loss recovery capability prediction model;

judging whether redundancy multiple processing is needed or not according to the packet loss recovery capability;

if so, performing redundancy multiple-sending processing on the current coded data and then transmitting the current coded data to a receiving end;

if not, the current coded data is directly transmitted to a receiving end.

A voice transmission system comprising a transmitting end and a receiving end, wherein:

the sending end is used for obtaining current encoding data in a voice encoding code stream, and obtaining packet loss recovery capability corresponding to the current encoding data according to a first voice encoding characteristic parameter corresponding to the current encoding data and a second voice encoding characteristic parameter corresponding to previous encoding data of the current encoding data through a machine learning-based packet loss recovery capability prediction model;

the sending end is also used for judging whether redundancy multiple sending processing is needed or not according to the packet loss recovery capability; if so, performing redundancy multiple-sending processing on the current coded data and then transmitting the current coded data to a receiving end; if not, directly transmitting the current coded data to a receiving end;

the receiving end is used for filtering out repeated data packets and then decoding the data packets to obtain a voice signal corresponding to the current coded data when receiving the current coded data or redundant multi-transmission packets corresponding to the current coded data;

the receiving end is further configured to, when the current encoded data and the redundant multi-transmission packet corresponding to the current encoded data are not received, perform packet loss recovery processing on the current encoded data to obtain a recovery packet corresponding to the current encoded data, and decode the recovery packet to obtain a voice signal corresponding to the current encoded data.

A voice transmission apparatus, the apparatus comprising:

the acquisition module is used for acquiring current encoded data in the voice encoded code stream;

the prediction module is used for obtaining packet loss recovery capability corresponding to current coded data according to a first voice coding characteristic parameter corresponding to the current coded data and a second voice coding characteristic parameter corresponding to previous coded data of the current coded data through a packet loss recovery capability prediction model based on machine learning;

the redundancy multi-occurrence judging module is used for judging whether redundancy multi-occurrence processing is required according to the packet loss recovery capability; if so, performing redundancy multiple-sending processing on the current coded data and then transmitting the current coded data to a receiving end; if not, the current coded data is directly transmitted to a receiving end.

A computer-readable storage medium, in which a computer program is stored which, when executed by a processor, causes the processor to carry out the steps of the above-mentioned speech transmission method.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the above-mentioned voice transmission method.

The above-mentioned voice transmission method, system, apparatus, computer-readable storage medium and computer device, before transmitting the current encoding data to the receiving end, through a packet loss recovery capability prediction model based on machine learning, predicting the packet loss recovery capability of the receiving end to the current coded data according to the first voice coding characteristic parameter corresponding to the current coded data and the second voice coding characteristic parameter corresponding to the previous coded data, therefore, whether the redundancy multiple processing is carried out on the current coded data is judged according to the packet loss recovery capability, if so, the method needs to consume necessary network bandwidth resources to perform redundancy multiple-occurrence processing, otherwise, the method does not need to perform redundancy multiple-occurrence processing, and directly transmits the current encoding data to the receiving end, thereby avoiding consuming excessive network bandwidth resources, therefore, the utilization rate of the network bandwidth is effectively improved on the whole, and meanwhile, the packet loss resistance of the transmission network can be ensured.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a voice transmission method;

FIG. 2 is a diagram of an exemplary embodiment of a voice transmission method;

FIG. 3 is a flow diagram of a method for voice transmission in one embodiment;

FIG. 4 is a schematic block diagram of a voice transmission using a redundant multiple issue mechanism in one embodiment;

fig. 5 is a flowchart illustrating a training procedure of a packet loss recovery capability prediction model in an embodiment;

FIG. 6 is a block diagram illustrating the training of a packet loss recovery capability prediction model in an embodiment;

FIG. 7 is a block flow diagram of a method of voice transmission in one embodiment;

FIG. 8 is a flow diagram of a method for voice transmission in an exemplary embodiment;

FIG. 9 is a block diagram showing the construction of a voice transmission apparatus according to an embodiment;

FIG. 10 is a block diagram showing a configuration of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Fig. 1 is a diagram of an application environment of a voice transmission method according to an embodiment. Referring to fig. 1, the voice transmission method is applied to a voice transmission system. The voice transmission system includes a transmitting end 110 and a receiving end 120. The transmitting end 110 and the receiving end 120 are connected through a network. The sending end 110 and the receiving end 120 may both be terminals, and the terminals may specifically be desktop terminals or mobile terminals, and the mobile terminals may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. In other embodiments, the sender 110 and the receiver 120 may also be a server or a server cluster.

As shown in fig. 2, in a specific application scenario, an application program supporting a voice transmission function is run on both the sending end 110 and the receiving end 120, the server 130 can provide a computing capability and a storage capability for the application program, and both the sending end 110 and the receiving end 120 can be connected to the server 130 through a network, so that voice transmission at both ends is realized based on the server 130. The server 130 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

In one embodiment, the transmitting end 110 may obtain current encoded data in a speech encoding code stream; the method comprises the steps of obtaining packet loss recovery capability corresponding to current coded data according to a first voice coding characteristic parameter corresponding to the current coded data and a second voice coding characteristic parameter corresponding to previous coded data of the current coded data through a machine learning-based packet loss recovery capability prediction model, judging whether redundancy multiple processing is needed or not according to the packet loss recovery capability, if so, conducting redundancy multiple processing on the current coded data and then transmitting the current coded data to a receiving end 120, and if not, directly transmitting the current coded data to the receiving end 120, so that the utilization rate of network bandwidth can be effectively improved on the whole, and meanwhile, the packet loss resistance capability of a transmission network can be guaranteed.

In one embodiment, as shown in fig. 3, a method of voice transmission is provided. The embodiment is mainly illustrated by applying the method to the transmitting end 110 in fig. 1 or fig. 2. Referring to fig. 3, the voice transmission method specifically includes the following steps S302 to S308:

s302, obtaining the current coding data in the voice coding code stream.

The voice coding code stream is an original code stream obtained after voice coding is carried out on a voice signal, and the voice coding code stream comprises a group of coded data to be transmitted. The encoded data may be an encoded data frame obtained by encoding a voice signal according to a specific frame length by a voice encoder at a transmitting end, and the transmitting end may transmit the encoded data frame in the voice encoded code stream to a receiving end through a network. The coded data can also be a coded data packet synthesized according to a plurality of coded data frames, and the sending end can transmit the coded data packet in the voice coded code stream to the receiving end through a network. For example, an encoder at a transmitting end acquires a 60ms voice signal, divides the voice signal into 4 frames with a frame length of 15ms, and sequentially encodes the 4 frames to obtain 4 encoded data frames, and the transmitting end may sequentially transmit the encoded data frames to a receiving end, and may further synthesize the 4 encoded data frames into an encoded data packet and transmit the encoded data packet to the receiving end through a network.

Generally, to counter the problem of packet loss in a transmission network, as shown in fig. 4, before a sending end transmits a speech coding code stream to a receiving end, a redundancy multi-sending mechanism is directly adopted to copy multiple copies of each coded data in the speech coding code stream, the coded data are arranged and combined according to a certain sequence and then sent to the receiving end, the receiving end can receive each coded data and a corresponding redundancy multi-sending packet through the network, filter out redundant coded data, and decode the filtered speech coding code stream to obtain a speech signal. In the embodiment provided by the application, after the sending end encodes the original voice information to obtain the voice coding code stream, before each coded data in the voice coding code stream is sent to the receiving end, the sending end can sequentially predict the packet loss recovery capability of the receiving end on each coded data in the voice coding code stream, so that the sending end can sequentially obtain the coded data in the voice coding code stream, and the current coded data is the coded data to be currently transmitted to the receiving end.

It can be understood that the current encoded data used in this application is used to describe the encoded data currently being processed by the transmitting end, the previous encoded data is used to describe the encoded data that precedes the current encoded data in the speech encoded code stream, and the previous encoded data may be the previous encoded data of the current encoded data, or may be a plurality of previous encoded data of the current encoded data, for example, may be the first two encoded data of the current encoded data. For example, after the sending end finishes processing the current encoded data F (i), the next encoded data F (i +1) of the current encoded data F (i) in the speech encoded code stream may be used as the new current encoded data, and the current encoded data F (i) may be used as the previous encoded data of the new current encoded data F (i + 1).

In one embodiment, the voice transmission method further includes: acquiring an original voice signal; segmenting an original voice signal to obtain an original voice sequence; and sequentially carrying out voice coding on the voice segments in the original voice sequence to obtain a voice coding code stream.

For example, an original voice signal acquired by a sending end is a voice of 2 seconds, the voice signal is divided by taking 20 milliseconds as a unit to obtain an original voice sequence consisting of 100 voice segments, and then each voice segment in the original voice sequence is subjected to voice coding in sequence to obtain coded data corresponding to each voice segment, so as to generate a voice coding code stream corresponding to the original voice signal.

In one embodiment, the voice transmission method further includes: acquiring voice coding characteristic parameters corresponding to voice segments in an original voice sequence; carrying out voice coding on the corresponding voice segments according to the voice coding characteristic parameters, and obtaining voice coding code streams after generating corresponding coded data; and caching the voice coding characteristic parameters adopted by each coded data in the voice coding process.

Specifically, in the speech encoding process, the transmitting end extracts speech encoding characteristic parameters of speech segments in an original speech sequence, encodes the extracted speech encoding characteristic parameters, and generates encoded data corresponding to each speech segment, for example, an encoder of the transmitting end extracts speech encoding characteristic parameters of the speech segments through some speech signal processing models (such as a filter, a characteristic extractor, and the like), and then encodes (such as entropy encoding) the speech encoding characteristic parameters and packs the encoded data according to a certain data format to obtain corresponding encoded data. It should be noted that the sending end may jointly generate the current encoded data corresponding to the current speech segment according to the speech coding characteristic parameter of the current speech segment and the speech coding characteristic parameter of the previous speech segment, and may also jointly generate the current encoded data corresponding to the current speech segment according to the speech coding characteristic parameter of the current speech segment and the speech coding characteristic parameter of the subsequent speech segment. The speech coding feature parameters may be parameters such as Line spectrum pair Frequency (LSF), pitch period (PitchDetection), adaptive codebook gain (adaptive gain), and fixed codebook gain, which are extracted by performing signal processing on a speech segment.

Further, when the sending end generates the coded data corresponding to each voice segment, the voice coding characteristic parameters of each voice segment in the coding process, that is, the voice coding characteristic parameters adopted when generating each coded data, are cached and used for predicting packet loss recovery capability corresponding to each coded data based on the cached voice coding characteristic parameters.

S304, obtaining the packet loss recovery capability corresponding to the current coded data according to the first voice coding characteristic parameter corresponding to the current coded data and the second voice coding characteristic parameter corresponding to the previous coded data of the current coded data through a machine learning-based packet loss recovery capability prediction model.

The packet loss recovery capability is a prediction result, and can reflect the voice quality condition of a recovery packet obtained by performing packet loss recovery processing on the current coded data after the current coded data is lost by a receiving end. The prediction result indicates that the receiving end can recover the lost current encoded data well or cannot recover the lost current encoded data well. The Packet Loss recovery processing is PLC (Packet Loss Concealment), and the Packet Loss recovery capability is the Packet Loss recovery capability of PLC.

Under the condition that numerical value mutation exists in voice coding characteristic parameters of coded data, limitation exists in packet loss recovery capability of a receiving end, for example, under the condition that base frequency hopping, LSF mutation and the like exist in adjacent or similar coded data, the packet loss recovery capability of the receiving end is limited, and under the condition, a redundancy multiple-sending mechanism is started by a sending end, so that the packet loss rate can be effectively improved, and the voice quality of the receiving end is guaranteed; and under the condition that the numerical value fluctuation of the voice coding characteristic parameters of the adjacent coded data is relatively stable, the receiving end generally has good packet loss recovery capability, and under the condition, the transmitting end does not need to start a redundancy multiple-sending mechanism. Therefore, the packet loss recovery capability corresponding to the current encoding data is linked with the voice encoding characteristic parameter corresponding to the current encoding data, and the machine learning model can learn how to predict the packet loss recovery capability corresponding to the data packet according to the voice encoding characteristic parameter after being trained by a large number of training samples.

Specifically, the sending end may obtain a first speech coding characteristic parameter corresponding to the cached current encoded data and a second speech coding characteristic parameter corresponding to the previous encoded data, and predict the packet loss recovery capability corresponding to the current encoded data according to the first speech coding characteristic parameter and the second speech coding characteristic parameter through a packet loss recovery capability prediction model trained in advance.

In other embodiments, the sending end may obtain the packet loss recovery capability corresponding to the current encoded data according to the first speech coding characteristic parameter corresponding to the current encoded data and the third speech coding characteristic parameter corresponding to the subsequent encoded data of the current encoded data through the packet loss recovery capability prediction model. Or obtaining the packet loss recovery capability corresponding to the current encoded data according to the second speech encoding characteristic parameter and/or the third speech encoding characteristic parameter. The post-coded data is used to describe the coded data following the current coded data in the speech coded code stream, where the post-coded data may be the last coded data of the current coded data, or may be multiple pieces of coded data following the current coded data, for example, the last two pieces of coded data of the current coded data.

It can be understood that the voice coding characteristic parameters corresponding to which coded data are used by the sending end as the input of the prediction model of the packet loss recovery capability depend on the algorithm rules adopted by the sending end when performing voice coding or the algorithm rules adopted by the receiving end when performing voice decoding, and the coding and decoding rules are mutually corresponding. For example, if a sending end needs to predict packet loss resilience corresponding to current encoded data according to a speech coding characteristic parameter corresponding to previous encoded data when generating the current encoded data, the speech coding characteristic parameter adopted by the previous encoded data needs to be used as input of a prediction model of the packet loss resilience; if the sending end needs to predict the packet loss recovery capability corresponding to the current encoded data according to the speech coding feature parameters adopted by the next encoded data when generating the current encoded data, the speech coding feature parameters adopted by the next encoded data need to be used as the input of the prediction model of the packet loss recovery capability.

The packet loss recovery capability prediction model is a computer model based on machine learning and can be realized by adopting a neural network model. The machine learning model can learn through the samples, and therefore has specific capabilities. In this embodiment, the packet loss recovery capability prediction model is a model with a capability of predicting packet loss recovery, which is obtained by training in advance.

In an embodiment, the sending end may set a model structure of the machine learning model in advance to obtain an initial machine learning model, and train the initial machine learning model to obtain model parameters of the machine learning model through a large amount of sample voice and packet loss simulation tests. Therefore, when voice transmission through a network is needed, a sending end can obtain model parameters obtained by training in advance, then the model parameters are led into an initial machine learning model to obtain a packet loss recovery capability prediction model, packet loss recovery capability corresponding to each coded data in a voice coding code stream is predicted through the packet loss recovery capability prediction model, and whether a redundancy multiple-generation mechanism is started for the current coded data is determined according to the predicted packet loss recovery capability.

Fig. 5 is a schematic flowchart illustrating a training procedure of a packet loss recovery capability prediction model in an embodiment. It should be noted that, in the training step, after being executed by any computer device, the trained packet loss recovery capability prediction model is obtained, and then the trained packet loss recovery capability prediction model is imported into a transmitting end that needs to perform voice transmission; the computer device may also be the transmitting end in fig. 1 or fig. 2, that is, the training step may also be directly performed by the transmitting end and obtain a trained packet loss recovery capability prediction model. The following training steps of the packet loss recovery capability prediction model are exemplified by taking a computer device as an execution subject, and specifically include:

and S502, acquiring a sample voice sequence in the training set.

Specifically, the computer device may acquire a large number of speech signals and segment the speech signals to obtain a large number of speech signal sequences composed of speech segments as sample speech sequences for training the machine learning model.

S504, carrying out voice coding on the sample voice sequence to obtain a sample voice coding code stream.

Specifically, for each sample voice sequence, the computer device extracts the voice coding characteristic parameters corresponding to each voice segment, generates the coded data corresponding to each voice segment according to the extracted voice coding characteristic parameters, and obtains the sample voice coding code stream corresponding to each sample voice sequence. The computer device may buffer speech coding characteristic parameters used for each coded data in the coding process.

S506, extracting a first voice coding characteristic parameter adopted by the current coding data in the sample voice coding code stream and a second voice coding characteristic parameter adopted by the previous coding data of the current coding data.

As mentioned above, the packet loss recovery capability corresponding to the encoded data is associated with the speech encoding feature parameter corresponding to the encoded data, and may also be associated with the speech encoding feature parameter corresponding to the previous encoded data and/or the subsequent encoded data, so that the computer device may train the speech encoding feature parameter as an input of the machine learning model during training. In this embodiment, the sending end may extract a first speech coding feature parameter corresponding to currently processed currently encoded data and a second speech coding feature parameter corresponding to previously encoded data of the currently processed currently encoded data as input of the machine learning model. As mentioned before, the previously encoded data is the previous encoded data of the current encoded data, and may also be a plurality of previous encoded data of the current encoded data.

It should be noted that each training object is one encoded data, and each sample speech encoding code stream includes a plurality of encoded data, so that each sample speech encoding code stream can be used for multiple training. For example, during the training process, the sending end may extract the speech coding characteristic parameter corresponding to the ith encoded data and the speech coding characteristic parameter corresponding to the (i-1) th encoded data in the sample speech coding code stream S, and may also extract the speech coding characteristic parameter corresponding to the (i +1) th encoded data and the speech coding characteristic parameter corresponding to the ith encoded data in the sample speech coding code stream S.

And S508, after the sample voice coding code stream is directly decoded and the first voice signal is obtained, a first voice quality score determined based on the first voice signal is obtained.

In order to obtain the target output of the machine learning model in the training process, the transmitting end needs to execute steps S508 to S512. The computer equipment can directly decode a sample voice coding code stream obtained after coding to obtain a first voice signal, and then test a first voice quality score corresponding to the first voice signal by using a voice quality test tool. Because the first voice signal is obtained by directly decoding the sample voice coding code stream, the condition of coded data loss does not exist, the obtained first voice signal is very close to the original sample voice sequence and can be called as a lossless voice signal, and the corresponding first voice quality score can be called as a lossless voice quality score.

In one embodiment, the voice quality testing tool may be PESQ (subjective voice quality assessment), which may objectively evaluate the quality of a voice signal according to some metrics that in turn have a high degree of match with human perception of voice quality, thereby providing a voice quality measure that may be fully quantified. The first voice quality score obtained may be denoted as MOS _ unload.

And S510, obtaining a recovery packet obtained by performing simulated packet loss recovery processing on the current coded data, decoding the recovery packet and obtaining a second voice signal, and then determining a second voice quality score based on the second voice signal.

Then, the computer device may use the current encoded data as a lost data packet, simulate a decoder at the receiving end to perform packet loss recovery processing on the current encoded data and obtain a corresponding recovery packet, decode the recovery packet to obtain a corresponding second voice signal, and perform voice quality scoring after splicing other voice segments in the original sample voice sequence with the second voice signal to obtain a second voice quality score. Because the second voice signal is obtained by decoding the recovery packet obtained under the condition of analog packet LOSS, and there is a LOSS between the recovery packet and the lost current encoded data, there will also be a LOSS between the obtained second voice signal and the voice segment corresponding to the current encoded data, the second voice signal may be referred to as a lossy voice signal, and the determined second voice quality score may be referred to as a lossy voice quality score, which is denoted as MOS _ LOSS.

S512, determining the real packet loss recovery capability corresponding to the current coded data according to the grading difference between the first voice quality grade and the second voice quality grade.

Specifically, the true packet LOSS recovery capability corresponding to the current encoded data may be measured by a score difference between the first voice quality score and the second voice quality score, that is, MOS _ unload-MOS _ LOSS may be used as the true packet LOSS recovery capability corresponding to the current encoded data, that is, the target output of the machine learning model. The real packet loss recovery capability corresponding to the current coded data is inversely related to the score difference, namely the smaller the difference is, the better the voice quality of a recovery packet obtained by simulating packet loss recovery after the current coded data is lost is, and the stronger the real packet loss recovery capability corresponding to the current coded data is; on the contrary, the larger the difference is, the lower the voice quality of the recovery packet obtained by simulating packet loss recovery after the current encoded data packet is lost is.

And S514, inputting the first voice coding characteristic parameter and the second voice coding characteristic parameter into a machine learning model, and outputting the predicted packet loss recovery capability corresponding to the current coding data through the machine learning model.

After obtaining the target output of the training process, the computer device may input the obtained first speech coding characteristic parameter and the second speech coding characteristic parameter to the machine learning model, and output the predicted packet loss recovery capability corresponding to the current coded data through the processing of the internal network. S514 may be executed before step S508, and the execution order of this step is not limited in this embodiment.

S516, after the model parameters of the machine learning model are adjusted according to the difference between the real packet loss recovery capability and the predicted packet loss recovery capability, returning to the step of obtaining the sample voice sequence in the training set to continue training until the training end condition is met.

Specifically, the computer device may construct a loss function according to the obtained real packet loss recovery capability and the predicted packet loss recovery capability obtained through the machine learning model, use the model parameter obtained when the loss function is minimized as the latest model parameter of the machine learning model, and continue to perform the next training according to the sample voice sequence until the machine learning model converges or the training frequency reaches the preset frequency, thereby obtaining the trained packet loss recovery capability prediction model with the packet loss recovery prediction capability.

Fig. 6 is a schematic diagram of a framework for training a machine learning model to obtain a packet loss recovery capability prediction model in an embodiment. Fig. 6 is a flow chart of the single training process. And the computer equipment acquires a sample voice sequence, and performs voice coding on the sample voice sequence to obtain a sample voice coding code stream. Firstly, directly decoding a sample voice coding code stream under the condition that the current coding data has no packet LOSS, then obtaining MOS _ LOSS by adopting PESQ, then simulating under the condition that the current coding data is lost, performing packet LOSS recovery processing, and then obtaining MOS _ LOSS by adopting PESQ. And taking the voice coding characteristic parameters of the current coded data and the voice coding characteristic parameters of the previous coded data as the input of a machine learning model to obtain the predicted packet LOSS recovery capability, taking MOS _ UNLOSS-MOS _ LOSS as the target output of the machine learning model, namely the real packet LOSS recovery capability, and then adjusting the model parameters of the machine learning model according to the predicted packet LOSS recovery capability and the real packet LOSS recovery capability to finish the training process.

In one embodiment, in step S304, obtaining, by using a machine learning-based packet loss resilience prediction model, a packet loss resilience corresponding to current encoded data according to a first speech encoding feature parameter corresponding to the current encoded data and a second speech encoding feature parameter corresponding to previous encoded data of the current encoded data, includes: inputting a first voice coding characteristic parameter corresponding to the current coding data and a second voice coding characteristic parameter corresponding to the previous coding data of the current coding data into a packet loss recovery capability prediction model; outputting a grade difference between a first voice quality grade determined by directly decoding the current coded data and a second voice quality grade determined by decoding the current coded data after packet loss recovery processing according to the first voice coding characteristic parameter and the second voice coding characteristic parameter through a packet loss recovery capability prediction model; determining packet loss recovery capability corresponding to the current coded data according to the grading difference; and the packet loss recovery capability corresponding to the current coded data is inversely related to the grade difference.

In this embodiment, before the sending end sends the current encoded data in the speech encoded code stream to the receiving end, the packet loss recovery capability corresponding to the current encoded data may be predicted through a packet loss recovery capability prediction model trained in advance. Specifically, a first speech coding characteristic parameter corresponding to current coded data and a second speech coding characteristic parameter corresponding to previous coded data are used as input of a packet loss recovery capability prediction model, output of the packet loss recovery capability prediction model is a score difference between a first speech quality score determined by directly decoding the current coded data and a second speech quality score determined by decoding the current coded data after packet loss recovery processing, the score difference reflects a quality condition of packet loss recovery processing performed by a receiving end after the current coded data is lost, namely the size of packet loss recovery capability, and the packet loss recovery capability and the score difference are in inverse correlation. When the score difference is large, that is, the packet loss recovery capability is smaller than a preset threshold, it indicates that the quality of the voice signal obtained by the receiving end performing packet loss recovery processing after the current encoded data is lost is poor, and on the contrary, when the score difference is small, that is, the packet loss recovery capability is larger than the preset threshold, it indicates that the quality of the voice signal obtained by the receiving end performing packet loss recovery processing after the current encoded data is lost is within an acceptable range.

S306, judging whether redundancy multiple processing is needed or not according to the packet loss recovery capability; if yes, executing step S308, and transmitting the current coded data to the receiving end after performing redundancy multiple-sending processing on the current coded data; if not, step S310 is executed to directly transmit the current encoded data to the receiving end.

Specifically, after the sending end obtains the packet loss recovery capability corresponding to the current encoded data through the packet loss recovery capability prediction model, whether redundancy multiple processing is performed on the current encoded data is judged according to the predicted packet loss recovery capability.

In an embodiment, the packet loss resilience output by the packet loss resilience prediction model is a value within a range of values, and the sending end may compare the packet loss resilience with a preset threshold, and determine whether to perform redundancy multiple processing on the current encoded data according to a comparison result.

Specifically, when the packet loss recovery capability is smaller than the preset threshold, the current encoded data packet is subjected to the redundancy multiple-occurrence processing and then transmitted to the receiving end, and when the packet loss recovery capability is smaller than the preset threshold, it indicates that the quality of the voice signal obtained by performing the packet loss recovery processing on the receiving end if the current encoded data packet is lost is poor, so that the problem of packet loss of the transmission network needs to be solved by using the redundancy multiple-occurrence method, that is, the current encoded data needs to be subjected to the redundancy multiple-occurrence processing and then transmitted to the receiving end. When the packet loss recovery capability is larger than a preset threshold value, the current encoding data is directly transmitted to a receiving end, when the packet loss recovery capability is larger than the preset threshold value, the fact that the quality of a voice signal obtained by the receiving end through packet loss recovery processing after the current encoding data is lost is within an acceptable range is shown, therefore, for the encoding data, the transmitting end does not need to use redundancy multiple as a packet loss resisting strategy, the transmitting end can directly transmit the current encoding data to the receiving end, and under the condition that the current encoding data is lost, the current encoding data is directly subjected to packet loss recovery processing through a built-in packet loss recovery algorithm in a decoder of the receiving end.

In one embodiment, the packet loss recovery capability output by the packet loss recovery capability prediction model is of two types, and when the packet loss recovery capability is a first value, the packet loss recovery capability indicates that after the current encoded data is lost, if the quality of a voice signal obtained by performing packet loss recovery processing on a receiving end is poor, the transmitting end needs to perform redundancy multiple processing on the current encoded data packet and then transmits the current encoded data packet to the receiving end; when the packet loss recovery capability is the second value, it indicates that after the current encoded data is lost, if the quality of the voice signal obtained by the receiving end through packet loss recovery processing is within an acceptable range, the transmitting end can directly transmit the current encoded data to the receiving end, and when the current encoded data is lost, the transmitting end directly uses a packet loss recovery algorithm built in a decoder of the receiving end to perform packet loss recovery processing on the current encoded data. For example, the first value may be 1 and the second value may be 0. For another example, the first value may be 0 and the second value may be 1.

In one embodiment, the redundancy multiple processing is performed on the current encoded data before the current encoded data is transmitted to the receiving end, including: acquiring packet loss state information fed back by a receiving end; determining redundancy multiple parameters corresponding to the current encoding data according to the packet loss state information; and copying the current coded data according to the redundancy multi-parameter and transmitting the current coded data to a receiving end.

Specifically, the receiving end may determine packet loss state information according to the received data packet, and feed back the packet loss state information to the transmitting end. The packet loss state information can be represented by the current packet loss rate, the receiving end can encapsulate the packet loss rate into a message and send the message to the sending end, and the sending end analyzes the received control message to obtain the packet loss rate. The redundancy multiple parameter may be a redundancy multiple, that is, n copies of the current encoded data, and the redundancy multiple parameter is adjusted based on the packet loss state information fed back by the receiving end, for example, n is larger at a high packet loss rate and smaller at a low packet loss rate. For another example, when the packet loss rate is 10%, the multiple redundancy multiple n may be set to 1; when the packet loss rate is 20%, the multiple of redundancy n may be set to 2 times redundancy.

In one embodiment, the voice transmission method further comprises: when the receiving end receives the current coding data or the redundant multi-transmission packet corresponding to the current coding data, the receiving end filters out repeated data packets and then decodes the data packets to obtain the voice signal corresponding to the current coding data.

Specifically, the receiving end can receive the current encoded data and the corresponding redundant multi-transmission packet through the network, filter and sort the redundant encoded data after filtering and sorting the packet sequence number, and decode one of the encoded data to obtain the corresponding voice signal.

In one embodiment, the voice transmission method further comprises:

when the receiving end does not receive the current encoding data and the redundancy multi-transmission packet corresponding to the current encoding data, the receiving end carries out packet loss recovery processing on the current encoding data to obtain a recovery packet corresponding to the current encoding data, and the recovery packet is decoded to obtain a voice signal corresponding to the current encoding data.

Specifically, in the process that the receiving end decodes the encoded data one by one, if the receiving end receives the current encoded data or the corresponding redundant multi-transmission packet, the receiving end reconstructs the voice signal according to a normal decoding flow, if the receiving end does not receive the current encoded data and the corresponding redundant multi-transmission packet, or does not receive the current encoded data and the corresponding redundant multi-transmission packet within a certain time, it is determined that the current encoded data is lost, and the receiving end can perform packet loss recovery processing on the current encoded data through a PLC algorithm built in a decoder. Specifically, it is common to approximate the current encoded data by using a pitch synchronous repetition method according to the decoded information of the previous frame as a recovery packet, and then decode the recovery packet to obtain the speech signal.

According to the voice transmission method, before current coded data are transmitted to a receiving end, the packet loss recovery capability of the receiving end on the current coded data is predicted according to a first voice coding characteristic parameter corresponding to the current coded data and a second voice coding characteristic parameter corresponding to the previous coded data through a machine learning-based packet loss recovery capability prediction model, so that whether redundancy multiple processing is performed on the current coded data or not is judged according to the packet loss recovery capability, if yes, necessary network bandwidth resources are consumed to perform redundancy multiple processing, otherwise, the redundancy multiple processing is not required to be performed, the current coded data are directly transmitted to the receiving end, excessive network bandwidth resources are avoided being consumed, the utilization rate of network bandwidth is effectively improved on the whole, and meanwhile, the packet loss resistance capability of a transmission network can be guaranteed.

Fig. 7 is a block diagram of a voice transmission method according to an embodiment. Referring to fig. 7, a sending end obtains an original voice signal, and performs voice coding on the original voice signal to obtain a voice coding code stream. Then, the sending end predicts the packet loss recovery capability of the receiving end aiming at each coded data in the voice coded code stream through a packet loss recovery capability prediction model based on machine learning. And judging whether to open a redundancy multiple mechanism for the current coded data or not according to the predicted packet loss recovery capability. And if judging that a redundancy multi-sending mechanism is started for the current encoding data, setting a redundancy multi-sending parameter according to packet loss state information fed back by a receiving end, copying a plurality of copies of the current encoding data according to the redundancy multi-sending parameter, and transmitting the current encoding data to the receiving end. And if the judgment does not start the redundancy multi-transmission mechanism for the current coded data, directly transmitting the current coded data to a receiving end.

If the receiving end receives the current coded data, the voice signal is reconstructed according to a normal decoding process, if the receiving end does not receive the current coded data and the corresponding redundant multi-transmission packet, or does not receive the current coded data and the corresponding redundant multi-transmission packet within a certain time, the loss of the current coded data is judged, and the receiving end can perform packet loss recovery processing on the current coded data through a PLC algorithm arranged in a decoder and then decode the current coded data to obtain the voice signal.

Fig. 8 is a flowchart illustrating a voice transmission method according to an embodiment. Referring to fig. 8, the following steps are included:

s802, acquiring an original voice signal.

S804, the original voice signal is segmented to obtain an original voice sequence.

S806, the voice segments in the original voice sequence are sequentially subjected to voice coding to obtain a voice coding code stream.

S808, caching the voice coding characteristic parameters adopted by each coded data in the voice coding process.

And S810, acquiring the current coded data in the voice coded code stream.

S812, inputting the first voice coding characteristic parameter corresponding to the current coding data and the second voice coding characteristic parameter corresponding to the previous coding data of the current coding data into the packet loss recovery capability prediction model.

S814, outputting a score difference between a first voice quality score determined by directly decoding the current coded data and a second voice quality score determined by decoding the current coded data after packet loss recovery processing according to the first voice coding characteristic parameter and the second voice coding characteristic parameter through the packet loss recovery capability prediction model.

And S816, determining the packet loss recovery capability corresponding to the current coded data according to the grading difference.

S818, when the packet loss recovery capability is smaller than a preset threshold, acquiring packet loss state information fed back by a receiving end; determining redundancy multiple parameters corresponding to the current encoding data according to the packet loss state information; copying current coded data according to the redundancy multi-parameter and transmitting the current coded data to a receiving end; and filtering repeated data packets through a receiving end and then decoding to obtain a voice signal corresponding to the current coded data.

And S820, when the packet loss recovery capability is larger than a preset threshold value, directly transmitting the current encoding data to a receiving end.

And S822, if the receiving end does not receive the current coded data and the redundant multi-transmission packet corresponding to the current coded data, performing packet loss recovery processing on the current coded data through the receiving end to obtain a recovery packet corresponding to the current coded data, and decoding the recovery packet to obtain a voice signal corresponding to the current coded data.

It should be understood that, although the steps in the flowcharts of fig. 3, 5, and 8 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 3, 5, and 8 may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, a voice transmission system is provided, which may be the voice transmission system shown in fig. 1 or fig. 2, and includes a transmitting end 110 and a receiving end 120, where:

the sending end 110 is configured to obtain current encoded data in a speech encoding code stream, and obtain packet loss recovery capability corresponding to the current encoded data according to a first speech encoding characteristic parameter corresponding to the current encoded data and a second speech encoding characteristic parameter corresponding to previous encoded data of the current encoded data through a machine learning-based packet loss recovery capability prediction model;

the sending end 110 is further configured to determine whether redundancy multiple-occurrence processing is required according to the packet loss recovery capability; if so, performing redundancy multiple-sending processing on the current coded data and then transmitting the current coded data to a receiving end; if not, directly transmitting the current coded data to a receiving end;

the receiving end 120 is configured to, when receiving the current encoded data or the redundant multicast packet corresponding to the current encoded data, filter out repeated data packets through the receiving end and then decode the data packets to obtain a voice signal corresponding to the current encoded data;

the receiving end 120 is further configured to, when the current encoded data and the redundant multi-transmission packet corresponding to the current encoded data are not received, perform packet loss recovery processing on the current encoded data to obtain a recovery packet corresponding to the current encoded data, and decode the recovery packet to obtain a voice signal corresponding to the current encoded data.

In one embodiment, the transmitting end 110 is further configured to obtain an original voice signal; segmenting an original voice signal to obtain an original voice sequence; and sequentially carrying out voice coding on the voice segments in the original voice sequence to obtain a voice coding code stream.

In an embodiment, the sending end 110 is further configured to obtain speech coding characteristic parameters corresponding to respective speech segments in the original speech sequence; carrying out voice coding on the corresponding voice segments according to the voice coding characteristic parameters, and obtaining voice coding code streams after generating corresponding coded data; and caching the voice coding characteristic parameters adopted by each coded data in the voice coding process.

In one embodiment, the sending end 110 is further configured to input a first speech coding characteristic parameter corresponding to the current encoded data and a second speech coding characteristic parameter corresponding to previous encoded data of the current encoded data to the packet loss resilience prediction model; outputting a grade difference between a first voice quality grade determined by directly decoding the current coded data and a second voice quality grade determined by decoding the current coded data after packet loss recovery processing according to the first voice coding characteristic parameter and the second voice coding characteristic parameter through a packet loss recovery capability prediction model; determining packet loss recovery capability corresponding to the current coded data according to the grading difference; and the packet loss recovery capability corresponding to the current coded data is inversely related to the grade difference.

In an embodiment, the sending end 110 is further configured to obtain packet loss status information fed back by the receiving end; determining redundancy multiple parameters corresponding to the current encoding data according to the packet loss state information; and copying the current coded data according to the redundancy multi-parameter and transmitting the current coded data to a receiving end.

In an embodiment, the receiving end 120 is further configured to, when the current encoded data and the redundant multi-transmission packet corresponding to the current encoded data are not received, perform packet loss recovery processing on the current encoded data to obtain a recovery packet corresponding to the current encoded data, and decode the recovery packet to obtain a voice signal corresponding to the current encoded data.

In one embodiment, the transmitting end 110 is further configured to obtain a sample voice sequence in a training set; carrying out voice coding on the sample voice sequence to obtain a sample voice coding code stream; extracting a first voice coding characteristic parameter adopted by current coding data in a sample voice coding code stream and a second voice coding characteristic parameter adopted by previous coding data of the current coding data; obtaining a first voice quality score determined based on a first voice signal after directly decoding a sample voice coding code stream and obtaining the first voice signal; acquiring a recovery packet obtained by performing simulated packet loss recovery processing on current coded data, decoding the recovery packet and obtaining a second voice signal, and then determining a second voice quality score based on the second voice signal; determining the real packet loss recovery capability corresponding to the current coded data according to the grading difference between the first voice quality grade and the second voice quality grade; inputting the first voice coding characteristic parameter and the second voice coding characteristic parameter into a machine learning model, and outputting predicted packet loss recovery capability corresponding to current coding data through the machine learning model; and after the model parameters of the machine learning model are adjusted according to the difference between the real packet loss recovery capability and the predicted packet loss recovery capability, returning to the step of obtaining the sample voice sequence in the training set to continue training until the training end condition is met.

According to the voice transmission system, before a sending end transmits current coded data to a receiving end, the packet loss recovery capability of the receiving end on the current coded data is predicted according to a first voice coding characteristic parameter corresponding to the current coded data and a second voice coding characteristic parameter corresponding to the previous coded data through a machine learning-based packet loss recovery capability prediction model, so that whether redundancy multiple processing is performed on the current coded data or not is judged according to the packet loss recovery capability, if yes, necessary network bandwidth resources are consumed to perform redundancy multiple processing, otherwise, redundancy multiple processing is not required to be performed, the current coded data is directly transmitted to the receiving end, excessive network bandwidth resources are avoided being consumed, the utilization rate of network bandwidth is effectively improved on the whole, and meanwhile, the packet loss resistance capability of a transmission network can be guaranteed.

In one embodiment, as shown in fig. 9, a voice transmission apparatus 900 is provided, which may be implemented by software, hardware or a combination of both as all or a part of a receiving end. The apparatus comprises an acquisition module 902, a prediction module 904, and a redundancy multiple occurrence decision module 906, wherein:

an obtaining module 902, configured to obtain current encoded data in a speech encoding code stream;

a prediction module 904, configured to obtain, through a machine learning-based packet loss resilience prediction model, a packet loss resilience corresponding to current encoded data according to a first speech coding feature parameter corresponding to the current encoded data and a second speech coding feature parameter corresponding to previous encoded data of the current encoded data;

a redundancy multiple-occurrence judging module 906, configured to judge whether redundancy multiple-occurrence processing is required according to the packet loss recovery capability; if so, performing redundancy multiple-sending processing on the current coded data and then transmitting the current coded data to a receiving end; if not, the current coded data is directly transmitted to a receiving end.

In one embodiment, the voice transmission apparatus 900 further includes a voice encoding module for obtaining an original voice signal; segmenting an original voice signal to obtain an original voice sequence; and sequentially carrying out voice coding on the voice segments in the original voice sequence to obtain a voice coding code stream.

In one embodiment, the voice transmission apparatus 900 further includes a voice coding module and a buffer module, where the voice coding module is configured to obtain voice coding characteristic parameters corresponding to respective voice segments in the original voice sequence; carrying out voice coding on the corresponding voice segments according to the voice coding characteristic parameters, and obtaining voice coding code streams after generating corresponding coded data; the buffer module is used for buffering the voice coding characteristic parameters adopted by each coded data in the voice coding process.

In one embodiment, the prediction module 904 is further configured to input a first speech coding characteristic parameter corresponding to the current encoded data and a second speech coding characteristic parameter corresponding to the previous encoded data of the current encoded data to the packet loss recovery capability prediction model; outputting a grade difference between a first voice quality grade determined by directly decoding the current coded data and a second voice quality grade determined by decoding the current coded data after packet loss recovery processing according to the first voice coding characteristic parameter and the second voice coding characteristic parameter through a packet loss recovery capability prediction model; determining packet loss recovery capability corresponding to the current coded data according to the grading difference; and the packet loss recovery capability corresponding to the current coded data is inversely related to the grade difference.

In an embodiment, the redundancy multi-occurrence determining module 906 is further configured to obtain packet loss state information fed back by the receiving end when the packet loss recovery capability is smaller than a preset threshold; determining redundancy multiple parameters corresponding to the current encoding data according to the packet loss state information; and copying the current coded data according to the redundancy multi-parameter and transmitting the current coded data to a receiving end.

In one embodiment, the speech transmission apparatus 900 further includes a model training module for obtaining sample speech sequences in a training set; carrying out voice coding on the sample voice sequence to obtain a sample voice coding code stream; extracting a first voice coding characteristic parameter adopted by current coding data in a sample voice coding code stream and a second voice coding characteristic parameter adopted by previous coding data of the current coding data; obtaining a first voice quality score determined based on a first voice signal after directly decoding a sample voice coding code stream and obtaining the first voice signal; acquiring a recovery packet obtained by performing simulated packet loss recovery processing on current coded data, decoding the recovery packet and obtaining a second voice signal, and then determining a second voice quality score based on the second voice signal; determining the real packet loss recovery capability corresponding to the current coded data according to the grading difference between the first voice quality grade and the second voice quality grade; inputting the first voice coding characteristic parameter and the second voice coding characteristic parameter into a machine learning model, and outputting predicted packet loss recovery capability corresponding to current coding data through the machine learning model; and after the model parameters of the machine learning model are adjusted according to the difference between the real packet loss recovery capability and the predicted packet loss recovery capability, returning to the step of obtaining the sample voice sequence in the training set to continue training until the training end condition is met.

Before transmitting the current encoded data to the receiving end, the voice transmission apparatus 900 predicts the packet loss resilience of the receiving end to the current encoded data according to the first voice encoding characteristic parameter corresponding to the current encoded data and the second voice encoding characteristic parameter corresponding to the previous encoded data by using the machine learning-based packet loss resilience prediction model, so as to determine whether to perform multiple redundancy processing on the current encoded data according to the packet loss resilience, if so, it needs to consume necessary network bandwidth resources to perform multiple redundancy processing, otherwise, it does not need to perform multiple redundancy processing, and directly transmits the current encoded data to the receiving end, so as to avoid consuming excessive network bandwidth resources, thereby effectively improving the utilization rate of network bandwidth as a whole, and simultaneously, ensuring the packet loss resistance capability of the transmission network.

FIG. 10 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the receiving end 110 in fig. 1. As shown in fig. 10, the computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement the voice transmission method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform the voice transmission method.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the voice transmission apparatus 900 provided in the present application can be implemented in a form of a computer program, and the computer program can be run on a computer device as shown in fig. 10. The memory of the computer device may store various program modules constituting the voice transmission apparatus 900, such as an acquisition module 902, a prediction module 904, and a redundancy multiple occurrence decision module 906 shown in fig. 9. The respective program modules constitute computer programs that cause the processors to execute the steps in the voice transmission methods of the embodiments of the present application described in the present specification.

For example, the computer device shown in fig. 10 may execute step S302 through the obtaining module 902 in the voice transmission apparatus 900 shown in fig. 9. The computer device may perform step S304 through the prediction module 904. The computer device may perform steps S306, S308, and S310 through the redundancy multi-occurrence decision module 906.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the above-described voice transmission method. Here, the steps of the voice transmission method may be the steps in the voice transmission methods of the above-described respective embodiments.

In one embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the steps of the above-mentioned speech transmission method. Here, the steps of the voice transmission method may be the steps in the voice transmission methods of the above-described respective embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of voice transmission, comprising:

acquiring current coded data in a voice coded code stream;

if not, the current coded data is directly transmitted to a receiving end.

2. The method of claim 1, further comprising:

acquiring an original voice signal;

segmenting an original voice signal to obtain an original voice sequence;

and sequentially carrying out voice coding on the voice segments in the original voice sequence to obtain a voice coding code stream.

3. The method of claim 1, further comprising:

acquiring voice coding characteristic parameters corresponding to voice segments in an original voice sequence;

carrying out voice coding on the corresponding voice segments according to the voice coding characteristic parameters, and obtaining voice coding code streams after generating corresponding coded data;

and caching the voice coding characteristic parameters adopted by each coded data in the voice coding process.

4. The method of claim 1, wherein obtaining the packet loss recovery capability corresponding to current encoded data according to a first speech coding feature parameter corresponding to the current encoded data and a second speech coding feature parameter corresponding to previous encoded data of the current encoded data through a machine learning-based packet loss recovery capability prediction model comprises:

inputting a first voice coding characteristic parameter corresponding to the current coding data and a second voice coding characteristic parameter corresponding to previous coding data of the current coding data into a packet loss recovery capability prediction model;

outputting a score difference between a first voice quality score determined by directly decoding the current coded data and a second voice quality score determined by decoding the current coded data after packet loss recovery processing according to the first voice coding characteristic parameter and the second voice coding characteristic parameter through the packet loss recovery capability prediction model;

determining packet loss recovery capability corresponding to the current coded data according to the grading difference;

and the packet loss recovery capability corresponding to the current coded data is inversely related to the grade difference.

5. The method of claim 1, wherein the performing redundant multi-sending processing on the current encoded data before transmitting to a receiving end comprises:

acquiring packet loss state information fed back by a receiving end;

determining redundancy multiple parameters corresponding to the current encoding data according to the packet loss state information;

and copying the current coded data according to the redundancy multi-parameter and transmitting the current coded data to the receiving end.

6. The method of claim 5, further comprising:

and when the receiving end receives the current encoding data or the redundant multi-transmission packet corresponding to the current encoding data, filtering out repeated data packets through the receiving end and then decoding to obtain the voice signal corresponding to the current encoding data.

7. The method of claim 1, further comprising:

and when the receiving end does not receive the current encoding data and the redundant multi-transmission packet corresponding to the current encoding data, performing packet loss recovery processing on the current encoding data through the receiving end to obtain a recovery packet corresponding to the current encoding data, and decoding the recovery packet to obtain a voice signal corresponding to the current encoding data.

8. The method according to any of claims 1 to 7, wherein the packet loss recovery capability prediction model is determined by:

acquiring a sample voice sequence in a training set;

carrying out voice coding on the sample voice sequence to obtain a sample voice coding code stream;

extracting a first voice coding characteristic parameter adopted by current coded data in the sample voice coding code stream and a second voice coding characteristic parameter adopted by previous coded data of the current coded data;

obtaining a first voice quality score determined based on a first voice signal after directly decoding the sample voice coding code stream and obtaining the first voice signal;

acquiring a recovery packet obtained by performing simulated packet loss recovery processing on the current coded data, decoding the recovery packet and obtaining a second voice signal, and then determining a second voice quality score based on the second voice signal;

determining the real packet loss recovery capability corresponding to the current coded data according to the grade difference between the first voice quality grade and the second voice quality grade;

inputting the first voice coding characteristic parameter and the second voice coding characteristic parameter into a machine learning model, and outputting predicted packet loss recovery capability corresponding to the current coding data through the machine learning model;

and after adjusting the model parameters of the machine learning model according to the difference between the real packet loss recovery capability and the predicted packet loss recovery capability, returning to the step of obtaining the sample voice sequence in the training set to continue training until a training end condition is met.

9. A voice transmission system comprising a transmitting end and a receiving end, wherein:

10. The system according to claim 9, wherein the sending end is further configured to obtain speech coding feature parameters corresponding to respective speech segments in an original speech sequence; carrying out voice coding on the corresponding voice segments according to the voice coding characteristic parameters, and obtaining voice coding code streams after generating corresponding coded data; and caching the voice coding characteristic parameters adopted by each coded data in the voice coding process.

11. The system according to claim 9, wherein the sending end is further configured to input a first speech coding feature parameter corresponding to the current encoded data and a second speech coding feature parameter corresponding to a previous encoded data of the current encoded data to a packet loss resilience prediction model; outputting a score difference between a first voice quality score determined by directly decoding the current coded data and a second voice quality score determined by decoding the current coded data after packet loss recovery processing according to the first voice coding characteristic parameter and the second voice coding characteristic parameter through the packet loss recovery capability prediction model; determining packet loss recovery capability corresponding to the current coded data according to the grading difference; and the packet loss recovery capability corresponding to the current coded data is inversely related to the grade difference.

12. The system according to claim 9, wherein the sending end is further configured to obtain packet loss status information fed back by the receiving end; determining redundancy multiple parameters corresponding to the current encoding data according to the packet loss state information; and copying the current coded data according to the redundancy multi-parameter and transmitting the current coded data to the receiving end.

13. A voice transmission apparatus, the apparatus comprising:

14. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 8.

15. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 8.