CN112769524B

CN112769524B - Voice transmission method, device, computer equipment and storage medium

Info

Publication number: CN112769524B
Application number: CN202110366404.2A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-04-06
Filing date: 2021-04-06
Publication date: 2021-06-22
Anticipated expiration: 2041-04-06
Also published as: CN112769524A

Abstract

The application relates to a voice transmission method, a system, a device, a computer device and a storage medium. The method relates to the technical field of voice data transmission, and comprises the following steps: acquiring a voice signal to be coded; carrying out voice coding on a voice signal according to a first coding parameter to obtain a first coding code stream; carrying out voice coding on the voice signal according to the second coding parameter to obtain a second coding code stream with the voice coding quality lower than that of the first coding code stream; and respectively transmitting the first coding code stream and the second coding code stream to a receiving end, and when the receiving end confirms that the first coding code stream has packet loss, obtaining a decoding voice signal corresponding to the packet loss position according to the second coding code stream. By adopting the method, the packet loss resistance and the voice coding quality of voice data transmission can be improved.

Description

Voice transmission method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a voice transmission method and apparatus, a computer device, and a storage medium.

Background

The voice coding can effectively reduce the bandwidth of voice signal transmission, and is very important for saving voice transmission cost and ensuring the integrity of voice information in the network transmission process, however, due to the instability of the transmission network, the packet loss phenomenon occurs in the voice transmission process.

In a conventional speech coding method, anti-packet loss mainly depends on in-band FEC (Forward Error Correction, Forward Error Correction for short) built in a speech coder, the in-band FEC carries redundant information of a previous frame in speech coded data of a frame, and when packet loss occurs, data at a packet loss position can be recovered through the redundant information of the previous frame carried in the speech coded data of a next frame at the packet loss position.

However, in this way of packet loss resistance, because each encoded frame only carries the redundant information of the previous frame, only the scene of packet loss of the previous frame can be solved, if the number of consecutive packet loss frames is greater than 1, the subsequent packet loss cannot be resisted, and the performance of packet loss resistance is weak. Moreover, under the condition that the coding rate is fixed, a competitive relationship exists between the redundant information of the previous frame and the bits occupied by the voice coding data of the current frame, and the voice coding quality is obviously reduced when the bit number occupied by the redundant information of the previous frame is higher.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a voice transmission method, an apparatus, a computer device and a storage medium capable of improving the packet loss resistance and the voice coding quality.

A method of voice transmission, the method comprising:

acquiring a voice signal to be coded;

carrying out voice coding on the voice signal according to a first coding parameter to obtain a first coding code stream;

carrying out voice coding on the voice signal according to a second coding parameter to obtain a second coding code stream with the voice coding quality lower than that of the first coding code stream;

and respectively transmitting the first coding code stream and the second coding code stream to a receiving end, and when the receiving end confirms that packet loss occurs in the first coding code stream, obtaining a decoding voice signal corresponding to a packet loss position according to the second coding code stream.

A voice transmission system, the system comprising a transmitting end and a receiving end, wherein:

the sending end is used for acquiring a voice signal to be coded; carrying out voice coding on the voice signal according to a first coding parameter to obtain a first coding code stream; carrying out voice coding on the voice signal according to a second coding parameter to obtain a second coding code stream with the voice coding quality lower than that of the first coding code stream; respectively transmitting the first coding code stream and the second coding code stream to the receiving end;

the receiving end is used for receiving the first coding code stream and the second coding code stream, and when packet loss occurs in the first coding code stream, a decoding voice signal corresponding to a packet loss position is obtained according to the second coding code stream.

In one embodiment, the transmitting end is configured to input the speech signal into a first speech encoder; performing voice coding on the voice signal according to the first coding parameter through the first voice coder to obtain a first coding code stream corresponding to the voice signal; inputting the speech signal into a second speech encoder; and performing voice coding on the voice signal according to the second coding parameter through the second voice coder to obtain a second coding code stream corresponding to the voice signal, wherein the voice coding quality of the second coding code stream is lower than that of the first coding code stream.

In one embodiment, the sending end is configured to obtain a voice quality indicator corresponding to a target scene; and determining the first coding parameter according to the voice quality index.

In one embodiment, the sending end is configured to obtain current network bandwidth information of a voice transmission link; determining a first group of packet modes according to the network bandwidth information; packing the first coding frames corresponding to the voice segments in the first coding code stream according to the first group of packet modes to obtain at least one first coding data packet; and transmitting the at least one first coded data packet to the receiving end.

In an embodiment, the sending end is configured to determine a second encoding parameter corresponding to a current speech segment to be encoded in the speech signal, where a speech encoding quality corresponding to the second encoding parameter is lower than a speech encoding quality corresponding to the first encoding parameter; performing voice coding on the current voice segment according to the second coding parameter to obtain a second coding frame corresponding to the current voice segment; and acquiring a second coding code stream according to a second coding frame corresponding to each voice segment in the voice signal.

In an embodiment, the sending end is configured to perform voice activity detection on a current voice segment to be encoded in the voice signal, so as to obtain a voice activity corresponding to the current voice segment; and determining a second coding parameter corresponding to the current voice segment according to the voice activity corresponding to the current voice segment.

In an embodiment, the sending end is configured to perform voice activity detection on a current voice segment to be encoded in the voice signal, so as to obtain a voice activity corresponding to the current voice segment; determining a previous voice segment corresponding to the current voice segment from the voice signal; and determining a second coding parameter corresponding to the current voice segment according to the voice activity corresponding to the current voice segment and the voice activity corresponding to the previous voice segment.

In an embodiment, the sending end is configured to obtain packet loss state information fed back by the receiving end; and determining a second coding parameter corresponding to the current voice segment according to the packet loss state information and the voice activity corresponding to the current voice segment to be coded in the voice signal.

In an embodiment, the sending end is configured to obtain packet loss state information fed back by the receiving end; determining a second group of packet modes according to the packet loss state information; packing second coding frames corresponding to the voice segments in the second coding code stream according to the second group of packet modes to obtain at least one second coding data packet; and transmitting the at least one second coded data packet to the receiving end.

In one embodiment, when the sending end transmits the first encoded code stream and the second encoded code stream to the receiving end by using different voice transmission links, packet loss state information about the first encoded code stream fed back by the receiving end is obtained; when the sending end adopts the same voice transmission network link to transmit the first code stream and the second code stream to the receiving end, the common packet loss state information of the first code stream and the second code stream fed back by the receiving end is obtained.

In one embodiment, the sending end is configured to transmit the first encoded code stream and the second encoded code stream to a receiving end respectively; the receiving end is used for receiving the first coding code stream and the second coding code stream, when packet loss occurs in the first coding code stream, determining a second coding frame corresponding to a packet loss position from the second coding code stream, determining a coding frame adjacent to the packet loss position in the received first coding code stream, and obtaining a decoding voice signal corresponding to the packet loss position according to the second coding frame corresponding to the packet loss position and the adjacent coding frame.

In an embodiment, the receiving end is configured to determine, according to the encoded frame adjacent to the packet loss position, a packet loss compensation frame corresponding to the packet loss position, perform speech decoding on the packet loss compensation frame to obtain a compensation speech signal, perform speech decoding on a second encoded frame corresponding to the packet loss position to obtain a second decoded speech signal, and perform windowing splicing processing on the compensation speech signal and the second decoded speech signal to obtain a decoded speech signal corresponding to the packet loss position.

In an embodiment, the receiving end is configured to, when the first encoded code stream does not have a packet loss, sequentially perform speech decoding on a first encoded frame in the first encoded code stream to obtain a decoded speech signal.

A voice transmission apparatus, the apparatus comprising:

the acquisition module is used for acquiring a voice signal to be coded;

the first coding module is used for carrying out voice coding on the voice signal according to a first coding parameter to obtain a first coding code stream;

the second coding module is used for carrying out voice coding on the voice signal according to a second coding parameter to obtain a second coding code stream of which the voice coding quality is lower than that of the first coding code stream;

and the transmission module is used for respectively transmitting the first coding code stream and the second coding code stream to a receiving end, and the receiving end acquires a decoding voice signal corresponding to a packet loss position according to the second coding code stream when confirming that the first coding code stream has packet loss.

In one embodiment, the first encoding module is configured to input the speech signal into a first speech encoder; and carrying out voice coding on the voice signal according to the first coding parameter through the first voice coder to obtain a first coding code stream corresponding to the voice signal.

In one embodiment, the second encoding module is configured to input the speech signal to a second speech encoder; and performing voice coding on the voice signal according to the second coding parameter through the second voice coder to obtain a second coding code stream corresponding to the voice signal, wherein the voice coding quality of the second coding code stream is lower than that of the first coding code stream.

In one embodiment, the voice transmission apparatus further includes a first encoding parameter obtaining module, configured to obtain a voice quality indicator corresponding to a target scene; and determining the first coding parameter according to the voice quality index.

In one embodiment, the transmission module is further configured to obtain current network bandwidth information of the voice transmission link; determining a first group of packet modes according to the network bandwidth information; packing the first coding frames corresponding to the voice segments in the first coding code stream according to the first group of packet modes to obtain at least one first coding data packet; and transmitting the at least one first coded data packet to the receiving end.

In one embodiment, the second encoding module is further configured to determine a second encoding parameter corresponding to a current speech segment to be encoded in the speech signal, where a speech encoding quality corresponding to the second encoding parameter is lower than a speech encoding quality corresponding to the first encoding parameter; performing voice coding on the current voice segment according to the second coding parameter to obtain a second coding frame corresponding to the current voice segment; and acquiring a second coding code stream according to a second coding frame corresponding to each voice segment in the voice signal.

In an embodiment, the voice transmission apparatus further includes a second encoding parameter obtaining module, configured to perform voice activity detection on a current voice segment to be encoded in the voice signal, so as to obtain a voice activity corresponding to the current voice segment; and determining a second coding parameter corresponding to the current voice segment according to the voice activity corresponding to the current voice segment.

In an embodiment, the voice transmission apparatus further includes a second encoding parameter obtaining module, configured to perform voice activity detection on a current voice segment to be encoded in the voice signal, so as to obtain a voice activity corresponding to the current voice segment; determining a previous voice segment corresponding to the current voice segment from the voice signal; and determining a second coding parameter corresponding to the current voice segment according to the voice activity corresponding to the current voice segment and the voice activity corresponding to the previous voice segment.

In an embodiment, the voice transmission apparatus further includes a second encoding parameter obtaining module, configured to obtain packet loss state information fed back by the receiving end; and determining a second coding parameter corresponding to the current voice segment according to the packet loss state information and the voice activity corresponding to the current voice segment to be coded in the voice signal.

In an embodiment, the transmission module is further configured to acquire packet loss state information fed back by the receiving end; determining a second group of packet modes according to the packet loss state information; packing second coding frames corresponding to the voice segments in the second coding code stream according to the second group of packet modes to obtain at least one second coding data packet; and transmitting the at least one second coded data packet to the receiving end.

In an embodiment, the voice transmission apparatus further includes a packet loss status information obtaining module, configured to obtain packet loss status information about the first encoded code stream fed back by the receiving end when the first encoded code stream and the second encoded code stream are transmitted to the receiving end through different voice transmission links; when the first code stream and the second code stream are transmitted to the receiving end through the same voice transmission network link, obtaining the common packet loss state information of the first code stream and the second code stream fed back by the receiving end.

In an embodiment, the transmission module is further configured to transmit the first encoded code stream and the second encoded code stream to a receiving end, where the receiving end determines, when it is determined that the first encoded code stream loses packet at the receiving end, a second encoded frame corresponding to a packet loss position from the second encoded code stream, determines an encoded frame adjacent to the packet loss position in the received first encoded code stream, and obtains a decoded speech signal corresponding to the packet loss position according to the second encoded frame corresponding to the packet loss position and the adjacent encoded frame.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring a voice signal to be coded;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring a voice signal to be coded;

and respectively transmitting the first coding code stream and the second coding code stream to a receiving end, and when the first coding code stream is lost, the receiving end obtains a decoding voice signal corresponding to the position of the lost packet according to the second coding code stream.

A computer program comprising computer instructions stored in a computer readable storage medium, the computer instructions being read from the computer readable storage medium by a processor of a computer device, the processor executing the computer instructions to cause the computer device to perform the steps of the above-mentioned voice transmission method.

The voice transmission method, the device, the computer equipment and the storage medium adopt a double coding mode, the voice coding quality corresponding to a first coding parameter is higher than the voice coding quality corresponding to a second coding parameter, the voice coding of the voice signal according to the first coding parameter is the voice coding with higher voice coding quality, the voice coding of the voice signal according to the second coding parameter is the voice coding with lower voice coding quality, because two paths of coding are adopted, the second coding code stream with lower voice coding quality is used as the redundant information of the first coding code stream, when the first coding stream has packet loss, the voice information of the packet loss frame can be repaired by the second coding code stream, compared with an FEC (forward error correction) mode, the anti-packet loss performance is stronger, in addition, because each path of coding stream does not need extra FEC redundant information, and the second coding stream is in the binding transmission with the first coding stream as the redundant information of the first coding stream, the problem of voice coding quality reduction caused by network bandwidth competition is avoided.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a voice transmission method;

FIG. 2 is a diagram of an exemplary embodiment of a voice transmission method;

FIG. 3 is a flow diagram of a method for voice transmission in one embodiment;

FIG. 4 is a diagram illustrating an embodiment of an intra-band FEC speech coding scheme;

FIG. 5 is a flowchart illustrating speech encoding of a speech signal according to a second encoding parameter to obtain a second encoded code stream according to an embodiment;

FIG. 6 is a flow diagram illustrating the determination of a second encoding parameter in one embodiment;

FIG. 7 is a flowchart illustrating the determination of a second encoding parameter according to another embodiment;

FIG. 8 is a diagram illustrating a dual encoding scheme in one embodiment;

FIG. 9 is a flow chart of double encoding in one particular embodiment;

FIG. 10 is a schematic diagram of two encoded code streams in one embodiment;

FIG. 11 is a diagram of a Hanning window in one embodiment;

FIG. 12 is a block diagram showing the construction of a voice transmission apparatus according to an embodiment;

FIG. 13 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The voice transmission method provided by the application can be applied to the application environment shown in fig. 1. The voice transmission system includes a transmitting end 110 and a receiving end 120, and the transmitting end 110 and the receiving end 120 are connected via a network. In one embodiment, the transmitting end 110 may obtain a voice signal to be encoded; carrying out voice coding on a voice signal according to a first coding parameter to obtain a first coding code stream; after performing voice coding on the voice signal according to the second coding parameter to obtain a second coding code stream with a voice coding quality lower than that of the first coding code stream, the sending end 110 transmits the first coding code stream and the second coding code stream to the receiving end 120, and the receiving end 120 obtains a decoded voice signal corresponding to a packet loss position according to the second coding code stream when the packet loss occurs in the first coding code stream.

In one embodiment, the sending end 110 may transmit the first encoded code stream and the second encoded code stream to a cloud server, and forward the first encoded code stream and the second encoded code stream to the receiving end 120 through the cloud server. As shown in fig. 2, in a specific application scenario, an application program supporting a voice transmission function, such as an instant voice communication client, is run on both the sender 110 and the receiver 120. The cloud server 130 can provide computing capability and storage capability for the application program, and both the sending end 110 and the receiving end 120 can be connected with the cloud server 130 through a network, so that voice transmission of the two-end instant voice communication client is realized based on the cloud server 130.

The transmitting end 110 and the receiving end 120 may be terminals, and the terminals may be, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and may also be vehicle-mounted devices, such as vehicle-mounted audio and video devices, vehicle-mounted display devices, and vehicle-mounted monitoring devices, and the like. The sender 110 and the receiver 120 may also be a server or a server cluster. The cloud server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The cloud server 130 may also be a block link node in a block link network, for example, in some scenarios requiring voiceprint recognition, voice data sent by a sending end may be stored on the block link node for subsequent voiceprint voice verification of a user identity. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

In an embodiment, as shown in fig. 3, a voice transmission method is provided, which is described by taking the method as an example applied to the transmitting end in fig. 1, and includes the following steps:

step 302, a speech signal to be encoded is obtained.

Wherein the speech signal to be encoded is a speech signal to be speech encoded. The voice signal to be coded may be an original analog voice signal, and the sending end may sample the analog voice signal according to a set sampling rate to obtain a digital voice signal. The voice signal to be encoded may also be a digital voice signal obtained by sampling an original analog voice signal, and the sending end may extract voice features from the digital voice signal according to a set code rate to obtain an encoded frame.

Specifically, the sending end obtains a voice signal to be encoded, where the voice signal to be encoded may be a real-time voice signal, for example, a voice signal that needs to be transmitted in real time in an instant call scene, and the voice signal to be encoded may also be a non-real-time voice signal, for example, an audio signal that needs to be played synchronously with a video in a video playing scene.

And 304, carrying out voice coding on the voice signal according to the first coding parameter to obtain a first coding code stream.

And step 306, carrying out voice coding on the voice signal according to the second coding parameter to obtain a second coding code stream with the voice coding quality lower than that of the first coding code stream.

Generally, in a voice call scenario, a sound signal is transmitted to a receiving end through the following processes: the method comprises the steps that a sending end collects sound signals in real time through a microphone, analog sound signals collected in real time are converted into digital voice signals through an analog-to-digital conversion circuit, the digital voice signals are subjected to voice coding through a voice coder of the sending end, then are packaged according to a communication network transmission format and a communication network transmission protocol and then are transmitted to a receiving end, the receiving end receives data packets and then unpacks the data packets to obtain voice coding code streams, the receiving end performs voice decoding on the voice coding code streams through a voice decoder to regenerate digital decoding voice signals, and finally the decoding voice signals are played through a loudspeaker to give out voice.

For example, an encoder at a transmitting end acquires a 60ms digital speech signal to be encoded in real time, divides the speech signal into 4 speech segments with a frame length of 15ms, performs speech encoding on the 4 speech segments in sequence to obtain an encoded code stream composed of 4 encoded frames, and in some cases, may also perform packet encoding on the 4 encoded frames and then transmit the encoded code stream to a receiving end.

In order to combat the above problem of network packet loss during voice transmission, in-band FEC is usually used to transmit voice encoded data. Fig. 4 is a schematic diagram of an inband FEC speech coding method in an embodiment. Referring to part (a) of fig. 4, after a transmitting end performs speech coding on a speech signal through a speech coder to obtain a speech coded code stream, before the speech coded code stream is transmitted to a receiving end, an in-band FEC combines coded data of a previous frame with coded data of a present frame by buffering the previous coded frame, and then transmits the combined coded data to the receiving end. Referring to part (b) of fig. 4, when a network packet loss occurs, the receiving end may recover the encoded frame at the packet loss position by using the encoded data of the previous frame carried in the encoded speech data of the next frame at the packet loss position, and when no network packet loss occurs, perform speech decoding using the encoded data of each frame in the received encoded speech code stream to obtain a speech decoding signal.

In the above in-band FEC coding method, each coded frame only carries the redundant information of the previous frame, so that the problem of network packet loss of the previous frame can only be solved, and if the number of consecutive packet loss frames is greater than 1, the packet loss resistance of in-band FEC is weak. In addition, in the in-band FEC, under the preset fixed coding rate, the number of bits used for FEC coding of the previous frame and the number of coded bits of the current frame are in a competitive relationship, that is, when the number of coded bits of the FEC is high, the number of coded bits of the speech of the current frame is reduced, so that the quality of the speech of the current frame after normal decoding is reduced. The inventor finds through experiments that under the condition that the same audio signal is identical in the setting value of the encoding code rate, the voice quality of a voice encoder is obviously reduced under the condition that the in-band FEC is closed and the in-band FEC is opened, and the reason is that the voice encoding quality is obviously reduced due to the fact that the in-band FEC occupies the bit number of the audio encoding after the in-band FEC is opened.

Therefore, in the embodiment of the application, a dual-coding mode is adopted, so that the problem of weak in-band FEC packet loss resistance can be solved, and the problem of voice coding quality reduction caused by network bandwidth competition is also avoided.

The coding parameters are parameters for performing speech coding, and determine speech coding quality. The coding parameter may be a code rate or a sampling rate of the speech coding, or other coding parameters. The sampling rate is the sampling precision used to sample an analog sound signal to obtain a digital speech signal, and the code rate is the amount of data of a code stream transmitted per second. It can be understood that the higher the sampling rate, the more truly the sound details in the original sound signal can be preserved, and the higher the speech quality of the decoded speech signal obtained after decoding. Similarly, the higher the code rate, the more voice details in the digital voice signal can be preserved, and the higher the voice quality of the decoded voice signal obtained after decoding. The first coding parameter and the second coding parameter may be different sampling rates or different code rates.

The voice coding of the first path adopts a first coding parameter, the voice coding of the second path adopts a second coding parameter, and the corresponding voice coding quality of the first coding parameter is higher than that of the second coding parameter. For example, the coding parameter is a code rate, and a first code rate used by the first path is greater than a second code rate used by the second path. For another example, the coding parameter is a sampling rate, and a first sampling rate used by the first path is greater than a second sampling rate used by the second path.

The sending end carries out voice coding on the voice signals according to the first coding parameters, the obtained first coding code stream is a high-quality voice coding code stream, the sending end carries out coding on the voice according to the second coding parameters, and the obtained second coding code stream is a voice coding code stream with the voice coding quality lower than that of the first coding code stream. Therefore, when the first coding code stream has packet loss, even if the packet loss is greater than 1 frame continuously, the voice information of the lost packet frame can be repaired through the second coding code stream, and the problem of weak packet loss resistance of the coding mode can be solved. In addition, the first coding code stream and the second coding code stream have different coding parameters and do not share the same coding parameter, so that the first path of coding and the second path of coding do not have a network bandwidth competition relationship, and the voice quality of the first path of coding cannot be influenced by the second path of coding, thereby avoiding the problem of voice coding quality reduction caused by network bandwidth competition.

In one embodiment, step 304 includes: inputting a speech signal into a first speech encoder; and performing voice coding on the voice signals according to the first coding parameters through a first voice coder to obtain first coding code streams corresponding to the voice signals.

The first encoder may be referred to as a master encoder, and the master encoder is configured to obtain a high-quality speech encoding code stream. In an embodiment, the first encoder at the transmitting end may extract the speech coding feature parameters of each speech segment in the speech signal according to the first coding parameters, encode the extracted speech coding feature parameters, and generate the coded data corresponding to each speech segment, for example, the first encoder at the transmitting end extracts the speech coding feature parameters of each speech segment through some speech signal processing models (such as a filter, a feature extractor, and the like), and then encodes the speech coding feature parameters to obtain the first coded code stream.

In one embodiment, step 306 includes: inputting the speech signal into a second speech coder; and performing voice coding on the voice signals according to the second coding parameters through a second voice coder to obtain a second coding code stream corresponding to the voice signals, wherein the voice coding quality of the second coding code stream is lower than that of the first coding code stream.

Correspondingly, the second encoder may be referred to as a spare encoder, and the second encoder is configured to generate a second encoded code stream with a speech encoding quality slightly lower than that of the first encoded code stream, and the second encoded code stream is configured to resist network packet loss occurring in a network transmission process of the first encoded code stream. Similarly, in an embodiment, the second encoder at the transmitting end extracts the speech coding feature parameter of each speech segment in the speech signal according to the second coding parameter, encodes the extracted speech coding feature parameter, generates coded data corresponding to each speech segment, and obtains a second coded code stream.

In the related art, there is also an out-of-band FEC coding at the packet level: and generating n-k redundant packets from the previous k original data packets, wherein the n-k redundant packets and the k data packets form a data block containing n data packets, and the data block is transmitted to a receiving end together. As can be seen, the more the number of redundant packets is, the better the situation of continuous packet loss that can be covered is, and in order to cover most packet loss scenarios, the out-of-band FEC needs to configure sufficient redundancy, however, in a conventional network scenario, the network packet loss is random, it is difficult to determine whether a current transmission packet is lost, the ratio of the number of packet losses to the total number of transmission packets is small, and therefore most redundant code streams of the out-of-band FEC are meaningless when no packet loss occurs, and thus, more network resources are consumed unnecessarily.

Therefore, the embodiment of the application is further improved, and the first coding parameter and the second coding parameter can be flexibly set, so that the effective utilization rate of network bandwidth resources is improved through the dynamic adjustment of double coding and coding parameters.

In one embodiment, the method further comprises: acquiring a voice quality index corresponding to a target scene; the first encoding parameter is determined according to the voice quality index.

In this embodiment, the first encoding is mainly responsible for voice data transmission, so the first encoding parameter may be set based on an actual service, that is, if the requirement on the voice quality is high, the voice encoding quality corresponding to the set first encoding parameter is higher, and if the requirement on the voice quality is not high, the voice encoding quality corresponding to the set first encoding parameter is relatively lower, and the first encoding parameter may be flexibly set according to the voice quality index corresponding to the target scene. The first encoding parameter in the first path of speech encoding process may be fixedly set, or may be dynamically adjusted.

In one embodiment, as shown in FIG. 5, step 306 comprises: carrying out voice coding on the voice signal according to the second coding parameter to obtain a second coding code stream with the voice coding quality lower than that of the first coding code stream, and the method comprises the following steps:

step 502, determining a second encoding parameter corresponding to a current speech segment to be encoded in the speech signal, wherein the speech encoding quality corresponding to the second encoding parameter is lower than the speech encoding quality corresponding to the first encoding parameter.

In this embodiment, the second path of speech coding mainly functions to resist the packet loss situation of the first path of speech coding, repair the packet loss data of the first path of speech coding, and has a small effect when the first coding code stream has no packet loss, so that when the sending end performs the second path of speech coding on the speech signal, in order to save network bandwidth resources, the speech coding quality corresponding to the second path of speech coding has no high requirement, that is, the speech coding quality corresponding to the second coding parameter adopted may be lower than the speech coding quality corresponding to the first coding parameter. In addition, the second encoding parameters corresponding to the voice segment can be flexibly adjusted, and for the current voice segment to be encoded, the sending end needs to determine the corresponding second encoding parameters.

And step 504, performing speech coding on the current speech segment according to the second coding parameter to obtain a second coding frame corresponding to the current speech segment.

Step 506, obtaining a second coding code stream according to the second coding frame corresponding to each voice segment in the voice signal.

In this embodiment, the sending end determines a second coding parameter corresponding to the speech segment in the speech signal, and performs speech coding on the speech segment according to the corresponding second coding parameter, so as to obtain a second coding code stream with a speech coding quality lower than that of the first coding code stream.

In one embodiment, as shown in fig. 6, step 502, determining a second encoding parameter corresponding to a current speech segment to be encoded in the speech signal includes:

step 602, performing voice activity detection on a current voice segment to be coded in the voice signal to obtain a voice activity corresponding to the current voice segment.

Voice Activity Detection (VAD), also called Voice endpoint Detection or Voice boundary Detection, is used to identify a noise segment or a silence segment with low Activity from a Voice signal. For the current voice segment to be coded, the sending end can detect the voice activity of the current voice segment to obtain the corresponding voice activity. In principle, the higher the voice activity corresponding to the current voice segment to be encoded is, the higher the voice activity is, the voice content exists in the voice segment, and the second coding parameter corresponding to the higher voice coding quality needs to be used for performing voice coding on the voice segment, otherwise, the lower the voice activity corresponding to the current voice segment to be encoded is, for example, when the VAD detection result is 0, the voice content does not exist in the voice segment, and the second coding parameter corresponding to the lower voice coding quality can be used for performing voice coding on the voice segment, so that the network transmission bandwidth can be saved as a whole. In one embodiment, the sending end may perform voice activity detection on each to-be-coded voice segment in the voice signal, and determine a second coding parameter corresponding to each to-be-coded voice segment according to the corresponding voice activity.

Step 604, determining a previous voice segment corresponding to the current voice segment from the voice signal.

The previous voice segment corresponding to the current voice segment is an encoded adjacent voice segment before the current voice segment, for example, the previous voice segment may be 4 consecutive voice segments before the current voice segment, and each voice segment needs to detect the corresponding voice activity during voice encoding, so that when the sending end performs voice encoding on the current voice segment, the sending end may directly obtain the respective voice activity of the 4 voice segments.

Step 606, determining a second coding parameter corresponding to the current speech segment according to the speech activity corresponding to the current speech segment and the speech activity corresponding to the previous speech segment.

In this embodiment, in addition to detecting the voice activity corresponding to the current voice segment to be encoded, the sending end may also check a previous voice segment corresponding to the current voice segment, that is, the voice activity corresponding to the historical voice segment, and determine the second encoding parameter corresponding to the current voice segment according to the voice activity corresponding to the current voice segment and the voice activity corresponding to the previous voice segment. Optionally, when both the voice activity corresponding to the previous voice segment and the voice activity corresponding to the current voice segment are lower than the threshold, that is, the voice activity corresponding to a plurality of consecutive voice segments is not high, the second encoding parameter may be adjusted, that is, the voice segment is voice-encoded using the second encoding parameter corresponding to the lower voice encoding quality, so as to save the network transmission bandwidth as a whole.

In one embodiment, as shown in fig. 7, step 502, determining a second encoding parameter corresponding to a current speech segment to be encoded in the speech signal includes:

step 702, obtaining the packet loss state information fed back by the receiving end.

Step 704, determining a second coding parameter corresponding to the current voice segment according to the packet loss state information and the voice activity corresponding to the current voice segment to be coded in the voice signal.

The packet loss state information is information reflecting a current packet loss state of the receiving end, and may be a packet loss rate. In this embodiment, in addition to detecting the voice activity corresponding to the current voice segment to be encoded, the sending end may also check packet loss state information fed back by the current receiving end, and dynamically adjust the second encoding parameter according to the packet loss state information. For example, the higher the packet loss rate is, the less the decoding end obtains the speech information, in order to ensure the speech quality of the subsequent decoding, the second encoding parameter corresponding to the higher speech encoding quality needs to be used to perform the second path of speech encoding on the speech segment, so that the second encoded frame contains more speech information, and thus, the better speech quality can be obtained when the second encoded frame is subsequently used to repair the lost first encoded frame. On the contrary, the lower the packet loss rate is, the less the receiving end uses the second encoding code stream to repair the first encoding code stream, so that in order to save network bandwidth resources, the second encoding parameter corresponding to the lower speech encoding quality can be used to perform speech encoding on the speech segment.

In an embodiment, when the voice activity corresponding to the current voice segment is greater than the threshold, for example, the voice activity is 1, which represents that the current voice segment has more important voice content, the higher the packet loss rate is, the voice segment needs to be subjected to voice coding by using the second coding parameter corresponding to the higher voice coding quality, and otherwise, the lower the packet loss rate is, the voice segment is subjected to voice coding by using the second coding parameter corresponding to the lower voice coding quality. When the voice activity corresponding to the current voice segment is less than the threshold, for example, the voice activity is 0, the second coding parameter corresponding to the lower voice coding quality can be directly used to perform voice coding on the voice segment.

In an embodiment, the sending end may also determine the second encoding parameter corresponding to the current speech segment only according to the packet loss state information after obtaining the packet loss state information fed back by the receiving end.

The voice transmission links of the first coding code stream and the second coding code stream can be in the same link or different links, and the anti-packet loss capability is stronger when the transmission links are different. In one embodiment, the obtaining packet loss state information fed back by the receiving end includes: when the first coding code stream and the second coding code stream are transmitted to a receiving end through different voice transmission links, packet loss state information which is fed back by the receiving end and related to the first coding code stream is obtained; when the first code stream and the second code stream are transmitted to the receiving end through the same voice transmission network link, the common packet loss state information of the first code stream and the second code stream fed back by the receiving end is obtained.

In this embodiment, when the first encoded code stream and the second encoded code stream are transmitted to the receiving end through different voice transmission links, the transmitting end dynamically adjusts the second encoding parameter according to the packet loss state information of the first encoded code stream, and when the first encoded code stream and the second encoded code stream are transmitted to the receiving end through the same voice transmission link, the transmitting end adjusts the second encoding parameter according to the packet loss state information integrated on the voice transmission link.

In one embodiment, the second encoding parameter corresponding to the current speech segment is determined by the following formula:

when VAD =1, BitRate _2 = B _ min + (BitRate _1-B _ min) × min (1, a × lostrate);

when VAD = 0, BitRate _2 = B _ min.

The VAD may be a voice activity corresponding to the current voice segment, or may be a plurality of continuous voice activities corresponding to the current voice segment and the previous voice segment. BitRate _1 and BitRate _2 are the first coding parameter and the second coding parameter, respectively, and may be, for example, a coding rate. B _ min is the set minimum speech coding parameter, and may be, for example, a minimum speech coding rate. The lossrate is packet loss state information currently fed back by the receiving end, and may be, for example, a packet loss rate of the first encoded code stream counted by the receiving end, or a common packet loss rate of the first encoded code stream and the second encoded code stream. a is a weighting coefficient, and the range can be 3-10.

And 308, respectively transmitting the first coding code stream and the second coding code stream to a receiving end, and when the receiving end confirms that the first coding code stream has packet loss, obtaining a decoding voice signal corresponding to the packet loss position according to the second coding code stream.

And the packet loss position is the position of a packet loss frame in the first coding code stream. Specifically, when a first coding code stream transmitted from a network is received by a receiving end, packet loss detection is performed on the first coding code stream, when packet loss occurs in the first coding code stream, whether a corresponding second coding frame is received at a packet loss position corresponding to a received second coding code stream is continuously judged, and if the corresponding second coding frame is received, a decoding voice signal corresponding to the packet loss position can be obtained by using the second coding frame corresponding to the packet loss position in the second coding code stream.

In an embodiment, when the first encoded code stream received by the receiving end has no packet loss, the receiving end directly and sequentially performs speech decoding on the first encoded frames in the received first encoded code stream to obtain a decoded speech signal.

In an embodiment, the receiving end may check whether packet sequence numbers corresponding to data packets in the received first encoded code stream are continuous when the receiving end receives the first encoded code stream, and determine that a data packet corresponding to a missing packet sequence number is lost when the packet sequence numbers corresponding to the received data packets are discontinuous, where the data packet corresponding to the packet sequence number is a packet loss position. Similarly, when receiving the second encoded code stream, the receiving end may check whether the packet sequence numbers corresponding to the data packets in the received second encoded code stream are consecutive, and when the packet sequence numbers corresponding to the received data packets are not consecutive, determine that the data packets corresponding to the missing packet sequence numbers are lost, where the data packets corresponding to the packet sequence numbers are the packet loss positions.

It should be noted that, in the embodiment of the present application, the first encoding code stream and the second encoding code stream are two independent encoding code streams, in a conventional network transmission process, the number of packet losses of each path only accounts for a small proportion of a data packet to be sent, and when a packet loss occurs at a certain position in the first path of encoding code stream, the probability that a packet loss occurs at the same position in the second encoding code stream is relatively low, so that the dual encoding mode provided in the embodiment of the present application has a relatively stable anti-packet loss performance.

In an embodiment, the sending end may respectively transmit the first encoded code stream and the second encoded code stream to the receiving end by using the same voice transmission link, or may respectively transmit the first encoded code stream and the second encoded code stream to the receiving end by using different voice transmission links.

The method and the device for processing the network bandwidth resource can also flexibly set the packing mode of the first encoding code stream and the packing mode of the second encoding code stream, and the first encoding code stream and the second encoding code stream are transmitted to the receiving end after being packed, so that the effective utilization rate of the network bandwidth resource is improved through the dynamic adjustment of the double encoding and packing modes. Specifically, after performing speech coding on the same speech signal according to different coding parameters to obtain a first coded code stream and a second coded code stream, the sending end may perform grouping according to different grouping modes, for example, 2 coded frames constitute one data packet or 4 coded frames constitute one data packet for transmission.

In one embodiment, the step of transmitting the first encoded code stream to the receiving end includes: acquiring current network bandwidth information of a voice transmission link; determining a first group of packet modes according to the network bandwidth information; packing first coding frames corresponding to each voice fragment in a first coding code stream according to a first group of packet mode to obtain at least one first coding data packet; at least one first encoded data packet is transmitted to a receiving end.

In this embodiment, since the first encoding channel is mainly responsible for voice data transmission, the group packing mode of the first encoding code stream may be set based on actual services. For example, if the voice data transmission is required to be fast or the current transmission network is in good condition, each encoded data packet may include a larger number of first encoded frames. The group packing mode of the first encoding code stream may be fixedly set according to the network bandwidth information, or may be dynamically adjusted according to one of the network bandwidth information, the packet loss state information, or the actual service requirement.

In one embodiment, the step of transmitting the second encoded code stream to the receiving end includes: acquiring packet loss state information fed back by a receiving end; determining a second group of packet modes according to the packet loss state information; packing second coding frames corresponding to each voice segment in a second coding code stream according to a second group of packet modes to obtain at least one second coding data packet; and transmitting the at least one second coded data packet to a receiving end.

In this embodiment, the second path of speech coding mainly functions to resist the packet loss condition of the first path of speech coding and repair the packet loss data of the first path of speech coding, so that the packet packing mode of the second coding code stream can be flexibly adjusted to improve the packet loss resistance. Specifically, the sending end may adjust a packet packing mode of the second encoded code stream according to packet loss state information fed back by the receiving end, for example, a packet loss rate or a network packet loss type.

The network packet loss type is determined based on the number of continuous packet losses counted by the receiving end, for example, when the receiving end counts in a plurality of adjacent time segments, and the ratio of M packets which are continuously lost is the largest, the M +1 second coding frames can be adopted to package the second coding code stream and then transmit the second coding stream to the receiving end. For example, the plurality of adjacent time segments are 4 time segments, N packets are lost in the 1 st time segment, M packets are lost in the 2 nd time segment, M packets are lost in the 3 rd time segment, and M packets are lost in the 4 th time segment, so that it can be obtained that in the plurality of adjacent time segments, the occupation ratio of N consecutive packets is 1/4, and the occupation ratio of M consecutive packets is 3/4, that the network packet loss type is that the occupation ratio of M packets lost in the plurality of adjacent time segments is maximum, and to improve the anti-packet loss performance, it is determined that the packet packing mode of the second coding code stream is that M +1 second coding frames are grouped into one data packet, and then the data packet is transmitted to the receiving end.

Of course, in the definition of the network packet loss type, the number of adjacent time segments and the duration of each time segment may be set according to actual needs.

In one embodiment, when a receiving end receives a first coded code stream, a first decoder corresponding to a first encoder is used for performing speech decoding on the first coded code stream to obtain a corresponding decoded speech signal. Meanwhile, the receiving end receives the second coding code stream, a second decoder corresponding to the second encoder is adopted to perform voice decoding on the second coding code stream to obtain a corresponding decoding voice signal, and the decoding voice signal obtained by decoding of the second encoder is utilized to recover the decoding voice signal corresponding to the packet loss position of the first coding code stream.

In an embodiment, the receiving end may determine a packet loss position in the first encoded code stream according to the received first encoded code stream, and when the second encoded code stream is received, only the second encoded frame corresponding to the packet loss position in the first encoded code stream is subjected to speech decoding, so as to obtain a speech decoding signal corresponding to the packet loss position, which does not require speech decoding on the entire second encoded code stream, and thus saves computational resources.

Fig. 8 is a schematic diagram of a dual coding scheme in an embodiment. Referring to fig. 8, a voice signal to be encoded acquired by a receiving end is subjected to voice encoding by a first encoder according to a first encoding parameter to obtain a first encoding code stream, a second encoder is subjected to voice encoding according to a second encoding parameter to obtain a second encoding code stream, and the two encoding code streams are transmitted to the receiving end through a voice transmission link. And the receiving end performs voice decoding on the received first coded code stream by using a first decoder, performs voice decoding on the received second coded code stream by using a second decoder, performs packet loss detection on the two paths of coded code streams respectively, and repairs a decoded voice signal at a packet loss position by using the second coded code stream when the packet loss occurs in the first coded code stream to obtain a final decoded voice signal.

Fig. 9 is a schematic flow chart of double encoding in a specific embodiment. Referring to fig. 9, the receiving end obtains the voice signal to be encoded, and performs voice activity detection on the voice signal to be encoded, so as to obtain the voice activity corresponding to each voice segment in the voice signal to be encoded. And then, respectively configuring coding code rates for the first encoder and the second encoder, wherein the first coding code rate of the first encoder can be set or dynamically adjusted according to actual service requirements, the second coding code rate of the second encoder is dynamically adjusted according to the packet loss rate fed back by the receiving end, the first coding code rate is increased when the packet loss rate is high, and the first coding code rate is reduced when the packet loss rate is low. And then, the first encoder performs voice encoding by adopting the configured first encoding code rate to obtain a first encoding code stream, and a first group of packet modes corresponding to the first encoding code stream can be set or dynamically adjusted according to actual service requirements so as to obtain a first encoding data packet to be transmitted after the first encoding code stream is packed according to the first group of packet modes. The second encoder performs voice coding by adopting the configured second coding code rate to obtain a second coding code stream, and the packaging mode of the second coding code stream is dynamically adjusted according to the network packet loss type fed back by the receiving end, so as to package the second coding code stream according to the second group packaging mode to obtain a second coding data packet to be transmitted. The first encoded data packet and the second encoded data packet can be transmitted to the receiving end through the same voice transmission link, or can be transmitted to the receiving end through different voice transmission links.

After the receiving end obtains the first coding data packet and the second coding data packet, unpacking the first coding code stream and the second coding code stream respectively, and performing packet loss detection on the two received code streams to obtain network packet loss state information including packet loss rate, network packet loss type and the like. And when detecting that the first coding code stream has packet loss, the first decoding voice signal corresponding to the packet loss position is repaired by using the second decoding voice signal, and the final decoding voice signal is output after the repair.

The voice transmission method adopts a double-coding mode, the voice coding quality corresponding to a first coding parameter is higher than the voice coding quality corresponding to a second coding parameter, the voice coding for the voice signal according to the first coding parameter is the voice coding with higher voice coding quality, the voice coding for the voice signal according to the second coding parameter is the voice coding with lower voice coding quality, because two-way coding is adopted, the second coding code stream with lower voice coding quality is used as the redundant information of the first coding code stream, when the first coding code stream loses the packet, the voice information of the lost packet frame can be repaired by the second coding code stream, compared with an in-band FEC mode, the anti-packet loss performance is stronger, in addition, because each way of coding code stream does not need additional FEC redundant information, and the second coding code stream is decoded and transmitted with the first coding stream as the redundant information of the first coding stream, the problem of voice coding quality reduction caused by network bandwidth competition is avoided.

In an embodiment, the receiving end may directly use the speech decoding signal obtained by speech decoding the second encoded frame corresponding to the packet loss position to compensate the speech decoding signal at the packet loss position of the first encoded code stream.

In some embodiments, because the first encoded code stream and the second encoded code stream adopt different encoding parameters, although the decoded speech signal obtained by performing speech decoding on the first encoded code stream by the first decoder at the receiving end is similar to the decoded speech signal obtained by performing speech decoding on the second encoded code stream by the second decoder in terms of speech content, there is a large difference in speech sound quality. Therefore, when the first coding code stream has packet loss, in order to keep voice continuity, the decoded voice signal at the packet loss position needs to be subjected to smooth switching processing, so as to ensure the integrity and continuity of the finally obtained decoded voice signal.

In some embodiments, the receiving end may approximately replace the current lost frame by using the pitch synchronization repetition method through the decoding information of the adjacent frame at the packet loss position, so as to achieve packet loss compensation.

In one embodiment, the method includes respectively transmitting a first encoded code stream and a second encoded code stream to a receiving end, and when the receiving end determines that a packet loss occurs in the first encoded code stream, obtaining a decoded speech signal corresponding to a packet loss position according to the second encoded code stream, including: respectively transmitting the first coding code stream and the second coding code stream to a receiving end; when the first coding code stream has packet loss at the receiving end, determining a second coding frame corresponding to the packet loss position from a second coding code stream through the receiving end, determining a coding frame adjacent to the packet loss position in the received first coding code stream, and obtaining a decoding voice signal corresponding to the packet loss position according to the second coding frame corresponding to the packet loss position and the adjacent coding frame.

The coding frame adjacent to the packet loss position in the first coding code stream is an unreleased packet frame closer to the packet loss position in the first coding code stream. Optionally, the packet loss position of the first coding code stream may be a packet loss position of the first coding code stream, or a packet loss position of the first coding code stream.

Specifically, the receiving end determines a packet loss position when determining that packet loss occurs in the first encoded code stream, where the packet loss position may be a first packet loss position of a segment of continuous packet loss, or a packet loss position of a frame lost. And the receiving end repairs the voice information at the packet loss position by utilizing a first coding frame adjacent to the packet loss position in the first coding code stream and a second coding frame corresponding to the packet loss position in the received second coding code stream. Compared with the mode that the voice information obtained by the voice decoding is directly carried out by using the second coding frame corresponding to the packet loss position in the second coding code stream for repairing, the mode refers to the high-quality voice information adjacent to the packet loss position in the first coding code stream, so that the voice obtained by final decoding is more natural and continuous, and the voice quality is higher. For example, the following steps are carried out: fig. 10 is a schematic diagram of two encoded code streams in an embodiment. Referring to fig. 10, a voice signal is divided into a plurality of voice segments, a first encoder at a transmitting end performs voice encoding on the voice signal respectively to obtain a first encoded frame corresponding to each voice segment, and the first encoded frame constitutes a first encoded code stream. And the second encoder performs voice coding on the first coded frame to obtain a second coded frame corresponding to each voice segment, the second coded frame forms a second coded code stream, and the first coded code stream and the second coded code stream are respectively sent to a receiving end.

Referring to fig. 10, if the kth-2 coding frame has no packet loss, the kth-1 coding frame has no packet loss, the kth coding frame has packet loss, the kth +1 coding frame has packet loss, the kth +2 coding frame has packet loss, and the kth +3 coding frame has no packet loss in the first coding code stream received by the receiving end. And in the second coding code stream received by the receiving end, the kth coding frame has no packet loss. In order to repair the lost voice signal corresponding to the kth encoded frame in the first encoded code stream, the receiving end may use the kth-1 encoded frame in the first encoded code stream as a frame for packet loss compensation processing, perform voice decoding on the frame to obtain a compensated voice signal corresponding to a packet loss position, perform voice decoding on the kth encoded frame in the second encoded code stream to obtain a corresponding decoded voice signal, and obtain a decoded voice signal corresponding to the packet loss position by using the decoded voice signal and the compensated decoded voice signal.

Certainly, the receiving end may also reconstruct the compensation speech signal corresponding to the kth position by using the (k + 3) th encoded frame in the first encoded code stream, or reconstruct the compensation speech signal corresponding to the kth position by using the (k-2) th encoded frame and the (k-1) th encoded frame in the first encoded code stream together, or reconstruct the compensation speech signal corresponding to the kth position by using the (k-1) th encoded frame and the (k + 3) th encoded frame in the first encoded code stream together.

In some embodiments, the transmitting end may sequentially transmit the encoded frames to the receiving end after the encoded frames are packed. For example, the (k-2) th encoding frame and the (k-1) th encoding frame form an (i-1) th data packet, the (k) th encoding frame and the (k + 1) th encoding frame form an (i) th data packet, and under the conditions that the (i-1) th data packet of the first encoding code stream is not lost, the (i) th data packet is lost, and the (i) th data packet in the second encoding code stream is not lost, the receiving end can repair the voice signal corresponding to the lost (i) th data packet in the first encoding code stream by using the (i-1) th data packet in the first encoding code stream and the (i) th data packet in the second encoding code stream.

In one embodiment, obtaining a decoded speech signal corresponding to the packet loss position according to the second encoded frame corresponding to the packet loss position and the adjacent encoded frame includes: and performing voice decoding according to the coding frame adjacent to the packet loss position by the receiving end to obtain an adjacent voice signal, performing packet loss compensation processing on the adjacent voice signal to obtain a compensation voice signal corresponding to the packet loss position, performing voice decoding on a second coding frame corresponding to the packet loss position to obtain a second decoding voice signal, and performing windowing splicing processing on the compensation voice signal and the second decoding voice signal to obtain a decoding voice signal corresponding to the packet loss position.

For example, the following steps are carried out: referring to fig. 10, if the kth-2 coding frame has no packet loss, the kth-1 coding frame has no packet loss, and the kth to the kth +2 coding frames have packet loss in the first coding code stream received by the receiving end. In a second encoded code stream received by a receiving end, a kth encoded frame has no packet loss, in order to repair a speech signal corresponding to the kth encoded frame lost in the first encoded code stream, the receiving end may use a (k-1) th encoded frame in the first encoded code stream as a frame for packet loss compensation processing, perform speech decoding on the frame to obtain a corresponding decoded speech signal, and then obtain a compensated speech signal X1 corresponding to the kth encoded frame, i.e., a packet loss position, by using packet loss compensation processing methods such as pitch reconstruction, the receiving end may use the kth encoded frame in the second encoded code stream to perform speech decoding to obtain a corresponding second decoded speech signal X2, and obtain the decoded speech signal of the first packet loss frame after performing windowing splicing processing on X1 and X2. The windowing and splicing process can be implemented by using window functions, such as Hanning window (Hanning), rectangular window, hamming window, etc.

FIG. 11 is a diagram illustrating a Hanning window in one embodiment. The windowing splicing process for the compensated speech signal X1 and the second decoded speech signal X2 may employ the following formula:

wherein the content of the first and second substances,

is a compensation voice signal corresponding to the kth coding frame, namely the packet loss position, in a first coding code stream received by a receiving end,

the kth coding frame in the second coding code stream is subjected to voice decoding to obtain a corresponding second decoding voice signal,

the decoded voice signal corresponding to the packet loss position is processed by windowing splicing.

In some embodiments, obtaining the decoded speech signal corresponding to the packet loss position according to the second encoded frame corresponding to the packet loss position and the adjacent encoded frame includes: after a plurality of continuous packet loss frames in a first coding code stream are determined through a receiving end, voice decoding is carried out on a first coding frame adjacent to a first packet loss frame in the continuous packet loss frames in the first coding code stream to obtain an adjacent voice signal, packet loss compensation processing is carried out on the adjacent voice signal to obtain a compensation voice signal corresponding to the position of the first packet loss frame, voice decoding is carried out on a second coding frame corresponding to the position of the first packet loss frame to obtain a second decoding voice signal, and windowing splicing processing is carried out on the compensation voice signal and the second decoding voice signal to obtain a decoding voice signal corresponding to the position of the first packet loss frame; for at least one packet loss frame behind the position of the first packet loss frame, performing voice decoding on a second coding frame at a corresponding position through a receiving end, and replacing a voice signal corresponding to the at least one packet loss frame with a voice signal obtained by decoding; for a first coding frame normally received after the plurality of continuous packet loss frames in the first coding code stream, performing voice decoding on a second coding frame at a corresponding position in a second coding code stream to obtain an adjacent voice signal, performing packet loss compensation on the adjacent voice signal to obtain a compensation voice signal, performing voice decoding on the first coding frame to obtain a first decoding voice signal, and performing windowing splicing processing on the compensation voice signal and the first decoding voice signal to obtain a decoding voice signal corresponding to the position of the normal first coding frame.

That is to say, if a continuous packet loss occurs after the kth packet loss frame in the first encoded code stream and the second encoded code stream does not have a packet loss, other continuous packet loss positions after the kth packet loss frame of the first encoded code stream may directly use the decoded speech signal corresponding to the packet loss position of the second encoded code stream, and when a normal frame arrives under the first encoded code stream, the smooth splicing processing needs to be performed in the similar manner, so that the decoded speech content is more natural and continuous.

For example, referring to fig. 10, if a continuous packet loss occurs after a kth packet loss frame in a first encoded code stream, that is, a packet loss occurs in a (k + 1) th first encoded frame, a packet loss occurs in a (k + 2) th first encoded frame, and no packet loss occurs in a (k + 3) th first encoded frame, a decoded speech signal corresponding to the (k + 1) th first encoded frame may be replaced by a decoded speech signal corresponding to a second encoded frame at a corresponding position in a second encoded code stream, and a decoded speech signal corresponding to a (k + 2) th first encoded frame may be replaced by a decoded speech signal corresponding to a second encoded frame at a corresponding position in the second encoded code stream.

Although packet loss does not occur in the (k + 3) th first coding frame in the first coding code stream, the decoded speech signal corresponding to the (k + 3) th first coding frame cannot be directly used in consideration of natural connection of speech. The decoded speech signal corresponding to the (k + 3) th first encoded frame needs to be obtained as follows: and performing voice decoding on the (k + 2) th second coding frame in the second coding code stream to obtain an adjacent voice signal, performing packet loss compensation on the adjacent voice signal to obtain a compensated voice signal, performing voice decoding on the (k + 3) th first coding frame in the first coding code stream to obtain a first decoded voice signal, performing windowing splicing processing on the compensated voice signal and the first decoded voice signal to obtain a decoded voice signal corresponding to the (k + 3) th coding frame, and ending the recovery processing of the continuous packet loss from the (k + 3) th frame to the (k + 3) th frame.

In the above embodiment, by smoothing the second encoded code stream and the first encoded code stream in the dual encoded code stream, even if continuous multi-frame packet loss occurs in the first encoded code stream, the second encoded code stream can be used for substitution and repair, and the repaired voice signal can be ensured to be continuous and natural, so that the voice decoding quality is improved, and a better effect is achieved in a real-time voice call scene.

In a specific embodiment, the voice transmission method comprises the following steps:

1. a sending end acquires a voice signal to be coded;

2. the sending end inputs the voice signal into a first voice coder;

3. a sending end obtains a voice quality index corresponding to a target scene;

4. the sending end determines a first coding parameter according to the voice quality index;

5. the sending end carries out voice coding on the voice signals according to the first coding parameters through a first voice coder to obtain a first coding code stream corresponding to the voice signals;

6. a sending end acquires the current network bandwidth information of a voice transmission link;

7. the sending end determines a first group of packet modes according to the network bandwidth information;

8. a sending end packs first coding frames corresponding to each voice segment in a first coding code stream according to a first pack mode to obtain at least one first coding data pack;

9. the sending end transmits at least one first coded data packet to the receiving end;

10. a sending end carries out voice activity detection on a current voice segment to be coded in a voice signal to obtain voice activity corresponding to the current voice segment;

11. a sending end determines a previous voice segment corresponding to a current voice segment from a voice signal;

12. the sending end determines a second coding parameter corresponding to the current voice segment according to the packet loss rate fed back by the receiving end, the voice activity corresponding to the current voice segment and the voice activity corresponding to the previous voice segment, wherein the voice coding quality corresponding to the second coding parameter is lower than the voice coding quality corresponding to the first coding parameter;

13. the sending end inputs the voice signal into a second voice coder;

14. the sending end carries out voice coding on the voice signals according to the second coding parameters through a second voice coder to obtain a second coding code stream corresponding to the voice signals, wherein the voice coding quality of the second coding code stream is lower than that of the first coding code stream;

15. a sending end obtains a network packet loss type fed back by a receiving end;

16. the sending end determines a second group of packet modes according to the network packet loss type;

17. the sending end packs second coding frames corresponding to each voice segment in the second coding code stream according to a second group pack mode to obtain at least one second coding data pack;

18. the transmitting end transmits at least one second coded data packet to the receiving end;

19. the sending end respectively transmits the first coding code stream and the second coding code stream to the receiving end;

20. when the first coding code stream has packet loss at a receiving end, determining coding frames adjacent to the packet loss position in the received first coding code stream, performing voice decoding according to the coding frames adjacent to the packet loss position to obtain adjacent voice signals, and performing packet loss compensation processing on the adjacent voice signals to obtain compensation voice signals corresponding to the packet loss position;

21. the receiving end determines a second coding frame corresponding to the packet loss position from the second coding code stream, and performs voice decoding on the second coding frame corresponding to the packet loss position to obtain a second decoding voice signal;

22. and the receiving end carries out windowing splicing processing on the compensation voice signal and the second decoding voice signal to obtain a decoding voice signal corresponding to the packet loss position.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the above-mentioned flowcharts may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or the stages is not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a part of the steps or the stages in other steps.

In one embodiment, a voice transmission system is provided, which may be the voice transmission system shown in fig. 1, and includes a transmitting end and a receiving end, where:

the sending end is used for obtaining a voice signal to be coded; carrying out voice coding on a voice signal according to a first coding parameter to obtain a first coding code stream; carrying out voice coding on the voice signal according to the second coding parameter to obtain a second coding code stream with the voice coding quality lower than that of the first coding code stream; respectively transmitting the first coding code stream and the second coding code stream to a receiving end;

In one embodiment, the transmitting end is used for inputting a voice signal into a first voice coder; performing voice coding on the voice signals according to the first coding parameters through a first voice coder to obtain first coding code streams corresponding to the voice signals; inputting the speech signal into a second speech coder; and performing voice coding on the voice signals according to the second coding parameters through a second voice coder to obtain a second coding code stream corresponding to the voice signals, wherein the voice coding quality of the second coding code stream is lower than that of the first coding code stream.

In one embodiment, the sending end is used for obtaining a voice quality index corresponding to a target scene; the first encoding parameter is determined according to the voice quality index.

In one embodiment, the sending end is configured to obtain current network bandwidth information of the voice transmission link; determining a first group of packet modes according to the network bandwidth information; packing first coding frames corresponding to each voice fragment in a first coding code stream according to a first group of packet mode to obtain at least one first coding data packet; at least one first encoded data packet is transmitted to a receiving end.

In one embodiment, the sending end is configured to determine a second encoding parameter corresponding to a current speech segment to be encoded in the speech signal, where a speech encoding quality corresponding to the second encoding parameter is lower than a speech encoding quality corresponding to the first encoding parameter; performing voice coding on the current voice segment according to the second coding parameter to obtain a second coding frame corresponding to the current voice segment; and obtaining a second coding code stream according to the second coding frame corresponding to each voice segment in the voice signal.

In one embodiment, the sending end is configured to perform voice activity detection on a current voice segment to be coded in a voice signal to obtain a voice activity corresponding to the current voice segment; and determining a second coding parameter corresponding to the current voice segment according to the voice activity corresponding to the current voice segment.

In one embodiment, the sending end is configured to perform voice activity detection on a current voice segment to be coded in a voice signal to obtain a voice activity corresponding to the current voice segment; determining a previous voice segment corresponding to the current voice segment from the voice signal; and determining a second coding parameter corresponding to the current voice segment according to the voice activity corresponding to the current voice segment and the voice activity corresponding to the previous voice segment.

In one embodiment, the sending end is configured to obtain packet loss state information fed back by the receiving end; and determining a second coding parameter corresponding to the current voice segment according to the packet loss state information and the voice activity corresponding to the current voice segment to be coded in the voice signal.

In one embodiment, the sending end is configured to obtain packet loss state information fed back by the receiving end; determining a second group of packet modes according to the packet loss state information; packing second coding frames corresponding to each voice segment in a second coding code stream according to a second group of packet modes to obtain at least one second coding data packet; and transmitting the at least one second coded data packet to a receiving end.

In one embodiment, when a sending end adopts different voice transmission links to transmit a first code stream and a second code stream to a receiving end, packet loss state information about the first code stream fed back by the receiving end is obtained; when the sending end adopts the same voice transmission network link to transmit the first code stream and the second code stream to the receiving end, the common packet loss state information of the first code stream and the second code stream fed back by the receiving end is obtained.

In one embodiment, the transmitting end is configured to transmit a first encoded code stream and a second encoded code stream to the receiving end respectively; the receiving end is used for receiving the first coding code stream and the second coding code stream, when packet loss occurs in the first coding code stream, determining a second coding frame corresponding to a packet loss position from the second coding code stream, determining a coding frame adjacent to the packet loss position in the received first coding code stream, and obtaining a decoding voice signal corresponding to the packet loss position according to the second coding frame corresponding to the packet loss position and the adjacent coding frame.

In an embodiment, the receiving end is configured to determine a packet loss compensation frame corresponding to the packet loss position according to the encoded frames adjacent to the packet loss position, perform speech decoding on the packet loss compensation frame to obtain a compensation speech signal, perform speech decoding on a second encoded frame corresponding to the packet loss position to obtain a second decoded speech signal, and perform windowing splicing processing on the compensation speech signal and the second decoded speech signal to obtain a decoded speech signal corresponding to the packet loss position.

In an embodiment, the receiving end is configured to, when the first encoded code stream does not have a packet loss, sequentially perform speech decoding on first encoded frames in the first encoded code stream to obtain a decoded speech signal.

The voice transmission system adopts a double-coding mode, the voice coding quality corresponding to a first coding parameter is higher than the voice coding quality corresponding to a second coding parameter, the voice coding for the voice signal according to the first coding parameter is the voice coding with higher voice coding quality, the voice coding for the voice signal according to the second coding parameter is the voice coding with lower voice coding quality, because two-way coding is adopted, the second coding code stream with lower voice coding quality is used as the redundant information of the first coding code stream, when the first coding code stream loses the packet, the voice information of the lost packet frame can be repaired by the second coding code stream, compared with an in-band FEC mode, the anti-packet loss performance is stronger, in addition, because each coding code stream does not need additional FEC redundant information, and the second coding code stream is decoded and transmitted with the first coding stream as the redundant information of the first coding stream, the problem of voice coding quality reduction caused by network bandwidth competition is avoided.

In one embodiment, as shown in fig. 12, there is provided a speech transmission apparatus 1200, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: an obtaining module 1202, a first encoding module 1204, a second encoding module 1206, and a transmitting module 1208, wherein:

an obtaining module 1202, configured to obtain a speech signal to be encoded;

the first coding module 1204 is configured to perform speech coding on the speech signal according to the first coding parameter to obtain a first coding code stream;

the second coding module 1206 is configured to perform speech coding on the speech signal according to the second coding parameter, so as to obtain a second coding code stream with lower speech coding quality than the first coding code stream;

a transmission module 1208, configured to transmit the first encoded code stream and the second encoded code stream to a receiving end, where the receiving end obtains a decoded speech signal corresponding to a packet loss position according to the second encoded code stream when it is determined that the first encoded code stream has a packet loss.

In one embodiment, the first encoding module 1204 is configured to input a speech signal into a first speech encoder; and performing voice coding on the voice signals according to the first coding parameters through a first voice coder to obtain first coding code streams corresponding to the voice signals.

In one embodiment, the second encoding module 1206 is used to input the speech signal into a second speech encoder; and performing voice coding on the voice signals according to the second coding parameters through a second voice coder to obtain a second coding code stream corresponding to the voice signals, wherein the voice coding quality of the second coding code stream is lower than that of the first coding code stream.

In an embodiment, the speech transmission apparatus 1200 further includes a first encoding parameter obtaining module, configured to obtain a speech quality indicator corresponding to a target scene; the first encoding parameter is determined according to the voice quality index.

In one embodiment, the transmission module 1208 is further configured to obtain current network bandwidth information of the voice transmission link; determining a first group of packet modes according to the network bandwidth information; packing first coding frames corresponding to each voice fragment in a first coding code stream according to a first group of packet mode to obtain at least one first coding data packet; at least one first encoded data packet is transmitted to a receiving end.

In one embodiment, the second encoding module 1206 is further configured to determine a second encoding parameter corresponding to a current speech segment to be encoded in the speech signal, where a speech encoding quality corresponding to the second encoding parameter is lower than a speech encoding quality corresponding to the first encoding parameter; performing voice coding on the current voice segment according to the second coding parameter to obtain a second coding frame corresponding to the current voice segment; and obtaining a second coding code stream according to the second coding frame corresponding to each voice segment in the voice signal.

In an embodiment, the voice transmission apparatus 1200 further includes a second encoding parameter obtaining module, configured to perform voice activity detection on a current voice segment to be encoded in the voice signal, so as to obtain a voice activity corresponding to the current voice segment; and determining a second coding parameter corresponding to the current voice segment according to the voice activity corresponding to the current voice segment.

In an embodiment, the voice transmission apparatus 1200 further includes a second encoding parameter obtaining module, configured to perform voice activity detection on a current voice segment to be encoded in the voice signal, so as to obtain a voice activity corresponding to the current voice segment; determining a previous voice segment corresponding to the current voice segment from the voice signal; and determining a second coding parameter corresponding to the current voice segment according to the voice activity corresponding to the current voice segment and the voice activity corresponding to the previous voice segment.

In an embodiment, the voice transmission apparatus 1200 further includes a second encoding parameter obtaining module, configured to obtain packet loss status information fed back by the receiving end; and determining a second coding parameter corresponding to the current voice segment according to the packet loss state information and the voice activity corresponding to the current voice segment to be coded in the voice signal.

In an embodiment, the transmission module 1208 is further configured to obtain packet loss state information fed back by the receiving end; determining a second group of packet modes according to the packet loss state information; packing second coding frames corresponding to each voice segment in a second coding code stream according to a second group of packet modes to obtain at least one second coding data packet; and transmitting the at least one second coded data packet to a receiving end.

In one embodiment, the voice transmission apparatus 1200 further includes a packet loss status information obtaining module, configured to obtain packet loss status information about the first encoded code stream fed back by the receiving end when the first encoded code stream and the second encoded code stream are transmitted to the receiving end through different voice transmission links; when the first code stream and the second code stream are transmitted to the receiving end through the same voice transmission network link, the common packet loss state information of the first code stream and the second code stream fed back by the receiving end is obtained.

In an embodiment, the transmission module 1208 is further configured to transmit the first encoded code stream and the second encoded code stream to the receiving end, where the receiving end determines that the first encoded code stream loses packet at the receiving end, determines a second encoded frame corresponding to a packet loss position from the second encoded code stream, determines an encoded frame adjacent to the packet loss position in the received first encoded code stream, and obtains the decoded speech signal corresponding to the packet loss position according to the second encoded frame corresponding to the packet loss position and the adjacent encoded frame.

The voice transmission apparatus 1200 described above employs a dual coding method, wherein the voice coding quality corresponding to the first coding parameter is higher than the voice coding quality corresponding to the second coding parameter, the voice coding of the voice signal according to the first coding parameter is a voice coding with higher voice coding quality, and the voice coding of the voice signal according to the second coding parameter is a voice coding with lower voice coding quality, because two coding methods are employed, the second coding stream with lower voice coding quality is used as the redundant information of the first coding stream, when the first coding stream is lost, the voice information of the lost packet frame can be repaired by the second coding stream, compared with the in-band FEC method, the anti-packet loss performance is stronger, in addition, because each coding stream does not need additional FEC redundant information, and the second coding stream is decoded and transmitted with the first coding stream as the redundant information of the first coding stream, the problem of voice coding quality reduction caused by network bandwidth competition is avoided.

For specific limitations of the voice transmission apparatus 1200, the above limitations on the voice transmission method can be referred to, and are not described herein again. The various modules in the voice transmission apparatus 1200 described above may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 13. The computer device includes a processor, a memory, a communication interface, a display screen, a microphone, and a speaker connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of voice transmission. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 13 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for voice transmission, the method comprising:

acquiring a voice signal to be coded;

performing voice coding on the voice signal according to a first coding code rate to obtain a first coding code stream, wherein the first coding code stream is a high-quality coding code stream for voice data transmission;

performing voice coding on the voice signal according to a second coding rate to obtain a second coding stream with voice coding quality lower than that of the first coding stream, wherein the first coding rate is higher than the second coding rate, the second coding stream is used for resisting network packet loss of the first coding stream in a network transmission process, and the second coding rate corresponding to a voice segment in the voice signal is dynamically adjusted according to voice activity corresponding to the voice segment and packet loss state information currently fed back by a receiving end;

respectively transmitting the first coding code stream and the second coding code stream to the receiving end, determining a second coding frame corresponding to a packet loss position from the second coding code stream when the receiving end confirms that the first coding code stream has packet loss, determining a coding frame adjacent to the packet loss position in the received first coding code stream, and performing voice decoding on the second coding frame corresponding to the packet loss position and the adjacent coding frame to obtain a decoding voice signal corresponding to the packet loss position.

2. The method of claim 1, wherein said speech coding the speech signal at a first coding rate to obtain a first coded stream, comprises:

inputting the speech signal into a first speech encoder;

and carrying out voice coding on the voice signal according to the first coding rate through the first voice coder to obtain a first coding code stream corresponding to the voice signal.

3. The method of claim 2, wherein said performing speech coding on the speech signal at a second coding rate to obtain a second coded stream with a lower speech coding quality than the first coded stream comprises:

inputting the speech signal into a second speech encoder;

and performing voice coding on the voice signal according to the second coding code rate through the second voice coder to obtain a second coding code stream corresponding to the voice signal, wherein the voice coding quality of the second coding code stream is lower than that of the first coding code stream.

4. The method of claim 1, further comprising:

acquiring a voice quality index corresponding to a target scene;

and determining the first coding rate according to the voice quality index.

5. The method of claim 1, wherein the step of transmitting the first encoded code stream to a receiving end comprises:

acquiring current network bandwidth information of a voice transmission link;

determining a first group of packet modes according to the network bandwidth information;

packing the first coding frames corresponding to the voice segments in the first coding code stream according to the first group of packet modes to obtain at least one first coding data packet;

and transmitting the at least one first coded data packet to the receiving end.

6. The method of claim 1, wherein said performing speech coding on the speech signal at a second coding rate to obtain a second coded stream with a lower speech coding quality than the first coded stream comprises:

determining a second coding rate corresponding to a current voice segment to be coded in the voice signal, wherein the voice coding quality corresponding to the second coding rate is lower than the voice coding quality corresponding to the first coding rate;

performing voice coding on the current voice segment according to the second coding rate to obtain a second coding frame corresponding to the current voice segment;

and acquiring a second coding code stream according to a second coding frame corresponding to each voice segment in the voice signal.

7. The method of claim 6, wherein the determining the second coding rate corresponding to the current speech segment to be coded in the speech signal comprises:

performing voice activity detection on a current voice segment to be coded in the voice signal to obtain voice activity corresponding to the current voice segment;

and determining a second coding rate corresponding to the current voice fragment according to the voice activity corresponding to the current voice fragment.

8. The method of claim 6, wherein the determining the second coding rate corresponding to the current speech segment to be coded in the speech signal comprises:

determining a previous voice segment corresponding to the current voice segment from the voice signal;

and determining a second coding rate corresponding to the current voice fragment according to the voice activity corresponding to the current voice fragment and the voice activity corresponding to the previous voice fragment.

9. The method of claim 6, wherein the determining the second coding rate corresponding to the current speech segment to be coded in the speech signal comprises:

acquiring packet loss state information fed back by the receiving end;

and determining a second coding rate corresponding to the current voice segment according to the packet loss state information and the voice activity corresponding to the current voice segment to be coded in the voice signal.

10. The method of claim 1, wherein the step of transmitting the second encoded code stream to the receiving end comprises:

acquiring packet loss state information fed back by the receiving end;

determining a second group of packet modes according to the packet loss state information;

packing second coding frames corresponding to the voice segments in the second coding code stream according to the second group of packet modes to obtain at least one second coding data packet;

and transmitting the at least one second coded data packet to the receiving end.

11. The method according to claim 10, wherein the obtaining packet loss status information fed back by the receiving end includes:

when the first coding code stream and the second coding code stream are transmitted to the receiving end through different voice transmission links, packet loss state information which is fed back by the receiving end and is related to the first coding code stream is obtained;

when the first code stream and the second code stream are transmitted to the receiving end through the same voice transmission network link, obtaining the common packet loss state information of the first code stream and the second code stream fed back by the receiving end.

12. The method according to any one of claims 1 to 11, wherein obtaining the decoded speech signal corresponding to the packet loss position after performing speech decoding on the second encoded frame corresponding to the packet loss position and the adjacent encoded frame comprises:

and performing voice decoding according to the coding frame adjacent to the packet loss position through the receiving end to obtain an adjacent voice signal, performing packet loss compensation processing on the adjacent voice signal to obtain a compensation voice signal corresponding to the packet loss position, performing voice decoding on a second coding frame corresponding to the packet loss position to obtain a second decoding voice signal, and performing windowing splicing processing on the compensation voice signal and the second decoding voice signal to obtain a decoding voice signal corresponding to the packet loss position.

13. A voice transmission system, the system comprising a transmitting end and a receiving end, wherein:

the sending end is used for acquiring a voice signal to be coded; performing voice coding on the voice signal according to a first coding code rate to obtain a first coding code stream, wherein the first coding code stream is a high-quality coding code stream for voice data transmission; performing voice coding on the voice signal according to a second coding rate to obtain a second coding stream with voice coding quality lower than that of the first coding stream, wherein the first coding rate is higher than the second coding rate, the second coding stream is used for resisting network packet loss of the first coding stream in a network transmission process, and the second coding rate corresponding to a voice segment in the voice signal is dynamically adjusted according to voice activity corresponding to the voice segment and packet loss state information currently fed back by a receiving end; respectively transmitting the first coding code stream and the second coding code stream to the receiving end;

the receiving end is used for receiving the first coding code stream and the second coding code stream, determining a second coding frame corresponding to a packet loss position from the second coding code stream when the first coding code stream is confirmed to have packet loss, determining a coding frame adjacent to the packet loss position in the received first coding code stream, and performing voice decoding on the second coding frame corresponding to the packet loss position and the adjacent coding frame to obtain a decoding voice signal corresponding to the packet loss position.

14. A voice transmission apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring a voice signal to be coded;

the first coding module is used for carrying out voice coding on the voice signal according to a first coding code rate to obtain a first coding code stream, and the first coding code stream is a high-quality coding code stream used for voice data transmission;

the second coding module is configured to perform voice coding on the voice signal according to a second coding rate to obtain a second coding stream with voice coding quality lower than that of the first coding stream, where the first coding rate is greater than the second coding rate, the second coding stream is used to resist network packet loss occurring in a network transmission process of the first coding stream, and the second coding rate corresponding to a voice segment in the voice signal is dynamically adjusted according to voice activity corresponding to the voice segment and packet loss state information currently fed back by a receiving end;

the transmission module is configured to transmit the first encoded code stream and the second encoded code stream to the receiving end, respectively, and when the receiving end determines that packet loss occurs in the first encoded code stream, the receiving end determines a second encoded frame corresponding to a packet loss position from the second encoded code stream, determines an encoded frame adjacent to the packet loss position in the received first encoded code stream, and performs voice decoding on the second encoded frame corresponding to the packet loss position and the adjacent encoded frame to obtain a decoded voice signal corresponding to the packet loss position.

15. The apparatus of claim 14, wherein the first encoding module is configured to input the speech signal to a first speech encoder; and carrying out voice coding on the voice signal according to the first coding rate through the first voice coder to obtain a first coding code stream corresponding to the voice signal.

16. The apparatus of claim 15, wherein the second encoding module is configured to input the speech signal to a second speech encoder; and performing voice coding on the voice signal according to the second coding code rate through the second voice coder to obtain a second coding code stream corresponding to the voice signal, wherein the voice coding quality of the second coding code stream is lower than that of the first coding code stream.

17. The apparatus according to claim 14, wherein the voice transmission apparatus further includes a first encoding parameter obtaining module, configured to obtain a voice quality indicator corresponding to a target scene; and determining the first coding rate according to the voice quality index.

18. The apparatus of claim 14, wherein the transmission module is further configured to obtain current network bandwidth information of the voice transmission link; determining a first group of packet modes according to the network bandwidth information; packing the first coding frames corresponding to the voice segments in the first coding code stream according to the first group of packet modes to obtain at least one first coding data packet; and transmitting the at least one first coded data packet to the receiving end.

19. The apparatus of claim 14, wherein the second encoding module is further configured to determine a second encoding rate corresponding to a current speech segment to be encoded in the speech signal, and a speech encoding quality corresponding to the second encoding rate is lower than a speech encoding quality corresponding to the first encoding rate; performing voice coding on the current voice segment according to the second coding rate to obtain a second coding frame corresponding to the current voice segment; and acquiring a second coding code stream according to a second coding frame corresponding to each voice segment in the voice signal.

20. The apparatus according to claim 19, wherein the voice transmission apparatus further includes a second encoding parameter obtaining module, configured to perform voice activity detection on a current voice segment to be encoded in the voice signal, so as to obtain a voice activity corresponding to the current voice segment; and determining a second coding rate corresponding to the current voice fragment according to the voice activity corresponding to the current voice fragment.

21. The apparatus according to claim 19, wherein the voice transmission apparatus further includes a second encoding parameter obtaining module, configured to perform voice activity detection on a current voice segment to be encoded in the voice signal, so as to obtain a voice activity corresponding to the current voice segment; determining a previous voice segment corresponding to the current voice segment from the voice signal; and determining a second coding rate corresponding to the current voice fragment according to the voice activity corresponding to the current voice fragment and the voice activity corresponding to the previous voice fragment.

22. The apparatus according to claim 19, wherein the voice transmission apparatus further includes a second encoding parameter obtaining module, configured to obtain packet loss status information fed back by the receiving end; and determining a second coding rate corresponding to the current voice segment according to the packet loss state information and the voice activity corresponding to the current voice segment to be coded in the voice signal.

23. The apparatus according to claim 14, wherein the transmission module is further configured to obtain packet loss status information fed back by the receiving end; determining a second group of packet modes according to the packet loss state information; packing second coding frames corresponding to the voice segments in the second coding code stream according to the second group of packet modes to obtain at least one second coding data packet; and transmitting the at least one second coded data packet to the receiving end.

24. The apparatus according to claim 23, wherein the voice transmission apparatus further includes a packet loss status information obtaining module, configured to obtain packet loss status information about the first encoded code stream fed back by the receiving end when the first encoded code stream and the second encoded code stream are transmitted to the receiving end through different voice transmission links; when the first code stream and the second code stream are transmitted to the receiving end through the same voice transmission network link, obtaining the common packet loss state information of the first code stream and the second code stream fed back by the receiving end.

25. The apparatus according to any one of claims 14 to 24, wherein the transmission module is further configured to transmit the first encoded code stream and the second encoded code stream to a receiving end, respectively, and the receiving end determines, when it is determined that the first encoded code stream is lost at the receiving end, a second encoded frame corresponding to a packet loss position from the second encoded code stream, determines an encoded frame adjacent to the packet loss position in the received first encoded code stream, and obtains the decoded speech signal corresponding to the packet loss position according to the second encoded frame corresponding to the packet loss position and the adjacent encoded frame.

26. A computer device comprising a memory storing a computer program and a processor implementing the steps of the method according to any one of claims 1 to 12 when executing the computer program.

27. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 12.