CN111312264B - Voice transmission method, system, device, computer readable storage medium and apparatus - Google Patents

Voice transmission method, system, device, computer readable storage medium and apparatus Download PDF

Info

Publication number
CN111312264B
CN111312264B CN202010104612.0A CN202010104612A CN111312264B CN 111312264 B CN111312264 B CN 111312264B CN 202010104612 A CN202010104612 A CN 202010104612A CN 111312264 B CN111312264 B CN 111312264B
Authority
CN
China
Prior art keywords
voice
coding
current
data
packet loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010104612.0A
Other languages
Chinese (zh)
Other versions
CN111312264A (en
Inventor
梁俊斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010104612.0A priority Critical patent/CN111312264B/en
Publication of CN111312264A publication Critical patent/CN111312264A/en
Application granted granted Critical
Publication of CN111312264B publication Critical patent/CN111312264B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application relates to a voice transmission method, a system, a device, a computer readable storage medium and equipment, wherein the method comprises the following steps: acquiring current coding data in a voice coding code stream; acquiring the packet loss recovery capacity corresponding to the current coding data according to the first voice coding characteristic parameter corresponding to the current coding data and the second voice coding characteristic parameter corresponding to the previous coding data of the current coding data through a packet loss recovery capacity prediction model based on machine learning; judging whether redundant multiple processing is needed according to the packet loss recovery capability, if so, performing redundant multiple processing on the current encoded data and then transmitting the current encoded data to a receiving end; if not, the current coded data is directly transmitted to the receiving end. The scheme provided by the application can effectively improve the utilization rate of network bandwidth and can also ensure the packet loss resistance of the transmission network.

Description

Voice transmission method, system, device, computer readable storage medium and apparatus
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method, a system, an apparatus, a computer readable storage medium, and a computer device for voice transmission.
Background
The internet is an unreliable transmission network, and the main problem faced by voice transmission based on the internet is the anti-packet loss problem, and the packet loss phenomenon can occur in the transmission process due to the instability of the transmission network. In order to resist network packet loss, a redundancy multiple-transmission mechanism is generally adopted to transmit the data packet to the receiving end, so as to increase the probability of the receiving end receiving the data packet, thereby playing a role in resisting packet loss.
However, the redundant multiple mechanism tends to increase the bandwidth by multiple, consume excessive network bandwidth resources, and especially cause more packet loss when network congestion and other problems easily occur in a bandwidth limited scene.
Disclosure of Invention
Based on this, it is necessary to provide a voice transmission method, device, system, computer readable storage medium and computer equipment, aiming at the technical problems of more packet loss caused by network bandwidth resource consumption and network congestion due to redundant multiple processing of data packets in the prior art.
A voice transmission method, comprising:
acquiring current coding data in a voice coding code stream;
acquiring packet loss recovery capacity corresponding to current coding data according to a first voice coding characteristic parameter corresponding to the current coding data and a second voice coding characteristic parameter corresponding to previous coding data of the current coding data through a packet loss recovery capacity prediction model based on machine learning;
Judging whether redundant multiple processing is needed according to the packet loss recovery capability;
if yes, performing redundancy multiple processing on the current coded data and then transmitting the current coded data to a receiving end;
if not, the current coded data is directly transmitted to a receiving end.
A voice transmission system comprising a transmitting end and a receiving end, wherein:
the sending end is used for obtaining current coding data in a voice coding code stream, and obtaining packet loss recovery capacity corresponding to the current coding data according to a first voice coding characteristic parameter corresponding to the current coding data and a second voice coding characteristic parameter corresponding to the previous coding data of the current coding data through a packet loss recovery capacity prediction model based on machine learning;
the sending end is also used for judging whether redundant multiple processing is needed according to the packet loss recovery capability; if yes, performing redundancy multiple processing on the current coded data and then transmitting the current coded data to a receiving end; if not, directly transmitting the current coded data to a receiving end;
the receiving end is used for filtering repeated data packets through the receiving end and then decoding the repeated data packets when receiving the current coding data or redundant multi-sending packets corresponding to the current coding data, so as to obtain voice signals corresponding to the current coding data;
And the receiving end is further used for carrying out packet loss recovery processing on the current coding data to obtain a recovery packet corresponding to the current coding data when the current coding data and the redundant multi-sending packet corresponding to the current coding data are not received, and decoding the recovery packet to obtain a voice signal corresponding to the current coding data.
A voice transmission apparatus, the apparatus comprising:
the acquisition module is used for acquiring current coding data in the voice coding code stream;
the prediction module is used for obtaining the packet loss recovery capacity corresponding to the current coding data according to the first voice coding characteristic parameter corresponding to the current coding data and the second voice coding characteristic parameter corresponding to the previous coding data of the current coding data through the packet loss recovery capacity prediction model based on machine learning;
the redundancy multiple-shot judging module is used for judging whether redundancy multiple-shot processing is needed according to the packet loss recovery capability; if yes, performing redundancy multiple processing on the current coded data and then transmitting the current coded data to a receiving end; if not, the current coded data is directly transmitted to a receiving end.
A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the above-described voice transmission method.
A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the above-described voice transmission method.
According to the voice transmission method, the system, the device, the computer readable storage medium and the computer equipment, before the current coded data is transmitted to the receiving end, the packet loss recovery capability of the receiving end is predicted according to the first voice coding characteristic parameter corresponding to the current coded data and the second voice coding characteristic parameter corresponding to the previous coded data through the packet loss recovery capability prediction model based on machine learning, so that whether the current coded data is subjected to redundancy multiple processing is judged according to the packet loss recovery capability, if so, necessary network bandwidth resources are required to be consumed for redundancy multiple processing, otherwise, the current coded data is directly transmitted to the receiving end without redundancy multiple processing, excessive network bandwidth resources are avoided being consumed, and therefore the utilization rate of network bandwidth is effectively improved as a whole, and meanwhile, the packet loss resistance capability of a transmission network is ensured.
Drawings
FIG. 1 is a diagram of an application environment for a voice transmission method in one embodiment;
FIG. 2 is a diagram of an application environment of a voice transmission method according to another embodiment;
FIG. 3 is a flow chart of a voice transmission method according to an embodiment;
FIG. 4 is a schematic block diagram of voice transmission using a redundant multiple emission mechanism in one embodiment;
fig. 5 is a flowchart illustrating a training step of the packet loss recovery capability prediction model in one embodiment;
FIG. 6 is a training block diagram of a packet loss recovery capability prediction model in one embodiment;
FIG. 7 is a flow diagram of a method of voice transmission in one embodiment;
FIG. 8 is a flow chart of a voice transmission method in one embodiment;
FIG. 9 is a block diagram of a voice transmission device in one embodiment;
FIG. 10 is a block diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Fig. 1 is an application environment diagram of a voice transmission method in one embodiment. Referring to fig. 1, the voice transmission method is applied to a voice transmission system. The voice transmission system includes a transmitting end 110 and a receiving end 120. The transmitting end 110 and the receiving end 120 are connected through a network. The transmitting end 110 and the receiving end 120 may be terminals, and the terminals may be specifically desktop terminals or mobile terminals, and the mobile terminals may be specifically at least one of mobile phones, tablet computers, notebook computers, and the like. In other embodiments, the sender 110 and the receiver 120 may be servers or server clusters.
As shown in fig. 2, in a specific application scenario, the sending end 110 and the receiving end 120 are both running an application program supporting a voice transmission function, and the server 130 may provide computing capability and storage capability for the application program, and the sending end 110 and the receiving end 120 may be connected to the server 130 through a network, so that voice transmission between the two ends is achieved based on the server 130. The server 130 may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.
In one embodiment, the transmitting end 110 may obtain current encoded data in the speech encoded code stream; according to the packet loss recovery capacity prediction model based on machine learning, the packet loss recovery capacity corresponding to the current coding data is obtained according to the first voice coding characteristic parameter corresponding to the current coding data and the second voice coding characteristic parameter corresponding to the previous coding data of the current coding data, whether redundant multiple processing is needed or not is judged according to the packet loss recovery capacity, if yes, the current coding data is transmitted to the receiving end 120 after the redundant multiple processing is carried out, if no, the current coding data is directly transmitted to the receiving end 120, the utilization rate of network bandwidth can be effectively improved as a whole, and meanwhile, the packet loss resistance capacity of a transmission network can be guaranteed.
As shown in fig. 3, in one embodiment, a voice transmission method is provided. The present embodiment is mainly exemplified by the method applied to the transmitting end 110 in fig. 1 or fig. 2. Referring to fig. 3, the voice transmission method specifically includes the following steps S302 to S308:
s302, current coding data in the voice coding code stream is obtained.
The voice coding code stream is an original code stream obtained after voice coding is carried out on a voice signal, and the voice coding code stream comprises a group of coded data to be transmitted. The encoded data may be an encoded data frame obtained by encoding the voice signal by a voice encoder of the transmitting end according to a specific frame length, and the transmitting end may transmit the encoded data frame in the voice encoded code stream to the receiving end through a network. The encoded data may also be an encoded data packet synthesized from a plurality of encoded data frames, where the transmitting end may transmit the encoded data packet in the speech encoded code stream to the receiving end through the network. For example, the encoder at the transmitting end obtains a voice signal of 60ms, divides the voice signal into 4 frames with a frame length of 15ms, and encodes the 4 frames sequentially to obtain 4 encoded data frames, the transmitting end may sequentially transmit the encoded data frames to the receiving end, and the transmitting end may further synthesize the 4 encoded data frames into one encoded data packet and then transmit the encoded data packet to the receiving end through the network.
In order to solve the packet loss problem of the transmission network, as shown in fig. 4, before the transmitting end transmits the voice code stream to the receiving end, the transmitting end directly adopts a redundancy multiple mechanism to copy multiple copies of each coded data in the voice code stream, and sends the multiple copies of each coded data to the receiving end after being arranged and combined according to a certain sequence, the receiving end can receive each coded data and the corresponding redundancy multiple copies through the network, and filters out the redundant coded data, and the voice code stream obtained after filtering is decoded to obtain a voice signal. In the embodiment provided by the application, after the sending end encodes the original voice information to obtain the voice coding code stream, before sending each coded data in the voice coding code stream to the receiving end, the sending end can predict the packet loss recovery capability of the receiving end to each coded data in the voice coding code stream in sequence, so that the sending end can obtain the coded data in the voice coding code stream in sequence, and the current coded data is the coded data to be transmitted to the receiving end currently.
It will be appreciated that, as used herein, currently encoded data is used to describe encoded data currently being processed by the transmitting end, and previously encoded data is used to describe encoded data preceding the currently encoded data in the speech encoded code stream, where the previously encoded data may be encoded data preceding the currently encoded data, or may be encoded data preceding a plurality of encoded data preceding the currently encoded data, for example, may be encoded data preceding two encoded data of the currently encoded data. In addition, the current encoded data is a relatively changing object, for example, after the processing of the current encoded data F (i) at the transmitting end is finished, the next encoded data F (i+1) of the current encoded data F (i) in the speech encoded code stream may be regarded as new current encoded data, and the current encoded data F (i) may be regarded as the previous encoded data of the new current encoded data F (i+1).
In one embodiment, the voice transmission method further includes: acquiring an original voice signal; dividing an original voice signal to obtain an original voice sequence; and sequentially performing voice coding on voice fragments in the original voice sequence to obtain a voice coding code stream.
For example, the original voice signal acquired by the transmitting end is a 2-second voice, the segment of voice signal is segmented by taking 20 milliseconds as a unit to obtain an original voice sequence formed by 100 voice segments, and then each voice segment in the original voice sequence is sequentially subjected to voice coding to obtain coding data corresponding to each voice segment, so that a voice coding code stream corresponding to the original voice signal is generated.
In one embodiment, the voice transmission method further includes: acquiring voice coding characteristic parameters corresponding to voice fragments in an original voice sequence; performing voice coding on the corresponding voice fragments according to the voice coding characteristic parameters, and obtaining a voice coding code stream after generating corresponding coding data; and buffering voice coding characteristic parameters adopted by each coded data in the voice coding process.
Specifically, in the process of speech coding, a transmitting end extracts speech coding feature parameters of speech segments in an original speech sequence, codes the extracted speech coding feature parameters, and generates coded data corresponding to each speech segment, for example, an encoder of the transmitting end extracts the speech coding feature parameters of the speech segments through some speech signal processing models (such as a filter, a feature extractor and the like), codes (such as entropy coding) the speech coding feature parameters, and packages the coded data according to a certain data format to obtain the corresponding coded data. It should be noted that, the transmitting end may jointly generate the current encoded data corresponding to the current speech segment according to the speech encoding feature parameter of the current speech segment and the speech encoding feature parameter of the preceding speech segment, and may jointly generate the current encoded data corresponding to the current speech segment according to the speech encoding feature parameter of the current speech segment and the speech encoding feature parameter of the following speech segment. The speech coding characteristic parameters may be parameters such as line spectrum versus frequency (Line spectrum Frequency, LSF), pitch Detection (Pitch Detection), adaptive codebook gain (adaptive gain), and fixed codebook gain, which are extracted by signal processing according to the speech segment.
Further, when the transmitting end generates the encoded data corresponding to each voice segment, the voice encoding characteristic parameters of each voice segment in the encoding process, that is, the voice encoding characteristic parameters adopted in the process of generating each encoded data, are also cached, and are used for predicting the packet loss recovery capacity corresponding to each encoded data based on the cached voice encoding characteristic parameters.
S304, obtaining the packet loss recovery capacity corresponding to the current coding data according to the first voice coding characteristic parameter corresponding to the current coding data and the second voice coding characteristic parameter corresponding to the previous coding data of the current coding data through a packet loss recovery capacity prediction model based on machine learning.
The packet loss recovery capability is a prediction result, and can reflect the voice quality condition of a recovery packet obtained by the receiving end performing packet loss recovery processing on the current coded data after the current coded data is lost. The prediction result indicates that the receiving end can well recover the lost current coding data or cannot well recover the lost current coding data. The packet loss recovery processing is the PLC (Packet Loss Concealment packet loss hiding), and the packet loss recovery capability is the packet loss recovery capability of the PLC.
Under the condition that the voice coding characteristic parameters of the coded data have numerical mutation, the packet loss recovery capability of the receiving end is limited, for example, under the condition that adjacent or similar coded data have fundamental frequency hopping, LSF mutation and the like, the packet loss recovery capability of the receiving end is limited, and under the condition, the sending end starts a redundancy multiple mechanism, the packet loss rate can be effectively improved, and the voice quality of the receiving end is ensured; under the condition that the numerical fluctuation of the voice coding characteristic parameters of adjacent coded data is relatively stable, the receiving end generally has good packet loss recovery capability, and under the condition, the transmitting end can not need to start a redundancy multiple mechanism. Therefore, the packet loss recovery capacity corresponding to the current coded data is related to the corresponding voice coding characteristic parameters, and after the machine learning model is trained through a large number of training samples, how to predict the packet loss recovery capacity corresponding to the data packet according to the voice coding characteristic parameters is learned.
Specifically, the transmitting end may acquire a first voice coding feature parameter corresponding to the cached current coding data and a second voice coding feature parameter corresponding to the previous coding data, and predict the packet loss recovery capacity corresponding to the current coding data according to the first voice coding feature parameter and the second voice coding feature parameter through a packet loss recovery capacity prediction model trained in advance.
In other embodiments, the transmitting end may obtain the packet loss recovery capability corresponding to the current encoded data according to the first speech encoding feature parameter corresponding to the current encoded data and the third speech encoding feature parameter corresponding to the later encoded data of the current encoded data through the packet loss recovery capability prediction model. Or, according to the second voice coding characteristic parameter and/or the third voice coding characteristic parameter, obtaining the packet loss recovery capacity corresponding to the current coded data. The post-encoded data is used to describe encoded data following the current encoded data in the speech encoded code stream, and may be encoded data following the current encoded data or may be a plurality of encoded data following the current encoded data, for example, may be two encoded data following the current encoded data.
It can be understood that the transmitting end uses the voice coding characteristic parameters corresponding to the coded data as the input of the packet loss recovery capability prediction model, and the coding and decoding rules are mutually corresponding depending on the algorithm rules adopted when the transmitting end performs voice coding or the algorithm rules adopted when the receiving end performs voice decoding. For example, if the transmitting end needs to predict the packet loss recovery capability corresponding to the current encoded data according to the voice encoding feature parameter corresponding to the previous encoded data when generating the current encoded data, the voice encoding feature parameter adopted by the previous encoded data needs to be used as the input of the packet loss recovery capability prediction model; if the transmitting end needs to predict the packet loss recovery capacity corresponding to the current encoded data according to the voice encoding characteristic parameters adopted by the next encoded data when generating the current encoded data, the voice encoding characteristic parameters adopted by the next encoded data need to be used as the input of a packet loss recovery capacity prediction model.
The prediction model of the packet loss recovery capability is a computer model based on machine learning, and can be realized by adopting a neural network model. The machine learning model may learn from a sample to have a specific capability. In this embodiment, the packet loss recovery capability prediction model is a model with prediction of packet loss recovery capability that is trained in advance.
In one embodiment, the transmitting end may set a model structure of the machine learning model in advance to obtain an initial machine learning model, and train the initial machine learning model through a large number of sample voices and packet loss simulation tests to obtain model parameters of the machine learning model. Accordingly, when voice is required to be transmitted through the network, the transmitting end can acquire model parameters which are obtained through training in advance, the model parameters are led into the initial machine learning model to obtain a packet loss recovery capacity prediction model, and the packet loss recovery capacity corresponding to each coded data in the voice coding code stream is predicted through the packet loss recovery capacity prediction model, so that whether a redundancy multiple mechanism is started for the current coded data is determined according to the predicted packet loss recovery capacity.
Fig. 5 is a flowchart illustrating a training procedure of the packet loss recovery capability prediction model in one embodiment. It should be noted that, the training step may be executed by any computer device to obtain a trained packet loss recovery capability prediction model, and then import the trained packet loss recovery capability prediction model into a transmitting end that needs to perform voice transmission; the computer device may also be the transmitting end in fig. 1 or fig. 2, that is, the training step may also be directly executed by the transmitting end and obtain a trained packet loss recovery capability prediction model. The following training steps of the packet loss recovery capability prediction model are illustrated by taking computer equipment as an execution subject, and specifically include:
S502, acquiring a sample voice sequence in a training set.
Specifically, the computer device may obtain a large number of speech signals and divide the speech signals to obtain a large number of speech signal sequences composed of speech segments as sample speech sequences for training the machine learning model.
S504, performing voice coding on the sample voice sequence to obtain a sample voice coding code stream.
Specifically, for each sample voice sequence, the computer equipment extracts voice coding characteristic parameters corresponding to each voice segment, and generates coding data corresponding to each voice segment according to the extracted voice coding characteristic parameters to obtain a sample voice coding code stream corresponding to each sample voice sequence. The computer device may cache speech coding feature parameters employed by each coded data during the coding process.
S506, extracting a first voice coding characteristic parameter adopted by current coding data in the sample voice coding code stream and a second voice coding characteristic parameter adopted by previous coding data of the current coding data.
As mentioned above, the packet loss recovery capability corresponding to the encoded data is associated with the speech coding feature parameter corresponding to the encoded data, and possibly also with the speech coding feature parameter corresponding to the preceding encoded data and/or the following encoded data, so that, during training, the computer device may train with the speech coding feature parameter as an input of the machine learning model. In this embodiment, the transmitting end may extract, as input of the machine learning model, a first speech coding feature parameter corresponding to the currently processed currently encoded data and a second speech coding feature parameter corresponding to the previously encoded data of the currently encoded data. As mentioned before, the previous encoded data is the previous encoded data of the current encoded data, but may also be the previous plurality of encoded data of the current encoded data.
It should be noted that each training object is one encoded data, and each sample speech encoded code stream includes a plurality of encoded data, so that each sample speech encoded code stream can be used for multiple training. For example, during the training process, the transmitting end may extract the speech coding feature parameter corresponding to the i-th encoded data and the speech coding feature parameter corresponding to the i-1-th encoded data in the sample speech coding code stream S, and the transmitting end may also extract the speech coding feature parameter corresponding to the i+1-th encoded data and the speech coding feature parameter corresponding to the i-th encoded data in the sample speech coding code stream S.
S508, after the sample voice coding code stream is directly decoded and the first voice signal is obtained, a first voice quality score determined based on the first voice signal is obtained.
In order to obtain the target output of the machine learning model in the training process, the transmitting end needs to execute steps S508 to S512. The computer device may directly decode the encoded sample speech encoded code stream to obtain a first speech signal, and then test a first speech quality score corresponding to the first speech signal using a speech quality testing tool. Because the first voice signal is obtained by directly decoding the sample voice coding code stream, and the condition of coded data loss does not exist, the obtained first voice signal is very close to the original sample voice sequence and can be called as lossless voice signal, and the corresponding first voice quality score can be called as lossless voice quality score.
In one embodiment, the speech quality testing tool may be PESQ (Perceptual evaluation of speech quality, subjective speech quality assessment), which may objectively evaluate the quality of speech signals based on some metrics that in turn are consistent with human perception of speech quality, thereby providing a speech quality metric that may be fully quantified. The obtained first speech quality score may be denoted as MOS _ UNLOSS.
S510, obtaining a recovery packet by carrying out simulated packet loss recovery processing on the current coded data, decoding the recovery packet and obtaining a second voice signal, and determining a second voice quality score based on the second voice signal.
Then, the computer device may take the current encoded data as a lost data packet, simulate a decoder at the receiving end to perform packet loss recovery processing on the current encoded data and obtain a corresponding recovery packet, decode the recovery packet to obtain a corresponding second voice signal, splice other voice fragments in the original sample voice sequence with the second voice signal, and perform voice quality scoring to obtain a second voice quality score. Since the second speech signal is obtained by decoding a recovery packet obtained in the case of an analog packet LOSS, there is a LOSS between the recovery packet and the lost current encoded data, and thus there is a LOSS between the obtained second speech signal and a speech segment corresponding to the current encoded data, the second speech signal may be referred to as a lossy speech signal, and the determined second speech quality score may be referred to as a lossy speech quality score, and is denoted as mos_loss.
And S512, determining the actual packet loss recovery capacity corresponding to the current encoded data according to the score difference between the first voice quality score and the second voice quality score.
Specifically, the actual packet LOSS recovery capability corresponding to the current encoded data may be measured by using a score difference between the first speech quality score and the second speech quality score, that is, the mos_unloss-mos_loss may be used as the actual packet LOSS recovery capability corresponding to the current encoded data, that is, the target output of the machine learning model. The actual packet loss recovery capacity corresponding to the current coded data is inversely related to the scoring difference, namely, the smaller the difference is, the better the voice quality of a recovery packet obtained by simulating packet loss of the current coded data and performing packet loss recovery, the stronger the actual packet loss recovery capacity corresponding to the current coded data is; otherwise, the larger the difference value is, the worse the voice quality of the recovery packet obtained by simulating the packet loss of the current coded data and then carrying out the packet loss recovery.
S514, the first voice coding characteristic parameter and the second voice coding characteristic parameter are input into a machine learning model, and the predicted packet loss recovery capacity corresponding to the current coding data is output through the machine learning model.
After the target output of the training process is obtained, the computer equipment can input the obtained first voice coding characteristic parameter and the second voice coding characteristic parameter into a machine learning model, and the predicted packet loss recovery capacity corresponding to the current coding data is output through the processing of the internal network. It should be noted that S514 may also be performed before step S508, and the execution sequence of the steps is not limited in this embodiment.
S516, after the model parameters of the machine learning model are adjusted according to the difference between the actual packet loss recovery capability and the predicted packet loss recovery capability, the step of returning to the step of obtaining the sample voice sequence in the training set continues training until the training ending condition is met.
Specifically, the computer device may construct a loss function according to the obtained actual packet loss recovery capability and the predicted packet loss recovery capability obtained through the machine learning model, and use the model parameter obtained when the loss function is minimized as the latest model parameter of the machine learning model, and continue to perform the next training according to the sample voice sequence until the machine learning model converges or the training frequency reaches the preset frequency, so as to obtain the trained packet loss recovery capability prediction model with the packet loss recovery prediction capability.
Fig. 6 is a schematic diagram of a frame for training a machine learning model to obtain a prediction model of packet loss recovery capability in one embodiment. Fig. 6 shows a schematic flow chart of a single training process. The computer equipment obtains a sample voice sequence, and carries out voice coding on the sample voice sequence to obtain a sample voice coding code stream. Firstly, directly decoding a sample voice coding code stream under the condition that the current coding data does not lose packets, then obtaining MOS_UNLOSS by adopting PESQ, and then simulating to carry out packet LOSS recovery processing under the condition that the current coding data is lost, and then obtaining MOS_LOSS by adopting PESQ after decoding. The method comprises the steps of taking voice coding characteristic parameters of current coded data and voice coding characteristic parameters of previous coded data as input of a machine learning model, obtaining predicted packet LOSS recovery capability, taking MOS_UNLOSS-MOS_LOSS as target output of the machine learning model, namely real packet LOSS recovery capability, and adjusting model parameters of the machine learning model according to the predicted packet LOSS recovery capability and the real packet LOSS recovery capability to complete the training process.
In one embodiment, step S304, obtaining, by a machine learning based packet loss recovery capability prediction model, a packet loss recovery capability corresponding to current encoded data according to a first speech encoding feature parameter corresponding to the current encoded data and a second speech encoding feature parameter corresponding to previous encoded data of the current encoded data, includes: inputting the first voice coding characteristic parameters corresponding to the current coding data and the second voice coding characteristic parameters corresponding to the previous coding data of the current coding data into a packet loss recovery capacity prediction model; outputting a grading difference between a first voice quality grading determined by directly decoding the current coding data and a second voice quality grading determined by decoding the current coding data after carrying out packet loss recovery processing according to the first voice coding characteristic parameter and the second voice coding characteristic parameter through a packet loss recovery capability prediction model; determining the packet loss recovery capacity corresponding to the current coded data according to the grading difference; the packet loss recovery capacity corresponding to the current coded data is inversely related to the scoring difference.
In this embodiment, before the transmitting end transmits the current encoded data in the speech encoded code stream to the receiving end, the packet loss recovery capability corresponding to the current encoded data may be predicted by a packet loss recovery capability prediction model trained in advance. Specifically, a first voice coding characteristic parameter corresponding to current coding data and a second voice coding characteristic parameter corresponding to previous coding data are used as inputs of a packet loss recovery capability prediction model, the output of the packet loss recovery capability prediction model is a grading difference between a first voice quality grading determined by directly decoding the current coding data and a second voice quality grading determined by decoding the current coding data after packet loss recovery processing, the grading difference reflects the quality condition of packet loss recovery processing of a receiving end after packet loss of the current coding data, namely the size of packet loss recovery capability, and the packet loss recovery capability and the grading difference are inversely related. When the score difference is larger, namely the packet loss recovery capability is smaller than a preset threshold, the quality of the voice signal obtained by the receiving end after the current coded data is lost is poor if the receiving end performs the packet loss recovery processing, otherwise, when the score difference is smaller, namely the packet loss recovery capability is larger than the preset threshold, the quality of the voice signal obtained by the receiving end after the current coded data is lost is within an acceptable range.
S306, judging whether redundant multiple processing is needed according to the packet loss recovery capability; if yes, executing step S308, and transmitting the current coded data to a receiving end after redundant multiple processing; if not, step S310 is executed to directly transmit the current encoded data to the receiving end.
Specifically, after the transmitting end obtains the packet loss recovery capacity corresponding to the current encoded data through the packet loss recovery capacity prediction model, whether redundancy multiple processing is performed on the current encoded data is judged according to the predicted packet loss recovery capacity.
In one embodiment, the packet loss recovery capability output by the packet loss recovery capability prediction model is a value within a range of values, and the transmitting end may compare the packet loss recovery capability with a preset threshold value, and determine whether redundancy multiple processing needs to be performed on the current encoded data according to the comparison result.
Specifically, when the packet loss recovery capability is smaller than a preset threshold, the current encoded data packet is transmitted to the receiving end after being subjected to redundancy multiple processing, and when the packet loss recovery capability is smaller than the preset threshold, the quality of a voice signal obtained by the receiving end after the current encoded data is lost if the packet loss recovery processing is performed is poor, so that the problem of packet loss of a transmission network needs to be solved by using a redundancy multiple processing method, namely the current encoded data needs to be transmitted to the receiving end after being subjected to redundancy multiple processing. When the packet loss recovery capability is greater than a preset threshold, the current coding data is directly transmitted to the receiving end, when the packet loss recovery capability is greater than the preset threshold, the quality of a voice signal obtained by the receiving end in packet loss recovery processing after the current coding data is lost is indicated to be within an acceptable range, so that the transmitting end does not need to use redundancy multiple as an anti-packet strategy for the coding data, the transmitting end can directly transmit the current coding data to the receiving end, and if the current coding data is lost, the packet loss recovery processing can be directly carried out on the current coding data by using a packet loss recovery algorithm built in a decoder of the receiving end.
In one embodiment, the packet loss recovery capacity output by the packet loss recovery capacity prediction model is of two types, when the packet loss recovery capacity is the first value, the packet loss recovery capacity indicates that the quality of a voice signal obtained by performing packet loss recovery processing on the current coded data is poor after the current coded data is lost, and the sending end needs to perform redundancy multiple processing on the current coded data packet and then transmit the current coded data packet to the receiving end; when the packet loss recovery capability is the second value, it means that after the current encoded data is lost, if the quality of the voice signal obtained by the packet loss recovery processing performed by the receiving end is within an acceptable range, the sending end can directly transmit the current encoded data to the receiving end, and under the condition that the current encoded data is lost, the packet loss recovery processing performed on the current encoded data is performed by directly using a packet loss recovery algorithm built in a decoder of the receiving end. For example, the first value may be 1 and the second value may be 0. For another example, the first value may be 0 and the second value may be 1.
In one embodiment, the method for transmitting the current encoded data to the receiving end after performing redundancy multiple processing includes: acquiring packet loss state information fed back by a receiving end; determining redundancy multiple parameters corresponding to current coded data according to the packet loss state information; and copying the current coded data according to the redundancy multiple parameters and transmitting the current coded data to a receiving end.
Specifically, the receiving end may determine packet loss status information according to the received data packet, and feed back the packet loss status information to the transmitting end. The packet loss state information can be represented by the current packet loss rate, the receiving end can package the packet loss rate into a message and send the message to the sending end, and the sending end analyzes the received control message to obtain the packet loss rate. The redundancy multiple parameter may be multiple redundancy, that is, the number of copies n of the current encoded data, where the redundancy multiple parameter is adjusted based on the packet loss status information fed back by the receiving end, for example, the value of n is larger at a high packet loss rate and the value of n is smaller at a low packet loss rate. For another example, when the packet loss rate is 10%, the multiple redundancy n may be set to 1; when the packet loss rate is 20%, the redundancy multiple n can be set to be 2 times redundancy.
In one embodiment, the voice transmission method further comprises: when the receiving end receives the current coding data or the redundant multi-sending packet corresponding to the current coding data, the receiving end filters the repeated data packet and then decodes the repeated data packet to obtain the voice signal corresponding to the current coding data.
Specifically, the receiving end can receive the current encoded data and the corresponding redundant multi-sending packets through the network, filter and sort the redundant encoded data according to the sequence of the packet sequence numbers, and decode one of the encoded data to obtain the corresponding voice signal.
In one embodiment, the voice transmission method further comprises:
when the receiving end does not receive the current coding data and the redundant multi-sending packet corresponding to the current coding data, the receiving end carries out packet loss recovery processing on the current coding data to obtain a recovery packet corresponding to the current coding data, and decodes the recovery packet to obtain a voice signal corresponding to the current coding data.
Specifically, when the receiving end decodes the encoded data one by one, if the receiving end receives the current encoded data or the corresponding redundant multi-packet, the receiving end rebuilds the voice signal according to the normal decoding flow, if the receiving end does not receive the current encoded data and the corresponding redundant multi-packet, or does not receive the current encoded data and the corresponding redundant multi-packet in a certain time, the receiving end judges that the current encoded data is lost, and the receiving end can perform packet loss recovery processing on the current encoded data through a built-in PLC algorithm of the decoder. In particular, the present encoded data is generally approximately replaced by a pitch synchronous repetition method based on the decoded information of the previous frame, and is used as a recovery packet, and the recovery packet is decoded to obtain a speech signal.
According to the voice transmission method, before the current coded data is transmitted to the receiving end, the packet loss recovery capacity of the current coded data is predicted by the receiving end according to the first voice coding characteristic parameter corresponding to the current coded data and the second voice coding characteristic parameter corresponding to the previous coded data through the packet loss recovery capacity prediction model based on machine learning, so that whether the current coded data is subjected to redundancy multiple processing is judged according to the packet loss recovery capacity, if so, necessary network bandwidth resources are required to be consumed for redundancy multiple processing, otherwise, the redundancy multiple processing is not required, the current coded data is directly transmitted to the receiving end, excessive network bandwidth resources are avoided to be consumed, and meanwhile, the packet loss resistance of a transmission network can be guaranteed.
FIG. 7 is a flow diagram of a method of voice transmission in one embodiment. Referring to fig. 7, a transmitting end obtains an original voice signal, and performs voice encoding on the original voice signal to obtain a voice encoding code stream. Then, the transmitting end predicts the packet loss recovery capacity of the receiving end for each encoded data in the voice encoded code stream through a packet loss recovery capacity prediction model based on machine learning. And judging whether to start a redundancy multiple mechanism for the current coded data according to the predicted packet loss recovery capability. If the redundancy multi-sending mechanism is judged to be started for the current coded data, after the redundancy multi-sending parameter is set according to the packet loss state information fed back by the receiving end, the current coded data is copied for a plurality of times according to the redundancy multi-sending parameter and then transmitted to the receiving end. If the redundant multi-sending mechanism is not started for the current coded data, the current coded data is directly transmitted to the receiving end.
If the receiving end receives the current coding data, reconstructing the voice signal according to a normal decoding flow, if the receiving end does not receive the current coding data and the corresponding redundant multi-sending packet or does not receive the current coding data and the corresponding redundant multi-sending packet within a certain time, judging that the current coding data is lost, and decoding the current coding data to obtain the voice signal after carrying out packet loss recovery processing on the current coding data through a PLC algorithm built in a decoder.
Fig. 8 is a flowchart of a voice transmission method in one embodiment. Referring to fig. 8, the method comprises the steps of:
s802, acquiring an original voice signal.
S804, dividing the original voice signal to obtain an original voice sequence.
S806, the voice fragments in the original voice sequence are sequentially subjected to voice coding, and a voice coding code stream is obtained.
S808, buffering voice coding characteristic parameters adopted by each coded data in the voice coding process.
S810, current coding data in the voice coding code stream is obtained.
S812, the first voice coding characteristic parameters corresponding to the current coding data and the second voice coding characteristic parameters corresponding to the previous coding data of the current coding data are input into the packet loss recovery capacity prediction model.
S814, outputting the grading difference between the first voice quality grading determined by directly decoding the current coding data and the second voice quality grading determined by decoding the current coding data after the packet loss recovery processing according to the first voice coding characteristic parameter and the second voice coding characteristic parameter through the packet loss recovery capability prediction model.
S816, determining the packet loss recovery capacity corresponding to the current coded data according to the grading difference.
S818, when the packet loss recovery capability is smaller than a preset threshold value, acquiring packet loss state information fed back by the receiving end; determining redundancy multiple parameters corresponding to current coded data according to the packet loss state information; copying the current coded data according to the redundancy multiple parameters and then transmitting the current coded data to a receiving end; and filtering out repeated data packets through a receiving end, and then decoding to obtain a voice signal corresponding to the current coded data.
S820, when the packet loss recovery capability is greater than a preset threshold, the current encoded data is directly transmitted to the receiving end.
S822, if the receiving end does not receive the current coding data and the redundant multi-sending packet corresponding to the current coding data, the receiving end performs packet loss recovery processing on the current coding data to obtain a recovery packet corresponding to the current coding data, decodes the recovery packet, and obtains a voice signal corresponding to the current coding data.
It should be understood that, although the steps in the flowcharts of fig. 3, 5, and 8 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 3, 5, 8 may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, a voice transmission system is provided, which may be a voice transmission system as shown in fig. 1 or fig. 2, including a transmitting end 110 and a receiving end 120, wherein:
the sending end 110 is configured to obtain current encoded data in the speech encoded code stream, and obtain, through a prediction model based on a packet loss recovery capability of machine learning, a packet loss recovery capability corresponding to the current encoded data according to a first speech encoding feature parameter corresponding to the current encoded data and a second speech encoding feature parameter corresponding to previous encoded data of the current encoded data;
the transmitting end 110 is further configured to determine whether redundancy multiple processing is required according to the packet loss recovery capability; if yes, performing redundancy multiple processing on the current coded data and then transmitting the current coded data to a receiving end; if not, directly transmitting the current coded data to a receiving end;
the receiving end 120 is configured to, when receiving current encoded data or a redundant multi-packet corresponding to the current encoded data, filter the repeated data packet by the receiving end and then decode the data packet to obtain a speech signal corresponding to the current encoded data;
the receiving end 120 is further configured to, when the current encoded data and the redundant multi-packet corresponding to the current encoded data are not received, perform packet loss recovery processing on the current encoded data to obtain a recovery packet corresponding to the current encoded data, and decode the recovery packet to obtain a speech signal corresponding to the current encoded data.
In one embodiment, the transmitting end 110 is further configured to obtain an original voice signal; dividing an original voice signal to obtain an original voice sequence; and sequentially performing voice coding on voice fragments in the original voice sequence to obtain a voice coding code stream.
In one embodiment, the transmitting end 110 is further configured to obtain speech coding feature parameters corresponding to each of the speech segments in the original speech sequence; performing voice coding on the corresponding voice fragments according to the voice coding characteristic parameters, and obtaining a voice coding code stream after generating corresponding coding data; and buffering voice coding characteristic parameters adopted by each coded data in the voice coding process.
In one embodiment, the transmitting end 110 is further configured to input a first speech coding feature parameter corresponding to the current encoded data and a second speech coding feature parameter corresponding to the previous encoded data of the current encoded data into the packet loss recovery capability prediction model; outputting a grading difference between a first voice quality grading determined by directly decoding the current coding data and a second voice quality grading determined by decoding the current coding data after carrying out packet loss recovery processing according to the first voice coding characteristic parameter and the second voice coding characteristic parameter through a packet loss recovery capability prediction model; determining the packet loss recovery capacity corresponding to the current coded data according to the grading difference; the packet loss recovery capacity corresponding to the current coded data is inversely related to the scoring difference.
In one embodiment, the sending end 110 is further configured to obtain packet loss status information fed back by the receiving end; determining redundancy multiple parameters corresponding to current coded data according to the packet loss state information; and copying the current coded data according to the redundancy multiple parameters and transmitting the current coded data to a receiving end.
In one embodiment, the receiving end 120 is further configured to perform packet loss recovery processing on the current encoded data when the current encoded data and the redundant multi-packet corresponding to the current encoded data are not received, obtain a recovery packet corresponding to the current encoded data, and decode the recovery packet to obtain a speech signal corresponding to the current encoded data.
In one embodiment, the transmitting end 110 is further configured to obtain a sample voice sequence in the training set; performing voice coding on the sample voice sequence to obtain a sample voice coding code stream; extracting a first voice coding characteristic parameter adopted by current coding data in a sample voice coding code stream and a second voice coding characteristic parameter adopted by previous coding data of the current coding data; after the sample voice coding code stream is directly decoded and a first voice signal is obtained, a first voice quality score determined based on the first voice signal is obtained; obtaining a recovery packet by carrying out simulated packet loss recovery processing on the current coded data, decoding the recovery packet and obtaining a second voice signal, and determining a second voice quality score based on the second voice signal; determining the actual packet loss recovery capacity corresponding to the current encoded data according to the score difference between the first voice quality score and the second voice quality score; inputting the first voice coding characteristic parameters and the second voice coding characteristic parameters into a machine learning model, and outputting predicted packet loss recovery capacity corresponding to current coding data through the machine learning model; and after the model parameters of the machine learning model are adjusted according to the difference between the actual packet loss recovery capability and the predicted packet loss recovery capability, returning to the step of obtaining the sample voice sequence in the training set for continuous training until the training ending condition is met.
According to the voice transmission system, before the transmitting end transmits the current encoded data to the receiving end, the transmitting end predicts the packet loss recovery capacity of the receiving end according to the first voice encoding characteristic parameter corresponding to the current encoded data and the second voice encoding characteristic parameter corresponding to the previous encoded data through the packet loss recovery capacity prediction model based on machine learning, so that whether the current encoded data is subjected to redundancy multiple processing or not is judged according to the packet loss recovery capacity, if so, necessary network bandwidth resources are required to be consumed for redundancy multiple processing, otherwise, redundancy multiple processing is not required, the current encoded data is directly transmitted to the receiving end, excessive network bandwidth resources are avoided to be consumed, and therefore the utilization rate of network bandwidth is effectively improved as a whole, and meanwhile, the packet loss resistance of a transmission network is also ensured.
In one embodiment, as shown in fig. 9, a voice transmission apparatus 900 is provided, which may be implemented as all or part of a receiving end by software, hardware, or a combination of both. The apparatus includes an acquisition module 902, a prediction module 904, and a redundancy multiple decision module 906, wherein:
An obtaining module 902, configured to obtain current encoded data in a speech encoded code stream;
the prediction module 904 is configured to obtain, according to a first speech coding feature parameter corresponding to current coded data and a second speech coding feature parameter corresponding to previous coded data of the current coded data, a packet loss recovery capability corresponding to the current coded data by using a packet loss recovery capability prediction model based on machine learning;
a redundancy multiple-output decision module 906, configured to decide whether redundancy multiple-output processing is required according to the packet loss recovery capability; if yes, performing redundancy multiple processing on the current coded data and then transmitting the current coded data to a receiving end; if not, the current coded data is directly transmitted to a receiving end.
In one embodiment, the voice transmission apparatus 900 further includes a voice encoding module, configured to obtain an original voice signal; dividing an original voice signal to obtain an original voice sequence; and sequentially performing voice coding on voice fragments in the original voice sequence to obtain a voice coding code stream.
In one embodiment, the voice transmission device 900 further includes a voice encoding module and a buffer module, where the voice encoding module is configured to obtain voice encoding feature parameters corresponding to voice segments in an original voice sequence; performing voice coding on the corresponding voice fragments according to the voice coding characteristic parameters, and obtaining a voice coding code stream after generating corresponding coding data; the buffer module is used for buffering voice coding characteristic parameters adopted by each coded data in the voice coding process.
In one embodiment, the prediction module 904 is further configured to input a first speech coding feature parameter corresponding to the current encoded data and a second speech coding feature parameter corresponding to the previous encoded data of the current encoded data into the packet loss recovery capability prediction model; outputting a grading difference between a first voice quality grading determined by directly decoding the current coding data and a second voice quality grading determined by decoding the current coding data after carrying out packet loss recovery processing according to the first voice coding characteristic parameter and the second voice coding characteristic parameter through a packet loss recovery capability prediction model; determining the packet loss recovery capacity corresponding to the current coded data according to the grading difference; the packet loss recovery capacity corresponding to the current coded data is inversely related to the scoring difference.
In one embodiment, the redundancy multiple decision module 906 is further configured to obtain the packet loss status information fed back by the receiving end when the packet loss recovery capability is less than a preset threshold; determining redundancy multiple parameters corresponding to current coded data according to the packet loss state information; and copying the current coded data according to the redundancy multiple parameters and transmitting the current coded data to a receiving end.
In one embodiment, the voice transmission apparatus 900 further includes a model training module, configured to obtain a sample voice sequence in a training set; performing voice coding on the sample voice sequence to obtain a sample voice coding code stream; extracting a first voice coding characteristic parameter adopted by current coding data in a sample voice coding code stream and a second voice coding characteristic parameter adopted by previous coding data of the current coding data; after the sample voice coding code stream is directly decoded and a first voice signal is obtained, a first voice quality score determined based on the first voice signal is obtained; obtaining a recovery packet by carrying out simulated packet loss recovery processing on the current coded data, decoding the recovery packet and obtaining a second voice signal, and determining a second voice quality score based on the second voice signal; determining the actual packet loss recovery capacity corresponding to the current encoded data according to the score difference between the first voice quality score and the second voice quality score; inputting the first voice coding characteristic parameters and the second voice coding characteristic parameters into a machine learning model, and outputting predicted packet loss recovery capacity corresponding to current coding data through the machine learning model; and after the model parameters of the machine learning model are adjusted according to the difference between the actual packet loss recovery capability and the predicted packet loss recovery capability, returning to the step of obtaining the sample voice sequence in the training set for continuous training until the training ending condition is met.
In the above-mentioned voice transmission device 900, before transmitting the current encoded data to the receiving end, the first voice encoding feature parameter corresponding to the current encoded data and the second voice encoding feature parameter corresponding to the previous encoded data are used to predict the packet loss recovery capability of the receiving end to the current encoded data according to the machine-learned packet loss recovery capability prediction model, so as to determine whether to perform redundancy multiple processing on the current encoded data according to the packet loss recovery capability, if yes, the necessary network bandwidth resources are required to be consumed to perform redundancy multiple processing, otherwise, the redundancy multiple processing is not required to be performed, and the current encoded data is directly transmitted to the receiving end, so that excessive network bandwidth resources are avoided to be consumed, thereby effectively improving the utilization rate of the network bandwidth as a whole, and meanwhile, the anti-packet loss capability of the transmission network is also ensured.
FIG. 10 illustrates an internal block diagram of a computer device in one embodiment. The computer device may be specifically the receiving end 110 in fig. 1. As shown in fig. 10, the computer device includes a processor, a memory, and a network interface connected by a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and may also store a computer program that, when executed by a processor, causes the processor to implement a voice transmission method. The internal memory may also store a computer program which, when executed by the processor, causes the processor to perform a voice transmission method.
It will be appreciated by those skilled in the art that the structure shown in fig. 10 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, the voice transmission apparatus 900 provided herein may be implemented in the form of a computer program that is executable on a computer device as shown in fig. 10. The memory of the computer device may store various program modules that make up the speech transmission apparatus 900, such as the acquisition module 902, the prediction module 904, and the redundancy multiple decision module 906 shown in fig. 9. The computer program constituted by the respective program modules causes the processor to execute the steps in the voice transmission method of the respective embodiments of the present application described in the present specification.
For example, the computer apparatus shown in fig. 10 may perform step S302 through the acquisition module 902 in the voice transmission apparatus 900 shown in fig. 9. The computer device may perform step S304 through the prediction module 904. The computer device may perform steps S306, S308, and S310 through the redundancy multiple decision module 906.
In one embodiment, a computer device is provided that includes a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the above-described voice transmission method. The steps of the voice transmission method herein may be the steps of the voice transmission method of the above-described respective embodiments.
In one embodiment, a computer readable storage medium is provided, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the above-described voice transmission method. The steps of the voice transmission method herein may be the steps of the voice transmission method of the above-described respective embodiments.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (21)

1. A voice transmission method, comprising:
acquiring current coding data in a voice coding code stream;
acquiring packet loss recovery capacity corresponding to current coding data according to a trained packet loss recovery capacity prediction model based on machine learning and a first voice coding characteristic parameter corresponding to the current coding data and a second voice coding characteristic parameter corresponding to the previous coding data of the current coding data;
Judging whether redundant multiple processing is needed according to the packet loss recovery capability;
if yes, performing redundancy multiple processing on the current coded data and then transmitting the current coded data to a receiving end;
if not, directly transmitting the current coded data to a receiving end;
the packet loss recovery capability prediction model is determined through the following steps:
acquiring a sample voice sequence in a training set;
performing voice coding on the sample voice sequence to obtain a sample voice coding code stream;
extracting a first voice coding characteristic parameter adopted by current coding data in the sample voice coding code stream and a second voice coding characteristic parameter adopted by previous coding data of the current coding data;
directly decoding the sample voice coding code stream and obtaining a first voice signal, and determining a first voice quality score based on the first voice signal;
performing simulated packet loss recovery processing on current coded data in the sample voice coded code stream to obtain a recovery packet, decoding the recovery packet and obtaining a second voice signal, and determining a second voice quality score based on the second voice signal;
determining the actual packet loss recovery capability corresponding to the current coding data in the sample voice coding code stream according to the score difference between the first voice quality score and the second voice quality score;
Inputting the first voice coding characteristic parameters and the second voice coding characteristic parameters into a machine learning model, and outputting predicted packet loss recovery capacity corresponding to current coding data in the sample voice coding code stream through the machine learning model;
and after the model parameters of the machine learning model are adjusted according to the difference between the real packet loss recovery capacity and the predicted packet loss recovery capacity, returning to the step of obtaining the sample voice sequence in the training set to continue training until the training ending condition is met, and obtaining the trained packet loss recovery capacity prediction model based on machine learning.
2. The method according to claim 1, wherein the method further comprises:
acquiring an original voice signal;
dividing an original voice signal to obtain an original voice sequence;
and sequentially performing voice coding on the voice fragments in the original voice sequence to obtain a voice coding code stream.
3. The method according to claim 1, wherein the method further comprises:
acquiring voice coding characteristic parameters corresponding to voice fragments in an original voice sequence;
performing voice coding on the corresponding voice fragments according to the voice coding characteristic parameters, and obtaining a voice coding code stream after generating corresponding coding data;
And buffering voice coding characteristic parameters adopted by each coded data in the voice coding process.
4. The method according to claim 1, wherein the obtaining, by the trained machine learning based packet loss recovery capability prediction model, the packet loss recovery capability corresponding to the current encoded data according to the first speech coding feature parameter corresponding to the current encoded data and the second speech coding feature parameter corresponding to the previous encoded data of the current encoded data, includes:
inputting the first voice coding characteristic parameters corresponding to the current coding data and the second voice coding characteristic parameters corresponding to the previous coding data of the current coding data into a packet loss recovery capacity prediction model;
outputting a scoring difference between a first voice quality score determined by directly decoding the current encoded data and a second voice quality score determined by decoding the current encoded data after packet loss recovery processing according to the first voice encoding characteristic parameter and the second voice encoding characteristic parameter through the packet loss recovery capability prediction model;
determining the packet loss recovery capacity corresponding to the current coded data according to the scoring difference;
And the packet loss recovery capacity corresponding to the current coded data is inversely related to the scoring difference.
5. The method of claim 1, wherein the redundancy multiple processing is performed on the current encoded data and then transmitted to a receiving end, and the method comprises:
acquiring packet loss state information fed back by a receiving end;
determining redundancy multiple parameters corresponding to the current coded data according to the packet loss state information;
and copying the current coded data according to the redundancy multiple parameters and transmitting the current coded data to the receiving end.
6. The method of claim 5, wherein the method further comprises:
when the receiving end receives the current coding data or the redundant multi-sending packet corresponding to the current coding data, the receiving end filters the repeated data packet and then decodes the data packet to obtain the voice signal corresponding to the current coding data.
7. The method according to claim 1, wherein the method further comprises:
when the receiving end does not receive the current coding data and the redundant multi-sending packet corresponding to the current coding data, the receiving end carries out packet loss recovery processing on the current coding data to obtain a recovery packet corresponding to the current coding data, and decodes the recovery packet to obtain a voice signal corresponding to the current coding data.
8. A voice transmission system comprising a transmitting end and a receiving end, wherein:
the sending end is used for obtaining current coding data in a voice coding code stream, and obtaining packet loss recovery capacity corresponding to the current coding data according to a first voice coding characteristic parameter corresponding to the current coding data and a second voice coding characteristic parameter corresponding to the previous coding data of the current coding data through a trained packet loss recovery capacity prediction model based on machine learning;
the sending end is also used for judging whether redundant multiple processing is needed according to the packet loss recovery capability; if yes, performing redundancy multiple processing on the current coded data and then transmitting the current coded data to a receiving end; if not, directly transmitting the current coded data to a receiving end;
the receiving end is used for filtering repeated data packets through the receiving end and then decoding the repeated data packets when receiving the current coding data or redundant multi-sending packets corresponding to the current coding data, so as to obtain voice signals corresponding to the current coding data;
the receiving end is further used for carrying out packet loss recovery processing on the current coding data to obtain a recovery packet corresponding to the current coding data when the current coding data and the redundant multi-sending packet corresponding to the current coding data are not received, and decoding the recovery packet to obtain a voice signal corresponding to the current coding data;
The sending end is also used for obtaining a sample voice sequence in the training set; performing voice coding on the sample voice sequence to obtain a sample voice coding code stream; extracting a first voice coding characteristic parameter adopted by current coding data in the sample voice coding code stream and a second voice coding characteristic parameter adopted by previous coding data of the current coding data; directly decoding the sample voice coding code stream and obtaining a first voice signal, and determining a first voice quality score based on the first voice signal; performing simulated packet loss recovery processing on current coded data in the sample voice coded code stream to obtain a recovery packet, decoding the recovery packet and obtaining a second voice signal, and determining a second voice quality score based on the second voice signal; determining the actual packet loss recovery capability corresponding to the current coding data in the sample voice coding code stream according to the score difference between the first voice quality score and the second voice quality score; inputting the first voice coding characteristic parameters and the second voice coding characteristic parameters into a machine learning model, and outputting predicted packet loss recovery capacity corresponding to current coding data in the sample voice coding code stream through the machine learning model; and after the model parameters of the machine learning model are adjusted according to the difference between the real packet loss recovery capacity and the predicted packet loss recovery capacity, obtaining a sample voice sequence in a training set, and continuing training until the training ending condition is met, and obtaining a trained packet loss recovery capacity prediction model based on machine learning.
9. The system of claim 8, wherein the transmitting end is further configured to obtain an original voice signal; dividing an original voice signal to obtain an original voice sequence; and sequentially performing voice coding on the voice fragments in the original voice sequence to obtain a voice coding code stream.
10. The system of claim 8, wherein the transmitting end is further configured to obtain speech coding feature parameters corresponding to each of the speech segments in the original speech sequence; performing voice coding on the corresponding voice fragments according to the voice coding characteristic parameters, and obtaining a voice coding code stream after generating corresponding coding data; and buffering voice coding characteristic parameters adopted by each coded data in the voice coding process.
11. The system of claim 8, wherein the transmitting end is further configured to input a first speech coding feature parameter corresponding to the current encoded data and a second speech coding feature parameter corresponding to the previous encoded data of the current encoded data to a packet loss recovery capability prediction model; outputting a scoring difference between a first voice quality score determined by directly decoding the current encoded data and a second voice quality score determined by decoding the current encoded data after packet loss recovery processing according to the first voice encoding characteristic parameter and the second voice encoding characteristic parameter through the packet loss recovery capability prediction model; determining the packet loss recovery capacity corresponding to the current coded data according to the scoring difference; and the packet loss recovery capacity corresponding to the current coded data is inversely related to the scoring difference.
12. The system of claim 8, wherein the transmitting end is further configured to obtain packet loss status information fed back by the receiving end; determining redundancy multiple parameters corresponding to the current coded data according to the packet loss state information; and copying the current coded data according to the redundancy multiple parameters and transmitting the current coded data to the receiving end.
13. The system of claim 12, wherein the receiving end is further configured to, when receiving the current encoded data or the redundant multi-packet corresponding to the current encoded data, filter the repeated data packet by the receiving end and then decode the filtered data packet to obtain the speech signal corresponding to the current encoded data.
14. The system of claim 8, wherein the receiving end is further configured to perform packet loss recovery processing on the current encoded data to obtain a recovery packet corresponding to the current encoded data when the current encoded data and the redundant multi-packet corresponding to the current encoded data are not received, and decode the recovery packet to obtain a speech signal corresponding to the current encoded data.
15. A voice transmission apparatus, the apparatus comprising:
The model training module is used for acquiring a sample voice sequence in a training set; performing voice coding on the sample voice sequence to obtain a sample voice coding code stream; extracting a first voice coding characteristic parameter adopted by current coding data in the sample voice coding code stream and a second voice coding characteristic parameter adopted by previous coding data of the current coding data; directly decoding the sample voice coding code stream and obtaining a first voice signal, and determining a first voice quality score based on the first voice signal; performing simulated packet loss recovery processing on current coded data in the sample voice coded code stream to obtain a recovery packet, decoding the recovery packet and obtaining a second voice signal, and determining a second voice quality score based on the second voice signal; determining the actual packet loss recovery capability corresponding to the current coding data in the sample voice coding code stream according to the score difference between the first voice quality score and the second voice quality score; inputting the first voice coding characteristic parameters and the second voice coding characteristic parameters into a machine learning model, and outputting predicted packet loss recovery capacity corresponding to current coding data in the sample voice coding code stream through the machine learning model; after the model parameters of the machine learning model are adjusted according to the difference between the real packet loss recovery capacity and the predicted packet loss recovery capacity, returning to the step of obtaining the sample voice sequence in the training set to continue training until the training end condition is met, and obtaining a trained packet loss recovery capacity prediction model based on machine learning;
The acquisition module is used for acquiring current coding data in the voice coding code stream;
the prediction module is used for obtaining the packet loss recovery capacity corresponding to the current coding data according to the first voice coding characteristic parameter corresponding to the current coding data and the second voice coding characteristic parameter corresponding to the previous coding data of the current coding data through a trained packet loss recovery capacity prediction model based on machine learning;
the redundancy multiple-shot judging module is used for judging whether redundancy multiple-shot processing is needed according to the packet loss recovery capability; if yes, performing redundancy multiple processing on the current coded data and then transmitting the current coded data to a receiving end; if not, the current coded data is directly transmitted to a receiving end.
16. The apparatus of claim 15, wherein the speech encoding module is further configured to obtain an original speech signal; dividing an original voice signal to obtain an original voice sequence; and sequentially performing voice coding on the voice fragments in the original voice sequence to obtain a voice coding code stream.
17. The apparatus of claim 15, wherein the apparatus further comprises:
the voice coding module is used for acquiring voice coding characteristic parameters corresponding to voice fragments in the original voice sequence; performing voice coding on the corresponding voice fragments according to the voice coding characteristic parameters, and obtaining a voice coding code stream after generating corresponding coding data;
And the buffer module is used for buffering voice coding characteristic parameters adopted by each coded data in the voice coding process.
18. The apparatus of claim 15, wherein the prediction module is further configured to input a first speech coding feature parameter corresponding to the current encoded data and a second speech coding feature parameter corresponding to the previously encoded data of the current encoded data into a packet loss recovery capability prediction model; outputting a scoring difference between a first voice quality score determined by directly decoding the current encoded data and a second voice quality score determined by decoding the current encoded data after packet loss recovery processing according to the first voice encoding characteristic parameter and the second voice encoding characteristic parameter through the packet loss recovery capability prediction model; determining the packet loss recovery capacity corresponding to the current coded data according to the scoring difference; and the packet loss recovery capacity corresponding to the current coded data is inversely related to the scoring difference.
19. The apparatus of claim 15, wherein the redundancy multiple decision module is further configured to obtain packet loss status information fed back by the receiving end; determining redundancy multiple parameters corresponding to the current coded data according to the packet loss state information; and copying the current coded data according to the redundancy multiple parameters and transmitting the current coded data to the receiving end.
20. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method of any one of claims 1 to 7.
21. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 7.
CN202010104612.0A 2020-02-20 2020-02-20 Voice transmission method, system, device, computer readable storage medium and apparatus Active CN111312264B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010104612.0A CN111312264B (en) 2020-02-20 2020-02-20 Voice transmission method, system, device, computer readable storage medium and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010104612.0A CN111312264B (en) 2020-02-20 2020-02-20 Voice transmission method, system, device, computer readable storage medium and apparatus

Publications (2)

Publication Number Publication Date
CN111312264A CN111312264A (en) 2020-06-19
CN111312264B true CN111312264B (en) 2023-04-21

Family

ID=71148037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010104612.0A Active CN111312264B (en) 2020-02-20 2020-02-20 Voice transmission method, system, device, computer readable storage medium and apparatus

Country Status (1)

Country Link
CN (1) CN111312264B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112820306B (en) * 2020-02-20 2023-08-15 腾讯科技(深圳)有限公司 Voice transmission method, system, device, computer readable storage medium and apparatus
CN111899729B (en) * 2020-08-17 2023-11-21 广州市百果园信息技术有限公司 Training method and device for voice model, server and storage medium
CN111953596A (en) * 2020-08-26 2020-11-17 北京奥特维科技有限公司 Double-network-port zero-delay hot standby method and device for distributed coding and decoding system
CN112532349B (en) * 2020-11-24 2022-02-18 广州技象科技有限公司 Data processing method and device based on decoding abnormity
CN112769524B (en) * 2021-04-06 2021-06-22 腾讯科技(深圳)有限公司 Voice transmission method, device, computer equipment and storage medium
CN114333862B (en) * 2021-11-10 2024-05-03 腾讯科技(深圳)有限公司 Audio encoding method, decoding method, device, equipment, storage medium and product
CN115146125B (en) * 2022-05-27 2023-02-03 北京科技大学 Receiving end data filtering method and device under semantic communication multi-address access scene

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1758254A1 (en) * 2005-08-23 2007-02-28 Thomson Licensing Improved erasure correction scheme based on XOR operations for packet transmission
CN101119319A (en) * 2007-09-19 2008-02-06 腾讯科技(深圳)有限公司 Method, transmitting/receiving device and system against lost packet in data transmission process
CN101604523A (en) * 2009-04-22 2009-12-16 网经科技(苏州)有限公司 In voice coding G.711, hide the method for redundant information
CN101741584A (en) * 2008-11-20 2010-06-16 盛乐信息技术(上海)有限公司 Method for reducing packet loss of streaming media
CN101777960A (en) * 2008-11-17 2010-07-14 华为终端有限公司 Audio encoding method, audio decoding method, related device and communication system
CN101834700A (en) * 2010-05-12 2010-09-15 北京邮电大学 Unidirectional reliable transmission method and transceiving device based on data packets
CN102263606A (en) * 2010-05-28 2011-11-30 华为技术有限公司 Channel data coding and decoding method and device
CN107196746A (en) * 2016-03-15 2017-09-22 中兴通讯股份有限公司 Anti-dropout methods, devices and systems in real-time Communication for Power
CN109218083A (en) * 2018-08-27 2019-01-15 广州爱拍网络科技有限公司 A kind of voice data transmission method and device
CN109862440A (en) * 2019-02-22 2019-06-07 深圳市凯迪仕智能科技有限公司 Audio video transmission forward error correction, device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5187389B2 (en) * 2010-12-28 2013-04-24 ブラザー工業株式会社 COMMUNICATION DEVICE, COMMUNICATION METHOD, AND COMMUNICATION PROGRAM

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1758254A1 (en) * 2005-08-23 2007-02-28 Thomson Licensing Improved erasure correction scheme based on XOR operations for packet transmission
CN101119319A (en) * 2007-09-19 2008-02-06 腾讯科技(深圳)有限公司 Method, transmitting/receiving device and system against lost packet in data transmission process
CN101777960A (en) * 2008-11-17 2010-07-14 华为终端有限公司 Audio encoding method, audio decoding method, related device and communication system
CN101741584A (en) * 2008-11-20 2010-06-16 盛乐信息技术(上海)有限公司 Method for reducing packet loss of streaming media
CN101604523A (en) * 2009-04-22 2009-12-16 网经科技(苏州)有限公司 In voice coding G.711, hide the method for redundant information
CN101834700A (en) * 2010-05-12 2010-09-15 北京邮电大学 Unidirectional reliable transmission method and transceiving device based on data packets
CN102263606A (en) * 2010-05-28 2011-11-30 华为技术有限公司 Channel data coding and decoding method and device
CN107196746A (en) * 2016-03-15 2017-09-22 中兴通讯股份有限公司 Anti-dropout methods, devices and systems in real-time Communication for Power
CN109218083A (en) * 2018-08-27 2019-01-15 广州爱拍网络科技有限公司 A kind of voice data transmission method and device
CN109862440A (en) * 2019-02-22 2019-06-07 深圳市凯迪仕智能科技有限公司 Audio video transmission forward error correction, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN111312264A (en) 2020-06-19

Similar Documents

Publication Publication Date Title
CN111312264B (en) Voice transmission method, system, device, computer readable storage medium and apparatus
CN112820306B (en) Voice transmission method, system, device, computer readable storage medium and apparatus
US11849128B2 (en) Dynamic control for a machine learning autoencoder
US20200234725A1 (en) Speech coding using discrete latent representations
US10594338B1 (en) Adaptive quantization
WO2023056808A1 (en) Encrypted malicious traffic detection method and apparatus, storage medium and electronic apparatus
WO2022213787A1 (en) Audio encoding method, audio decoding method, apparatus, computer device, storage medium, and computer program product
CN116233445B (en) Video encoding and decoding processing method and device, computer equipment and storage medium
EP3076390A1 (en) Method and device for decoding speech and audio streams
CN114333862B (en) Audio encoding method, decoding method, device, equipment, storage medium and product
CN113660488B (en) Method and device for carrying out flow control on multimedia data and training flow control model
CN114842857A (en) Voice processing method, device, system, equipment and storage medium
CN112769524B (en) Voice transmission method, device, computer equipment and storage medium
CN114298199A (en) Transcoding parameter model training method, video transcoding method and device
CN112634868B (en) Voice signal processing method, device, medium and equipment
CN113724716B (en) Speech processing method and speech processing device
CN112669857B (en) Voice processing method, device and equipment
CN116996622B (en) Voice data transmission method, device, equipment, medium and program product
CN116188648B (en) Virtual person action generation method and device based on non-driving source and electronic equipment
US20240078411A1 (en) Information processing system, encoding device, decoding device, model learning device, information processing method, encoding method, decoding method, model learning method, and program storage medium
US20240127848A1 (en) Quality estimation model for packet loss concealment
CN116580716A (en) Audio encoding method, device, storage medium and computer equipment
CN115206330A (en) Audio processing method, audio processing apparatus, electronic device, and storage medium
CN115359398A (en) Voice video positioning model and construction method, device and application thereof
CN117640015A (en) Speech coding and decoding method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40024144

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant