CN115171707A - Voice stream packet loss compensation method and device, equipment, medium and product thereof - Google Patents

Voice stream packet loss compensation method and device, equipment, medium and product thereof Download PDF

Info

Publication number
CN115171707A
CN115171707A CN202210804024.7A CN202210804024A CN115171707A CN 115171707 A CN115171707 A CN 115171707A CN 202210804024 A CN202210804024 A CN 202210804024A CN 115171707 A CN115171707 A CN 115171707A
Authority
CN
China
Prior art keywords
voice
frame
vocoder
subsequent
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210804024.7A
Other languages
Chinese (zh)
Inventor
王汉超
林伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bigo Technology Pte Ltd
Original Assignee
Bigo Technology Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bigo Technology Pte Ltd filed Critical Bigo Technology Pte Ltd
Priority to CN202210804024.7A priority Critical patent/CN115171707A/en
Publication of CN115171707A publication Critical patent/CN115171707A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The application relates to a voice stream packet loss compensation method and a device, equipment, medium and product thereof, wherein the method comprises the following steps: determining a current voice frame missing a subsequent voice frame in a voice stream; acquiring a voice frame sequence containing a current voice frame from a voice stream, and extracting acoustic characteristics of the voice frame sequence; respectively extracting global feature information and local feature information of the acoustic features by adopting a conditional network in a preset vocoder to construct comprehensive feature information; and generating the subsequent speech frame according to the comprehensive characteristic information by adopting a circulating network in the vocoder and taking the global characteristic information as reference. The method and the device can realize packet loss compensation on the voice stream, generate the missing subsequent voice frame for the voice stream, have high restoration degree of the generated subsequent voice frame, and effectively avoid the phenomena of repeated sound, mechanical sound and the like of the voice after packet loss compensation.

Description

Voice stream packet loss compensation method and device, equipment, medium and product thereof
Technical Field
The present application relates to the field of audio transmission technologies, and in particular, to a method, an apparatus, a device, a medium, and a product for compensating for packet loss of a voice stream.
Background
Under a network voice call scene, in a voice transmission process, a voice packet loss phenomenon often occurs, which affects call quality and needs to be prevented or processed by a technical means.
The traditional packet loss compensation method uses a signal processing technology, by searching a segment which is most matched with a waveform before packet loss occurs, the lost segment is predicted by using a waveform substitution method, a linear predictive coding method or other difference methods, and then the reconstruction of a signal is completed by fusing the lost segment with received voice through methods such as overlapping and adding, fading in and fading out and the like.
Under the condition of adopting the conventional packet loss compensation method, for voice data, the audio synthesized in the packet loss scene exceeding 40ms will have the phenomena of repeated sound, mechanical sound and the like, so that other modes which can be effective need to be found.
Disclosure of Invention
The present application aims to solve the above problems and provide a voice stream packet loss compensation method, and a corresponding apparatus, device, non-volatile readable storage medium, and computer program product.
According to an aspect of the present application, a method for compensating for packet loss of a voice stream is provided, which includes the following steps:
determining a current voice frame missing a subsequent voice frame in a voice stream;
acquiring a voice frame sequence containing a current voice frame from a voice stream, and extracting acoustic characteristics of the voice frame sequence;
respectively extracting global feature information and local feature information of the acoustic features by adopting a conditional network in a preset vocoder to construct comprehensive feature information;
and generating the subsequent voice frame according to the comprehensive characteristic information by adopting a circulating network in the vocoder and taking the global characteristic information as reference.
According to another aspect of the present application, there is provided an apparatus for compensating for packet loss of a voice stream, including:
a current frame processing module, configured to determine a current speech frame missing a subsequent speech frame in the speech stream;
the sequence processing module is arranged to acquire a voice frame sequence containing a current voice frame from a voice stream and extract the acoustic characteristics of the voice frame sequence;
the feature construction module is arranged to adopt a conditional network in a preset vocoder to respectively extract the global feature information and the local feature information of the acoustic features to construct comprehensive feature information;
and the voice frame generating module is arranged to adopt a circulating network in the vocoder, take the global feature information as reference and generate the subsequent voice frame according to the comprehensive feature information.
According to another aspect of the present application, there is provided a voice stream packet loss compensation device, including a central processing unit and a memory, where the central processing unit is configured to invoke and run a computer program stored in the memory to perform the steps of the voice stream packet loss compensation method described in the present application.
According to another aspect of the present application, a non-transitory readable storage medium is provided, which stores, in the form of computer readable instructions, a computer program implemented according to the voice stream packet loss compensation method, where the computer program is invoked by a computer to execute the steps included in the method.
According to another aspect of the present application, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method described in any one of the embodiments of the present application.
Compared with the prior art, after a current voice frame lacking a subsequent voice frame is determined, acoustic features of a voice frame sequence formed by the current voice frame and a plurality of previous voice frames are utilized, global feature information and local feature information in the current voice frame are respectively extracted by means of a condition network of a vocoder to form comprehensive feature information, then the subsequent voice frame is generated according to the comprehensive feature information by referring to the global feature information through a circulating network in the vocoder, the comprehensive feature information is obtained by extracting significant features of different scales in the acoustic features in the voice frame sequence through the condition network, so that information such as phonemes and rhythms with slow change in voice information can be effectively obtained, the circulating network can generate the subsequent voice frame according to the comprehensive feature information under the condition of referring to the global feature information, the subsequent voice frame effectively inherits the information such as phonemes and rhythms of the previous voice frame with continuous time, therefore, the generated subsequent reduction voice frame has high degree, the phenomena of repeated sounds, mechanical sounds and the like of the voice after subjective packet loss compensation can be effectively avoided, and a compensated voice stream can obtain high quality score.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of a network architecture corresponding to a voice call environment applied in the present application;
fig. 2 is a schematic diagram illustrating a network architecture of a vocoder used in the voice stream packet loss compensation method according to the present application;
fig. 3 is a flowchart illustrating a voice stream packet loss compensation method according to an embodiment of the present application;
fig. 4 is a flowchart illustrating a voice stream packet loss compensation method according to another embodiment of the present application;
FIG. 5 is a flow chart illustrating accessing a subsequent voice frame into a voice stream according to an embodiment of the present application;
FIG. 6 is a schematic flow chart of acquiring acoustic features of a sequence of speech frames in an embodiment of the present application;
FIG. 7 is a schematic flow chart illustrating a process of obtaining comprehensive feature information according to acoustic features in an embodiment of the present application;
FIG. 8 is a schematic flow chart illustrating a process of generating a subsequent speech frame according to the comprehensive characteristic information in the embodiment of the present application;
FIG. 9 is a flowchart illustrating the training of a vocoder according to an embodiment of the present application;
FIG. 10 is a flow chart illustrating the training of a vocoder using a mask according to an embodiment of the present invention;
fig. 11 is a schematic block diagram of a voice stream packet loss compensation apparatus according to the present application;
fig. 12 is a schematic structural diagram of a voice stream packet loss compensation apparatus used in the present application.
Detailed Description
The models referred or possibly referred to in the application include a traditional machine learning model or a deep learning model, unless specified in clear text, the models can be deployed in a remote server and used for remote calling at a client, and can also be deployed in a client with sufficient equipment capability for direct calling.
The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.
Referring to fig. 1, a network architecture adopted by an exemplary application scenario of the present application may be used to deploy a voice call service, where the voice call service supports real-time voice communication, and a process of encoding and decoding a voice stream of the voice call service may be implemented by running a computer program product obtained by any one of the embodiments of the present application. The application server 81 shown in fig. 1 may be used to support the operation of the voice call service, and the media server 82 may be used to process the de-encoding process of the voice stream pushed by each user to implement relaying, wherein the terminal device, such as the computer 83 and the mobile phone 84, is generally provided as a client for the terminal user to use, and may be used to send or receive the voice stream. In addition, when a voice stream needs to be encoded at a terminal device, the computer program product obtained in the embodiments of the present application can also be deployed at the terminal device, so as to apply the method of any one of the embodiments of the present application to perform packet loss compensation on a received or transmitted voice stream. The application scenarios disclosed above are only examples, and the voice stream packet loss compensation method of the present application is applicable to all scenarios that require packet loss compensation for a voice stream, for example, the method can also be used for packet loss compensation for a voice stream in a live broadcast service scenario.
Fig. 2 shows a network architecture of an exemplary vocoder employed in the present application, the vocoder including a conditional network and a cyclic network, the conditional network being used mainly for feature extraction of voice data in a voice stream, and the cyclic network being used mainly for generating subsequent voice frames for the voice data that are needed for packet loss compensation.
The conditional network is mainly used for extracting global feature information of the voice data by a residual network, is used for extracting local feature information corresponding to a plurality of scales of the voice data by a plurality of different scaling systems by an up-sampling network, and then splices the global feature information and the local feature information into comprehensive feature information by a splicing layer, so that the extraction of deep voice information of the voice data is realized, and the comprehensive feature information realizes the feature representation of information with slow change of phonemes, rhythm and the like in the voice data. The global feature information obtained by the residual error network is further divided into multiple paths to provide reference information for the process of processing the comprehensive feature information by the circulating network.
The loop Network is realized by RNN (RecurrentConvolitional Network), two unidirectional Gate control loop units (GRU, gate Recurrent Unit) are adopted inside, each Gate control loop Unit is responsible for processing comprehensive characteristic information, corresponding deep semantic information is obtained, and the deep semantic information is spliced with the global characteristic information output by the residual error Network and then input into the next node for processing. The loop network is provided with a classification network at the tail end thereof for realizing classification mapping according to the output processed by the gating loop unit and the combination information of the global characteristic information so as to restore a subsequent voice frame.
In one embodiment, the vocoder may be a WaveRNN (Wave current Neural Network) or a variant model thereof, such as SC-WaveRNN (Speaker Conditional WaveRNN), which is a model of a circulating Network suitable for processing audio data in a sequence form, and the initial design goal of the WaveRNN is to maintain high-speed sequence generation, and an author uses techniques such as simplified model, thinning, parallel sequence generation to significantly increase a sequence generation speed, and its good performance may even implement real-time speech synthesis on a CPU.
The vocoder can be trained to a convergence state in advance to be used, the network structure of the vocoder reduces the overall complexity, the parameters are fewer, the temporary context is provided by utilizing the global characteristic information, the vocoder is easier to be trained to be converged, and more accurate prediction can be made.
Referring to fig. 3, in an embodiment of a method for compensating for packet loss of a voice stream according to an aspect of the present application, the method includes the following steps:
step S1100, determining a current voice frame missing a subsequent voice frame in a voice stream;
in the voice communication process, the voice stream passes through a plurality of equipment nodes in the whole transmission channel from the sender terminal equipment, the server and the receiver terminal equipment, and the voice stream packet loss compensation can be executed in any set node so as to apply the voice stream packet loss compensation method. Specifically, for the server, the server receives the voice stream submitted by the sender, decodes the voice stream to obtain the voice frames therein, and detects whether the timestamps carried by the received voice frames are continuous within the preset time range, so that whether the phenomenon of losing the voice frames, namely losing packets, occurs can be detected, and packet loss compensation can be performed on the voice frames. For the terminal device, especially for the terminal device of the receiving side, the same as the server side, it is determined whether to implement packet loss compensation on the voice stream by detecting whether the voice frame is lost in the received voice stream.
The voice stream is composed of a plurality of voice frames. In one embodiment, when a speech frame is detected as lost, the previous speech frame earlier than the timestamp of this speech frame may be determined as the current speech frame. For the current speech frame, the next speech frame following the current speech frame in its timing is missing, and what needs to be compensated is the missing speech frame, i.e. the missing subsequent speech frame. The current speech frame is continuous with its subsequent speech frame at a timing determined by the timestamp. In another embodiment, in a speech stream, after the current speech frame, a plurality of subsequent speech frames may be continuously lost, and for this case, the subsequent speech frame obtained through packet loss compensation may be used as a new current speech frame for recovering a subsequent speech frame after the next time sequence is continuously performed.
Step S1200, acquiring a voice frame sequence containing a current voice frame from a voice stream, respectively extracting global characteristic information and local characteristic information of the acoustic characteristics, and constructing the global characteristic information and the local characteristic information as comprehensive characteristic information;
the voice stream is generally stored in a memory buffer, and the voice frame in the voice stream can be obtained from the memory buffer. To form the input of the vocoder, in one embodiment, after determining that a current speech frame of a subsequent speech frame is missing, a plurality of speech frames whose timing is continuous are obtained from the speech stream based on the timestamp of the current speech frame, with the current speech frame as the last speech frame in timing, and a sequence of speech frames is formed from the speech frames. It is to be understood that in a sequence of speech frames, the current speech frame is the most time-sequenced speech frame, the other speech frames are all speech frames with a time stamp earlier than the time stamp of the current speech frame, and the speech frames are consecutive in time stamp.
For the voice frame sequence, the voice preprocessing means can be used to perform audio preprocessing on each voice frame in the voice frame sequence to obtain the acoustic characteristics of each voice frame, so as to construct the acoustic characteristics of the whole voice frame sequence.
The acoustic features play a role in describing relevant information of features with relatively stable style in the speech frames, such as phonemes, prosody and the like, and may be any one of logarithmic Mel spectrum information, time-frequency spectrum information and CQT filtering information.
Those skilled in the art will appreciate that the above acoustic features may be encoded using a corresponding algorithm. In the process of coding, the speech signal is subjected to conventional processing such as pre-emphasis, framing and windowing, and then time domain or frequency domain analysis is carried out, namely speech signal analysis is realized. The purpose of pre-emphasis is to boost the high frequency part of the speech signal and smooth the spectrum; typically the pre-emphasis is implemented by a first order high pass filter. Before the speech signal is analyzed, the speech signal needs to be framed, the length of each frame of the speech signal is usually set to be 20ms, and 10ms overlap between two adjacent frames can be realized by considering the frame shift factor. To implement framing, this may be accomplished by windowing the speech signal. Different window selections affect the result of the speech signal analysis, and it is common to use window functions corresponding to hamming windows (Hamm) to perform the windowing operation.
In one embodiment, for the time-frequency spectrum information, voice data of each voice information in a time domain is pre-emphasized, framed, windowed, and transformed into a frequency domain by short-time fourier transform (STFT), so as to obtain data corresponding to a spectrogram, thereby forming the time-frequency spectrum information.
In another embodiment, for the logarithmic mel-frequency spectrum, the time-frequency spectrum information may be obtained by filtering the time-frequency spectrum information by using a mel-scale filter bank and then taking the logarithm of the filtered time-frequency spectrum information.
In another embodiment, for the CQT filtering information, a Constant Q Transform (CQT) refers to a filter bank in which the center frequencies are distributed exponentially, the filtering bandwidths are different, and the ratio of the center frequency to the bandwidth is a Constant Q. Unlike the fourier transform, the frequency of the horizontal axis of its spectrum is not linear, but is based on a log2 base, and the filter window length can be varied for better performance from spectral line frequency to spectral line frequency.
Any of the above acoustic signatures can be used as an input to the vocoder of the present application, and in order to facilitate the processing of the vocoder, in one embodiment, the acoustic signatures can be constructed according to a certain predetermined format. For example, the acoustic features corresponding to each speech frame are organized into a row vector, and for the whole encoded speech frame sequence, the row vectors of the speech frames are spliced together longitudinally according to time sequence to obtain a two-dimensional matrix as the acoustic features of the whole speech frame sequence.
Step 1300, extracting comprehensive characteristic information of the acoustic characteristics by adopting a conditional network in a preset vocoder, wherein the comprehensive characteristic information comprises global characteristic information and local characteristic information of the acoustic characteristics;
the vocoder learns the capability of generating the last speech frame in the speech frame sequence, namely the subsequent speech frame of the current speech frame according to the acoustic characteristics of the input speech frame sequence by pre-training to convergence.
When the acoustic features of a sequence of speech frames are input to the vocoder, the acoustic features enter a conditional network in the vocoder to extract deep semantic information therein. Specifically, the acoustic features are divided into two paths, and the two paths are respectively input into a residual error network and an upsampling network in a condition network.
And the residual error network executes residual error convolution operation on the acoustic features input into the residual error network, and extracts deep semantic information in the acoustic features on the full scale, thereby obtaining global feature information corresponding to the acoustic features. This global feature information is not only used in the conditional network but will also be referenced in the recurrent network to provide reference information for the recurrent network.
The up-sampling network performs characteristic sampling operation on the acoustic characteristics on different scales by pre-matching the scaling coefficients corresponding to a plurality of scales, so as to obtain local characteristic information corresponding to each scale.
In the conditional network, the global feature information obtained by the residual network and the global feature information obtained by the up-sampling network are spliced into comprehensive feature information, and the comprehensive feature information also includes features with relatively stable style, such as phonemes, prosody and the like, obtained by extracting deep semantic information from acoustic features at different scales.
Step S1400, generating the subsequent voice frame according to the comprehensive characteristic information by adopting a circulating network in the vocoder and taking the global characteristic information as a reference.
A loop network in the vocoder takes the comprehensive characteristic information output by the conditional network as input, and adopts two or more gating loop units (GRUs) to perform characteristic extraction so as to obtain corresponding characteristic information for generating a subsequent speech frame. In this process, the output of each gated loop unit is further spliced with global feature information output by a residual error network in the conditional network, and the global feature information is used as an input of a next processing node. And finally, carrying out full connection to realize classification mapping according to summary feature information formed by the feature information output by the last gating circulation unit and the global feature information in a classification network at the tail end of the circulation network, and constructing a subsequent voice frame according to a classification mapping result. This subsequent speech frame is the next speech frame resulting from the implementation of packet loss compensation for the current speech frame.
According to the embodiments, after the current speech frame lacking the subsequent speech frame is determined, the acoustic characteristics of the speech frame sequence formed by the current speech frame and the previous multiple speech frames are utilized, the global characteristic information and the local characteristic information are respectively extracted by the condition network of the vocoder to form the comprehensive characteristic information, then the subsequent speech frame is generated according to the comprehensive characteristic information by referring to the global characteristic information through the circulating network in the vocoder, the comprehensive characteristic information is obtained by extracting the significant characteristics of different scales in the acoustic characteristics in the speech frame sequence through the condition network, so that the information such as phonemes, prosody and the like with slow change in the speech information can be effectively obtained, the circulating network can generate the subsequent speech frame according to the comprehensive characteristic information under the condition of referring to the global characteristic information, and the subsequent speech frame effectively inherits the information such as phonemes, prosody and the like of the speech frame with continuous time in the previous speech frame, therefore, the generated subsequent speech frame has high restoration degree, the phenomenon such as repeated pronunciation, mechanical pronunciation and the like after the packet loss compensation can be effectively avoided, and the compensated speech can be expected to obtain high quality score.
On the basis of any of the above embodiments, referring to fig. 4, after the step of generating the subsequent speech frame according to the comprehensive characteristic information, the method includes:
and S1500, taking the generated subsequent voice frame as a new current voice frame, and continuing to iteratively generate a new subsequent voice frame until the generated subsequent voice frame reaches the maximum compensation quantity or all the subsequent voice frames which are continuously missed are completed.
When a plurality of speech frames are continuously missing in a speech stream, after a vocoder implements packet loss compensation based on a current video frame to obtain a subsequent speech frame, packet loss compensation of the next subsequent speech frame can be realized by iterating steps S1100 to S1400 of the present application.
In particular, a subsequent speech frame recovered by the vocoder is added as a new current speech frame to the end of the sequence of speech frames, and in one embodiment, the first speech frame in the sequence of speech frames is also deleted in order to maintain the length of the sequence of speech frames consistent. Then, on the basis of obtaining a new speech frame sequence, a loop is executed from step S1100 until step S1400 obtains the next subsequent speech frame. And by extension, a plurality of subsequent voice frames can be continuously determined, so that packet loss compensation of a plurality of missing subsequent voice frames is completed.
In one embodiment, the maximum compensation number of the subsequent speech frames may be preset to control the loop iteration process of the vocoder, and when the vocoder loops for a plurality of iterations and reaches the maximum compensation number, the iteration is terminated, so as to generate a plurality of subsequent speech frames with continuous time sequence corresponding to the maximum compensation number based on the original current speech frame as the start. Generally, the maximum compensation amount can be set to 4 to 8, which can ensure that the fine difference generated by compensating the subsequent speech frames is not easily perceived by human ears and maintain excellent sound quality, for example, if the time length of the speech frames is set to 6, the time length of the speech frames is 20ms, then the subsequent speech frames of 120ms can be correspondingly generated.
In another embodiment, the number of iterations to complete all subsequent speech frames that are continuously missing can be dynamically determined by detecting the total number of actually continuously missing speech frames in the speech stream and determining the total number as the maximum compensation number.
In another embodiment, the maximum compensation amount may be determined according to a preset compensation duration in association with the duration occupied by the speech frame. In general, in pursuit of a slight change caused by the acoustically hard to perceive frame complementing, the compensation duration may be set to not more than 180ms, and when determining the maximum compensation number, 180ms is divided by the duration of the speech frame, for example, 20ms rounded down, and the maximum compensation number, for example, 180/20=9, may be obtained.
In another embodiment, the vocoder may introduce the duration of a missing speech frame in the speech stream during the control loop iteration, and if the duration is short, for example, 40ms, the iteration may be terminated after 2 subsequent speech frames are generated; if the duration is longer, for example 180ms, the maximum compensation amount may be limited, the iteration is terminated after 6 subsequent speech frames are generated, and for the subsequent missing 60ms speech frames, a silence replacement flag is implanted to keep silent during playing, so as to reasonably control the effective duration of packet loss compensation.
Various means adopted by the prior art are easy to generate mechanical sound and repeated sound after the compensation exceeds 40ms, and the application has the advantages that through actual measurement, even if a 120ms compensation speech frame is continuously generated, better sound quality can be obtained, and the phenomenon that the sound quality is influenced by the mechanical sound and the repeated sound is avoided.
On the basis of any of the above embodiments, referring to fig. 5, after the step of generating the subsequent speech frame according to the integrated feature information, specifically after step S1400 or after step S1500 finally executes step S1400 through one or more iterations, the method includes:
step S2100, splicing the generated subsequent voice frames to obtain a compensation frame sequence;
for the subsequent voice frames generated by the vocoder, no matter a single subsequent voice frame is generated or a plurality of subsequent voice frames are generated, the subsequent voice frames can be processed in a centralized way and sequentially spliced according to the time stamps to form a compensation frame sequence.
Step S2200, adjusting the volume corresponding to the compensation frame sequence to make it not exceed the volume of the voice frame sequence;
in order to unify the volume effect, a preset limiter, specifically a volume limiter can be called to control the volume limit of each subsequent voice frame in the compensation frame sequence, and the excessive volume in the subsequent voice frame is reduced by taking the volume in the voice frame sequence as a reference, so that the volume of the subsequent voice frame in the compensation frame sequence does not exceed the volume of the voice frame in the voice frame sequence, thereby controlling the volume of the voice frame obtained by packet loss compensation at a reasonable amplitude and maintaining the consistency of the voice quality.
Step S2300, smoothly connecting the compensation frame sequence to the voice stream where the voice frame sequence is located.
The compensation frame sequence is connected into the voice stream, and the packet loss compensation of the voice stream can be further realized. In order to maintain the smoothness of the compensated frame sequence obtained by the vocoder acoustically after the speech stream is spliced into it, the compensated frame sequence may be smoothly spliced into the speech stream in a fade-in/fade-out manner.
In one embodiment, after the compensating frame sequence is controlled to access the speech stream, fading out starts at 20ms, i.e. fading out starts from a second subsequent speech frame, until the speech frame is completely muted 20ms after the end of the packet loss or 120ms after the start of the packet loss, and furthermore, the speech frames in the speech stream after the compensating frame sequence are controlled to fade in within a time window of 20 ms. The time settings related to fade-out and fade-in can be flexibly adjusted according to actual requirements, and are not limited to the above examples.
According to the above embodiments, it can be understood that the subsequent speech frame recovered by the vocoder implementing packet loss compensation can be smoothly accessed into the original speech stream, so that the original speech stream can keep smoothness in hearing and obtain good sound quality.
On the basis of any of the above embodiments, please refer to fig. 6, the obtaining a speech frame sequence including a current speech frame from a speech stream and extracting an acoustic feature of the speech frame sequence includes:
step S1210, obtaining a current voice frame and a previous voice frame with continuous time sequence from a voice stream according to preset duration to form a voice frame sequence;
when the voice frame after the current voice frame is detected to be missing, the voice frame sequence including the current voice frame needs to be determined, so that the unified specification of the vocoder is adapted, a preset time length can be adopted, each voice frame with continuous time sequence corresponding to the preset time length is obtained from the memory buffer area of the voice stream, and the voice frames with continuous time sequences take the current voice frame as the last voice frame.
The preset duration may be set as needed, and in one embodiment, the preset duration may be set to be between 200ms and 400ms, for example, 300ms, which is relatively speaking, such setting may obtain enough speech frames, may provide enough audio information, and effectively helps the vocoder generate subsequent speech frames.
Step S1220, performing short-time Fourier transform on the voice frame sequence to obtain frequency spectrum information;
the speech frame sequence already contains sufficient speech frame data, and therefore, a Short-Term Fourier Transform (STFT) may be performed on the speech frame sequence to convert it from time domain information to frequency domain information, thereby obtaining spectral information corresponding to the speech frame sequence.
In one embodiment, the short-time Fourier transform is performed using the following equation:
Figure BDA0003735789590000121
where x (t) is the input signal, i.e. the sequence of speech frames, and w (t) is a window function, it is recommended to use the Hamming window function (Hamming), which expresses STFT { x (t) } (τ, ω) as a short-time fourier transform of x (t) w (t- τ).
Step S1230, applying a mel filter bank to convert the spectrum information into a logarithmic mel spectrum as the acoustic feature of the sequence of voice frames.
The frequency spectrum information of the voice frame sequence is linearly scaled, because the sound level heard by human ears is not linearly proportional to the frequency of the sound, the Mel frequency scale is more suitable for the auditory characteristics of the human ears, therefore, a Mel filter bank is further applied to convert the frequency spectrum information into the Mel scale, and then logarithm is carried out on the Mel scale, so that a corresponding logarithm Mel spectrum can be obtained, so as to obtain the contour information in the acoustic characteristics. In performing the logarithmic transformation, the following formula may be applied:
Figure BDA0003735789590000122
wherein S represents frequency spectrum information obtained by conversion of the Mel scale,
Figure BDA0003735789590000123
representing the log mel spectrum obtained after the conversion.
According to the above embodiments, the speech frame sequence extracted from the speech stream is converted from the time domain to the frequency domain for representation, and the logarithm is obtained to obtain the corresponding logarithm mel spectrum, which can effectively represent the contour features in the speech frame sequence, so that the preliminary effective representation of the audio feature information in the speech frame sequence can be realized as the acoustic features of the speech frame sequence.
On the basis of any of the above embodiments, referring to fig. 7, the extracting comprehensive feature information of acoustic features by using a conditional network in a preset vocoder includes:
step S1310, extracting the acoustic features based on a residual error network in a conditional network to obtain the global feature information;
in this embodiment, SC-WaveRNN may be used as a prototype to obtain the network structure of the vocoder in the present application, and as shown in fig. 2, a Speaker Encoder in the prototype network may be omitted, as compared with the prototype network provided by the original author, and of course, in another embodiment, the Encoder may also be used.
The speaker coder in the prototype network of SC-WaveRNN is not necessary from the point of view of the present application implementing speech synthesis. The speaker coder is an important contribution of the SC-WaveRNN paper, and the authors measure the positive gain of the speaker coder in all cases by PESQ (objective evaluation of speech quality); the same index is used for measuring, and the contribution of a speaker encoder is not obvious aiming at the task of executing packet loss compensation. The reason for this is that SC-WaveRNN aims at TTS (Text to Speech, from Text to Speech), the model input contains a complete mel spectrum, and it is important that the speaker encoder maps the mel spectrum to the speaker characteristics; for the application scenario of packet loss concealment in the application, the speaker characteristics can only affect the first frame of the compensated voice, and the added speaker encoder comprises an LSTM (Long Short-Term Memory), so that the calculation complexity is high and the profit is not obvious. Thus, those skilled in the art may implement the configuration of the vocoder with or without a speaker encoder in accordance with the principles disclosed herein.
The acoustic signature of the sequence of speech frames from which the subsequent speech frame is generated, obtained as described above, is input into the conditional network of the vocoder, with one input into the residual network of the conditional network. The residual error network is responsible for performing residual error convolution operation on the acoustic features, and deep semantic information in the acoustic features is extracted on the global scale of the voice frame sequence, so that corresponding global feature information is obtained, and global representation of the acoustic features of the voice frame sequence is realized.
Step S1320, multi-scale sampling is carried out on the acoustic features based on an up-sampling network in a conditional network, and the local feature information is obtained;
the acoustic features of the voice frame sequence are input into an up-sampling network in the conditional network from the second path, the up-sampling network is preset with a plurality of scaling scales, for example, three scaling scales, deep semantic information is extracted from the acoustic features on the different scaling scales respectively, and the information granularity is continuously refined, so that local feature information corresponding to the different information granularities is obtained, and local representation of the acoustic features of the voice frame sequence is realized.
And step S1330, performing feature splicing on the global feature information and the local feature information based on a splicing layer in the conditional network to obtain the comprehensive feature information.
And finally, a splicing layer arranged in the conditional network performs characteristic splicing on the global characteristic information and the local characteristic information obtained by the residual network to construct comprehensive characteristic information, wherein the comprehensive characteristic information comprises the global information of the acoustic characteristics and the local information of the acoustic characteristics under different fine scales, can comprehensively and completely represent important characteristics in the acoustic characteristics, and is helpful for guiding the cyclic network to generate effective subsequent voice frames.
According to the above embodiments, it can be understood that the vocoder can realize effective feature representation of the voice frame sequence by synthesizing the important features of the acoustic features under global and local conditions, which is the basis for generating the subsequent voice frames, and by adopting the vocoder with the preferred network structure, the working efficiency of the vocoder can be improved, and good profit can be obtained.
On the basis of any of the above embodiments, referring to fig. 8, the generating the subsequent speech frame according to the synthesized feature information by using the cyclic network in the vocoder and using the global feature information as a reference includes:
step 1410, after the comprehensive characteristic information is fully connected, performing context arrangement by referring to the global characteristic information through a plurality of gate control circulation units, and then fully connecting and outputting to obtain predicted characteristic information;
firstly, the comprehensive characteristic information output from the conditional network is fully connected through a first full connection layer in the circulating network so as to further realize characteristic synthesis.
And then, inputting the fully-connected comprehensive characteristic information into a first gating circulating unit for characteristic extraction so as to realize selection of important characteristics in the first gating circulating unit and obtain first gating characteristic information. And the first gating characteristic information is further spliced with global characteristic information obtained by a residual error network and then input into a second gating circulation unit. The second gating circulation unit performs feature extraction on the input feature information to obtain second gating feature information in the same way, and the second gating feature information is spliced with global feature information obtained by the residual error network and then output.
In one embodiment, the output of the second gating feature information spliced with the global feature information may be used as the predicted feature information. In another embodiment, the second gating feature information is further fully connected with feature information obtained by splicing the global feature information, and after full connection, the second gating feature information is further spliced with the global feature information obtained by the residual error network to obtain predicted feature information. In each step, the global feature information obtained by continuously quoting the residual error network provides context reference, which is beneficial to accurately extracting important features in acoustic features and enables subsequent voice frames generated by the cyclic network to be more effective.
Step S1420, based on the classification network, the prediction characteristic information is classified and mapped to obtain the subsequent voice frame.
The prediction characteristic information is input into a classification network preset in each cycle network, and the probability of each bit required by constructing a subsequent speech frame is determined through the classification mapping of the classification network, so that the subsequent speech frame is constructed.
In one embodiment, in the classification network, in the process of constructing the subsequent speech frame according to the prediction feature information, the following temperature coefficient-based formula is applied for audio sampling:
Figure BDA0003735789590000151
wherein T is the sampling temperature, y i To predict the tag, P i The probability of the ith bit of the subsequent speech frame.
According to the above embodiments, it can be understood that the cyclic network can effectively generate a subsequent speech frame of a current speech frame under the guidance of the global features and the local features of the acoustic features according to the output of the conditional network, thereby realizing effective packet loss compensation for the speech stream.
On the basis of any of the above embodiments, referring to fig. 9, before the step of determining a current speech frame missing a subsequent speech frame in a speech stream, the method includes:
s0100, performing first-stage training on the vocoder by adopting a first class of training samples in the data set, and training the vocoder to a convergence state;
two types of training samples, namely a first type of training sample and a second type of training sample, are prepared, and the two types of training samples can be stored in the same data set or different data sets.
The first class of training samples is mainly used for pre-training the vocoder, the second class of training samples is mainly used for fine-tuning training the vocoder, therefore, the first class of training samples can adopt materials with properly relaxed environmental noise, and the second class of training samples can adopt materials with clearer foreground voice.
In one embodiment, the first type of training sample may be a public data set, where one public data set selected in actual measurement training in the present application is a voice data set that includes multiple languages and is formed by aggregating audio data provided by tens of thousands of contributors, where each voice data may be used as the first type of training sample.
In one embodiment, the second type of training sample can automatically acquire online user data, the online user data adopted in actual measurement training of the application comprises audio data sampling segments of tens of thousands of online users, and after background noise is eliminated from the original sampling segment, a pure Voice segment is intercepted through Voice Activity Detection (VAD), so that a 15-30s training sample is finally formed.
The first class training sample and the second class training sample can be subjected to voice preprocessing in advance to determine each voice frame in the first class training sample and the second class training sample.
When a vocoder is trained in a first stage, a single first class training sample is called from a data set to perform iterative training once to obtain the acoustic characteristics of a voice frame sequence, the length of the voice frame sequence can be 20ms as a unit, the acoustic characteristics of the voice frame sequence are input into the vocoder, an activation function is generated through a conditional network, comprehensive characteristic information representing information such as phonemes, prosody and the like is extracted from the acoustic characteristics, then the comprehensive characteristic information is input into a cyclic network to predict a subsequent voice frame, a next voice frame of the last voice frame of the voice frame sequence is used as a supervision label, a loss value of the subsequent voice frame relative to the next voice frame is calculated, then whether the vocoder has converged is determined according to the loss value, when the vocoder has not converged, back propagation is performed according to the loss value, weighting parameters of the conditional network and the cyclic network are updated by gradient values, and the next first class training sample is called from the data set to continue iterative training on the vocoder until the vocoder converges.
It can be seen that, in the pre-training process of the vocoder, the last time-sequenced speech frame in the sequence of sampled speech frames in the first class of training samples, whose time sequence is consecutive to the following speech frame, is used as the supervision tag of the following speech frame generated based on the sequence of speech frames for calculating the loss value of the following speech frame, thereby implementing effective supervision on the pre-training process of the vocoder.
Step S0200, using the second class of training sample in the data set to perform the second stage training to the vocoder, and training the vocoder to a convergence state;
when the vocoder is trained in the second stage, a single second class training sample is called from a data set to perform iterative training once to obtain the acoustic features of a voice frame sequence, the length of the voice frame sequence can be 20ms as a unit, the acoustic features of the voice frame sequence are input into the vocoder, an activation function is generated through a conditional network, comprehensive feature information representing information such as phonemes, prosody and the like is extracted from the acoustic features, the comprehensive feature information is input into a cyclic network to predict a subsequent voice frame, the loss value of the subsequent voice frame relative to the next voice frame is calculated by using the next voice frame of the last voice frame of the voice frame sequence as a supervision tag, whether the vocoder has converged or not is determined according to the loss value, when the vocoder has not converged, back propagation is performed according to the loss value, the weight parameters of the conditional network and the cyclic network are updated in gradient, and the next second class training sample is called from the data set to perform iterative training on the vocoder continuously until the vocoder reaches the convergence.
It can be seen that, in the pre-training process of the vocoder, the last speech frame in the time sequence in the speech frame sequence sampled in the second class of training samples, whose time sequence is continuous with the following speech frame, is used as the supervision label of the following speech frame generated based on the speech frame sequence to calculate the loss value of the following speech frame, thereby implementing effective supervision on the pre-training process of the vocoder.
In the second stage training, the vocoder is trained at a smaller learning rate relative to the first stage training so that the parameterization of the conditional network is more consistent with the data distribution of the on-line user. Since the second class of training samples provided by the online users are pure voice segments, the vocoder can be trained with a higher learning rate by using the second class of training samples, which is helpful for further improving the capability of the vocoder to generate subsequent voice frames.
S0300, solidifying the weight of the condition network of the vocoder, adopting the second class training sample in the data set to implement the third stage training for the vocoder, training the vocoder to the convergence state, so as to regulate the weight of the circulation network in the vocoder;
the vocoder is trained in a third stage, in effect, after the vocoder has been pre-trained in the first two stages, to perform fine tuning training on the vocoder to further adjust the weights of the cyclic network therein to ensure that it is able to efficiently produce subsequent frames of a given sequence of frames of speech.
For this reason, before the third-stage training is performed, the conditional network is considered to have reached the desired requirements, and the weights of the conditional network are frozen, so that the weights of the conditional network are not updated by gradients during the third-stage training. As for the weights of the cyclic network, it is still maintained as a learnable weight, which will be further modified during the third stage training. Then, a third stage of training of the vocoder may be initiated.
When the third stage training is carried out on the vocoder, a single second type training sample is called from a data set to carry out one iteration training to obtain the acoustic characteristics of a voice frame sequence in the data set, the length of the voice frame sequence can be 20ms as a unit, the acoustic characteristics are input into the vocoder, an activation function is generated through a conditional network, comprehensive characteristic information representing information such as phonemes, prosody and the like is extracted from the acoustic characteristics, then the comprehensive characteristic information is input into a circulating network to predict a subsequent voice frame, the loss value of the subsequent voice frame relative to the next voice frame is calculated by using the next voice frame of the last voice frame of the voice frame sequence as a supervision label, then whether the vocoder is converged or not is decided according to the loss value, when the vocoder is not converged, back propagation is carried out on the vocoder according to the loss value, the weight parameters of the conditional network and the circulating network are updated in a gradient mode, and the next second type training sample is called from the data set to continue to carry out iteration training on the vocoder until the vocoder is converged.
It can be seen that, in the pre-training process of the vocoder, the last speech frame in the time sequence in the speech frame sequence sampled in the second class of training samples, whose time sequence is continuous with the following speech frame, is used as the supervision label of the following speech frame generated based on the speech frame sequence to calculate the loss value of the following speech frame, thereby implementing effective supervision on the pre-training process of the vocoder.
In the third stage of training, the vocoder is trained at a smaller learning rate relative to the first stage of training so that the parameterization of the conditional network is more consistent with the data distribution of the online user. Because the second class of training samples provided by the online users are pure voice segments, the vocoder is trained with a higher learning rate by adopting the second class of training samples, which is beneficial to further improving the capability of the vocoder to generate subsequent voice frames.
Through the third stage of training, the weight of the conditional network is kept unchanged, the weight of the cyclic network is continuously corrected, and finally the convergence state is reached, so that the whole training process of the vocoder can be terminated.
According to the above embodiments, it is understood that the acoustic device of the present application is trained in multiple stages, and the first type of training samples are used for pre-training in the first stage, and the second type of training samples are used for improving the pre-training effect at a smaller learning rate in the second stage, and the second type of training samples are used for improving the ability of the cyclic network to generate subsequent voice frames under the condition of the weight of the fixed condition network in the third stage, so as to finally realize the comprehensive training, and enable the obtained vocoder to generate an effective subsequent voice frame for a given voice frame sequence. The vocoder obtained according to the training process has low parameter quantity and few floating point operations per second, can realize real-time inference of a mobile terminal, and is particularly suitable for being deployed in mobile terminals such as mobile phones and computers.
In another embodiment, the vocoder training may be accomplished by using a conditional network as a generator, using a cyclic network as a discriminator, and constructing the conditional network and the cyclic network to train the anti-vocoder, and using the first type training samples and the second type training samples in the data set to train the generated anti-vocoder to a convergence state.
With reference to fig. 10, based on any of the above embodiments, the third stage training of the vocoder using the second type of training samples in the data set includes:
step S0210, replacing acoustic characteristics of a plurality of following speech frames with a preset number and continuous time sequence of the last speech frame in the sequence of the speech frames sampled from each second class training sample by mask representation;
during the third stage of training, the vocoder may be trained accordingly according to the number of subsequent speech frames that the vocoder is expected to generate for the sequence of speech frames, i.e., the maximum compensation number. For this purpose, in each iterative training, a speech frame sequence is obtained for the called second class training sample, for a plurality of continuous speech frames after the last speech frame of the speech frame sequence, the specific number is determined according to a preset maximum compensation number, the acoustic features of the speech frames are replaced by mask representations, for example, the maximum compensation number is determined as 6 speech frames according to the duration of 120ms, and then the acoustic features of 6 time-sequential subsequent speech frames following the speech frame sequence are replaced by mask representations. The way of performing the mask representation may be, for example, a way of replacing all the feature values of the corresponding acoustic features with values 1, 0.5, or the like, and may be flexibly set.
Step S0220, the vocoder generates a plurality of subsequent voice frames corresponding to the subsequent voice frames iteratively based on the voice frame sequence of the training sample;
for the second class of training samples, the vocoder starts to generate a subsequent speech frame based on the acoustic features of the first speech frame sequence of the second class of training samples, after generating a subsequent speech frame, the subsequent speech frame is listed as the last speech frame in the speech frame sequence to obtain a new speech frame sequence, then iteration is continued to extract the acoustic features of the new speech frame sequence for generating the next subsequent speech frame, and the iteration is performed until a plurality of subsequent speech frames corresponding to the maximum compensation number are generated.
Step S0230, calculating loss values of corresponding subsequent speech frames according to the plurality of subsequent speech frames in the second class of training samples, and correcting the weight of the circulating network in the vocoder according to the loss values.
In the process of the vocoder generating a plurality of subsequent speech frames for each training sample of the second type, for each subsequent speech frame, the vocoder takes a subsequent speech frame of the training sample having a timestamp corresponding to the subsequent speech frame as a supervised label for the subsequent speech frame, calculates a loss value for the subsequent speech frame, and performs back propagation to the vocoder in accordance with the loss value, gradient updates the weights of the cyclic network, while the conditional network does not participate in the gradient update because its weights have been frozen.
According to the above embodiments, it can be easily understood that, by replacing the acoustic features of the following speech frames in the second class of training samples with the mask representation during the third stage training, the loop network can be guided to generate the following speech frames corresponding to the following speech frames, and the connection iteration can generate a plurality of corresponding following speech frames according to the preset maximum compensation number, so that the vocoder learns the capability of continuously compensating the plurality of speech frames, thereby improving the efficiency of performing packet loss compensation of the vocoder.
Referring to fig. 11, an embodiment of a speech stream packet loss compensation apparatus according to an aspect of the present disclosure includes a current frame processing module 1100, a sequence processing module 1200, a feature construction module 1300, and a speech frame generation module 1400, where the current frame processing module 1100 is configured to determine a current speech frame missing a subsequent speech frame in a speech stream; a sequence processing module 1200 configured to obtain a speech frame sequence including a current speech frame from a speech stream, and extract an acoustic feature of the speech frame sequence; a feature construction module 1300 configured to extract global feature information and local feature information of the acoustic features respectively by using a conditional network in a preset vocoder, and construct the global feature information and the local feature information as comprehensive feature information; the speech frame generating module 1400 is configured to generate the subsequent speech frame according to the comprehensive feature information by using a loop network in the vocoder and taking the global feature information as a reference.
On the basis of any of the above embodiments, the speech frame generating module 1400 includes: and the iteration decision module is set to continue to iterate to generate new subsequent voice frames by taking the generated subsequent voice frames as new current voice frames until the generated subsequent voice frames reach the maximum compensation quantity or all the continuously missing subsequent voice frames are completely complemented.
On the basis of any of the above embodiments, the speech frame generating module 1400 includes: the frame splicing module is used for splicing the generated subsequent voice frames to obtain a compensation frame sequence; the volume control module is arranged for adjusting the volume corresponding to the compensation frame sequence to ensure that the volume does not exceed the volume of the voice frame sequence; and the compensation access module is used for smoothly accessing the compensation frame sequence to the voice stream where the voice frame sequence is positioned.
On the basis of any of the above embodiments, the sequence processing module 1200 includes: the sequence acquisition unit is set to acquire a current voice frame and a previous voice frame with continuous time sequence from the voice stream according to preset time length to form a voice frame sequence; the frequency spectrum transformation unit is arranged for carrying out short-time Fourier transformation on the voice frame sequence to obtain frequency spectrum information; a spectral conversion unit arranged to apply a mel filter bank to convert the spectral information into a logarithmic mel spectrum as the acoustic features of the sequence of speech frames.
On the basis of any of the above embodiments, the feature configuration module 1300 includes: the global extraction unit is set to extract the acoustic features based on a residual error network in a conditional network to obtain the global feature information; the local extraction unit is set to perform multi-scale sampling on the acoustic features based on an up-sampling network in a conditional network to obtain the local feature information; and the feature splicing unit is configured to perform feature splicing on the global feature information and the local feature information based on a splicing layer in the conditional network to obtain the comprehensive feature information.
On the basis of any of the above embodiments, the speech frame generating module 1400 includes: the prediction execution unit is arranged to reference the global feature information through a plurality of gate control circulation units to perform context arrangement after the comprehensive feature information is fully connected, and then fully connected and output to obtain prediction feature information; and the voice frame generating unit is set to carry out classification mapping on the prediction characteristic information based on a classification network to obtain a subsequent voice frame.
On the basis of any of the above embodiments, the previous to the current frame processing module 1100 includes: a first training module configured to perform a first stage of training of the vocoder using a first type of training sample in the data set to train the vocoder to a convergence state; a second training module configured to perform a second stage of training for the vocoder using a second type of training sample in the data set to train the vocoder to a convergence state; a third training module, configured to solidify the weight of the conditional network of the vocoder, perform a third stage of training on the vocoder using the second type of training samples in the data set, train the vocoder to a convergence state, so as to adjust the weight of the circulating network in the vocoder; the training samples are audio data and comprise a plurality of voice frames with continuous time sequences, the voice frames with continuous time sequences are used for calculating the loss value of the subsequent voice frames generated by the vocoder corresponding to the voice frames with continuous time sequences, and the second type of training samples are audio data corresponding to pure human voice segments.
On the basis of any of the above embodiments, the second training module includes: the mask representation unit is used for replacing acoustic features of a plurality of subsequent voice frames with preset number, wherein the time sequence of the last voice frame is continuous in the voice frame sequence sampled from each second class training sample, with mask representation; an iteration generating unit, which is set as a vocoder to iteratively generate subsequent speech frames corresponding to the plurality of subsequent speech frames based on the speech frame sequence of the training sample; a weight modifying unit arranged to calculate a loss value of its corresponding subsequent speech frame from said plurality of subsequent speech frames in the second class of training samples, and to modify the weight of the circulating network in the vocoder in accordance with the loss value.
Another embodiment of the present application further provides a voice stream packet loss compensation device. As shown in fig. 12, the internal structure of the voice stream packet loss compensation device is schematically illustrated. The voice stream packet loss compensation device comprises a processor, a computer readable storage medium, a memory and a network interface which are connected through a system bus. The computer-readable non-transitory storage medium of the voice stream packet loss compensation device stores an operating system, a database and computer-readable instructions, the database may store information sequences, and the computer-readable instructions, when executed by the processor, may cause the processor to implement a voice stream packet loss compensation method.
The processor of the voice stream packet loss compensation device is used for providing calculation and control capabilities and supporting the operation of the whole voice stream packet loss compensation device. The voice stream packet loss compensation device may have a memory storing computer readable instructions, and when the computer readable instructions are executed by a processor, the processor may be configured to execute the voice stream packet loss compensation method of the present application. The network interface of the voice stream packet loss compensation equipment is used for connecting and communicating with a terminal.
Those skilled in the art will understand that the structure shown in fig. 12 is only a block diagram of a part of the structure related to the present application, and does not constitute a limitation on the voice stream packet loss compensation apparatus to which the present application is applied, and a specific voice stream packet loss compensation apparatus may include more or less components than those shown in the figure, or combine some components, or have a different arrangement of components.
In this embodiment, the processor is configured to execute specific functions of the modules in fig. 11, and the memory stores program codes and various data required for executing the modules or sub-modules. The network interface is used for realizing data transmission between user terminals or servers. In this embodiment, the non-volatile readable storage medium stores program codes and data required for executing all modules in the voice stream packet loss compensation apparatus of the present application, and the server can call the program codes and data of the server to execute functions of all modules.
The present application further provides a non-transitory readable storage medium storing computer readable instructions, which when executed by one or more processors, cause the one or more processors to perform the steps of the voice stream packet loss compensation method according to any embodiment of the present application.
The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the method as described in any of the embodiments of the present application.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application may be implemented by hardware related to instructions of a computer program, which may be stored in a non-volatile readable storage medium, and when executed, may include the processes of the embodiments of the methods as described above. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
In summary, according to the present application, packet loss compensation can be performed on a voice stream, a missing subsequent voice frame is generated for the voice stream, the generated subsequent voice frame has a high degree of restitution, and the phenomena of repeated voice, mechanical voice, and the like in the voice after packet loss compensation can be effectively avoided, and the compensated voice stream can obtain a higher subjective quality score.

Claims (12)

1. A voice stream packet loss compensation method is characterized by comprising the following steps:
determining a current voice frame lacking a subsequent voice frame in the voice stream;
acquiring a voice frame sequence containing a current voice frame from a voice stream, and extracting acoustic characteristics of the voice frame sequence;
respectively extracting global characteristic information and local characteristic information of the acoustic characteristics by adopting a conditional network in a preset vocoder to construct comprehensive characteristic information;
and generating the subsequent voice frame according to the comprehensive characteristic information by adopting a circulating network in the vocoder and taking the global characteristic information as reference.
2. The method for compensating for the packet loss of the voice stream according to claim 1, wherein the step of generating the subsequent voice frame according to the comprehensive characteristic information comprises:
and taking the generated subsequent speech frame as a new current speech frame, and continuously generating a new subsequent speech frame in an iterative manner until the generated subsequent speech frame reaches the maximum compensation quantity or all the subsequent speech frames which are continuously lost are supplemented.
3. The method for compensating for the lost voice frame packet according to claim 1 or 2, wherein the step of generating the subsequent voice frame according to the comprehensive characteristic information includes:
splicing the generated subsequent voice frames to obtain a compensation frame sequence;
adjusting the volume corresponding to the compensation frame sequence to be not more than the volume of the voice frame sequence;
and smoothly accessing the compensation frame sequence to the voice stream in which the voice frame sequence is positioned.
4. The method of claim 1, wherein the step of obtaining a speech frame sequence including a current speech frame from a speech stream and extracting an acoustic feature of the speech frame sequence comprises:
acquiring a current voice frame and a previous voice frame with continuous time sequence from a voice stream according to preset time length to form a voice frame sequence;
performing short-time Fourier transform on the voice frame sequence to obtain frequency spectrum information;
applying a Mel filter bank to convert the spectral information into a logarithmic Mel spectrum as the acoustic features of the sequence of speech frames.
5. The voice stream packet loss compensation method according to claim 1, wherein the extracting, by using a conditional network in a preset vocoder, comprehensive feature information of acoustic features includes:
extracting the acoustic features based on a residual error network in a conditional network to obtain the global feature information;
carrying out multi-scale sampling on the acoustic features based on an up-sampling network in a conditional network to obtain the local feature information;
and performing feature splicing on the global feature information and the local feature information based on a splicing layer in the conditional network to obtain the comprehensive feature information.
6. The method of claim 1, wherein the generating the subsequent speech frame according to the global feature information by using a loop network in the vocoder and using the global feature information as a reference comprises:
after the comprehensive characteristic information is fully connected, the comprehensive characteristic information is subjected to context arrangement by referring to the global characteristic information through a plurality of gate control circulation units, and then the comprehensive characteristic information is fully connected and output to obtain predicted characteristic information;
and carrying out classification mapping on the prediction characteristic information based on a classification network to obtain a subsequent voice frame.
7. The method of claim 1, wherein the step of determining a current speech frame in the speech stream without a subsequent speech frame is preceded by:
performing a first stage of training on the vocoder by using a first type of training sample in the data set to train the vocoder to a convergence state;
performing a second stage of training on the vocoder by using a second class of training samples in the data set to train the vocoder to a convergence state;
solidifying the weight of the conditional network of the vocoder, performing a third stage training on the vocoder by adopting a second type training sample in the data set, and training the vocoder to a convergence state so as to adjust the weight of the circulating network in the vocoder;
the training samples are audio data and comprise a plurality of continuous time sequence voice frames, the voice frames with continuous time sequences are used for calculating loss values of subsequent voice frames generated by a vocoder correspondingly when the voice frames with continuous time sequences are previous, and the second class of training samples are audio data corresponding to pure human voice segments.
8. The method of claim 7, wherein the performing a third stage training of the vocoder using the second type of training samples in the data set comprises:
replacing acoustic features of a plurality of subsequent voice frames of a preset number with continuous time sequence of the last voice frame in the voice frame sequence sampled from each second class training sample by mask representation;
the vocoder iteratively generates a subsequent speech frame corresponding to the plurality of subsequent speech frames based on the sequence of speech frames of the training sample;
and calculating the loss value of the corresponding subsequent speech frame according to the plurality of subsequent speech frames in the second class of training samples, and correcting the weight of the circulating network in the vocoder according to the loss value.
9. A voice stream packet loss compensation apparatus, comprising:
a current frame processing module, configured to determine a current speech frame missing a subsequent speech frame in the speech stream;
the sequence processing module is set to acquire a voice frame sequence containing a current voice frame from the voice stream and extract the acoustic characteristics of the voice frame sequence;
the feature construction module is set to adopt a conditional network in a preset vocoder to respectively extract the global feature information and the local feature information of the acoustic features to construct comprehensive feature information;
and the voice frame generating module is arranged to adopt a circulating network in the vocoder, take the global feature information as reference and generate the subsequent voice frame according to the comprehensive feature information.
10. A voice stream packet loss compensation device comprising a central processing unit and a memory, wherein the central processing unit is configured to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 8.
11. A non-transitory readable storage medium storing a computer program implemented according to the method of any one of claims 1 to 8 in the form of computer readable instructions, the computer program, when invoked by a computer, performing the steps included in the corresponding method.
12. A computer program product comprising computer programs/instructions which, when executed by a processor, carry out the steps of the method of any one of claims 1 to 8.
CN202210804024.7A 2022-07-07 2022-07-07 Voice stream packet loss compensation method and device, equipment, medium and product thereof Pending CN115171707A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210804024.7A CN115171707A (en) 2022-07-07 2022-07-07 Voice stream packet loss compensation method and device, equipment, medium and product thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210804024.7A CN115171707A (en) 2022-07-07 2022-07-07 Voice stream packet loss compensation method and device, equipment, medium and product thereof

Publications (1)

Publication Number Publication Date
CN115171707A true CN115171707A (en) 2022-10-11

Family

ID=83493516

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210804024.7A Pending CN115171707A (en) 2022-07-07 2022-07-07 Voice stream packet loss compensation method and device, equipment, medium and product thereof

Country Status (1)

Country Link
CN (1) CN115171707A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116248229A (en) * 2022-12-08 2023-06-09 南京龙垣信息科技有限公司 Packet loss compensation method for real-time voice communication

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116248229A (en) * 2022-12-08 2023-06-09 南京龙垣信息科技有限公司 Packet loss compensation method for real-time voice communication
CN116248229B (en) * 2022-12-08 2023-12-01 南京龙垣信息科技有限公司 Packet loss compensation method for real-time voice communication

Similar Documents

Publication Publication Date Title
JP7427723B2 (en) Text-to-speech synthesis in target speaker's voice using neural networks
US8571871B1 (en) Methods and systems for adaptation of synthetic speech in an environment
CN112712812B (en) Audio signal generation method, device, equipment and storage medium
KR20230156121A (en) Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech
CN112185363B (en) Audio processing method and device
KR20230084229A (en) Parallel tacotron: non-autoregressive and controllable TTS
CN111508470A (en) Training method and device of speech synthesis model
CN112908294B (en) Speech synthesis method and speech synthesis system
Oyamada et al. Non-native speech conversion with consistency-aware recursive network and generative adversarial network
CN112530400A (en) Method, system, device and medium for generating voice based on text of deep learning
CN115171707A (en) Voice stream packet loss compensation method and device, equipment, medium and product thereof
CN113129864A (en) Voice feature prediction method, device, equipment and readable storage medium
CN112908293B (en) Method and device for correcting pronunciations of polyphones based on semantic attention mechanism
Song et al. AdaVITS: Tiny VITS for low computing resource speaker adaptation
Lőrincz et al. Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis
CN113948062B (en) Data conversion method and computer storage medium
CN113314097B (en) Speech synthesis method, speech synthesis model processing device and electronic equipment
CN114999440A (en) Avatar generation method, apparatus, device, storage medium, and program product
CN114005428A (en) Speech synthesis method, apparatus, electronic device, storage medium, and program product
CN113314101A (en) Voice processing method and device, electronic equipment and storage medium
Xie et al. A new high quality trajectory tiling based hybrid TTS in real time
Ding A Systematic Review on the Development of Speech Synthesis
CN115188362A (en) Speech synthesis model generation method and device, equipment, medium and product thereof
CN113066472A (en) Synthetic speech processing method and related device
KR102584481B1 (en) Method and system for synthesizing multi speaker speech using artificial neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination