CN114866856B

CN114866856B - Audio signal processing method, audio generation model training method and device

Info

Publication number: CN114866856B
Application number: CN202210486101.9A
Authority: CN
Inventors: 李楠; 郑羲光; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-05-06
Filing date: 2022-05-06
Publication date: 2024-01-02
Anticipated expiration: 2042-05-06
Also published as: CN114866856A

Abstract

The disclosure relates to an audio signal processing method, an audio generation model training device, electronic equipment and a storage medium, and belongs to the technical field of audio. The method comprises the following steps: under the condition that an audio frame is missing in an audio signal, acquiring a historical audio frame before the audio frame and a future audio frame after the audio frame; synthesizing a target audio frame based on the historical audio frame and the future audio frame, wherein phonemes and semantics of the target audio frame are similar to those of the audio frame; based on the historical audio frame, the target audio frame and the future audio frame, a compensation signal associated with the audio signal is synthesized, the compensation signal filling the missing audio frame with the target audio frame. The method and the device can at least improve the audio quality when the receiving end performs audio packet loss compensation.

Description

Audio signal processing method, audio generation model training method and device

Technical Field

The disclosure relates to the technical field of audio, and in particular relates to a processing method of an audio signal, a training method and device of an audio generation model, electronic equipment and a storage medium.

Background

With the development and progress of audio technology, users can carry out real-time audio and video call through a terminal, or users can watch live broadcast at any time through the terminal, and the transmission technology of audio signals can not be separated in audio and video call and real-time live broadcast scenes.

Because of network fluctuation or other fault reasons, the audio signal often has an audio packet loss phenomenon in the transmission process, in order to improve the audio packet loss phenomenon, when a receiving end detects that a certain frame or a plurality of frames of audio frames are lost, the audio frames which are adjacent to the audio frames and are not lost are copied, so that the audio frames obtained by copying are filled with the missing audio frames, and the audio packet loss compensation mode is adopted for copying the audio frames, so that noise and noise are very easy to occur in a high packet loss rate scene, and the audio quality of the audio signal of the receiving end is adversely affected.

Disclosure of Invention

The disclosure provides a processing method of an audio signal, a training method and device of an audio generation model, electronic equipment and a storage medium, so as to at least improve the audio quality when a receiving end performs audio packet loss compensation. The technical scheme of the present disclosure is as follows:

according to an aspect of the embodiments of the present disclosure, there is provided a method for processing an audio signal, including:

Under the condition that an audio frame is missing in an audio signal, acquiring a historical audio frame before the audio frame and a future audio frame after the audio frame;

synthesizing a target audio frame based on the historical audio frame and the future audio frame, wherein phonemes and semantics of the target audio frame are similar to those of the audio frame;

based on the historical audio frames, the target audio frames, and the future audio frames, a compensation signal associated with the audio signal is synthesized, the compensation signal filling the missing audio frames with the target audio frames.

In some embodiments, the synthesizing the target audio frame based on the historical audio frame and the future audio frame comprises:

determining an audio clip comprised of the historical audio frame and the future audio frame;

acquiring deletion indication information of the audio fragment, wherein the deletion indication information is used for indicating whether any audio frame in the audio fragment is deleted or not;

and synthesizing the target audio frame based on the audio fragment and the deletion indication information.

In some embodiments, the synthesizing the target audio frame based on the audio clip and the deletion indication information includes:

Fusing the audio fragment and the deletion indication information to obtain extended audio data;

encoding the extended audio data to obtain audio encoding characteristics of the extended audio data;

and decoding the audio coding features to obtain the target audio frame.

In some embodiments, the encoding the extended audio data to obtain audio encoding characteristics of the extended audio data includes:

inputting the extended audio data into an audio generation model, wherein the audio generation model is used for synthesizing a target audio frame missing between a historical audio frame and a future audio frame;

and encoding the extended audio data through an audio encoding layer of the audio generation model to obtain the audio encoding characteristics.

In some embodiments, the decoding the audio encoding feature to obtain the target audio frame comprises:

compressing the audio coding features through a quantization compression layer of the audio generation model to obtain audio compression features;

and decoding the audio compression characteristic through an audio decoding layer of the audio generation model to obtain the target audio frame.

In some embodiments, the audio segment is an audio frame sequence composed of a plurality of audio frames, and the deletion indication information is a parameter sequence composed of deletion indication parameters of the plurality of audio frames;

the fusing the audio clip and the deletion instruction information to obtain extended audio data comprises the following steps:

splicing any audio frame in the audio frame sequence, and splicing the audio frame and the missing indication parameter of the audio frame in the parameter sequence to obtain the dual-channel data of the audio frame;

the extended audio data composed of two-channel data of a plurality of audio frames is acquired.

In some embodiments, for any audio frame in the sequence of audio frames, the deletion indication parameter of the audio frame is assigned to 1 when the audio frame is deleted, and the deletion indication parameter of the audio frame is assigned to 0 when the audio frame is not deleted.

In some embodiments, the number of frames of the future audio frame is a target number of frames; or, the frame length of the future audio frame is the target frame length; or, the playing duration of the future audio frame is the target duration.

According to another aspect of the embodiments of the present disclosure, there is provided a training method of an audio generation model, including:

Obtaining a target audio fragment associated with a sample audio fragment through an audio generation model, wherein a missing audio frame exists in the sample audio fragment, and the missing audio frame is filled with a synthesized target audio frame in the target audio fragment;

acquiring respective audio discrimination parameters of the sample audio fragment and the target audio fragment through an audio discrimination model, wherein the audio discrimination parameters are used for representing the possibility that whether the audio discrimination model discriminates whether the input audio fragment is a machine synthesized signal or not;

iteratively adjusting parameters of the audio generation model based on the audio discrimination parameters, the sample audio piece and the target audio piece.

In some embodiments, iteratively adjusting parameters of the audio generation model based on the audio discrimination parameters, the sample audio piece, and the target audio piece comprises:

determining a discrimination loss term of the audio generation model based on the audio discrimination parameters, wherein the discrimination loss term is used for representing whether a target audio frame synthesized by the audio generation model can be accurately identified by the audio discrimination model;

determining a reconstruction loss term of the audio generation model based on the target audio segment and the sample audio segment, wherein the reconstruction loss term is used for representing the difference degree between a target audio frame in the target audio segment and a missing audio frame in the sample audio segment;

Iteratively adjusting parameters of the audio generation model based on the discrimination loss term and the reconstruction loss term.

In some embodiments, the determining a reconstruction loss term for the audio generation model based on the target audio piece and the sample audio piece comprises:

acquiring at least one of a spectral loss term, a pronunciation loss term or a semantic loss term of the audio generation model based on the sample audio piece and the target audio piece;

adding at least one of a spectrum loss term, a pronunciation loss term or a semantic loss term of the audio generation model to obtain a reconstruction loss term;

the spectrum loss term is used for representing the difference degree of the sample audio fragment and the target audio fragment in a frequency domain space, the pronunciation loss term is used for representing the difference degree of the sample audio fragment and the target audio fragment in a phoneme characteristic space, and the semantic loss term is used for representing the difference degree of the sample audio fragment and the target audio fragment in a semantic characteristic space.

In some embodiments, the obtaining a spectral loss term of the audio generation model based on the sample audio piece and the target audio piece comprises:

Under different sampling rates, performing time-frequency conversion on the sample audio fragment and the target audio fragment to obtain a sample frequency signal of the sample audio fragment and a target frequency signal of the target audio fragment;

determining a time-frequency loss term and a signal-to-noise loss term based on the sample frequency signal and the target frequency signal at different sampling rates, wherein the time-frequency loss term is used for representing the difference degree of the sample frequency signal and the target frequency signal in frequency amplitude, and the signal-to-noise loss term is used for representing the signal-to-noise ratio of the target frequency signal;

and acquiring the frequency spectrum loss term based on the time-frequency loss term and the signal-to-noise loss term.

In some embodiments, the determining a time-frequency loss term based on the sample frequency signal and the target frequency signal at different sampling rates includes:

determining a target frequency component corresponding to any sample frequency component in the target frequency signal for any sample frequency component of any sample audio frame in the sample frequency signal at any sampling rate;

acquiring an L1 norm between the amplitude of the sample frequency component and the amplitude of the target frequency component;

Acquiring an L2 norm between a natural logarithm of the amplitude of the sample frequency component and a natural logarithm of the amplitude of the target frequency component;

the time-frequency loss term is obtained based on the L1 norm and the L2 norm of a plurality of sample audio frames at a plurality of sampling rates, respectively, over a plurality of sample frequency components.

In some embodiments, the determining a signal-to-noise loss term based on the sample frequency signal and the target frequency signal at different sampling rates comprises:

dividing the amplitude of the target frequency component by the cosine value of a frequency characteristic included angle to obtain signal information of the target frequency signal, wherein the frequency characteristic included angle is the characteristic included angle of the sample frequency component and the target frequency component in a frequency domain space;

subtracting the target frequency component from the signal characteristic to obtain noise information of the target frequency signal;

the signal-to-noise loss term is obtained based on the signal information and the noise information for a plurality of sample audio frames at a plurality of sample rates, each over a plurality of sample frequency components.

In some embodiments, the obtaining a pronunciation loss term for the audio generation model based on the sample audio piece and the target audio piece comprises:

inputting the sample audio fragment into a phoneme feature extraction model to obtain sample phoneme features, wherein the sample phoneme features are used for representing the pronunciation features of phonemes of audio frames in the sample audio fragment;

inputting the target audio segment into the phoneme feature extraction model to obtain target phoneme features, wherein the target phoneme features are used for representing the pronunciation features of phonemes of audio frames in the target audio segment;

and acquiring the pronunciation loss term based on the sample phoneme characteristic and the target phoneme characteristic.

In some embodiments, the obtaining the semantic loss term of the audio generation model based on the sample audio piece and the target audio piece comprises:

inputting the sample audio fragment into a semantic feature extraction model to obtain sample semantic features, wherein the sample semantic features are used for representing the semantic features of audio frames in the sample audio fragment;

inputting the target audio segment into the semantic feature extraction model to obtain target semantic features, wherein the target semantic features are used for representing semantic features of audio frames in the target audio segment;

And acquiring the semantic loss item based on the sample semantic features and the target semantic features.

According to another aspect of the embodiments of the present disclosure, there is provided an audio signal processing apparatus including:

an acquisition unit configured to perform, in the event of an audio frame missing in an audio signal, acquisition of a history audio frame preceding the audio frame and a future audio frame following the audio frame;

a first synthesizing unit configured to perform synthesizing a target audio frame based on the history audio frame and the future audio frame, the target audio frame having phonemes and semantics similar to the audio frame;

a second synthesizing unit configured to perform synthesizing a compensation signal associated with the audio signal based on the historical audio frame, the target audio frame, and the future audio frame, the compensation signal filling the missing audio frame with the target audio frame.

In some embodiments, the first synthesis unit comprises:

a determining subunit configured to perform determining an audio clip constituted by the historical audio frame and the future audio frame;

an acquisition subunit configured to perform acquisition of deletion instruction information of the audio piece, the deletion instruction information being used to instruct whether any audio frame in the audio piece is deleted;

A synthesizing subunit configured to perform synthesizing the target audio frame based on the audio clip and the deletion instruction information.

In some embodiments, the synthesis subunit comprises:

a fusion subunit configured to perform fusion of the audio clip and the deletion instruction information to obtain extended audio data;

a coding subunit configured to perform coding of the extended audio data to obtain audio coding features of the extended audio data;

and a decoding subunit configured to perform decoding of the audio coding feature to obtain the target audio frame.

In some embodiments, the encoding subunit is configured to perform:

In some embodiments, the decoding subunit is configured to perform:

the fusion subunit is configured to perform:

According to another aspect of the embodiments of the present disclosure, there is provided a training apparatus of an audio generation model, including:

a first acquisition unit configured to perform acquisition of a target audio piece associated with a sample audio piece in which a missing audio frame exists, by an audio generation model, the missing audio frame being filled with a synthesized target audio frame in the target audio piece;

a second acquisition unit configured to perform acquisition of respective audio discrimination parameters of the sample audio piece and the target audio piece by an audio discrimination model, the audio discrimination parameters being used to characterize a possibility that the audio discrimination model discriminates whether an input audio piece is a machine synthesized signal;

and a parameter adjustment unit configured to perform iterative adjustment of parameters of the audio generation model based on the audio discrimination parameters, the sample audio piece, and the target audio piece.

In some embodiments, the parameter adjustment unit comprises:

a first determination subunit configured to perform determining, based on the audio discrimination parameters, a discrimination loss term of the audio generation model, the discrimination loss term being used to characterize whether a target audio frame synthesized by the audio generation model can be accurately identified by the audio discrimination model;

A second determination subunit configured to perform determining a reconstruction loss term of the audio generation model based on the target audio piece and the sample audio piece, the reconstruction loss term being used to characterize a degree of difference between a target audio frame in the target audio piece and a missing audio frame in the sample audio piece;

a parameter adjustment subunit configured to perform iterative adjustment of parameters of the audio generation model based on the discrimination loss term and the reconstruction loss term.

In some embodiments, the second determination subunit comprises:

an acquisition subunit configured to perform acquisition of at least one of a spectral loss term, a pronunciation loss term, or a semantic loss term of the audio generation model based on the sample audio piece and the target audio piece;

an adder subunit configured to perform adding at least one of a spectral loss term, a pronunciation loss term, or a semantic loss term of the audio generation model to obtain the reconstruction loss term;

In some embodiments, the acquisition subunit comprises:

a transformation subunit configured to perform time-frequency transformation on the sample audio segment and the target audio segment at different sampling rates, to obtain a sample frequency signal of the sample audio segment and a target frequency signal of the target audio segment;

a determining subunit configured to perform determining, based on the sample frequency signal and the target frequency signal at different sampling rates, a time-frequency loss term for characterizing a degree of difference in frequency amplitude of the sample frequency signal and the target frequency signal, and a signal-to-noise loss term for characterizing a signal-to-noise ratio of the target frequency signal;

an acquisition sub-unit configured to perform acquisition of the spectrum loss term based on the time-frequency loss term and the signal-to-noise loss term.

In some embodiments, the determining subunit is configured to perform:

In some embodiments, the acquisition subunit is configured to perform:

According to another aspect of the embodiments of the present disclosure, there is provided an electronic device including:

one or more processors;

one or more memories for storing the one or more processor-executable instructions;

wherein the one or more processors are configured to perform the method of processing an audio signal or the method of training an audio generation model in any of the possible implementations of the above aspect.

According to another aspect of the disclosed embodiments, there is provided a computer-readable storage medium, at least one instruction of which, when executed by one or more processors of an electronic device, enables the electronic device to perform the method of processing an audio signal or the method of training an audio generation model in any one of the possible implementations of the above aspect.

According to another aspect of the disclosed embodiments, there is provided a computer program product comprising one or more instructions executable by one or more processors of an electronic device, such that the electronic device is capable of performing the method of processing an audio signal or the method of training an audio generation model in any of the possible implementations of the above aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

when the audio frame is lost, the context information of the audio frame, namely the historical audio frame and the future audio frame, is utilized to synthesize a target audio frame similar to the missing audio frame in various dimensions of frequency, phonemes and semantics, and the missing audio frame is filled by utilizing the target audio frame, so that a packet loss compensation mechanism of the audio signal is realized, the packet loss compensation mechanism is not simple copy of the historical audio frame or the future audio frame, but is a more natural, smooth and high-tone target audio frame is synthesized, noise and noise which are very easy to appear under the packet loss compensation mechanism of the traditional audio frame are avoided, and accordingly, the adverse influence on the audio quality of the audio signal at the receiving end due to the packet loss compensation mechanism is avoided, namely, the compensation signal subjected to the packet loss compensation has higher audio quality and better playing effect.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

Fig. 1 is an environmental schematic diagram of an audio signal processing method according to an embodiment of the disclosure;

fig. 2 is a flowchart of a method of processing an audio signal according to an embodiment of the present disclosure;

fig. 3 is a flowchart of a method of processing an audio signal according to an embodiment of the present disclosure;

FIG. 4 is a flow chart of a training method for an audio generation model provided by an embodiment of the present disclosure;

FIG. 5 is a training and reasoning phase flow diagram of an audio generation model provided by an embodiment of the present disclosure;

fig. 6 is a logical block diagram of an audio signal processing apparatus according to an embodiment of the present disclosure;

FIG. 7 is a logical block diagram of a training apparatus of an audio generation model, shown in an embodiment of the present disclosure;

FIG. 8 shows a block diagram of an electronic device provided by an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

In some embodiments, the meaning of a and/or B includes: three cases of A and B, A and B.

The user information referred to in the present disclosure may be information authorized by the user or sufficiently authorized by each party. It should be noted that, information (including, but not limited to, device information, behavior information, personal information, etc. of the user), data (including, but not limited to, data for analysis, stored data, presented data, etc.), and signals related to the present application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of the related data is required to comply with the relevant laws and regulations and standards of the relevant country and region. For example, the audio signals referred to in this application are all acquired with sufficient authorization.

Hereinafter, terms related to the embodiments of the present disclosure will be explained.

IP-based audio transmission (Voice over Internet Protocol, voIP): voIP is an audio telephony technology that enables audio telephony and multimedia conferences via internet protocol (Internet Protocol, IP), i.e., audio communication via the internet. VoIP is also known as IP telephony, internet telephony (or internet telephony), broadband telephony, and broadband telephony services. VoIP technology can be used for a variety of internet access devices including VoIP phones, smartphones, personal computers, and for audio calls (or audio-video calls) and sending short messages over cellular networks, wiFi (Wireless Fidelity ) networks.

Packet loss concealment (Packet Loss Concealment, PLC): also called as a packet loss compensation mechanism, PLC is a compensation mechanism used by media engines to solve the problem of network packet loss. When a media engine is receiving a series of media stream packets (e.g., audio stream packets), it cannot be guaranteed that all packets are received due to network packet loss. Taking a VoIP call scenario as an example, a transmitting end continuously transmits a series of audio stream data packets to a receiving end, and if the receiving end finds that one or more data packets are lost, a PLC mechanism will function to compensate or recover audio frames in the lost data packets. The PLC is not standard-consistent, i.e. allows the media engine and media codec to be implemented and extended as the case may be.

Generating an antagonism network (Generative Adversarial Networks, GAN): GAN is an important generation model in the field of deep learning, and is also an unsupervised deep learning model used to generate data (e.g., audio data) by a computer.

Under the GAN architecture, a Generator (Generator, i.e., a generation model) and a Discriminator (i.e., a discrimination model) are involved, wherein the Generator generates data through a machine, and the optimization goal is to "spoof" the Discriminator as much as possible so that the Discriminator cannot distinguish whether the input data is generated by the machine; the discriminator is used for judging whether the input data is real data or data generated by the generator, and the optimization target is to find false data forged by the generator as far as possible.

During the dynamic challenge (or betting) of the GAN, the generator and arbiter train at the same time and compete in a Minimax algorithm (minimum) and learn by betting to produce a fairly good output signal. Along with the training, the data generated by the generator can be more and more similar to the real data, and the level of whether the data is identified by the discriminator is also higher and higher, so that the generator can generate enough data for 'false spurious' after the training is finished in an ideal state; while for the arbiter it will be difficult for the arbiter to determine whether the data generated by the generator is real or not.

The countermeasure mode of the GAN avoids some difficulties of some traditional generation models in practical application, skillfully approximates some insoluble loss functions through countermeasure learning, and has wide application in the generation of data such as voice, music, images, video, natural language and the like.

Optimized Scale Invariant Signal-to-Noise Ratio (OSISNR): a signal-to-noise ratio obtaining method can measure signal-to-noise ratios of signals to be detected under different scales, and OSISNR can measure signal-to-noise ratios of audio signals under different sampling rates aiming at the condition that the signals to be detected are audio signals.

Phoneme enhanced perceptual loss (Phone-Fortified Perceptual Loss, PFPL): for any audio signal, the audio signal is converted into corresponding phoneme features according to an audio waveform steering amount model, and the degree of difference between the phoneme features of different audio signals is PFPL. According to the audio waveform steering amount model, the audio signals can be converted from the time domain space to the phoneme vector space, so that the phoneme features can represent the pronunciation features of the associated audio signals, the phoneme features are equivalent to the perception features of human beings on the audio signals, so that the difference degree among the phoneme features of different audio signals reflects the loss of the different audio signals in the phoneme enhanced perception dimension.

Acoustic speech recognition (Acoustic Speech Recognition, ASR): also known as automatic speech recognition technology, is a technology that converts human speech into text. Speech recognition is a multi-disciplinary, intersecting domain that is tightly coupled to numerous disciplines such as acoustics, speech, linguistics, digital signal processing theory, information theory, computer science, and the like.

In the audio transmission process based on VoIP, after a series of audio data packets are collected by a transmitting end, the audio data packets are continuously transmitted to a receiving end through an IP protocol, so that streaming of audio signals is achieved, but due to bad network environments such as network fluctuation and network signal difference, packet loss phenomenon often occurs in the network transmission of the audio data packets, the receiving end cannot ensure that all the audio data packets sent by the transmitting end are received, and the fluency of playing audio by the receiving end can be seriously influenced due to the lost audio data packets in the network transmission, especially, in the scenes such as real-time communication, live broadcast interaction and the like, once the phenomena such as poor audio fluency, audio blocking and the like occur, the conversation experience of users can be greatly damaged.

At present, for the audio packet loss phenomenon, when a receiving end detects that a certain frame or a plurality of frames of audio frames are missing, the receiving end copies the audio frames which are adjacent to the audio frames and are not missing, and fills the missing audio frames with the audio frames obtained by copying. However, under the condition of high packet loss rate, a large amount of noise and noise still exist even though the audio clip is smoothed, which seriously affects the audio quality played by the receiving end.

In view of this, the embodiments of the present disclosure provide a high-tone quality, low-delay audio PCL method based on a neural network, which can compensate for a lost audio signal with high quality, and simultaneously can control the delay of an audio PCL algorithm within 20ms (milliseconds), that is, a target audio frame for compensation is synthesized by using a future audio frame of the lost audio frame within 20ms at most, so that the packet loss compensation can be performed with high quality, and meanwhile, only very low delay is brought, which has great significance for scenes with high real-time requirements such as real-time communication, live broadcast interaction, and the like.

The system architecture of the embodiments of the present disclosure is described below.

Fig. 1 is an implementation environment schematic diagram of a processing method of an audio signal according to an embodiment of the disclosure. Referring to fig. 1, in this implementation environment, a first terminal 120, a server 140, and a second terminal 160 are included.

The first terminal 120 installs and runs an application program supporting a VoIP service, wherein the VoIP service includes: the embodiments of the present disclosure do not specifically limit the type of VoIP service, such as a multiparty real-time audio call or an audio-video call based on VoIP, an audio conference or an audio-video conference based on VoIP, a live webcast based on VoIP, and the like. Optionally, the application program supporting VoIP service includes: live applications, short video applications, audio-video applications, content sharing applications, content generation applications, teleconferencing applications, remote consultation applications, social applications, IP telephony applications, etc., the type of application program is not particularly limited by the disclosed embodiments.

The first terminal 120 and the second terminal 160 are directly or indirectly connected to the server 140 by wired or wireless communication.

Server 140 includes at least one of a server, a plurality of servers, a cloud computing platform, or a virtualization center. Server 140 is used to provide background services for applications supporting VoIP services. Optionally, the server 140 performs primary audio processing, and the first terminal 120 and the second terminal 160 perform secondary audio processing; alternatively, the server 140 performs a secondary audio processing operation, and the first terminal 120 and the second terminal 160 perform a primary audio processing operation; alternatively, the server 140, the first terminal 120 and the second terminal 160 cooperate to perform audio processing by using a distributed computing architecture.

Optionally, the server 140 is a stand-alone physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

The second terminal 160 installs and runs an application supporting a VoIP service including: the embodiments of the present disclosure do not specifically limit the type of VoIP service, such as a multiparty real-time audio call or an audio-video call based on VoIP, an audio conference or an audio-video conference based on VoIP, a live webcast based on VoIP, and the like. Optionally, the application program supporting VoIP service includes: live applications, short video applications, audio-video applications, content sharing applications, content generation applications, teleconferencing applications, remote consultation applications, social applications, IP telephony applications, etc., the type of application program is not particularly limited by the disclosed embodiments.

Illustratively, taking a network live broadcast scenario as an example, the first terminal 120 is a terminal used by a host user, the host user starts a live broadcast application on the first terminal 120, the host user logs in a host account in the live broadcast application, and initiates live broadcast (i.e. on-air and on-air) in a live broadcast platform based on a live broadcast initiation control, the first terminal 120 collects a live broadcast data stream of the host user, the live broadcast data stream includes a live broadcast audio stream and a live broadcast image stream, and the first terminal 120 pushes the live broadcast data stream of the host user to the server 140. The server 140 then stores the anchor account in association with the live data stream. The second terminal 160 is a terminal used by the audience user, the audience user starts a live broadcast application on the second terminal 160, the live broadcast application is used for turning over a live broadcast room subscribed by the audience user or recommended by the server 140, when the audience user clicks to enter the live broadcast room of the host user, the second terminal 160 sends a resource loading request carrying a host account number corresponding to the live broadcast room to the server 140, the server 140 inquires live broadcast data streams stored in association with the host account number, and the inquired live broadcast data streams are pushed to the second terminal 160.

Because of the influence of bad network environments such as network fluctuation and network signal difference, for live audio streams, audio packet loss may occur in the process that live audio streams are pushed from the first terminal 120 to the server 140 or in the process that the second terminal 160 pulls from the server 140, so that missing audio frames are likely to exist in live audio streams received by the second terminal 160, and according to the processing method of audio signals provided by the embodiment of the disclosure, target audio frames very similar to the missing audio frames can be synthesized by utilizing an audio generation model, thereby noise or noise which may be generated when the PLC is performed by an audio frame replication technology is effectively suppressed, a high-tone quality audio packet loss compensation PLC mechanism is realized, further, when the target audio frames are synthesized, only partial historical audio frames and future audio frames which do not exceed a certain frame number or frame length are utilized, the time delay of the PLC mechanism can be controlled within 20ms, thereby realizing a PLC mechanism with high tone quality and low delay on the receiving end (i.e., the second terminal 160) side, improving the smoothness of live broadcast viewing experience of users.

Illustratively, taking a real-time two-person voice call scenario as an example, the first terminal 120 is a terminal used by a first user, the first user starts a social application on the first terminal 120, the first user logs in a first account in the social application, and based on a call option in a chat interface with a second account, triggers the first terminal 120 to send a call request for the second account to the server 140, where the call request is used to request the second account to join a two-person voice call, and the server 140 forwards the call request to the second terminal 160 logged in by the second account, if the second account agrees to join the two-person voice call, the first terminal 120 and the second terminal 160 can perform online voice communication based on VoIP technology. Only two terminals are used for performing multi-person voice call as an illustration, and the embodiments of the present disclosure are also applicable to voice call scenes of three or more persons, which are not described herein.

Because of the influence of bad network environments such as network fluctuation and network signal difference, in a double voice call, the audio signal sent by the first terminal 120 to the second terminal 160 through the server 140, or the audio signal sent by the second terminal 160 to the first terminal 120 through the server 140 may have audio packet loss, so that the audio signal received by the first terminal 120 or the second terminal 160 is likely to have missing audio frames, and the processing method of the audio signal provided by the embodiment of the disclosure can utilize the audio generation model to synthesize a target audio frame very similar to the missing audio frame, thereby effectively inhibiting noise or noise possibly generated when the PLC is performed through the audio frame replication technology, realizing a high-tone audio packet loss compensation PLC mechanism, further, when the target audio frame is synthesized, only using part of historical audio frames and future audio frames not exceeding a certain frame number or frame length, and controlling the delay of the PLC mechanism within 20ms, thereby realizing high tone quality, low tone quality, and voice quality on the receiving end (possibly the first terminal 120 or the second terminal 160) side, thereby improving the voice signal playing and receiving quality of the voice call, and improving the voice quality of the voice communication experience.

Illustratively, taking the multi-person teleconference scenario as an example, the first terminal 120 is a terminal used by a conference host, the conference host starts a teleconference application on the first terminal 120 and creates a new web conference, specifies a start time of the web conference, the server 140 assigns a conference number to the web conference, after reaching the start time of the web conference, the conference host enters the conference number in the teleconference application to access the web conference, and similarly, the second terminal 160 is a terminal used by any participant of the web conference, the participant enters the conference number in the teleconference application to access the web conference.

Due to the influence of network fluctuation, network signal difference and other bad network environments, in a network conference, audio signals spoken by a conference host or a participant to a microphone may have audio packet loss in the process of synchronizing to each participating terminal through the server 140, that is, missing audio frames may exist in audio signals received by the first terminal 120 or the second terminal 160.

Alternatively, the applications installed on the first terminal 120 and the second terminal 160 are the same, or the applications installed on the two terminals are the same type of application of different operating system platforms, or the applications installed on the two terminals are different versions of the same type of application developed for different models of terminals, for example, the first terminal 120 is a desktop computer and installs a PC (Personal Computer ) end application, and the second terminal 160 is a smartphone and installs a mobile end application.

The first terminal 120 may refer broadly to one of a plurality of terminals and the second terminal 160 may refer broadly to one of a plurality of terminals, with the embodiments of the present disclosure being illustrated with only the first terminal 120 and the second terminal 160. The device types of the first terminal 120 and the second terminal 160 are the same or different, and include: at least one of a smart phone, a tablet computer, a smart speaker, a smart watch, a notebook computer, or a desktop computer, but is not limited thereto. For example, the first terminal 120 may be a desktop computer and the second terminal 160 may be a smart phone, or both the first terminal 120 and the second terminal 160 may be smart phones or other handheld portable communication devices.

Those skilled in the art will recognize that the number of terminals may be greater or lesser. Such as the above-mentioned terminals may be only one, or the above-mentioned terminals may be several tens or hundreds, or more. The embodiment of the present disclosure does not limit the number of terminals and the type of devices.

Fig. 2 is a flowchart of a processing method of an audio signal, which is shown in an embodiment of the present disclosure, and referring to fig. 2, the processing method of the audio signal is performed by an electronic device, and an electronic device is taken as an example of a terminal, and the electronic device is, for example, a receiving terminal of the audio signal related in the above-mentioned implementation environment.

In step 201, the terminal acquires a history audio frame before the audio frame and a future audio frame after the audio frame in the case that the audio frame is missing in the audio signal.

In step 202, the terminal synthesizes a target audio frame based on the historical audio frame and the future audio frame, the phonemes and semantics of the target audio frame being similar to the audio frame.

In step 203, the terminal synthesizes a compensation signal associated with the audio signal based on the historical audio frame, the target audio frame and the future audio frame, the compensation signal filling the missing audio frame with the target audio frame.

According to the method provided by the embodiment of the disclosure, when the audio frame is lost, the context information of the audio frame, namely the historical audio frame and the future audio frame, is utilized to synthesize the target audio frame similar to the missing audio frame in various dimensions of frequency, phonemes and semantics, and the target audio frame is utilized to fill the missing audio frame, so that a packet loss compensation mechanism of an audio signal is realized, the packet loss compensation mechanism is not simple copy of the historical audio frame or the future audio frame, but is a more natural, smooth and high-tone target audio frame is synthesized, noise and noise which are very easy to occur under the packet loss compensation mechanism of the traditional audio frame can be avoided, and adverse effects on the audio quality of the audio signal at the receiving end due to the packet loss compensation mechanism can be avoided, namely, the compensated signal subjected to packet loss compensation has higher audio quality and better playing effect.

In some embodiments, synthesizing the target audio frame based on the historical audio frame and the future audio frame includes:

The target audio frame is synthesized based on the audio clip and the deletion indication information.

In some embodiments, synthesizing the target audio frame based on the audio clip and the deletion indication information includes:

and decoding the audio coding feature to obtain the target audio frame.

In some embodiments, encoding the extended audio data to obtain audio encoding characteristics of the extended audio data includes:

inputting the expanded audio data into an audio generation model for synthesizing missing target audio frames between historical audio frames and future audio frames;

In some embodiments, decoding the audio encoding feature to obtain the target audio frame comprises:

compressing the audio coding feature through a quantization compression layer of the audio generation model to obtain an audio compression feature;

fusing the audio clip and the deletion instruction information to obtain extended audio data, including:

splicing any audio frame in the audio frame sequence, and splicing the audio frame with the missing indication parameter of the audio frame in the parameter sequence to obtain the dual-channel data of the audio frame;

Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.

Fig. 3 is a flowchart of a processing method of an audio signal, which is shown in an embodiment of the present disclosure, and referring to fig. 3, the processing method of the audio signal is performed by an electronic device, and an electronic device is taken as an example of a terminal, and the electronic device is, for example, a receiving terminal of the audio signal related in the above-mentioned implementation environment.

In step 301, the terminal receives an audio signal.

The terminal is any electronic device that receives an audio signal, and an application program supporting a VoIP service is installed and run on the terminal, and optionally, the VoIP service includes: the embodiments of the present disclosure do not specifically limit the type of VoIP service, such as a multiparty real-time audio call or an audio-video call based on VoIP, an audio conference or an audio-video conference based on VoIP, a live webcast based on VoIP, and the like.

Optionally, the application program supporting VoIP service includes: live applications, short video applications, audio-video applications, content sharing applications, content generation applications, teleconferencing applications, remote consultation applications, social applications, IP telephony applications, etc., the type of application program is not particularly limited by the disclosed embodiments.

In some embodiments, the terminal starts the application program in response to a user's start operation on the application program, for example, the user touches an icon of the application program on a desktop of the terminal, or the user inputs a start instruction for the application program to the intelligent assistant, where the start instruction includes a voice instruction or a text instruction, and the type of the start instruction is not specifically limited in the embodiments of the present disclosure. Optionally, when the user sets an automatic start condition for the application program, the terminal automatically starts the application program by the operating system when detecting that the automatic start condition of the application program is met, for example, the automatic start condition is that the application program is started automatically by startup, or is started automatically at a fixed time, for example, 5 minutes before the start of the specified audio-video conference, etc., and the embodiment of the disclosure does not specifically limit the automatic start condition of the application program.

After the application program is started, displaying a main interface of the application program, wherein an account login option is displayed in the main interface, a user executes a triggering operation on the account login option, logs in an account of the user in the application program, and returns to the main interface after the login is finished, so that the user can access VoIP service in the main interface, for example, the user enters a designated audio/video conference, or the user answers the audio/video call, or the user opens a live broadcasting room associated with a certain anchor account, and the embodiment of the disclosure is not particularly limited to this. It should be noted that, when the user accesses the VoIP service, the user does not have to log in the own account in advance, for example, in a live network scene, if the user does not log in the own account, the user can watch live broadcast with the identity of the tourist, but only after logging in the own account, the user can execute the interactive actions of giving a virtual gift to the anchor, publishing a barrage, focusing on the anchor account, and the like.

In some embodiments, after the terminal accesses the VoIP service in the application program, the terminal can receive the audio signal transmitted from the sender terminal forwarded by the server, optionally, in the audio signal transmission process of the VoIP service, after the sender terminal collects the audio signal, the sender terminal encodes and compresses the original audio signal through a preset audio compression algorithm to obtain a compressed audio signal, then packages the compressed audio signal by using the TCP (Transmission Control Protocol) transmission control protocol)/IP standard to obtain a packaged audio signal, then sends the packaged audio signal to the server through the IP network, and forwards the packaged audio signal to the receiver terminal (i.e., the execution subject related to the step 301) by the server.

Optionally, when the packetized audio signal is transmitted through the IP network, the packetized audio signal is usually transmitted in the form of audio data packets, and in VoIP service, the audio data packets are usually continuously streamed, so that the terminal receives a series of audio data packets, and after the terminal parses and decompresses the audio data packets, the terminal can recover the original audio signal.

Optionally, in some network live broadcast scenarios, the audio signal is not streamed in units of audio data packets, but is transmitted at a frame level in units of media frames (including audio frames and picture frames), and the terminal parses the received media frames, so that the original media frames can be recovered.

Whether the audio signal is streamed by taking the audio data packet as a unit or transmitted at the frame level by taking the media frame as a unit, the audio data packet (equivalent to a plurality of audio frames after being packed) or the media frame is very likely to be lost in the transmission process of the IP network due to poor network environment and the like, so that the receiving terminal cannot receive all the audio frames sent by the sending terminal.

In step 302, the terminal acquires a history of audio frames preceding the audio frame and a future audio frame following the audio frame in case of an audio frame missing in the audio signal.

In some embodiments, the terminal detects whether there are missing audio frames in the audio signal, for example, after recovering the audio signal, the terminal queries whether the time stamps of adjacent audio frames in the audio signal are continuous, and if the time stamps of the adjacent audio frames jump, it indicates that there are missing one or more audio frames between the pair of adjacent audio frames. It should be noted that, for any pair of adjacent audio frames, assuming that the time stamps of the adjacent audio frames are discontinuous, all audio frames missing in the adjacent audio frames can be complemented by using the historical audio frames and the future audio frames through steps 303-307 described below, for example, N target audio frames can be complemented when N (N is greater than or equal to 1) audio frames are missing. Further, assuming that there are a plurality of time stamp discontinuities of adjacent audio frames in the audio signal, which represents that there are a plurality of missing audio frames in the audio signal, each missing audio frame may be one or more frames, and thus steps 303-307 are performed on each missing audio frame, so as to be able to complement the plurality of missing audio frames in the audio signal, the embodiments of the present disclosure only take the complementing manner of any missing audio frame as an example.

In some embodiments, when the terminal detects that the time stamps of adjacent audio frames in the audio signal are discontinuous, indicating that there are one or more audio frames missing between the pair of adjacent audio frames, then determining a first audio frame missing and a last audio frame missing from the one or more audio frames missing, wherein the first audio frame missing refers to the audio frame with the smallest time stamp in the one or more audio frames missing, and the last audio frame missing refers to the audio frame with the largest time stamp in the one or more audio frames missing. Next, one or more un-missing historical audio frames preceding the missing first audio frame and one or more un-missing future audio frames following the missing last audio frame are obtained.

In some embodiments, a specified number of historical audio frames are acquired, such as N, when the historical audio frames are acquired ₁ Historical audio frames, N ₁ Is an integer greater than or equal to 1; or, acquiring a historical audio frame with a specified frame length, such as acquiring a historical audio frame with a frame length of 128 bytes; alternatively, a historical audio frame with a specified play duration is obtained, for example, a historical audio frame with a play duration of 40ms (milliseconds) is obtained, and the selection manner of the historical audio frame is not specifically limited in the embodiments of the present disclosure.

In some embodiments, a specified number of future audio frames are acquired when acquiring future audio frames, such as N ₂ The number of future audio frames, in other words, the number of frames of the future audio frames is the target number of frames N ₂ ，N ₂ Is an integer greater than or equal to 1; or, a historical audio frame with a specified frame length is acquired, for example, a historical audio frame with a frame length of 128 bytes is acquired, in other words, the frame length of a future audio frame is a target frame length, and the target frame length is a numerical value greater than 0; alternatively, a historical audio frame is acquired for a specified duration of play, such as a 40ms (millisecond) history of play durationIn other words, the playing duration of the future audio frame is a target duration, where the target duration is any value greater than 0, and the selection manner of the historical audio frame is not specifically limited in the embodiment of the present disclosure.

Since it is necessary to synthesize a target audio frame as similar as possible to the missing audio frame using a history audio frame before the missing audio frame and a future audio frame after the missing audio frame, this means that, assuming that after the missing audio frame is detected, the terminal has not received the future audio frame (i.e., the future audio frame has not yet been generated or has not yet been transmitted to the receiving terminal), the terminal needs to wait for the future audio frame to arrive before synthesizing the target audio frame and playing the target audio frame, in other words, how many future audio frames are needed to synthesize the target audio frame, determines the delay of playing the audio signal by the terminal.

In step 303, the terminal determines an audio clip consisting of the historical audio frame and the future audio frame.

In some embodiments, the terminal forms one or more historical audio frames, one or more missing audio frames, and one or more future audio frames obtained in step 302 into an audio clip, for example, in the audio clip, since the missing audio frames are not already available, the terminal uses blank audio frames to fill the missing audio frames, which is equivalent to initializing the one or more missing audio frames to blank audio frames, and then uses the historical audio frames and the future audio frames to predict which frequencies and amplitudes the blank audio frames should have, thereby synthesizing a target audio frame, and keeping the target audio frame synthesized by the machine as similar as possible to the missing audio frames. In other words, the audio clip is an audio frame sequence composed of a plurality of audio frames, the audio frame sequence includes a history audio frame, a blank audio frame and a future audio frame, and the time stamps of adjacent audio frames in the audio frame sequence are all continuous.

In step 304, the terminal obtains deletion indication information of the audio segment, where the deletion indication information is used to indicate whether any audio frame in the audio segment is deleted.

The deletion indication information is a parameter sequence formed by deletion indication parameters of a plurality of audio frames.

In some embodiments, the terminal may assign a deletion indication parameter to each audio frame in the audio frame sequence based on the audio clip, i.e., the audio frame sequence obtained in step 303, where the deletion indication parameters of each audio frame may form a parameter sequence according to the time stamp order, and this parameter sequence is the deletion indication information of the audio clip.

In some embodiments, it is assumed that the deletion indicating parameter of each audio frame is binary data, where binary data refers to a data type that is to be valued at 1 or 0, in this case, for any audio frame in the audio frame sequence, when the audio frame is deleted, that is, when the audio frame is a blank audio frame, the deletion indicating parameter of the audio frame is assigned to 1; when the audio frame is not missing, that is, the audio frame is not a blank audio frame, the missing indication parameter of the audio frame is assigned to 0.

In some embodiments, it is assumed that the deletion indicating parameter of each audio frame is boolean data, where boolean data refers to a data type to be valued as True or False, in which case, for any audio frame in the audio frame sequence, when the audio frame is deleted, i.e. when the audio frame is a blank audio frame, the deletion indicating parameter of the audio frame is assigned as True; when the audio frame is not missing, i.e. the audio frame is not a blank audio frame, the missing indication parameter of the audio frame is assigned as False.

In some embodiments, when the loss indication parameter is binary data, it may be defined that a value of 1 represents that the audio frame is not lost and a value of 0 represents that the audio frame is lost, or when the loss indication parameter is boolean data, it may be defined that a value of True represents that the audio frame is not lost and a value of False represents that the audio frame is lost, and a technician may configure loss indication parameters of different types and different assignment rules according to requirements.

In step 305, the terminal fuses the audio clip and the deletion instruction information to obtain extended audio data.

In some embodiments, since the audio clip is essentially a sequence of audio frames, for any audio frame in the sequence of audio frames, which may be any one of a history audio frame, a blank audio frame, or a future audio frame, the terminal finds the missing indication parameter of the audio frame in the missing indication information, i.e. the parameter sequence, and then concatenates the audio frame with the missing indication parameter of the audio frame to obtain dual-channel data of the audio frame, in other words, one channel in the dual-channel data represents a time-domain sound signal of the audio frame itself, and the other channel represents the missing indication parameter of the audio frame, the terminal obtains one dual-channel data for each audio frame in the sequence of audio frames, and finally can obtain the extended audio data composed of the dual-channel data of a plurality of audio frames.

Schematically, S (t) is used to represent an audio frame with a timestamp t in the sequence of audio frames _n Representing the nth audio packet, assume S _n Packet loss occurs in audio transmission, then S _n All audio frames contained in the audio frame are missing, assuming S _n L, the number of audio frames contained in the audio signal having a play duration of 1 second, then S is present _n ＝[s(t _n )，...，s(t _n +L-1)]For example, for an audio signal with a sampling rate of 16000Hz (hertz), l=16. Furthermore, the missing indication parameter of an audio frame with a timestamp t in the sequence of audio frames is denoted by l (t), and in case l (t) is binary, l (t) may be defined as:

and (3) for an audio frame s (t) with a time stamp of t in the audio frame sequence, splicing and combining the audio frame s (t) and the deletion indicating parameter l (t) thereof into two-channel data, and after the operation is carried out on all the audio frames in the audio frame sequence, obtaining the extended audio data which are formed by the two-channel data of all the audio frames together.

In step 306, the terminal encodes the extended audio data to obtain audio encoding characteristics of the extended audio data.

In some embodiments, the encoding operation of the present step 306 and the decoding operation of the next step 307 are performed by using a trained audio generation model, and the training method of the audio generation model will be described in detail in the next embodiment, which is not described herein. Optionally, the audio generating model is a model obtained based on training of GAN architecture, the GAN architecture includes a generator and a discriminator, the generator is used for generating target audio frames that can be spoofed into the discriminator as much as possible, the discriminator is used for distinguishing which are real audio frames and which are audio frames synthesized by the generator as much as possible, and when the stopping condition of training is met, the obtained trained generator is the audio generating model related to the embodiment of the disclosure.

In some embodiments, the audio generation model is used to synthesize the missing target audio frame between the historical audio frame and the future audio frame, the input signal of the audio generation model is the extended audio data obtained in the step 305, the output signal may be the target audio frame, or the output signal may also be the compensation signal obtained after the missing audio frame (i.e. the blank audio frame after initialization) is replaced by the target audio frame, and the output signal is not specifically limited in the embodiments of the present disclosure.

In some embodiments, after the audio generation model is trained by the server, pruning and compressing a model parameter set of the audio generation model, embedding the pruned and compressed model parameter set into an SDK (Software Development Kit ) of an application program supporting VoIP service, and issuing the pruned and compressed model parameter set to each terminal installed with the application program through the SDK, so that an audio packet loss compensation mechanism of the mobile terminal can be realized by deploying the audio generation model at the mobile terminal.

In some embodiments, the audio generation model may be any machine learning model based on a neural network architecture, for example, the model architecture of the audio generation model includes: soundStream model, U-Net model, DRNN (Deep Recurrent Neural Network, deep-loop neural network), etc., the disclosed embodiments do not specifically limit the model architecture of the audio generation model.

Illustratively, taking an audio generation model as a SoundStream model for illustration, the SoundStream model includes one or more audio coding layers, one or more quantization compression layers, and one or more audio decoding layers, the audio coding layers are used for audio coding an input signal to obtain a coded signal, the quantization compression layers are used for performing quantization compression of feature vectors on the coded signal to obtain a compressed signal, the audio decoding layers are used for decoding recovery on the compressed signal to obtain a target audio frame, and it is to be noted that the audio coding layers and the audio decoding layers generally need to be symmetrical (i.e. the layers of the two layers are consistent).

In some embodiments, the terminal inputs the extended audio data into an audio generation model, and encodes the extended audio data through one or more audio encoding layers of the audio generation model to obtain audio encoding characteristics of the extended audio data.

Optionally, one or more audio coding layers in the audio generation model are connected in series with each other, and each of the remaining audio coding layers except the first audio coding layer uses the extended audio data as an input signal, uses a feature vector output by the above audio coding layer as an input signal, performs audio coding on the input signal in the audio coding layer, inputs the feature vector obtained by audio coding into the next audio coding layer, and repeatedly performs the above operation until the last audio coding layer outputs the audio coding feature.

In step 307, the terminal decodes the audio coding feature to obtain a target audio frame, where the phonemes and semantics of the target audio frame are similar to those of the missing audio frame.

In some embodiments, for any audio generation model with a codec architecture, after the audio coding features are acquired through the audio coding layer, the audio coding features can be directly decoded by using the audio decoding layer to acquire the final recovered target audio frame, so that the synthesis flow of the target audio frame can be simplified.

Illustratively, taking the audio generation model as the SoundStream model as an example, since the audio coding layer of the SoundStream model can generate feature vectors (i.e., audio coding features) that take an infinite number of values, in order to use a finite number of bits to transfer the audio coding features to the audio decoding layer, the original audio coding features must be replaced with proximity vectors from a finite set (called a codebook), a process called quantized compression of the feature vectors. In view of this, the SoundStream model further provides a quantization compression layer (also referred to as a residual vector quantizer, residual Vector Quantization, RVQ) between the audio encoding layer and the audio decoding layer, and typically provides a plurality of quantization compression layers, in which the code vector is quantized at a medium resolution, and each of the next quantization compression layers processes the residual of the previous layer, which is the residual value obtained by splicing the input vector and the output vector of the previous quantization compression layer, and by adding or deleting the quantization compression layer of the SoundStream model, the bit rate can be easily controlled to be increased or decreased, which enables the SoundStream model to have a higher controllability for the bit rate of the synthesized target audio frame.

In some embodiments, the terminal inputs the audio coding feature into a quantized compression layer of the audio generation model, and compresses the audio coding feature by the quantized compression layer of the audio generation model to obtain an audio compression feature. For the SoundStream model, inputting the audio coding feature output by the last audio coding layer into a first quantized compression layer, performing quantization compression on the audio coding feature in the first quantized compression layer with medium resolution to obtain a compression vector, splicing the compression vector and the audio coding feature to obtain a residual error of the first quantized compression layer, inputting the residual error of the first quantized compression layer into a second quantized compression layer for similar processing, starting from the second quantized compression layer, processing the residual error of the previous layer by each subsequent quantized compression layer, and finally outputting the audio compression feature by the last quantized compression layer.

In some embodiments, the terminal inputs the audio compression feature into an audio decoding layer of the audio generation model, and decodes the audio compression feature through the audio decoding layer of the audio generation model to obtain the target audio frame. For the SoundStream model, the audio compression feature output by the last quantized compression layer is input into the first audio decoding layer, the audio compression feature is audio-decoded by the first audio decoding layer, the feature vector obtained by audio decoding is input into the second audio decoding layer for similar processing, in other words, one or more audio decoding layers in the audio generation model are connected in series with each other, except that the first audio decoding layer takes the audio compression feature as an input signal, each of the remaining audio decoding layers takes the feature vector output by the previous audio decoding layer as an input signal, and each audio decoding layer will perform the above operation until the last audio decoding layer outputs the target audio frame.

In the steps 305-307, the terminal synthesizes a target audio frame similar to the missing audio frame based on the audio fragment and the missing indication information, and the similarity between the target audio frame and the missing audio frame is ensured by training and adjusting the audio generation model, so that the similarity is similar in frequency or amplitude, pronunciation and semantics are ensured for the audio signal, and the synthesized target audio frame can realize high-quality audio packet loss compensation.

In the above steps 303-307, the terminal synthesizes the target audio frame based on the historical audio frame and the future audio frame, and because the historical audio frame (i.e. the context of the missing audio frame) and the future audio frame (i.e. the context of the missing audio frame) are used as the context information, the target audio frame under the condition that the target audio frame is not the historical audio frame or the simple copy of the future audio frame can be accurately synthesized, and the target audio frame has higher similarity with the missing audio frame, so that the audio quality of the audio packet loss compensation is greatly improved.

In step 308, the terminal synthesizes a compensation signal associated with the audio signal based on the historical audio frame, the target audio frame and the future audio frame, the compensation signal filling the missing audio frame with the target audio frame.

In some embodiments, the terminal uses the target audio frames predicted in step 307 to fill the missing audio frames in the audio signal, for example, if N (n+.1) audio frames are missing in the audio signal, N target audio frames are predicted, and N target audio frames are used to fill the missing N audio frames, so as to obtain a final compensating signal (i.e. a complete audio signal) compensated by audio packet loss, and then the terminal can play the compensating signal.

According to the method provided by the embodiment of the disclosure, when the audio frame is lost, the context information of the audio frame, namely the historical audio frame and the future audio frame, is utilized to synthesize the target audio frame similar to the missing audio frame in various dimensions of frequency, phonemes and semantics, and the target audio frame is utilized to fill the missing audio frame, so that a packet loss compensation mechanism of an audio signal is realized, the packet loss compensation mechanism is not simple copy of the historical audio frame or the future audio frame, but is a more natural, smooth and high-tone target audio frame is synthesized, noise and noise which are very easy to appear under the packet loss compensation mechanism of the traditional audio frame are avoided, and adverse effects caused by the packet loss compensation mechanism on the audio quality of the audio signal at the receiving end are avoided, namely, the compensated signal subjected to packet loss compensation has higher audio quality and better playing effect, and thus the product experience of a user accessing VoIP service is optimized.

In the above embodiment, how to use the trained audio generation model to perform audio packet loss compensation is described in detail, but in the embodiment of the present disclosure, a training manner of the audio generation model will be described in detail, taking training of the audio generation model based on a GAN architecture as an example, the audio generation model is used as a generator under the GAN architecture, the audio discrimination model is used as a discriminator, and the two are in counter game with each other, so that the trained audio generation model can be obtained when the stop condition is met.

Fig. 4 is a flowchart of a training method of an audio generation model according to an embodiment of the present disclosure, and as shown in fig. 4, an execution body of an embodiment of the present disclosure is an electronic device, and an example of the electronic device is illustrated by taking the electronic device as a server.

In step 401, the server obtains a sample audio piece and missing indication information of the sample audio piece.

In some embodiments, the server obtains sample audio from a sample audio set, segments the sample audio to obtain one or more sample audio segments, and the embodiments of the present disclosure illustrate a single sample audio segment processing flow. The sample audio set may be locally stored by a server or may be downloaded from a cloud database, and the source of the sample audio set in the embodiments of the present disclosure is not specifically limited.

Optionally, the sample audio contained in the sample audio set includes, but is not limited to: clean voice data of different sexes and different languages, noise data in different noise scenes, music data or song data of different curved winds and the like. It should be noted that, when the sample audio relates to the clean voice data of the relevant user, the collection, use and analysis of the clean voice data are fully authorized and individually agreed by the relevant user, and the relevant laws and regulations and standards of the relevant country and region need to be complied with.

In some embodiments, after the sample audio is switched to obtain the sample audio segment, the sample audio segment is a complete audio signal (i.e. no packet loss occurs), and because the audio packet loss needs to be simulated for the sample audio segment as an input signal of the model training stage, the server can simulate which audio frames in the sample audio segment are lost by using some packet loss simulation models, in other words, the server inputs the sample audio segment into the packet loss simulation model, and predicts the loss indication information of the sample audio segment by using the packet loss simulation model.

In some embodiments, the packet loss simulation model includes, but is not limited to: the embodiment of the disclosure does not specifically limit the model type of the packet loss simulation model, such as a markov channel model, a gilbert halloysite channel model, a packet loss model acquired based on a real audio transmission scene, and the like.

Illustratively, for a sample audio segment containing 96 frames, the packet loss simulation model predicts that the 17 th to 32 th frames and the 49 th to 64 th frames of the sample audio segment are lost, but the server does not need to truly discard the 17 th to 32 th frames and the 49 th to 64 th frames of the sample audio segment, but only needs to mark the deletion indication parameters of the 17 th to 32 th frames and the 49 th to 64 th frames as 1 in the deletion indication information of the sample audio segment (assuming that the deletion indication parameters are marked as 1 to represent the audio frame loss).

In step 402, the server fuses the sample audio segment and the deletion instruction information to obtain sample extension data.

Step 402 is similar to step 305 and will not be described here.

In step 403, the server inputs the sample extension data to an audio generation model, and obtains a target audio clip associated with the sample audio clip through the audio generation model.

The "missing audio frame" in the training stage refers to an audio frame marked as missing by the missing indication information, that is, an audio frame with a missing indication parameter of 1 predicted by the packet loss simulation model, and these audio frames are not actually lost in the real network transmission, but are all saved in the server for later obtaining the relevant loss terms in the loss function values.

Wherein the missing audio frame is filled with the synthesized target audio frame in the target audio clip.

In the training stage, the server inputs sample extension data into an audio generation model, encodes the sample extension data through the audio generation model to obtain audio coding features of the sample extension data, decodes the audio coding features through the audio generation model to obtain target audio frames similar to missing audio frames in terms of phonemes and semanteme, and replaces the audio frames marked as missing (namely, the missing indication parameter is 1) in the sample audio fragments by using the target audio frames to obtain target audio fragments.

The process of predicting the target audio frame using the audio generation model in step 403 is similar to steps 306-308, and will not be described in detail.

In step 404, the server inputs the sample audio segment and the target audio segment into an audio discrimination model, and obtains respective audio discrimination parameters of the sample audio segment and the target audio segment through the audio discrimination model, where the audio discrimination parameters characterize the possibility of whether the input audio segment is a machine synthesized signal or not.

In some embodiments, the server inputs the sample audio segment into the audio discrimination model, extracts the discrimination feature of the sample audio segment through the audio discrimination model, and performs index normalization on the discrimination feature to obtain the audio discrimination parameter of the sample audio segment, for example, the audio discrimination parameter is characterized by whether the sample audio segment belongs to the prediction probability of the machine synthesized signal, when the prediction probability is closer to 1 time, the sample audio segment is considered to be closer to one machine synthesized signal by the audio discrimination model, when the prediction probability is closer to 0 time, the sample audio segment is considered to be closer to the audio signal (rather than the machine synthesized signal) of the real scene by the audio discrimination model, and when the prediction probability is closer to 0.5 time, the sample audio segment is difficult to distinguish by the audio discrimination model.

Similarly, the server inputs the target audio fragment into the audio discrimination model, extracts discrimination features of the target audio fragment through the audio discrimination model, and performs exponential normalization on the discrimination features to obtain the audio discrimination parameters, for example, the audio discrimination parameters are characterized as prediction probability of whether the target audio fragment belongs to a machine synthesized signal or not, when the prediction probability is closer to 1 time, the target audio fragment is considered to be closer to a machine synthesized signal by the audio discrimination model, when the prediction probability is closer to 0 time, the target audio fragment is considered to be closer to an audio signal (rather than the machine synthesized signal) of a real scene by the audio discrimination model, and when the prediction probability is closer to 0.5 time, the target audio fragment is difficult to distinguish by the audio discrimination model.

It should be noted that, the server pre-trains by using the audio signals of some real scenes and some machine synthesis signals to obtain an audio discrimination model, and puts the audio discrimination model into GAN countermeasure learning, and both the audio discrimination model and the audio generation model are continuously optimized along with GAN countermeasure learning, but the audio discrimination model is only used for jointly optimizing the audio generation model in the training stage, and is not put into the actual reasoning stage (i.e. actually put into the use of audio packet loss compensation).

In some embodiments, the audio discriminant model may be any audio classification model (such as an audio classification model) based on a neural network architecture, and the model architecture of the audio classification model is not particularly limited in the embodiments of the present disclosure.

Illustratively, the audio discriminating model includes one or more residual convolution layers, a full-connection layer and an exponential normalization (Softmax) layer, and the audio discriminating model includes a plurality of residual convolution layers, where the server inputs the target audio segment into the first residual convolution layer, performs convolution operation on the target audio segment through the residual convolution layer to obtain a feature vector, splices the feature vector and the target audio segment to obtain a residual of the first residual convolution layer, inputs the residual of the first residual convolution layer into the second residual convolution layer for performing a similar process, and from the second residual convolution layer, each subsequent residual convolution layer processes the residual of the previous layer, finally outputs a discriminating feature of the target audio segment from the last residual convolution layer, then inputs the discriminating feature into the full-connection layer for performing full-connection processing to obtain a full-connection feature, inputs the full-connection feature into the Softmax layer for exponential normalization, and the Softmax layer outputs the audio discriminating parameter. It should be noted that, in this example, only the processing flow of the target audio segment is described, but the processing flow of the sample audio segment is similar, and will not be described here again.

In step 405, the server determines, based on the respective audio discrimination parameters of the sample audio piece and the target audio piece, a discrimination loss term for the audio generation model, where the discrimination loss term is used to characterize whether the target audio frame synthesized by the audio generation model can be accurately identified by the audio discrimination model.

In some embodiments, the server performs the above steps 401-403 for a plurality of sample audio segments, so that a target audio segment can be predicted for each sample audio segment, then, through the above step 404, each sample audio segment is also input into the audio discrimination model to obtain the audio discrimination parameters of the sample audio segment, and similarly, each target audio segment is input into the audio discrimination model to obtain the audio discrimination parameters of the target audio segment, the method is equivalent to respectively judging whether each sample audio fragment and the associated target audio fragment are machine synthesized signals by utilizing an audio judging model, and then, after all sample audio fragments of one round of iteration are traversed, utilizing the audio judging parameters of a plurality of sample audio fragments and each associated target audio fragment of the round of iteration, and taking the cross entropy of the audio judging parameters in the round of iteration as a judging loss item of the round of iteration.

In step 406, the server obtains a spectral loss term for the audio generation model based on the sample audio piece and the target audio piece, the spectral loss term being used to characterize a degree of difference in frequency domain space between the sample audio piece and the target audio piece.

In some embodiments, since the sample audio segment and the target audio segment are both time domain signals, to measure the signal difference in the frequency domain for each sample audio segment and target audio segment, the server may perform FFT (Fast Fourier Transform ) on each pair of sample audio segment and target audio segment, so that the sample audio segment and the target audio segment can be converted from the time domain signals to the frequency domain signals, respectively.

In some embodiments, the frequency domain signal obtained by the FFT is affected by the sampling rate, and the sampling rate of the FFT determines the resolution of the frequency domain signal, in other words, at different sampling rates, even if the FFT is performed on the same sample audio segment or the target audio segment, the frequency domain signal with different resolutions is obtained, in which case the server may perform time-frequency transformation on both the sample audio segment and the target audio segment through the FFT at different sampling rates to obtain the sample frequency signal of the sample audio segment and the target frequency signal of the target audio segment.

It should be noted that, for the same sample audio segment, a sample frequency signal is obtained at each sampling rate, so that a plurality of sample frequency signals are obtained at a plurality of sampling rates, and similarly, for the same target audio segment, a target frequency signal is obtained at each sampling rate, so that a plurality of target frequency signals are obtained at a plurality of sampling rates.

In some embodiments, for any pair of a sample audio segment and a target audio segment, the server performs FFT at a plurality of sampling rates, respectively, to obtain a plurality of sample frequency signals obtained by converting the sample audio segment at the plurality of sampling rates, respectively, and a plurality of target frequency signals obtained by converting the target audio segment at the plurality of sampling rates, respectively.

In some embodiments, the server is capable of determining a time-frequency loss term and a signal-to-noise loss term of the audio generation model based on the sample frequency signal and the target frequency signal at different sampling rates, respectively, wherein the time-frequency loss term is used for representing a degree of difference in frequency amplitude of the sample frequency signal and the target frequency signal, and the signal-to-noise loss term is used for representing a signal-to-noise ratio of the target frequency signal.

Next, a description will be given of a process of acquiring a time-frequency loss term of the audio generation model.

In some embodiments, for any sample frequency component of any sample audio frame in the sample frequency signal at any sample rate, the server determines a target frequency component in the target frequency signal that corresponds to the sample frequency component. Optionally, because each sample audio segment has an association relationship with one target audio segment, under the same sampling rate, there is also an association relationship between the sample frequency signal of the sample audio segment and the target frequency signal of the target audio segment, and for any sample frequency signal, a corresponding target frequency signal can be found.

Illustratively, assuming that S (n, K) is used to represent the frequency amplitude of the kth sample frequency component of the nth audio frame in the sample frequency signal at any sampling rate, and that the sample frequency signal contains K sample frequency components in total (while K also represents the frame length of the FFT, K is the integer power of 2), K is an integer greater than or equal to 1 and less than or equal to K, while S '(n, K) is also used to represent the frequency amplitude of the kth target frequency component of the nth audio frame in the target frequency signal associated with the sample frequency signal, it is apparent that S (n, K) and S' (n, K) are respective frequency amplitudes of a pair of frequency components having a correspondence.

In some embodiments, the server obtains an L1 norm between the frequency magnitude of the sample frequency component and the frequency magnitude of the target frequency component, and further obtains an L2 norm between the natural logarithm of the frequency magnitude of the sample frequency component and the natural logarithm of the frequency magnitude of the target frequency component.

Schematically, the frequency amplitude of each of any pair of frequency components having a correspondence relationshipS (n, k) and S '(n, k), obtaining L1 norm ||S (n, k) -S' (n, k) |of the pair of frequency components in frequency amplitude ₁ And the L2 norm of the natural logarithm of the frequency magnitudes of the pair of frequency components ||ln (S (n, k)) -ln (S' (n, k))|ln ₂ 。

In some embodiments, the time-frequency loss term is obtained based on the L1 norm and the L2 norm of a plurality of sample audio frames at a plurality of sample rates, respectively, over a plurality of sample frequency components. Optionally, under the same sampling rate, adding the L1 norms of the same audio frame on all frequency components to obtain an L1 norms sum of the audio frame, adding the L1 norms sum of all audio frames in the same sample audio segment to obtain an L1 norms sum of the whole sample audio segment, similarly, adding the L2 norms of the same audio frame on all frequency components to obtain an L2 norms sum of the audio frame, adding the L2 norms sum of all audio frames in the same sample audio segment to obtain an L2 norms sum of the whole sample audio segment, then carrying out weighted summation on the L1 norms sum and the L2 norms sum of the sample audio segment to obtain a spectrum loss component under the current sampling rate, and summing the spectrum loss components under all sampling rates to obtain a final time-frequency loss term.

Illustratively, assume that L is used _TF Representing a time-frequency loss term, then the time-frequency loss term L _TF Can be expressed as the following formula:

where K represents the frame length of the FFT change, n represents the time stamp number of the audio frame in the sample audio segment, K represents the number of the frequency component, K is an integer greater than or equal to 1 and less than or equal to K, S (n, K) represents the frequency amplitude of the kth sample frequency component of the nth audio frame at the current sampling rate, S' (n, K) represents the frequency amplitude of the kth target frequency component of the nth audio frame at the current sampling rate, ln represents the natural logarithm, |·|| ₁ The expression is used to find the L1 norm, I.I ₂ Representing the L2 norm, alpha represents the multi-resolution (i.e., multi-sampling rate) The weight factor of the lower combination is used,that is, α is positively correlated with the parameter K, so that when the frame length of the parameter K, that is, the FFT change, is larger, the total number of frequency components representing the frequency domain signal is larger, and the resolution of the frequency domain signal is higher, the value of the weight factor α is increased, so that the L2 norm and the term under the high resolution condition can be weighted.

In the above formula, it can be seen that the time-frequency loss term L _TF The method comprises the steps of including L1 norm sum and L2 norm sum, wherein the L1 norm sum can represent the absolute difference degree of the frequency amplitude of the audio frame on the same frequency component, and the L2 norm sum can better fit the hearing difference degree of the frequency amplitude of the audio frame on the same frequency component due to the influence of natural logarithm ln, because the frequency amplitude difference perceived by human beings on hearing is not linear when the frequency amplitude of different frequency components changes, and the accuracy degree of a time-frequency loss term can be greatly improved by increasing the L2 norm sum to fit the hearing difference degree.

Next, a description will be given of an acquisition process of a signal-to-noise loss term of the audio generation model.

In some embodiments, the server divides the frequency amplitude of the target frequency component by the cosine of a frequency characteristic included angle, where the frequency characteristic included angle is a characteristic included angle between the sample frequency component and the target frequency component in the frequency domain space, to obtain the signal information of the target frequency signal.

Illustratively, for the respective frequency magnitudes S (n, k) and S' (n, k) of any pair of frequency components having a correspondence, S is used _target (n, k) represents signal information, then signal information S _target (n, k) is defined as follows:

where n represents the time stamp sequence number of the audio frame in the sample audio segment, K represents the sequence number of the frequency component, K is an integer greater than or equal to 1 and less than or equal to K, K represents the frame length of the FFT change, S (n, K) represents the frequency amplitude of the kth sample frequency component of the nth audio frame at the current sampling rate, S' (n, K) represents the frequency amplitude of the kth target frequency component of the nth audio frame at the current sampling rate, |·| ² Representing the modular aspect of the vector of values,<·>representing the inner product of the two vectors, cos<·>Representing the cosine of the angle between the two vectors.

In some embodiments, the server subtracts the target frequency component from the signal characteristic to obtain noise information for the target frequency signal.

Illustratively, E is used for the respective frequency magnitudes S (n, k) and S' (n, k) of any pair of frequency components having a correspondence _noise (n, k) represents noise information, then the noise informationE _noise (n, k) is defined as follows:

E _noise (n，k)＝S′(n，k)-S _target (n，k)

wherein n represents a time stamp sequence number of an audio frame in a sample audio fragment, K represents a sequence number of a frequency component, K is an integer greater than or equal to 1 and less than or equal to K, K represents a frame length of an FFT change, S' (n, K) represents a frequency amplitude of a kth target frequency component of an nth audio frame at a current sampling rate, S _target And (n, k) is the signal information calculated by the formula.

In some embodiments, the server obtains the signal-to-noise loss term based on the signal information and the noise information for a plurality of sample audio frames at a plurality of sample rates, each over a plurality of sample frequency components. Optionally, under the same sampling rate, firstly acquiring a model of signal information, then acquiring a model of noise information, dividing the model of the signal information by the model of the noise information to obtain a signal-to-noise ratio, obtaining a logarithmic signal-to-noise ratio after logarithmic transformation with 10 as a base, and summing the logarithmic signal-to-noise ratios of all frequency components, all audio frames and all resolutions to obtain a final signal-to-noise loss term.

Illustratively, assume that L is used _MR-OSISNR Representing the signal-to-noise penalty term, also referred to as multiresolution signal-to-noise penalty term, signal-to-noise penalty term L, since the optimized logarithmic signal-to-noise ratio at different resolutions (embodied by different sampling rates) is taken into account _MR-OSISNR Can be expressed as the following formula:

wherein K represents the frame length of the FFT variation, n represents the time stamp sequence number of the audio frame in the sample audio fragment, K represents the sequence number of the frequency component, K is an integer greater than or equal to 1 and less than or equal to K, log ₁₀ Refers to logarithmic transformation based on 10, S _target (n, k) is the signal information calculated by the formula, E _noise (n, k) is calculated by the formulaThe noise information that is reached is used to determine, I.I ² Representing the modular aspect of the vector.

In the above formula, it can be seen that the signal information S is due to _target (n, k) is a value independent of the frequency amplitude S (n, k) of the sample frequency component in the sample frequency signal, i.e. the signal information S _target (n, k) is affected only by the frequency amplitude S '(n, k) and the frequency characteristic angle of the target frequency component in the target frequency signal, and the noise information is affected by S' (n, k) and S _target (n, k) are commonly adjusted, i.e. the representative noise information is also a value independent of S (n, k), so that the signal-to-noise loss term L is constructed _MR-OSISNR The influence of S (n, k) can be stripped, but the influence of the included angle of the frequency characteristic is still remained, so that the signal-to-noise ratio difference of the target frequency signal under different resolutions (namely different sampling rates) can be well measured.

In some embodiments, after the server obtains the time-frequency Loss term and the signal-to-noise Loss term in the above manner, the server obtains the spectrum Loss term (spectrum Loss) based on the time-frequency Loss term and the signal-to-noise Loss term. Optionally, the server adds the time-frequency loss term and the signal-to-noise loss term to obtain the spectrum loss term.

Illustratively, L is used _S Representing the spectral loss term, then the spectral loss term L _S The definition is as follows:

L ₅ ＝L _TF +L _MR-OSISNR

wherein L is _TF Representing time-frequency loss terms, L _MR-OSISNR Representing the signal-to-noise loss term.

In this step 406, the spectrum loss term L is obtained _S Can take frequency domain difference of the audio signal into consideration in a training stage of an audio generation model by introducing a frequency spectrum loss term L _S The target audio piece and the sample audio piece can be constrained to be as similar as possible in the signal's own waveform and amplitude.

In step 407, the server obtains a pronunciation loss term of the audio generation model based on the sample audio piece and the target audio piece, where the pronunciation loss term is used to characterize a degree of difference between the sample audio piece and the target audio piece in a phoneme feature space.

In some embodiments, the server inputs the sample audio segment into a phoneme feature extraction model to obtain sample phoneme features that characterize phonetically the phonemes of the audio frames in the sample audio segment.

Optionally, the phoneme feature extraction model is an audio waveform steering vector (wav 2 vec) model, the server inputs the sample audio segment into the wav2vec model, and extracts a sample phoneme feature of a sample audio segment through the wav2vec model, where the sample phoneme feature can reflect a feature representation of the sample audio segment in a phoneme space, that is, represents which features the sample audio segment has in pronunciation.

In some embodiments, the wav2vec model includes an encoder network and a context network, the input of the sample audio segment into the encoder network extracts potential vectors for each audio frame in the sample audio segment, the potential vectors being low frequency feature representations for the audio frames in the sample audio segment, then the input of the potential vectors for each audio frame into the context network, the context information of the potential vectors for each audio frame being combined by the context network to extract deep feature representations fused with the context information, and finally the output of the sample phoneme features by the context network. Wherein the encoder network and the context network each comprise a plurality of causal convolution blocks, each causal convolution block comprising a causal convolution layer, a batch normalization layer, and a ReLU nonlinear layer, the encoder network and the context network having different convolution kernel parameters.

In some embodiments, the server inputs the target audio segment into the phoneme feature extraction model to obtain target phoneme features that characterize the phonetically speaking features of the phonemes of the audio frames in the target audio segment. The extraction manner of the target phoneme features is similar to that of the sample phoneme features, and will not be described here.

In some embodiments, the server obtains the pronunciation loss term based on the sample phoneme feature and the target phoneme feature. Optionally, the server may obtain a gastein distance between the sample phoneme feature and the target phoneme feature as the pronunciation loss term, or may also obtain a cosine similarity or an inverse of a euclidean distance as the pronunciation loss term, and the method for obtaining the pronunciation loss term is not specifically limited in the embodiments of the present disclosure.

Illustratively, s (t) is used to represent the audio frame with a time stamp of t in the sample audio segment, s' (t) is used to represent the audio frame with a time stamp of t in the target audio segment, and then the pronunciation loss term L _PFP Can be expressed as:

L _PFP ＝WD[wav2vec[s(t)]，wav2vec[s′(t)]]

where wav2vec [. Cndot. ] represents the effect of the wav2vec model and WD [. Cndot. ] represents the calculated Neisserian distance between the two vectors.

In this step 407, the pronunciation loss term L is obtained _PFP In addition to the frequency domain loss term representing the frequency domain difference of the audio signal itself, which is involved in the step 406, the speech loss term L can be used in the training phase of the audio generation model _PFP Taking into account the voicing difference of the audio signal, since even some signals similar in frequency domain may have a large difference in voicing by introducing a voicing loss term L _PFP It can be further constrained that the target audio piece and the sample audio piece are as similar in pronunciation as possible.

In step 408, the server obtains a semantic loss term for the audio generation model based on the sample audio piece and the target audio piece, the semantic loss term being used to characterize the degree of difference between the sample audio piece and the target audio piece in a semantic feature space.

In some embodiments, the server inputs the sample audio segment into a semantic feature extraction model to obtain sample semantic features that are used to characterize semantically the audio frames in the sample audio segment.

In some embodiments, the semantic feature extraction model is an Encoder (Encoder) in an acoustic speech recognition ASR model, and since the semantics of the audio are extracted by the Encoder and then converted to text corresponding to the semantics by the decoder in the ASR model, the sample semantic features of the sample audio fragments can be extracted well by the Encoder of the ASR model.

Illustratively, the encoder of the ASR model includes one or more serially connected semantic coding layers, the server inputs the sample audio segments into the encoder of the ASR model, the sample audio segments are encoded by the one or more serially connected semantic coding layers, and the last semantic coding layer outputs sample semantic features that reflect the feature representation of the sample audio segments in semantic space, i.e., represent which features the sample audio segments have semantically.

In some embodiments, the server inputs the target audio segment into the semantic feature extraction model to obtain target semantic features that are used to characterize semantically the audio frames in the target audio segment. The extraction manner of the target semantic features is similar to that of the sample semantic features, and is not described here in detail.

In some embodiments, the server obtains the semantic loss term based on the sample semantic feature and the target semantic feature. Optionally, the server may acquire a gastein distance between the sample semantic feature and the target semantic feature as the semantic loss term, or may also acquire cosine similarity or an inverse of a euclidean distance as the semantic loss term, where an acquisition manner of the semantic loss term is not specifically limited in the embodiments of the present disclosure.

Illustratively, s (t) is used to represent the audio frame with a time stamp of t in the sample audio segment, s' (t) is used to represent the audio frame with a time stamp of t in the target audio segment, then the semantic loss term L _ASR Can be expressed as:

L _ASR ＝WD[ASRenc[s(t)]，ASRenc[s′(t)]]

where ASRenc represents the role of the encoder of the ASR model and WD represents the calculated Neisserian distance between the two vectors.

In this step 408, by obtaining the semantic loss terms, it is possible to, in the training phase of the audio generation model,in addition to the frequency domain loss term involved in step 406 and the pronunciation loss term involved in step 407, the semantic loss term L is also considered _ASR Considering the semantic difference of the audio signals, which is because even some signals which are dissimilar in frequency domain and are dissimilar in pronunciation, the understanding of the signals when the machine performs voice-to-text is not affected even if the semantically similarity is high, and therefore, by introducing the semantic loss term L _ASR The target audio piece and the sample audio piece can be further constrained to be as semantically similar as possible, thereby enabling the target audio piece and the sample audio piece to approach text converging to the same semantic when machine translated through the ASR model.

In the above steps 406-408, a possible implementation manner of the server obtaining at least one of the spectrum loss term, the pronunciation loss term, or the semantic loss term of the audio generation model based on the sample audio segment and the target audio segment is shown, that is, the server obtains the spectrum loss term, the pronunciation loss term, and the semantic loss term, in other embodiments, the server may also obtain no spectrum loss term, no pronunciation loss term, or no semantic loss term, so as to simplify the training procedure of the audio generation model, which is not specifically limited in this embodiment of the disclosure.

In step 409, the server adds at least one of a spectral loss term, a voicing loss term, or a semantic loss term of the audio generation model to obtain a reconstruction loss term of the audio generation model, where the reconstruction loss term is used to characterize a degree of difference between the target audio frame in the target audio segment and the missing audio frame in the sample audio segment.

In some embodiments, the spectrum loss term L is obtained at the server _S Pronunciation loss term L _PFP And semantic loss term L _ASR In the case of (a), the server may add the spectrum loss term L _S Pronunciation loss term L _PFP And semantic loss term L _ASR Adding to obtain a reconstruction loss term L _Generator The method comprises the following steps:

L _Generator ＝L _S +L _PFP +L _ASR

in some embodiments, the server may also employ different weighting parameters to respectively apply the spectral loss term L _S Pronunciation loss term L _PFP And semantic loss term L _ASR Weighting is carried out, and finally the weighted spectrum loss term L is obtained _S Pronunciation loss term L _PFP And semantic loss term L _ASR Adding to obtain a reconstruction loss term L _Generator 。

In some embodiments, the reconstruction loss term may be composed of only the spectral loss term and the semantic loss term, provided that the server does not acquire the pronunciation loss term, or the reconstruction loss term may be composed of only the spectral loss term and the pronunciation loss term, provided that the server does not acquire the pronunciation loss term and the semantic loss term, or the reconstruction loss term is the spectral loss term itself, which is not specifically limited by the embodiments of the present disclosure.

In the above steps 406-409, one possible implementation of the server determining a reconstruction loss term for the audio generation model based on the target audio piece and the sample audio piece is shown, in other embodiments the server may further obtain the reconstruction loss term by: after predicting an associated target audio segment for each sample audio segment, for each pair of sample audio segments and target audio segments, a signal difference value between the audio frame marked as missing (i.e., the audio frame with the missing indication parameter of 1) and the target audio frame may be obtained, e.g., the signal difference value refers to: the type of the signal difference value is not particularly limited, after all sample audio fragments of one round of iteration are traversed, the signal difference values of a plurality of sample audio fragments of the round of iteration and the respective associated target audio fragments can be obtained, and the signal difference values are added to obtain a reconstruction loss item of the round of iteration, so that the training flow of an audio generation model can be simplified.

In step 410, the server obtains a loss function value of the audio generation model based on the discrimination loss term and the reconstruction loss term.

In some embodiments, the server adds the discrimination loss term and the reconstruction loss term to obtain a loss function value of the audio generation model in the current iteration, alternatively, the server may also use different weight parameters to weight the discrimination loss term and the reconstruction loss term respectively, and add the weighted discrimination loss term and the reconstruction loss term to obtain a loss function value of the current iteration, and in the embodiments of the present disclosure, the obtaining mode of the loss function value is not specifically limited.

In step 411, the server iteratively adjusts parameters of the audio generation model when the number of iterations and the loss function value do not meet the stop condition.

In some embodiments, the stop condition includes at least one of: the number of iterations exceeds a number threshold, or the loss function value is less than a loss threshold, where the number threshold is any integer greater than 1 and the loss threshold is a value greater than or equal to 0 and less than or equal to 1.

In some embodiments, in the case where the number of iterations does not exceed the number of times threshold and the loss function value is not less than the loss threshold, a back propagation algorithm is required to iteratively adjust the model parameters of both the audio generation model and the audio discrimination model, and the next iteration is started based on the audio generation model and the audio discrimination model after the parameters are adjusted, that is, steps 401 to 410 are executed again until the number of iterations exceeds the number of times threshold or the loss function value is less than the loss threshold, the stop condition is met, and step 412 described below is entered.

In the above steps 410-411, a possible implementation manner in which the server iteratively adjusts the parameters of the audio generation model based on the discrimination loss term and the reconstruction loss term is shown, and in some embodiments, only the iteration number exceeding the threshold number may be used as a stop condition, only the loss function value being smaller than the loss threshold value may be used as a stop condition, or other stop conditions may be configured by the technician in a personalized manner, which is not specifically limited in this embodiment of the disclosure.

In step 412, the server outputs the trained audio generation model when either the number of iterations or the loss function value meets the stop condition.

In some embodiments, when the number of iterations exceeds the number of times threshold, or the loss function value is smaller than the loss threshold, the method meets a stop condition, at this time, training of the audio generation model is stopped, the audio discrimination model is not put into practical use, after the trained audio generation model passes the test, the server can prune and compress the audio generation model passing the test, embed the pruned and compressed audio generation model into the SDK of the application program, and send the pruned and compressed audio generation model to each client side provided with the application program in a cold update or hot update mode, so that the mobile end deployment of the audio generation model can be realized, and the high-quality compensation of audio packet loss is realized through the processing mode of the audio signal related in the previous embodiment.

In the above steps 405-412, a possible implementation manner of iteratively adjusting the parameters of the audio generation model based on the audio discrimination parameters, the sample audio segment and the target audio segment is shown, by using the audio discrimination parameters to obtain the discrimination loss term, and additionally constructing the reconstruction loss term, so that the loss function is more accurate, thereby restricting the similarity between the predicted target audio frame and the lost audio frame, and further applying the audio generation model to enhance the effect of audio packet loss compensation.

According to the method provided by the embodiment of the disclosure, the audio generation model and the audio discrimination model are utilized to conduct countermeasure training under the GAN architecture, so that in the process of iteratively adjusting parameters, the audio generation model is constrained by the audio discrimination parameters, a target audio frame which is more difficult to distinguish by the audio discrimination model can be continuously synthesized, the prediction accuracy of the audio generation model can be improved, the target audio frame is more natural, accurate and better in tone quality, meanwhile, as noise and noise are easily identified by the audio discrimination model, the trained audio generation model can avoid introducing noise and noise into the synthesized target audio frame, and the audio quality of a compensating signal for audio packet loss compensation can be greatly improved when the audio generation model is utilized to put into audio packet loss compensation application.

Fig. 5 is a flowchart of a training and reasoning phase of an audio generation model according to an embodiment of the present disclosure, where, as shown in fig. 5, the left part represents a training phase 510 based on GAN architecture between an audio generation model 511 and an audio discrimination model 512, and the right part represents a reasoning phase 520 for putting the trained audio generation model 511 into practical use.

In the training stage 510, the obtained sample audio segment 501 and the corresponding missing instruction information 502 are fused by a Concar (splicing) layer, so that the input signal of the audio generation model 511 can be obtained, the audio generation model 511 predicts the target audio frame of the missing part, and fills the missing part by using the target audio frame, which can be regarded as the final output of the complete target audio segment 503 without the missing audio frame. Then, the complete target audio fragment 503 and the complete sample audio signal 504 (refer to the complete signal that is not Mask-masked by the missing indication information 502) are input into the audio discrimination model 512, the audio discrimination parameters are output to the complete target audio fragment 503 and the complete sample audio signal 504 respectively through the audio discrimination model 512, discrimination loss terms can be obtained by utilizing the respective audio discrimination parameters, in addition, a spectrum loss term, a pronunciation loss term and a semantic loss term are additionally constructed, the three terms are added to obtain a reconstruction loss term, the discrimination loss term and the reconstruction loss term are added to obtain the loss function value of the iteration of the present round, and when the stop condition is not met, the model parameters of the audio generation model 511 and the audio discrimination model 512 are iteratively adjusted until the stop condition is met, and the iteration output of the trained audio generation model 511 is stopped.

In the reasoning stage 520 it can be seen that a History audio frame 521 (History Packet), a current audio frame 522 to be output (Interested Packet), a Future audio frame 523 (covered/Future Packets), a newly received audio frame 524 (Current Received Packet) and a compensated target audio frame 525 are involved. It can be seen that, for the audio frame 522 to be output currently, it is required to determine whether the audio frame 522 is lost by the packet loss determination module, if the audio frame 522 is not lost, the directly received audio frame 522 is directly received, if the audio frame 522 is already lost, which represents that the terminal does not receive the audio frame 522 locally, then both the historical audio frame 521 and the future audio frame 523 (the future audio frame with the duration of 18ms is used at most) need to be input into the trained audio generation model 511, so that the target audio frame 525 synthesized by the machine is synthesized after compensating the lost audio frame 522, at this time, the target audio frame 525 is output (i.e. played) instead of the missing audio frame 522 which is not received, and if the audio frame is replaced by the audio data packet with the duration of 1ms, in VoIP transmission, all the audio frames contained in each audio data packet with the duration of 1ms are all or all lost, in this case, the future audio frame with the duration of 18ms is used at most, so that the total delay time delay of the receiver can be controlled within 20ms, and the sound quality can reach the low sound quality.

In the embodiment of the disclosure, by providing a training manner based on a GAN architecture, when the audio generation model obtained by training is put into use, a packet loss hiding effect with low delay and high tone quality can be achieved for the problem of audio packet loss, and even if the audio packet loss occurs, in order to compensate a lost audio data packet to obtain a high tone quality compensation signal, only a future audio frame with a play duration of 18ms can be used at most, so that the total algorithm time delay of audio packet loss compensation can be controlled within 20ms, and the audio generation model is easy to deploy at a mobile terminal, and has great practical significance in various VoIP scenes.

Fig. 6 is a logical block diagram of an audio signal processing apparatus according to an embodiment of the present disclosure. Referring to fig. 6, the apparatus includes an acquisition unit 601, a first synthesizing unit 602, and a second synthesizing unit 603.

An acquisition unit 601 configured to perform, in the event of an audio frame missing in the audio signal, acquisition of a history audio frame preceding the audio frame and a future audio frame following the audio frame;

a first synthesizing unit 602 configured to perform synthesizing a target audio frame based on the history audio frame and the future audio frame, the phonemes and semantics of the target audio frame being similar to the audio frame;

A second synthesizing unit 603 configured to perform synthesizing a compensation signal associated with the audio signal based on the historical audio frame, the target audio frame and the future audio frame, the compensation signal filling the missing audio frame with the target audio frame.

According to the device provided by the embodiment of the application, when the audio frame is lost, the context information of the audio frame, namely the historical audio frame and the future audio frame, is utilized to synthesize the target audio frame similar to the missing audio frame in various dimensions of frequency, phonemes and semantics, and the target audio frame is utilized to fill the missing audio frame, so that a packet loss compensation mechanism of an audio signal is realized, the packet loss compensation mechanism is not simple copy of the historical audio frame or the future audio frame, but is used for synthesizing the target audio frame which is more natural, smooth and high in tone quality, noise and noise which are very easy to occur under the packet loss compensation mechanism of the traditional audio frame can be avoided, and adverse effects on the audio quality of the audio signal at the receiving end due to the packet loss compensation mechanism can be avoided, namely, the compensated signal subjected to the packet loss compensation has higher audio quality and better playing effect.

In some embodiments, based on the apparatus composition of fig. 6, the first synthesis unit 602 includes:

an acquisition subunit configured to perform acquisition of deletion instruction information of the audio piece, the deletion instruction information being used to instruct whether any one of the audio frames in the audio piece is deleted;

and a synthesizing subunit configured to perform synthesizing the target audio frame based on the audio piece and the deletion instruction information.

In some embodiments, based on the apparatus composition of fig. 6, the synthesis subunit comprises:

a decoding subunit configured to perform decoding of the audio encoding feature to obtain the target audio frame.

In some embodiments, the encoding subunit is configured to perform:

In some embodiments, the decoding subunit is configured to perform:

the fusion subunit is configured to perform:

With respect to the apparatus in the above-described embodiments, the specific manner in which the respective units perform the operations has been described in detail in the embodiments regarding the processing method of the audio signal, and will not be described in detail herein.

Fig. 7 is a logical block diagram of a training apparatus of an audio generation model according to an embodiment of the present disclosure. Referring to fig. 7, the apparatus includes a first acquisition unit 701, a second acquisition unit 702, and a parameter adjustment unit 703.

A first obtaining unit 701 configured to obtain, by an audio generation model, a target audio piece associated with a sample audio piece in which a missing audio frame exists, the missing audio frame being filled with a synthesized target audio frame;

a second obtaining unit 702 configured to obtain, by an audio discrimination model, respective audio discrimination parameters of the sample audio piece and the target audio piece, the audio discrimination parameters being used to characterize a possibility that the audio discrimination model discriminates whether the input audio piece is a machine synthesized signal;

A parameter adjustment unit 703 configured to perform iterative adjustment of parameters of the audio generation model based on the audio discrimination parameters, the sample audio piece and the target audio piece.

According to the device provided by the embodiment of the application, the audio generation model and the audio discrimination model are utilized to conduct countermeasure training under the GAN framework, so that the audio generation model can continuously synthesize target audio frames which are more difficult to distinguish by the audio discrimination model in the process of iteratively adjusting parameters, the prediction accuracy of the audio generation model can be improved, the target audio frames are more natural, accurate and better in tone quality, meanwhile, because noise and noise are easily identified by the audio discrimination model, the trained audio generation model can avoid introducing noise and noise in the synthesized target audio frames, and the audio quality of a compensating signal for audio packet loss compensation can be greatly improved when the audio generation model is utilized to put into audio packet loss compensation application.

In some embodiments, based on the apparatus composition of fig. 7, the parameter adjustment unit 703 includes:

a first determination subunit configured to perform determination of a discrimination loss term of the audio generation model based on the audio discrimination parameter, the discrimination loss term being used to characterize whether a target audio frame synthesized by the audio generation model can be accurately identified by the audio discrimination model;

and a parameter adjustment subunit configured to perform iterative adjustment of parameters of the audio generation model based on the discrimination loss term and the reconstruction loss term.

In some embodiments, based on the apparatus composition of fig. 7, the second determination subunit comprises:

an acquisition subunit configured to perform acquiring at least one of a spectral loss term, a pronunciation loss term, or a semantic loss term of the audio generation model based on the sample audio piece and the target audio piece;

the method comprises the steps of determining a frequency spectrum loss term, a pronunciation loss term and a semantic loss term, wherein the frequency spectrum loss term is used for representing the difference degree of the sample audio fragment and the target audio fragment in a frequency domain space, the pronunciation loss term is used for representing the difference degree of the sample audio fragment and the target audio fragment in a phoneme characteristic space, and the semantic loss term is used for representing the difference degree of the sample audio fragment and the target audio fragment in a semantic characteristic space.

In some embodiments, based on the apparatus composition of fig. 7, the acquisition subunit comprises:

a transform sub-unit configured to perform time-frequency transform on both the sample audio piece and the target audio piece at different sampling rates to obtain a sample frequency signal of the sample audio piece and a target frequency signal of the target audio piece;

a determining subunit configured to perform determining, based on the sample frequency signal and the target frequency signal at different sampling rates, a time-frequency loss term for characterizing a degree of difference in frequency amplitude of the sample frequency signal and the target frequency signal and a signal-to-noise loss term for characterizing a signal-to-noise ratio of the target frequency signal;

an acquisition sub-unit configured to perform acquisition of the spectral loss term based on the time-frequency loss term and the signal-to-noise loss term.

In some embodiments, the determining subunit is configured to perform:

determining a target frequency component corresponding to any sample frequency component in the target frequency signal for any sample frequency component of any sample audio frame in the sample frequency signal at any sample rate;

the time-frequency loss term is obtained based on the L1 norm and the L2 norm of each of a plurality of sample audio frames at a plurality of sample frequencies.

In some embodiments, the determining subunit is configured to perform:

dividing the amplitude of the target frequency component by the cosine value of a frequency characteristic included angle, which is the characteristic included angle between the sample frequency component and the target frequency component in a frequency domain space, to obtain the signal information of the target frequency signal;

In some embodiments, the acquisition subunit is configured to perform:

inputting the target audio fragment into the semantic feature extraction model to obtain target semantic features, wherein the target semantic features are used for representing the semantic features of the audio frames in the target audio fragment;

and acquiring the semantic loss item based on the sample semantic feature and the target semantic feature.

The specific manner in which the respective units perform the operations in the apparatus of the above embodiment has been described in detail in the embodiment of the training method concerning the audio generation model, and will not be described in detail here.

Fig. 8 shows a block diagram of an electronic device according to an embodiment of the disclosure, and the electronic device is taken as a terminal 800 for illustration. The terminal 800 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 800 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, and the like.

In general, the terminal 800 includes: a processor 801 and a memory 802.

Processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 801 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 801 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 801 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and rendering of content required to be displayed by the display screen. In some embodiments, the processor 801 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 802 is used to store at least one instruction for execution by processor 801 to implement the processing methods of audio signals or the training methods of audio generation models provided by the various embodiments in the present disclosure.

In some embodiments, the terminal 800 may further optionally include: a peripheral interface 803, and at least one peripheral. The processor 801, the memory 802, and the peripheral interface 803 may be connected by a bus or signal line. Individual peripheral devices may be connected to the peripheral device interface 803 by buses, signal lines, or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 804, a touch display 805, a camera assembly 806, audio circuitry 807, a positioning assembly 808, and a power supply 809.

Peripheral interface 803 may be used to connect at least one Input/Output (I/O) related peripheral to processor 801 and memory 802. In some embodiments, processor 801, memory 802, and peripheral interface 803 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 801, the memory 802, and the peripheral interface 803 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 804 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 804 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 804 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 804 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 804 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 804 may also include NFC (Near Field Communication, short range wireless communication) related circuitry, which is not limited by the present disclosure.

The display 805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 805 is a touch display, the display 805 also has the ability to collect touch signals at or above the surface of the display 805. The touch signal may be input as a control signal to the processor 801 for processing. At this time, the display 805 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 805 may be one, providing a front panel of the terminal 800; in other embodiments, the display 805 may be at least two, respectively disposed on different surfaces of the terminal 800 or in a folded design; in still other embodiments, the display 805 may be a flexible display disposed on a curved surface or a folded surface of the terminal 800. Even more, the display 805 may be arranged in an irregular pattern other than rectangular, i.e., a shaped screen. The display 805 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 806 is used to capture images or video. Optionally, the camera assembly 806 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, the camera assembly 806 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

Audio circuitry 807 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and the environment, converting the sound waves into electric signals, inputting the electric signals to the processor 801 for processing, or inputting the electric signals to the radio frequency circuit 804 for voice communication. For stereo acquisition or noise reduction purposes, a plurality of microphones may be respectively disposed at different portions of the terminal 800. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 801 or the radio frequency circuit 804 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuit 807 may also include a headphone jack.

The location component 808 is utilized to locate the current geographic location of the terminal 800 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 808 may be a positioning component based on the United states GPS (Global Positioning System ), the Beidou system of China, the Granati system of Russia, or the Galileo system of the European Union.

A power supply 809 is used to power the various components in the terminal 800. The power supply 809 may be an alternating current, direct current, disposable battery, or rechargeable battery. When the power supply 809 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 800 also includes one or more sensors 810. The one or more sensors 810 include, but are not limited to: acceleration sensor 811, gyro sensor 812, pressure sensor 813, optical sensor 814, and proximity sensor 815.

The acceleration sensor 811 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 800. For example, the acceleration sensor 811 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 801 may control the touch display screen 805 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 811. Acceleration sensor 811 may also be used for the acquisition of motion data of a game or user.

The gyro sensor 812 may detect a body direction and a rotation angle of the terminal 800, and the gyro sensor 812 may collect a 3D motion of the user to the terminal 800 in cooperation with the acceleration sensor 811. The processor 801 may implement the following functions based on the data collected by the gyro sensor 812: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 813 may be disposed at a side frame of the terminal 800 and/or at a lower layer of the touch display 805. When the pressure sensor 813 is disposed on a side frame of the terminal 800, a grip signal of the terminal 800 by a user may be detected, and the processor 801 performs left-right hand recognition or shortcut operation according to the grip signal collected by the pressure sensor 813. When the pressure sensor 813 is disposed at the lower layer of the touch display screen 805, the processor 801 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 805. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor 814 is used to collect the ambient light intensity. In one embodiment, the processor 801 may control the display brightness of the touch display 805 based on the ambient light intensity collected by the optical sensor 814. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 805 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 805 is turned down. In another embodiment, the processor 801 may also dynamically adjust the shooting parameters of the camera assembly 806 based on the ambient light intensity collected by the optical sensor 814.

A proximity sensor 815, also known as a distance sensor, is typically provided on the front panel of the terminal 800. The proximity sensor 815 is used to collect the distance between the user and the front of the terminal 800. In one embodiment, when the proximity sensor 815 detects a gradual decrease in the distance between the user and the front face of the terminal 800, the processor 801 controls the touch display 805 to switch from the bright screen state to the off screen state; when the proximity sensor 815 detects that the distance between the user and the front surface of the terminal 800 gradually increases, the processor 801 controls the touch display 805 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 8 is not limiting and that more or fewer components than shown may be included or certain components may be combined or a different arrangement of components may be employed.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, and an electronic device is taken as a server 900 for illustration. The server 900 may have a relatively large difference due to different configurations or performances, and may include one or more processors (Central Processing Units, CPU) 901 and one or more memories 902, where at least one program code is stored in the memories 902, and the at least one program code is loaded and executed by the processor 901 to implement the audio signal processing method or the audio generation model training method provided in the above embodiments. Of course, the server 900 may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium is also provided, e.g. a memory, comprising at least one instruction executable by a processor in an electronic device to perform the method of processing an audio signal or the method of training an audio generation model in the above embodiments. Alternatively, the above-described computer-readable storage medium may be a non-transitory computer-readable storage medium, which may include, for example, ROM (Read-Only Memory), RAM (Random-Access Memory), CD-ROM (Compact Disc Read-Only Memory), magnetic tape, floppy disk, optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, including one or more instructions executable by a processor of an electronic device to perform the method of processing an audio signal or the method of training an audio generation model provided in the above embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of processing an audio signal, comprising:

determining an audio fragment consisting of the historical audio frame and the future audio frame, wherein the audio fragment is an audio frame sequence consisting of a plurality of audio frames; acquiring deletion indication information of the audio fragment, wherein the deletion indication information is a parameter sequence formed by deletion indication parameters of a plurality of audio frames and is used for indicating whether any audio frame in the audio fragment is deleted, splicing the audio frame with the deletion indication parameters of the audio frame in the parameter sequence to obtain dual-channel data of the audio frame; acquiring extended audio data composed of dual-channel data of a plurality of audio frames; encoding the extended audio data to obtain audio encoding characteristics of the extended audio data; decoding the audio coding features to obtain a target audio frame, wherein the phonemes and the semantics of the target audio frame are similar to those of the audio frame;

2. The method of claim 1, wherein encoding the extended audio data to obtain audio encoding characteristics of the extended audio data comprises:

3. The method of claim 2, wherein decoding the audio encoding feature to obtain the target audio frame comprises:

4. The method of claim 1, wherein for any audio frame in the sequence of audio frames, the deletion indication parameter of the audio frame is assigned to 1 when the audio frame is deleted, and the deletion indication parameter of the audio frame is assigned to 0 when the audio frame is not deleted.

5. The method according to any one of claims 1 to 4, wherein the number of frames of the future audio frame is a target number of frames; or, the frame length of the future audio frame is the target frame length; or, the playing duration of the future audio frame is the target duration.

6. A method of training an audio generation model, comprising:

obtaining a target audio fragment associated with a sample audio fragment through an audio generation model, wherein missing audio frames exist in the sample audio fragment, and the missing audio frames are filled with synthesized target audio frames in the target audio fragment, and the synthesis process of the target audio frames comprises the following steps: acquiring a historical audio frame before the missing audio frame and a future audio frame after the missing audio frame, and determining an audio fragment consisting of the historical audio frame and the future audio frame, wherein the audio fragment is an audio frame sequence consisting of a plurality of audio frames; acquiring deletion indication information of the audio fragment, wherein the deletion indication information is a parameter sequence formed by deletion indication parameters of a plurality of audio frames and is used for indicating whether any audio frame in the audio fragment is deleted, splicing the audio frame with the deletion indication parameters of the audio frame in the parameter sequence to obtain dual-channel data of the audio frame; acquiring extended audio data composed of dual-channel data of a plurality of audio frames; encoding the extended audio data to obtain audio encoding characteristics of the extended audio data; decoding the audio coding features to obtain the target audio frame;

7. The method of claim 6, wherein iteratively adjusting parameters of the audio generation model based on the audio discrimination parameters, the sample audio piece, and the target audio piece comprises:

8. The method of claim 7, wherein the determining a reconstruction loss term for the audio generation model based on the target audio piece and the sample audio piece comprises:

9. The method of claim 8, wherein the obtaining a spectral loss term for the audio generation model based on the sample audio segment and the target audio segment comprises:

10. The method of claim 9, wherein the determining a time-frequency loss term based on the sample frequency signal and the target frequency signal at different sampling rates comprises:

11. The method of claim 9, wherein the determining a signal-to-noise loss term based on the sample frequency signal and the target frequency signal at different sampling rates comprises:

12. The method of claim 8, wherein the obtaining a pronunciation loss term for the audio generation model based on the sample audio segment and the target audio segment comprises:

13. The method of claim 8, wherein the obtaining semantic loss terms for the audio generation model based on the sample audio piece and the target audio piece comprises:

14. An audio signal processing apparatus, comprising:

a first synthesis unit comprising:

a determining subunit configured to perform determining an audio clip made up of the historical audio frame and the future audio frame, the audio clip being an audio frame sequence made up of a plurality of audio frames;

an acquisition subunit configured to perform acquisition of deletion instruction information of the audio segment, where the deletion instruction information is a parameter sequence formed by deletion instruction parameters of a plurality of audio frames, and is used to indicate whether any audio frame in the audio segment is deleted;

a synthesis subunit comprising: a fusion subunit, configured to perform splicing on any audio frame in the audio frame sequence, and the audio frame and the missing indication parameter of the audio frame in the parameter sequence to obtain dual-channel data of the audio frame; acquiring extended audio data composed of dual-channel data of a plurality of audio frames; a coding subunit configured to perform coding of the extended audio data to obtain audio coding features of the extended audio data; a decoding subunit configured to perform decoding on the audio coding feature to obtain a target audio frame, where phonemes and semantics of the target audio frame are similar to those of the audio frame;

15. A training device for an audio generation model, comprising:

a first obtaining unit configured to obtain, by using an audio generation model, a target audio segment associated with a sample audio segment, where there is a missing audio frame in the sample audio segment, and the missing audio frame is filled with a synthesized target audio frame in the target audio segment, where a synthesis process of the target audio frame includes: acquiring a historical audio frame before the missing audio frame and a future audio frame after the missing audio frame, and determining an audio fragment consisting of the historical audio frame and the future audio frame, wherein the audio fragment is an audio frame sequence consisting of a plurality of audio frames; acquiring deletion indication information of the audio fragment, wherein the deletion indication information is a parameter sequence formed by deletion indication parameters of a plurality of audio frames and is used for indicating whether any audio frame in the audio fragment is deleted, splicing the audio frame with the deletion indication parameters of the audio frame in the parameter sequence to obtain dual-channel data of the audio frame; acquiring extended audio data composed of dual-channel data of a plurality of audio frames; encoding the extended audio data to obtain audio encoding characteristics of the extended audio data; decoding the audio coding features to obtain a target audio frame;

16. An electronic device, comprising:

one or more processors;

wherein the one or more processors are configured to execute the instructions to implement the method of processing an audio signal as claimed in any one of claims 1 to 5; or a training method of an audio generation model as claimed in any one of claims 6 to 13.

17. A computer-readable storage medium, characterized in that at least one instruction in the computer-readable storage medium, when executed by one or more processors of an electronic device, enables the electronic device to perform the method of processing an audio signal according to any one of claims 1 to 5; or a training method of an audio generation model as claimed in any one of claims 6 to 13.