CN117765954A

CN117765954A - Audio processing method, device, electronic equipment and storage medium

Info

Publication number: CN117765954A
Application number: CN202311801737.9A
Authority: CN
Inventors: 张莉娜; 魏光辉; 宋莎莎; 杨喜鹏; 田锋平; 吴玉虎
Original assignee: Shanghai Mobvoi Information Technology Co ltd
Current assignee: Shanghai Mobvoi Information Technology Co ltd
Priority date: 2023-12-25
Filing date: 2023-12-25
Publication date: 2024-03-26

Abstract

The disclosure provides an audio processing method, an audio processing device, electronic equipment and a storage medium. The audio processing method provided by the disclosure comprises the following steps: converting the audio to be processed to obtain a first spectrogram; inputting the first spectrogram into a pre-trained VQGAN model to obtain a first code; and processing the first code to obtain a first audio token.

Description

Audio processing method, device, electronic equipment and storage medium

Technical Field

The disclosure relates to an audio processing method, an audio processing device, electronic equipment and a storage medium.

Background

With the development of artificial intelligence, audio processing technologies such as speech recognition, speech translation, speech questions and answers and the like have also been developed. One key process of audio processing technology is how to down-dimension discretize a continuous audio signal into an audio token (an audio discretized representation). The prior art generally uses an audio encoder to implement the discretization process.

However, the number and quality of tokens obtained by processing continuous audio by an audio encoder hardly meet the requirements of an audio processing model, either too much beyond the limits of the audio processing model, or too little quality too poor to support audio processing or poor quality of audio processing.

Disclosure of Invention

The disclosure provides an audio processing method, an audio processing device, electronic equipment and a storage medium.

According to one aspect of the present disclosure, there is provided an audio processing method including:

converting the audio to be processed to obtain a first spectrogram;

inputting the first spectrogram into a pre-trained VQGAN model to obtain a first code;

and processing the first code to obtain a first audio token.

An audio processing method according to at least one embodiment of the present disclosure further includes:

and inputting the first audio token into a pre-trained audio processing model to obtain a processing result.

According to an audio processing method of at least one embodiment of the present disclosure, the inputting the first audio token into a pre-trained audio processing model includes:

acquiring a first processing instruction of the audio to be processed;

inputting the first processing instruction and the first audio token into a pre-trained audio processing model.

According to an audio processing method of at least one embodiment of the present disclosure, before the inputting the first audio token into the pre-trained audio processing model, the audio processing method further includes:

acquiring a first training audio and a corresponding training processing result;

converting the first training audio to obtain a training spectrogram;

inputting the training spectrogram into the pre-trained VQGAN model to obtain a second code;

processing the second code to obtain a second audio token;

and training an initial processing model by using the second audio token and a corresponding training processing result to obtain the pre-trained audio processing model.

According to an audio processing method of at least one embodiment of the present disclosure, the training an initial processing model using the second audio token and a corresponding training processing result includes:

acquiring a training processing instruction corresponding to a training processing result;

and training an initial processing model by using the training processing instruction, the corresponding second audio token and the training processing result.

According to an audio processing method of at least one embodiment of the present disclosure, before the inputting the first spectrogram into a pre-trained VQGAN model, the audio processing method further includes: processing the first spectrogram according to a preset processing rule to obtain a second spectrogram;

the inputting the first spectrogram into a pre-trained VQGAN model includes: and inputting the second spectrogram into a pre-trained VQGAN model.

According to an audio processing method of at least one embodiment of the present disclosure, before the inputting the first spectrogram into a pre-trained VQGAN model, the audio processing method further includes:

randomly intercepting audio from the second training audio according to the preset first time to obtain a third training audio;

converting the third training audio to obtain a third spectrogram;

and training an initial VQGAN model by using a third spectrogram to obtain the pre-trained VQGAN model.

According to another aspect of the present disclosure, there is provided an audio processing apparatus including:

the speech spectrum conversion module is used for converting the audio to be processed to obtain a first spectrum chart;

the discrete processing module is used for inputting the first spectrogram into a pre-trained VQGAN model to obtain a first code;

and the discrete conversion module is used for processing the first code to obtain a first audio token.

According to still another aspect of the present disclosure, there is provided an electronic apparatus including: a memory storing execution instructions; and a processor executing the execution instructions stored in the memory, causing the processor to execute the audio processing method according to any one of the embodiments of the present disclosure.

According to yet another aspect of the present disclosure, there is provided a readable storage medium having stored therein execution instructions which, when executed by a processor, are to implement the audio processing method of any one of the embodiments of the present disclosure.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

Fig. 1 is a flowchart of an audio processing method provided in embodiment 1 of the present disclosure.

Fig. 2 is a flowchart two of an audio processing method provided in embodiment 1 of the present disclosure.

Fig. 3 is a flowchart III of an audio processing method provided in embodiment 1 of the present disclosure.

Fig. 4 is a flowchart of a model input method in the audio processing method shown in fig. 2.

Fig. 5 is a flowchart of a model training method in the audio processing method shown in fig. 3.

Fig. 6 is a flowchart of an audio processing method provided in embodiment 1 of the present disclosure.

Fig. 7 is a flowchart five of an audio processing method provided in embodiment 1 of the present disclosure.

Fig. 8 is a schematic structural diagram of an audio processing apparatus according to embodiment 2 of the present disclosure.

Fig. 9 is a schematic diagram of a second configuration of the audio processing apparatus according to embodiment 2 of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the drawings and the embodiments. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant content and not limiting of the present disclosure. It should be further noted that, for convenience of description, only a portion relevant to the present disclosure is shown in the drawings.

In addition, embodiments of the present disclosure and features of the embodiments may be combined with each other without conflict. The technical aspects of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Unless otherwise indicated, the exemplary implementations/embodiments shown are to be understood as providing exemplary features of various details of some ways in which the technical concepts of the present disclosure may be practiced. Thus, unless otherwise indicated, features of the various implementations/embodiments may be additionally combined, separated, interchanged, and/or rearranged without departing from the technical concepts of the present disclosure.

The use of cross-hatching and/or shading in the drawings is typically used to clarify the boundaries between adjacent components. As such, the presence or absence of cross-hatching or shading does not convey or represent any preference or requirement for a particular material, material property, dimension, proportion, commonality between illustrated components, and/or any other characteristic, attribute, property, etc. of a component, unless indicated. In addition, in the drawings, the size and relative sizes of elements may be exaggerated for clarity and/or descriptive purposes. While the exemplary embodiments may be variously implemented, the specific process sequences may be performed in a different order than that described. For example, two consecutively described processes may be performed substantially simultaneously or in reverse order from that described. Moreover, like reference numerals designate like parts.

When an element is referred to as being "on" or "over", "connected to" or "coupled to" another element, it can be directly on, connected or coupled to the other element or intervening elements may be present. However, when an element is referred to as being "directly on," "directly connected to," or "directly coupled to" another element, there are no intervening elements present. For this reason, the term "connected" may refer to physical connections, electrical connections, and the like, with or without intermediate components.

The terminology used herein is for the purpose of describing particular embodiments and is not intended to limit the scope of the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, when the terms "comprises" and/or "comprising," and variations thereof, are used in the present specification, the presence of stated features, integers, steps, operations, elements, components, and/or groups thereof is described, but the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof is not precluded. It is also noted that, as used herein, the terms "substantially," "about," and other similar terms are used as approximation terms and not as degree terms, and as such, are used to explain the inherent deviations of measured, calculated, and/or provided values that would be recognized by one of ordinary skill in the art.

The audio processing method of the present disclosure may be applied to an audio processing apparatus of the present disclosure, which may be configured on a server.

The audio processing method of the present disclosure is described in detail below with reference to fig. 1 to 7.

Example 1:

fig. 1 is a flowchart of an audio processing method of one embodiment of the present disclosure.

Referring to fig. 1, an audio processing method M100 of the present embodiment includes:

step S110, converting the audio to be processed to obtain a first spectrogram.

Step S120, inputting the first spectrogram into a pre-trained VQGAN model to obtain a first code corresponding to the spectrogram.

Step S130, converting the first code to obtain a first audio token.

According to the audio processing method, after the audio to be processed is converted into the first spectrogram, the first spectrogram is compressed into the first code through the pre-trained VQGAN model, and the first code is processed to obtain the first audio token. Because the VQGAN model can use fewer numbers to represent data, it is helpful to effectively compress continuous information in the potential space while maintaining the reconstruction quality, and therefore, the audio processing process based on the VQGAN model can ensure quality while reducing the number of audio tokens, so that the audio processing model can support continuous voice input. The method solves the problems that the number and the quality of the token obtained by processing continuous audio through an audio encoder in the prior art hardly meet the requirement of an audio processing model, or the number and the quality of the token exceed the limit of the audio processing model too much or the quality of the token are too low to support audio processing or the quality of the audio processing is poor.

In some embodiments of the present disclosure, the audio to be processed in step S110 may be a speaking sound, a singing sound, an audio sound, or the like. Step S110 may use Fourier transform (Fourier Transform), fast Fourier transform (Fast Fourier Transform, FFT), short-time Fourier transform (Short-time Fourier transform or Short-term Fourier transform, STFT) and the like to transform the audio to be processed; in order to enrich the first spectrogram feature, the conversion process of step S110 may further specifically be: and converting the audio to be processed based on the Mel scale, wherein the first spectrogram is the Mel spectrogram.

In some embodiments of the present disclosure, the specific process of converting through step S110 may include: firstly, preprocessing audio to be processed, pre-emphasis (such as passing the audio to be processed through a high-pass filter and the like), framing (dividing the audio to be processed into a plurality of small blocks according to time, namely, one frame), windowing (adding a window function to each frame), and the like are carried out to obtain a plurality of single-frame signals; converting the single frame signals to generate a plurality of spectrograms; mapping each spectrogram to obtain a plurality of mapped spectrograms; and splicing the plurality of mapped spectrograms in the time dimension, so as to form a complete first spectrogram. Specifically, the power spectrum is adopted by the spectrogram, and the first spectrogram obtained after passing through the Mel filter bank is the Mel spectrogram.

In some embodiments of the present disclosure, before step S120, a training process of the initial VQGAN (Vector Quantized Generative Adversarial Network) model may be further included, the initial VQGAN model may be trained using training audio, the training process improves the image reconstruction performance through the loss function and the perception error of GAN, the compression model and the image generation model game each other, and the respective parameters are continuously optimized, so that the generated image is closer to the real image, and finally the pre-trained VQGAN model is obtained. The training audio may be the same length or different.

In some embodiments of the present disclosure, the VQGAN model pre-trained in step S130 may be embodied as a compressed model in the VQGAN model pre-trained. When the first spectrogram is input into the compression model, the compression model obtains intermediate codes after CNN Encoder coding, and each coding position of the intermediate codes is removed from the codebook to find the code with the nearest distance, so as to generate a first code with the same dimension.

In some embodiments of the present disclosure, the processing of the first code in step S130 may be performed to search for a corresponding discretized representation (token) in the preset index for each coding position of the first code. After the first audio token is obtained in step S130, the first audio token may be used in fields such as speech recognition, speech question-answering, speech translation, and speech synthesis.

Further, as shown in fig. 2, the audio processing method provided by the present disclosure may further include:

step S140, inputting the first audio token into a pre-trained audio processing model to obtain a processing result.

In some embodiments of the present disclosure, after performing audio discretization by the audio processing method shown in fig. 1, the first audio token may be processed in step S140, and since the number of the first audio token can be reduced while the quality is guaranteed, the audio processing model in step S140 can support processing of continuous longer audio, where the audio processing model may be one or more processing modes models including speech recognition, speech question-answering, speech translation, speech synthesis, and the like; the audio processing model may be pre-configured with audio and corresponding token vocabulary.

At this time, as shown in fig. 3, before step S140, it may further include:

step S131, obtaining a first training audio and a corresponding training processing result.

Step S132, converting the first training audio to obtain a training spectrogram.

Step S133, inputting the training spectrogram into a pre-trained VQGAN model to obtain a second code.

Step S134, the second code is processed to obtain a second audio token.

And step S135, training the initial processing model by using the second audio token and the corresponding training processing result to obtain a pre-trained audio processing model.

In some embodiments of the present disclosure, the conversion process in step S132 is similar to step S110, the second encoding process obtained in step S133 is similar to step S120, and the encoding process in step S134 is similar to step S130, which will not be described in detail.

In some embodiments of the present disclosure, when the audio processing model is a single processing model, the audio token may be processed directly. When the audio processing model is two or more processing models, the processing may be performed based on the processing instruction, and in this case, as shown in fig. 4, step S140 may include: step S1401, acquiring a first processing instruction of audio to be processed; step S1402, inputting a first processing instruction and a first audio token into a pre-trained audio processing model. After the audio processing model receives the first processing instruction and the first audio token, a processing procedure corresponding to the first processing instruction can be executed on the first audio token. The first processing instruction may be an original processing instruction input by a user, or may be a processing identifier corresponding to the original processing instruction, where in this case, the audio processing model may further preset a processing instruction and a corresponding processing identifier vocabulary.

In some embodiments of the present disclosure, in order for the audio processing model to support multiple processing procedures, to enhance model generalization capability, and to train the initial processing model, as shown in fig. 5, step S135 may include: step S1351, obtaining a training processing instruction corresponding to the training processing result; step S1352, training the initial processing model using the training processing instructions, the corresponding second audio token, and the training processing results.

Further, as shown in fig. 6, to ensure the unified specification and integrity of the spectrogram, the method further includes, before step S120:

and step S111, processing the first spectrogram according to a preset processing rule to obtain a second spectrogram.

At this time, step S120 specifically includes: and inputting the second spectrogram into a pre-trained VQGAN model to obtain a first code.

In some embodiments of the present disclosure, the processing in step S111 may be a normalization processing procedure, such as normalizing the first spectrogram according to a preset db value or a minimum db value of the first spectrogram, or the like.

Further, in order to make the image generated by the VQGAN model more similar to the real image, as shown in fig. 7, before step S120, it may further include

Step S112, randomly intercepting the audio from the second training audio according to the preset first time to obtain a third training audio.

Step S113, converting the third training audio to obtain a third spectrogram.

Step S114, training an initial VQGAN model by using the third spectrogram to obtain a pre-trained VQGAN model.

In some embodiments of the present disclosure, in step S112, there may be a correlation between the third training audio, or there may be a correlation between the third training audio by setting a suitable first time, so that when the initial VQGAN model is trained, the model can better learn about the correlation between the audio, and the generated image is closer to the real image. The association is that there is an intersection between the two tones.

In some embodiments of the present disclosure, the conversion process in step S113 is similar to step S110, and the third spectrogram used in the training in step S114 may also be a processed spectrogram, and the processing process is similar to step S111, which is not described herein.

In some embodiments of the present disclosure, when a plurality of data is input into a model, to represent an association relationship between data, the plurality of data may be spliced and input into the model.

Taking automatic speech recognition (Automatic Speech Recognition, ASR) techniques as an example, the data input into the pre-trained audio processing model includes processing directives and audio tokens, which may be in the form of:

{"text":

"[ASR]<|690|><|690|><|690|><|1337|><|2396|><|1851|><|3765|

><|434|><|1190|><|942|><|2263|><|858|><|2331|><|413|><|315

7|><|4083|><|1107|><|1305|><|3524|><|3486|><|1440|><|3589|

><|2407|><|815|><|858|><|868|><|1367|><|2322|><|3323|><|31

09|><|193|><|727|><|3731|><|3535|><|3126|><|2159|><|694|><

|2102|><|1236|>"}

in training, the training data used includes processing instructions, audio token, and processing results, which may be in the form of:

{"text":

"[ASR]<|690|><|690|><|690|><|1337|><|2396|><|1851|><|3765|

><|434|><|1190|><|942|><|2263|><|858|><|2331|><|413|><|315

7|><|4083|><|1107|><|1305|><|3524|><|3486|><|1440|><|3589|

><|2407|><|815|><|858|><|868|><|1367|><|2322|><|3323|><|31

09|><|193|><|727|><|3731|><|3535|><|3126|><|2159|><|694|><

|2102|><|1236|><|AS|>you send you back to home bar }

Wherein the method comprises the steps of|AS|Is a connector.

Example 2:

fig. 8-9 illustrate exemplary diagrams of audio processing devices employing hardware implementations of processing systems.

The apparatus may include corresponding modules that perform the steps of the flowcharts described above. Thus, each step or several steps in the flowcharts described above may be performed by respective modules, and the apparatus may include one or more of these modules. A module may be one or more hardware modules specifically configured to perform the respective steps, or be implemented by a processor configured to perform the respective steps, or be stored within a computer-readable medium for implementation by a processor, or be implemented by some combination.

The hardware architecture may be implemented using a bus architecture. The bus architecture may include any number of interconnecting buses and bridges depending on the specific application of the hardware and the overall design constraints. Bus 1100 connects together various circuits including one or more processors 1200, memory 1300, and/or hardware modules. Bus 1100 may also connect various other circuits 1400, such as peripherals, voltage regulators, power management circuits, external antennas, and the like.

Bus 1100 may be an industry standard architecture (ISA, industry Standard Architecture) bus, a peripheral component interconnect (PCI, peripheral Component) bus, or an extended industry standard architecture (EISA, extended Industry Standard Component) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one connection line is shown in the figure, but not only one bus or one type of bus.

Any process or method description in a flowchart or otherwise described herein may be understood as: a module, segment, or portion of code, which comprises one or more executable instructions for implementing the steps of a specified logical function(s) or process (es). The scope of the preferred embodiments of the present disclosure may include other implementations in which functions may be performed out of the order described, for example, in a substantially simultaneous manner or in an opposite order depending on the function involved, as would be understood by one of skill in the art. The processor may be used to perform the various methods and processes described above. For example, method embodiments in the present disclosure may be implemented as a software program stored on a computer readable storage medium, such as a memory. In some embodiments, part or all of the software program may be loaded and/or installed via memory and/or a communication interface. One or more of the steps of the methods described above may be performed when the software program is loaded and executed by a processor. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above in any other suitable manner (e.g., by means of firmware).

Logic and/or steps represented in the flowcharts or otherwise described herein may be embodied in any readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

For the purposes of this description, a "readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). In addition, the readable storage medium may even be paper or other suitable medium for the printable program, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner if necessary, and then stored in a memory.

It should be understood that portions of the present disclosure may be implemented in hardware, software, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the methods of the embodiments described above may be implemented by a program to instruct related hardware. The program may be stored in a readable storage medium. The program, when executed, includes one or a combination of steps for implementing the method.

Furthermore, each functional unit in each embodiment of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. The storage medium may be a read-only memory, a magnetic disk or optical disk, etc.

As shown in fig. 8, an audio processing apparatus 1000 according to the present disclosure may include:

the speech spectrum conversion module 1010 is configured to convert the audio to be processed to obtain a first spectrum chart.

The discrete processing module 1020 is configured to input the first spectrogram into a pre-trained VQGAN model to obtain a first code.

The discrete conversion module 1030 is configured to process the first code to obtain a first audio token.

Further, the audio processing apparatus provided by the present disclosure may further include:

the audio processing module 1040 is configured to input the first audio token into a pre-trained audio processing model, and obtain a processing result.

Wherein, the audio processing module 1040 may include:

the first instruction submodule 10401 is configured to obtain a first processing instruction of the audio to be processed.

A first processing submodule 10402 for inputting a first processing instruction and a first audio token into a pre-trained audio processing model.

At this time, the audio processing apparatus may further include:

the first obtaining module 1050 is configured to obtain the first training audio and the corresponding training processing result.

The first conversion module 1060 is configured to convert the first training audio to obtain a training spectrogram.

A first input module 1070 is configured to input the training spectrogram into a pre-trained VQGAN model to obtain a second code.

The first processing module 1080 is configured to process the second code to obtain a second audio token.

The first training module 1090 is configured to train the initial processing model by using the second audio token and the corresponding training processing result, so as to obtain a pre-trained audio processing model.

Wherein, the first training module 1090 may include:

training instruction submodule 10901 is configured to obtain a training processing instruction corresponding to the training processing result.

A process training submodule 10902 for training the initial process model using the training process instructions, the corresponding second audio token, and the training process results.

and the standard processing module 1091 is configured to process the first spectrogram according to a preset processing rule to obtain a second spectrogram.

At this time, the discrete processing module 1020 is specifically configured to input the second spectrogram into a pre-trained VQGAN model, to obtain the first code.

the audio capturing module 1092 is configured to randomly capture audio from the second training audio according to a preset first time, so as to obtain a third training audio.

The second conversion module 1093 is configured to convert the third training audio to obtain a third spectrogram.

A second training module 1094 is configured to train the initial VQGAN model using the third spectrogram, to obtain a pre-trained VQGAN model.

In some embodiments of the present disclosure, the process of implementing audio processing through the above modules/sub-modules is similar to that provided in embodiment 1, and will not be described in detail herein. Fig. 9 is a structure of the audio processing apparatus including all the above modules/sub-modules, and when the audio processing apparatus includes only some modules/sub-modules, the structure is similar to that shown in fig. 9, and will not be described in detail here.

After the audio to be processed is converted into the first spectrogram, the audio processing device compresses the first spectrogram into the first code through the pre-trained VQGAN model, and processes the first code to obtain the first audio token. Because the VQGAN model can use fewer numbers to represent data, it is helpful to effectively compress continuous information in the potential space while maintaining the reconstruction quality, and therefore, the audio processing process based on the VQGAN model can ensure quality while reducing the number of audio tokens, so that the audio processing model can support continuous voice input. The device solves the problems that the number and the quality of the token obtained by processing continuous audio through an audio encoder in the prior art hardly meet the requirement of an audio processing model, or the number and the quality of the token exceed the limit of the audio processing model too much or the quality of the token are too low to support audio processing or the quality of the audio processing is poor.

The present disclosure also provides an electronic device, including: a memory storing execution instructions; and a processor or other hardware module that executes the execution instructions stored in the memory, such that the processor or other hardware module performs the audio processing method described above.

The present disclosure also provides a readable storage medium having stored therein execution instructions which, when executed by a processor, are to implement the above-described audio processing method.

In the description of the present specification, reference to the terms "one embodiment/example," "some embodiments/examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment/example or example is included in at least one embodiment/mode or example of the present application. In this specification, the schematic representations of the above terms are not necessarily in the same implementation/example or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more implementations/embodiments or examples. Furthermore, the various embodiments/examples or examples described in this specification and the features of the various embodiments/embodiments or examples may be combined and combined by persons skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" is at least two, such as two, three, etc., unless explicitly defined otherwise.

It will be appreciated by those skilled in the art that the above-described embodiments are merely for clarity of illustration of the disclosure, and are not intended to limit the scope of the disclosure. Other variations or modifications will be apparent to persons skilled in the art from the foregoing disclosure, and such variations or modifications are intended to be within the scope of the present disclosure.

Claims

1. An audio processing method, comprising:

converting the audio to be processed to obtain a first spectrogram;

inputting the first spectrogram into a pre-trained VQGAN model to obtain a first code; and

and processing the first code to obtain a first audio token.

2. The audio processing method according to claim 1, characterized by further comprising:

3. The audio processing method of claim 2, wherein said inputting the first audio token into a pre-trained audio processing model comprises:

acquiring a first processing instruction of the audio to be processed; and

4. The audio processing method according to claim 2 or 3, further comprising, prior to said inputting the first audio token into a pre-trained audio processing model:

converting the first training audio to obtain a training spectrogram;

processing the second code to obtain a second audio token; and

5. The audio processing method of claim 4, wherein training an initial processing model using the second audio token and the corresponding training processing results comprises:

acquiring a training processing instruction corresponding to a training processing result; and

6. An audio processing method according to any one of claims 1 to 3, characterized in that,

before the inputting the first spectrogram into the pre-trained VQGAN model, the method further comprises: processing the first spectrogram according to a preset processing rule to obtain a second spectrogram;

7. The audio processing method according to any one of claims 1 to 3, characterized by further comprising, before said inputting the first spectrogram into a pre-trained VQGAN model:

converting the third training audio to obtain a third spectrogram; and

8. An audio processing apparatus, comprising:

9. An electronic device, comprising:

a memory storing execution instructions; and

a processor executing the execution instructions stored in the memory, implementing the audio processing method of any one of claims 1 to 7.

10. A readable storage medium having stored therein execution instructions which, when executed by a processor, implement the audio processing method of any one of claims 1 to 7.