CN117765954A - Audio processing method, device, electronic equipment and storage medium - Google Patents

Audio processing method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117765954A
CN117765954A CN202311801737.9A CN202311801737A CN117765954A CN 117765954 A CN117765954 A CN 117765954A CN 202311801737 A CN202311801737 A CN 202311801737A CN 117765954 A CN117765954 A CN 117765954A
Authority
CN
China
Prior art keywords
audio
processing
training
model
spectrogram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311801737.9A
Other languages
Chinese (zh)
Inventor
张莉娜
魏光辉
宋莎莎
杨喜鹏
田锋平
吴玉虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mobvoi Information Technology Co ltd
Original Assignee
Shanghai Mobvoi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Mobvoi Information Technology Co ltd filed Critical Shanghai Mobvoi Information Technology Co ltd
Priority to CN202311801737.9A priority Critical patent/CN117765954A/en
Publication of CN117765954A publication Critical patent/CN117765954A/en
Pending legal-status Critical Current

Links

Landscapes

  • Circuit For Audible Band Transducer (AREA)

Abstract

The disclosure provides an audio processing method, an audio processing device, electronic equipment and a storage medium. The audio processing method provided by the disclosure comprises the following steps: converting the audio to be processed to obtain a first spectrogram; inputting the first spectrogram into a pre-trained VQGAN model to obtain a first code; and processing the first code to obtain a first audio token.

Description

Audio processing method, device, electronic equipment and storage medium
Technical Field
The disclosure relates to an audio processing method, an audio processing device, electronic equipment and a storage medium.
Background
With the development of artificial intelligence, audio processing technologies such as speech recognition, speech translation, speech questions and answers and the like have also been developed. One key process of audio processing technology is how to down-dimension discretize a continuous audio signal into an audio token (an audio discretized representation). The prior art generally uses an audio encoder to implement the discretization process.
However, the number and quality of tokens obtained by processing continuous audio by an audio encoder hardly meet the requirements of an audio processing model, either too much beyond the limits of the audio processing model, or too little quality too poor to support audio processing or poor quality of audio processing.
Disclosure of Invention
The disclosure provides an audio processing method, an audio processing device, electronic equipment and a storage medium.
According to one aspect of the present disclosure, there is provided an audio processing method including:
converting the audio to be processed to obtain a first spectrogram;
inputting the first spectrogram into a pre-trained VQGAN model to obtain a first code;
and processing the first code to obtain a first audio token.
An audio processing method according to at least one embodiment of the present disclosure further includes:
and inputting the first audio token into a pre-trained audio processing model to obtain a processing result.
According to an audio processing method of at least one embodiment of the present disclosure, the inputting the first audio token into a pre-trained audio processing model includes:
acquiring a first processing instruction of the audio to be processed;
inputting the first processing instruction and the first audio token into a pre-trained audio processing model.
According to an audio processing method of at least one embodiment of the present disclosure, before the inputting the first audio token into the pre-trained audio processing model, the audio processing method further includes:
acquiring a first training audio and a corresponding training processing result;
converting the first training audio to obtain a training spectrogram;
inputting the training spectrogram into the pre-trained VQGAN model to obtain a second code;
processing the second code to obtain a second audio token;
and training an initial processing model by using the second audio token and a corresponding training processing result to obtain the pre-trained audio processing model.
According to an audio processing method of at least one embodiment of the present disclosure, the training an initial processing model using the second audio token and a corresponding training processing result includes:
acquiring a training processing instruction corresponding to a training processing result;
and training an initial processing model by using the training processing instruction, the corresponding second audio token and the training processing result.
According to an audio processing method of at least one embodiment of the present disclosure, before the inputting the first spectrogram into a pre-trained VQGAN model, the audio processing method further includes: processing the first spectrogram according to a preset processing rule to obtain a second spectrogram;
the inputting the first spectrogram into a pre-trained VQGAN model includes: and inputting the second spectrogram into a pre-trained VQGAN model.
According to an audio processing method of at least one embodiment of the present disclosure, before the inputting the first spectrogram into a pre-trained VQGAN model, the audio processing method further includes:
randomly intercepting audio from the second training audio according to the preset first time to obtain a third training audio;
converting the third training audio to obtain a third spectrogram;
and training an initial VQGAN model by using a third spectrogram to obtain the pre-trained VQGAN model.
According to another aspect of the present disclosure, there is provided an audio processing apparatus including:
the speech spectrum conversion module is used for converting the audio to be processed to obtain a first spectrum chart;
the discrete processing module is used for inputting the first spectrogram into a pre-trained VQGAN model to obtain a first code;
and the discrete conversion module is used for processing the first code to obtain a first audio token.
According to still another aspect of the present disclosure, there is provided an electronic apparatus including: a memory storing execution instructions; and a processor executing the execution instructions stored in the memory, causing the processor to execute the audio processing method according to any one of the embodiments of the present disclosure.
According to yet another aspect of the present disclosure, there is provided a readable storage medium having stored therein execution instructions which, when executed by a processor, are to implement the audio processing method of any one of the embodiments of the present disclosure.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.
Fig. 1 is a flowchart of an audio processing method provided in embodiment 1 of the present disclosure.
Fig. 2 is a flowchart two of an audio processing method provided in embodiment 1 of the present disclosure.
Fig. 3 is a flowchart III of an audio processing method provided in embodiment 1 of the present disclosure.
Fig. 4 is a flowchart of a model input method in the audio processing method shown in fig. 2.
Fig. 5 is a flowchart of a model training method in the audio processing method shown in fig. 3.
Fig. 6 is a flowchart of an audio processing method provided in embodiment 1 of the present disclosure.
Fig. 7 is a flowchart five of an audio processing method provided in embodiment 1 of the present disclosure.
Fig. 8 is a schematic structural diagram of an audio processing apparatus according to embodiment 2 of the present disclosure.
Fig. 9 is a schematic diagram of a second configuration of the audio processing apparatus according to embodiment 2 of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the drawings and the embodiments. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant content and not limiting of the present disclosure. It should be further noted that, for convenience of description, only a portion relevant to the present disclosure is shown in the drawings.
In addition, embodiments of the present disclosure and features of the embodiments may be combined with each other without conflict. The technical aspects of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Unless otherwise indicated, the exemplary implementations/embodiments shown are to be understood as providing exemplary features of various details of some ways in which the technical concepts of the present disclosure may be practiced. Thus, unless otherwise indicated, features of the various implementations/embodiments may be additionally combined, separated, interchanged, and/or rearranged without departing from the technical concepts of the present disclosure.
The use of cross-hatching and/or shading in the drawings is typically used to clarify the boundaries between adjacent components. As such, the presence or absence of cross-hatching or shading does not convey or represent any preference or requirement for a particular material, material property, dimension, proportion, commonality between illustrated components, and/or any other characteristic, attribute, property, etc. of a component, unless indicated. In addition, in the drawings, the size and relative sizes of elements may be exaggerated for clarity and/or descriptive purposes. While the exemplary embodiments may be variously implemented, the specific process sequences may be performed in a different order than that described. For example, two consecutively described processes may be performed substantially simultaneously or in reverse order from that described. Moreover, like reference numerals designate like parts.
When an element is referred to as being "on" or "over", "connected to" or "coupled to" another element, it can be directly on, connected or coupled to the other element or intervening elements may be present. However, when an element is referred to as being "directly on," "directly connected to," or "directly coupled to" another element, there are no intervening elements present. For this reason, the term "connected" may refer to physical connections, electrical connections, and the like, with or without intermediate components.
The terminology used herein is for the purpose of describing particular embodiments and is not intended to limit the scope of the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, when the terms "comprises" and/or "comprising," and variations thereof, are used in the present specification, the presence of stated features, integers, steps, operations, elements, components, and/or groups thereof is described, but the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof is not precluded. It is also noted that, as used herein, the terms "substantially," "about," and other similar terms are used as approximation terms and not as degree terms, and as such, are used to explain the inherent deviations of measured, calculated, and/or provided values that would be recognized by one of ordinary skill in the art.
The audio processing method of the present disclosure may be applied to an audio processing apparatus of the present disclosure, which may be configured on a server.
The audio processing method of the present disclosure is described in detail below with reference to fig. 1 to 7.
Example 1:
fig. 1 is a flowchart of an audio processing method of one embodiment of the present disclosure.
Referring to fig. 1, an audio processing method M100 of the present embodiment includes:
step S110, converting the audio to be processed to obtain a first spectrogram.
Step S120, inputting the first spectrogram into a pre-trained VQGAN model to obtain a first code corresponding to the spectrogram.
Step S130, converting the first code to obtain a first audio token.
According to the audio processing method, after the audio to be processed is converted into the first spectrogram, the first spectrogram is compressed into the first code through the pre-trained VQGAN model, and the first code is processed to obtain the first audio token. Because the VQGAN model can use fewer numbers to represent data, it is helpful to effectively compress continuous information in the potential space while maintaining the reconstruction quality, and therefore, the audio processing process based on the VQGAN model can ensure quality while reducing the number of audio tokens, so that the audio processing model can support continuous voice input. The method solves the problems that the number and the quality of the token obtained by processing continuous audio through an audio encoder in the prior art hardly meet the requirement of an audio processing model, or the number and the quality of the token exceed the limit of the audio processing model too much or the quality of the token are too low to support audio processing or the quality of the audio processing is poor.
In some embodiments of the present disclosure, the audio to be processed in step S110 may be a speaking sound, a singing sound, an audio sound, or the like. Step S110 may use Fourier transform (Fourier Transform), fast Fourier transform (Fast Fourier Transform, FFT), short-time Fourier transform (Short-time Fourier transform or Short-term Fourier transform, STFT) and the like to transform the audio to be processed; in order to enrich the first spectrogram feature, the conversion process of step S110 may further specifically be: and converting the audio to be processed based on the Mel scale, wherein the first spectrogram is the Mel spectrogram.
In some embodiments of the present disclosure, the specific process of converting through step S110 may include: firstly, preprocessing audio to be processed, pre-emphasis (such as passing the audio to be processed through a high-pass filter and the like), framing (dividing the audio to be processed into a plurality of small blocks according to time, namely, one frame), windowing (adding a window function to each frame), and the like are carried out to obtain a plurality of single-frame signals; converting the single frame signals to generate a plurality of spectrograms; mapping each spectrogram to obtain a plurality of mapped spectrograms; and splicing the plurality of mapped spectrograms in the time dimension, so as to form a complete first spectrogram. Specifically, the power spectrum is adopted by the spectrogram, and the first spectrogram obtained after passing through the Mel filter bank is the Mel spectrogram.
In some embodiments of the present disclosure, before step S120, a training process of the initial VQGAN (Vector Quantized Generative Adversarial Network) model may be further included, the initial VQGAN model may be trained using training audio, the training process improves the image reconstruction performance through the loss function and the perception error of GAN, the compression model and the image generation model game each other, and the respective parameters are continuously optimized, so that the generated image is closer to the real image, and finally the pre-trained VQGAN model is obtained. The training audio may be the same length or different.
In some embodiments of the present disclosure, the VQGAN model pre-trained in step S130 may be embodied as a compressed model in the VQGAN model pre-trained. When the first spectrogram is input into the compression model, the compression model obtains intermediate codes after CNN Encoder coding, and each coding position of the intermediate codes is removed from the codebook to find the code with the nearest distance, so as to generate a first code with the same dimension.
In some embodiments of the present disclosure, the processing of the first code in step S130 may be performed to search for a corresponding discretized representation (token) in the preset index for each coding position of the first code. After the first audio token is obtained in step S130, the first audio token may be used in fields such as speech recognition, speech question-answering, speech translation, and speech synthesis.
Further, as shown in fig. 2, the audio processing method provided by the present disclosure may further include:
step S140, inputting the first audio token into a pre-trained audio processing model to obtain a processing result.
In some embodiments of the present disclosure, after performing audio discretization by the audio processing method shown in fig. 1, the first audio token may be processed in step S140, and since the number of the first audio token can be reduced while the quality is guaranteed, the audio processing model in step S140 can support processing of continuous longer audio, where the audio processing model may be one or more processing modes models including speech recognition, speech question-answering, speech translation, speech synthesis, and the like; the audio processing model may be pre-configured with audio and corresponding token vocabulary.
At this time, as shown in fig. 3, before step S140, it may further include:
step S131, obtaining a first training audio and a corresponding training processing result.
Step S132, converting the first training audio to obtain a training spectrogram.
Step S133, inputting the training spectrogram into a pre-trained VQGAN model to obtain a second code.
Step S134, the second code is processed to obtain a second audio token.
And step S135, training the initial processing model by using the second audio token and the corresponding training processing result to obtain a pre-trained audio processing model.
In some embodiments of the present disclosure, the conversion process in step S132 is similar to step S110, the second encoding process obtained in step S133 is similar to step S120, and the encoding process in step S134 is similar to step S130, which will not be described in detail.
In some embodiments of the present disclosure, when the audio processing model is a single processing model, the audio token may be processed directly. When the audio processing model is two or more processing models, the processing may be performed based on the processing instruction, and in this case, as shown in fig. 4, step S140 may include: step S1401, acquiring a first processing instruction of audio to be processed; step S1402, inputting a first processing instruction and a first audio token into a pre-trained audio processing model. After the audio processing model receives the first processing instruction and the first audio token, a processing procedure corresponding to the first processing instruction can be executed on the first audio token. The first processing instruction may be an original processing instruction input by a user, or may be a processing identifier corresponding to the original processing instruction, where in this case, the audio processing model may further preset a processing instruction and a corresponding processing identifier vocabulary.
In some embodiments of the present disclosure, in order for the audio processing model to support multiple processing procedures, to enhance model generalization capability, and to train the initial processing model, as shown in fig. 5, step S135 may include: step S1351, obtaining a training processing instruction corresponding to the training processing result; step S1352, training the initial processing model using the training processing instructions, the corresponding second audio token, and the training processing results.
Further, as shown in fig. 6, to ensure the unified specification and integrity of the spectrogram, the method further includes, before step S120:
and step S111, processing the first spectrogram according to a preset processing rule to obtain a second spectrogram.
At this time, step S120 specifically includes: and inputting the second spectrogram into a pre-trained VQGAN model to obtain a first code.
In some embodiments of the present disclosure, the processing in step S111 may be a normalization processing procedure, such as normalizing the first spectrogram according to a preset db value or a minimum db value of the first spectrogram, or the like.
Further, in order to make the image generated by the VQGAN model more similar to the real image, as shown in fig. 7, before step S120, it may further include
Step S112, randomly intercepting the audio from the second training audio according to the preset first time to obtain a third training audio.
Step S113, converting the third training audio to obtain a third spectrogram.
Step S114, training an initial VQGAN model by using the third spectrogram to obtain a pre-trained VQGAN model.
In some embodiments of the present disclosure, in step S112, there may be a correlation between the third training audio, or there may be a correlation between the third training audio by setting a suitable first time, so that when the initial VQGAN model is trained, the model can better learn about the correlation between the audio, and the generated image is closer to the real image. The association is that there is an intersection between the two tones.
In some embodiments of the present disclosure, the conversion process in step S113 is similar to step S110, and the third spectrogram used in the training in step S114 may also be a processed spectrogram, and the processing process is similar to step S111, which is not described herein.
In some embodiments of the present disclosure, when a plurality of data is input into a model, to represent an association relationship between data, the plurality of data may be spliced and input into the model.
Taking automatic speech recognition (Automatic Speech Recognition, ASR) techniques as an example, the data input into the pre-trained audio processing model includes processing directives and audio tokens, which may be in the form of:
{"text":
"[ASR]<|690|><|690|><|690|><|1337|><|2396|><|1851|><|3765|
><|434|><|1190|><|942|><|2263|><|858|><|2331|><|413|><|315
7|><|4083|><|1107|><|1305|><|3524|><|3486|><|1440|><|3589|
><|2407|><|815|><|858|><|868|><|1367|><|2322|><|3323|><|31
09|><|193|><|727|><|3731|><|3535|><|3126|><|2159|><|694|><
|2102|><|1236|>"}
in training, the training data used includes processing instructions, audio token, and processing results, which may be in the form of:
{"text":
"[ASR]<|690|><|690|><|690|><|1337|><|2396|><|1851|><|3765|
><|434|><|1190|><|942|><|2263|><|858|><|2331|><|413|><|315
7|><|4083|><|1107|><|1305|><|3524|><|3486|><|1440|><|3589|
><|2407|><|815|><|858|><|868|><|1367|><|2322|><|3323|><|31
09|><|193|><|727|><|3731|><|3535|><|3126|><|2159|><|694|><
|2102|><|1236|><|AS|>you send you back to home bar }
Wherein the method comprises the steps of|AS|Is a connector.
Example 2:
fig. 8-9 illustrate exemplary diagrams of audio processing devices employing hardware implementations of processing systems.
The apparatus may include corresponding modules that perform the steps of the flowcharts described above. Thus, each step or several steps in the flowcharts described above may be performed by respective modules, and the apparatus may include one or more of these modules. A module may be one or more hardware modules specifically configured to perform the respective steps, or be implemented by a processor configured to perform the respective steps, or be stored within a computer-readable medium for implementation by a processor, or be implemented by some combination.
The hardware architecture may be implemented using a bus architecture. The bus architecture may include any number of interconnecting buses and bridges depending on the specific application of the hardware and the overall design constraints. Bus 1100 connects together various circuits including one or more processors 1200, memory 1300, and/or hardware modules. Bus 1100 may also connect various other circuits 1400, such as peripherals, voltage regulators, power management circuits, external antennas, and the like.
Bus 1100 may be an industry standard architecture (ISA, industry Standard Architecture) bus, a peripheral component interconnect (PCI, peripheral Component) bus, or an extended industry standard architecture (EISA, extended Industry Standard Component) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one connection line is shown in the figure, but not only one bus or one type of bus.
Any process or method description in a flowchart or otherwise described herein may be understood as: a module, segment, or portion of code, which comprises one or more executable instructions for implementing the steps of a specified logical function(s) or process (es). The scope of the preferred embodiments of the present disclosure may include other implementations in which functions may be performed out of the order described, for example, in a substantially simultaneous manner or in an opposite order depending on the function involved, as would be understood by one of skill in the art. The processor may be used to perform the various methods and processes described above. For example, method embodiments in the present disclosure may be implemented as a software program stored on a computer readable storage medium, such as a memory. In some embodiments, part or all of the software program may be loaded and/or installed via memory and/or a communication interface. One or more of the steps of the methods described above may be performed when the software program is loaded and executed by a processor. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above in any other suitable manner (e.g., by means of firmware).
Logic and/or steps represented in the flowcharts or otherwise described herein may be embodied in any readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
For the purposes of this description, a "readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). In addition, the readable storage medium may even be paper or other suitable medium for the printable program, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner if necessary, and then stored in a memory.
It should be understood that portions of the present disclosure may be implemented in hardware, software, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the methods of the embodiments described above may be implemented by a program to instruct related hardware. The program may be stored in a readable storage medium. The program, when executed, includes one or a combination of steps for implementing the method.
Furthermore, each functional unit in each embodiment of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. The storage medium may be a read-only memory, a magnetic disk or optical disk, etc.
As shown in fig. 8, an audio processing apparatus 1000 according to the present disclosure may include:
the speech spectrum conversion module 1010 is configured to convert the audio to be processed to obtain a first spectrum chart.
The discrete processing module 1020 is configured to input the first spectrogram into a pre-trained VQGAN model to obtain a first code.
The discrete conversion module 1030 is configured to process the first code to obtain a first audio token.
Further, the audio processing apparatus provided by the present disclosure may further include:
the audio processing module 1040 is configured to input the first audio token into a pre-trained audio processing model, and obtain a processing result.
Wherein, the audio processing module 1040 may include:
the first instruction submodule 10401 is configured to obtain a first processing instruction of the audio to be processed.
A first processing submodule 10402 for inputting a first processing instruction and a first audio token into a pre-trained audio processing model.
At this time, the audio processing apparatus may further include:
the first obtaining module 1050 is configured to obtain the first training audio and the corresponding training processing result.
The first conversion module 1060 is configured to convert the first training audio to obtain a training spectrogram.
A first input module 1070 is configured to input the training spectrogram into a pre-trained VQGAN model to obtain a second code.
The first processing module 1080 is configured to process the second code to obtain a second audio token.
The first training module 1090 is configured to train the initial processing model by using the second audio token and the corresponding training processing result, so as to obtain a pre-trained audio processing model.
Wherein, the first training module 1090 may include:
training instruction submodule 10901 is configured to obtain a training processing instruction corresponding to the training processing result.
A process training submodule 10902 for training the initial process model using the training process instructions, the corresponding second audio token, and the training process results.
Further, the audio processing apparatus provided by the present disclosure may further include:
and the standard processing module 1091 is configured to process the first spectrogram according to a preset processing rule to obtain a second spectrogram.
At this time, the discrete processing module 1020 is specifically configured to input the second spectrogram into a pre-trained VQGAN model, to obtain the first code.
Further, the audio processing apparatus provided by the present disclosure may further include:
the audio capturing module 1092 is configured to randomly capture audio from the second training audio according to a preset first time, so as to obtain a third training audio.
The second conversion module 1093 is configured to convert the third training audio to obtain a third spectrogram.
A second training module 1094 is configured to train the initial VQGAN model using the third spectrogram, to obtain a pre-trained VQGAN model.
In some embodiments of the present disclosure, the process of implementing audio processing through the above modules/sub-modules is similar to that provided in embodiment 1, and will not be described in detail herein. Fig. 9 is a structure of the audio processing apparatus including all the above modules/sub-modules, and when the audio processing apparatus includes only some modules/sub-modules, the structure is similar to that shown in fig. 9, and will not be described in detail here.
After the audio to be processed is converted into the first spectrogram, the audio processing device compresses the first spectrogram into the first code through the pre-trained VQGAN model, and processes the first code to obtain the first audio token. Because the VQGAN model can use fewer numbers to represent data, it is helpful to effectively compress continuous information in the potential space while maintaining the reconstruction quality, and therefore, the audio processing process based on the VQGAN model can ensure quality while reducing the number of audio tokens, so that the audio processing model can support continuous voice input. The device solves the problems that the number and the quality of the token obtained by processing continuous audio through an audio encoder in the prior art hardly meet the requirement of an audio processing model, or the number and the quality of the token exceed the limit of the audio processing model too much or the quality of the token are too low to support audio processing or the quality of the audio processing is poor.
The present disclosure also provides an electronic device, including: a memory storing execution instructions; and a processor or other hardware module that executes the execution instructions stored in the memory, such that the processor or other hardware module performs the audio processing method described above.
The present disclosure also provides a readable storage medium having stored therein execution instructions which, when executed by a processor, are to implement the above-described audio processing method.
In the description of the present specification, reference to the terms "one embodiment/example," "some embodiments/examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment/example or example is included in at least one embodiment/mode or example of the present application. In this specification, the schematic representations of the above terms are not necessarily in the same implementation/example or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more implementations/embodiments or examples. Furthermore, the various embodiments/examples or examples described in this specification and the features of the various embodiments/embodiments or examples may be combined and combined by persons skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" is at least two, such as two, three, etc., unless explicitly defined otherwise.
It will be appreciated by those skilled in the art that the above-described embodiments are merely for clarity of illustration of the disclosure, and are not intended to limit the scope of the disclosure. Other variations or modifications will be apparent to persons skilled in the art from the foregoing disclosure, and such variations or modifications are intended to be within the scope of the present disclosure.

Claims (10)

1. An audio processing method, comprising:
converting the audio to be processed to obtain a first spectrogram;
inputting the first spectrogram into a pre-trained VQGAN model to obtain a first code; and
and processing the first code to obtain a first audio token.
2. The audio processing method according to claim 1, characterized by further comprising:
and inputting the first audio token into a pre-trained audio processing model to obtain a processing result.
3. The audio processing method of claim 2, wherein said inputting the first audio token into a pre-trained audio processing model comprises:
acquiring a first processing instruction of the audio to be processed; and
inputting the first processing instruction and the first audio token into a pre-trained audio processing model.
4. The audio processing method according to claim 2 or 3, further comprising, prior to said inputting the first audio token into a pre-trained audio processing model:
acquiring a first training audio and a corresponding training processing result;
converting the first training audio to obtain a training spectrogram;
inputting the training spectrogram into the pre-trained VQGAN model to obtain a second code;
processing the second code to obtain a second audio token; and
and training an initial processing model by using the second audio token and a corresponding training processing result to obtain the pre-trained audio processing model.
5. The audio processing method of claim 4, wherein training an initial processing model using the second audio token and the corresponding training processing results comprises:
acquiring a training processing instruction corresponding to a training processing result; and
and training an initial processing model by using the training processing instruction, the corresponding second audio token and the training processing result.
6. An audio processing method according to any one of claims 1 to 3, characterized in that,
before the inputting the first spectrogram into the pre-trained VQGAN model, the method further comprises: processing the first spectrogram according to a preset processing rule to obtain a second spectrogram;
the inputting the first spectrogram into a pre-trained VQGAN model includes: and inputting the second spectrogram into a pre-trained VQGAN model.
7. The audio processing method according to any one of claims 1 to 3, characterized by further comprising, before said inputting the first spectrogram into a pre-trained VQGAN model:
randomly intercepting audio from the second training audio according to the preset first time to obtain a third training audio;
converting the third training audio to obtain a third spectrogram; and
and training an initial VQGAN model by using a third spectrogram to obtain the pre-trained VQGAN model.
8. An audio processing apparatus, comprising:
the speech spectrum conversion module is used for converting the audio to be processed to obtain a first spectrum chart;
the discrete processing module is used for inputting the first spectrogram into a pre-trained VQGAN model to obtain a first code;
and the discrete conversion module is used for processing the first code to obtain a first audio token.
9. An electronic device, comprising:
a memory storing execution instructions; and
a processor executing the execution instructions stored in the memory, implementing the audio processing method of any one of claims 1 to 7.
10. A readable storage medium having stored therein execution instructions which, when executed by a processor, implement the audio processing method of any one of claims 1 to 7.
CN202311801737.9A 2023-12-25 2023-12-25 Audio processing method, device, electronic equipment and storage medium Pending CN117765954A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311801737.9A CN117765954A (en) 2023-12-25 2023-12-25 Audio processing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311801737.9A CN117765954A (en) 2023-12-25 2023-12-25 Audio processing method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117765954A true CN117765954A (en) 2024-03-26

Family

ID=90310201

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311801737.9A Pending CN117765954A (en) 2023-12-25 2023-12-25 Audio processing method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117765954A (en)

Similar Documents

Publication Publication Date Title
CN110197658B (en) Voice processing method and device and electronic equipment
CN111309883B (en) Man-machine dialogue method based on artificial intelligence, model training method and device
KR100391243B1 (en) System and method for generating and using context dependent sub-syllable models to recognize a tonal language
CN113658583B (en) Ear voice conversion method, system and device based on generation countermeasure network
CN113823264A (en) Speech recognition method, speech recognition device, computer-readable storage medium and computer equipment
KR102272554B1 (en) Method and system of text to multiple speech
CN112542169A (en) Voice recognition processing method and device
KR20190135853A (en) Method and system of text to multiple speech
CN114399995A (en) Method, device and equipment for training voice model and computer readable storage medium
CN109741761B (en) Sound processing method and device
CN112163084A (en) Question feedback method, device, medium and electronic equipment
CN117765954A (en) Audio processing method, device, electronic equipment and storage medium
CN111863009A (en) Training method and system of context information prediction model
CN115132182B (en) Data identification method, device, equipment and readable storage medium
CN115240696B (en) Speech recognition method and readable storage medium
CN115497451A (en) Voice processing method and device, electronic equipment and storage medium
Diwan et al. Reduce and reconstruct: ASR for low-resource phonetic languages
CN112259110B (en) Audio encoding method and device and audio decoding method and device
CN112750469A (en) Method for detecting music in voice, voice communication optimization method and corresponding device
CN114400006B (en) Speech recognition method and device
CN117238277B (en) Intention recognition method, device, storage medium and computer equipment
CN112820313B (en) Model training method, voice separation method and device and electronic equipment
CN115132204B (en) Voice processing method, equipment, storage medium and computer program product
CN115346516A (en) End-to-end voice confidence estimation method and device in cross-domain voice recognition
CN116484314A (en) Multi-mode data processing method and language intelligent system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination