CN117292705A - Audio processing method, device, electronic equipment and storage medium - Google Patents

Audio processing method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117292705A
CN117292705A CN202210689173.3A CN202210689173A CN117292705A CN 117292705 A CN117292705 A CN 117292705A CN 202210689173 A CN202210689173 A CN 202210689173A CN 117292705 A CN117292705 A CN 117292705A
Authority
CN
China
Prior art keywords
audio data
processed
acoustic feature
model
audio processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210689173.3A
Other languages
Chinese (zh)
Inventor
汤本来
李忠豪
马泽君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co Ltd filed Critical Beijing Youzhuju Network Technology Co Ltd
Priority to CN202210689173.3A priority Critical patent/CN117292705A/en
Publication of CN117292705A publication Critical patent/CN117292705A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The application provides an audio processing method, an audio processing device, electronic equipment and a storage medium. The audio processing method realizes audio processing based on the pre-trained acoustic feature extraction model and the audio processing model; the acoustic feature extraction model can extract acoustic features of audio data of a user, and through setting an output mode, the extracted acoustic features can also keep certain pronunciation mode features besides voice content, so that the extracted acoustic features can be better used for audio processing; and an audio processing model that performs audio processing based on the acoustic features extracted by the acoustic feature extraction model to thereby process audio data of the user voice input into audio data having a target tone. According to the scheme, the user can normally, effectively and efficiently use the audio processing service in an offline state.

Description

Audio processing method, device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to an audio processing method, an audio processing device, an electronic device, and a storage medium.
Background
With the development of multimedia communication technology and artificial intelligence, speech synthesis and speech recognition technology have become key technologies for man-machine speech communication. In some application scenarios, due to specific application requirements such as confidentiality, personalization and the like, voice input by a user needs to be processed through an audio processing technology.
In the related art, the above-described audio processing technique may be implemented by a machine learning model, and the model for implementing audio processing is often deployed on the network side, which makes it necessary for a user to use the audio processing service in an online manner depending on a network communication service. However, the network communication service generally has disadvantages of latency and instability, and in some scenarios, the user may not use the network communication service at all, which may cause the user to not use the audio processing service normally and effectively.
Disclosure of Invention
In view of the foregoing, an object of the present application is to provide an audio processing method, an apparatus, an electronic device, and a storage medium.
Based on the above object, the present application provides an audio processing method, including:
acquiring audio data to be processed;
inputting the audio data to be processed into a pre-trained acoustic feature extraction model to obtain acoustic features to be processed corresponding to the audio data to be processed; the acoustic features to be processed comprise pronunciation mode features of at least part of the audio data to be processed;
Inputting the acoustic features to be processed into a pre-trained audio processing model to obtain processed audio data which has target tone and corresponds to the audio data to be processed.
Based on the same technical concept, the present application further provides an audio processing apparatus, including:
the acquisition module is configured to acquire audio data to be processed;
the extraction module is configured to input the audio data to be processed into a pre-trained acoustic feature extraction model to obtain acoustic features to be processed corresponding to the audio data to be processed; the acoustic features to be processed comprise pronunciation mode features of at least part of the audio data to be processed;
the processing module is configured to input the acoustic features to be processed into a pre-trained audio processing model to obtain processed audio data which has a target tone and corresponds to the audio data to be processed.
Based on the same technical concept, the application also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the method according to any one of the above.
Based on the same technical idea, the present application also provides a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method as set forth in any one of the above.
From the above, it can be seen that, according to the audio processing method, apparatus, electronic device and storage medium provided by the present application, aiming at the requirements in the offline state, the light-weighted acoustic feature extraction model and the audio processing model are correspondingly constructed and connected in parallel, so that the user can use the audio processing service normally, effectively and efficiently in the offline state.
Drawings
In order to more clearly illustrate the technical solutions of the present application or related art, the drawings that are required to be used in the description of the embodiments or related art will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.
Fig. 1 is a schematic flow chart of an audio processing method according to an embodiment of the present application;
FIG. 2 is a flowchart of a training method of an acoustic feature extraction model according to an embodiment of the present application;
FIG. 3 is a flowchart of a training method of an audio processing model according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an audio processing device according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
It can be appreciated that before using the technical solutions disclosed in the embodiments of the present application, the user should be informed and authorized by appropriate means of the type, the usage range, the usage scenario, etc. of the personal information related to the present application according to the relevant laws and regulations.
For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Therefore, the user can autonomously select whether to provide personal information for software or hardware such as electronic equipment, application programs, servers or storage media for executing the operation of the technical scheme according to the prompt information.
As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.
It will be appreciated that the above-described notification and user authorization acquisition process is merely illustrative and not limiting of the implementation of the present application, and that other ways of satisfying relevant legal regulations may be applied to the implementation of the present application.
It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.
For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail below with reference to the accompanying drawings.
The principles and spirit of the present application will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable one skilled in the art to better understand and practice the present application and are not intended to limit the scope of the present application in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
According to an embodiment of the application, an audio processing method, an audio processing device, electronic equipment and a storage medium are provided.
In this document, it should be understood that any number of elements in the drawings is for illustration and not limitation, and that any naming is used only for distinction and not for any limitation.
The principles and spirit of the present application are explained in detail below with reference to several representative embodiments thereof.
In the related art, since a model for implementing audio processing is deployed on a network side (typically, a server located in a cloud), it is impossible for a user to normally and effectively use an audio processing service in some cases. For example, when a network communication service is delayed or unstable, a user cannot obtain the processed audio at the first time; as another example, the user may not be able to use the audio processing service in an area not covered by the network communication service or when not purchasing the network communication service at all.
In response to the foregoing problems, the present application provides an audio processing scheme. The audio processing method of the embodiment of the application realizes audio processing based on the pre-trained acoustic feature extraction model and the audio processing model. The acoustic feature extraction model is obtained by carrying out light weight improvement on a voice recognition model in the related technology, the acoustic feature of the voice data of a user can be extracted, and through setting of an output mode, the extracted acoustic feature is enabled to have a certain pronunciation mode feature besides voice content, so that the extracted acoustic feature can be better used for audio processing, and the acoustic feature extraction model after light weight improvement can meet deployment and implementation requirements on terminal equipment in an off-line state. The audio processing model is used for processing audio data input by the user voice into audio data with target tone based on the acoustic features extracted by the acoustic feature extraction model, so that the audio processing in an off-line state is realized by matching with the acoustic feature extraction model. According to the scheme, aiming at the requirements in the off-line state, the lightweight acoustic feature extraction model and the audio processing model are correspondingly provided, so that a user can normally, effectively and efficiently use the audio processing service in the off-line state.
First, the embodiment of the application provides an audio processing method. The audio processing method is applied to terminal equipment, wherein the terminal equipment comprises, but is not limited to, a desktop computer, a mobile phone, a mobile computer, a tablet computer, a media player, an intelligent wearable device view, a personal digital assistant (personal digital assistant, PDA) or other electronic equipment capable of realizing the functions. In the embodiment of the application, the terminal equipment is further provided with an acoustic feature extraction model and an audio processing model which are obtained through pre-training.
Referring to fig. 1, the audio processing method of the present embodiment may include the steps of:
step S101, obtaining audio data to be processed.
In a specific implementation, audio data input by a user is acquired through an audio acquisition component (such as a microphone) of the terminal device, where the audio data is referred to as audio data to be processed in the embodiment of the application. The audio data to be processed has a tone color that is the tone color of the current user himself.
Step S102, inputting the audio data to be processed into a pre-trained acoustic feature extraction model to obtain acoustic features to be processed corresponding to the audio data to be processed; the acoustic features to be processed include a voicing mode feature of at least a portion of the audio data to be processed.
In the implementation, the acquired audio data to be processed is input into a pre-trained acoustic feature extraction model to obtain acoustic features, which are output by the acoustic feature extraction model and correspond to the audio data to be processed, and in the embodiment of the application, the acoustic features generated by the acoustic feature extraction model according to the audio data to be processed are called as acoustic features to be processed. Based on the characteristics of the acoustic feature extraction model of the embodiment of the application, the acoustic feature to be processed can also comprise the features reflecting the pronunciation modes of accents, long and short tones and the like while the feature of the voice content including the first audio data is relatively complete. In addition, as for other specific implementations of the acoustic feature extraction model, reference may be made to the following examples.
Step S103, inputting the acoustic features to be processed into a pre-trained audio processing model to obtain processed audio data which has a target tone and corresponds to the audio data to be processed.
In specific implementation, the acoustic feature to be processed is input into a pre-trained audio processing model to obtain audio data corresponding to the audio data to be processed, which is output by the audio processing model. Based on the characteristics of the audio processing model in the embodiment of the application, the voice content of the processed audio data is the same as the voice content of the audio data to be processed, and the tone color of the processed audio data is converted from the tone color of the current user to the target tone color.
In the implementation, the acoustic feature extraction model and the audio processing model are both trained in advance, and the acoustic feature extraction model and the audio processing model have the characteristic of light weight and can be matched with the operation resources of the terminal equipment, so that the audio processing method in the embodiment of the application can work in an off-line state and does not depend on network communication services.
As an alternative embodiment, the audio data to be processed may also be deleted after being input into the acoustic feature extraction model. The audio data to be processed is deleted immediately after being used, so that the storage space of the terminal equipment can be saved, and the occupation of resources of the terminal equipment when the audio processing method of the embodiment of the application is implemented is further reduced.
By the audio processing method of the embodiment, the user can use the audio processing service normally, effectively and efficiently in an off-line state based on the light-weighted acoustic feature extraction model and the audio processing model.
As an optional implementation manner, the embodiment of the application further includes a training method of the acoustic feature extraction model.
Referring to fig. 2, the training method of the acoustic feature extraction model of the present embodiment may include the following steps:
Step S201, acquiring first audio data.
In particular, in order to construct a training data set for training an acoustic feature extraction model, a certain amount of audio data whose content is human voice needs to be acquired, and in this embodiment, the audio data is referred to as first audio data.
Step S202, generating first acoustic features corresponding to the first audio data according to a preset voice recognition model; the first acoustic feature is output by any hidden layer in the speech recognition model that is close to the output layer.
In particular, any speech recognition model in the related art may be selected as the predetermined speech recognition model in the embodiment of the present application, and for example, the speech recognition model may be D-fsmn, tacotron, fastspeech, fastspeech2, fastpitch, or the like. The voice recognition model can perform voice recognition on the audio data with the content of the voice of the person so as to obtain the acoustic characteristics corresponding to the audio data. The acoustic features of which type may be used may be arbitrarily selected according to the speech recognition model selected or implementation requirements, and may be, for example, one or more of MFCC (Mel Frequency Cepstrum Coefficients, mel-frequency cepstral coefficient), FBANK (Mel-filterbank energy features, mel-filter bank energy feature), pitch (fundamental frequency), voiceless (speech flag feature), or any other type of acoustic feature.
In specific implementation, the first audio data obtained in the foregoing step is input into a speech recognition model and subjected to speech recognition processing, and then acoustic features obtained according to the speech recognition model are referred to as first acoustic features. The speech recognition model generally comprises an input layer, an output layer and a hidden layer which is positioned between the input layer and the output layer and is fully connected in sequence. For any hidden layer, a plurality of neurons are included for extracting deep sub-layer features of the data. The output layer carries out regression or classification processing on the feature data output by the hidden layer so as to obtain the finally output acoustic feature.
In specific implementation, in the embodiment of the present application, the output of the final output layer of the speech recognition model is not selected as the first acoustic feature, but the output of any hidden layer close to the output layer in the speech recognition model is used as the first acoustic feature. The design concept of the arrangement is as follows: the output of the final output layer, considering the speech recognition model, often includes only the speech content features of the first audio data, and no longer includes the voicing mode features of the speech in the first audio data. For example, the first audio data is: the sound color of adult men sends out a voice with the content of 'hello'; after the voice recognition processing is performed through the voice recognition model, the output characteristics output by the output layer of the voice recognition model only comprise characteristic data of voice content of 'hello', but do not comprise characteristic data of tone colors of the original adult men, accents, long and short sounds and other pronunciation modes. However, the audio processing service that needs to be implemented in the embodiments of the present application only needs to change the tone color of the audio data, while keeping the voice content and the pronunciation mode unchanged. Based on the characteristics of a general speech recognition model, the output of a plurality of hidden layers close to an output layer can comprise speech content characteristics and pronunciation mode characteristics. Therefore, in the embodiment of the present application, the output of any hidden layer near the output layer in the speech recognition model is selected as the first acoustic feature.
In implementation, the output of any hidden layer close to the output layer in the speech recognition model can be selected as the first acoustic feature according to the implementation requirement. By the arrangement of the first acoustic features, the first acoustic features obtained in the embodiment of the application can be more complete and simultaneously can be characterized by reflecting the pronunciation modes of accents, long and short tones and the like. As an optional implementation manner, considering that the output of the last hidden layer included in the voice recognition model, that is, the hidden layer adjacent to the output layer, can better consider the voice content feature and the pronunciation mode feature, the step of generating the first acoustic feature may be: and inputting the first audio data into the voice recognition model, and extracting the output of one hidden layer, which is close to the output layer, in the voice recognition model as a first acoustic feature corresponding to the first audio data.
Step S203, generating a first training data set according to the first audio data and the first acoustic feature.
In specific implementation, based on the first audio data obtained in the previous step and the first acoustic feature corresponding to the first audio data, a training data set is generated, where the training data set is used for training the acoustic feature extraction model later, and in this embodiment, the training data set is referred to as a first training data set. The first training data set comprises a certain number of training samples, and the training samples comprise input data and corresponding label data. In this embodiment of the present application, the training samples in the first training data set are configured to be "first audio data to first acoustic features" by using the first audio data as input data and using the first acoustic features as corresponding tag data, that is, the sample form of the first training data set.
Step S204, obtaining an acoustic feature extraction model; the size of the acoustic feature extraction model is smaller than the size of the speech recognition model.
In specific implementation, a pre-constructed acoustic feature extraction model is obtained, and the acoustic feature extraction model can be obtained by performing light modification on the voice recognition model selected in the embodiment of the application. The size of the model of the acoustic feature extraction model after light modification is far smaller than that of the voice recognition model, and the purpose of the model is to enable the light acoustic feature extraction model to be matched with the operation resources such as the memory and the processor of the terminal equipment, so that the model can normally work in an off-line state only by depending on the operation resources of the terminal equipment.
In the implementation, the main content of the lightweight modification of the acoustic feature extraction model is to reduce the number of hidden layers and the number of neurons included in the hidden layers on the premise of keeping the overall structural form of the model unchanged. The hidden layer and the amount of decrease of the neurons included in the hidden layer can be set accordingly according to the operation resource configuration of the terminal equipment applied by the method of the application or other implementation requirements. As an alternative embodiment, the number of hidden layers comprised by the acoustic feature extraction model may be set to not more than one fifth of the number of hidden layers comprised by the speech recognition model, taking into account the adaptation of the computational resource configuration for most terminal devices at present. On the other hand, for any hidden layer of the acoustic feature extraction model, the number of neurons that the hidden layer includes may be set to not more than one-fourth of the number of neurons that any hidden layer of the speech recognition model includes.
Step 205, training the acoustic feature extraction model according to the first training data set.
In the specific implementation, based on the first training data set obtained through the construction and the obtained lightweight modified acoustic feature extraction model, training is carried out on the acoustic feature extraction model according to the first training data set. Specifically, a loss function is constructed, and based on each training sample included in the first training data set, iterative training is performed on the acoustic feature extraction model through a predetermined training algorithm, model parameters of the acoustic feature extraction model are updated during each iteration, and after a predetermined convergence condition (generally, a predetermined number of iteration rounds is reached), a trained acoustic feature extraction model is obtained. The setting of the loss function, the training algorithm used, the setting of the convergence condition, and the like may be implemented by selecting any feasible mode in the related art, and the specific selection of the embodiment of the application is not limited.
According to the training method of the acoustic feature extraction model, the training can be performed to obtain the acoustic feature extraction model, the acoustic feature extraction model is obtained by carrying out light weight improvement on the voice recognition model in the related technology, the acoustic feature of the user can be extracted, and through the arrangement of the output mode, the extracted acoustic feature is enabled to have a certain pronunciation mode feature besides voice content, so that the extracted acoustic feature can be better used for audio processing, and the acoustic feature extraction model after light weight improvement can meet the deployment and implementation requirements on terminal equipment in an off-line state.
As an alternative implementation manner, the embodiment of the application further includes a training method of the audio processing model.
Referring to fig. 3, the training method of the audio processing model of the present embodiment may include the following steps:
step S301, second audio data with a target tone color is acquired.
In particular, in order to construct a training data set for training an audio processing model, a certain amount of audio data whose content is human voice needs to be acquired, and in this embodiment, the audio data is referred to as second audio data. In this embodiment of the present application, the second audio data have the same tone characteristics, and the tone of the second audio data is referred to as the target tone. Wherein, tone color refers to the characteristic of different sounds which are peculiar and distinctive in terms of waveforms, namely, tone color can be understood as the sound uniqueness of different pronunciation subjects. The audio processing to be realized by the method of each embodiment of the present application is to convert the original tone of the audio data of the user into another fixed tone without changing the language content of the voice uttered by the user. Therefore, the acquired certain amount of second audio data is audio data having the same target tone color but different language contents. In addition, the specific choice of the target tone may be set correspondingly according to specific implementation needs, which is not limited in the embodiments of the present application.
Step S302, inputting the second audio data into a pre-trained acoustic feature extraction model to obtain a second acoustic feature corresponding to the second audio data; the acoustic feature extraction model is trained based on the training method of the acoustic feature extraction model in any of the foregoing embodiments.
In specific implementation, the second audio data is input into an acoustic feature extraction model, which is obtained by training based on the training method of the acoustic feature extraction model in any embodiment. For any input second audio data, the acoustic feature extraction model may output corresponding acoustic features, and in the embodiment of the present application, the acoustic features generated by the acoustic feature extraction model and corresponding to the second audio data are referred to as second acoustic features.
Step S303, generating a second training data set according to the second acoustic feature and the second audio data.
In specific implementation, based on the second audio data obtained in the previous step and the second acoustic feature corresponding to the second audio data, a training data set is generated, where the training data set is used for training the audio processing model later, and in this embodiment, the training data set is referred to as a second training data set. The second training data set comprises a certain number of training samples, and the training samples comprise input data and corresponding label data. In this embodiment of the present application, the training samples in the second training data set are configured as "second acoustic features-second audio data" by using the second acoustic features as input data and using the second audio data as corresponding tag data, that is, the sample form of the second training data set.
Step S304, an audio processing model is obtained, and the audio processing model is trained according to the second training data set.
In practice, a pre-built audio processing model is obtained, which may be based on any end-to-end audio processing model in the related art. The audio processing model is input with audio data, the input audio data is encoded by an encoder, and then the result of the encoding processing is decoded by a decoder to obtain an audio processing result. Then, the audio processing result output from the decoder is processed by the vocoder to obtain corresponding waveform data. Waveform data output from the vocoder may be output as audio through an audio output device such as a speaker.
As an optional implementation manner, the model size of the audio processing model may be further reduced, so that the audio processing model may be adapted to the computing resource of the terminal device, thereby ensuring normal operation in an offline state. In particular, the encoder and vocoder can be modeled jointly, i.e. the audio processing model in this embodiment only comprises encoder and vocoder. The encoder is configured to encode the input second acoustic feature to obtain a corresponding encoded feature; the vocoder is directly coupled to the encoder and is configured to generate audio data having a target tone based on the encoding characteristics output by the encoder. In practice, both the encoder and the vocoder can be implemented using any feasible model in the related art; for example, the encoder may be selected from any of the above-exemplified speech recognition models in the embodiments of the present application, and the vocoder may be melgan, waveRNN, waveGlow, LPCnet or the like.
In the implementation, the audio processing model is trained according to the second training data set based on the second training data set obtained through the construction and the obtained audio processing model. Specifically, a loss function is constructed, and based on each training sample included in the second training data set, iterative training is performed on the audio processing model through a predetermined training algorithm, model parameters of the audio processing model are updated when each iteration is performed, and after a predetermined convergence condition (generally, a predetermined number of iteration rounds is reached), a trained audio processing model is obtained. The setting of the loss function, the training algorithm used, the setting of the convergence condition, and the like may be implemented by selecting any feasible mode in the related art, and the specific selection of the embodiment of the application is not limited.
As an alternative embodiment, in the process of training the audio processing model, in order to further reduce the model size of the audio processing model, parameter sharing processing may also be performed on several hidden layers included in the audio processing model. That is, the step of training the audio processing model may specifically include: determining at least two target hidden layers from hidden layers included in the audio processing model; and sharing the parameters of the at least two target hidden layers. In specific implementation, for the hidden layer to be subjected to parameter sharing, the hidden layer is called a target hidden layer in the embodiment of the application. The number of the target hidden layers is at least two, and the determined parameters of a plurality of target hidden layers are shared when model parameters are updated through a training algorithm (such as a gradient descent algorithm) in any one or more iterative processes during the training of the audio processing model. The model size of the audio processing model can be reduced through the processing, and the model training efficiency can be increased at the same time. The specific number of the target hidden layers can be set correspondingly according to implementation requirements.
Through the training method of the audio processing model, the audio processing model can be obtained through training, and the audio processing model can correspondingly generate corresponding audio data with target tone by taking the acoustic characteristics obtained by the acoustic characteristic extraction model as input. The model size of the audio processing model can be further reduced through the implementation modes of joint modeling of the encoder and the vocoder, parameter sharing and the like, so that the audio processing model in the embodiment of the application can meet the deployment and implementation requirements on the terminal equipment in an off-line state.
It should be noted that some embodiments of the present application are described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Based on the same technical conception, the embodiment of the application also provides an audio processing device. Referring to fig. 4, the audio processing apparatus 400 includes:
An acquisition module 401 configured to acquire audio data to be processed;
the extraction module 402 is configured to input the audio data to be processed into a pre-trained acoustic feature extraction model to obtain acoustic features to be processed corresponding to the audio data to be processed; the acoustic features to be processed comprise pronunciation mode features of at least part of the audio data to be processed;
the processing module 403 is configured to input the acoustic feature to be processed into a pre-trained audio processing model, so as to obtain processed audio data which has a target tone and corresponds to the audio data to be processed.
In some alternative embodiments, the extraction module 402 is further configured to obtain the first audio data; generating a first acoustic feature corresponding to the first audio data according to a predetermined voice recognition model; the first acoustic feature is output by any hidden layer close to an output layer in the voice recognition model; generating a first training data set according to the first audio data and the first acoustic feature; acquiring an acoustic feature extraction model; the size of the acoustic feature extraction model is smaller than the size of the speech recognition model; and training the acoustic feature extraction model according to the first training data set.
In some alternative embodiments, the extracting module 402 is specifically configured to input the first audio data into the speech recognition model, and extract an output of one hidden layer in the speech recognition model, which is immediately adjacent to the output layer, as the first acoustic feature corresponding to the first audio data.
In some alternative embodiments, the acoustic feature extraction model includes no more than one fifth of the number of hidden layers that the speech recognition model includes; for any hidden layer of the acoustic feature extraction model, the number of neurons that the hidden layer comprises does not exceed one quarter of the number of neurons that any hidden layer of the speech recognition model comprises.
In some alternative embodiments, processing module 403 is further configured to obtain second audio data having a target timbre; inputting the second audio data into the acoustic feature extraction model to obtain a second acoustic feature corresponding to the second audio data; generating a second training data set according to the second acoustic feature and the second audio data; and acquiring an audio processing model, and training the audio processing model according to the second training data set.
In some alternative embodiments, the audio processing model includes: an encoder and a vocoder; wherein the encoder is configured to encode features according to the second acoustic feature correspondence; the vocoder is configured to generate audio data having a target tone according to the encoding characteristics.
In some alternative embodiments, the processing module 403 is specifically configured to determine at least two target hidden layers from the hidden layers included in the audio processing model; and sharing the parameters of the at least two target hidden layers.
In some alternative embodiments, the processing module 403 is further configured to delete the audio data to be processed after inputting the audio data to be processed into a pre-trained acoustic feature extraction model.
For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of each module may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.
The device of the foregoing embodiment is configured to implement the audio processing method of any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.
Based on the same technical concept, the embodiment of the application also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the audio processing method according to any embodiment.
Fig. 5 shows a more specific hardware architecture of an electronic device according to this embodiment, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 implement communication connections therebetween within the device via a bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit ), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. Memory 1020 may store an operating system and other application programs, and when the embodiments of the present specification are implemented in software or firmware, the associated program code is stored in memory 1020 and executed by processor 1010.
The input/output interface 1030 is used to connect with an input/output module for inputting and outputting information. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.
Communication interface 1040 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).
Bus 1050 includes a path for transferring information between components of the device (e.g., processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).
It should be noted that although the above-described device only shows processor 1010, memory 1020, input/output interface 1030, communication interface 1040, and bus 1050, in an implementation, the device may include other components necessary to achieve proper operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.
The electronic device of the foregoing embodiment is configured to implement the corresponding audio processing method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.
Based on the same technical concept, the embodiments of the present application also provide a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the audio processing method as described in any one of the embodiments above.
The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.
The storage medium of the foregoing embodiments stores computer instructions for causing the computer to perform the audio processing method according to any one of the foregoing embodiments, and has the advantages of the corresponding method embodiments, which are not described herein.
According to one or more embodiments of the present disclosure, example 1 provides an audio processing method, comprising:
acquiring audio data to be processed;
inputting the audio data to be processed into a pre-trained acoustic feature extraction model to obtain acoustic features to be processed corresponding to the audio data to be processed; the acoustic features to be processed comprise pronunciation mode features of at least part of the audio data to be processed;
inputting the acoustic features to be processed into a pre-trained audio processing model to obtain processed audio data which has target tone and corresponds to the audio data to be processed.
In accordance with one or more embodiments of the present disclosure, example 2 provides the method of example 1, the method further comprising training the acoustic feature extraction model by:
acquiring first audio data;
generating a first acoustic feature corresponding to the first audio data according to a predetermined voice recognition model; the first acoustic feature is output by any hidden layer close to an output layer in the voice recognition model;
Generating a first training data set according to the first audio data and the first acoustic feature;
acquiring an acoustic feature extraction model; the size of the acoustic feature extraction model is smaller than the size of the speech recognition model;
and training the acoustic feature extraction model according to the first training data set.
According to one or more embodiments of the present disclosure, example 3 provides the method of example 2, the generating, according to a predetermined speech recognition model, a first acoustic feature corresponding to the first audio data, including:
and inputting the first audio data into the voice recognition model, and extracting the output of one hidden layer, which is close to the output layer, in the voice recognition model as a first acoustic feature corresponding to the first audio data.
In accordance with one or more embodiments of the present disclosure, example 4 provides the method of example 2, the acoustic feature extraction model comprising no more than one fifth of the number of hidden layers the speech recognition model comprises; for any hidden layer of the acoustic feature extraction model, the number of neurons that the hidden layer comprises does not exceed one quarter of the number of neurons that any hidden layer of the speech recognition model comprises.
In accordance with one or more embodiments of the present disclosure, example 5 provides the method of example 1, the method further comprising training the audio processing model by:
acquiring second audio data with a target tone;
inputting the second audio data into the acoustic feature extraction model to obtain a second acoustic feature corresponding to the second audio data;
generating a second training data set according to the second acoustic feature and the second audio data;
and acquiring an audio processing model, and training the audio processing model according to the second training data set.
In accordance with one or more embodiments of the present disclosure, example 6 provides the method of example 5, the audio processing model comprising: an encoder and a vocoder; wherein the encoder is configured to encode features according to the second acoustic feature correspondence; the vocoder is configured to generate audio data having a target tone according to the encoding characteristics.
In accordance with one or more embodiments of the present disclosure, example 7 provides the method of example 6, the training the audio processing model comprising:
determining at least two target hidden layers from hidden layers included in the audio processing model;
And sharing the parameters of the at least two target hidden layers.
According to one or more embodiments of the present disclosure, example 8 provides the method of example 1, the inputting the audio data to be processed into a pre-trained acoustic feature extraction model, further comprising, after:
and deleting the audio data to be processed.
According to one or more embodiments of the present disclosure, example 9 provides an audio processing apparatus, comprising:
the acquisition module is configured to acquire audio data to be processed;
the extraction module is configured to input the audio data to be processed into a pre-trained acoustic feature extraction model to obtain acoustic features to be processed corresponding to the audio data to be processed; the acoustic features to be processed comprise pronunciation mode features of at least part of the audio data to be processed;
the processing module is configured to input the acoustic features to be processed into a pre-trained audio processing model to obtain processed audio data which has a target tone and corresponds to the audio data to be processed.
In accordance with one or more embodiments of the present disclosure, example 10 provides the apparatus of example 9, the extraction module further configured to obtain the first audio data; generating a first acoustic feature corresponding to the first audio data according to a predetermined voice recognition model; the first acoustic feature is output by any hidden layer close to an output layer in the voice recognition model; generating a first training data set according to the first audio data and the first acoustic feature; acquiring an acoustic feature extraction model; the size of the acoustic feature extraction model is smaller than the size of the speech recognition model; and training the acoustic feature extraction model according to the first training data set.
According to one or more embodiments of the present disclosure, example 11 provides the apparatus of example 10, the extraction module is specifically configured to input the first audio data into the speech recognition model, and extract an output of one hidden layer in the speech recognition model that is immediately adjacent to the output layer as a first acoustic feature corresponding to the first audio data.
In accordance with one or more embodiments of the present disclosure, example 12 provides the apparatus of example 10, the acoustic feature extraction model comprising no more than one fifth of the number of hidden layers the speech recognition model comprises; for any hidden layer of the acoustic feature extraction model, the number of neurons that the hidden layer comprises does not exceed one quarter of the number of neurons that any hidden layer of the speech recognition model comprises.
Example 13 provides the apparatus of example 9, the processing module further configured to obtain second audio data having a target timbre, according to one or more embodiments of the present disclosure; inputting the second audio data into the acoustic feature extraction model to obtain a second acoustic feature corresponding to the second audio data; generating a second training data set according to the second acoustic feature and the second audio data; and acquiring an audio processing model, and training the audio processing model according to the second training data set.
Example 14 provides the apparatus of example 13, according to one or more embodiments of the disclosure, the audio processing model comprising: an encoder and a vocoder; wherein the encoder is configured to encode features according to the second acoustic feature correspondence; the vocoder is configured to generate audio data having a target tone according to the encoding characteristics.
Example 15 provides the apparatus of example 13, according to one or more embodiments of the present disclosure, the processing module specifically configured to determine at least two target hidden layers from hidden layers included in the audio processing model; and sharing the parameters of the at least two target hidden layers.
In accordance with one or more embodiments of the present disclosure, example 16 provides the apparatus of example 9, the processing module further configured to delete the audio data to be processed after inputting the audio data to be processed into a pre-trained acoustic feature extraction model.
According to one or more embodiments of the present disclosure, example 17 provides an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any one of examples 1 to 8 when the program is executed.
Example 18 provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of examples 1 to 8, according to one or more embodiments of the present disclosure.
It should be noted that unless otherwise defined, technical or scientific terms used in the embodiments of the present application should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present application belongs. The terms "first," "second," and the like, as used in embodiments of the present application, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements.
Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the application (including the claims) is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the present application, the steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.
The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Accordingly, any omissions, modifications, equivalents, improvements and/or the like which are within the spirit and principles of the embodiments are intended to be included within the scope of the present application.

Claims (11)

1. An audio processing method, comprising:
acquiring audio data to be processed;
inputting the audio data to be processed into a pre-trained acoustic feature extraction model to obtain acoustic features to be processed corresponding to the audio data to be processed; the acoustic features to be processed comprise pronunciation mode features of at least part of the audio data to be processed;
Inputting the acoustic features to be processed into a pre-trained audio processing model to obtain processed audio data which has target tone and corresponds to the audio data to be processed.
2. The method of claim 1, further comprising training the acoustic feature extraction model by:
acquiring first audio data;
generating a first acoustic feature corresponding to the first audio data according to a predetermined voice recognition model; the first acoustic feature is output by any hidden layer close to an output layer in the voice recognition model;
generating a first training data set according to the first audio data and the first acoustic feature;
acquiring an acoustic feature extraction model; the size of the acoustic feature extraction model is smaller than the size of the speech recognition model;
and training the acoustic feature extraction model according to the first training data set.
3. The method of claim 2, wherein generating the first acoustic feature corresponding to the first audio data according to a predetermined speech recognition model comprises:
and inputting the first audio data into the voice recognition model, and extracting the output of one hidden layer, which is close to the output layer, in the voice recognition model as a first acoustic feature corresponding to the first audio data.
4. The method of claim 2, wherein the acoustic feature extraction model comprises no more than one fifth of the number of hidden layers comprised by the speech recognition model; for any hidden layer of the acoustic feature extraction model, the number of neurons that the hidden layer comprises does not exceed one quarter of the number of neurons that any hidden layer of the speech recognition model comprises.
5. The method of claim 1, further comprising training the audio processing model by:
acquiring second audio data with a target tone;
inputting the second audio data into the acoustic feature extraction model to obtain a second acoustic feature corresponding to the second audio data;
generating a second training data set according to the second acoustic feature and the second audio data;
and acquiring an audio processing model, and training the audio processing model according to the second training data set.
6. The method of claim 5, wherein the audio processing model comprises: an encoder and a vocoder; wherein the encoder is configured to encode features according to the second acoustic feature correspondence; the vocoder is configured to generate audio data having a target tone according to the encoding characteristics.
7. The method of claim 6, wherein the training the audio processing model comprises:
determining at least two target hidden layers from hidden layers included in the audio processing model;
and sharing the parameters of the at least two target hidden layers.
8. The method of claim 1, wherein the inputting the audio data to be processed into a pre-trained acoustic feature extraction model, further comprises thereafter:
and deleting the audio data to be processed.
9. An audio processing apparatus, comprising:
the acquisition module is configured to acquire audio data to be processed;
the extraction module is configured to input the audio data to be processed into a pre-trained acoustic feature extraction model to obtain acoustic features to be processed corresponding to the audio data to be processed; the acoustic features to be processed comprise pronunciation mode features of at least part of the audio data to be processed;
the processing module is configured to input the acoustic features to be processed into a pre-trained audio processing model to obtain processed audio data which has a target tone and corresponds to the audio data to be processed.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 8 when the program is executed by the processor.
11. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 8.
CN202210689173.3A 2022-06-16 2022-06-16 Audio processing method, device, electronic equipment and storage medium Pending CN117292705A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210689173.3A CN117292705A (en) 2022-06-16 2022-06-16 Audio processing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210689173.3A CN117292705A (en) 2022-06-16 2022-06-16 Audio processing method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117292705A true CN117292705A (en) 2023-12-26

Family

ID=89255874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210689173.3A Pending CN117292705A (en) 2022-06-16 2022-06-16 Audio processing method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117292705A (en)

Similar Documents

Publication Publication Date Title
US11475881B2 (en) Deep multi-channel acoustic modeling
US10943606B2 (en) Context-based detection of end-point of utterance
WO2020253509A1 (en) Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium
CN112786007B (en) Speech synthesis method and device, readable medium and electronic equipment
JP2018026127A (en) Translation method, translation device, and computer program
CN108615525B (en) Voice recognition method and device
CN112489621B (en) Speech synthesis method, device, readable medium and electronic equipment
CN112786008B (en) Speech synthesis method and device, readable medium and electronic equipment
CN113330511B (en) Voice recognition method, voice recognition device, storage medium and electronic equipment
US20140236597A1 (en) System and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis
CN111081280A (en) Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method
CN112927674B (en) Voice style migration method and device, readable medium and electronic equipment
CN113327580A (en) Speech synthesis method, device, readable medium and electronic equipment
CN114627856A (en) Voice recognition method, voice recognition device, storage medium and electronic equipment
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
US20230419957A1 (en) User profile linking
CN117373431A (en) Audio synthesis method, training method, device, equipment and storage medium
WO2014176489A2 (en) A system and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis
US11393454B1 (en) Goal-oriented dialog generation using dialog template, API, and entity data
CN116597858A (en) Voice mouth shape matching method and device, storage medium and electronic equipment
CN117219052A (en) Prosody prediction method, apparatus, device, storage medium, and program product
CN117292705A (en) Audio processing method, device, electronic equipment and storage medium
CN111916095B (en) Voice enhancement method and device, storage medium and electronic equipment
CN114566140A (en) Speech synthesis model training method, speech synthesis method, equipment and product
CN112017662A (en) Control instruction determination method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination