WO2024037348A1 - 音频处理方法、模型训练方法、装置、设备、介质及产品 - Google Patents

音频处理方法、模型训练方法、装置、设备、介质及产品 Download PDF

Info

Publication number
WO2024037348A1
WO2024037348A1 PCT/CN2023/111004 CN2023111004W WO2024037348A1 WO 2024037348 A1 WO2024037348 A1 WO 2024037348A1 CN 2023111004 W CN2023111004 W CN 2023111004W WO 2024037348 A1 WO2024037348 A1 WO 2024037348A1
Authority
WO
WIPO (PCT)
Prior art keywords
padding
target
audio processing
model
amount
Prior art date
Application number
PCT/CN2023/111004
Other languages
English (en)
French (fr)
Inventor
黄家鸿
马东鹏
项伟
Original Assignee
广州市百果园信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市百果园信息技术有限公司 filed Critical 广州市百果园信息技术有限公司
Publication of WO2024037348A1 publication Critical patent/WO2024037348A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the embodiments of the present application relate to the field of audio processing technology, such as audio processing methods, model training methods, devices, equipment, storage media and products.
  • timbre conversion is an important audio processing technology
  • timbre conversion solutions based on neural network models have been widely used in various fields such as audio content generation and entertainment audio production.
  • Timbre conversion is a technology that converts the timbre of the original audio into a target timbre while keeping the content information of the original audio unchanged.
  • audio processing technologies such as timbre conversion use offline inference solutions in most cases.
  • streaming inference solutions are needed for application scenarios with high timeliness requirements such as voice calls or live broadcasts.
  • streaming inference solutions are needed for application scenarios with high timeliness requirements such as voice calls or live broadcasts.
  • streaming inference solutions are needed.
  • related streaming inference solutions are difficult to balance low latency and conversion effects.
  • Embodiments of the present application provide audio processing methods, model training methods, devices, equipment, storage media and products, which can be better suited for processing audio streams.
  • an audio processing method which method includes:
  • the target right padding amount and the target left padding amount corresponding to the preset audio processing model wherein the preset audio processing model includes a convolution layer, and the target right padding amount is used to indicate the amount of padding in the convolution layer.
  • the number of data padded on the right side of the input data of the layer The target left padding number is used to indicate the number of data padded on the left side of the input data of the convolution layer.
  • the target right padding number is greater than 0, and the same volume
  • the padding quantity on the right side of the target corresponding to the stacked layer is less than the padding quantity on the left side of the target corresponding to the stacked layer;
  • the audio stream to be processed is processed to obtain a corresponding processed target audio stream.
  • a model training method which method includes:
  • the audio processing model contains a convolution layer
  • the right padding number is used to indicate the right padding of the input data of the convolution layer.
  • the amount of data, the left padding amount is used to indicate the amount of data padded to the left of the input data of the convolution layer
  • the left padding amount is used to indicate the padding to the left of the input data of the convolution layer
  • the number of data, the right padding number is greater than 0, and the right padding number corresponding to the same convolution layer is less than the corresponding left padding number;
  • the sample audio stream is processed to obtain a corresponding processed target sample audio stream, wherein the sample audio stream corresponds to a standard audio flow;
  • a target loss relationship is determined according to the target sample audio stream and the standard audio stream, and the audio processing model is trained based on the target loss relationship.
  • an audio processing device which device includes:
  • the padding quantity determination module is configured to determine the target right padding quantity and the target left padding quantity corresponding to the preset audio processing model, wherein the preset audio processing model includes a convolution layer, and the target right padding quantity is The target left padding quantity is used to indicate the number of data padded on the right side of the input data of the convolution layer.
  • the target left padding quantity is used to indicate the data quantity padded on the left side of the input data of the convolution layer.
  • the target right padded The quantity is greater than 0, and the padding quantity on the right side of the target corresponding to the same convolution layer is less than the padding quantity on the left side of the target corresponding to the same convolution layer;
  • the audio stream processing module is configured to process the audio stream to be processed based on the target right padding amount, the target left padding amount, and the preset audio processing model to obtain a corresponding processed target audio stream.
  • a model training device which device includes:
  • a quantity determination module configured to determine the right padding quantity and the left padding quantity corresponding to the audio processing model, wherein the audio processing model contains a convolution layer, and the right padding quantity is used to indicate the amount of padding in the convolution layer.
  • the left padding number is used to indicate the number of data padded on the left side of the input data of the convolution layer.
  • the right padding number is greater than 0, and the same convolution layer corresponds to The right padding quantity is less than the corresponding left padding quantity;
  • an audio processing module configured to process the sample audio stream based on the right padding quantity, the left padding quantity and the audio processing model to obtain a corresponding processed target sample audio stream, Wherein, the sample audio stream corresponds to a standard audio stream;
  • a model training module configured to determine a target loss relationship based on the target sample audio stream and the standard audio stream, and to train the audio processing model based on the target loss relationship.
  • an electronic device including:
  • the memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor, so that the at least one processor can execute the method described in any embodiment of the present application. Audio processing methods and/or model training methods.
  • a computer-readable storage medium stores a computer program, and the computer program is used to implement any of the embodiments of the present application when executed by a processor. audio processing methods and/or model training methods.
  • a computer program product includes a computer program.
  • the computer program When executed by a processor, the computer program implements the audio processing method and/or the audio processing method described in any embodiment of the present application. or model training method.
  • Figure 1 is a schematic flow chart of an audio processing method provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of a convolution method in related technologies
  • Figure 3 is a schematic diagram of a convolution method provided by an embodiment of the present application.
  • Figure 4 is a schematic structural diagram of a preset audio processing model provided by an embodiment of the present application.
  • Figure 5 is a schematic flow chart of another audio processing method provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of the principle of an audio processing method provided by an embodiment of the present application.
  • Figure 7 is a schematic flow chart of a model training method provided by an embodiment of the present application.
  • FIG. 8 is a structural block diagram of an audio processing device provided by an embodiment of the present application.
  • Figure 9 is a structural block diagram of a model training device provided by an embodiment of the present application.
  • Figure 10 is a structural block diagram of an electronic device provided by an embodiment of the present application.
  • the first is the autoregressive model solution.
  • This type of solution uses an autoregressive framework. When predicting the information of the current step (step), it only uses the previous information and does not look back (that is, it does not use the later information). Information), during training, the size of the speech chunk needs to be set to limit the amount of information looking forward.
  • the autoregressive architecture is suitable for streaming reasoning, it does not use the later information and cannot use the later information to help predict the output of the current step, making it difficult to ensure high quality of the converted audio.
  • the second type is the convolution model scheme.
  • This type of scheme combines the convolutions of different layers to form a model.
  • the convolution generally uses conventional convolution, such as causal convolution, traditional convolution or atrous convolution.
  • it is necessary to calculate the receptive field size of the convolution within the entire framework, so that when doing streaming reasoning, the left and right (can also be understood as front and back) pads are filled with values consistent with the size of the receptive field. That is, the amount of data filled on the left is the same as the amount of data filled on the right.
  • it is generally necessary to add a vocoder and convolution is often used inside the vocoder.
  • a brand new convolution method which fills data on the right side, and the amount of filling data on the right side is less than the amount of filling data on the left side, that is, there is more filling on the left side than on the right side.
  • the convolution method is called left-sided convolution.
  • FIG. 1 is a schematic flow chart of an audio processing method provided by an embodiment of the present application.
  • This embodiment can be applied to processing audio streams, and can be specifically applied to various applications such as voice calls, audio and video live broadcasts, and multi-person online conferences.
  • the method can be executed by an audio processing device, which can be implemented in the form of hardware and/or software, and which can be configured in electronic equipment such as audio processing equipment.
  • the electronic device may be a mobile device such as a mobile phone, a smart watch, a tablet computer, or a personal digital assistant; it may also be a desktop computer or other other device.
  • the method includes:
  • Step 101 Determine the target right-side padding quantity and the target left-side padding quantity corresponding to the preset audio processing model, wherein the preset audio processing model includes a convolution layer, and the target right-side padding quantity is used to indicate the location of the preset audio processing model.
  • the padding quantity on the right side of the target corresponding to the same convolution layer is smaller than the padding quantity on the left side of the target corresponding to the same convolution layer.
  • the preset audio processing model can be understood as a neural network model obtained through pre-training and used to process the audio stream to obtain a processed audio stream, wherein the audio processing corresponding to the preset audio processing model Methods may include timbre conversion, language conversion (such as automatically converting Chinese voice to English voice), voice noise reduction, voice content replacement, spoken language recognition, etc.
  • the model includes one or more convolutional layers, and the specific model structure can be trained using the model training method provided in the embodiments of this application.
  • the target right padding amount and the target left padding amount are first determined.
  • the audio stream to be processed can be understood as the audio stream that currently needs to be processed, which can be a real-time call voice stream, a live video voice stream or an online conference voice stream, etc.
  • the target right padding quantity and the target left padding quantity can be determined by reading the preset target right padding quantity and the target left padding quantity, or according to the actual situation of the audio stream to be processed and the audio processing Actual demand or network quality and other related factors are dynamically determined.
  • Different values of the target right padding quantity and the target left padding quantity can correspond to different preset audio processing models, that is, the corresponding ones can be selected based on the determined target right padding quantity and target left padding quantity.
  • Preset audio processing model; the different values of the target right padding amount and the target left padding amount can also correspond to the same general preset audio processing model. This general preset audio processing When training the model, the number of right padding and left padding can be randomly determined to adapt to different value situations.
  • the audio stream when processing an audio stream, it is generally necessary to divide the audio stream into blocks to obtain multiple speech blocks.
  • the specific size can be set according to the actual situation, for example, corresponding to the size of 10 or 20 speech frames, and the speech blocks are Make inferences about units.
  • the amount of output data After the data undergoes the convolution operation of the convolution layer, the amount of output data will be reduced. In order to ensure that the amount of data input is consistent each time and reduce calculation errors, it is necessary to put it on the left side of the input data before inputting it to the convolution layer (you can also understand Fill (pad) a certain amount of data for the front) and the right side (can also be understood as the back), such as a certain amount of data with a step value of 0.
  • the target right padding amount (also called the target rear padding amount) is used to indicate the amount of data padded on the right side of the input data of the convolution layer.
  • the number of padding on the right side of the target corresponding to different convolutional layers can be the same or different, that is, the number of padding on the right side of multiple targets can be determined, and the number can be consistent with the number of convolutional layers;
  • the number of padding on the left side of the target also called target front padding amount
  • the target left side corresponding to different convolution layers The number of side padding can be the same or different, that is, the number of left padding for multiple targets can be determined, and the number can be consistent with the number of convolutional layers; the number of right padding for all targets can be greater than 0, and, for the same For a convolutional layer, the corresponding padding amount on the right side of the target is smaller than the padding amount on the left side of the target. That is, each convolutional layer in the default audio processing model is a left-sided convolution method.
  • FIG 2 is a schematic diagram of a convolution method in related technology. What is shown in Figure 2 is a traditional convolution method.
  • the convolution kernel size kernel_size
  • time_step 3
  • the receptive field size at this time is 5.
  • the data filled on the right side should be the two upcoming speech blocks. It is necessary to wait for the next two speech blocks to be obtained before the input data of the convolution layer can be obtained, so the delay is two The time of the speech block (other factors are ignored here).
  • Figure 3 is a schematic diagram of a convolution method provided by an embodiment of the present application. What is shown in Figure 3 is a new unbalanced convolution method provided by the present application, which is left-side convolution.
  • the embodiment of this application uses left-sided convolution, that is, the receptive field of the convolution is biased to the left, and the left pad is required. The number is more than that on the right.
  • the convolution kernel size is also 5.
  • Step 102 Based on the target right padding amount, the target left padding amount, and the preset audio processing model, process the audio stream to be processed to obtain a corresponding processed target audio stream.
  • the audio stream to be processed can be input to the preset audio processing model; or after the audio stream to be processed is preprocessed, the preprocessing result data can be input to the default audio processing model.
  • the preset audio processing model fills the input data that will be input to the corresponding convolution layer based on the target right padding quantity and the target left padding quantity corresponding to each convolution layer.
  • the preset audio processing model outputs the processed audio stream, and the target audio stream is determined based on the output of the preset audio processing model, for example, removing Fill in the audio data corresponding to the data to obtain the target audio stream.
  • the target right padding quantity and the target left padding quantity corresponding to the preset audio processing model are first determined, wherein, in the preset audio processing model Contains the convolution layer, the target right padding amount is used to indicate the amount of data padded to the right of the input data of the convolution layer, the target left padding amount is used to indicate the amount of data padded to the left of the input data of the convolution layer, the target The right padding quantity is greater than 0 and less than the target left padding quantity. Then, based on the target right padding quantity, the target left padding quantity and the preset audio processing model, the audio stream to be processed is processed to obtain the corresponding processed target. audio stream.
  • filling data on the right side when filling data for the input data of the convolution layer, filling data on the right side can ensure that the subsequent information is referenced and improve the audio quality after audio processing, while the amount of filling data on the right side is less than that on the left side. Filling the amount of data can effectively control the delay, thus taking into account both low latency and processing effect.
  • the preset audio processing model includes a preset timbre addition model (which can be understood as an acoustic model) and a preset vocoder model.
  • the preset timbre addition model and the preset vocoder The models all contain convolutional layers, the preset timbre adding model is set to add timbre information to the content information to obtain timbre content information, and the preset vocoder model is set to convert the timbre content information into audio data.
  • the preset audio processing model can be considered as a preset timbre conversion model, which is configured to perform timbre conversion on the audio stream.
  • the content information can be understood as voice content information
  • automatic speech recognition technology Automatic Speech Recognition, ASR
  • ASR Automatic Speech Recognition
  • Both the preset timbre addition model and the preset vocoder model contain one or more convolutional layers.
  • FIG 4 is a schematic structural diagram of a preset audio processing model provided by an embodiment of the present application.
  • the timbre information is added, and then passes through the preset vocoder model. Converted to audio data after upsampling.
  • AM Acoustic Model
  • vocoder a preset vocoder model
  • the vocoder uses Generative Adversial Networks (GAN) as the basic generation model, which can ensure the sound quality of the generated audio.
  • GAN Generative Adversial Networks
  • the audio stream to be processed is processed based on the target right padding amount, the target left padding amount, and the preset audio processing model to obtain a corresponding processed target audio stream
  • the method includes: extracting content information to be processed in the audio stream to be processed; inputting the content information to be processed into the preset audio processing model, so that the preset audio processing model is based on the target right side padding amount and the The to-be-processed content information is processed according to the padding amount on the left side of the target to obtain the corresponding target audio stream with changing timbre.
  • the ASR technology is used to extract the content information to be processed from the audio stream to be processed, the content information to be processed is input to the preset audio processing model for processing, and the preset timbre addition model is used in the preset audio processing model.
  • target timbre information is added to obtain target timbre content information.
  • the target timbre content information is upsampled by the preset vocoder model and converted into a target audio stream whose timbre is changed to the target timbre.
  • padding is performed based on the padding quantity on the right side of the target and the padding quantity on the left side of the target corresponding to the corresponding convolutional layer, and then input to the corresponding convolutional layer for convolution calculation.
  • determining the target right padding amount and the target left padding amount corresponding to the preset audio processing model includes: determining the target right padding amount corresponding to the preset audio processing model; determining the convolution layer The corresponding convolution kernel size; determine the target left padding quantity based on the convolution kernel size and the target right padding quantity.
  • the corresponding total amount of padding can be determined according to the receptive field of the model.
  • the receptive field can be determined by the convolution kernel size of the convolution layer.
  • the total number of padding is generally the convolution kernel size.
  • Subtract 1 that is, kernel_size-1
  • you can first determine the padding quantity on the right side of the target (right_pad_num), and then determine the padding quantity on the left side of the target (left_pad_num) based on the difference between the total padding quantity and the padding quantity on the right side of the target, that is, left_pad_num kernel_size-1-right_pad_num.
  • the target right-side padding amount is determined based on the expected delay corresponding to the audio stream to be processed, wherein the target right-side padding amount and the expected delay are positively correlated.
  • the expected delay can be understood as the desired delay length, and the expected delay can be determined based on the current tolerance for delay. For example, the higher the tolerance, the longer the expected delay, and accordingly, the target The amount of padding on the right side can be larger; the lower the tolerance, the shorter the expected delay, and accordingly, the amount of padding on the right side of the target can be smaller.
  • the corresponding relationship between the right filling quantity set and the delay can be established in advance to obtain the quantity delay mapping relationship, and the quantity delay mapping relationship is queried according to the expected delay to obtain the target right filling quantity.
  • the quantity-delay mapping relationship can be established through experiments and other methods.
  • the right-side padding quantity set contains the right-side padding quantity corresponding to each convolution layer.
  • Figure 5 is a schematic flow chart of another audio processing method provided by an embodiment of the present application. Taking audio processing as timbre conversion as an example, Figure 6 is a schematic principle diagram of an audio processing method provided by an embodiment of the present application. It can be combined with Figure 5 Refer to Figure 6 to understand the embodiments of the present application. Based on the above optional embodiments, the method may include:
  • Step 501 Determine the target right-side padding amount corresponding to the preset audio processing model according to the expected delay corresponding to the audio stream to be processed.
  • the preset audio processing model includes a preset timbre addition model and a preset vocoder model.
  • the preset timbre addition model includes a first preset number (marked as M, greater than or equal to 1) of convolutional layers. It is assumed that the vocoder model includes a second preset number (denoted as N, greater than or equal to 1) of convolutional layers.
  • each convolutional layer in the preset audio processing model corresponds to different right-side padding quantities, and then establish the right-side padding quantity based on the different delays corresponding to the preset audio processing model.
  • the corresponding relationship between the set and the delay is obtained to obtain the quantity delay mapping relationship.
  • the right padding numbers corresponding to M convolutional layers are Mr1, Mr2,..., Mrm
  • the right padding numbers corresponding to N convolutional layers are Nr1, Nr2,..., Nrm
  • the value range of the padding amount on the right side is greater than 0 and less than the difference of the corresponding convolution kernel size minus 1 divided by 2, that is, (kernel_size-1)/2, for Mr1, Mr2,...
  • each right-side filling quantity set contains one kind of Mr1, Mr2 ,..., Mrm, Nr1, Nr2,..., the values of Nrm (that is, M+N values), and corresponding to a delay, the quantity-delay mapping relationship is obtained.
  • the quantity delay mapping relationship can be queried based on the expected delay to obtain the corresponding filling quantity set on the right side of the target.
  • Step 502 Determine the size of the convolution kernel corresponding to the convolution layer, and determine the amount of padding on the left side of the target based on the size of the convolution kernel and the amount of padding on the right side of the target.
  • the corresponding target left padding amount is determined according to the corresponding convolution kernel size of the convolution layer and the corresponding target right padding amount.
  • Step 503 Extract the content information to be processed in the audio stream to be processed.
  • this step can also be performed before step 501.
  • Step 504 Input the content information to be processed into the preset audio processing model, so that the preset audio processing The model processes the content information to be processed based on the number of fills on the right side of the target and the number of fills on the left side of the target, and obtains the corresponding target audio stream with changed timbre.
  • the determined amount of padding on the right side of the target is the same for each convolutional layer in the preset audio processing model.
  • inference is performed in units of speech chunks (chunks).
  • 10 hypothetical content information (PPg) is 1 speech chunk.
  • the unit of PPg can be a frame, and the number of padding on the right side of the target is 1 speech chunk.
  • blocks (10 PPg) the convolution kernel size is 5 speech blocks, then the number of padding on the left side of the target is 3 speech blocks (30 PPg).
  • the first 50 PPg are obtained, of which the first 40 PPg are used as the input of the preset audio processing model (for the convenience of description, the interior of the preset audio processing model is not expanded), and the last 10 PPg are equivalent to the padding data on the right side.
  • the left side is filled with 30 PPg with a value of 0.
  • the preset audio processing model outputs the audio data (wav) corresponding to 80 PPg. Its length is 80*expected size (hope_size), and the 40 in the middle (31st to 70th) *hope_size wav is the target audio data that needs to be provided to the user. Continue to obtain 40 PPg for the second time.
  • the last 10 PPg obtained last time and the first 30 PPg obtained this time are used as the input of the preset audio processing model.
  • the last 10 PPg obtained this time are equivalent to the right padding.
  • data, and the middle 30 PPg obtained last time are used as left filling data.
  • the preset audio processing model outputs the wav corresponding to 80 PPg, the wav corresponding to the pad is removed to obtain the target audio data.
  • the third time, and so on we will not go into details here.
  • the audio processing method provided by the embodiments of the present application can flexibly and accurately determine the amount of left and right padding data for the input data of the convolution layer based on the current expected delay, thereby realizing a streaming voice changing solution with adjustable delay, and can ensure a relatively short delay.
  • FIG. 7 is a schematic flowchart of a model training method provided by an embodiment of the present application.
  • This embodiment can be applied to the case of training a model for audio stream processing.
  • the model can be specifically applied to applications such as voice calls and audio streaming.
  • Various application scenarios that require high real-time performance such as live video broadcast and multi-person online meetings.
  • the method can be executed by a model training device, which can be implemented in the form of hardware and/or software, and which can be configured in electronic equipment such as model training equipment.
  • the electronic device may be a mobile device such as a mobile phone, a smart watch, a tablet computer, or a personal digital assistant; it may also be a desktop computer or other other device.
  • the audio processing model trained using the embodiments of this application can be applied to the audio processing method provided by any embodiment of this application.
  • the method includes:
  • Step 701 Determine the right padding quantity and the left padding quantity corresponding to the audio processing model, where the audio processing model includes a convolution layer, and the right padding quantity is used to indicate the input data in the convolution layer.
  • the left padding number is used to indicate the number of data padded on the left side of the input data of the convolution layer.
  • the right padding number is greater than 0, and all data corresponding to the same convolution layer
  • the right padding quantity is smaller than the corresponding left padding quantity.
  • the number of right padding and the number of left padding can be preset, so that the trained audio processing model can be applied to the speech processing situation corresponding to the set number of right padding and left padding; right The number of side padding and the number of left padding can also be randomly determined within the specified value range.
  • the audio processing model obtained by training can be applied to speech processing situations corresponding to different numbers of right padding and left padding, and also That is, it can be a general audio processing model.
  • Step 702 Process the sample audio stream based on the right padding amount, the left padding amount and the audio processing model to obtain a corresponding processed target sample audio stream, where the sample audio stream corresponds to There is a standard audio stream.
  • the sample audio stream can be processed in a way that does not require high timeliness and has an ideal processing effect to obtain the corresponding standard audio stream.
  • timbre conversion as an example, an offline timbre conversion model can be used to process the sample audio.
  • the stream is subjected to timbre conversion processing to obtain a standard audio stream.
  • the standard audio stream can also be processed to obtain a corresponding sample audio stream.
  • an offline timbre conversion model can be used to perform timbre conversion processing on the standard audio stream to obtain a sample audio stream.
  • Step 703 Determine a target loss relationship based on the target sample audio stream and the standard audio stream, and train the audio processing model based on the target loss relationship.
  • a standard audio stream can be understood as an audio stream with ideal processing effects that the model outputs.
  • the loss relationship can be used to characterize the difference between two types of data, which can be expressed as a loss value. Specifically, it can be calculated using a loss function.
  • the target loss relationship is used to characterize the difference between the target sample audio stream and the standard audio stream.
  • the specific loss function used can be set according to actual needs.
  • the audio processing model is trained according to the target loss relationship. During the training process, the target loss relationship can be minimized, and training methods such as backpropagation can be used to continuously optimize the weight parameter values in the audio processing model until the preset training cutoff is met. condition.
  • Specific training cutoff conditions can be set according to actual needs, for example, they can be set based on the number of iterations, the degree of convergence of the loss value, or the model accuracy.
  • the audio processing model obtained after training can become the default audio processing model in the audio processing method in the above embodiment.
  • the model training method provided by the embodiment of the present application presets that the audio processing model contains a convolution layer.
  • filling data on the right side can ensure that subsequent information can be referenced. , improve the audio quality after audio processing, and the number of padding data on the right side is less than the number of padding data on the left side, which can effectively control the delay, so that the model obtained after training can achieve both low latency and processing effect.
  • the audio processing model includes a timbre addition model and a vocoder model, Both the timbre adding model and the vocoder model include a convolutional layer.
  • the timbre adding model is set to add timbre information to the content information to obtain timbre content information.
  • the vocoder model is set to add timbre content information to the timbre content information. Convert to audio data.
  • processing the sample audio stream based on the right padding amount, the left padding amount and the audio processing model to obtain the corresponding processed target sample audio stream includes: extracting the sample audio stream The sample content information of The target sample audio stream of the corresponding changed timbre.
  • determining the right padding quantity and the left padding quantity corresponding to the audio processing model includes: determining the right padding quantity and the left padding quantity corresponding to the audio processing model in a random manner.
  • it may include: using a random method to determine the number of right padding corresponding to the audio processing model; determining the size of the convolution kernel corresponding to the convolution layer; and determining according to the size of the convolution kernel and the number of right padding.
  • Left padding quantity The advantage of this setting is that it can be applied to speech processing situations corresponding to different right-side padding quantities and left-side padding quantities. That is, a general audio processing model can be trained to facilitate dynamic determination of the right-side padding quantity according to actual application requirements. The effect of flexible adjustment of delay can be achieved.
  • FIG 8 is a structural block diagram of an audio processing device provided by an embodiment of the present application.
  • the device can be implemented by software and/or hardware, and can generally be integrated in electronic equipment such as audio processing equipment.
  • the audio stream can be processed by executing an audio processing method. for processing.
  • the device includes: a filling quantity determination module 801 and an audio stream processing module 802.
  • the padding quantity determination module 801 is configured to determine the target right padding quantity and the target left padding quantity corresponding to the preset audio processing model, wherein the preset audio processing model includes a convolution layer, and the target right padding quantity Used to indicate the amount of data filled on the right side of the input data of the convolution layer, the target left side padding amount is used to indicate the amount of data filled on the left side of the input data of the convolution layer, the target right side The number of padding is greater than 0, and the number of padding on the right side of the target corresponding to the same convolution layer is less than the number of padding on the corresponding left side of the target;
  • the audio stream processing module 802 is configured to process the audio stream to be processed based on the target right padding amount, the target left padding amount, and the preset audio processing model to obtain a corresponding processed target audio stream.
  • the audio processing device when it is necessary to process the audio stream to be processed, first determines the target right padding quantity and the target left padding quantity corresponding to the preset audio processing model, where the preset audio processing model contains Convolution layer, the target right padding amount is used to indicate the amount of data padded on the right side of the input data of the convolution layer, and the target left padding amount is used to indicate the amount of data padded on the left side of the input data of the convolution layer.
  • the number of side padding data is greater than 0 and less than the target left padding quantity. Then, based on the target right padding quantity, the target left padding quantity and the preset audio processing model, the audio stream to be processed is processed. Get the corresponding processed target audio stream.
  • filling data on the right side when filling data for the input data of the convolution layer, filling data on the right side can ensure that subsequent information is referenced and improve the audio quality after audio processing, while the amount of filling data on the right side is less than that on the left side Filling the amount of data can effectively control the delay, thus taking into account both low latency and processing effect.
  • the preset audio processing model includes a preset timbre addition model and a preset vocoder model, and both the preset timbre addition model and the preset vocoder model include a convolution layer, so
  • the preset timbre adding model is set to add timbre information to the content information to obtain timbre content information, and the preset vocoder model is set to convert the timbre content information into audio data;
  • the audio stream processing module includes:
  • a content information extraction unit configured to extract content information to be processed in the audio stream to be processed
  • a timbre conversion unit configured to input the to-be-processed content information to the preset audio processing model, so that the preset audio processing model converts the target right-side fill quantity and the target left-side fill quantity to the preset audio processing model.
  • the content information to be processed is processed to obtain the corresponding target audio stream for changing the timbre.
  • the filling quantity determination module includes:
  • a right-side padding quantity determination unit is configured to determine the target right-side padding quantity corresponding to the preset audio processing model
  • a convolution kernel size determination unit configured to determine the convolution kernel size corresponding to the convolution layer
  • the left padding quantity determining unit is configured to determine the target left padding quantity based on the convolution kernel size and the target right padding quantity.
  • the target right-side padding amount is determined based on the expected delay corresponding to the audio stream to be processed, wherein the target right-side padding amount and the expected delay are positively correlated.
  • Figure 9 is a structural block diagram of a model training device provided by an embodiment of the present application.
  • the device can be implemented by software and/or hardware, and can generally be integrated in electronic equipment such as model training equipment. Model training can be performed by executing a model training method.
  • the device includes: a quantity determination module 901, an audio processing module 902 and a model training module 903.
  • the quantity determination module 901 is configured to determine the right padding quantity and the left padding quantity corresponding to the audio processing model, wherein the audio processing model includes a convolution layer, and the right padding quantity is used to indicate the convolution layer.
  • the left padding number is used to indicate the number of data padded on the left side of the input data of the convolution layer.
  • the right padding number is greater than 0, and the same convolution layer
  • the corresponding filling quantity on the right side is less than the corresponding filling quantity on the left side;
  • the audio processing module 902 is configured to process the sample audio stream based on the right padding amount, the left padding amount and the audio processing model to obtain a corresponding processed target sample audio stream, wherein, The sample audio stream corresponds to the standard audio stream;
  • the model training module 903 is configured to determine a target loss relationship based on the target sample audio stream and the standard audio stream, and train the audio processing model based on the target loss relationship.
  • the model training device presets the audio processing model to include a convolution layer.
  • filling data on the right side can ensure that subsequent information can be referenced. , improve the audio quality after audio processing, and the number of padding data on the right side is less than the number of padding data on the left side, which can effectively control the delay, so that the model obtained after training can achieve both low latency and processing effect.
  • the audio processing model includes a timbre adding model and a vocoder model.
  • Both the timbre adding model and the vocoder model include a convolution layer.
  • the timbre adding model is configured to add content information. timbre information, to obtain timbre content information, and the vocoder model is set to convert the timbre content information into audio data;
  • the audio processing module includes:
  • an information extraction unit configured to extract sample content information in the sample audio stream
  • a timbre changing unit configured to input the sample content information to the audio processing model, so that the audio processing model processes the sample content information based on the right padding amount and the left padding amount, Get the corresponding target sample audio stream of the changed timbre.
  • the quantity determination module includes:
  • the random determination unit is set to randomly determine the number of right padding corresponding to the audio processing model
  • a convolution kernel determination unit configured to determine the size of the convolution kernel corresponding to the convolution layer
  • the left side quantity determining unit is configured to determine the left side padding quantity based on the convolution kernel size and the right side padding quantity.
  • the embodiment of the present application provides an electronic device, in which the audio processing device and/or the model training device provided by the embodiment of the present application can be integrated.
  • Figure 10 is a structural block diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device 1000 includes a processor 1001, and a memory 1002 communicatively connected to the processor 1001, where the memory 1002 stores a computer program that can be executed by the processor 1001, and the computer program is executed by the processor 1001, so that the processor 1001
  • the audio processing method and/or model training method described in any embodiment of the present application can be executed.
  • the number of processors may be one or more. In Figure 10, one processor is taken as an example.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program is used to enable the processor to implement the audio processing described in any embodiment of the present application when executed. method.
  • Embodiments of the present application also provide a computer program product.
  • the computer program product includes a computer program. When executed by a processor, the computer program implements the audio processing method provided by the embodiments of the present application.
  • the audio processing device, model training device, equipment, storage medium and product provided in the above embodiments can execute the audio processing method or model training method provided in any embodiment of the present application, and have corresponding functional modules and beneficial effects for executing the method.
  • the audio processing method or model training method provided by any embodiment of this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Stereophonic System (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

音频处理方法、模型训练方法、装置、设备、介质及产品。该方法包括:确定预设音频处理模型对应的目标右侧填充数量和目标左侧填充数量,其中,目标右侧填充数量用于指示在卷积层的输入数据右侧填充的数据数量,目标左侧填充数量用于指示在卷积层的输入数据左侧填充的数据数量,目标右侧填充数量大于0且小于目标左侧填充数量,基于目标右侧填充数量、目标左侧填充数量和预设音频处理模型,对待处理音频流进行处理,得到对应的处理后的目标音频流。

Description

音频处理方法、模型训练方法、装置、设备、介质及产品
本申请要求在2022年08月17日提交中国专利局、申请号为202210986541.0的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及音频处理技术领域,例如涉及音频处理方法、模型训练方法、装置、设备、存储介质及产品。
背景技术
随着音频处理技术的快速发展,神经网络模型在音频处理方面已得到广泛的应用。例如,音色转换作为一种重要的音频处理技术,基于神经网络模型的音色转换方案,已经广泛应用于如音频内容生成以及娱乐音频制作等各种领域。
音色转换是一种保持原始音频的内容信息不变的前提下,把其音色转换为目标音色的技术。目前,音色转换等音频处理技术,大部分情况下使用离线推理方案,而对于语音通话或直播等对时效性要求较高的应用场景来说,需要使用流式推理方案。而相关的流式推理方案难以兼顾低延迟和转换效果。
发明内容
本申请实施例提供了音频处理方法、模型训练方法、装置、设备、存储介质及产品,可以更好地适用于对音频流的处理。
根据本申请的一方面,提供了一种音频处理方法,该方法包括:
确定预设音频处理模型对应的目标右侧填充数量和目标左侧填充数量,其中,所述预设音频处理模型中包含卷积层,所述目标右侧填充数量用于指示在所述卷积层的输入数据右侧填充的数据数量,所述目标左侧填充数量用于指示在所述卷积层的输入数据左侧填充的数据数量,所述目标右侧填充数量大于0,且同一卷积层对应的所述目标右侧填充数量小于对应的所述目标左侧填充数量;
基于所述目标右侧填充数量、所述目标左侧填充数量和所述预设音频处理模型,对待处理音频流进行处理,得到对应的处理后的目标音频流。
根据本申请的另一方面,提供了一种模型训练方法,该方法包括:
确定音频处理模型对应的右侧填充数量和左侧填充数量,其中,所述音频处理模型中包含卷积层,所述右侧填充数量用于指示在所述卷积层的输入数据右侧填充的数据数量,所述左侧填充数量用于指示在所述卷积层的输入数据左侧填充的数据数量,所述左侧填充数量用于指示在所述卷积层的输入数据左侧填充的数据数量,所述右侧填充数量大于0,且同一卷积层对应的所述右侧填充数量小于对应的所述左侧填充数量;
基于所述右侧填充数量、所述左侧填充数量和所述音频处理模型,对样本音频流进行处理,得到对应的处理后的目标样本音频流,其中,所述样本音频流对应有标准音频流;
根据所述目标样本音频流与所述标准音频流确定目标损失关系,并基于所述目标损失关系对所述音频处理模型进行训练。
根据本申请的另一方面,提供了一种音频处理装置,该装置包括:
填充数量确定模块,设置为确定预设音频处理模型对应的目标右侧填充数量和目标左侧填充数量,其中,所述预设音频处理模型中包含卷积层,所述目标右侧填充数量用于指示在所述卷积层的输入数据右侧填充的数据数量,所述目标左侧填充数量用于指示在所述卷积层的输入数据左侧填充的数据数量,所述目标右侧填充数量大于0,且同一卷积层对应的所述目标右侧填充数量小于对应的所述目标左侧填充数量;
音频流处理模块,设置为基于所述目标右侧填充数量、所述目标左侧填充数量和所述预设音频处理模型,对待处理音频流进行处理,得到对应的处理后的目标音频流。
根据本申请的另一方面,提供了一种模型训练装置,该装置包括:
数量确定模块,设置为确定音频处理模型对应的右侧填充数量和左侧填充数量,其中,所述音频处理模型中包含卷积层,所述右侧填充数量用于指示在所述卷积层的输入数据右侧填充的数据数量,所述左侧填充数量用于指示在所述卷积层的输入数据左侧填充的数据数量,所述右侧填充数量大于0,且同一卷积层对应的所述右侧填充数量小于对应的所述左侧填充数量;
音频处理模块,设置为基于所述右侧填充数量、所述左侧填充数量和所述音频处理模型,对样本音频流进行处理,得到对应的处理后的目标样本音频流, 其中,所述样本音频流对应有标准音频流;
模型训练模块,设置为根据所述目标样本音频流与所述标准音频流确定目标损失关系,并基于所述目标损失关系对所述音频处理模型进行训练。
根据本申请的另一方面,提供了一种电子设备,所述电子设备包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行本申请任一实施例所述的音频处理方法和/或模型训练方法。
根据本申请的另一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序用于使处理器执行时实现本申请任一实施例所述的音频处理方法和/或模型训练方法。
根据本申请的另一方面,提供了一种计算机程序产品,所述计算机程序产品包括计算机程序,所述计算机程序在被处理器执行时实现本申请任一实施例所述的音频处理方法和/或模型训练方法。
附图说明
下面将对实施例描述中所需要使用的附图作简单地介绍,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种音频处理方法的流程示意图;
图2为相关技术中的一种卷积方式示意图;
图3为本申请实施例提供的一种卷积方式示意图;
图4为本申请实施例提供的一种预设音频处理模型的结构示意图;
图5为本申请实施例提供的又一种音频处理方法的流程示意图;
图6为本申请实施例提供的一种音频处理方法的原理示意图;
图7为本申请实施例提供的一种模型训练方法的流程示意图;
图8为本申请实施例提供的一种音频处理装置的结构框图;
图9为本申请实施例提供的一种模型训练装置的结构框图;
图10为本申请实施例提供的一种电子设备的结构框图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
为了便于理解本申请实施例,以音频处理方式为音色转换处理为例,先对相关技术进行介绍。
目前,流式音色转换方案主要有以下两种:
第一种,自回归模型方案,该类方案采用自回归的框架,在预测当前步长(step)的信息时候只用到前面的信息,不会向后面看(也即不会用到后面的信息),在训练的时候需要设置语音块(chunk)的大小来限定向前看的信息量。自回归架构虽然适合做流式推理,但是由于不会用到后面的信息,无法利用后面的信息来协助预测当前步长的输出,难以保证较高的转换音色后的音频的质量。
第二种,卷积模型方案,该类方案将不同层的卷积合并在一起组成一个模型,其中,卷积一般采用常规卷积,如因果卷积、传统卷积或空洞卷积等。采用卷积组合的架构,需要计算好整个框架下,卷积的感受野大小,以便在做流式推理的时候,左右(也可理解为前后)填充(pad)与感受野大小一致的数值,也即左侧填充的数据量和右侧填充的数据量一致。由于做语音转换时,一般还需要加入声码器,声码器内部经常也是采用卷积,如果声学模型和声码器都采用卷积,两个模型叠加会增加感受野的大小,而大的感受野会导致延迟变大。目前,常见的做法是减少卷积的卷积核,以达到缩小感受野大小减少延迟的效果,但是缩小感受野会导致效果变差。
本申请实施例中,提供一种全新的卷积方式,在右侧填充数据,且右侧填充数据数量少于左侧填充数据数量,也即左侧填充多于右侧,本文中将该种卷积方式称为偏左卷积。通过将用于进行音频流处理的模型中的常规卷积更改为偏左卷积,可以达到不需要通过减少卷积核来缩小感受野大小的方式,来保证既能参考到后面的信息,又可以有效控制延迟,从而兼顾低延时和处理效果。
图1为本申请实施例提供的一种音频处理方法的流程示意图,本实施例可适用于对音频流进行处理的情况,具体可以适用于如语音通话、音视频直播以及多人在线会议等各种对实时性要求较高的应用场景。该方法可以由音频处理装置执行,该音频处理装置可以采用硬件和/或软件的形式实现,该音频处理装置可配置于音频处理设备等电子设备中。所述电子设备可以为手机、智能手表、平板电脑以及个人数字助理等移动设备;也可为台式计算机等其他设备。如图1所示,该方法包括:
步骤101、确定预设音频处理模型对应的目标右侧填充数量和目标左侧填充数量,其中,所述预设音频处理模型中包含卷积层,所述目标右侧填充数量用于指示在所述卷积层的输入数据右侧填充的数据数量,所述目标左侧填充数量用于指示在所述卷积层的输入数据左侧填充的数据数量,所述目标右侧填充数量大于0,且同一卷积层对应的所述目标右侧填充数量小于对应的所述目标左侧填充数量。
本申请实施例中,预设音频处理模型可以理解为通过预先训练得到的用于对音频流进行处理,以得到处理后的音频流的神经网络模型,其中,预设音频处理模型对应的音频处理方式可以包括如音色转换、语种转换(如自动将中文语音转换为英文语音)、语音降噪、语音内容替换、以及口语识别等等。该模型中包括一个或多个卷积层,具体模型结构可以采用本申请实施例中提供的模型训练方法训练得到。
示例性的,在利用预设音频处理模型对待处理音频流进行处理之前,先确定目标右侧填充数量和目标左侧填充数量。待处理音频流可理解为当前需要进行处理的音频流,具体可以是实时通话语音流,视频直播语音流或在线会议语音流等。其中,目标右侧填充数量和目标左侧填充数量的确定方式,可以是读取预先设定的目标右侧填充数量和目标左侧填充数量,也可以根据待处理音频流的实际情况、音频处理的实际需求或网络质量等相关因素动态确定。目标右侧填充数量和目标左侧填充数量的不同取值情况,可以对应不同的预设音频处理模型,也即,可以根据确定好的目标右侧填充数量和目标左侧填充数量来选择对应的预设音频处理模型;目标右侧填充数量和目标左侧填充数量的不同取值情况,也可以对应同一个通用的预设音频处理模型,该通用的预设音频处理 模型在训练时,可以采用随机确定右侧填充数量和左侧填充数量方式进行训练,以适应不同的取值情况。
示例性的,在处理音频流时,一般需要对音频流进行分块,得到多个语音块,具体大小可以根据实际情况进行设置,例如对应10个或20个语音帧的大小,并以语音块为单位进行推理。数据在经过卷积层的卷积操作后,输出的数据量会减少,为了保证每次输入的数据量一致,减少计算误差,需要在输入卷积层之前,在输入数据左侧(也可以理解为前面)和右侧(也可理解为后面)填充(pad)一定数量的数据,如一定数量的步长值为0的数据。
本申请实施例中,目标右侧填充数量(也可称为目标后方填充数量)用于指示在卷积层的输入数据右侧填充的数据数量,当预设音频处理模型中包含多个卷积层时,不同卷积层对应的目标右侧填充数量可以相同或不同,也即,可以确定多个目标右侧填充数量,个数可以与卷积层的个数一致;目标左侧填充数量(也可称为目标前方填充数量)用于指示在卷积层的输入数据左侧填充的数据数量,当预设音频处理模型中包含多个卷积层时,不同卷积层对应的目标左侧填充数量可以相同或不同,也即,可以确定多个目标左侧填充数量,个数可以与卷积层的个数一致;所有目标右侧填充数量可以均大于0,并且,对于同一个卷积层来说,其对应的目标右侧填充数量小于目标左侧填充数量,也即,预设音频处理模型中的每个卷积层为偏左卷积方式。
图2为相关技术中的一种卷积方式示意图,图2中所示的为传统卷积方式,例如卷积核大小(kernel_size)为5,当输入的时间步长(time_step)为3时,在左右pad两个step值为0的数据,也即左侧填充数量等于右侧填充数量,然后采用卷积进行计算,此时的感受野大小为5。在对音频流进行处理时,右侧填充的数据应为即将到来的两个语音块,需要等待后面的两个语音块获取到后,才能够得到卷积层的输入数据,因此延迟为两个语音块的时间(其他因素此处忽略不计)。
图3为本申请实施例提供的一种卷积方式示意图,图3中所示的为本申请提供的一种新的非平衡卷积方式,偏左卷积。在处理流式语音时,影响延迟的是卷积右边的感受野,因此,为了兼顾延迟和音频质量,本申请实施例采用偏左卷积,即卷积的感受野偏向左边,需要左边pad的个数比右边多,例如,卷积核大小同样为5,当输入的时间步长(time_step)为3时,在左侧pad三个step值为0的数据,在右侧pad一个step值为0的数据,也即左侧填充数量大于右侧填充数量,然后采用卷积进行计算,此时的感受野大小为5。在对音频流进行处理时,右侧填充的数据应为即将到来的一个语音块,需要等待后面的一个语音块获取到后,才能够得到卷积层的输入数据,因此延迟为一个语音块的时间 (其他因素此处忽略不计),对比图2中所示的传统卷积方式来说,可以减少一个语音块的延迟。
步骤102、基于所述目标右侧填充数量、所述目标左侧填充数量和所述预设音频处理模型,对待处理音频流进行处理,得到对应的处理后的目标音频流。
示例性的,在确定好目标右侧填充数量和目标左侧填充数量后,可以将待处理音频流输入至预设音频处理模型;或对待处理音频流进行预处理后,将预处理结果数据输入至预设音频处理模型。预设音频处理模型在对输入模型的数据进行处理的过程中,基于各卷积层对应的目标右侧填充数量和目标左侧填充数量,对即将输入至对应卷积层的输入数据进行填充后,输入至对应卷积层,以完成卷积计算,最后由预设音频处理模型输出经过处理的音频流,根据预设音频处理模型的输出确定目标音频流,例如,在模型输出的基础上去除填充数据对应的音频数据,得到目标音频流。
本申请实施例中提供的音频处理方法,在需要对待处理音频流进行处理时,先确定预设音频处理模型对应的目标右侧填充数量和目标左侧填充数量,其中,预设音频处理模型中包含卷积层,目标右侧填充数量用于指示在卷积层的输入数据右侧填充的数据数量,目标左侧填充数量用于指示在卷积层的输入数据左侧填充的数据数量,目标右侧填充数量大于0且小于目标左侧填充数量,随后,再基于目标右侧填充数量、目标左侧填充数量和预设音频处理模型,对待处理音频流进行处理,得到对应的处理后的目标音频流。通过采用上述音频处理方法,在为卷积层的输入数据填充数据时,在右侧填充数据可以保证参考到后面的信息,提高音频处理后的音频质量,而右侧填充数据数量少于左侧填充数据数量,可以有效控制延迟,从而可以兼顾低延时和处理效果。
在一些实施例中,所述预设音频处理模型中包括预设音色添加模型(可理解为声学模型)和预设声码器模型,所述预设音色添加模型和所述预设声码器模型中均包含卷积层,所述预设音色添加模型设置为为内容信息添加音色信息,得到音色内容信息,所述预设声码器模型设置为将音色内容信息转换为音频数据。此时,预设音频处理模型可认为是预设音色转换模型,设置为对音频流进行音色转换。其中,内容信息可以理解为语音内容信息,可以利用自动语音识别技术(Automatic Speech Recognition,ASR)对音频流中的音频帧进行预处理,以将音频帧中包含的人的语音信息转换为文本信息,将文本信息作为上述内容信息。预设音色添加模型和预设声码器模型中均包含一个或多个卷积层。
图4为本申请实施例提供的一种预设音频处理模型的结构示意图,如图4所示,内容信息经过预设音色添加模型后,被添加了音色信息,再经过预设声码器模型经过上采样转换为音频数据。示例性的,预设音色添加模型可以为声 学模型(Acoustic Model,AM),具体可以是一维卷积残差网络(conv1d resnet)的声学模型;预设声码器模型(vocoder),具体可以是HiFi-GAN声码器,该声码器采用生成对抗网络(Generative Adversial Networks,GAN)作为基础生成模型,可以保证生成音频的音质。
在一些实施例中,所述基于所述目标右侧填充数量、所述目标左侧填充数量和所述预设音频处理模型,对待处理音频流进行处理,得到对应的处理后的目标音频流,包括:提取待处理音频流中的待处理内容信息;将所述待处理内容信息输入至所述预设音频处理模型,以使所述预设音频处理模型基于所述目标右侧填充数量和所述目标左侧填充数量对所述待处理内容信息进行处理,得到对应的变化音色的目标音频流。这样设置的好处在于,可以准确地保证进行音色转换后的目标音频流与待处理音频流的内容信息的一致性。
示例性的,利用ASR技术从待处理音频流中提取出待处理内容信息,将待处理内容信息输入至预设音频处理模型进行处理,通过预设音频处理模型中预设音色添加模型,在待处理内容信息基础上添加目标音色信息,得到目标音色内容信息,目标音色内容信息经预设声码器模型上采样后转换为音色变更为目标音色的目标音频流,在上述处理过程中,数据输入至各卷积层之前,基于相应卷积层对应的目标右侧填充数量和目标左侧填充数量进行填充,再输入至相应卷积层进行卷积计算。
在一些实施例中,所述确定预设音频处理模型对应的目标右侧填充数量和目标左侧填充数量,包括:确定预设音频处理模型对应的目标右侧填充数量;确定所述卷积层对应的卷积核大小;根据所述卷积核大小和所述目标右侧填充数量,确定目标左侧填充数量。这样设置的好处在于,可以在卷积核大小固定的情况下,快速准确地确定目标右侧填充数量和目标左侧填充数量。
示例性的,对于同一卷积层来说,对应的填充总数量可以根据模型的感受野来确定,感受野可以由该卷积层的卷积核大小确定,填充总数量一般为卷积核大小减1(也即kernel_size-1),可先确定目标右侧填充数量(right_pad_num),再根据填充总数量与目标右侧填充数量的差值,确定目标左侧填充数量(left_pad_num),也即,left_pad_num=kernel_size-1-right_pad_num。
在一些实施例中,所述目标右侧填充数量根据所述待处理音频流对应的预期延时确定,其中,所述目标右侧填充数量和所述预期延时成正相关关系。这样设置的好处在于,可以根据实际的延时需求快速合理地确定目标右侧填充数量。
其中,预期延时可以理解为想要达到的延时时长,可根据当前对延时的容忍程度来确定预期延时。例如,容忍程度越高,预期延时越长,相应的,目标 右侧填充数量可以大一些;容忍程度越低,预期延时越短,相应的,目标右侧填充数量可以小一些。可选的,可以预先建立右侧填充数量集合与延时之间的对应关系,得到数量延时映射关系,根据预期延时查询该数量延时映射关系,得到目标右侧填充数量。数量延时映射关系的建立方式可以通过实验等方式,右侧填充数量集合中包含每个卷积层对应的右侧填充数量。
图5为本申请实施例提供的又一种音频处理方法的流程示意图,以音频处理为音色转换为例,图6为本申请实施例提供的一种音频处理方法的原理示意图,可结合图5和图6对本申请实施例进行理解,在上述各可选实施例基础上,该方法可包括:
步骤501、根据待处理音频流对应的预期延时确定预设音频处理模型对应的目标右侧填充数量。
其中,预设音频处理模型中包括预设音色添加模型和预设声码器模型,预设音色添加模型中包含第一预设数量(记为M,大于或等于1)的卷积层,预设声码器模型中包含第二预设数量(记为N,大于或等于1)的卷积层。
可选的,可以预先通过实验等方式,确定预设音频处理模型中各卷积层分别对应不同的右侧填充数量的情况下,预设音频处理模型对应的不同延时,建立右侧填充数量集合与延时之间的对应关系,得到数量延时映射关系。例如,M个卷积层对应的右侧填充数量分别Mr1、Mr2、......、Mrm,N个卷积层对应的右侧填充数量分别Nr1、Nr2、......、Nrm,右侧填充数量的取值范围为大于0,且小于对应的卷积核大小减1后的差值再除以2,也即(kernel_size-1)/2,对Mr1、Mr2、......、Mrm、Nr1、Nr2、......、Nrm的不同取值分别进行组合,得到多个右侧填充数量集合,每个右侧填充数量集合中包含一种Mr1、Mr2、......、Mrm、Nr1、Nr2、......、Nrm的取值(也即M+N个数值),且对应一个延时,得到数量延时映射关系。
本步骤中,可以根据预期延时查询该数量延时映射关系,得到对应的目标右侧填充数量集合。
步骤502、确定卷积层对应的卷积核大小,并根据卷积核大小和目标右侧填充数量,确定目标左侧填充数量。
示例性的,针对每个卷积层,根据该卷积层对应的卷积核大小和对应的目标右侧填充数量,确定对应的目标左侧填充数量。
步骤503、提取待处理音频流中的待处理内容信息。
其中,本步骤也可在步骤501之前进行。
步骤504、将待处理内容信息输入至预设音频处理模型,以使预设音频处理 模型基于目标右侧填充数量和目标左侧填充数量对待处理内容信息进行处理,得到对应的变更音色的目标音频流。
为了便于说明,假设针对预设音频处理模型中的各卷积层,所确定的目标右侧填充数量均相同。如图6所示,以语音块(chunk)为单位进行推理,假设10个假设内容信息(PPg)为1个语音块,其中,PPg的单位可以是帧,目标右侧填充数量为1个语音块(10个PPg),卷积核大小为5个语音块,则目标左侧填充数量为3个语音块(30个PPg)。第1次获取50个PPg,其中的前40个PPg作为预设音频处理模型的输入(为了描述方便,未对预设音频处理模型内部进行展开),后10个PPg相当于右侧填充数据,左侧填充30个数值为0的PPg,经预设音频处理模型输出80个PPg对应的音频数据(wav),其长度为80*期望大小(hope_size),而中间(第31至70)的40*hope_size个wav为需要提供给用户的目标音频数据。第2次继续获取40个PPg,其中,上次获取的最后10个PPg和本次获取的前30个PPg作为预设音频处理模型的输入,本次获取的最后10个PPg相当于右侧填充数据,而上次获取的中间30个PPg作为左侧填充数据,经预设音频处理模型输出80个PPg对应的wav后,去除pad对应的wav,得到目标音频数据。第3次之后,依次类推,此处不再赘述。
本申请实施例提供的音频处理方法,可以根据当前的预期延时灵活精准地确定卷积层的输入数据的左右填充数据的数量,从而实现可调节延迟的流式变声方案,且可以在保证较低延时的情况下,兼顾较好的接近甚至超过离线推理方案的音色转换效果,保证音频质量,对于直播等应用场景来说,可以做到直播过程中的实时变声,满足用户的实时变声需求,提升用户体验。
图7为本申请实施例提供的一种模型训练方法的流程示意图,本实施例可适用于,对用于进行音频流处理的模型进行训练的情况,该模型具体可以适用于如语音通话、音视频直播以及多人在线会议等各种对实时性要求较高的应用场景。该方法可以由模型训练装置执行,该装置可以采用硬件和/或软件的形式实现,该装置可配置于模型训练设备等电子设备中。所述电子设备可以为手机、智能手表、平板电脑以及个人数字助理等移动设备;也可为台式计算机等其他设备。采用本申请实施例训练得到的音频处理模型可以应用于本申请中任意实施例提供的音频处理方法。
如图7所示,该方法包括:
步骤701、确定音频处理模型对应的右侧填充数量和左侧填充数量,其中,所述音频处理模型中包含卷积层,所述右侧填充数量用于指示在所述卷积层的输入数据右侧填充的数据数量,所述左侧填充数量用于指示在所述卷积层的输入数据左侧填充的数据数量,所述右侧填充数量大于0,且同一卷积层对应的所 述右侧填充数量小于对应的所述左侧填充数量。
示例性的,右侧填充数量和左侧填充数量可以预先设定,这样,训练得到的音频处理模型,可以适用于所设定的右侧填充数量和左侧填充数量对应的语音处理情况;右侧填充数量和左侧填充数量也可以在规定的取值范围内随机确定,这样,训练得到的音频处理模型,可以适用于不同的右侧填充数量和左侧填充数量对应的语音处理情况,也即可以是一个通用的音频处理模型。
步骤702、基于所述右侧填充数量、所述左侧填充数量和所述音频处理模型,对样本音频流进行处理,得到对应的处理后的目标样本音频流,其中,所述样本音频流对应有标准音频流。
示例性的,对于样本音频流,可以采用对时效性要求不高的处理效果较理想的方式进行处理,得到对应的标准音频流,以音色转换为例,可以采用离线音色转换模型,对样本音频流进行音色转换处理,得到标准音频流。
示例性的,也可以对标准音频流进行处理后,得到对应的样本音频流,以音色转换为例,可以采用离线音色转换模型,对标准音频流进行音色转换处理,得到样本音频流。
步骤703、根据所述目标样本音频流与所述标准音频流确定目标损失关系,并基于所述目标损失关系对所述音频处理模型进行训练。
示例性的,标准音频流可以理解为模型输出想要达到的具有理想处理效果的音频流。损失关系可以用于表征两种数据之间的差异,可以以损失值表示,具体可以采用损失函数来计算。目标损失关系用于表征目标样本音频流与标准音频流之间的差异,在计算目标损失关系时,所采用的具体的损失函数可以根据实际需求进行设置。根据目标损失关系对音频处理模型进行训练,在训练过程中,可以以最小化目标损失关系为目标,利用反向传播等训练手段不断优化音频处理模型中的权重参数值,直到满足预设训练截止条件。具体的训练截止条件可根据实际需求进行设置,例如可以基于迭代次数、损失值收敛程度、或模型准确率等设定。训练结束后得到的音频处理模型,可以成为上述实施例中音频处理方法中的预设音频处理模型。
本申请实施例提供的模型训练方法,预设音频处理模型中包含卷积层,在训练过程中,在为卷积层的输入数据填充数据时,在右侧填充数据可以保证参考到后面的信息,提高音频处理后的音频质量,而右侧填充数据数量少于左侧填充数据数量,可以有效控制延迟,从而可以使得训练后得到的模型,能够实现兼顾低延时和处理效果。
在一些实施例中,所述音频处理模型中包括音色添加模型和声码器模型, 所述音色添加模型和所述声码器模型中均包含卷积层,所述音色添加模型设置为为内容信息添加音色信息,得到音色内容信息,所述声码器模型设置为将音色内容信息转换为音频数据。其中,所述基于所述右侧填充数量、所述左侧填充数量和所述音频处理模型,对样本音频流进行处理,得到对应的处理后的目标样本音频流,包括:提取样本音频流中的样本内容信息;将所述样本内容信息输入至所述音频处理模型,以使所述音频处理模型基于所述右侧填充数量和所述左侧填充数量对所述样本内容信息进行处理,得到对应的变更音色的目标样本音频流。这样设置的好处在于,可以训练得到兼顾低延时和转换效果的用于音色转换的音频处理模型。
在一些实施例中,所述确定音频处理模型对应的右侧填充数量和左侧填充数量,包括:采用随机方式确定音频处理模型对应的右侧填充数量和左侧填充数量。可选的,可包括:采用随机方式确定音频处理模型对应的右侧填充数量;确定所述卷积层对应的卷积核大小;根据所述卷积核大小和所述右侧填充数量,确定左侧填充数量。这样设置的好处在于,可以适用于不同的右侧填充数量和左侧填充数量对应的语音处理情况,也即可以训练得到通用的音频处理模型,方便根据实际应用需求,动态确定右侧填充数量,可以达到灵活调整延时的效果。
图8为本申请实施例提供的一种音频处理装置的结构框图,该装置可由软件和/或硬件实现,一般可集成在音频处理设备等电子设备中,可通过执行音频处理方法来对音频流进行处理。如图8所示,该装置包括:填充数量确定模块801和音频流处理模块802。
填充数量确定模块801,设置为确定预设音频处理模型对应的目标右侧填充数量和目标左侧填充数量,其中,所述预设音频处理模型中包含卷积层,所述目标右侧填充数量用于指示在所述卷积层的输入数据右侧填充的数据数量,所述目标左侧填充数量用于指示在所述卷积层的输入数据左侧填充的数据数量,所述目标右侧填充数量大于0,且同一卷积层对应的所述目标右侧填充数量小于对应的所述目标左侧填充数量;
音频流处理模块802,设置为基于所述目标右侧填充数量、所述目标左侧填充数量和所述预设音频处理模型,对待处理音频流进行处理,得到对应的处理后的目标音频流。
本申请实施例提供的音频处理装置,在需要对待处理音频流进行处理时,先确定预设音频处理模型对应的目标右侧填充数量和目标左侧填充数量,其中,预设音频处理模型中包含卷积层,目标右侧填充数量用于指示在卷积层的输入数据右侧填充的数据数量,目标左侧填充数量用于指示在卷积层的输入数据左 侧填充的数据数量,目标右侧填充数量大于0且小于目标左侧填充数量,随后,再基于目标右侧填充数量、目标左侧填充数量和预设音频处理模型,对待处理音频流进行处理,得到对应的处理后的目标音频流。通过采用上述音频处理装置,在为卷积层的输入数据填充数据时,在右侧填充数据可以保证参考到后面的信息,提高音频处理后的音频质量,而右侧填充数据数量少于左侧填充数据数量,可以有效控制延迟,从而可以兼顾低延时和处理效果。
可选的,所述预设音频处理模型中包括预设音色添加模型和预设声码器模型,所述预设音色添加模型和所述预设声码器模型中均包含卷积层,所述预设音色添加模型设置为为内容信息添加音色信息,得到音色内容信息,所述预设声码器模型设置为将音色内容信息转换为音频数据;
其中,所述音频流处理模块包括:
内容信息提取单元,设置为提取待处理音频流中的待处理内容信息;
音色转换单元,设置为将所述待处理内容信息输入至所述预设音频处理模型,以使所述预设音频处理模型基于所述目标右侧填充数量和所述目标左侧填充数量对所述待处理内容信息进行处理,得到对应的变更音色的目标音频流。
可选的,所述填充数量确定模块,包括:
右侧填充数量确定单元,设置为确定预设音频处理模型对应的目标右侧填充数量;
卷积核大小确定单元,设置为确定所述卷积层对应的卷积核大小;
左侧填充数量确定单元,设置为根据所述卷积核大小和所述目标右侧填充数量,确定目标左侧填充数量。
可选的,所述目标右侧填充数量根据所述待处理音频流对应的预期延时确定,其中,所述目标右侧填充数量和所述预期延时成正相关关系。
图9为本申请实施例提供的一种模型训练装置的结构框图,该装置可由软件和/或硬件实现,一般可集成在模型训练设备等电子设备中,可通过执行模型训练方法来进行模型训练。如图9所示,该装置包括:数量确定模块901、音频处理模块902和模型训练模块903。
数量确定模块901,设置为确定音频处理模型对应的右侧填充数量和左侧填充数量,其中,所述音频处理模型中包含卷积层,所述右侧填充数量用于指示在所述卷积层的输入数据右侧填充的数据数量,所述左侧填充数量用于指示在所述卷积层的输入数据左侧填充的数据数量,所述右侧填充数量大于0,且同一卷积层对应的所述右侧填充数量小于对应的所述左侧填充数量;
音频处理模块902,设置为基于所述右侧填充数量、所述左侧填充数量和所述音频处理模型,对样本音频流进行处理,得到对应的处理后的目标样本音频流,其中,所述样本音频流对应有标准音频流;
模型训练模块903,设置为根据所述目标样本音频流与所述标准音频流确定目标损失关系,并基于所述目标损失关系对所述音频处理模型进行训练。
本申请实施例提供的模型训练装置,预设音频处理模型中包含卷积层,在训练过程中,在为卷积层的输入数据填充数据时,在右侧填充数据可以保证参考到后面的信息,提高音频处理后的音频质量,而右侧填充数据数量少于左侧填充数据数量,可以有效控制延迟,从而可以使得训练后得到的模型,能够实现兼顾低延时和处理效果。
可选的,所述音频处理模型中包括音色添加模型和声码器模型,所述音色添加模型和所述声码器模型中均包含卷积层,所述音色添加模型设置为为内容信息添加音色信息,得到音色内容信息,所述声码器模型设置为将音色内容信息转换为音频数据;
其中,所述音频处理模块,包括:
信息提取单元,设置为提取样本音频流中的样本内容信息;
音色变更单元,设置为将所述样本内容信息输入至所述音频处理模型,以使所述音频处理模型基于所述右侧填充数量和所述左侧填充数量对所述样本内容信息进行处理,得到对应的变更音色的目标样本音频流。
可选的,所述数量确定模块,包括:
随机确定单元,设置为采用随机方式确定音频处理模型对应的右侧填充数量;
卷积核确定单元,设置为确定所述卷积层对应的卷积核大小;
左侧数量确定单元,设置为根据所述卷积核大小和所述右侧填充数量,确定左侧填充数量。
本申请实施例提供了一种电子设备,该电子设备中可集成本申请实施例提供的音频处理装置和/或模型训练装置。图10为本申请实施例提供的一种电子设备的结构框图。电子设备1000包括处理器1001,以及与处理器1001通信连接的存储器1002,其中,存储器1002存储有可被处理器1001执行的计算机程序,计算机程序被处理器1001执行,以使所述处理器1001能够执行本申请任一实施例所述的音频处理方法和/或模型训练方法。其中,处理器的数量可以是一个或多个,图10中以一个处理器为例。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序用于使处理器执行时实现本申请任一实施例所述的音频处理方法。
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括计算机程序,所述计算机程序在被处理器执行时实现如本申请实施例提供的音频处理方法。
上述实施例中提供的音频处理装置、模型训练装置、设备、存储介质及产品可执行本申请任意实施例所提供的音频处理方法或模型训练方法,具备执行该方法相应的功能模块和有益效果。未在上述实施例中详尽描述的技术细节,可参见本申请任意实施例所提供的音频处理方法或模型训练方法。

Claims (12)

  1. 一种音频处理方法,包括:
    确定预设音频处理模型对应的目标右侧填充数量和目标左侧填充数量,其中,所述预设音频处理模型中包含卷积层,所述目标右侧填充数量用于指示在所述卷积层的输入数据右侧填充的数据数量,所述目标左侧填充数量用于指示在所述卷积层的输入数据左侧填充的数据数量,所述目标右侧填充数量大于0,且同一卷积层对应的所述目标右侧填充数量小于对应的所述目标左侧填充数量;
    基于所述目标右侧填充数量、所述目标左侧填充数量和所述预设音频处理模型,对待处理音频流进行处理,得到对应的处理后的目标音频流。
  2. 根据权利要求1所述的方法,其中,所述预设音频处理模型中包括预设音色添加模型和预设声码器模型,所述预设音色添加模型和所述预设声码器模型中均包含卷积层,所述预设音色添加模型设置为为内容信息添加音色信息,得到音色内容信息,所述预设声码器模型设置为将音色内容信息转换为音频数据;
    其中,所述基于所述目标右侧填充数量、所述目标左侧填充数量和所述预设音频处理模型,对待处理音频流进行处理,得到对应的处理后的目标音频流,包括:
    提取待处理音频流中的待处理内容信息;
    将所述待处理内容信息输入至所述预设音频处理模型,以使所述预设音频处理模型基于所述目标右侧填充数量和所述目标左侧填充数量对所述待处理内容信息进行处理,得到对应的变更音色的目标音频流。
  3. 根据权利要求1所述的方法,其中,所述确定预设音频处理模型对应的目标右侧填充数量和目标左侧填充数量,包括:
    确定预设音频处理模型对应的目标右侧填充数量;
    确定所述卷积层对应的卷积核大小;
    根据所述卷积核大小和所述目标右侧填充数量,确定目标左侧填充数量。
  4. 根据权利要求1-3任一所述的方法,其中,所述目标右侧填充数量根据所述待处理音频流对应的预期延时确定,其中,所述目标右侧填充数量和所述预期延时成正相关关系。
  5. 一种模型训练方法,包括:
    确定音频处理模型对应的右侧填充数量和左侧填充数量,其中,所述音频处理模型中包含卷积层,所述右侧填充数量用于指示在所述卷积层的输入数据 右侧填充的数据数量,所述左侧填充数量用于指示在所述卷积层的输入数据左侧填充的数据数量,所述右侧填充数量大于0,且同一卷积层对应的所述右侧填充数量小于对应的所述左侧填充数量;
    基于所述右侧填充数量、所述左侧填充数量和所述音频处理模型,对样本音频流进行处理,得到对应的处理后的目标样本音频流,其中,所述样本音频流对应有标准音频流;
    根据所述目标样本音频流与所述标准音频流确定目标损失关系,并基于所述目标损失关系对所述音频处理模型进行训练。
  6. 根据权利要求5所述的方法,其中,所述音频处理模型中包括音色添加模型和声码器模型,所述音色添加模型和所述声码器模型中均包含卷积层,所述音色添加模型设置为为内容信息添加音色信息,得到音色内容信息,所述声码器模型设置为将音色内容信息转换为音频数据;
    其中,所述基于所述右侧填充数量、所述左侧填充数量和所述音频处理模型,对样本音频流进行处理,得到对应的处理后的目标样本音频流,包括:
    提取样本音频流中的样本内容信息;
    将所述样本内容信息输入至所述音频处理模型,以使所述音频处理模型基于所述右侧填充数量和所述左侧填充数量对所述样本内容信息进行处理,得到对应的变更音色的目标样本音频流。
  7. 根据权利要求5所述的方法,其中,所述确定音频处理模型对应的右侧填充数量和左侧填充数量,包括:
    采用随机方式确定音频处理模型对应的右侧填充数量;
    确定所述卷积层对应的卷积核大小;
    根据所述卷积核大小和所述右侧填充数量,确定左侧填充数量。
  8. 一种音频处理装置,包括:
    填充数量确定模块,设置为确定预设音频处理模型对应的目标右侧填充数量和目标左侧填充数量,其中,所述预设音频处理模型中包含卷积层,所述目标右侧填充数量用于指示在所述卷积层的输入数据右侧填充的数据数量,所述目标左侧填充数量用于指示在所述卷积层的输入数据左侧填充的数据数量,所述目标右侧填充数量大于0,且同一卷积层对应的所述目标右侧填充数量小于对应的所述目标左侧填充数量;
    音频流处理模块,设置为基于所述目标右侧填充数量、所述目标左侧填充数量和所述预设音频处理模型,对待处理音频流进行处理,得到对应的处理后 的目标音频流。
  9. 一种模型训练装置,包括:
    数量确定模块,设置为确定音频处理模型对应的右侧填充数量和左侧填充数量,其中,所述音频处理模型中包含卷积层,所述右侧填充数量用于指示在所述卷积层的输入数据右侧填充的数据数量,所述左侧填充数量用于指示在所述卷积层的输入数据左侧填充的数据数量,所述右侧填充数量大于0,且同一卷积层对应的所述右侧填充数量小于对应的所述左侧填充数量;
    音频处理模块,设置为基于所述右侧填充数量、所述左侧填充数量和所述音频处理模型,对样本音频流进行处理,得到对应的处理后的目标样本音频流,其中,所述样本音频流对应有标准音频流;
    模型训练模块,设置为根据所述目标样本音频流与所述标准音频流确定目标损失关系,并基于所述目标损失关系对所述音频处理模型进行训练。
  10. 一种电子设备,所述电子设备包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-4任一项所述的音频处理方法和/或权利要求5-7任一项所述的模型训练方法。
  11. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序用于使处理器执行时实现权利要求1-4任一项所述的音频处理方法和/或权利要求5-7任一项所述的模型训练方法。
  12. 一种计算机程序产品,所述计算机程序产品包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-4任一项所述的音频处理方法和/或权利要求5-7任一项所述的模型训练方法。
PCT/CN2023/111004 2022-08-17 2023-08-03 音频处理方法、模型训练方法、装置、设备、介质及产品 WO2024037348A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210986541.0 2022-08-17
CN202210986541.0A CN115346543A (zh) 2022-08-17 2022-08-17 音频处理方法、模型训练方法、装置、设备、介质及产品

Publications (1)

Publication Number Publication Date
WO2024037348A1 true WO2024037348A1 (zh) 2024-02-22

Family

ID=83952152

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/111004 WO2024037348A1 (zh) 2022-08-17 2023-08-03 音频处理方法、模型训练方法、装置、设备、介质及产品

Country Status (2)

Country Link
CN (1) CN115346543A (zh)
WO (1) WO2024037348A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115346543A (zh) * 2022-08-17 2022-11-15 广州市百果园信息技术有限公司 音频处理方法、模型训练方法、装置、设备、介质及产品

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200110777A1 (en) * 2017-06-28 2020-04-09 Zhejiang University System and Method of Graph Feature Extraction Based on Adjacency Matrix
CN112435684A (zh) * 2020-11-03 2021-03-02 中电金信软件有限公司 语音分离方法、装置、计算机设备和存储介质
CN112700794A (zh) * 2021-03-23 2021-04-23 北京达佳互联信息技术有限公司 一种音频场景分类方法、装置、电子设备和存储介质
CN112906624A (zh) * 2021-03-12 2021-06-04 合肥工业大学 一种基于音视频多模态时序预测的视频数据特征提取方法
CN113823272A (zh) * 2021-06-02 2021-12-21 腾讯科技(深圳)有限公司 语音处理方法、装置、电子设备以及存储介质
CN115346543A (zh) * 2022-08-17 2022-11-15 广州市百果园信息技术有限公司 音频处理方法、模型训练方法、装置、设备、介质及产品

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11017761B2 (en) * 2017-10-19 2021-05-25 Baidu Usa Llc Parallel neural text-to-speech
US10796686B2 (en) * 2017-10-19 2020-10-06 Baidu Usa Llc Systems and methods for neural text-to-speech using convolutional sequence learning
KR20190064803A (ko) * 2017-12-01 2019-06-11 건국대학교 산학협력단 제로 패딩을 사용하는 언어 모델 학습 방법 및 이를 사용하는 장치
US10770063B2 (en) * 2018-04-13 2020-09-08 Adobe Inc. Real-time speaker-dependent neural vocoder
KR102358149B1 (ko) * 2021-06-18 2022-02-08 주식회사 위스타 컨볼루션 신경망에서의 패딩 방법

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200110777A1 (en) * 2017-06-28 2020-04-09 Zhejiang University System and Method of Graph Feature Extraction Based on Adjacency Matrix
CN112435684A (zh) * 2020-11-03 2021-03-02 中电金信软件有限公司 语音分离方法、装置、计算机设备和存储介质
CN112906624A (zh) * 2021-03-12 2021-06-04 合肥工业大学 一种基于音视频多模态时序预测的视频数据特征提取方法
CN112700794A (zh) * 2021-03-23 2021-04-23 北京达佳互联信息技术有限公司 一种音频场景分类方法、装置、电子设备和存储介质
CN113823272A (zh) * 2021-06-02 2021-12-21 腾讯科技(深圳)有限公司 语音处理方法、装置、电子设备以及存储介质
CN115346543A (zh) * 2022-08-17 2022-11-15 广州市百果园信息技术有限公司 音频处理方法、模型训练方法、装置、设备、介质及产品

Also Published As

Publication number Publication date
CN115346543A (zh) 2022-11-15

Similar Documents

Publication Publication Date Title
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
WO2024037348A1 (zh) 音频处理方法、模型训练方法、装置、设备、介质及产品
CN102226944B (zh) 混音方法及设备
US9143862B2 (en) Correlation based filter adaptation
WO2021047201A9 (zh) 一种语音识别方法及装置
CN109285554B (zh) 一种回声消除方法、服务器、终端及系统
CN111858909A (zh) 生成交织文本的抽象摘要的系统和方法
CN110379411B (zh) 针对目标说话人的语音合成方法和装置
US11972778B2 (en) Sound-picture matching method of video, related apparatus, and storage medium
WO2020221846A1 (en) Bandwidth extension of incoming data using neural networks
CN110060696A (zh) 混音方法及装置、终端及可读存储介质
CN111464876A (zh) 翻译文本字幕流式展示方法、装置以及设备
CN112750444A (zh) 混音方法、装置及电子设备
SG178364A1 (en) Frequency band scale factor determination in audio encoding based upon frequency band signal energy
CN117642814A (zh) 稳健的直接语音到语音翻译
CN112233649A (zh) 机器同声传译输出音频动态合成方法、装置以及设备
WO2021147237A1 (zh) 语音信号处理方法、装置、电子设备及存储介质
CN112634912A (zh) 丢包补偿方法及装置
WO2023011125A1 (zh) 一种同传翻译方法、装置、设备及存储介质
CN114373473A (zh) 通过低延迟深度学习实现同时降噪和去混响
CN117133307A (zh) 低功耗单声道语音降噪方法、计算机装置及计算机可读存储介质
CN114743561A (zh) 语音分离装置及方法、存储介质、计算机设备
CN114743540A (zh) 语音识别方法、系统、电子设备和存储介质
CN114023352A (zh) 一种基于能量谱深度调制的语音增强方法及装置
CN115798453A (zh) 语音重建方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23854252

Country of ref document: EP

Kind code of ref document: A1