CN111985643A - Training method for generating network, audio data enhancement method and related device - Google Patents

Training method for generating network, audio data enhancement method and related device Download PDF

Info

Publication number
CN111985643A
CN111985643A CN202010849195.2A CN202010849195A CN111985643A CN 111985643 A CN111985643 A CN 111985643A CN 202010849195 A CN202010849195 A CN 202010849195A CN 111985643 A CN111985643 A CN 111985643A
Authority
CN
China
Prior art keywords
audio data
data
network
audio
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010849195.2A
Other languages
Chinese (zh)
Other versions
CN111985643B (en
Inventor
徐东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202010849195.2A priority Critical patent/CN111985643B/en
Publication of CN111985643A publication Critical patent/CN111985643A/en
Application granted granted Critical
Publication of CN111985643B publication Critical patent/CN111985643B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)

Abstract

The training method for generating the network realizes information filtering on target audio data and reference audio data by determining identification network input data to obtain sub-target audio data meeting the set conditions of a transform domain and sub-reference audio data meeting the set conditions of the transform domain, and inputs the sub-target audio data meeting the set conditions of the transform domain and the sub-reference audio data meeting the set conditions of the transform domain into the identification network on the basis of the sub-target audio data meeting the set conditions of the transform domain and the sub-reference audio data meeting the set conditions of the transform domain, so that the identification of the identification network is more pertinent, the updating of internal parameters of the generation network is also more pertinent based on a more pertinent identification result, the trained generation network can generate the audio data in a pertinent manner, and the targeted expansion of audio data quantity is realized.

Description

Training method for generating network, audio data enhancement method and related device
Technical Field
The present application relates to the field of audio data processing technologies, and in particular, to a training method for generating a network, an audio data enhancement method, and a related apparatus.
Background
The development of artificial intelligence technology can not support data. For example, the reliability of machine learning depends on the size of the data volume participating in machine learning. Generally, the larger the amount of data, the more sufficient the machine learning, and the higher the reliability. Therefore, it is necessary to secure a large data amount.
At present, the data volume is often expanded by acquiring data with the same or similar distribution through data enhancement methods (such as rotation, scaling, translation, contrast transformation, noise disturbance, etc.).
However, in the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: in the audio field, there is a lack of data enhancement methods to expand the amount of audio data.
Disclosure of Invention
In order to solve the above technical problems, embodiments of the present application provide a training method for generating a network, an audio data enhancement method, and a related device, so as to achieve the purpose of performing targeted training on the generated network, ensuring that the trained generated network can generate audio data in a targeted manner, and achieving targeted expansion of audio data amount, and the technical scheme is as follows:
one aspect of the present application provides a training method for generating a network, including:
selecting audio data to be processed and reference audio data from a data source;
inputting the audio data to be processed and the reference audio data into a generation network in a to-be-trained generative confrontation network to obtain target audio data;
determining authentication network input data when a difference between the target audio data and the reference audio data in the generated network is within a difference threshold, the authentication network input data comprising: sub-target audio data which is obtained from the target audio data and meets a transform domain setting condition, and sub-reference audio data which is obtained from the reference audio data and meets the transform domain setting condition;
inputting the identification network input data into an identification network in the to-be-trained generative confrontation network, and acquiring an identification result;
and when the discrimination result of the sub-target audio data in the discrimination result is false, updating the internal parameters of the generation network.
The process for obtaining the sub-target audio data meeting the set condition of the transform domain includes:
transforming the target audio data based on a preset Fourier transform function to obtain first frequency domain data;
taking the frequency domain data in the set bandwidth range in the first frequency domain data as sub-target audio data meeting the set conditions of a transform domain;
the obtaining process of the sub-reference audio data that meets the transform domain setting condition includes:
transforming the reference audio data based on the preset Fourier transform function to obtain second frequency domain data;
and taking the frequency domain data within the set bandwidth range in the second frequency domain data as sub-standard audio data meeting the set conditions of the transform domain.
The step of using the frequency domain data in the set bandwidth range in the first frequency domain data as the sub-target audio data meeting the set condition of the transform domain includes:
extracting audio features of a set type from the first frequency domain data in frequency domain data within a set bandwidth range, and taking the extracted audio features as sub-target audio data meeting set conditions of a transform domain;
the determining, as sub-reference audio data meeting the transform domain setting condition, frequency domain data within the set bandwidth range in the second frequency domain data, includes:
and extracting the audio features of the set type from the second frequency domain data in the frequency domain data within the set bandwidth range, and taking the extracted audio features as sub-standard audio data meeting the set conditions of the transform domain.
The process for obtaining the sub-target audio data meeting the set condition of the transform domain includes:
changing the target audio data based on a preset constant Q change function to obtain third frequency domain data;
selecting data of a set type from the third frequency domain data, and taking the selected data as sub-target audio data meeting the set conditions of a transform domain;
the obtaining process of the sub-reference audio data that meets the transform domain setting condition includes:
changing the reference audio data based on the preset constant Q change function to obtain fourth frequency domain data;
and selecting the data of the set type from the fourth frequency domain data, and taking the selected data as sub-reference audio data meeting the set conditions of the transform domain.
Further comprising:
and when the difference between the target audio data and the reference audio data in the generating network is not within the range of the difference threshold, updating the internal parameters of the generating network, and returning to execute the step of selecting the audio data to be processed and the reference audio data from the data source.
When a plurality of groups of the audio data to be processed and the reference audio data are input into a generating network in a generating countermeasure network to be trained, if the discrimination results of the sub-target audio data are false and the ratio of the discrimination results of the sub-target audio data is false does not reach a preset ratio threshold, the discriminating network is trained.
The training the authentication network comprises:
updating internal parameters of the authentication network;
the identification network identifies the audio data of the training sub-targets and the reference audio data of the training children to obtain an identification result;
the process for determining the training sub-target audio data comprises the following steps: inputting the audio data to be processed required by training the identification network into the generation network to obtain training target audio data, obtaining data meeting the set conditions of the transform domain from the training target audio data, and using the obtained data as training sub-target audio data; the determination process of the training sub-reference audio data comprises the following steps: obtaining data meeting the set conditions of the transform domain from the reference audio data required by training the identification network, and taking the obtained data as training sub-reference audio data;
judging whether an identification network loss function value is within a preset threshold range, wherein the identification network loss function value represents the difference between the identification result and a preset identification result;
if not, returning to the step of updating the internal parameters of the authentication network until the value of the authentication network loss function is within the range of the preset threshold value.
The method further comprises the following steps:
after the training of the authentication network or before the updating of the internal parameters of the generation network when the authentication result of the sub-target audio data in the authentication result is false, the method further includes:
judging whether the discrimination result of the sub-target audio data in the discrimination result is false or not;
if yes, judging whether the generated network loss function value is converged;
if the generated network loss function value is converged, ending the training;
if the generated network loss function value is not converged, updating the internal parameters of the generated network;
or, judging whether the times of outputting the sub-target audio data to the authentication network reaches the set times;
if the number of times is not up to the set number, updating the internal parameters of the generated network;
if the set times are reached, ending the training;
or, judging whether the times of outputting the sub-target audio data to the authentication network reaches the set times;
if the set times are not reached, judging whether the discrimination result of the sub-target audio data in the discrimination result is false or not convergence;
if yes, judging whether the generated network loss function value is converged;
if the generated network loss function value is converged, ending the training;
and if the generated network loss function value is not converged, updating the parameters of the generated network.
The selecting of the audio data to be processed and the reference audio data from the data source includes:
selecting audio data which accords with a set data format from a data source, and taking the selected audio data as reference audio data;
and randomly selecting audio data with the same number as the reference audio data from the data source, and taking the selected audio data as audio data to be processed.
The selecting audio data conforming to a set data format from a data source and using the selected audio data as reference audio data includes:
selecting audio data which accords with a set data format from a data source;
performing audio enhancement on the audio data conforming to the set data format based on a signal processing method to obtain first audio enhancement data, wherein the audio attribute of the first audio enhancement data is the same as that of the audio data conforming to the set data format;
taking the audio data conforming to the set data format and the first audio enhancement data as reference audio data;
the randomly selecting audio data from the data source, and using the selected audio data as audio data to be processed includes:
randomly selecting audio data from the data source, and taking the selected audio data as random audio data;
performing audio enhancement on the random audio data based on a signal processing method to obtain second audio enhancement data;
and taking the random audio data and the second audio enhancement data as audio data to be processed.
The audio enhancement of the audio data conforming to the set data format or the random audio data based on the signal processing method includes:
detecting the energy of each audio frame in the audio data conforming to the set data format or the random audio data;
based on the energy of each audio frame, screening out a low-energy audio frame set from the audio data conforming to the set data format or the random audio data, wherein the low-energy audio frame set is composed of a set number of audio frames with the energy lower than a set energy threshold, and the set number of audio frames with the energy lower than the set energy threshold are continuously arranged audio frames;
taking the audio frames in the audio data conforming to the set data format or the random audio data except the audio frames in each low-energy audio frame set as effective audio frames;
and combining the effective audio frames to obtain effective audio data, and performing audio enhancement on the effective audio data based on a signal processing method.
Combining a plurality of effective audio frames to obtain effective audio data, and performing audio enhancement on the effective audio data based on a signal processing method, wherein the method comprises the following steps:
determining an average power of a plurality of the valid audio frames based on the power of each of the valid audio frames;
normalizing the average power, and taking the power obtained by normalization as a target power;
respectively multiplying each effective audio frame by the target power to obtain a target effective audio frame;
merging a plurality of target effective audio frames to obtain effective audio data;
carrying out inversion processing and/or turnover processing on the effective audio data;
and carrying out reverse phase processing and/or random cutting on the audio data obtained after the overturning processing.
A method of audio data enhancement, comprising:
acquiring audio data to be processed;
calling a generating network, and processing the audio data to be processed to obtain target audio data, wherein the generating network is obtained by training based on the training method of the generating network of any one of claims 1-14;
and taking the target audio data as audio data enhancement data.
Another aspect of the present application provides a training apparatus for generating a network, including:
the selection module is used for selecting audio data to be processed and reference audio data from a data source;
the first acquisition module is used for inputting the audio data to be processed and the reference audio data into a generation network in a to-be-trained generative confrontation network to acquire target audio data;
a determination module to determine authentication network input data when a difference between the target audio data and the reference audio data in the generation network is within a difference threshold, the authentication network input data comprising: sub-target audio data which is obtained from the target audio data and meets a transform domain setting condition, and sub-reference audio data which is obtained from the reference audio data and meets the transform domain setting condition;
the second acquisition module is used for inputting the identification network input data into the identification network in the to-be-trained generative confrontation network and acquiring an identification result;
and the updating module is used for updating the internal parameters of the generation network when the authentication result of the sub-target audio data in the authentication result is false, and returning to execute the selection module to select the audio data to be processed and the reference audio data from the data source.
A third aspect of the present application provides an electronic device comprising:
a memory for storing at least one set of instructions;
a processor, configured to call and execute the instruction set in the memory, and execute the steps of the training method for generating a network according to any one of the above items by executing the instruction set.
A computer storage medium having stored thereon a computer program for execution by a processor for carrying out the steps of the training method for generating a network as claimed in any one of the preceding claims.
Compared with the prior art, the beneficial effect of this application is:
in the method, the information filtering of the target audio data and the reference audio data is realized by determining the input data of the identification network, so that the sub-target audio data meeting the setting condition of the transform domain and the sub-reference audio data meeting the setting condition of the transform domain are obtained, and on the basis, the sub-target audio data meeting the setting condition of the transform domain and the sub-reference audio data meeting the setting condition of the transform domain are input into the identification network, so that the identification of the identification network is more targeted, the updating of the internal parameters of the generation network is also more targeted based on the more targeted identification result, the trained generation network can generate the audio data in a targeted manner, and the targeted audio data volume is realized.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
Fig. 1 is a flowchart of an embodiment 1 of a training method for generating a network according to the present application;
FIG. 2 is a schematic diagram illustrating training of a generative confrontation network provided herein;
FIG. 3 is a flow chart of embodiment 2 of a training method for generating a network according to the present application;
FIG. 4 is a flow chart of embodiment 3 of a training method for generating a network according to the present application;
FIG. 5 is a flowchart of embodiment 4 of a training method for generating a network according to the present application;
FIG. 6 is a flow chart of embodiment 5 of a training method for generating a network according to the present application;
fig. 7 is a flowchart of an embodiment 1 of an audio data enhancement method provided in the present application;
fig. 8 is a schematic diagram of a logic structure of a training apparatus for generating a network according to the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Referring to fig. 1, a flowchart of a training method for generating a network provided in embodiment 1 of the present application is schematically illustrated, where the method may be applied to an electronic device, and the present application does not limit a product type of the electronic device, and as shown in fig. 1, the method may include, but is not limited to, the following steps:
step S11, selecting the audio data to be processed and the reference audio data from the data source.
In this embodiment, selecting the audio data to be processed and the reference audio data from the data source may include:
selecting a group of audio data to be processed and reference audio data from a data source;
or, a plurality of groups of audio data to be processed and reference audio data are selected from the data source.
Selecting a set of audio data to be processed and reference audio data from a data source may include:
s111, selecting audio data which accord with a set data format from a data source, and taking the selected audio data as reference audio data;
the set data format may be set as needed, and is not limited in this application.
And S112, selecting the audio data with the same number as the reference audio data from the data source, and taking the selected audio data as the audio data to be processed.
Selecting audio data with the same number as the reference audio data from the data source, and using the selected audio data as the audio data to be processed may include:
and selecting the audio data with the same number as the reference audio data from the data source, and taking the selected audio data as the audio data to be processed used in each training.
The audio data with the same number as the reference audio data are selected from the data source, and the selected audio data are used as the audio data to be processed used in each training, so that the frequency of selecting the audio data to be processed can be reduced, and the efficiency of generating the network training is improved.
Of course, selecting the same number of audio data as the reference audio data from the data source, and using the selected audio data as the audio data to be processed, may also include:
and randomly selecting audio data with the same number as the reference audio data from the data source, and taking the selected audio data as the audio data to be processed.
The audio data with the same number as the reference audio data are randomly selected from the data source, so that the selected audio data can be ensured to possibly contain the reference audio data and the non-reference audio data, or only contain the non-reference audio data, the generation of the network by utilizing the diversified audio data to be processed is trained, and the accuracy of the generated network training is improved.
When a plurality of sets of audio data to be processed and reference audio data are selected from the data source, the selection process of each set of audio data to be processed and reference audio data may refer to the above-described related process of selecting a set of audio data to be processed and reference audio data from the data source, and will not be described herein again.
And step S12, inputting the audio data to be processed and the reference audio data into a generation network in the generative confrontation network to be trained to obtain target audio data.
And inputting the audio data to be processed and the reference audio data into a generating network in the generating confrontation network to be trained, and then processing the audio data to be processed by the generating network to obtain target audio data.
Step S13, when the difference between the target audio data and the reference audio data in the generated network is within the difference threshold range, determining authentication network input data, the authentication network input data including: sub-target audio data that meets the transform domain setting condition, which is obtained from the target audio data, and sub-reference audio data that meets the transform domain setting condition, which is obtained from the reference audio data.
The difference between the target audio data and the reference audio data in the generated network is within a difference threshold, which can be understood as: and the generated network loss function value is within the range of the preset threshold value of the generated network.
Wherein the generated network loss function value characterizes a difference between the target audio data and the reference audio data.
When a set of audio data to be processed and reference audio data are selected from a data source, when the difference between target audio data and reference audio data in the generating network is within a difference threshold range, which indicates that the target audio data generated by the generating network meets the output requirement, the network input data can be determined to be identified when the target audio data meets the output requirement. Authenticating the network input data may be understood as: data input to the authentication network.
When a plurality of groups of audio data to be processed and reference audio data are selected from a data source, the generation network processes the audio data to be processed in each group and respectively generates target audio data, when the difference between each target audio data in the generation network and the corresponding reference audio data is within the range of a difference threshold value, and when the generation network indicates that each target audio data generated by the generation network meets the output requirement, the network input data can be determined and identified when each target audio data meets the output requirement. In this case, determining the authentication network input data may be understood as: and respectively determining and identifying network input data aiming at each group of audio data to be processed and reference audio data. Thus, there is a need to determine a plurality of authentication network input data.
In this embodiment, determining and identifying network input data for a set of audio data to be processed and reference audio data may include:
and acquiring sub-target audio data meeting the set condition of the transform domain from the target audio data, acquiring sub-reference audio data meeting the set condition of the transform domain from the reference audio data, and taking the acquired sub-target audio data and the sub-reference audio data as authentication network input data.
The process of obtaining the sub-target audio data meeting the transform domain setting condition from the target audio data may include:
s131, transforming the target audio data based on a preset Fourier transform function to obtain first frequency domain data.
And S132, regarding the frequency domain data in the set bandwidth range in the first frequency domain data as the sub-target audio data meeting the set condition of the transform domain.
The set bandwidth range may be set as needed, and is not limited herein. For example, the set bandwidth range may be set to, but is not limited to: less than 3000 Hz.
In this embodiment, regarding the frequency domain data within the set bandwidth range in the first frequency domain data as the sub-target audio data meeting the set condition of the transform domain, the method may include:
and extracting the audio features of the set type from the frequency domain data within the set bandwidth range from the first frequency domain data, and taking the extracted audio features as sub-target audio data meeting the set conditions of the transform domain.
And extracting the audio features of the set type from the first frequency domain data in the frequency domain data within the set bandwidth range, and using the extracted audio features as sub-target audio data meeting the set conditions of the transform domain to realize further filtering of the audio data.
The setting type can be set according to the needs, and is not limited in the application.
Corresponding to steps S131 to S132, obtaining sub-reference audio data that meets the transform domain setting condition from the reference audio data may include:
s133, transform the reference audio data based on a preset fourier transform function to obtain second frequency domain data.
The fourier transform function set in advance in this step is the same as the fourier transform function set in advance in step S132.
And S134, taking the frequency domain data in the set bandwidth range in the second frequency domain data as the sub-reference audio data meeting the set conditions of the transform domain.
The set bandwidth range in this step is the same as the set bandwidth range in step S132.
In this embodiment, regarding the frequency domain data within the set bandwidth range in the second frequency domain data as the sub-reference audio data meeting the set condition of the transform domain, the method may include:
and extracting the audio features of the set type from the frequency domain data within the set bandwidth range from the second frequency domain data, and taking the extracted audio features as sub-standard audio data which accord with the set conditions of the transform domain.
The setting type in this step is the same as that in step S132.
In this embodiment, the process of obtaining the sub-target audio data meeting the transform domain setting condition from the target audio data may also include:
s135, changing the target audio data based on a preset constant Q change function to obtain third frequency domain data;
s136, selecting data of a set type from the third frequency domain data, and taking the selected data as sub-target audio data meeting the set conditions of the transform domain.
In this embodiment, the setting type may be set according to needs, and is not limited herein. For example, the setting type may be set as a spectral peak.
Corresponding to steps S135 to S136, obtaining sub-reference audio data that meets the transform domain setting condition from the reference audio data may include:
and S137, changing the reference audio data based on a preset constant Q change function to obtain fourth frequency domain data.
The constant Q change function preset in this step is the same as the constant Q change function preset in step S135.
And S138, selecting data of a set type from the fourth frequency domain data, and taking the selected data as sub-reference audio data meeting set conditions of a transform domain.
The setting type in this step is the same as that in step S136.
In this embodiment, the target audio data and the reference audio data are processed in the same data processing manner (e.g., steps S131 to S132 and steps S133 to S134 are the same data processing manner, and steps S135 to S136 and steps S137 to S138 are the same data processing manner), so as to ensure the consistency of the processing, and further ensure that the obtained sub-target audio data and the sub-reference audio data can be reliably compared.
When a plurality of sets of audio data to be processed and reference audio data are obtained from a data source, a detailed process for identifying network input data is respectively determined for each set of audio data to be processed and reference audio data, which can be referred to the above description for determining identifying network input data for a set of audio data to be processed and reference audio data, and is not described herein again.
And step S14, inputting the input data of the identification network into the identification network in the to-be-trained generative confrontation network, and acquiring an identification result.
As shown in fig. 2, when the difference between the target audio data and the reference audio data of the generation network is within the difference threshold range, sub-target audio data that meets the setting condition of the transform domain is obtained from the target audio data, sub-reference audio data that meets the setting condition of the transform domain is obtained from the reference audio data, and the sub-target audio data and the sub-reference audio data are input to the authentication network.
And after receiving the authentication network input data, the authentication network authenticates the sub-target audio data in the authentication network input data to obtain an authentication result.
The process of authenticating the sub-target audio data in the authentication network input data may include:
comparing and identifying whether the sub-target audio data in the network input data is consistent with the sub-reference audio data;
if not, determining that the sub-target audio data is false;
and if so, determining that the sub-target audio data is true.
When a group of audio data to be processed and reference audio data are acquired from a data source, correspondingly, input data of an identification network is input into the identification network in the countermeasure network to be trained, and an identification result is acquired.
When acquiring multiple sets of audio data to be processed and reference audio data from a data source, correspondingly, inputting multiple identification network input data into an identification network in a to-be-trained generative confrontation network, and the identification network respectively acquires an identification result for each identification network input data.
And step S15, when the discrimination result of the sub-target audio data is false, updating the internal parameters of the generated network.
Corresponding to obtaining an authentication result when a set of audio data to be processed and reference audio data are obtained from a data source, when the authentication result of the sub-target audio data is false in the authentication result, the internal parameters of the generation network may be updated.
Corresponding to obtaining a plurality of discrimination results when a plurality of sets of audio data to be processed and reference audio data are obtained from a data source, when the discrimination result of the sub-target audio data in the discrimination results is false, it can be understood that: when the discrimination results of the sub-target audio data are all false.
When a set of audio data to be processed and reference audio data are acquired from a data source, an authentication result is acquired, when the authentication result of the sub-target audio data is true in the authentication result, the sub-target audio data and the sub-reference audio data can be considered to be consistent, and when the sub-target audio data is consistent with the sub-reference audio data, the internal parameters of the generated network can not be updated.
When multiple sets of audio data to be processed and reference audio data are obtained from a data source, multiple authentication results are obtained, when the authentication results of the sub-target audio data in the multiple authentication results are all true, all the sub-target audio data and the corresponding sub-reference audio data can be considered to be consistent, and when all the sub-target audio data and the corresponding sub-reference audio data are consistent, the internal parameters of the generated network can not be updated any more.
In the embodiment, information filtering is performed on target audio data and reference audio data by determining identification network input data to obtain sub-target audio data meeting the set condition of the transform domain and sub-reference audio data meeting the set condition of the transform domain, and on the basis, the sub-target audio data meeting the set condition of the transform domain and the sub-reference audio data meeting the set condition of the transform domain are input into the identification network, so that identification of the identification network is more targeted, updating of internal parameters of the generation network is also more targeted based on the more targeted identification result, the trained generation network is ensured to generate audio data in a targeted manner, and targeted expansion of audio data volume is realized.
As another alternative embodiment of the present application, as shown in fig. 3, a flowchart of an embodiment 2 of a training method for generating a network provided by the present application is shown, where this embodiment is mainly an extension of the training method for generating a network described in the above embodiment 1, and the method may include, but is not limited to, the following steps:
step S21, selecting audio data to be processed and reference audio data from a data source;
and step S22, inputting the audio data to be processed and the reference audio data into a generation network in the generation-type confrontation network to be trained to obtain target audio data.
The detailed procedures of steps S21-S22 can be found in the related descriptions of steps S11-S12 in embodiment 1, and are not repeated herein.
Step S23, determining whether the difference between the target audio data and the reference audio data in the generating network is within the difference threshold range.
In this embodiment, whether the difference between the target audio data and the reference audio data in the generation network is within the difference threshold range may be determined by determining whether the generation network loss function value is within the preset threshold range of the generation network.
Wherein the generated network loss function value characterizes a difference between the target audio data and the reference audio data.
When the difference between the target audio data and the reference audio data in the generation network is not within the difference threshold range, step S24 may be performed; when the difference between the target audio data and the reference audio data in the generation network is within the difference threshold range, step S25 may be performed.
Step S24, updating the internal parameters of the generated network, and returning to step S21.
Step S25, determining authentication network input data, the authentication network input data including: sub-target audio data that meets the transform domain setting condition, which is obtained from the target audio data, and sub-reference audio data that meets the transform domain setting condition, which is obtained from the reference audio data.
And step S26, inputting the input data of the identification network into the identification network in the to-be-trained generative confrontation network, and acquiring an identification result.
And step S27, judging whether the discrimination result of the sub-target audio data in the discrimination result is false.
If false, step S24 is performed.
The detailed procedures of steps S25-S27 can be found in the related descriptions of steps S13-S15 in embodiment 1, and are not repeated herein.
In this embodiment, the generating network itself determines whether the difference between the target audio data and the reference audio data is within the difference threshold range, which may provide a target for training of the generating network and ensure the performance of the training of the generating network.
As another alternative embodiment of the present application, as shown in fig. 4, a flowchart of an embodiment 3 of a training method for generating a network provided by the present application is shown, where this embodiment is mainly a refinement scheme of the training method for generating a network described in the above embodiment 1, and the method may include, but is not limited to, the following steps:
step S31, selecting multiple sets of audio data to be processed and reference audio data from the data source.
And step S32, inputting multiple groups of audio data to be processed and the reference audio data into a generation network in the generative confrontation network to be trained to obtain multiple target audio data.
Step S33, when the difference between each target audio data in the generated network and the corresponding reference audio data is within the difference threshold range, determining a plurality of authentication network input data, each authentication network input data respectively including: sub-target audio data which is obtained from the target audio data and meets the set condition of the transform domain, and sub-reference audio data which is obtained from the reference audio data and meets the set condition of the transform domain;
step S34, inputting the multiple authentication network input data into the authentication network in the to-be-trained generative confrontation network, and obtaining multiple authentication results.
And inputting a plurality of identification network input data into the identification network in the to-be-trained generative confrontation network, wherein the identification network can respectively identify each identification network input data to obtain an identification result. For example, when 128 sets of audio data to be processed and reference audio data are input to the generation network, the generation network will obtain 128 target audio data, and determine 128 authentication network input data, and accordingly, the authentication network will obtain 128 authentication results.
Step S35, determining whether there is an authentication result in which the authentication result of the sub-target audio data is false among the plurality of authentication results.
If so, go to step S36.
Step S36, determining whether the ratio of the discrimination results of the sub-target audio data with the false discrimination results reaches a preset ratio threshold.
The ratio of the sub-target audio data with false authentication result can be understood as: the sub-target audio data has a ratio of the number of false authentication results to the total number of the plurality of authentication results.
If the preset ratio threshold is reached, executing step S37; if the ratio does not reach the preset ratio threshold, step S38 is executed.
The preset proportion threshold value can be set according to needs, and is not limited in the application. For example, the preset proportion threshold may be set as: 95%, 98% or 100%.
And step S37, updating the internal parameters of the generated network.
And step S38, training the authentication network.
In this embodiment, the process of training the authentication network may include:
and S381, updating the internal parameters of the authentication network.
And S382, identifying the network, namely identifying the audio data of the training sub-targets and the reference audio data of the training children to obtain an identification result.
The determination process of the training sub-target audio data may be:
inputting the audio data to be processed required by the training identification network into the generation network to obtain training target audio data, obtaining data meeting the set conditions of the transform domain from the training target audio data, and using the obtained data as training sub-target audio data.
The determination process of the training sub-reference audio data may be:
and obtaining data meeting the set conditions of the transform domain from the reference audio data required by training the identification network, and taking the obtained data as training sub-reference audio data.
And S383, judging whether the authentication network loss function value is in a preset threshold range or not, wherein the authentication network loss function value represents the difference between the authentication result and a preset authentication result.
If not, the step S381 is executed again until the discrimination network loss function value is within the preset threshold range.
In this embodiment, the preset threshold range may be set as needed, and is not limited in this application.
In this embodiment, when the ratio of the discrimination result of the sub-target audio data that is false does not reach the preset ratio threshold, the discrimination network is trained, the discrimination performance of the discrimination network is improved, the generation network is trained on the basis of ensuring the accuracy of the discrimination result of the discrimination network, and the training performance of the generation network can be improved.
As another alternative embodiment of the present application, as shown in fig. 5, a flowchart of an embodiment 4 of a training method for generating a network provided by the present application is shown, where this embodiment is mainly an extension of the training method for generating a network described in the above embodiment 2, and the method may include, but is not limited to, the following steps:
step S41, selecting audio data to be processed and reference audio data from a data source;
and step S42, inputting the audio data to be processed and the reference audio data into a generation network in the generative confrontation network to be trained to obtain target audio data.
Step S43, determining whether the difference between the target audio data and the reference audio data in the generating network is within the difference threshold range.
When the difference between the target audio data and the reference audio data in the generation network is not within the difference threshold range, step S44 may be performed; when the difference between the target audio data and the reference audio data in the generation network is within the difference threshold range, step S45 may be performed.
Step S44, updating the internal parameters of the generated network, and returning to step S41.
Step S45, determining authentication network input data, the authentication network input data including: sub-target audio data that meets the transform domain setting condition, which is obtained from the target audio data, and sub-reference audio data that meets the transform domain setting condition, which is obtained from the reference audio data.
And step S46, inputting the input data of the identification network into the identification network in the to-be-trained generative confrontation network, and acquiring an identification result.
And step S47, judging whether the discrimination result of the sub-target audio data in the discrimination result is false.
When the discrimination result of the sub-target audio data is false, executing step S44; when the sub-target audio data is true in the authentication result, the sub-target audio data may be considered to be consistent with the sub-reference audio data, and it is indicated that the authentication network cannot authenticate the sub-target audio data, and accuracy of authentication by the authentication network needs to be improved, so step S48 may be executed.
And step S48, training the authentication network.
The detailed procedures of steps S41-S48 can be referred to the related descriptions of steps S21-S28 in embodiment 2, and are not described herein again.
Step S49, determining whether the generated confrontation network to be trained satisfies the training end condition.
In this embodiment, the determining whether the generated confrontation network to be trained satisfies the training end condition may include:
s491, determining whether the result of the sub-target audio data discrimination is false or not.
In this embodiment, it is necessary to determine whether the discrimination result of the sub-target audio data is false or not among the discrimination results according to the discrimination results of multiple times.
Judging whether the discrimination result of the sub-target audio data is false or not convergence in the discrimination result, which can be understood as: in the continuous multiple authentication results, the authentication results of the sub-target audio data are all false.
If the convergence indicates that the accuracy of the network authentication reaches the set requirement, step S492 may be executed; if the network identification is not converged, the identification accuracy of the identification network does not meet the set requirement, and at least the identification network needs to be trained, so that the training end condition can be determined not to be met.
S492, judging whether the generated network loss function value is converged.
In this embodiment, whether the generated network loss function value converges or not may be determined according to the generated network loss function value obtained by the current calculation and the generated network loss function value obtained by the previous calculation.
If the generated network loss function value is converged, the condition of finishing training can be determined to be met; if the generated network loss function value does not converge, it may be determined that the training end condition is not satisfied.
In this embodiment, in the identification result, when the identification result of the sub-target audio data is false convergence and the generated network loss function value is converged, it is determined that the training end condition is satisfied, and it is ensured that the training is ended on the premise that the performance of the generated countermeasure network meets the requirement.
Of course, determining whether the generated confrontation network to be trained satisfies the training end condition may also include:
s493, determining whether the number of times of outputting the sub-target audio data to the authentication network reaches a set number of times.
If the set times are not reached, the condition that the training end condition is not met can be determined; if the number of times is reached, it may be determined that the training end condition is satisfied.
The set number of times can be set as required, and is not limited in the application.
Alternatively, the determining whether the generated confrontation network to be trained satisfies the training termination condition may include:
s3494, judging whether the times of outputting the sub-target audio data to the authentication network reaches the set times;
if the set number of times is not reached, step S495 is executed.
The set number of times can be set as required, and is not limited in the application.
S495, judging whether the discrimination result of the sub-target audio data is false or not convergence in the discrimination result.
If yes, go to step S496; if not, it may be determined that the training end condition is not satisfied.
And S496, judging whether the generated network loss function value is converged.
If the generated network loss function value is converged, the condition of finishing training can be determined to be met; if the generated network loss function value does not converge, it may be determined that the training end condition is not satisfied.
In this embodiment, when the set number of times is not reached, if the discrimination result of the sub-target audio data is false convergence and the generated network loss function value is converged in the discrimination result, it may be determined that the training end condition is satisfied, so that the training is ended in advance when the set number of times is not reached, and the performance of the generated countermeasure network can still be satisfied on the premise of ensuring the training efficiency.
In this embodiment, if the training end condition is satisfied, step S410 is executed; if the training end condition is not satisfied, the process returns to step S41.
And step S410, finishing training.
In this embodiment, when the discrimination result of the sub-target audio data in the discrimination result is true, after the discrimination network is trained, it is determined whether the generated confrontation network to be trained satisfies the training termination condition, so as to terminate the training or continue the training, thereby ensuring that both the generated network and the discrimination network can be trained, and improving the training precision of the generated confrontation network to be trained.
As another alternative embodiment of the present application, as shown in fig. 6, a flowchart of an embodiment 5 of a training method for generating a network provided by the present application is shown, where this embodiment is mainly a refinement scheme of the training method for generating a network described in the above embodiment 1, and the method may include, but is not limited to, the following steps:
step S51, selecting audio data conforming to the set data format from the data source.
Step S52, performing audio enhancement on the audio data conforming to the set data format based on the signal processing method, to obtain first audio enhancement data.
In this embodiment, the audio attribute of the audio data is not changed in the process of performing audio enhancement on the audio data conforming to the set data format based on the signal processing method, so that it is ensured that the audio attribute of the first audio enhancement data is the same as the audio attribute of the audio data conforming to the set data format.
Audio attributes may include, but are not limited to: format (e.g., mp3, wav, flac, or ogg), channel number (e.g., single channel, stereo, or 5.1 channels), or sampling rate (e.g., 44.1kHz, 48kHz, or 32 kHz).
In this embodiment, the process of performing audio enhancement on the audio data conforming to the set data format based on the signal processing method may include:
s521, detecting the energy of each audio frame in the audio data conforming to the set data format.
The process of detecting the energy of each audio frame in the audio data conforming to the set data format may include:
s5211, when audio data conforming to a set data format is read, values of a plurality of waveform sampling points are obtained through an audio reading algorithm.
The data formed by the acquired waveform sampling points is the audio data which accords with the set data format.
S5212, grouping the waveform sampling points to obtain a plurality of groups of waveform sampling points, and forming an audio frame by each group of waveform sampling points. The number of the waveform sampling points in each group may include, but is not limited to 1024, and the overlapping rate of each audio frame may be, but is not limited to 50%.
If the value of each audio frame (i.e., the value of each set of waveform samples) is represented as a frame, the value of the first audio data may be represented as frames. Wherein, frames is [ frame1, frame2, … ].
S5213, calculating the energy of each audio frame:
the sum of squares of the values of the waveform sampling points in each audio frame can be calculated separately and taken as the energy of each audio frame.
In this embodiment, the energy of each audio frame may be represented as Ei. The energy per frame can be expressed as E ═ E1, E2, ….
S522, based on the energy of each audio frame, a low-energy audio frame set is screened from the audio data conforming to the set data format.
The low-energy audio frame set is composed of a set number of audio frames with energy lower than a set energy threshold, and the set number of audio frames with energy lower than the set energy threshold are continuously arranged audio frames.
The audio frames in the low energy audio set may be understood as: a mute frame.
The set number can be set as required, and is not limited herein.
For example, if the energy of each audio frame from the 21 st frame to the 30 th frame is lower than the set energy threshold, the energy of each audio frame from the 31 st frame to the 40 th frame is lower than the set energy threshold, and the energy of each audio frame from the 101 st frame to the 110 th frame is lower than the set energy threshold, the set of the 21 st frame to the 30 th frame is a low-energy audio frame set, the set of the 31 st frame to the 40 th frame is a low-energy audio frame set, and the set of the 101 st frame to the 110 th frame is a low-energy audio frame set.
S523, the audio frames in the audio data conforming to the set data format except the audio frames in each low-energy audio frame set are taken as valid audio frames.
In this embodiment, the audio frames in the audio data conforming to the set data format except the audio frame in each low-energy audio frame set are used as effective audio frames, so that the accuracy of the effective audio frames can be improved.
And S524, combining the plurality of effective audio frames to obtain effective audio data, and performing audio enhancement on the effective audio data based on a signal processing method.
The process of merging the multiple valid audio frames to obtain valid audio data may include:
s5241, an average power of the plurality of valid audio frames is determined based on the power of each valid audio frame.
In this embodiment, the average power of a plurality of valid audio frames may be determined based on the following relation:
Figure BDA0002644157250000211
wherein,
Figure BDA0002644157250000212
representing the average power of a number of active audio frames, frai representing an active audio frame, | frai | representing the sum of the squares of the absolute values of the powers of the active audio frames,
Figure BDA0002644157250000213
representing a summation function and N representing the number of valid audio frames.
S5242, the average power is normalized, and the power obtained by the normalization is used as the target power.
In this embodiment, the following relation may be used to normalize the average power, and the power obtained by the normalization processing may be used as the target power:
Figure BDA0002644157250000214
where const denotes a target power, P denotes an average power of a plurality of valid audio frames, and alpha denotes a preset average power threshold.
And S5243, multiplying each effective audio frame by the target power respectively to obtain a target effective audio frame.
In this embodiment, the following relation may be used to multiply each effective audio frame by the target power, respectively, to obtain a target effective audio frame:
FraE=const×frai
FraE denotes a target valid audio frame, const denotes a target power, frai denotes a valid audio frame.
S5244, merging the target valid audio frames to obtain valid audio data.
Merging the target valid audio frames to obtain valid audio data, which may include:
and removing the overlapped part of the target effective audio frames based on the overlapping rate of the target effective audio frames, and combining the effective audio frames with the overlapped parts removed according to the time sequence to obtain effective audio data.
The process of audio-enhancing the valid audio data based on the signal processing method, corresponding to steps S5241-S5244, may include:
s5245, inverting the valid audio data.
In this embodiment, the following relation may be used to perform inverse processing on the valid audio data:
z1=-z;
z denotes valid audio data, z1 denotes audio data obtained by inverting the valid audio data, and-z denotes inverting the valid audio data.
S5246, randomly cutting the audio data obtained after the phase inversion processing.
The process of randomly cutting the audio data obtained after the inverse processing may include:
s52461, obtaining the total duration of the audio data obtained after the phase inversion processing;
s52462, based on a random number algorithm, randomly selecting R different moments from 0 to the total duration, sorting the selected moments in size to obtain R sorted moments, and selecting audio segments with set time length from audio data obtained after inversion processing by taking the moments as starting points.
For example, if the total duration of the audio data obtained after the inversion processing is 4 minutes, R is 3, 3 times, which are respectively the 0.5 th minute, the 1 st minute and the 2 nd minute, are randomly selected within 0 to 4 minutes, the set time length is 10 seconds, an audio segment of 10 seconds is selected from the audio data obtained after the inversion processing with the 0.5 th minute as a starting point, and an audio segment of 10 seconds is selected with the 1 st minute as a starting point, and an audio segment of 10 seconds is selected with the 2 nd minute as a starting point.
And randomly cutting the audio data obtained after the phase inversion processing to obtain a plurality of shorter audio segments, so that the subsequent processing can be facilitated, and the processing efficiency can be improved.
In this embodiment, the process of performing audio enhancement on the valid audio data based on the signal processing method may also include:
and S5247, turning over the effective audio data.
In this embodiment, the effective audio data is subjected to the flipping process, which may be understood as:
and outputting the effective audio data in a reverse order according to the time sequence. For example, if the valid audio data is the audio sequence [1,2,3,4,5], the audio sequence [1,2,3,4,5] is output in reverse order according to the time sequence, and the audio sequence [5,4,3,2,1] is obtained.
The following relation can be utilized to flip the valid audio data:
z2=flip(z);
z represents valid audio data, z2 represents audio data obtained by outputting the valid audio data in reverse order in chronological order, and flip (z) represents outputting the valid audio data in reverse order in chronological order.
And S5248, randomly cutting the audio data obtained after the overturning processing.
Randomly cutting the audio data obtained by the flipping process to obtain a plurality of audio segments, which may include:
s52481, acquiring the total duration of the audio data obtained after the turning process;
s52482, based on a random number algorithm, randomly selecting R different moments from 0 to the total duration, sorting the selected moments in size to obtain R sorted moments, and selecting audio segments with set time length from audio data obtained after turning over by taking the moments as starting points.
For example, if the total duration of the audio data obtained after the flipping process is 4 minutes, R is 3, 3 times, which are respectively the 0.5 th minute, the 1 st minute, and the 2 nd minute, are randomly selected within 0 to 4 minutes, the set time length is 10 seconds, an audio segment of 10 seconds is selected from the audio data obtained after the flipping process with the 0.5 th minute as a starting point, and an audio segment of 10 seconds is selected with the 1 st minute as a starting point, and an audio segment of 10 seconds is selected with the 2 nd minute as a starting point.
And randomly cutting the audio data obtained after the turning processing to obtain a plurality of shorter audio segments, so that the subsequent processing can be facilitated, and the processing efficiency can be improved.
The process of audio enhancement of the valid audio data based on the signal processing method may also include:
s5249, carrying out inversion processing on the effective audio data, and carrying out turnover processing on the audio data obtained after the inversion processing to obtain the effective audio data to be processed.
The process of performing the inverse phase processing on the valid audio data can be referred to the related description of step S5245, and will not be described herein again.
The process of performing the flipping process on the audio data obtained after the inverting process can be referred to the related description of step S5247, and is not described herein again.
S52410, randomly clipping effective audio data to be processed.
The process of randomly clipping the valid audio data to be processed can be referred to the related description of step S5248, and will not be described herein again.
In step S53, the audio data conforming to the set data format and the first audio enhancement data are used as the reference audio data.
And taking the audio data which accords with the set data format and the first audio enhancement data as reference audio data, and expanding the data volume of the reference audio data for training the generation network.
Step S54 is to randomly select audio data from the data source, and to use the selected audio data as random audio data.
And step S55, performing audio enhancement on the random audio data based on the signal processing method to obtain second audio enhancement data.
In this embodiment, the audio attribute of the audio data is not changed in the process of performing audio enhancement on the random audio data based on the signal processing method, so that it is ensured that the audio attribute of the second audio enhancement data is the same as that of the random audio data.
In this embodiment, for the process of performing audio enhancement on random audio data based on the signal processing method, reference may be made to related descriptions of performing audio enhancement on audio data conforming to a set data format based on the signal processing method, and details are not described herein again.
And step S56, taking the random audio data and the second audio enhancement data as the audio data to be processed.
And taking the random audio data and the second audio enhancement data as audio data to be processed, and expanding the data volume of the audio data to be processed for training the generation network.
Steps S51-S56 are a specific implementation of step S11 in example 1.
And step S57, inputting the audio data to be processed and the reference audio data into a generation network in the generative confrontation network to be trained to obtain target audio data.
Step S58, when the difference between the target audio data and the reference audio data in the generated network is within the difference threshold range, determining authentication network input data, the authentication network input data including: sub-target audio data which is obtained from the target audio data and meets the set condition of the transform domain, and sub-reference audio data which is obtained from the reference audio data and meets the set condition of the transform domain;
step S59, inputting the input data of the identification network into the identification network in the to-be-trained generative confrontation network, and acquiring an identification result;
step S510, when the discrimination result of the sub-target audio data in the discrimination result is false, updating the internal parameters of the generation network.
The detailed procedures of steps S57-S510 can be found in the related descriptions of steps S12-S15 in embodiment 1, and are not described herein again.
In this embodiment, audio enhancement is performed on the audio data conforming to the set data format and the random audio data based on a signal processing method, so as to obtain first audio enhancement data having the same audio attribute as the audio data conforming to the set data format and second audio enhancement data having the same audio attribute as the random audio data, thereby implementing expansion of the audio data for training the generated network and improving the training performance of the generated network.
As another alternative embodiment of the present application, as shown in fig. 7, there is provided a flowchart of an embodiment 1 of an audio data enhancement method provided by the present application, where the method may include, but is not limited to, the following steps:
and step S61, acquiring the audio data to be processed.
The audio data to be processed may be understood as: any one of the audio data.
And step S62, calling a generation network, and processing the audio data to be processed to obtain target audio data.
In this embodiment, the generated network is obtained by training based on the training method for generating a network described in any one of method embodiments 1 to 5.
Step S63, the target audio data is treated as audio data enhancement data.
In this embodiment, the trained generation network can be called to process the audio data to be processed to obtain target audio data, and the target audio data is used as audio data enhancement data to expand the audio data volume. The audio data obtained by the expansion can be applied to different scenes, for example, the audio data can be used for training a neural network model and improving the performance of the neural network model training.
Next, a training apparatus for generating a network according to an embodiment of the present application will be described, and the training apparatus for generating a network described below and the training method for generating a network described above may be referred to correspondingly.
Referring to fig. 8, the training apparatus for generating a network includes: a selection module 100, a first acquisition module 200, a determination module 300, a second acquisition module 400, and an update module 500.
A selection module 100, configured to select audio data to be processed and reference audio data from a data source;
a first obtaining module 200, configured to input audio data to be processed and reference audio data into a generation network in a generative confrontation network to be trained, so as to obtain target audio data;
a determining module 300, configured to determine, when a difference between target audio data and reference audio data in the generating network is within a difference threshold range, authentication network input data, where the authentication network input data includes: sub-target audio data which is obtained from the target audio data and meets the set condition of the transform domain, and sub-reference audio data which is obtained from the reference audio data and meets the set condition of the transform domain;
a second obtaining module 400, configured to input authentication network input data into an authentication network in a to-be-trained generative confrontation network, and obtain an authentication result;
a first updating module 500, configured to update the internal parameters of the generating network when the authentication result of the sub-target audio data is false.
In this embodiment, the determining module 300 may be specifically configured to:
transforming the target audio data based on a preset Fourier transform function to obtain first frequency domain data;
taking the frequency domain data in the set bandwidth range in the first frequency domain data as sub-target audio data meeting the set conditions of a transform domain;
or, based on the preset fourier transform function, transforming the reference audio data to obtain second frequency domain data;
and taking the frequency domain data within the set bandwidth range in the second frequency domain data as sub-standard audio data meeting the set conditions of the transform domain.
In this embodiment, the determining module 300 may use the frequency domain data in the set bandwidth range in the first frequency domain data as the sub-target audio data meeting the set condition of the transform domain, and may include:
extracting audio features of a set type from the first frequency domain data in frequency domain data within a set bandwidth range, and taking the extracted audio features as sub-target audio data meeting set conditions of a transform domain;
the determining, as sub-reference audio data meeting the transform domain setting condition, frequency domain data within the set bandwidth range in the second frequency domain data, includes:
and extracting the audio features of the set type from the second frequency domain data in the frequency domain data within the set bandwidth range, and taking the extracted audio features as sub-standard audio data meeting the set conditions of the transform domain.
In this embodiment, the determining module 300 may be specifically configured to:
changing the target audio data based on a preset constant Q change function to obtain third frequency domain data;
selecting data of a set type from the third frequency domain data, and taking the selected data as sub-target audio data meeting the set conditions of a transform domain;
or, based on the preset constant Q change function, changing the reference audio data to obtain fourth frequency domain data;
and selecting the data of the set type from the fourth frequency domain data, and taking the selected data as sub-reference audio data meeting the set conditions of the transform domain.
In this embodiment, the training apparatus for generating a network may further include:
and a second updating module, configured to update the internal parameters of the generating network when the difference between the target audio data and the reference audio data in the generating network is not within the difference threshold range, and return to the execution selection module 100 to select the to-be-processed audio data and the reference audio data from the data source.
In this embodiment, the training apparatus for generating a network may further include:
and the training module is used for training the discrimination network if the discrimination results of the sub-target audio data are false discrimination results and the ratio of the discrimination results of the sub-target audio data which are false discrimination results does not reach a preset ratio threshold value when the plurality of groups of the audio data to be processed and the reference audio data are input into the generation network of the generation-type countermeasure network to be trained.
In this embodiment, the training module may be specifically configured to:
updating internal parameters of the authentication network;
the identification network identifies the audio data of the training sub-targets and the reference audio data of the training children to obtain an identification result;
the process for determining the training sub-target audio data comprises the following steps: inputting the audio data to be processed required by training the identification network into the generation network to obtain training target audio data, obtaining data meeting the set conditions of the transform domain from the training target audio data, and using the obtained data as training sub-target audio data; the determination process of the training sub-reference audio data comprises the following steps: obtaining data meeting the set conditions of the transform domain from the reference audio data required by training the identification network, and taking the obtained data as training sub-reference audio data;
judging whether an identification network loss function value is within a preset threshold range, wherein the identification network loss function value represents the difference between the identification result and a preset identification result;
if not, returning to the step of updating the internal parameters of the authentication network until the value of the authentication network loss function is within the range of the preset threshold value.
In this embodiment, the training apparatus for generating a network may further include:
the judging module is used for judging whether the generated confrontation network to be trained meets training end conditions or not after the identification network is trained or before the identification result of the sub-target audio data in the identification result is false and the internal parameters of the generated network are updated;
if the training end condition is met, ending the training;
and if the training end condition is not met, updating the internal parameters of the generated network.
Judging whether the generated confrontation network to be trained meets the training end condition may include:
judging whether the discrimination result of the sub-target audio data is false or not in the discrimination result;
if yes, judging whether the generated network loss function value is converged;
or, judging whether the times of outputting the sub-target audio data to the authentication network reaches the set times;
or, judging whether the times of outputting the sub-target audio data to the authentication network reaches the set times;
if the set times are not reached, judging whether the discrimination result of the sub-target audio data in the discrimination result is false or not convergence;
and if so, judging whether the generated network loss function value is converged.
In this embodiment, the selection module 100 may be specifically configured to:
selecting audio data which accords with a set data format from a data source, and taking the selected audio data as reference audio data;
and randomly selecting audio data with the same number as the reference audio data from the data source, and taking the selected audio data as audio data to be processed.
In this embodiment, the selection module 100 may be specifically configured to:
selecting audio data which accords with a set data format from a data source;
performing audio enhancement on the audio data conforming to the set data format based on a signal processing method to obtain first audio enhancement data, wherein the audio attribute of the first audio enhancement data is the same as that of the audio data conforming to the set data format;
taking the audio data conforming to the set data format and the first audio enhancement data as reference audio data;
or randomly selecting audio data from the data source, and taking the selected audio data as random audio data;
performing audio enhancement on the random audio data based on a signal processing method to obtain second audio enhancement data;
and taking the random audio data and the second audio enhancement data as audio data to be processed.
In this embodiment, the selecting module 100 performs audio enhancement on the audio data conforming to the set data format or the random audio data based on a signal processing method, and the audio enhancement may include:
detecting the energy of each audio frame in the audio data conforming to the set data format or the random audio data;
based on the energy of each audio frame, screening out a low-energy audio frame set from the audio data conforming to the set data format or the random audio data, wherein the low-energy audio frame set is composed of a set number of audio frames with the energy lower than a set energy threshold, and the set number of audio frames with the energy lower than the set energy threshold are continuously arranged audio frames;
taking the audio frames in the audio data conforming to the set data format or the random audio data except the audio frames in each low-energy audio frame set as effective audio frames;
and combining the effective audio frames to obtain effective audio data, and performing audio enhancement on the effective audio data based on a signal processing method.
In this embodiment, the selecting module 100 combines a plurality of valid audio frames to obtain valid audio data, and performs audio enhancement on the valid audio data based on a signal processing method, which may include:
determining an average power of a plurality of the valid audio frames based on the power of each of the valid audio frames;
normalizing the average power, and taking the power obtained by normalization as a target power;
respectively multiplying each effective audio frame by the target power to obtain a target effective audio frame;
merging a plurality of target effective audio frames to obtain effective audio data;
carrying out inversion processing and/or turnover processing on the effective audio data;
and carrying out reverse phase processing and/or random cutting on the audio data obtained after the overturning processing.
In another embodiment of the present application, there is provided an audio data enhancement apparatus including:
the acquisition module is used for acquiring audio data to be processed;
the processing module is used for calling a generating network and processing the audio data to be processed to obtain target audio data, wherein the generating network is obtained by training based on a training method of the generating network introduced in any one of method embodiments 1-3;
and the enhancement module is used for taking the target audio data as audio data enhancement data.
In another embodiment of the present application, there is provided an electronic device, which may include:
a memory for storing at least one set of instructions;
a processor for calling and executing the instruction set in the memory, and executing the steps of the training method for generating a network as described in any one of method embodiments 1-4 by executing the instruction set.
In another embodiment of the application, a computer storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the training method of generating a network as in any one of the method embodiments 1 to 4.
It should be noted that each embodiment is mainly described as a difference from the other embodiments, and the same and similar parts between the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
As can be seen from the above description of the embodiments, those skilled in the art will understand that all or part of the steps in the above method embodiments may be implemented by software plus related hardware. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The training method for generating a network, the audio data enhancement method and the related device provided by the present application are introduced in detail above, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (16)

1. A training method for generating a network, comprising:
selecting audio data to be processed and reference audio data from a data source;
inputting the audio data to be processed and the reference audio data into a generation network in a to-be-trained generative confrontation network to obtain target audio data;
determining authentication network input data when a difference between the target audio data and the reference audio data in the generated network is within a difference threshold, the authentication network input data comprising: sub-target audio data which is obtained from the target audio data and meets a transform domain setting condition, and sub-reference audio data which is obtained from the reference audio data and meets the transform domain setting condition;
inputting the identification network input data into an identification network in the to-be-trained generative confrontation network, and acquiring an identification result;
and when the discrimination result of the sub-target audio data in the discrimination result is false, updating the internal parameters of the generation network.
2. The method as claimed in claim 1, wherein the obtaining process of the sub-target audio data meeting the transform domain setting condition comprises:
transforming the target audio data based on a preset Fourier transform function to obtain first frequency domain data;
taking the frequency domain data in the set bandwidth range in the first frequency domain data as sub-target audio data meeting the set conditions of a transform domain;
the obtaining process of the sub-reference audio data that meets the transform domain setting condition includes:
transforming the reference audio data based on the preset Fourier transform function to obtain second frequency domain data;
and taking the frequency domain data within the set bandwidth range in the second frequency domain data as sub-standard audio data meeting the set conditions of the transform domain.
3. The method according to claim 2, wherein the using the frequency domain data within the set bandwidth range in the first frequency domain data as the sub-target audio data meeting the set condition of the transform domain comprises:
extracting audio features of a set type from the first frequency domain data in frequency domain data within a set bandwidth range, and taking the extracted audio features as sub-target audio data meeting set conditions of a transform domain;
the determining, as sub-reference audio data meeting the transform domain setting condition, frequency domain data within the set bandwidth range in the second frequency domain data, includes:
and extracting the audio features of the set type from the second frequency domain data in the frequency domain data within the set bandwidth range, and taking the extracted audio features as sub-standard audio data meeting the set conditions of the transform domain.
4. The method as claimed in claim 1, wherein the obtaining process of the sub-target audio data meeting the transform domain setting condition comprises:
changing the target audio data based on a preset constant Q change function to obtain third frequency domain data;
selecting data of a set type from the third frequency domain data, and taking the selected data as sub-target audio data meeting the set conditions of a transform domain;
the obtaining process of the sub-reference audio data that meets the transform domain setting condition includes:
changing the reference audio data based on the preset constant Q change function to obtain fourth frequency domain data;
and selecting the data of the set type from the fourth frequency domain data, and taking the selected data as sub-reference audio data meeting the set conditions of the transform domain.
5. The method of claim 1, further comprising:
and when the difference between the target audio data and the reference audio data in the generating network is not within the range of the difference threshold, updating the internal parameters of the generating network, and returning to execute the step of selecting the audio data to be processed and the reference audio data from the data source.
6. The method according to claim 1, wherein when a plurality of sets of the audio data to be processed and the reference audio data are input to a generation network in a to-be-trained generative countermeasure network, if there is an authentication result in which the sub-target audio data is false among a plurality of the authentication results, and a ratio of the authentication results in which the sub-target audio data is false does not reach a preset ratio threshold, the authentication network is trained.
7. The method of claim 6, wherein the training the discrimination network comprises:
updating internal parameters of the authentication network;
the identification network identifies the audio data of the training sub-targets and the reference audio data of the training children to obtain an identification result;
the process for determining the training sub-target audio data comprises the following steps: inputting the audio data to be processed required by training the identification network into the generation network to obtain training target audio data, obtaining data meeting the set conditions of the transform domain from the training target audio data, and using the obtained data as training sub-target audio data; the determination process of the training sub-reference audio data comprises the following steps: obtaining data meeting the set conditions of the transform domain from the reference audio data required by training the identification network, and taking the obtained data as training sub-reference audio data;
judging whether an identification network loss function value is within a preset threshold range, wherein the identification network loss function value represents the difference between the identification result and a preset identification result;
if not, returning to the step of updating the internal parameters of the authentication network until the value of the authentication network loss function is within the range of the preset threshold value.
8. The method according to claim 6, wherein after the training of the authentication network or before updating the internal parameters of the generation network when the authentication result of the sub-target audio data is false in the authentication result, the method further comprises:
judging whether the discrimination result of the sub-target audio data in the discrimination result is false or not;
if yes, judging whether the generated network loss function value is converged;
if the generated network loss function value is converged, ending the training;
if the generated network loss function value is not converged, updating the internal parameters of the generated network;
or, judging whether the times of outputting the sub-target audio data to the authentication network reaches the set times;
if the number of times is not up to the set number, updating the internal parameters of the generated network;
if the set times are reached, ending the training;
or, judging whether the times of outputting the sub-target audio data to the authentication network reaches the set times;
if the set times are not reached, judging whether the discrimination result of the sub-target audio data in the discrimination result is false or not convergence;
if yes, judging whether the generated network loss function value is converged;
if the generated network loss function value is converged, ending the training;
and if the generated network loss function value is not converged, updating the parameters of the generated network.
9. The method according to any one of claims 1-8, wherein the selecting the audio data to be processed and the reference audio data from the data source comprises:
selecting audio data which accords with a set data format from a data source, and taking the selected audio data as reference audio data;
and randomly selecting audio data with the same number as the reference audio data from the data source, and taking the selected audio data as audio data to be processed.
10. The method according to claim 9, wherein the selecting audio data conforming to the set data format from the data source, and using the selected audio data as the reference audio data, comprises:
selecting audio data which accords with a set data format from a data source;
performing audio enhancement on the audio data conforming to the set data format based on a signal processing method to obtain first audio enhancement data, wherein the audio attribute of the first audio enhancement data is the same as that of the audio data conforming to the set data format;
taking the audio data conforming to the set data format and the first audio enhancement data as reference audio data;
the randomly selecting audio data from the data source, and using the selected audio data as audio data to be processed includes:
randomly selecting audio data from the data source, and taking the selected audio data as random audio data;
performing audio enhancement on the random audio data based on a signal processing method to obtain second audio enhancement data;
and taking the random audio data and the second audio enhancement data as audio data to be processed.
11. The method according to claim 10, wherein the audio enhancement of the audio data conforming to the set data format or the random audio data based on the signal processing method comprises:
detecting the energy of each audio frame in the audio data conforming to the set data format or the random audio data;
based on the energy of each audio frame, screening out a low-energy audio frame set from the audio data conforming to the set data format or the random audio data, wherein the low-energy audio frame set is composed of a set number of audio frames with the energy lower than a set energy threshold, and the set number of audio frames with the energy lower than the set energy threshold are continuously arranged audio frames;
taking the audio frames in the audio data conforming to the set data format or the random audio data except the audio frames in each low-energy audio frame set as effective audio frames;
and combining the effective audio frames to obtain effective audio data, and performing audio enhancement on the effective audio data based on a signal processing method.
12. The method of claim 11, wherein combining a plurality of valid audio frames to obtain valid audio data and performing audio enhancement on the valid audio data based on a signal processing method comprises:
determining an average power of a plurality of the valid audio frames based on the power of each of the valid audio frames;
normalizing the average power, and taking the power obtained by normalization as a target power;
respectively multiplying each effective audio frame by the target power to obtain a target effective audio frame;
merging a plurality of target effective audio frames to obtain effective audio data;
carrying out inversion processing and/or turnover processing on the effective audio data;
and carrying out reverse phase processing and/or random cutting on the audio data obtained after the overturning processing.
13. A method of audio data enhancement, comprising:
acquiring audio data to be processed;
calling a generating network, and processing the audio data to be processed to obtain target audio data, wherein the generating network is obtained by training based on the training method of the generating network of any one of claims 1-12;
and taking the target audio data as audio data enhancement data.
14. A training apparatus that generates a network, comprising:
the selection module is used for selecting audio data to be processed and reference audio data from a data source;
the first acquisition module is used for inputting the audio data to be processed and the reference audio data into a generation network in a to-be-trained generative confrontation network to acquire target audio data;
a determination module to determine authentication network input data when a difference between the target audio data and the reference audio data in the generation network is within a difference threshold, the authentication network input data comprising: sub-target audio data which is obtained from the target audio data and meets a transform domain setting condition, and sub-reference audio data which is obtained from the reference audio data and meets the transform domain setting condition;
the second acquisition module is used for inputting the identification network input data into the identification network in the to-be-trained generative confrontation network and acquiring an identification result;
and the updating module is used for updating the internal parameters of the generated network when the authentication result of the sub-target audio data in the authentication result is false.
15. An electronic device, comprising:
a memory for storing at least one set of instructions;
a processor for invoking and executing said set of instructions in said memory, the steps of the training method of generating a network of any of claims 1-12 being performed by executing said set of instructions.
16. A computer storage medium, having stored thereon a computer program for execution by a processor for carrying out the steps of the training method for generating a network according to any one of claims 1 to 12.
CN202010849195.2A 2020-08-21 2020-08-21 Training method for generating network, audio data enhancement method and related devices Active CN111985643B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010849195.2A CN111985643B (en) 2020-08-21 2020-08-21 Training method for generating network, audio data enhancement method and related devices

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010849195.2A CN111985643B (en) 2020-08-21 2020-08-21 Training method for generating network, audio data enhancement method and related devices

Publications (2)

Publication Number Publication Date
CN111985643A true CN111985643A (en) 2020-11-24
CN111985643B CN111985643B (en) 2023-12-01

Family

ID=73442777

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010849195.2A Active CN111985643B (en) 2020-08-21 2020-08-21 Training method for generating network, audio data enhancement method and related devices

Country Status (1)

Country Link
CN (1) CN111985643B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109545193A (en) * 2018-12-18 2019-03-29 百度在线网络技术(北京)有限公司 Method and apparatus for generating model
CN109948796A (en) * 2019-03-13 2019-06-28 腾讯科技(深圳)有限公司 Self-encoding encoder learning method, device, computer equipment and storage medium
WO2020047298A1 (en) * 2018-08-30 2020-03-05 Dolby International Ab Method and apparatus for controlling enhancement of low-bitrate coded audio
US20200193969A1 (en) * 2018-12-18 2020-06-18 Baidu Online Network Technology (Beijing) Co., Ltd Method and apparatus for generating model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020047298A1 (en) * 2018-08-30 2020-03-05 Dolby International Ab Method and apparatus for controlling enhancement of low-bitrate coded audio
CN109545193A (en) * 2018-12-18 2019-03-29 百度在线网络技术(北京)有限公司 Method and apparatus for generating model
US20200193969A1 (en) * 2018-12-18 2020-06-18 Baidu Online Network Technology (Beijing) Co., Ltd Method and apparatus for generating model
CN109948796A (en) * 2019-03-13 2019-06-28 腾讯科技(深圳)有限公司 Self-encoding encoder learning method, device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
白勇;齐林;帖云;: "基于强化学习Actor-Critic算法的音乐生成", 计算机应用与软件, no. 05 *
袁辰;钱丽萍;张慧;张婷;: "基于生成对抗网络的恶意域名训练数据生成", 计算机应用研究, no. 05 *

Also Published As

Publication number Publication date
CN111985643B (en) 2023-12-01

Similar Documents

Publication Publication Date Title
Sprengel et al. Audio based bird species identification using deep learning techniques
CN112786057B (en) Voiceprint recognition method and device, electronic equipment and storage medium
US20170249957A1 (en) Method and apparatus for identifying audio signal by removing noise
US20230050565A1 (en) Audio detection method and apparatus, computer device, and readable storage medium
CN113436646A (en) Camouflage voice detection method adopting combined features and random forest
CN113241092A (en) Sound source separation method based on double-attention mechanism and multi-stage hybrid convolution network
CN111833884A (en) Voiceprint feature extraction method and device, electronic equipment and storage medium
KR102018286B1 (en) Method and Apparatus for Removing Speech Components in Sound Source
Sahasrabudhe et al. Learning fingerprint orientation fields using continuous restricted Boltzmann machines
Liu et al. Golden gemini is all you need: Finding the sweet spots for speaker verification
CN117577117A (en) Training method and device for orthogonalization low-rank adaptive matrix voice detection model
CN111985643B (en) Training method for generating network, audio data enhancement method and related devices
CN112151052A (en) Voice enhancement method and device, computer equipment and storage medium
CN109119089B (en) Method and equipment for performing transparent processing on music
CN116343759A (en) Method and related device for generating countermeasure sample of black box intelligent voice recognition system
CN113421574B (en) Training method of audio feature extraction model, audio recognition method and related equipment
CN113241054B (en) Speech smoothing model generation method, speech smoothing method and device
CN112201270B (en) Voice noise processing method and device, computer equipment and storage medium
CN109841232A (en) The extracting method of note locations and device and storage medium in music signal
CN117648990A (en) Voice countermeasure sample generation method and system for black box attack
Dinkel et al. Small-footprint convolutional neural network for spoofing detection
Buccoli et al. Unsupervised feature learning for music structural analysis
CN115148195A (en) Training method and audio classification method of audio feature extraction model
CN112115509A (en) Data generation method and device
CN109951243A (en) A kind of spectrum prediction method, system and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant