CN113488027A

CN113488027A - Hierarchical classification generated audio tracing method, storage medium and computer equipment

Info

Publication number: CN113488027A
Application number: CN202111046475.0A
Authority: CN
Inventors: 陶建华; 马浩鑫; 易江燕
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2021-10-08

Abstract

The invention provides a hierarchical classification audio generation tracing method, a storage medium and computer equipment, comprising the following steps: extracting acoustic features of the training audio; inputting the acoustic characteristics of the training audio into a two-classification model, and performing two-classification model training to obtain a trained two-classification model; marking different labels on the generated training audio according to the generation method of the training audio, and inputting the acoustic characteristics of the generated training audio into a multi-classification model for training to obtain a trained multi-classification model; and extracting acoustic features of the test audio, inputting the acoustic features of the test audio into the trained two-class model, judging whether the test audio is real or generated voice, if so, terminating prediction, and if so, inputting the generated acoustic features of the test audio into the trained multi-class model to predict the generation source type of the test audio.

Description

Hierarchical classification generated audio tracing method, storage medium and computer equipment

Technical Field

The invention relates to the field of voice processing and image processing, in particular to a hierarchical classification audio generation tracing method.

Background

The output of the current generated voice detection network only has true and false binary classification results, however, under the background of evidence collection of practical public security and court, people not only care about the true effectiveness of the audio, but also need to know what the generation source is if the audio is synthetic or recorded and the like. Research into multi-classification traceability of audio is still blank at present.

The generated voice is identified as whether the input voice is judged and detected to be the generated voice or not, and the detected binary detection result is output. The current detection scheme is mainly based on two improvements: more discriminative acoustic features and more efficient classifiers, although in recent years models of end-to-end structures no longer distinguish between feature extraction modules and classifiers, the mainstream research at present still adopts a feature extraction and classifier architecture. In the classifier level, most of researches select a certain neural network to perform classification training, such as a residual neural network (ResNet), a lightweight convolutional neural network (LightCNN), and the like, and only pay attention to the judgment of voice authenticity. However, in practical application scenarios such as forensics, people do not pay attention to the reality of audio, and need to know the source of the false audio (i.e. which synthesis methods are used to generate false speech/which company's technology generates audio/which model of recording device records, etc.).

Publication No. CN113299315A, provides a method for generating speech features without continuous learning of raw data storage, comprising: collecting audio data, and extracting audio acoustic features to obtain linear cepstrum coefficient features; training a deep learning network model by applying the linear cepstrum coefficient characteristics to obtain a source domain model; regularization loss is added on the basis of a training loss function of the source domain model, the direction of model parameter optimization is constrained, and model parameters of the source domain model are updated by using newly acquired audio data to obtain a target domain model.

The method is characterized in that the model is continuously updated, namely, the original model is updated by new data, the model has memory of old knowledge, the continuous learning of the model is the innovation of the model training and updating process, the learning is the characteristics of generated voice, then a classification task is carried out, and the input audio is used for obtaining the reality of the audio/generating a classification result. Publication No. CN113314148A provides a lightweight neural network generated speech discrimination method and system based on original waveforms, including: sampling an audio file according to a fixed sampling rate to obtain original waveform points of the audio file, and segmenting the original waveform points into original audio frames to obtain an original audio frame sequence; the first layer is a fixed one-dimensional convolution layer, the one-dimensional convolution layer is a structure formed by mutually stacking a conventional module and a dimension reduction module, the first layer is an average pooling layer, and the average pooling layer is a full-connection layer to construct a search network; inputting the original audio frame sequence into a search network, and respectively searching the optimal operation connection between each neuron in the conventional module and the dimension reduction module to obtain an optimal model structure; and training the searched optimal model structure by using the original audio frame sequence to obtain a trained search network. The method is characterized in that the process of model training generation is emphasized, original audio is used as the output of a network through a network structure searching method, the network is used as feature extraction and a classifier, an end-to-end network structure is designed, meanwhile, the redundancy of a manual network is removed through the network searching method, the main innovation point is the generation of the model structure, the model is still input audio after the model is completed, and the classification and the judgment of real/generated voice are carried out.

The technical problem to be solved by the present application is to generate the tracing source of the voice, not the discrimination of the real/generated voice.

The prior art has the following defects: only two classification results are generated really, the two classification results are not detailed enough, the generation source type is not given, the audio tracing cannot be carried out, the judicial evidence obtaining cannot be provided with a judgment basis, the frequency tracing has important significance for the judicial evidence obtaining, if only two classification results are generated really, the two classification results are not detailed enough, the generation source type is not given, the audio tracing cannot be carried out, and the persuasion of the effectiveness of the audio evidence can be weakened greatly.

Disclosure of Invention

In view of the above, a first aspect of the present invention provides a hierarchical classification audio tracing method, including:

s1: extracting acoustic features of the training audio;

s2: inputting the acoustic characteristics of the training audio into a two-classification model, and performing two-classification model training to obtain a trained two-classification model;

s3: marking different labels on the generated training audio according to the generation method of the training audio, and inputting the acoustic characteristics of the generated training audio into a multi-classification model for training to obtain a trained multi-classification model;

s4: and extracting acoustic features of the test audio, inputting the acoustic features of the test audio into the trained two-class model, judging whether the test audio is real or generated voice, if so, terminating prediction, and if so, inputting the generated acoustic features of the test audio into the trained multi-class model to predict the generation source type of the test audio.

In some specific embodiments, the specific method for extracting the acoustic features of the training audio includes: the method comprises the steps of sampling a training audio to an original waveform point, then carrying out pre-emphasis, framing, windowing, fast Fourier transform, linear filter bank, logarithm taking and discrete cosine transform to obtain 60-dimensional LFCC-linear coefficient cepstrum characteristics of the audio.

In some specific embodiments, the windowed window is 25 frames in length.

In some specific embodiments, the fast fourier transform is a 512-dimensional FFT-fast fourier transform.

In some specific embodiments, the two-classification model employs a LightCNN network.

In some specific embodiments, the two-class model is trained for 150 rounds, an adaptive moment estimation (adam) optimizer is selected, the initial learning rate is set to 0.001, and the batch size is 128.

In some specific embodiments, the multi-classification model employs a ResNet18 network.

In some specific embodiments, the multi-classification model is trained for 100 rounds, the adam optimizer is selected, the initial learning rate is set to 0.001, and the batch size is 128.

A second aspect of the present invention provides a readable storage medium, which stores one or more programs, which are executable by one or more processors, for implementing the steps of the hierarchical classification audio tracing generating method according to the first aspect.

A third aspect of the invention provides a computer apparatus comprising a processor and a memory, wherein the memory is for storing a computer program; the processor is configured to implement the steps of the hierarchical classification generated audio tracing method according to the first aspect when executing the computer program stored in the memory.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:

and providing a judgment basis for audio evidence collection, and further providing a criterion for generating a source type on the basis of judging to generate the audio so as to collect the audio evidence.

Drawings

Fig. 1 is a flowchart of a method for generating audio source tracing by hierarchical classification according to an embodiment of the present invention;

fig. 2 is a diagram of a prediction process of a method for generating audio source tracing according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Example 1:

the embodiment of the present application as illustrated in figure 1 provides a hierarchical classification method of generating audio traceability,

s1: extracting acoustic features of the training audio;

s4: and extracting acoustic features of the test audio, inputting the acoustic features of the test audio into the trained two-class model, judging whether the model is real or generating voice, if so, terminating prediction, and if so, inputting the generated acoustic features of the test audio into the trained multi-class model to predict the generation source type of the model.

In some specific embodiments, the specific method of extracting the acoustic features of the training audio includes: the method comprises the steps of sampling a training audio to an original waveform point, then carrying out pre-emphasis, framing, windowing, fast Fourier transform, linear filter bank, logarithm taking and discrete cosine transform to obtain 60-dimensional LFCC-linear coefficient cepstrum characteristics of the audio.

In some specific embodiments, the windowed window is 25 frames in length.

In some specific embodiments, the binary model employs a LightCNN network.

In some specific embodiments, the binary model is trained for 150 rounds, an optimizer for adam-adaptive moment estimation is selected, the initial learning rate is set to 0.001, and the batch size is 128.

In some specific embodiments, the multi-classification model is trained for 100 rounds, an optimizer for adam-adaptive moment estimation is selected, the initial learning rate is set to 0.001, and the batch size is 128.

Example 2:

as shown in fig. 1, in some specific application fields, the scheme described in embodiment 1 is adopted to specifically provide an embodiment of a hierarchical classification audio source generating method, and the specific method and steps are as follows:

s1: the method for extracting the acoustic features of the training audio comprises the following steps: sampling training audio to an original waveform point, then performing pre-emphasis, framing, windowing, fast Fourier transform, linear filter bank, logarithm taking and discrete cosine transform to obtain 60-dimensional LFCC characteristics of the audio, wherein the window length is 25 frames, and 512-dimensional FFT is performed;

s2: inputting the acoustic characteristics of the training audio into a two-classification model, and performing two-classification model training to obtain a trained two-classification model; the two-classification model adopts a LightCNN network, the two-classification model is trained for 150 rounds, an adam optimizer is selected, the initial learning rate is set to be 0.001, and the batch data size is 128;

s3: marking different labels on the generated training audio according to the generation method of the training audio, and inputting the acoustic characteristics of the generated training audio into a multi-classification model for training to obtain a trained multi-classification model; the multi-classification model adopts a ResNet18 network, the multi-classification model is trained for 100 rounds, an adam optimizer is selected, the initial learning rate is set to be 0.001, and the batch data size is 128;

s4: as shown in fig. 2, extracting the acoustic features of the test audio, inputting the acoustic features of the test audio into the trained two-class model, performing the discrimination of the real/generated speech, if the acoustic features of the test audio are discriminated to be real, terminating the prediction, and if the acoustic features of the test audio are discriminated to be generated, inputting the acoustic features of the generated test audio into the trained multi-class model to predict the generation source type of the multi-class model.

Example 3:

the present invention further provides a readable storage medium, which stores one or more programs, where the one or more programs are executable by one or more processors, and implement the steps of the hierarchical classification audio tracing generating method according to the embodiment of the first aspect.

Example 4:

the invention additionally provides computer equipment, which comprises a processor and a memory, wherein the memory is used for storing computer programs; the processor is configured to, when executing the computer program stored in the memory, implement the steps of the hierarchical classification generated audio tracing method according to the embodiment of the first aspect.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A hierarchical classification generated audio traceability method, the method comprising:

s1: extracting acoustic features of the training audio;

2. The hierarchical classification generating audio tracing method according to claim 1, wherein the specific method for extracting the acoustic features of the training audio comprises: the method comprises the steps of sampling a training audio to an original waveform point, then carrying out pre-emphasis, framing, windowing, fast Fourier transform, linear filter bank, logarithm taking and discrete cosine transform to obtain 60-dimensional linear coefficient cepstrum characteristics of the audio.

3. The hierarchically classified generative audio tracing method of claim 2, wherein said windowed window has a length of 25 frames.

4. The hierarchically classified generated audio tracing method according to claim 3, wherein said fast Fourier transform is a 512-dimensional fast Fourier transform.

5. The hierarchically classified generative audio tracing method of claim 1, wherein the two classification models employ a lightweight convolutional neural network.

6. The hierarchical classification generating audio tracing method of claim 5, wherein the two classification models are trained for 150 rounds, an adaptive moment estimation optimizer is selected, an initial learning rate is set to 0.001, and a batch data size is 128.

7. The hierarchically classified generative audio tracing method of claim 1, wherein the multi-classification model employs an 18-layer residual neural network.

8. The hierarchical classification generating audio tracing method of claim 7, wherein the multi-classification model is trained for 100 rounds, an adaptive moment estimation optimizer is selected, an initial learning rate is set to 0.001, and a batch data size is 128.

9. A readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to perform the steps of the hierarchical classified generate audio traceability method of any of claims 1-8.

10. A computer device comprising a processor and a memory, wherein the memory is configured to store a computer program; the processor, when executing a computer program stored on the memory, is configured to perform the steps of the hierarchical classification generating audio traceability method according to any one of claims 1 to 8.