CN111933180B

CN111933180B - Audio splicing detection method and system, mobile terminal and storage medium

Info

Publication number: CN111933180B
Application number: CN202010594336.0A
Authority: CN
Inventors: 曾志先; 肖龙源; 李稀敏; 叶志坚; 刘晓葳
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2023-04-07
Anticipated expiration: 2040-06-28
Also published as: CN111933180A

Abstract

The invention provides an audio splicing detection method, an audio splicing detection system, a mobile terminal and a storage medium, wherein the method comprises the following steps: acquiring original audio data, and segmenting the original audio in the original audio data to obtain segmented audio; splicing the split audio to obtain spliced audio, and respectively extracting audio features of the spliced audio and the original audio to obtain original audio features and spliced audio features; respectively carrying out normalization processing on the original audio features and the spliced audio features, and training a preset recurrent neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model; and controlling the audio detection model to carry out audio splicing detection so as to output a detection result. According to the method, the most appropriate audio features are automatically learned through the audio detection model to be used as a mode for judging whether the audio is spliced or not, so that the representativeness of the features is improved, and the audio splicing detection efficiency and the audio splicing detection accuracy are improved.

Description

Audio splicing detection method and system, mobile terminal and storage medium

Technical Field

The invention belongs to the technical field of audio detection, and particularly relates to an audio splicing detection method, an audio splicing detection system, a mobile terminal and a storage medium.

Background

The voiceprint recognition technology is a technology for judging the identity of a speaker through voice, is mainly applied to the fields of banks, finance, security and the like, has the characteristics of low cost, high efficiency and the like, and is easy to deploy in various embedded devices.

However, since sound is easily recorded by recording devices such as mobile phones or recording pens, the authentication system formed by voiceprint recognition is easily attacked by lawless persons, and common attack modes include techniques such as recording playback, speech synthesis, speech generation and speech conversion, and the like.

The existing audio splicing detection method needs manual sound wave feature selection, splicing detection of audio to be detected is correspondingly performed in a sound wave matching mode, namely, the sound wave of the audio to be detected is subjected to ripple matching with a preset sound wave through selection based on the manual sound wave feature so as to obtain an audio splicing detection result, but the audio splicing detection efficiency is low due to the sound wave matching mode selected based on the manual feature, and the audio splicing detection accuracy is poor.

Disclosure of Invention

The embodiment of the invention aims to provide an audio splicing detection method, an audio splicing detection system, a mobile terminal and a storage medium, and aims to solve the problem that voiceprint identification accuracy is low due to the fact that a cosine formula or an Euclidean distance formula is adopted to compare the similarity of voiceprint vectors in the using process of the existing audio splicing detection method.

The embodiment of the invention is realized in such a way that an audio splicing detection method comprises the following steps:

acquiring original audio data, and segmenting the original audio in the original audio data respectively to obtain segmented audio;

splicing the segmented audio to obtain spliced audio, and respectively extracting audio features of the spliced audio and the original audio to obtain spliced audio features and original audio features;

respectively carrying out normalization processing on the original audio features and the spliced audio features, and training a preset cyclic neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model;

and inputting the audio to be detected into the audio detection model, and controlling the audio detection model to carry out audio splicing detection so as to output a detection result.

Further, the step of respectively segmenting the original audio in the original audio data comprises:

respectively carrying out random segmentation on each original audio according to a preset segmentation quantity to obtain segmented audio;

specifically, the step of splicing the segmented audio includes:

and extracting the segmentation audio according to the preset segmentation quantity, and splicing the extracted segmentation audio to obtain the spliced audio.

Further, the step of normalizing the original audio features and the spliced audio features respectively comprises:

respectively carrying out numerical value standardization processing on the original audio features and the spliced audio features to obtain original audio original numerical values and spliced audio original numerical values;

respectively carrying out average value calculation and standard deviation calculation on the original audio original numerical value and the spliced audio original numerical value to obtain an original audio average value, an original audio standard deviation, a spliced audio average value and a spliced audio standard deviation;

and respectively calculating the original audio original numerical value and the spliced audio original numerical value according to a standardized calculation formula to obtain an original audio normalization value and a spliced audio normalization value.

Furthermore, the step of training the preset recurrent neural network according to the normalized original audio features and the normalized spliced audio features includes:

setting the original audio normalization value as a positive sample, and setting the spliced audio normalization value as a negative sample;

performing model training on the preset recurrent neural network according to the positive sample and the negative sample, and performing loss calculation on the preset recurrent neural network to obtain a loss value;

and performing optimization iteration on the preset cyclic neural network according to the loss value until the preset cyclic neural network meets a preset ending condition, and outputting the preset cyclic neural network to obtain the audio detection model.

Further, the detection result includes an original audio score value and a spliced audio score value, and after the step of outputting the detection result, the method further includes:

performing probability calculation on the original audio score value and the spliced audio score value by adopting a SoftMax function to obtain a splicing probability value;

and if the splicing probability is smaller than a probability threshold value, judging that the audio to be tested is spliced audio.

Further, the step of separately performing audio feature extraction on the spliced audio and the original audio comprises:

and respectively carrying out short-time Fourier transform processing on the spliced audio and the original audio to obtain spliced STFT characteristics and original STFT characteristics.

Another objective of an embodiment of the present invention is to provide an audio splicing detection system, which includes:

the audio segmentation module is used for acquiring original audio data and segmenting the original audio in the original audio data respectively to obtain segmented audio;

the audio splicing module is used for splicing the segmented audios to obtain spliced audios, and respectively extracting audio features of the spliced audios and the original audios to obtain spliced audio features and original audio features;

the model training module is used for respectively carrying out normalization processing on the original audio features and the spliced audio features and training a preset cyclic neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model;

and the audio detection module is used for inputting audio to be detected into the audio detection model and controlling the audio detection model to carry out audio splicing detection so as to output a detection result.

Still further, the audio slicing module is further configured to:

and respectively carrying out random segmentation on each original audio according to a preset segmentation quantity to obtain the segmented audio.

Another objective of an embodiment of the present invention is to provide a mobile terminal, including a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal execute the audio splicing detection method described above.

Another object of an embodiment of the present invention is to provide a storage medium, which stores a computer program used in the above-mentioned mobile terminal, and the computer program, when executed by a processor, implements the steps of the above-mentioned audio splicing detection method.

According to the embodiment of the invention, manual feature selection is not needed, the most appropriate audio features are automatically learned by the audio detection model to be used as a mode for judging whether the audio is spliced, the representativeness of the features is improved, the audio splicing detection efficiency and the audio splicing detection accuracy are further improved, and spliced audio is generated by splicing and splitting the audio, so that a large amount of training data can be generated according to less original audio data, the data collection efficiency is improved, and the data acquisition time is saved.

Drawings

Fig. 1 is a flowchart of an audio splicing detection method according to a first embodiment of the present invention;

FIG. 2 is a flowchart of an audio splicing detection method according to a second embodiment of the present invention;

FIG. 3 is a flowchart of an audio splicing detection method according to a third embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an audio splicing detection system according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of a mobile terminal according to a fifth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

Example one

Please refer to fig. 1, which is a flowchart illustrating an audio splicing detection method according to a first embodiment of the present invention, including the steps of:

step S10, acquiring original audio data, and segmenting the original audio in the original audio data respectively to obtain segmented audio;

the original audio data is real audio data, the audio number and the audio duration of the original audio in the original audio data may be set according to requirements, for example, the audio number may be set to 5 thousand, 1 ten thousand, or 2 ten thousand, the audio duration may be set to 3 seconds, 4 seconds, or 5 seconds, the audio duration between different original audios may be different, but the audio durations of all the original audios are within a preset duration range.

Optionally, in this step, if it is detected that the audio duration of any original audio is not within the preset duration range, audio clipping or audio filling is performed on the original audio, so that the audio duration of the original audio after the audio clipping or audio filling is within the preset duration range.

Furthermore, in the step, the original audio is respectively segmented to obtain the segmented audio design, so that the subsequent generation of spliced audio is effectively facilitated, and the subsequent training data of the preset recurrent neural network is further ensured.

Step S20, splicing the segmented audios to obtain spliced audios, and respectively extracting audio features of the spliced audios and the original audios to obtain spliced audio features and original audio features;

splicing the segmented audio randomly to obtain a spliced audio, wherein the spliced audio is used as negative sample data of a preset recurrent neural network to ensure the training effect of the preset recurrent neural network;

optionally, in this step, the audio features of the spliced audio and the original audio may be automatically extracted by using a function calculation formula, a function matrix, or other manners, and the audio features may be selected according to requirements, for example, STFT features, MFCC features, or spectrogram features, and the like, in the spliced audio and the original audio may be extracted.

Step S30, respectively carrying out normalization processing on the original audio features and the spliced audio features, and training a preset recurrent neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model;

the original audio features and the spliced audio features are subjected to normalization processing, so that the influence of extreme values or noise on the audio features is effectively reduced, and the accuracy of the training data of the audio detection model is improved;

in the step, all the original audio features after normalization processing are used as positive sample data, and the spliced audio features are used as negative sample data and are written into a preset recurrent neural network for training, so that the audio detection module is obtained.

Specifically, the label of the original audio feature is set to 1, the label of the spliced audio feature is set to 0, the spliced audio feature is randomly ordered, 75% of total sample data is set as a training set, 15% of total sample data is set as a test set, and the preset recurrent neural network is trained to obtain an audio detection model;

optionally, the predetermined recurrent neural network may be a GRU recurrent neural network, where the GRU recurrent neural network includes a 3-layer LSTM structure and the number of hidden layer neurons is 300.

S40, inputting the audio to be detected into the audio detection model, and controlling the audio detection model to carry out audio splicing detection so as to output a detection result;

in the embodiment, the GRU network is used as a network structure, so that information in time sequence can be fully utilized, the probability judgment is made by combining front and back information, audio data is just established on the time sequence relation, audio features of all training sets are input into the network, and the output is the two-classification numerical value corresponding to each audio feature data.

According to the embodiment, manual feature selection is not needed, the most appropriate audio features are automatically learned by the audio detection model to be used as a mode for judging whether the audio is spliced or not, the representativeness of the features is improved, the audio splicing detection efficiency and the accuracy of audio splicing detection are improved, spliced audio is generated by splicing and splitting the audio, a large amount of training data can be generated according to less original audio data, the data collection efficiency in the preset cyclic neural network training process is improved, and the data collection time is saved.

Example two

Please refer to fig. 2, which is a flowchart of an audio splicing detection method according to a second embodiment of the present application, including the steps of:

s11, acquiring original audio data, and respectively performing random segmentation on each original audio according to a preset segmentation number to obtain segmented audio;

the preset splitting number may be set according to a requirement, for example, the preset splitting number may be set to 4, 5, or 10, etc.;

preferably, in this embodiment, the preset number of segments is set to 5, that is, each original audio is randomly segmented into 5 segments, and when the number of original audio in the original audio data is N, the number of segmented audio obtained by segmentation is 5N.

Step S21, extracting the segmented audios according to the preset segmentation quantity, and splicing the extracted segmented audios to obtain spliced audios;

extracting 5 segmented audios from all the segmented audios respectively, and splicing the 5 extracted segmented audios each time to obtain spliced audios;

optionally, in this step, a segmented audio may be extracted from each original audio, and audio splicing may be performed according to the extraction result, so as to obtain N numbers of spliced audios.

Specifically, in the step, the design of splicing the extracted segmented audios to obtain the spliced audio is adopted, so that the training effect on the preset recurrent neural network is effectively guaranteed.

Step S31, short-time Fourier transform processing is respectively carried out on the spliced audio and the original audio to obtain spliced STFT characteristics and original STFT characteristics;

optionally, in this step, the spliced audio and the original audio may be respectively extracted specifically by directly using a kaldi tool library of python, so as to convert the spliced audio and the original audio into STFT features of 257 dimensions.

Step S41, respectively carrying out normalization processing on the spliced STFT characteristics and the original STFT characteristics, and training a preset cyclic neural network according to the spliced STFT characteristics and the original STFT characteristics after normalization processing to obtain an audio detection model;

the original STFT characteristics and the STFT audio characteristics are subjected to normalization processing, so that the influence of extreme values or noise on the audio characteristics is effectively reduced, and the accuracy of the training data of the audio detection model is improved.

S51, inputting the audio to be detected into the audio detection model, and controlling the audio detection model to carry out audio splicing detection so as to output a detection result;

wherein the detection result comprises an original audio score value and a spliced audio score value.

S61, performing probability calculation on the original audio score value and the spliced audio score value by adopting a SoftMax function to obtain a splicing probability value;

the two values output by the audio detection model output layer are converted into probabilities through a SoftMax function, the meanings of the probabilities are the probability value that the audio to be detected is the real audio and the probability value that the audio is spliced, and the calculation mode of the SoftMax function is used for converting the values output by the audio detection model into the range of 0-1, so that whether the audio to be detected is spliced can be directly judged according to the probability value of 0-1.

And S71, if the splicing probability is smaller than a probability threshold, judging that the audio to be tested is spliced audio.

According to the embodiment, manual feature selection is not needed, the most appropriate audio features are automatically learned by the audio detection model to be used as a mode for judging whether the audio is spliced or not, the representativeness of the features is improved, the audio splicing detection efficiency and the audio splicing detection accuracy are further improved, spliced audio is generated by splicing and splitting the audio, a large amount of training data can be generated according to less original audio data, the data collection efficiency is improved, and the data collection time is saved.

EXAMPLE III

Please refer to fig. 3, which is a flowchart of an audio splicing detection method according to a third embodiment of the present application, where the third embodiment is configured to refine step S30 in the first embodiment to refine how to normalize the original audio features and the spliced audio features respectively, and train a preset recurrent neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model, and includes the steps of:

step S301, respectively carrying out numerical value standardization processing on the original audio features and the spliced audio features to obtain original audio original numerical values and spliced audio original numerical values;

the original audio original numerical value and the spliced audio original numerical value are calculated, so that the subsequent normalization processing aiming at the original audio characteristic and the spliced audio characteristic is effectively facilitated;

step S302, respectively carrying out average value calculation and standard deviation calculation on the original audio original numerical value and the spliced audio original numerical value to obtain an original audio average value, an original audio standard deviation, a spliced audio average value and a spliced audio standard deviation;

step S303, respectively calculating the original audio original numerical value and the spliced audio original numerical value according to a standardized calculation formula to obtain an original audio normalization value and a spliced audio normalization value;

wherein the normalized calculation formula is:

D ₁ ＝(A ₁ -B ₁ )/C ₁ ；

wherein A is ₁ Is the original audio original value, B ₁ Is the original audio mean value, C ₁ Taking the standard deviation of the original audio and D1 as a normalization value of the original audio;

D ₂ ＝(A ₂ -B ₂ )/C ₂ ；

wherein A is ₂ For the splicing audio original value, B ₂ Is the spliced audio mean value, C ₂ And D2 is the standard deviation of the spliced audio and the normalized value of the spliced audio.

Step S304, setting the original audio normalization value as a positive sample, and setting the spliced audio normalization value as a negative sample;

step S305, performing model training on the preset cyclic neural network according to the positive sample and the negative sample, and performing loss calculation on the preset cyclic neural network to obtain a loss value;

the loss calculation of the preset recurrent neural network can be performed by adopting a cross entropy loss function to obtain the loss value, and the loss value is used for updating the parameter weight in the preset recurrent neural network so as to improve the identification efficiency of the preset recurrent neural network.

Step S306, performing optimization iteration on the preset cyclic neural network according to the loss value until the preset cyclic neural network meets a preset ending condition, and outputting the preset cyclic neural network to obtain the audio detection model;

the parameter weight of the iterative preset cyclic neural network can be optimized by adopting an Adam algorithm according to the loss value, the learning rate is 0.00005, 64 audio STFT feature data are transmitted into each batch, one Epoch is trained for 150 batches, and 30 epochs are trained in total;

specifically, in this step, if it is detected that the iteration number of the preset recurrent neural network is equal to the number threshold, or it is detected that the loss value in the preset recurrent neural network is smaller than the loss threshold, it is determined that the preset recurrent neural network satisfies a preset end condition, and the preset recurrent neural network is output to obtain the audio detection model, where the audio detection model is used to receive the audio to be detected and determine whether the audio to be detected is a spliced audio.

In the embodiment, through the design of carrying out normalization processing on the original audio features and the spliced audio features, the influence of an extreme value or noise on the audio features is effectively reduced, the accuracy of the training data of the audio detection model is further improved, the loss value is obtained through loss calculation on the preset cyclic neural network, the design of optimizing iteration is carried out on the preset cyclic neural network according to the loss value, the parameter weight in the preset cyclic neural network can be effectively updated, and the accuracy of splicing detection of the audio detection model on the audio to be detected is improved.

Example four

Please refer to fig. 4, which is a schematic structural diagram of an audio splicing detection system 100 according to a fourth embodiment of the present invention, including: audio segmentation module 10, audio concatenation module 11, model training module 12 and audio detection module 13, wherein:

the audio segmentation module 10 is configured to obtain original audio data, and segment original audios in the original audio data to obtain segmented audios.

Wherein the audio slicing module 10 is further configured to: and respectively carrying out random segmentation on each original audio according to a preset segmentation number to obtain the segmented audio.

And the audio splicing module 11 is configured to splice the segmented audio to obtain a spliced audio, and perform audio feature extraction on the spliced audio and the original audio respectively to obtain a spliced audio feature and an original audio feature.

Wherein, the audio splicing module 11 is further configured to: and extracting the segmentation audio according to the preset segmentation quantity, and splicing the extracted segmentation audio to obtain the spliced audio.

Preferably, the audio splicing module 11 is further configured to: and respectively carrying out short-time Fourier transform processing on the spliced audio and the original audio to obtain spliced STFT characteristics and original STFT characteristics.

And the model training module 12 is configured to perform normalization processing on the original audio features and the spliced audio features, and train a preset recurrent neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model.

Wherein the model training module 12 is further configured to: respectively carrying out numerical value standardization processing on the original audio features and the spliced audio features to obtain original audio original numerical values and spliced audio original numerical values;

Preferably, the model training module 12 is further configured to: setting the original audio normalization value as a positive sample, and setting the spliced audio normalization value as a negative sample;

performing model training on the preset cyclic neural network according to the positive sample and the negative sample, and performing loss calculation on the preset cyclic neural network to obtain a loss value;

And the audio detection module 13 is configured to input the audio to be detected into the audio detection model, and control the audio detection model to perform audio splicing detection, so as to output a detection result.

Wherein the audio detection module 13 is further configured to: performing probability calculation on the original audio score value and the spliced audio score value by adopting a SoftMax function to obtain a splicing probability value;

EXAMPLE five

Referring to fig. 5, a mobile terminal 101 according to a fifth embodiment of the present invention includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal 101 execute the audio splicing detection method, and the mobile terminal 101 may be a robot.

The present embodiment also provides a storage medium on which a computer program used in the above-mentioned mobile terminal 101 is stored, which when executed, includes the steps of:

and inputting the audio to be detected into the audio detection model, and controlling the audio detection model to carry out audio splicing detection so as to output a detection result. The storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is used as an example, in practical applications, the above-mentioned function distribution may be performed by different functional units or modules according to needs, that is, the internal structure of the storage device is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application.

Those skilled in the art will appreciate that the configuration shown in fig. 4 does not constitute a limitation of the audio splice detection system of the present invention and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components, and that the audio splice detection method of fig. 1-3 may also be implemented using more or fewer components than those shown in fig. 4, or some components in combination, or a different arrangement of components. The units, modules, etc. referred to herein are a series of computer programs that can be executed by a processor (not shown) of the current audio splice detection system and that can perform specific functions, and all of the computer programs can be stored in a storage device (not shown) of the current audio splice detection system.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. An audio splicing detection method, characterized in that the method comprises:

acquiring original audio data, and if the audio duration of any original audio is detected not to be within a preset duration range, performing audio cutting or audio filling on the original audio;

respectively carrying out random segmentation on each original audio according to a preset segmentation number to obtain segmented audio;

splicing the segmented audios to obtain spliced audios, and respectively performing audio feature extraction on the spliced audios and the original audios to obtain spliced audio features and original audio features;

inputting the audio to be detected into the audio detection model, and controlling the audio detection model to carry out audio splicing detection so as to output a detection result;

the step of splicing the sliced audio comprises:

2. The audio splicing detection method according to claim 1, wherein said step of normalizing said original audio features and said spliced audio features separately comprises:

3. The audio splicing detection method according to claim 2, wherein the step of training a preset recurrent neural network according to the normalized original audio features and the spliced audio features comprises:

4. The audio splice detection method of claim 1 wherein the detection result comprises an original audio score value and a spliced audio score value, the method further comprising, after the step of outputting the detection result:

5. The audio splicing detection method according to claim 1, wherein said step of separately performing audio feature extraction on the spliced audio and the original audio comprises:

6. An audio splice detection system, the system comprising:

the audio cutting module is used for acquiring original audio data, and if the audio duration of any original audio is detected not to be within a preset duration range, performing audio cutting or audio filling on the original audio;

the audio splicing module is used for splicing the segmented audios to obtain spliced audios, and respectively performing audio feature extraction on the spliced audios and the original audios to obtain spliced audio features and original audio features;

the model training module is used for respectively carrying out normalization processing on the original audio features and the spliced audio features and training a preset cyclic neural network according to the original audio features and the spliced audio features after normalization processing to obtain an audio detection model;

the audio detection module is used for inputting audio to be detected into the audio detection model and controlling the audio detection model to carry out audio splicing detection so as to output a detection result;

the audio splicing module is further configured to: and extracting the segmentation audio according to the preset segmentation quantity, and splicing the extracted segmentation audio to obtain the spliced audio.

7. A mobile terminal, characterized in that it comprises a storage device for storing a computer program and a processor running the computer program to make the mobile terminal execute the audio splice detection method according to any of claims 1 to 5.

8. A storage medium, characterized in that it stores a computer program for use in a mobile terminal according to claim 7, which computer program, when being executed by a processor, carries out the steps of the audio splice detection method according to any of the claims 1 to 5.