CN111933180B - Audio splicing detection method and system, mobile terminal and storage medium - Google Patents
Audio splicing detection method and system, mobile terminal and storage medium Download PDFInfo
- Publication number
- CN111933180B CN111933180B CN202010594336.0A CN202010594336A CN111933180B CN 111933180 B CN111933180 B CN 111933180B CN 202010594336 A CN202010594336 A CN 202010594336A CN 111933180 B CN111933180 B CN 111933180B
- Authority
- CN
- China
- Prior art keywords
- audio
- original
- spliced
- features
- splicing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 125
- 238000013528 artificial neural network Methods 0.000 claims abstract description 53
- 238000012549 training Methods 0.000 claims abstract description 34
- 238000010606 normalization Methods 0.000 claims abstract description 33
- 238000012545 processing Methods 0.000 claims abstract description 25
- 230000000306 recurrent effect Effects 0.000 claims abstract description 24
- 238000000034 method Methods 0.000 claims abstract description 8
- 125000004122 cyclic group Chemical group 0.000 claims description 29
- 230000011218 segmentation Effects 0.000 claims description 29
- 238000004364 calculation method Methods 0.000 claims description 25
- 238000004590 computer program Methods 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 4
- 230000006870 function Effects 0.000 description 11
- 238000013480 data collection Methods 0.000 description 7
- 238000013461 design Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention provides an audio splicing detection method, an audio splicing detection system, a mobile terminal and a storage medium, wherein the method comprises the following steps: acquiring original audio data, and segmenting the original audio in the original audio data to obtain segmented audio; splicing the split audio to obtain spliced audio, and respectively extracting audio features of the spliced audio and the original audio to obtain original audio features and spliced audio features; respectively carrying out normalization processing on the original audio features and the spliced audio features, and training a preset recurrent neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model; and controlling the audio detection model to carry out audio splicing detection so as to output a detection result. According to the method, the most appropriate audio features are automatically learned through the audio detection model to be used as a mode for judging whether the audio is spliced or not, so that the representativeness of the features is improved, and the audio splicing detection efficiency and the audio splicing detection accuracy are improved.
Description
Technical Field
The invention belongs to the technical field of audio detection, and particularly relates to an audio splicing detection method, an audio splicing detection system, a mobile terminal and a storage medium.
Background
The voiceprint recognition technology is a technology for judging the identity of a speaker through voice, is mainly applied to the fields of banks, finance, security and the like, has the characteristics of low cost, high efficiency and the like, and is easy to deploy in various embedded devices.
However, since sound is easily recorded by recording devices such as mobile phones or recording pens, the authentication system formed by voiceprint recognition is easily attacked by lawless persons, and common attack modes include techniques such as recording playback, speech synthesis, speech generation and speech conversion, and the like.
The existing audio splicing detection method needs manual sound wave feature selection, splicing detection of audio to be detected is correspondingly performed in a sound wave matching mode, namely, the sound wave of the audio to be detected is subjected to ripple matching with a preset sound wave through selection based on the manual sound wave feature so as to obtain an audio splicing detection result, but the audio splicing detection efficiency is low due to the sound wave matching mode selected based on the manual feature, and the audio splicing detection accuracy is poor.
Disclosure of Invention
The embodiment of the invention aims to provide an audio splicing detection method, an audio splicing detection system, a mobile terminal and a storage medium, and aims to solve the problem that voiceprint identification accuracy is low due to the fact that a cosine formula or an Euclidean distance formula is adopted to compare the similarity of voiceprint vectors in the using process of the existing audio splicing detection method.
The embodiment of the invention is realized in such a way that an audio splicing detection method comprises the following steps:
acquiring original audio data, and segmenting the original audio in the original audio data respectively to obtain segmented audio;
splicing the segmented audio to obtain spliced audio, and respectively extracting audio features of the spliced audio and the original audio to obtain spliced audio features and original audio features;
respectively carrying out normalization processing on the original audio features and the spliced audio features, and training a preset cyclic neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model;
and inputting the audio to be detected into the audio detection model, and controlling the audio detection model to carry out audio splicing detection so as to output a detection result.
Further, the step of respectively segmenting the original audio in the original audio data comprises:
respectively carrying out random segmentation on each original audio according to a preset segmentation quantity to obtain segmented audio;
specifically, the step of splicing the segmented audio includes:
and extracting the segmentation audio according to the preset segmentation quantity, and splicing the extracted segmentation audio to obtain the spliced audio.
Further, the step of normalizing the original audio features and the spliced audio features respectively comprises:
respectively carrying out numerical value standardization processing on the original audio features and the spliced audio features to obtain original audio original numerical values and spliced audio original numerical values;
respectively carrying out average value calculation and standard deviation calculation on the original audio original numerical value and the spliced audio original numerical value to obtain an original audio average value, an original audio standard deviation, a spliced audio average value and a spliced audio standard deviation;
and respectively calculating the original audio original numerical value and the spliced audio original numerical value according to a standardized calculation formula to obtain an original audio normalization value and a spliced audio normalization value.
Furthermore, the step of training the preset recurrent neural network according to the normalized original audio features and the normalized spliced audio features includes:
setting the original audio normalization value as a positive sample, and setting the spliced audio normalization value as a negative sample;
performing model training on the preset recurrent neural network according to the positive sample and the negative sample, and performing loss calculation on the preset recurrent neural network to obtain a loss value;
and performing optimization iteration on the preset cyclic neural network according to the loss value until the preset cyclic neural network meets a preset ending condition, and outputting the preset cyclic neural network to obtain the audio detection model.
Further, the detection result includes an original audio score value and a spliced audio score value, and after the step of outputting the detection result, the method further includes:
performing probability calculation on the original audio score value and the spliced audio score value by adopting a SoftMax function to obtain a splicing probability value;
and if the splicing probability is smaller than a probability threshold value, judging that the audio to be tested is spliced audio.
Further, the step of separately performing audio feature extraction on the spliced audio and the original audio comprises:
and respectively carrying out short-time Fourier transform processing on the spliced audio and the original audio to obtain spliced STFT characteristics and original STFT characteristics.
Another objective of an embodiment of the present invention is to provide an audio splicing detection system, which includes:
the audio segmentation module is used for acquiring original audio data and segmenting the original audio in the original audio data respectively to obtain segmented audio;
the audio splicing module is used for splicing the segmented audios to obtain spliced audios, and respectively extracting audio features of the spliced audios and the original audios to obtain spliced audio features and original audio features;
the model training module is used for respectively carrying out normalization processing on the original audio features and the spliced audio features and training a preset cyclic neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model;
and the audio detection module is used for inputting audio to be detected into the audio detection model and controlling the audio detection model to carry out audio splicing detection so as to output a detection result.
Still further, the audio slicing module is further configured to:
and respectively carrying out random segmentation on each original audio according to a preset segmentation quantity to obtain the segmented audio.
Another objective of an embodiment of the present invention is to provide a mobile terminal, including a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal execute the audio splicing detection method described above.
Another object of an embodiment of the present invention is to provide a storage medium, which stores a computer program used in the above-mentioned mobile terminal, and the computer program, when executed by a processor, implements the steps of the above-mentioned audio splicing detection method.
According to the embodiment of the invention, manual feature selection is not needed, the most appropriate audio features are automatically learned by the audio detection model to be used as a mode for judging whether the audio is spliced, the representativeness of the features is improved, the audio splicing detection efficiency and the audio splicing detection accuracy are further improved, and spliced audio is generated by splicing and splitting the audio, so that a large amount of training data can be generated according to less original audio data, the data collection efficiency is improved, and the data acquisition time is saved.
Drawings
Fig. 1 is a flowchart of an audio splicing detection method according to a first embodiment of the present invention;
FIG. 2 is a flowchart of an audio splicing detection method according to a second embodiment of the present invention;
FIG. 3 is a flowchart of an audio splicing detection method according to a third embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an audio splicing detection system according to a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of a mobile terminal according to a fifth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
Example one
Please refer to fig. 1, which is a flowchart illustrating an audio splicing detection method according to a first embodiment of the present invention, including the steps of:
step S10, acquiring original audio data, and segmenting the original audio in the original audio data respectively to obtain segmented audio;
the original audio data is real audio data, the audio number and the audio duration of the original audio in the original audio data may be set according to requirements, for example, the audio number may be set to 5 thousand, 1 ten thousand, or 2 ten thousand, the audio duration may be set to 3 seconds, 4 seconds, or 5 seconds, the audio duration between different original audios may be different, but the audio durations of all the original audios are within a preset duration range.
Optionally, in this step, if it is detected that the audio duration of any original audio is not within the preset duration range, audio clipping or audio filling is performed on the original audio, so that the audio duration of the original audio after the audio clipping or audio filling is within the preset duration range.
Furthermore, in the step, the original audio is respectively segmented to obtain the segmented audio design, so that the subsequent generation of spliced audio is effectively facilitated, and the subsequent training data of the preset recurrent neural network is further ensured.
Step S20, splicing the segmented audios to obtain spliced audios, and respectively extracting audio features of the spliced audios and the original audios to obtain spliced audio features and original audio features;
splicing the segmented audio randomly to obtain a spliced audio, wherein the spliced audio is used as negative sample data of a preset recurrent neural network to ensure the training effect of the preset recurrent neural network;
optionally, in this step, the audio features of the spliced audio and the original audio may be automatically extracted by using a function calculation formula, a function matrix, or other manners, and the audio features may be selected according to requirements, for example, STFT features, MFCC features, or spectrogram features, and the like, in the spliced audio and the original audio may be extracted.
Step S30, respectively carrying out normalization processing on the original audio features and the spliced audio features, and training a preset recurrent neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model;
the original audio features and the spliced audio features are subjected to normalization processing, so that the influence of extreme values or noise on the audio features is effectively reduced, and the accuracy of the training data of the audio detection model is improved;
in the step, all the original audio features after normalization processing are used as positive sample data, and the spliced audio features are used as negative sample data and are written into a preset recurrent neural network for training, so that the audio detection module is obtained.
Specifically, the label of the original audio feature is set to 1, the label of the spliced audio feature is set to 0, the spliced audio feature is randomly ordered, 75% of total sample data is set as a training set, 15% of total sample data is set as a test set, and the preset recurrent neural network is trained to obtain an audio detection model;
optionally, the predetermined recurrent neural network may be a GRU recurrent neural network, where the GRU recurrent neural network includes a 3-layer LSTM structure and the number of hidden layer neurons is 300.
S40, inputting the audio to be detected into the audio detection model, and controlling the audio detection model to carry out audio splicing detection so as to output a detection result;
in the embodiment, the GRU network is used as a network structure, so that information in time sequence can be fully utilized, the probability judgment is made by combining front and back information, audio data is just established on the time sequence relation, audio features of all training sets are input into the network, and the output is the two-classification numerical value corresponding to each audio feature data.
According to the embodiment, manual feature selection is not needed, the most appropriate audio features are automatically learned by the audio detection model to be used as a mode for judging whether the audio is spliced or not, the representativeness of the features is improved, the audio splicing detection efficiency and the accuracy of audio splicing detection are improved, spliced audio is generated by splicing and splitting the audio, a large amount of training data can be generated according to less original audio data, the data collection efficiency in the preset cyclic neural network training process is improved, and the data collection time is saved.
Example two
Please refer to fig. 2, which is a flowchart of an audio splicing detection method according to a second embodiment of the present application, including the steps of:
s11, acquiring original audio data, and respectively performing random segmentation on each original audio according to a preset segmentation number to obtain segmented audio;
the preset splitting number may be set according to a requirement, for example, the preset splitting number may be set to 4, 5, or 10, etc.;
preferably, in this embodiment, the preset number of segments is set to 5, that is, each original audio is randomly segmented into 5 segments, and when the number of original audio in the original audio data is N, the number of segmented audio obtained by segmentation is 5N.
Step S21, extracting the segmented audios according to the preset segmentation quantity, and splicing the extracted segmented audios to obtain spliced audios;
extracting 5 segmented audios from all the segmented audios respectively, and splicing the 5 extracted segmented audios each time to obtain spliced audios;
optionally, in this step, a segmented audio may be extracted from each original audio, and audio splicing may be performed according to the extraction result, so as to obtain N numbers of spliced audios.
Specifically, in the step, the design of splicing the extracted segmented audios to obtain the spliced audio is adopted, so that the training effect on the preset recurrent neural network is effectively guaranteed.
Step S31, short-time Fourier transform processing is respectively carried out on the spliced audio and the original audio to obtain spliced STFT characteristics and original STFT characteristics;
optionally, in this step, the spliced audio and the original audio may be respectively extracted specifically by directly using a kaldi tool library of python, so as to convert the spliced audio and the original audio into STFT features of 257 dimensions.
Step S41, respectively carrying out normalization processing on the spliced STFT characteristics and the original STFT characteristics, and training a preset cyclic neural network according to the spliced STFT characteristics and the original STFT characteristics after normalization processing to obtain an audio detection model;
the original STFT characteristics and the STFT audio characteristics are subjected to normalization processing, so that the influence of extreme values or noise on the audio characteristics is effectively reduced, and the accuracy of the training data of the audio detection model is improved.
S51, inputting the audio to be detected into the audio detection model, and controlling the audio detection model to carry out audio splicing detection so as to output a detection result;
wherein the detection result comprises an original audio score value and a spliced audio score value.
S61, performing probability calculation on the original audio score value and the spliced audio score value by adopting a SoftMax function to obtain a splicing probability value;
the two values output by the audio detection model output layer are converted into probabilities through a SoftMax function, the meanings of the probabilities are the probability value that the audio to be detected is the real audio and the probability value that the audio is spliced, and the calculation mode of the SoftMax function is used for converting the values output by the audio detection model into the range of 0-1, so that whether the audio to be detected is spliced can be directly judged according to the probability value of 0-1.
And S71, if the splicing probability is smaller than a probability threshold, judging that the audio to be tested is spliced audio.
According to the embodiment, manual feature selection is not needed, the most appropriate audio features are automatically learned by the audio detection model to be used as a mode for judging whether the audio is spliced or not, the representativeness of the features is improved, the audio splicing detection efficiency and the audio splicing detection accuracy are further improved, spliced audio is generated by splicing and splitting the audio, a large amount of training data can be generated according to less original audio data, the data collection efficiency is improved, and the data collection time is saved.
EXAMPLE III
Please refer to fig. 3, which is a flowchart of an audio splicing detection method according to a third embodiment of the present application, where the third embodiment is configured to refine step S30 in the first embodiment to refine how to normalize the original audio features and the spliced audio features respectively, and train a preset recurrent neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model, and includes the steps of:
step S301, respectively carrying out numerical value standardization processing on the original audio features and the spliced audio features to obtain original audio original numerical values and spliced audio original numerical values;
the original audio original numerical value and the spliced audio original numerical value are calculated, so that the subsequent normalization processing aiming at the original audio characteristic and the spliced audio characteristic is effectively facilitated;
step S302, respectively carrying out average value calculation and standard deviation calculation on the original audio original numerical value and the spliced audio original numerical value to obtain an original audio average value, an original audio standard deviation, a spliced audio average value and a spliced audio standard deviation;
step S303, respectively calculating the original audio original numerical value and the spliced audio original numerical value according to a standardized calculation formula to obtain an original audio normalization value and a spliced audio normalization value;
wherein the normalized calculation formula is:
D 1 =(A 1 -B 1 )/C 1 ;
wherein A is 1 Is the original audio original value, B 1 Is the original audio mean value, C 1 Taking the standard deviation of the original audio and D1 as a normalization value of the original audio;
D 2 =(A 2 -B 2 )/C 2 ;
wherein A is 2 For the splicing audio original value, B 2 Is the spliced audio mean value, C 2 And D2 is the standard deviation of the spliced audio and the normalized value of the spliced audio.
Step S304, setting the original audio normalization value as a positive sample, and setting the spliced audio normalization value as a negative sample;
step S305, performing model training on the preset cyclic neural network according to the positive sample and the negative sample, and performing loss calculation on the preset cyclic neural network to obtain a loss value;
the loss calculation of the preset recurrent neural network can be performed by adopting a cross entropy loss function to obtain the loss value, and the loss value is used for updating the parameter weight in the preset recurrent neural network so as to improve the identification efficiency of the preset recurrent neural network.
Step S306, performing optimization iteration on the preset cyclic neural network according to the loss value until the preset cyclic neural network meets a preset ending condition, and outputting the preset cyclic neural network to obtain the audio detection model;
the parameter weight of the iterative preset cyclic neural network can be optimized by adopting an Adam algorithm according to the loss value, the learning rate is 0.00005, 64 audio STFT feature data are transmitted into each batch, one Epoch is trained for 150 batches, and 30 epochs are trained in total;
specifically, in this step, if it is detected that the iteration number of the preset recurrent neural network is equal to the number threshold, or it is detected that the loss value in the preset recurrent neural network is smaller than the loss threshold, it is determined that the preset recurrent neural network satisfies a preset end condition, and the preset recurrent neural network is output to obtain the audio detection model, where the audio detection model is used to receive the audio to be detected and determine whether the audio to be detected is a spliced audio.
In the embodiment, through the design of carrying out normalization processing on the original audio features and the spliced audio features, the influence of an extreme value or noise on the audio features is effectively reduced, the accuracy of the training data of the audio detection model is further improved, the loss value is obtained through loss calculation on the preset cyclic neural network, the design of optimizing iteration is carried out on the preset cyclic neural network according to the loss value, the parameter weight in the preset cyclic neural network can be effectively updated, and the accuracy of splicing detection of the audio detection model on the audio to be detected is improved.
Example four
Please refer to fig. 4, which is a schematic structural diagram of an audio splicing detection system 100 according to a fourth embodiment of the present invention, including: audio segmentation module 10, audio concatenation module 11, model training module 12 and audio detection module 13, wherein:
the audio segmentation module 10 is configured to obtain original audio data, and segment original audios in the original audio data to obtain segmented audios.
Wherein the audio slicing module 10 is further configured to: and respectively carrying out random segmentation on each original audio according to a preset segmentation number to obtain the segmented audio.
And the audio splicing module 11 is configured to splice the segmented audio to obtain a spliced audio, and perform audio feature extraction on the spliced audio and the original audio respectively to obtain a spliced audio feature and an original audio feature.
Wherein, the audio splicing module 11 is further configured to: and extracting the segmentation audio according to the preset segmentation quantity, and splicing the extracted segmentation audio to obtain the spliced audio.
Preferably, the audio splicing module 11 is further configured to: and respectively carrying out short-time Fourier transform processing on the spliced audio and the original audio to obtain spliced STFT characteristics and original STFT characteristics.
And the model training module 12 is configured to perform normalization processing on the original audio features and the spliced audio features, and train a preset recurrent neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model.
Wherein the model training module 12 is further configured to: respectively carrying out numerical value standardization processing on the original audio features and the spliced audio features to obtain original audio original numerical values and spliced audio original numerical values;
respectively carrying out average value calculation and standard deviation calculation on the original audio original numerical value and the spliced audio original numerical value to obtain an original audio average value, an original audio standard deviation, a spliced audio average value and a spliced audio standard deviation;
and respectively calculating the original audio original numerical value and the spliced audio original numerical value according to a standardized calculation formula to obtain an original audio normalization value and a spliced audio normalization value.
Preferably, the model training module 12 is further configured to: setting the original audio normalization value as a positive sample, and setting the spliced audio normalization value as a negative sample;
performing model training on the preset cyclic neural network according to the positive sample and the negative sample, and performing loss calculation on the preset cyclic neural network to obtain a loss value;
and performing optimization iteration on the preset cyclic neural network according to the loss value until the preset cyclic neural network meets a preset ending condition, and outputting the preset cyclic neural network to obtain the audio detection model.
And the audio detection module 13 is configured to input the audio to be detected into the audio detection model, and control the audio detection model to perform audio splicing detection, so as to output a detection result.
Wherein the audio detection module 13 is further configured to: performing probability calculation on the original audio score value and the spliced audio score value by adopting a SoftMax function to obtain a splicing probability value;
and if the splicing probability is smaller than a probability threshold value, judging that the audio to be tested is spliced audio.
According to the embodiment, manual feature selection is not needed, the most appropriate audio features are automatically learned by the audio detection model to be used as a mode for judging whether the audio is spliced or not, the representativeness of the features is improved, the audio splicing detection efficiency and the audio splicing detection accuracy are further improved, spliced audio is generated by splicing and splitting the audio, a large amount of training data can be generated according to less original audio data, the data collection efficiency is improved, and the data collection time is saved.
EXAMPLE five
Referring to fig. 5, a mobile terminal 101 according to a fifth embodiment of the present invention includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal 101 execute the audio splicing detection method, and the mobile terminal 101 may be a robot.
The present embodiment also provides a storage medium on which a computer program used in the above-mentioned mobile terminal 101 is stored, which when executed, includes the steps of:
acquiring original audio data, and segmenting the original audio in the original audio data respectively to obtain segmented audio;
splicing the segmented audio to obtain spliced audio, and respectively extracting audio features of the spliced audio and the original audio to obtain spliced audio features and original audio features;
respectively carrying out normalization processing on the original audio features and the spliced audio features, and training a preset cyclic neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model;
and inputting the audio to be detected into the audio detection model, and controlling the audio detection model to carry out audio splicing detection so as to output a detection result. The storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is used as an example, in practical applications, the above-mentioned function distribution may be performed by different functional units or modules according to needs, that is, the internal structure of the storage device is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application.
Those skilled in the art will appreciate that the configuration shown in fig. 4 does not constitute a limitation of the audio splice detection system of the present invention and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components, and that the audio splice detection method of fig. 1-3 may also be implemented using more or fewer components than those shown in fig. 4, or some components in combination, or a different arrangement of components. The units, modules, etc. referred to herein are a series of computer programs that can be executed by a processor (not shown) of the current audio splice detection system and that can perform specific functions, and all of the computer programs can be stored in a storage device (not shown) of the current audio splice detection system.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (8)
1. An audio splicing detection method, characterized in that the method comprises:
acquiring original audio data, and if the audio duration of any original audio is detected not to be within a preset duration range, performing audio cutting or audio filling on the original audio;
respectively carrying out random segmentation on each original audio according to a preset segmentation number to obtain segmented audio;
splicing the segmented audios to obtain spliced audios, and respectively performing audio feature extraction on the spliced audios and the original audios to obtain spliced audio features and original audio features;
respectively carrying out normalization processing on the original audio features and the spliced audio features, and training a preset cyclic neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model;
inputting the audio to be detected into the audio detection model, and controlling the audio detection model to carry out audio splicing detection so as to output a detection result;
the step of splicing the sliced audio comprises:
and extracting the segmentation audio according to the preset segmentation quantity, and splicing the extracted segmentation audio to obtain the spliced audio.
2. The audio splicing detection method according to claim 1, wherein said step of normalizing said original audio features and said spliced audio features separately comprises:
respectively carrying out numerical value standardization processing on the original audio features and the spliced audio features to obtain original audio original numerical values and spliced audio original numerical values;
respectively carrying out average value calculation and standard deviation calculation on the original audio original numerical value and the spliced audio original numerical value to obtain an original audio average value, an original audio standard deviation, a spliced audio average value and a spliced audio standard deviation;
and respectively calculating the original audio original numerical value and the spliced audio original numerical value according to a standardized calculation formula to obtain an original audio normalization value and a spliced audio normalization value.
3. The audio splicing detection method according to claim 2, wherein the step of training a preset recurrent neural network according to the normalized original audio features and the spliced audio features comprises:
setting the original audio normalization value as a positive sample, and setting the spliced audio normalization value as a negative sample;
performing model training on the preset cyclic neural network according to the positive sample and the negative sample, and performing loss calculation on the preset cyclic neural network to obtain a loss value;
and performing optimization iteration on the preset cyclic neural network according to the loss value until the preset cyclic neural network meets a preset ending condition, and outputting the preset cyclic neural network to obtain the audio detection model.
4. The audio splice detection method of claim 1 wherein the detection result comprises an original audio score value and a spliced audio score value, the method further comprising, after the step of outputting the detection result:
performing probability calculation on the original audio score value and the spliced audio score value by adopting a SoftMax function to obtain a splicing probability value;
and if the splicing probability is smaller than a probability threshold value, judging that the audio to be tested is spliced audio.
5. The audio splicing detection method according to claim 1, wherein said step of separately performing audio feature extraction on the spliced audio and the original audio comprises:
and respectively carrying out short-time Fourier transform processing on the spliced audio and the original audio to obtain spliced STFT characteristics and original STFT characteristics.
6. An audio splice detection system, the system comprising:
the audio cutting module is used for acquiring original audio data, and if the audio duration of any original audio is detected not to be within a preset duration range, performing audio cutting or audio filling on the original audio;
respectively carrying out random segmentation on each original audio according to a preset segmentation quantity to obtain segmented audio;
the audio splicing module is used for splicing the segmented audios to obtain spliced audios, and respectively performing audio feature extraction on the spliced audios and the original audios to obtain spliced audio features and original audio features;
the model training module is used for respectively carrying out normalization processing on the original audio features and the spliced audio features and training a preset cyclic neural network according to the original audio features and the spliced audio features after normalization processing to obtain an audio detection model;
the audio detection module is used for inputting audio to be detected into the audio detection model and controlling the audio detection model to carry out audio splicing detection so as to output a detection result;
the audio splicing module is further configured to: and extracting the segmentation audio according to the preset segmentation quantity, and splicing the extracted segmentation audio to obtain the spliced audio.
7. A mobile terminal, characterized in that it comprises a storage device for storing a computer program and a processor running the computer program to make the mobile terminal execute the audio splice detection method according to any of claims 1 to 5.
8. A storage medium, characterized in that it stores a computer program for use in a mobile terminal according to claim 7, which computer program, when being executed by a processor, carries out the steps of the audio splice detection method according to any of the claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010594336.0A CN111933180B (en) | 2020-06-28 | 2020-06-28 | Audio splicing detection method and system, mobile terminal and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010594336.0A CN111933180B (en) | 2020-06-28 | 2020-06-28 | Audio splicing detection method and system, mobile terminal and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111933180A CN111933180A (en) | 2020-11-13 |
CN111933180B true CN111933180B (en) | 2023-04-07 |
Family
ID=73317209
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010594336.0A Active CN111933180B (en) | 2020-06-28 | 2020-06-28 | Audio splicing detection method and system, mobile terminal and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111933180B (en) |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109243446A (en) * | 2018-10-01 | 2019-01-18 | 厦门快商通信息技术有限公司 | A kind of voice awakening method based on RNN network |
CN109376264A (en) * | 2018-11-09 | 2019-02-22 | 广州势必可赢网络科技有限公司 | A kind of audio-frequency detection, device, equipment and computer readable storage medium |
CN109599117A (en) * | 2018-11-14 | 2019-04-09 | 厦门快商通信息技术有限公司 | A kind of audio data recognition methods and human voice anti-replay identifying system |
CN110428845A (en) * | 2019-07-24 | 2019-11-08 | 厦门快商通科技股份有限公司 | Composite tone detection method, system, mobile terminal and storage medium |
CN110942776B (en) * | 2019-10-31 | 2022-12-06 | 厦门快商通科技股份有限公司 | Audio splicing prevention detection method and system based on GRU |
-
2020
- 2020-06-28 CN CN202010594336.0A patent/CN111933180B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111933180A (en) | 2020-11-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107680582B (en) | Acoustic model training method, voice recognition method, device, equipment and medium | |
WO2021128741A1 (en) | Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium | |
US9489965B2 (en) | Method and apparatus for acoustic signal characterization | |
CN111243603B (en) | Voiceprint recognition method, system, mobile terminal and storage medium | |
CN109599117A (en) | A kind of audio data recognition methods and human voice anti-replay identifying system | |
CN106991312B (en) | Internet anti-fraud authentication method based on voiceprint recognition | |
EP3989217B1 (en) | Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium | |
CN110910891B (en) | Speaker segmentation labeling method based on long-time and short-time memory deep neural network | |
CN111798828B (en) | Synthetic audio detection method, system, mobile terminal and storage medium | |
CN110942776B (en) | Audio splicing prevention detection method and system based on GRU | |
WO2012075640A1 (en) | Modeling device and method for speaker recognition, and speaker recognition system | |
CN113053410B (en) | Voice recognition method, voice recognition device, computer equipment and storage medium | |
CN111816185A (en) | Method and device for identifying speaker in mixed voice | |
Sapra et al. | Emotion recognition from speech | |
WO2023279691A1 (en) | Speech classification method and apparatus, model training method and apparatus, device, medium, and program | |
CN109545226B (en) | Voice recognition method, device and computer readable storage medium | |
CN111091809A (en) | Regional accent recognition method and device based on depth feature fusion | |
CN111640438B (en) | Audio data processing method and device, storage medium and electronic equipment | |
CN116153337B (en) | Synthetic voice tracing evidence obtaining method and device, electronic equipment and storage medium | |
CN112420056A (en) | Speaker identity authentication method and system based on variational self-encoder and unmanned aerial vehicle | |
CN111933180B (en) | Audio splicing detection method and system, mobile terminal and storage medium | |
Miyake et al. | Sudden noise reduction based on GMM with noise power estimation | |
CN115565548A (en) | Abnormal sound detection method, abnormal sound detection device, storage medium and electronic equipment | |
CN114420133A (en) | Fraudulent voice detection method and device, computer equipment and readable storage medium | |
US20220335927A1 (en) | Learning apparatus, estimation apparatus, methods and programs for the same |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |