CN111933180B - Audio splicing detection method and system, mobile terminal and storage medium - Google Patents

Audio splicing detection method and system, mobile terminal and storage medium Download PDF

Info

Publication number
CN111933180B
CN111933180B CN202010594336.0A CN202010594336A CN111933180B CN 111933180 B CN111933180 B CN 111933180B CN 202010594336 A CN202010594336 A CN 202010594336A CN 111933180 B CN111933180 B CN 111933180B
Authority
CN
China
Prior art keywords
audio
original
spliced
features
splicing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010594336.0A
Other languages
Chinese (zh)
Other versions
CN111933180A (en
Inventor
曾志先
肖龙源
李稀敏
叶志坚
刘晓葳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Co Ltd
Original Assignee
Xiamen Kuaishangtong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Co Ltd filed Critical Xiamen Kuaishangtong Technology Co Ltd
Priority to CN202010594336.0A priority Critical patent/CN111933180B/en
Publication of CN111933180A publication Critical patent/CN111933180A/en
Application granted granted Critical
Publication of CN111933180B publication Critical patent/CN111933180B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides an audio splicing detection method, an audio splicing detection system, a mobile terminal and a storage medium, wherein the method comprises the following steps: acquiring original audio data, and segmenting the original audio in the original audio data to obtain segmented audio; splicing the split audio to obtain spliced audio, and respectively extracting audio features of the spliced audio and the original audio to obtain original audio features and spliced audio features; respectively carrying out normalization processing on the original audio features and the spliced audio features, and training a preset recurrent neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model; and controlling the audio detection model to carry out audio splicing detection so as to output a detection result. According to the method, the most appropriate audio features are automatically learned through the audio detection model to be used as a mode for judging whether the audio is spliced or not, so that the representativeness of the features is improved, and the audio splicing detection efficiency and the audio splicing detection accuracy are improved.

Description

Audio splicing detection method and system, mobile terminal and storage medium
Technical Field
The invention belongs to the technical field of audio detection, and particularly relates to an audio splicing detection method, an audio splicing detection system, a mobile terminal and a storage medium.
Background
The voiceprint recognition technology is a technology for judging the identity of a speaker through voice, is mainly applied to the fields of banks, finance, security and the like, has the characteristics of low cost, high efficiency and the like, and is easy to deploy in various embedded devices.
However, since sound is easily recorded by recording devices such as mobile phones or recording pens, the authentication system formed by voiceprint recognition is easily attacked by lawless persons, and common attack modes include techniques such as recording playback, speech synthesis, speech generation and speech conversion, and the like.
The existing audio splicing detection method needs manual sound wave feature selection, splicing detection of audio to be detected is correspondingly performed in a sound wave matching mode, namely, the sound wave of the audio to be detected is subjected to ripple matching with a preset sound wave through selection based on the manual sound wave feature so as to obtain an audio splicing detection result, but the audio splicing detection efficiency is low due to the sound wave matching mode selected based on the manual feature, and the audio splicing detection accuracy is poor.
Disclosure of Invention
The embodiment of the invention aims to provide an audio splicing detection method, an audio splicing detection system, a mobile terminal and a storage medium, and aims to solve the problem that voiceprint identification accuracy is low due to the fact that a cosine formula or an Euclidean distance formula is adopted to compare the similarity of voiceprint vectors in the using process of the existing audio splicing detection method.
The embodiment of the invention is realized in such a way that an audio splicing detection method comprises the following steps:
acquiring original audio data, and segmenting the original audio in the original audio data respectively to obtain segmented audio;
splicing the segmented audio to obtain spliced audio, and respectively extracting audio features of the spliced audio and the original audio to obtain spliced audio features and original audio features;
respectively carrying out normalization processing on the original audio features and the spliced audio features, and training a preset cyclic neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model;
and inputting the audio to be detected into the audio detection model, and controlling the audio detection model to carry out audio splicing detection so as to output a detection result.
Further, the step of respectively segmenting the original audio in the original audio data comprises:
respectively carrying out random segmentation on each original audio according to a preset segmentation quantity to obtain segmented audio;
specifically, the step of splicing the segmented audio includes:
and extracting the segmentation audio according to the preset segmentation quantity, and splicing the extracted segmentation audio to obtain the spliced audio.
Further, the step of normalizing the original audio features and the spliced audio features respectively comprises:
respectively carrying out numerical value standardization processing on the original audio features and the spliced audio features to obtain original audio original numerical values and spliced audio original numerical values;
respectively carrying out average value calculation and standard deviation calculation on the original audio original numerical value and the spliced audio original numerical value to obtain an original audio average value, an original audio standard deviation, a spliced audio average value and a spliced audio standard deviation;
and respectively calculating the original audio original numerical value and the spliced audio original numerical value according to a standardized calculation formula to obtain an original audio normalization value and a spliced audio normalization value.
Furthermore, the step of training the preset recurrent neural network according to the normalized original audio features and the normalized spliced audio features includes:
setting the original audio normalization value as a positive sample, and setting the spliced audio normalization value as a negative sample;
performing model training on the preset recurrent neural network according to the positive sample and the negative sample, and performing loss calculation on the preset recurrent neural network to obtain a loss value;
and performing optimization iteration on the preset cyclic neural network according to the loss value until the preset cyclic neural network meets a preset ending condition, and outputting the preset cyclic neural network to obtain the audio detection model.
Further, the detection result includes an original audio score value and a spliced audio score value, and after the step of outputting the detection result, the method further includes:
performing probability calculation on the original audio score value and the spliced audio score value by adopting a SoftMax function to obtain a splicing probability value;
and if the splicing probability is smaller than a probability threshold value, judging that the audio to be tested is spliced audio.
Further, the step of separately performing audio feature extraction on the spliced audio and the original audio comprises:
and respectively carrying out short-time Fourier transform processing on the spliced audio and the original audio to obtain spliced STFT characteristics and original STFT characteristics.
Another objective of an embodiment of the present invention is to provide an audio splicing detection system, which includes:
the audio segmentation module is used for acquiring original audio data and segmenting the original audio in the original audio data respectively to obtain segmented audio;
the audio splicing module is used for splicing the segmented audios to obtain spliced audios, and respectively extracting audio features of the spliced audios and the original audios to obtain spliced audio features and original audio features;
the model training module is used for respectively carrying out normalization processing on the original audio features and the spliced audio features and training a preset cyclic neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model;
and the audio detection module is used for inputting audio to be detected into the audio detection model and controlling the audio detection model to carry out audio splicing detection so as to output a detection result.
Still further, the audio slicing module is further configured to:
and respectively carrying out random segmentation on each original audio according to a preset segmentation quantity to obtain the segmented audio.
Another objective of an embodiment of the present invention is to provide a mobile terminal, including a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal execute the audio splicing detection method described above.
Another object of an embodiment of the present invention is to provide a storage medium, which stores a computer program used in the above-mentioned mobile terminal, and the computer program, when executed by a processor, implements the steps of the above-mentioned audio splicing detection method.
According to the embodiment of the invention, manual feature selection is not needed, the most appropriate audio features are automatically learned by the audio detection model to be used as a mode for judging whether the audio is spliced, the representativeness of the features is improved, the audio splicing detection efficiency and the audio splicing detection accuracy are further improved, and spliced audio is generated by splicing and splitting the audio, so that a large amount of training data can be generated according to less original audio data, the data collection efficiency is improved, and the data acquisition time is saved.
Drawings
Fig. 1 is a flowchart of an audio splicing detection method according to a first embodiment of the present invention;
FIG. 2 is a flowchart of an audio splicing detection method according to a second embodiment of the present invention;
FIG. 3 is a flowchart of an audio splicing detection method according to a third embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an audio splicing detection system according to a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of a mobile terminal according to a fifth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
Example one
Please refer to fig. 1, which is a flowchart illustrating an audio splicing detection method according to a first embodiment of the present invention, including the steps of:
step S10, acquiring original audio data, and segmenting the original audio in the original audio data respectively to obtain segmented audio;
the original audio data is real audio data, the audio number and the audio duration of the original audio in the original audio data may be set according to requirements, for example, the audio number may be set to 5 thousand, 1 ten thousand, or 2 ten thousand, the audio duration may be set to 3 seconds, 4 seconds, or 5 seconds, the audio duration between different original audios may be different, but the audio durations of all the original audios are within a preset duration range.
Optionally, in this step, if it is detected that the audio duration of any original audio is not within the preset duration range, audio clipping or audio filling is performed on the original audio, so that the audio duration of the original audio after the audio clipping or audio filling is within the preset duration range.
Furthermore, in the step, the original audio is respectively segmented to obtain the segmented audio design, so that the subsequent generation of spliced audio is effectively facilitated, and the subsequent training data of the preset recurrent neural network is further ensured.
Step S20, splicing the segmented audios to obtain spliced audios, and respectively extracting audio features of the spliced audios and the original audios to obtain spliced audio features and original audio features;
splicing the segmented audio randomly to obtain a spliced audio, wherein the spliced audio is used as negative sample data of a preset recurrent neural network to ensure the training effect of the preset recurrent neural network;
optionally, in this step, the audio features of the spliced audio and the original audio may be automatically extracted by using a function calculation formula, a function matrix, or other manners, and the audio features may be selected according to requirements, for example, STFT features, MFCC features, or spectrogram features, and the like, in the spliced audio and the original audio may be extracted.
Step S30, respectively carrying out normalization processing on the original audio features and the spliced audio features, and training a preset recurrent neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model;
the original audio features and the spliced audio features are subjected to normalization processing, so that the influence of extreme values or noise on the audio features is effectively reduced, and the accuracy of the training data of the audio detection model is improved;
in the step, all the original audio features after normalization processing are used as positive sample data, and the spliced audio features are used as negative sample data and are written into a preset recurrent neural network for training, so that the audio detection module is obtained.
Specifically, the label of the original audio feature is set to 1, the label of the spliced audio feature is set to 0, the spliced audio feature is randomly ordered, 75% of total sample data is set as a training set, 15% of total sample data is set as a test set, and the preset recurrent neural network is trained to obtain an audio detection model;
optionally, the predetermined recurrent neural network may be a GRU recurrent neural network, where the GRU recurrent neural network includes a 3-layer LSTM structure and the number of hidden layer neurons is 300.
S40, inputting the audio to be detected into the audio detection model, and controlling the audio detection model to carry out audio splicing detection so as to output a detection result;
in the embodiment, the GRU network is used as a network structure, so that information in time sequence can be fully utilized, the probability judgment is made by combining front and back information, audio data is just established on the time sequence relation, audio features of all training sets are input into the network, and the output is the two-classification numerical value corresponding to each audio feature data.
According to the embodiment, manual feature selection is not needed, the most appropriate audio features are automatically learned by the audio detection model to be used as a mode for judging whether the audio is spliced or not, the representativeness of the features is improved, the audio splicing detection efficiency and the accuracy of audio splicing detection are improved, spliced audio is generated by splicing and splitting the audio, a large amount of training data can be generated according to less original audio data, the data collection efficiency in the preset cyclic neural network training process is improved, and the data collection time is saved.
Example two
Please refer to fig. 2, which is a flowchart of an audio splicing detection method according to a second embodiment of the present application, including the steps of:
s11, acquiring original audio data, and respectively performing random segmentation on each original audio according to a preset segmentation number to obtain segmented audio;
the preset splitting number may be set according to a requirement, for example, the preset splitting number may be set to 4, 5, or 10, etc.;
preferably, in this embodiment, the preset number of segments is set to 5, that is, each original audio is randomly segmented into 5 segments, and when the number of original audio in the original audio data is N, the number of segmented audio obtained by segmentation is 5N.
Step S21, extracting the segmented audios according to the preset segmentation quantity, and splicing the extracted segmented audios to obtain spliced audios;
extracting 5 segmented audios from all the segmented audios respectively, and splicing the 5 extracted segmented audios each time to obtain spliced audios;
optionally, in this step, a segmented audio may be extracted from each original audio, and audio splicing may be performed according to the extraction result, so as to obtain N numbers of spliced audios.
Specifically, in the step, the design of splicing the extracted segmented audios to obtain the spliced audio is adopted, so that the training effect on the preset recurrent neural network is effectively guaranteed.
Step S31, short-time Fourier transform processing is respectively carried out on the spliced audio and the original audio to obtain spliced STFT characteristics and original STFT characteristics;
optionally, in this step, the spliced audio and the original audio may be respectively extracted specifically by directly using a kaldi tool library of python, so as to convert the spliced audio and the original audio into STFT features of 257 dimensions.
Step S41, respectively carrying out normalization processing on the spliced STFT characteristics and the original STFT characteristics, and training a preset cyclic neural network according to the spliced STFT characteristics and the original STFT characteristics after normalization processing to obtain an audio detection model;
the original STFT characteristics and the STFT audio characteristics are subjected to normalization processing, so that the influence of extreme values or noise on the audio characteristics is effectively reduced, and the accuracy of the training data of the audio detection model is improved.
S51, inputting the audio to be detected into the audio detection model, and controlling the audio detection model to carry out audio splicing detection so as to output a detection result;
wherein the detection result comprises an original audio score value and a spliced audio score value.
S61, performing probability calculation on the original audio score value and the spliced audio score value by adopting a SoftMax function to obtain a splicing probability value;
the two values output by the audio detection model output layer are converted into probabilities through a SoftMax function, the meanings of the probabilities are the probability value that the audio to be detected is the real audio and the probability value that the audio is spliced, and the calculation mode of the SoftMax function is used for converting the values output by the audio detection model into the range of 0-1, so that whether the audio to be detected is spliced can be directly judged according to the probability value of 0-1.
And S71, if the splicing probability is smaller than a probability threshold, judging that the audio to be tested is spliced audio.
According to the embodiment, manual feature selection is not needed, the most appropriate audio features are automatically learned by the audio detection model to be used as a mode for judging whether the audio is spliced or not, the representativeness of the features is improved, the audio splicing detection efficiency and the audio splicing detection accuracy are further improved, spliced audio is generated by splicing and splitting the audio, a large amount of training data can be generated according to less original audio data, the data collection efficiency is improved, and the data collection time is saved.
EXAMPLE III
Please refer to fig. 3, which is a flowchart of an audio splicing detection method according to a third embodiment of the present application, where the third embodiment is configured to refine step S30 in the first embodiment to refine how to normalize the original audio features and the spliced audio features respectively, and train a preset recurrent neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model, and includes the steps of:
step S301, respectively carrying out numerical value standardization processing on the original audio features and the spliced audio features to obtain original audio original numerical values and spliced audio original numerical values;
the original audio original numerical value and the spliced audio original numerical value are calculated, so that the subsequent normalization processing aiming at the original audio characteristic and the spliced audio characteristic is effectively facilitated;
step S302, respectively carrying out average value calculation and standard deviation calculation on the original audio original numerical value and the spliced audio original numerical value to obtain an original audio average value, an original audio standard deviation, a spliced audio average value and a spliced audio standard deviation;
step S303, respectively calculating the original audio original numerical value and the spliced audio original numerical value according to a standardized calculation formula to obtain an original audio normalization value and a spliced audio normalization value;
wherein the normalized calculation formula is:
D 1 =(A 1 -B 1 )/C 1
wherein A is 1 Is the original audio original value, B 1 Is the original audio mean value, C 1 Taking the standard deviation of the original audio and D1 as a normalization value of the original audio;
D 2 =(A 2 -B 2 )/C 2
wherein A is 2 For the splicing audio original value, B 2 Is the spliced audio mean value, C 2 And D2 is the standard deviation of the spliced audio and the normalized value of the spliced audio.
Step S304, setting the original audio normalization value as a positive sample, and setting the spliced audio normalization value as a negative sample;
step S305, performing model training on the preset cyclic neural network according to the positive sample and the negative sample, and performing loss calculation on the preset cyclic neural network to obtain a loss value;
the loss calculation of the preset recurrent neural network can be performed by adopting a cross entropy loss function to obtain the loss value, and the loss value is used for updating the parameter weight in the preset recurrent neural network so as to improve the identification efficiency of the preset recurrent neural network.
Step S306, performing optimization iteration on the preset cyclic neural network according to the loss value until the preset cyclic neural network meets a preset ending condition, and outputting the preset cyclic neural network to obtain the audio detection model;
the parameter weight of the iterative preset cyclic neural network can be optimized by adopting an Adam algorithm according to the loss value, the learning rate is 0.00005, 64 audio STFT feature data are transmitted into each batch, one Epoch is trained for 150 batches, and 30 epochs are trained in total;
specifically, in this step, if it is detected that the iteration number of the preset recurrent neural network is equal to the number threshold, or it is detected that the loss value in the preset recurrent neural network is smaller than the loss threshold, it is determined that the preset recurrent neural network satisfies a preset end condition, and the preset recurrent neural network is output to obtain the audio detection model, where the audio detection model is used to receive the audio to be detected and determine whether the audio to be detected is a spliced audio.
In the embodiment, through the design of carrying out normalization processing on the original audio features and the spliced audio features, the influence of an extreme value or noise on the audio features is effectively reduced, the accuracy of the training data of the audio detection model is further improved, the loss value is obtained through loss calculation on the preset cyclic neural network, the design of optimizing iteration is carried out on the preset cyclic neural network according to the loss value, the parameter weight in the preset cyclic neural network can be effectively updated, and the accuracy of splicing detection of the audio detection model on the audio to be detected is improved.
Example four
Please refer to fig. 4, which is a schematic structural diagram of an audio splicing detection system 100 according to a fourth embodiment of the present invention, including: audio segmentation module 10, audio concatenation module 11, model training module 12 and audio detection module 13, wherein:
the audio segmentation module 10 is configured to obtain original audio data, and segment original audios in the original audio data to obtain segmented audios.
Wherein the audio slicing module 10 is further configured to: and respectively carrying out random segmentation on each original audio according to a preset segmentation number to obtain the segmented audio.
And the audio splicing module 11 is configured to splice the segmented audio to obtain a spliced audio, and perform audio feature extraction on the spliced audio and the original audio respectively to obtain a spliced audio feature and an original audio feature.
Wherein, the audio splicing module 11 is further configured to: and extracting the segmentation audio according to the preset segmentation quantity, and splicing the extracted segmentation audio to obtain the spliced audio.
Preferably, the audio splicing module 11 is further configured to: and respectively carrying out short-time Fourier transform processing on the spliced audio and the original audio to obtain spliced STFT characteristics and original STFT characteristics.
And the model training module 12 is configured to perform normalization processing on the original audio features and the spliced audio features, and train a preset recurrent neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model.
Wherein the model training module 12 is further configured to: respectively carrying out numerical value standardization processing on the original audio features and the spliced audio features to obtain original audio original numerical values and spliced audio original numerical values;
respectively carrying out average value calculation and standard deviation calculation on the original audio original numerical value and the spliced audio original numerical value to obtain an original audio average value, an original audio standard deviation, a spliced audio average value and a spliced audio standard deviation;
and respectively calculating the original audio original numerical value and the spliced audio original numerical value according to a standardized calculation formula to obtain an original audio normalization value and a spliced audio normalization value.
Preferably, the model training module 12 is further configured to: setting the original audio normalization value as a positive sample, and setting the spliced audio normalization value as a negative sample;
performing model training on the preset cyclic neural network according to the positive sample and the negative sample, and performing loss calculation on the preset cyclic neural network to obtain a loss value;
and performing optimization iteration on the preset cyclic neural network according to the loss value until the preset cyclic neural network meets a preset ending condition, and outputting the preset cyclic neural network to obtain the audio detection model.
And the audio detection module 13 is configured to input the audio to be detected into the audio detection model, and control the audio detection model to perform audio splicing detection, so as to output a detection result.
Wherein the audio detection module 13 is further configured to: performing probability calculation on the original audio score value and the spliced audio score value by adopting a SoftMax function to obtain a splicing probability value;
and if the splicing probability is smaller than a probability threshold value, judging that the audio to be tested is spliced audio.
According to the embodiment, manual feature selection is not needed, the most appropriate audio features are automatically learned by the audio detection model to be used as a mode for judging whether the audio is spliced or not, the representativeness of the features is improved, the audio splicing detection efficiency and the audio splicing detection accuracy are further improved, spliced audio is generated by splicing and splitting the audio, a large amount of training data can be generated according to less original audio data, the data collection efficiency is improved, and the data collection time is saved.
EXAMPLE five
Referring to fig. 5, a mobile terminal 101 according to a fifth embodiment of the present invention includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal 101 execute the audio splicing detection method, and the mobile terminal 101 may be a robot.
The present embodiment also provides a storage medium on which a computer program used in the above-mentioned mobile terminal 101 is stored, which when executed, includes the steps of:
acquiring original audio data, and segmenting the original audio in the original audio data respectively to obtain segmented audio;
splicing the segmented audio to obtain spliced audio, and respectively extracting audio features of the spliced audio and the original audio to obtain spliced audio features and original audio features;
respectively carrying out normalization processing on the original audio features and the spliced audio features, and training a preset cyclic neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model;
and inputting the audio to be detected into the audio detection model, and controlling the audio detection model to carry out audio splicing detection so as to output a detection result. The storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is used as an example, in practical applications, the above-mentioned function distribution may be performed by different functional units or modules according to needs, that is, the internal structure of the storage device is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application.
Those skilled in the art will appreciate that the configuration shown in fig. 4 does not constitute a limitation of the audio splice detection system of the present invention and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components, and that the audio splice detection method of fig. 1-3 may also be implemented using more or fewer components than those shown in fig. 4, or some components in combination, or a different arrangement of components. The units, modules, etc. referred to herein are a series of computer programs that can be executed by a processor (not shown) of the current audio splice detection system and that can perform specific functions, and all of the computer programs can be stored in a storage device (not shown) of the current audio splice detection system.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (8)

1. An audio splicing detection method, characterized in that the method comprises:
acquiring original audio data, and if the audio duration of any original audio is detected not to be within a preset duration range, performing audio cutting or audio filling on the original audio;
respectively carrying out random segmentation on each original audio according to a preset segmentation number to obtain segmented audio;
splicing the segmented audios to obtain spliced audios, and respectively performing audio feature extraction on the spliced audios and the original audios to obtain spliced audio features and original audio features;
respectively carrying out normalization processing on the original audio features and the spliced audio features, and training a preset cyclic neural network according to the normalized original audio features and the spliced audio features to obtain an audio detection model;
inputting the audio to be detected into the audio detection model, and controlling the audio detection model to carry out audio splicing detection so as to output a detection result;
the step of splicing the sliced audio comprises:
and extracting the segmentation audio according to the preset segmentation quantity, and splicing the extracted segmentation audio to obtain the spliced audio.
2. The audio splicing detection method according to claim 1, wherein said step of normalizing said original audio features and said spliced audio features separately comprises:
respectively carrying out numerical value standardization processing on the original audio features and the spliced audio features to obtain original audio original numerical values and spliced audio original numerical values;
respectively carrying out average value calculation and standard deviation calculation on the original audio original numerical value and the spliced audio original numerical value to obtain an original audio average value, an original audio standard deviation, a spliced audio average value and a spliced audio standard deviation;
and respectively calculating the original audio original numerical value and the spliced audio original numerical value according to a standardized calculation formula to obtain an original audio normalization value and a spliced audio normalization value.
3. The audio splicing detection method according to claim 2, wherein the step of training a preset recurrent neural network according to the normalized original audio features and the spliced audio features comprises:
setting the original audio normalization value as a positive sample, and setting the spliced audio normalization value as a negative sample;
performing model training on the preset cyclic neural network according to the positive sample and the negative sample, and performing loss calculation on the preset cyclic neural network to obtain a loss value;
and performing optimization iteration on the preset cyclic neural network according to the loss value until the preset cyclic neural network meets a preset ending condition, and outputting the preset cyclic neural network to obtain the audio detection model.
4. The audio splice detection method of claim 1 wherein the detection result comprises an original audio score value and a spliced audio score value, the method further comprising, after the step of outputting the detection result:
performing probability calculation on the original audio score value and the spliced audio score value by adopting a SoftMax function to obtain a splicing probability value;
and if the splicing probability is smaller than a probability threshold value, judging that the audio to be tested is spliced audio.
5. The audio splicing detection method according to claim 1, wherein said step of separately performing audio feature extraction on the spliced audio and the original audio comprises:
and respectively carrying out short-time Fourier transform processing on the spliced audio and the original audio to obtain spliced STFT characteristics and original STFT characteristics.
6. An audio splice detection system, the system comprising:
the audio cutting module is used for acquiring original audio data, and if the audio duration of any original audio is detected not to be within a preset duration range, performing audio cutting or audio filling on the original audio;
respectively carrying out random segmentation on each original audio according to a preset segmentation quantity to obtain segmented audio;
the audio splicing module is used for splicing the segmented audios to obtain spliced audios, and respectively performing audio feature extraction on the spliced audios and the original audios to obtain spliced audio features and original audio features;
the model training module is used for respectively carrying out normalization processing on the original audio features and the spliced audio features and training a preset cyclic neural network according to the original audio features and the spliced audio features after normalization processing to obtain an audio detection model;
the audio detection module is used for inputting audio to be detected into the audio detection model and controlling the audio detection model to carry out audio splicing detection so as to output a detection result;
the audio splicing module is further configured to: and extracting the segmentation audio according to the preset segmentation quantity, and splicing the extracted segmentation audio to obtain the spliced audio.
7. A mobile terminal, characterized in that it comprises a storage device for storing a computer program and a processor running the computer program to make the mobile terminal execute the audio splice detection method according to any of claims 1 to 5.
8. A storage medium, characterized in that it stores a computer program for use in a mobile terminal according to claim 7, which computer program, when being executed by a processor, carries out the steps of the audio splice detection method according to any of the claims 1 to 5.
CN202010594336.0A 2020-06-28 2020-06-28 Audio splicing detection method and system, mobile terminal and storage medium Active CN111933180B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010594336.0A CN111933180B (en) 2020-06-28 2020-06-28 Audio splicing detection method and system, mobile terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010594336.0A CN111933180B (en) 2020-06-28 2020-06-28 Audio splicing detection method and system, mobile terminal and storage medium

Publications (2)

Publication Number Publication Date
CN111933180A CN111933180A (en) 2020-11-13
CN111933180B true CN111933180B (en) 2023-04-07

Family

ID=73317209

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010594336.0A Active CN111933180B (en) 2020-06-28 2020-06-28 Audio splicing detection method and system, mobile terminal and storage medium

Country Status (1)

Country Link
CN (1) CN111933180B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109243446A (en) * 2018-10-01 2019-01-18 厦门快商通信息技术有限公司 A kind of voice awakening method based on RNN network
CN109376264A (en) * 2018-11-09 2019-02-22 广州势必可赢网络科技有限公司 A kind of audio-frequency detection, device, equipment and computer readable storage medium
CN109599117A (en) * 2018-11-14 2019-04-09 厦门快商通信息技术有限公司 A kind of audio data recognition methods and human voice anti-replay identifying system
CN110428845A (en) * 2019-07-24 2019-11-08 厦门快商通科技股份有限公司 Composite tone detection method, system, mobile terminal and storage medium
CN110942776B (en) * 2019-10-31 2022-12-06 厦门快商通科技股份有限公司 Audio splicing prevention detection method and system based on GRU

Also Published As

Publication number Publication date
CN111933180A (en) 2020-11-13

Similar Documents

Publication Publication Date Title
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
WO2021128741A1 (en) Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium
US9489965B2 (en) Method and apparatus for acoustic signal characterization
CN111243603B (en) Voiceprint recognition method, system, mobile terminal and storage medium
CN109599117A (en) A kind of audio data recognition methods and human voice anti-replay identifying system
CN106991312B (en) Internet anti-fraud authentication method based on voiceprint recognition
EP3989217B1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
CN110910891B (en) Speaker segmentation labeling method based on long-time and short-time memory deep neural network
CN111798828B (en) Synthetic audio detection method, system, mobile terminal and storage medium
CN110942776B (en) Audio splicing prevention detection method and system based on GRU
WO2012075640A1 (en) Modeling device and method for speaker recognition, and speaker recognition system
CN113053410B (en) Voice recognition method, voice recognition device, computer equipment and storage medium
CN111816185A (en) Method and device for identifying speaker in mixed voice
Sapra et al. Emotion recognition from speech
WO2023279691A1 (en) Speech classification method and apparatus, model training method and apparatus, device, medium, and program
CN109545226B (en) Voice recognition method, device and computer readable storage medium
CN111091809A (en) Regional accent recognition method and device based on depth feature fusion
CN111640438B (en) Audio data processing method and device, storage medium and electronic equipment
CN116153337B (en) Synthetic voice tracing evidence obtaining method and device, electronic equipment and storage medium
CN112420056A (en) Speaker identity authentication method and system based on variational self-encoder and unmanned aerial vehicle
CN111933180B (en) Audio splicing detection method and system, mobile terminal and storage medium
Miyake et al. Sudden noise reduction based on GMM with noise power estimation
CN115565548A (en) Abnormal sound detection method, abnormal sound detection device, storage medium and electronic equipment
CN114420133A (en) Fraudulent voice detection method and device, computer equipment and readable storage medium
US20220335927A1 (en) Learning apparatus, estimation apparatus, methods and programs for the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant