CN114819067A

CN114819067A - Spliced audio detection and positioning method and system based on spectrogram segmentation

Info

Publication number: CN114819067A
Application number: CN202210368335.3A
Authority: CN
Inventors: 张振宇; 赵险峰; 易小伟
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2022-07-29

Abstract

The invention relates to a spliced audio detection and positioning method and system based on spectrogram segmentation. The method comprises the following steps: dividing the audio to be detected into a plurality of audio segments S to be detected according to the length of the minimum positioning area of audio splicing tampering positioning _g (ii) a Extraction of S _g Spectrogram feature F of _g T S _g Spliced into one audio segment S' _g Will F _g Splicing to form a spectrogram feature F 'to be input into the network' _g (ii) a Is prepared from F' _g Inputting the audio signals into a trained spliced audio detection and positioning network, and calculating S' _g A corresponding binary prediction mask; calculate each S _g The number of the splicing sampling points in the binary prediction mask is the proportion rho of the number of the splicing sampling points to the total number of the sampling points; comparing rho with a preset judgment threshold value T, and judging the audio frequency section S _g Whether it is a spliced fragment. The method can accurately judge whether the audio of the block with the specific length is the spliced audio block, can improve the accuracy of detection and positioning of the spliced audio, and has expandability.

Description

Spliced audio detection and positioning method and system based on spectrogram segmentation

Technical Field

The invention relates to a method for detecting a tampered audio frequency, in particular to a method for detecting and positioning a spliced audio frequency based on spectrogram segmentation and application of the method in the field of digital audio frequency forensics, and belongs to the field of multimedia privacy protection in the technical field of information security.

Background

Methods of audio tampering are mainly classified into Copy-and-paste (Copy-Move) and Splicing (Splicing). The audio copying and pasting operation is to intercept a small audio segment in a section of audio and paste the intercepted audio segment to another position of the same section of audio so as to construct a new audio segment; the audio splicing technology is to insert another audio segment at the beginning, middle or end of the audio to form a new audio segment; the two audio tampering methods are mainly used for changing the original content of the audio so as to achieve the purpose of making false audio; the audio tampering detection positioning mainly adopts the characteristics of background noise inconsistency, voice characteristic similarity and the like in combination with a machine learning method to judge whether the audio file to be detected is tampered.

With the widespread use of online audio editing processing tools, it becomes easier to create tampered audio without perceptible traces. The audio splicing can be divided into an audio middle or end to insert another piece of audio to construct a new audio fragment: the former inserts the audio of the small segment into a certain point in the middle of the complete audio to form a new audio; the latter splices the audio of the small segment to the end of the complete audio to form a new audio. The copying and pasting of the audio and the splicing of the audio reduce the reliability of judicial evidence of the audio, and are not beneficial to the protection of intellectual property rights. In addition, these spliced audios can be used to propagate fake news, negatively impacting society. Thus, the ability to detect whether an audio recording is spliced is a task of great interest to audio forensics.

Various studies have been conducted on audio stitching detection and localization over the past several decades. According to the detection principle, the detection of the spliced audio can be roughly divided into three categories: background noise based detection, ENF based detection, and deep learning based detection. First, due to the inconsistency of noise levels caused by audio splicing operations, researchers have developed audio splicing detection methods based on the local noise level of an audio signal, such as: determining the length of each syllable by using a Spectral Entropy method (SE), calculating the variance of the background noise of each syllable, and then judging whether a different splicing tampered audio exists by comparing the similarity of the variances of the background noise of each syllable (reference: Meng, X., Li, C., Tian, L.: Detecting audio splicing for every algorithm based on local noise level estimation. in: 20185 th international conference on systems and information (ICSAI). pp.861-865.IEEE 2018); a noise estimation algorithm optimized by parameters is adopted to extract a noise signal of suspicious voice, and statistics of Mel frequency characteristics of the estimated noise signal are calculated, so that detection splicing tracks are judged (reference: Yan, D., Dong, M., Gao, J.: amplified speech splicing with noise level in continuity and Communication Networks 2021). Furthermore, detecting spliced audio by analyzing an Electrical Network Frequency (ENF) signal is a good method, based on the fact that inserting one piece of audio into another piece of audio recording results in an abnormal change in the ENF signal. Some researchers have proposed an ENF signal after wavelet filter to highlight abnormal ENF changes and trained the classifier using autoregressive coefficients under a Supervised learning framework to judge spliced audio segments (ref: Lin, X., Kang, X.: Supervisual audio localization using an autoregressive model. in:2017IEEE international conference on resources, space and signal processing (ICASSP). pp.2142-2146.IEEE 2017); some researchers used a plurality of ENF features as input feature vectors of a convolutional neural network to detect spliced audio (refer to: Mao, m., Xiao, z., Kang, x., Li, x., Xiao, l.: Electric network frequency based audio using a connected audio network, in: IFIP International Conference on Digital audio, pp.253-270.Springer 2020), however, due to legal restrictions, obtaining a concurrent reference data set of a power system was greatly limited, which made the ENF-based audio splicing detection method challenging in practicability. At present, Convolutional Neural Networks (CNN) have been introduced into the field of audio splicing detection, and introduce Convolutional Neural Networks into audio splicing detection, directly input spectrogram of audio segment into Convolutional Neural Networks, train classifiers based on Convolutional Neural Networks to detect spliced voice (refer to documents: Yan, d., Dong, m., Gao, j.: expanding speech channels-splicing with noise level association and Communication Networks 2021). Based on the above summary, although some audio stitching detection and localization methods have achieved effective performance, there is still a need for new techniques to improve detection and localization performance. To our knowledge and in a review of the literature, the encoder-decoder architecture has not been investigated for audio splice detection and localization.

Through patent inquiry, the related patent application situations existing in the field of the invention are as follows:

chinese patent CN111564163A, "a method for detecting multiple fabricated operations based on RNN," discloses a method for detecting multiple fabricated operations based on recurrent neural networks. The method is based on the dependency relationship between the linear spectral coefficients and the audio frames, and utilizes a Recurrent Neural Network (RNN) to learn the intrinsic characteristics of the spectral coefficients, so that the accuracy of the detection of the forged voice is effectively improved. Since the invention does not involve the operation of splicing and tampering detection of audio, the method is obviously different from the design idea and the specific implementation mode of the invention.

Disclosure of Invention

The invention aims to accurately judge whether the audio of a block with a specific length is a spliced audio block or not by accurately segmenting a spectrogram of the audio, and finally designs an audio splicing detection and positioning method with high accuracy on the basis.

Compared with other spliced audio detection and positioning methods, The method adopts The technology of image scene style in The visual field, and defines The minimum positioning region length (L) of audio splicing tampering positioning _slr ) And finally calculating whether a certain minimum positioning area block is a spliced audio block according to a Binary output mask (Binary output mask) output by a Full Convolution Network (FCN). Therefore, the method provided by the invention is different from the conventional method for splicing, detecting and positioning any audio, and is particularly suitable for detecting and positioning large-scale and long-time audio.

According to research, the existing audio splicing detection and positioning method has the following three limitations: firstly, due to the inconsistent method of noise level caused by audio splicing operation, when the signal-to-noise ratio between splicing sections is close to or even the same, the performance of the audio splicing detection method based on the noise level can be sharply reduced; secondly, due to legal limitations, the method for detecting the spliced audio based on analyzing the power grid frequency signals has great limitation on acquiring a concurrent reference data set of the power system, so that the practicability of the audio splicing detection method based on the ENF is challenged; finally, the spliced audio detection method based on the neural network can only infer whether the given audio is spliced or not, and cannot position the spliced audio segments.

Specifically, the technical scheme adopted by the invention is as follows:

a spliced audio detection and positioning method based on spectrogram segmentation comprises the following steps:

1) detecting fragment division: minimum positioning area length L according to audio splicing tampering positioning _slr Dividing the audio to be detected into a plurality of audio segments S to be detected _g Each audio segment is composed of consecutive sample points and has a length L _slr 。

2) Pretreatment: extracting an audio segment S _g Spectrogram feature F of _g And according to the size of the network inputT audio segments S _g Make up a spliced Audio segment S' _g And corresponding spectrogram feature F _g Splicing to form a spectrogram feature F 'to be input into the network' _g 。

3) Calculating a binary prediction mask: splicing the obtained spectrogram feature F' _g Inputting the Audio segment into a trained ASLNet (Audio Splicing Detection and Localization Network, ASLNet) Network, and calculating a spliced Audio segment S' _g And (3) a corresponding binary prediction mask, wherein 1 in the binary prediction mask represents a spliced sampling point, and 0 represents an original sampling point.

4) Calculating the element ratio rho: calculating each audio segment S according to the calculation result of the step 3) _g The binary prediction mask of (1) is a ratio ρ of the number of the stitched dots (element 1) to the total number of the dots.

5) Judging splicing segments: according to the calculation result of the step 4), comparing rho with a preset judgment threshold value T, and further judging the section S _g Whether it is a spliced segment, where when p>And when T is reached, the fragment is a splicing fragment, otherwise, the fragment is an original fragment.

6) For N divided audio segments S' _g Executing the steps 2) to 5) to sequentially judge all the segments S of the audio frequency to be detected _g Whether it is a spliced fragment.

The ASLNet network provided by the invention is designed and trained, and the spectrogram characteristic F _g The extraction of (c) and the definition and calculation of the ratio ρ are explained in detail as follows.

[1]Spectrogram feature F _g And binary real mask extraction:

the invention derives an audio clip S _g The Mel-Frequency Cepstral Coefficients (MFCC) features are extracted as spectrogram features, and the flow of extracting the spectrogram features is shown in FIG. 1. The detailed MFCC extraction process is: first, the energy of the signal at high frequency is emphasized by a pre-weighting module, and a Short-Time Fourier transform (Short-Time Fourier) of the pre-weighted signal is calculated by using a periodic Hamming window with a length of 2048 samples and an overlap of 512 samplesier Transform, STFT). The energy is then mapped to the Mel-frequency scale using a Mel-Filter (Mel-Filter) and logarithmically mapped to produce a power map. Finally, a transformed coefficient containing significant energy, i.e., a mel-frequency spectral coefficient, is calculated using a discrete cosine transform.

For an audio segment with a minimum positioning unit of 16000 sample points (i.e., L) _slr 16000), the first 24 coefficients are selected as the static MFCC features, the dynamic coefficients and the acceleration coefficients are calculated and connected to the static coefficients, and then 72 feature vectors are formed. Thus, the shape of the MFCC feature matrix is 72 × 32, where 72 is the number of coefficients and 32 is the number of frames. In addition, to train the decoder network, a binary real Mask (Ground Truth Mask) is designed for each MFCC feature matrix, and consists of 0 or 1 elements, and has a size of 72 × 32. For the original audio segment, each element in the corresponding binary true mask is 0, and for the spliced audio segment, each element in the corresponding binary true mask is 1. In the present invention, the length L _slr The length of the audio clip that the user wants to locate can be set according to the practical application.

[2] Design and training of ASLNet network:

the overall flow chart of the spliced audio detection and positioning method based on spectrogram segmentation is shown in fig. 2, wherein a full convolution network is the core of the overall flow; the full convolution network structure is a network structure commonly used by the current semantic segmentation algorithm and consists of an encoder and a decoder. The encoder performs convolution and downsampling to capture context information, while the decoder is responsible for deconvolution and upsampling to predict class labels at the pixel level. Many encoder-decoder architectures have been proposed (FCN, U-Net and SegNet) and successfully applied in the field of image pixel segmentation. The basic network architecture of the ASLNet of the present invention is a modified FCN-VGG16, which consists of a VGG16 encoder and a decoder with residual structure. The goal of the VGG16 encoder is to capture the context representation of the acoustic features, while the goal of the decoder is to convert the intermediate feature map to a binary prediction mask.

As shown in fig. 3, VGG blocks are stacked to construct a VGG16 encoder, where each VGG block consists of two to three convolution blocks, followed by one max-posing layer, for a total of 13 convolution layers and 5 max-posing layers; the volume block consists of a volume layer, a batch normalization layer, and a linear unit (ReLU) activation function. For all convolutional layers, the same kernel size of 3 × 3 is used, the convolution step size is 1; the fill size is 1 to keep the output size after each convolutional layer the same. The size of the max-Pooling layer is 2 x 2, step size 2, for halving the resolution after each VGG block. The purpose of the decoder, which consists of two transposed convolutional layers and one SoftMax activation function, is to reconstruct a binary real mask using the basic information extracted by the VGG16 encoder. The kernel size of the first transposition convolution is 4 multiplied by 4, and the step size is 2; the kernel size of the second transpose convolution is 32 × 32 with a step size of 16. Furthermore, features learned by the lower layers are aggregated to the higher layers with a jump connection from the fourth VGG block to the first transposed convolution. The final SoftMax activation function is used to calculate the probability that the element is from a spliced audio segment. To train the ASLNet network, the data size of the input network is first determined, and then the MFCC matrices of multiple audio segments and corresponding binary real masks are respectively stitched together as the input and label of the ASLNet network.

[3] Definition and calculation of the ratio ρ:

the method judges whether the audio of a small block is spliced audio according to a binary prediction mask output by an ASLNet network, just as the element definition in the binary true mask, for an original audio segment, each element in the corresponding binary true mask is 0, and for the spliced audio segment, each element in the corresponding binary true mask is 1. Therefore, the method calculates the ratio rho of the number of the elements of the spliced sampling points in the mask to the total number of the elements according to the binary prediction mask, and the specific rho calculation formula is as follows:

where Num represents the number of elements in the set. ρ calculated according to the above formula is compared with a preset threshold value T,to judge the audio block S _g Whether the sample is a spliced sample, the specific formula is assumed as follows:

a spliced audio detection and positioning system based on spectrogram segmentation by adopting the method comprises the following steps:

a detection segment division module for dividing the minimum positioning region length L according to the audio splicing tampering positioning _slr Dividing the audio to be detected into a plurality of audio segments S to be detected _g ；

A pre-processing module for extracting the audio segment S _g Spectrogram feature F of _g And the t audio segments S are divided according to the size of the network input _g Spliced into one audio segment S' _g And corresponding spectrogram feature F _g Splicing to form a spectrogram feature F 'to be input into the network' _g ；

Calculating a binary prediction mask module for splicing the spliced spectrogram features F' _g Inputting the audio segment S 'into the trained spliced audio detection and positioning network, and calculating the spliced audio segment S' _g A corresponding binary prediction mask;

a calculating element ratio module for calculating each audio segment S _g The number of the splicing sampling points in the binary prediction mask is the proportion rho of the number of the splicing sampling points to the total number of the sampling points;

a splicing segment judgment module for comparing rho value with preset judgment threshold value T value and further judging the audio segment S _g Whether it is a spliced fragment; for N divided audio segments S' _g Sequentially judging all audio frequency segments S of the audio frequency to be detected _g Whether it is a spliced fragment.

The spliced audio detection and positioning method based on spectrogram segmentation has the following beneficial effects in the related technical field:

1) the accuracy of detection and positioning can be improved. As the image target segmentation technology in the visual field is developed well, a plurality of advanced network structures can achieve high recognition accuracy; the advanced network structures can be applied to the spectrogram of the audio frequency for positioning, and high accuracy can be achieved.

2) The positioned area can be visually displayed to a certain extent. The binary prediction mask image output by the ASLNet network can well display the audio block S _g The proportion of elements in the spliced spectrogram features can be positioned in an original oscillogram of the audio due to a certain corresponding relation between the proportion of the elements and the sampling points of the audio, and the possibility of tampering can be better displayed through two-dimensional image positioning display.

3) The problem of data set mismatch can be effectively alleviated to a certain extent. The same audio segment has own characteristics, the trained ASLNet network can learn the inherent characteristics of a complete audio segment, and the realization of the ASLNet network does not depend on a specific training data set, so the method has a larger application range and can effectively analyze unknown audio data sets.

4) The method has expandability. The length of the chip segment is L _slr Parameters such as a full convolution network structure, a final Threshold (T) and the like can be adjusted according to the requirements of the actual environment, so that different spliced audio detection and positioning methods based on spectrogram segmentation are expanded and customized to be applied to different voice splicing detection and positioning analysis scenes.

Drawings

FIG. 1 is a flow chart of coefficient feature extraction for audio spectrogram in the present invention;

FIG. 2 is a flow chart of the spliced audio detection positioning of the present invention;

FIG. 3 is a schematic diagram of an Encoder-Decoder full convolution network of the present invention;

fig. 4 is a schematic diagram of the detection result after network iteration according to the present invention.

Detailed Description

The invention will now be further described by way of specific embodiments with reference to figure 2.

The invention provides a spliced audio detection and positioning method based on spectrogram segmentation, which comprises the following specific operation details:

2) Pretreatment: extracting an audio segment S _g Spectrogram feature F of _g And the t audio segments S are divided according to the size of the network input _g Make up a spliced Audio segment S' _g And corresponding spectrogram feature F _g Spliced into a spectrogram feature F 'to be input into the network' _g 。

3) Calculating a binary prediction mask: splicing the obtained spectrogram feature F' _g Inputting the audio data into a trained ASLNet network, and calculating spliced audio segments S' _g And (3) a corresponding binary prediction mask, wherein 1 in the binary prediction mask represents a spliced sampling point, and 0 represents an original sampling point.

As can be seen from the above detailed description: firstly, the method realizes the calibration of the forged area of the audio spectrogram mainly through a full convolution network in the visual field, thereby realizing the splicing and positioning functions of the original audio waveform without depending on specific background noise and ENF signals; secondly, according toThe same practical application scene can be realized by changing the minimum positioning area creation length L _slr And the preset value of T, thereby generating detection results with different length positioning intervals and different confidence degrees. Therefore, the invention has wider application range and stronger flexibility.

In order to highlight that the invention provides an effective spliced audio detection and positioning method, the spliced audio detection and positioning experiment is carried out by adopting the following experimental configuration:

1) making two spliced sample data sets: a 2 second long audio data set (CNSet2s) and a 3s long audio data set (CNSet3 s); first, the audio segments of the FMFCC-a corpus were cut into audio segments of 1 second, 2 seconds, and 3 seconds. With a 2 second audio clip and a 3 second audio clip as the original samples of the CNSet2s and CNSet3s data sets, there are 44,727 and 44,669 audio clips, respectively. Two non-homologous 1 second audio segments are then randomly selected and concatenated into one spliced audio segment (i.e., the splice location is at the end of the other audio segment). 86,073 spliced audio clips of 2 seconds were randomly made to construct CNSet2 s. In addition, a 1-second audio clip and a 2-second audio clip were randomly selected, the 1-second audio clip was inserted into the middle of the 2-second audio clip, and 85,865 3-second spliced audio clips were randomly produced, thereby completely producing CNSet3 s.

2) Data set partitioning: the CNSet2s and CNSet3s are divided into a training data set, a verification data set and a test data set according to the ratio of 6:2:2, and are used for training a network model, selecting the model and predicting the performance of the model.

3) Parameter extraction: first, in this experiment, the minimum location area wound length L is defined _slr 16000, namely 16000 sampling points are used as a minimum positioning interval for extracting MFCC characteristic coefficients each time, obtaining a sound spectrum matrix of 72 × 32 audio segments, and preparing a binary real mask matrix corresponding to each audio segment in a training set verification set;

4) the contrast splicing audio detection method comprises the following steps: since Jadhav is the current spliced Audio detection based on neural networks, it is compared with the present invention (references: Jadhav, S., Patole, R., Rege, P.: Audio streaming detection using a connected Audio network in: 201910 th International Conference on Computing, Communication and Networking Technologies (ICCCNT) pp.1-5.IEEE 2019).

5) Training and detecting: training and optimizing the ASLNet network and the Jadhav network in the method by utilizing a training set and a verification set, setting each batch as 64 audios, carrying out one-time verification on the verification set after the model carries out one epoch in the training set, circulating 200 epochs, and selecting the model which obtains the best result in the verification set as a final model for testing on a test set; and in the testing stage, inputting the extracted spectrogram characteristic MFCC matrix of the audio segment on the verification set into a trained network model to obtain a corresponding predicted binary mask.

6) Calculating the proportion rho of the number of elements equal to 1 in the obtained binary prediction mask in all the element numbers, comparing the rho with a predefined threshold value T, judging whether the audio segment is a spliced segment or not, calculating the true positive rate, the true negative rate and the correct rate of the model, repeating the experiment for 10 times, and averaging the obtained data to obtain the final result of the model.

According to the experimental configuration, the detection and positioning results of the spliced audio are shown in table 1, and it can be seen that the method can effectively detect the segments of the spliced audio, and when the threshold value T is increased, the true positive rate of the method is obviously increased, so that the number of missed samples is reduced; in addition, the detection and positioning results of the audio spliced by the method and the Jadhav method are shown in the table 2, and the detection effect of the method is obviously superior to that of the Jadhav method, so that the method and the device are very suitable for spliced audio detection and positioning scenes with high requirements on safety levels.

TABLE 1 detection results of the present invention using different threshold values T

TABLE 2 detection of spliced Audio Using the Jadhav method and the present invention

Based on the same inventive concept, another embodiment of the present invention provides a system for detecting and positioning a spliced audio based on spectrogram segmentation, which uses the above method, and comprises:

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.

Other embodiments of the invention: for the preprocessing of step 2), the spectrogram features involved can be replaced by audio waveforms, statistical features of a spectrogram, or any acoustic features of audio.

For the binary prediction mask of step 3), the ASLNet involved can be replaced by a network of any coding-decoding structure, such as: u-net and SegNet, etc.

For the step 4) of calculating the elemental ratio p, the ratio p is not necessarily a ratio of the number of samples, and may instead be a weighted ratio, or any other quantity representing specific gravity.

For the judgment of the splicing segments in the step 5), the final result is not necessarily obtained by comparing with the threshold, and may be replaced by any decision-making mode, such as: and training a binary classifier.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A spliced audio detection and positioning method based on spectrogram segmentation is characterized by comprising the following steps:

minimum positioning area length L according to audio splicing tampering positioning _slr Dividing the audio to be detected into a plurality of audio segments S to be detected _g ；

Extracting an audio segment S _g Spectrogram feature F of _g And the t audio segments S are divided according to the size of the network input _g Spliced into one audio segment S' _g And corresponding spectrogram feature F _g Splicing to form a spectrogram feature F 'to be input into the network' _g ；

Splicing the obtained spectrogram feature F' _g Inputting the audio segment S 'into the trained spliced audio detection and positioning network, and calculating the spliced audio segment S' _g A corresponding binary prediction mask;

calculate each audio segment S _g The number of the splicing sampling points in the binary prediction mask is the proportion rho of the number of the splicing sampling points to the total number of the sampling points;

comparing rho with a preset judgment threshold value T, and further judging the audio frequency segment S _g Whether it is a spliced fragment;

for N divided audio segments S' _g Sequentially judging all audio frequency segments S of the audio frequency to be detected _g Whether it is a spliced fragment.

2. The method of claim 1, wherein the extracting the audio segment S _g Spectrogram feature F of _g Is from an audio clip S _g Extracting Mel frequency spectrum coefficient characteristic as spectrogram characteristic.

3. The method according to claim 2, wherein the flow of extracting the spectrogram feature comprises: firstly, enhancing the energy of a signal at a high frequency by using a pre-weighting module, and calculating the short-time Fourier transform of the pre-weighted signal by using a periodic Hamming window which is 2048 samples in length and is overlapped by 512 samples; then, mapping the energy to a Mel frequency scale by utilizing a Mel filter, and taking a logarithm to make a power diagram; finally, a transformed coefficient containing significant energy, i.e., a mel-frequency spectral coefficient, is calculated using a discrete cosine transform.

4. The method of claim 1, wherein the basic network architecture of the spliced audio detection and localization network is modified FCN-VGG16, the architecture consisting of a VGG16 encoder and a decoder with residual structure, the VGG16 encoder aiming at capturing a contextual representation of the acoustic features and the decoder aiming at converting the intermediate feature map to a binary prediction mask.

5. The method of claim 4 wherein the VGG16 encoder is constructed by stacking VGG blocks, wherein each VGG block consists of two to three convolution blocks, followed by a max-posing layer; the convolution block consists of a convolution layer, a batch processing normalization layer and a linear unit activation function; the decoder consists of two transposed convolution layers and a SoftMax activation function, features from a low level are aggregated to a high level by using jump connection from a fourth VGG block to a first transposed convolution, and finally the SoftMax activation function is used for calculating the probability that an element comes from a spliced audio segment.

6. The method of claim 1, wherein 1 in the binary prediction mask represents a stitched sample and 0 represents an original sample; the calculation formula of the ratio ρ is as follows:

where Num represents the number of elements in the set.

7. The method of claim 1, wherein the comparison p is compared with a predetermined threshold T to determine the audio segment S _g Whether or not it is a spliced fragment, comprising: when rho > T, the audio segment S _g To splice the segments, otherwise, the audio segment S _g Is the original fragment.

8. A spliced audio detection and positioning system based on spectrogram segmentation by using the method of any one of claims 1 to 7, comprising:

a detection segment division module for dividing the minimum positioning region length L according to the audio splicing tampering positioning _slr To be testedFrequency division into several audio segments S to be detected _g ；

9. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 7.