CN110060703B

CN110060703B - Method for detecting and positioning smoothing processing in voice segment

Info

Publication number: CN110060703B
Application number: CN201810055610.XA
Authority: CN
Inventors: 闫琦; 杨锐; 黄继武
Original assignee: Shenzhen University; Sun Yat Sen University
Current assignee: Shenzhen University; Sun Yat Sen University
Priority date: 2018-01-19
Filing date: 2018-01-19
Publication date: 2021-05-04
Anticipated expiration: 2038-01-19
Also published as: CN110060703A

Abstract

The invention discloses a method for detecting and positioning smooth processing in a voice segment, which comprises the following steps: s1, selecting a smoothing filter; s2, selecting original voice, extracting an original voice set, and processing the original voice set into a training voice set through the filter; s3, extracting feature sets from the original voice and the training voice set; s4, respectively screening out samples from the feature set of the original voice and the feature set of the training voice set, and training an SVM classifier model by adopting a classifier; s5, selecting a voice to be detected, framing the voice to be detected, and extracting a voice feature set to be detected from each frame signal; s6, classifying the voice feature set to be detected by using the SVM classifier model in the step S4, judging whether the signal is subjected to smoothing treatment, and if so, positioning the position where the smoothing treatment is positioned. The method has the advantages that compared with the existing similar detection method, the method provided by the invention has higher detection rate and can be used as a high-success-rate method for judging whether the digital voice is smoothly processed.

Description

Method for detecting and positioning smoothing processing in voice segment

Technical Field

The present invention relates to the field of media content forensics, and more particularly, to a method of detecting and locating smoothing within a speech segment.

Background

At present, the recording function of a digital recording pen and a mobile phone is widely popularized, and the digital recording has the tendency of replacing the former analog recording. Digital audio plays a very important role as evidence for the jurisdictions. However, with the wide popularization and application of a series of audio editing software such as Cooledit, Adobe audio and the like, even people without relevant professional knowledge can edit and modify digital audio by using the audio editing software. Therefore, it is necessary to authenticate the authenticity of digital audio.

Smoothing is a common audio post-processing means, and is often used in smoothing a tampered boundary after digital audio is deleted, cut, and spliced, so that the authenticity of the digital audio can be identified by detecting whether smoothing exists in the digital audio. The detection technology for smoothing processing in a long-time speech segment is relatively mature at present, and the smoothing processing in the long-time speech segment can be effectively detected by using common speech features such as MFCC (Mel frequency cepstrum coefficient). However, when the smoothed speech segment is, for example, only several hundred or even several tens of samples are smoothed, most of the existing common speech frequency domain features are no longer applicable due to the very little frequency information contained in the speech segment.

Disclosure of Invention

The present invention provides a method for detecting and locating a smoothing process in a speech segment to overcome the above-mentioned drawbacks of the prior art.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a method of detecting and locating a smoothing process within a speech segment, comprising the steps of:

s1, selecting a smoothing filter;

s2, selecting original voice, extracting an original voice set, and processing the original voice set into a training voice set through the filter;

s3, extracting feature sets from the original voice and the training voice set;

s4, screening out samples from the feature set of the original voice and the feature set of the training voice set respectively, and training an SVM classifier model by adopting a classifier;

s5, selecting a voice to be detected, framing the voice to be detected, and extracting a voice feature set to be detected from each frame signal;

s6, classifying the voice feature set to be detected by using the SVM classifier model of the step S4, judging whether the signal is subjected to smoothing treatment, and if so, positioning the position where the smoothing treatment is positioned.

The working principle of the method is as follows: firstly, smoothing the original voice through a smoothing filter to obtain a smoothed voice set; then, obtaining a feature set of the voice set subjected to smoothing processing, and training a classifier model by matching with a classifier; and finally, framing the voice to be detected, extracting a voice feature set to be detected, classifying the voice feature set to be detected by adopting the classifier model, judging whether each frame of the voice to be detected is subjected to smoothing treatment or not, and if so, positioning the smoothing treatment position.

Preferably, the smoothing filter of step S1 includes a linear filter and a nonlinear filter;

the linear filter comprises a triangular window function and two variants thereof, an average filter and a Gaussian filter;

the nonlinear filter is a median filter.

Preferably, the step S2 includes the steps of:

s2.1, selecting original voices, and intercepting non-silent voice fragments with certain sample length from each section of voice to serve as an original voice set;

and S2.2, setting the lengths of the filtering windows to be 5, 7, 9, 11, 13, 15 and 31 respectively, and filtering each voice segment in the original voice set of the step S2.1 by using the filter of the step S1 to obtain a filtered voice segment which is used as a training voice set.

Preferably, the step S3 is to derive a feature set of each speech segment in the original speech set and the training speech set of the step S2, and the step S3 includes the following steps:

s3.1, performing differential calculation on each section of voice segment in the original voice set and the training voice set in the step S2 to obtain a differential signal corresponding to each section of voice segment;

s3.2, standard deviation calculation is carried out on the difference signal in the step S3.1, and a calculation result is used as a first part of a feature set of each section of voice segment in the original voice set and the training voice set;

s3.3, carrying out Fourier transform on the differential signal in the step S3.1 to obtain a frequency domain signal corresponding to the differential signal;

s3.4, taking the original voice signal sampling rate of the step S2 as Fs, performing standard deviation calculation on the frequency signal of the frequency domain signal of the step S3.3 in a frequency interval from Fs/4 to Fs/2, and taking a calculation result as a second part of a feature set of each voice segment in the original voice set and the training voice set;

s3.5, filtering each section of voice fragment in the original voice set and the training voice set in the step S2 by adopting a median filter with the window length of 5, and calculating a residual error corresponding to each section of voice fragment;

and S3.6, carrying out differential calculation on the residual error in the step S3.5 to obtain a differential signal, and carrying out standard deviation calculation on the differential signal to obtain a standard deviation value which is used as a third part of the feature set of each section of voice segment in the original voice set and the training voice set.

Preferably, the step S5 is configured to extract a feature set of each speech segment of the speech to be detected, and includes the following steps:

s5.1, selecting a voice to be detected, framing the voice to be detected by a certain sample length, and calculating each frame signal to obtain a differential signal corresponding to each section of voice fragment;

s5.2, standard deviation calculation is carried out on the differential signal in the step S5.1, and a calculation result is used as a first part of a feature set of each section of voice fragment of the voice to be detected;

s5.3, carrying out Fourier transform on the differential signal in the step S5.1 to obtain a frequency domain signal corresponding to the differential signal;

s5.4, taking the sampling rate of the voice signal to be detected in the step S5.1 as Fs, calculating the standard deviation of the frequency signal of the frequency domain signal in the step S5.3 in the frequency interval from Fs/4 to Fs/2, and taking the calculation result as the second part of the feature set of each section of voice segment of the voice to be detected;

s5.5, filtering each section of voice fragment in the voice to be detected in the step S5.1 by adopting a median filter with the window length of 5, and calculating a residual error corresponding to each section of voice fragment;

and S5.6, carrying out differential calculation on the residual error in the step S5.5 to obtain a differential signal, and carrying out standard deviation calculation on the differential signal to obtain a standard deviation value which is used as a third part of the feature set of each section of the voice segment of the voice to be detected.

Preferably, the method for screening samples in step S4 randomly selects half of each of the feature set of the original speech and the feature set of the training speech set, and uses the selected half as the feature set sample of the original speech and the feature set sample of the training speech set respectively;

the classifier in the step S4 is a LibSVM classifier.

Preferably, the certain sample length is 50 samples, 100 samples and 150 samples.

Preferably, the fourier transform is 128 in length.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

firstly, obtaining a voice characteristic set through an original voice and a training voice set, and obtaining a classifier model through a classifier; when detecting the voice to be detected, extracting a voice characteristic set to be detected, and classifying by using the classifier model so as to judge whether the voice segment is subjected to smoothing processing and positioning. Compared with the existing similar detection method, the method provided by the invention obviously has higher detection rate and can be used as a method with high success rate for judging whether the digital voice is processed smoothly and further detecting and positioning the voice falsification by using commercial audio processing software.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a method of detecting and locating a smoothing process within a speech segment.

FIG. 2 is a flow chart of detecting a voice under test

FIG. 3 is a diagram of a standard triangular window function.

Fig. 4 is a schematic diagram of a variation of the first triangular window function.

Fig. 5 is a diagram showing a variation of the second triangular window function.

FIG. 6 is a statistical histogram of correlation coefficients between adjacent samples in an original speech segment.

FIG. 7 is a statistical histogram of correlation coefficients between adjacent samples in a smoothed speech segment.

Fig. 8 is a differential signal before and after the speech segment smoothing process.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

A method for detecting and locating a smoothing process within a speech segment, as shown in fig. 1, comprising the steps of:

s1, selecting a smoothing filter;

s2, selecting original voice, extracting an original voice set, and processing the original voice set into a training voice set through a filter;

s3, extracting feature sets from the original voice and the training voice set;

s4, respectively screening out samples from the feature set of the original voice and the feature set of the training voice set, and training an SVM classifier model by adopting a classifier;

s5, as shown in the figure 2, selecting a voice to be detected, framing the voice to be detected, and extracting a voice feature set to be detected from each frame signal;

s6, as shown in FIG. 2, classifying the speech feature set to be detected by using the SVM classifier model of step S4, judging whether the signal is subjected to smoothing processing, and if so, locating the position where the smoothing processing is located.

In the present embodiment, the smoothing filter of step S1 includes a linear filter and a nonlinear filter;

the linear filter includes a triangular window function as shown in fig. 3, a first triangular window function variation as shown in fig. 4, a second triangular window function variation as shown in fig. 5, an averaging filter, and a gaussian filter; the first triangular window function variant in fig. 4 differs from the standard triangular window function in that the slope decreases in the left half and increases in the right half; the second triangular window function variant in fig. 5 differs from the standard triangular window function in that the slope increases in the left-hand half and decreases in the right-hand half.

The nonlinear filter is a median filter.

In the present embodiment, step S2 includes the following steps:

and S2.2, setting the lengths of the filtering windows to be 5, 7, 9, 11, 13, 15 and 31 respectively, and filtering each voice segment in the original voice set in the step S2.1 by using the filter in the step S1 to obtain a filtered voice segment which is used as a training voice set.

In this embodiment, step S3 is used to derive the feature set of each speech segment in the original speech set and the training speech set of step S2, and step S3 includes the following steps:

s3.4, taking the original voice signal sampling rate of the step S2 as Fs, carrying out standard deviation calculation on the frequency signals of the frequency domain signals of the step S3.3 in a frequency interval from Fs/4 to Fs/2, and taking the calculation result as a second part of the feature set of each voice segment in the original voice set and the training voice set;

In this embodiment, as shown in fig. 2, the step S5 is configured to extract a feature set of each speech segment of the speech to be detected, and includes the following steps:

s5.1, selecting the voice to be detected, framing the voice to be detected by a certain sample length, and calculating each frame signal to obtain a differential signal corresponding to each section of voice fragment;

In this embodiment, the method for screening samples in step S4 randomly selects half of each of the feature set of the original speech and the feature set of the training speech set, and uses the selected half as the feature set sample of the original speech and the feature set sample of the training speech set respectively;

the classifier in step S4 is a LibSVM classifier.

In the present embodiment, the certain sample lengths are 50 samples, 100 samples, and 150 samples.

In this embodiment, the length of the fourier transform is 128.

The principle of the method provided by the invention is as follows:

as shown in fig. 6 and 7, smoothing digital speech at a certain position can enhance correlation between adjacent samples in the speech signal, so that the difference signal of the speech segment after smoothing and the difference signal of the original speech have a significant difference; as shown in fig. 8, the difference signal amplitude value of the smoothed speech signal is smaller and changes more slowly. Meanwhile, after the smoothing processing, the smoothing processing is carried out on the voice segment again, the residual error of the voice signal after the filtering processing is more gentle than that of the original voice signal, and whether the voice signal is subjected to the smoothing processing or not can be effectively distinguished by means of the differential signal of the residual error. The method provided by the invention forms the standard deviation of the difference signal of the voice segment, the standard deviation of the high-frequency part in the difference signal of the voice segment and the standard deviation of the difference signal of the residual error of the voice segment into a characteristic set, and can effectively detect and position the position of smoothing processing in the voice segment.

The present example also includes the following experiments and experimental results.

The embodiment adopts a speech library comprising 13240 pieces of speech, wherein each piece of speech is intercepted by three speech segments of 50 samples, 100 samples and 150 samples, the interception ensures that the speech segments are not muted, and the three speech segments are taken as three original speech sets. And then, the original speech set with each length is smoothed by using the above-mentioned 6 filter models to obtain a corresponding smoothed speech set. And extracting the feature set provided by the invention for each voice set and classifying the original voice set and the smoothed voice set by using a LibSVM classifier. Three sets of experiments were performed in common in this application, including: the method comprises the following steps of carrying out experiments according to the scheme of the invention, comparing experiments of the existing filtering detection algorithm and detecting a smoothing treatment experiment caused by audio editing software.

In experiments performed according to the inventive arrangements, the purpose of the experiments was to verify the effect of different speech segment lengths on the invention. In the present application, the voice segments with lengths of 50 samples, 100 samples and 150 samples are tested, and the test results are shown in table 1, table 2 and table 3.

TABLE 1 detection Rate of the method proposed by the present invention (Speech segment length 50 samples)

TABLE 2 detection Rate of the method proposed by the present invention (speech segment length 100 samples)

TABLE 3 detection Rate of the method proposed by the present invention (speech segment length 150 samples)

The accuracy in tables 1, 2, and 3 is the average accuracy of the original voice fragment and the smoothed voice fragment classified using LibSVM. For each smoothing operation, the filter windows have seven lengths of 5, 7, 9, 11, 13, 15 and 31, respectively.

It can be seen from the above experimental results that, for 6 different types of filtering operations, it can be effectively distinguished whether the speech segment is subjected to smoothing processing. And even for shorter speech segments, such as only 50 samples in length, more efficient detection can be performed. When the filter window length is only 5, the original speech segment and the smoothed speech segment can be effectively distinguished. In practical application, the voice to be detected can be framed, and then the feature set provided by the invention is respectively extracted and classified for each voice segment, so as to realize detection and positioning of smoothing processing in the voice segment.

In the comparative experiment of the existing filter Detection algorithm, the present embodiment adopts the method for detecting Median Filtering Using AR coefficients and the method for detecting post-Processing voice Detection proposed in the articles, "Robust media Filtering Using an Autoresistive Model", "IEEE Transactions on Information Formation and Security (Volume:8, Issue:9, Sept.2013)," DOI:10.1109/TIFS.2013.2273394 '", and" Audio Processing Detection Based on amplification Coccurrence Vector function "," IEEE Signal Processing Letters (Volume:23, Issue:5, May 2016), "DOI: 10.1109/LSP 2016.4925600'", as the comparison experiment of voice, the length of the sample is 50, and the experimental results are shown in the table 5 and the table 5.

Table 4 detection rate of method for detecting median filter using AR coefficient (voice segment length is 50 samples)

TABLE 5 detection Rate of the method of Speech post-processing detection (Speech segment length 50 samples)

Comparing the experiment performed according to the scheme of the present invention with the comparative experiment of the existing filtering detection algorithm, the experiment performed according to the scheme of the present invention has significantly higher accuracy.

In a smoothing experiment caused by detecting audio editing software, audio editing software Cooledit and Adobe Audio which are mainly applied in practice are selected to edit and modify digital voice; in order not to destroy the continuity of the voice signal, the audio editing software will automatically smooth the voice signal at the tampered boundary, and usually only smooth dozens of samples. The smoothing algorithms of the software are not disclosed to the outside, and in order to prove the practical value of the invention, the software is adopted to respectively delete the non-silent part of each section of voice in the voice library of the embodiment, so that the two kinds of software automatically smooth and tamper the boundary. Then, we intercept the speech segment (length about 20-30 samples) automatically smoothed by the audio editing software as the smoothed speech data set, and then intercept the speech segment (length of speech segment is 30 samples) near the tampered boundary that is not smoothed by the software as the original speech data set. We classify the two speech datasets with the present invention and compare it with the method of median filtering using AR coefficients and the Luo method, and the experimental results are shown in the following table:

TABLE 6 detection rate of three methods for voice segments smoothed by audio editing software

The experimental results show that the method provided by the invention can effectively detect the unknown smooth processing operation of the commercial audio processing software on the digital voice. The experimental result shows that the method has obvious practical value and can be effectively applied to detecting and positioning the smoothing operation in the voice segment in the actual environment.

From the above three groups of experimental results, the method for detecting and positioning the smoothing processing occurring in the voice segment provided by the invention has higher detection accuracy, and can effectively detect 6 common smoothing processing operations including a linear filter and a nonlinear filter. When the length of the smoothed speech signal is only 50 samples, the method provided by the invention can still effectively detect the speech signal. Meanwhile, for unknown smooth processing operation in commercial audio processing software, the method provided by the invention can also effectively detect and has better practical application significance in the aspect of audio forensics.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A method for detecting and locating a smoothing process within a speech segment, comprising the steps of:

s1, selecting a smoothing filter;

s3, extracting feature sets from the original voice and the training voice set; the method comprises the following steps:

s3.6, carrying out differential calculation on the residual error in the step S3.5 to obtain a differential signal, and carrying out standard deviation calculation on the differential signal to obtain a standard deviation value which is used as a third part of the feature set of each section of voice fragment in the original voice set and the training voice set;

2. The method for detecting and locating the smoothing process in a speech segment according to claim 1, wherein the smoothing filter of step S1 includes a linear filter and a nonlinear filter;

the nonlinear filter is a median filter.

3. The method for detecting and locating a smoothing process within a speech segment according to claim 1, wherein said step S2 includes the steps of:

4. The method for detecting and locating the smoothing process in the speech segment according to claim 1, wherein the step S5 is used to extract the feature set of each speech segment of the speech to be detected, and comprises the following steps:

5. The method for detecting and locating the smoothness within a speech segment according to claim 1, wherein the method for screening samples in step S4 randomly selects half of each of the feature set of the original speech and the feature set of the training speech set as the feature set sample of the original speech and the feature set sample of the training speech set respectively;

the classifier in the step S4 is a LibSVM classifier.

6. The method of claim 3 or 4, wherein the certain sample length is 50 samples, 100 samples and 150 samples.

7. Method for detecting and localizing smoothing within a speech segment according to claim 1 or 4, characterized in that the Fourier transform has a length of 128.