CN110060703B - Method for detecting and positioning smoothing processing in voice segment - Google Patents

Method for detecting and positioning smoothing processing in voice segment Download PDF

Info

Publication number
CN110060703B
CN110060703B CN201810055610.XA CN201810055610A CN110060703B CN 110060703 B CN110060703 B CN 110060703B CN 201810055610 A CN201810055610 A CN 201810055610A CN 110060703 B CN110060703 B CN 110060703B
Authority
CN
China
Prior art keywords
voice
original
detected
training
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810055610.XA
Other languages
Chinese (zh)
Other versions
CN110060703A (en
Inventor
闫琦
杨锐
黄继武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Sun Yat Sen University
Original Assignee
Shenzhen University
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University, Sun Yat Sen University filed Critical Shenzhen University
Priority to CN201810055610.XA priority Critical patent/CN110060703B/en
Publication of CN110060703A publication Critical patent/CN110060703A/en
Application granted granted Critical
Publication of CN110060703B publication Critical patent/CN110060703B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/54Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention discloses a method for detecting and positioning smooth processing in a voice segment, which comprises the following steps: s1, selecting a smoothing filter; s2, selecting original voice, extracting an original voice set, and processing the original voice set into a training voice set through the filter; s3, extracting feature sets from the original voice and the training voice set; s4, respectively screening out samples from the feature set of the original voice and the feature set of the training voice set, and training an SVM classifier model by adopting a classifier; s5, selecting a voice to be detected, framing the voice to be detected, and extracting a voice feature set to be detected from each frame signal; s6, classifying the voice feature set to be detected by using the SVM classifier model in the step S4, judging whether the signal is subjected to smoothing treatment, and if so, positioning the position where the smoothing treatment is positioned. The method has the advantages that compared with the existing similar detection method, the method provided by the invention has higher detection rate and can be used as a high-success-rate method for judging whether the digital voice is smoothly processed.

Description

Method for detecting and positioning smoothing processing in voice segment
Technical Field
The present invention relates to the field of media content forensics, and more particularly, to a method of detecting and locating smoothing within a speech segment.
Background
At present, the recording function of a digital recording pen and a mobile phone is widely popularized, and the digital recording has the tendency of replacing the former analog recording. Digital audio plays a very important role as evidence for the jurisdictions. However, with the wide popularization and application of a series of audio editing software such as Cooledit, Adobe audio and the like, even people without relevant professional knowledge can edit and modify digital audio by using the audio editing software. Therefore, it is necessary to authenticate the authenticity of digital audio.
Smoothing is a common audio post-processing means, and is often used in smoothing a tampered boundary after digital audio is deleted, cut, and spliced, so that the authenticity of the digital audio can be identified by detecting whether smoothing exists in the digital audio. The detection technology for smoothing processing in a long-time speech segment is relatively mature at present, and the smoothing processing in the long-time speech segment can be effectively detected by using common speech features such as MFCC (Mel frequency cepstrum coefficient). However, when the smoothed speech segment is, for example, only several hundred or even several tens of samples are smoothed, most of the existing common speech frequency domain features are no longer applicable due to the very little frequency information contained in the speech segment.
Disclosure of Invention
The present invention provides a method for detecting and locating a smoothing process in a speech segment to overcome the above-mentioned drawbacks of the prior art.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a method of detecting and locating a smoothing process within a speech segment, comprising the steps of:
s1, selecting a smoothing filter;
s2, selecting original voice, extracting an original voice set, and processing the original voice set into a training voice set through the filter;
s3, extracting feature sets from the original voice and the training voice set;
s4, screening out samples from the feature set of the original voice and the feature set of the training voice set respectively, and training an SVM classifier model by adopting a classifier;
s5, selecting a voice to be detected, framing the voice to be detected, and extracting a voice feature set to be detected from each frame signal;
s6, classifying the voice feature set to be detected by using the SVM classifier model of the step S4, judging whether the signal is subjected to smoothing treatment, and if so, positioning the position where the smoothing treatment is positioned.
The working principle of the method is as follows: firstly, smoothing the original voice through a smoothing filter to obtain a smoothed voice set; then, obtaining a feature set of the voice set subjected to smoothing processing, and training a classifier model by matching with a classifier; and finally, framing the voice to be detected, extracting a voice feature set to be detected, classifying the voice feature set to be detected by adopting the classifier model, judging whether each frame of the voice to be detected is subjected to smoothing treatment or not, and if so, positioning the smoothing treatment position.
Preferably, the smoothing filter of step S1 includes a linear filter and a nonlinear filter;
the linear filter comprises a triangular window function and two variants thereof, an average filter and a Gaussian filter;
the nonlinear filter is a median filter.
Preferably, the step S2 includes the steps of:
s2.1, selecting original voices, and intercepting non-silent voice fragments with certain sample length from each section of voice to serve as an original voice set;
and S2.2, setting the lengths of the filtering windows to be 5, 7, 9, 11, 13, 15 and 31 respectively, and filtering each voice segment in the original voice set of the step S2.1 by using the filter of the step S1 to obtain a filtered voice segment which is used as a training voice set.
Preferably, the step S3 is to derive a feature set of each speech segment in the original speech set and the training speech set of the step S2, and the step S3 includes the following steps:
s3.1, performing differential calculation on each section of voice segment in the original voice set and the training voice set in the step S2 to obtain a differential signal corresponding to each section of voice segment;
s3.2, standard deviation calculation is carried out on the difference signal in the step S3.1, and a calculation result is used as a first part of a feature set of each section of voice segment in the original voice set and the training voice set;
s3.3, carrying out Fourier transform on the differential signal in the step S3.1 to obtain a frequency domain signal corresponding to the differential signal;
s3.4, taking the original voice signal sampling rate of the step S2 as Fs, performing standard deviation calculation on the frequency signal of the frequency domain signal of the step S3.3 in a frequency interval from Fs/4 to Fs/2, and taking a calculation result as a second part of a feature set of each voice segment in the original voice set and the training voice set;
s3.5, filtering each section of voice fragment in the original voice set and the training voice set in the step S2 by adopting a median filter with the window length of 5, and calculating a residual error corresponding to each section of voice fragment;
and S3.6, carrying out differential calculation on the residual error in the step S3.5 to obtain a differential signal, and carrying out standard deviation calculation on the differential signal to obtain a standard deviation value which is used as a third part of the feature set of each section of voice segment in the original voice set and the training voice set.
Preferably, the step S5 is configured to extract a feature set of each speech segment of the speech to be detected, and includes the following steps:
s5.1, selecting a voice to be detected, framing the voice to be detected by a certain sample length, and calculating each frame signal to obtain a differential signal corresponding to each section of voice fragment;
s5.2, standard deviation calculation is carried out on the differential signal in the step S5.1, and a calculation result is used as a first part of a feature set of each section of voice fragment of the voice to be detected;
s5.3, carrying out Fourier transform on the differential signal in the step S5.1 to obtain a frequency domain signal corresponding to the differential signal;
s5.4, taking the sampling rate of the voice signal to be detected in the step S5.1 as Fs, calculating the standard deviation of the frequency signal of the frequency domain signal in the step S5.3 in the frequency interval from Fs/4 to Fs/2, and taking the calculation result as the second part of the feature set of each section of voice segment of the voice to be detected;
s5.5, filtering each section of voice fragment in the voice to be detected in the step S5.1 by adopting a median filter with the window length of 5, and calculating a residual error corresponding to each section of voice fragment;
and S5.6, carrying out differential calculation on the residual error in the step S5.5 to obtain a differential signal, and carrying out standard deviation calculation on the differential signal to obtain a standard deviation value which is used as a third part of the feature set of each section of the voice segment of the voice to be detected.
Preferably, the method for screening samples in step S4 randomly selects half of each of the feature set of the original speech and the feature set of the training speech set, and uses the selected half as the feature set sample of the original speech and the feature set sample of the training speech set respectively;
the classifier in the step S4 is a LibSVM classifier.
Preferably, the certain sample length is 50 samples, 100 samples and 150 samples.
Preferably, the fourier transform is 128 in length.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
firstly, obtaining a voice characteristic set through an original voice and a training voice set, and obtaining a classifier model through a classifier; when detecting the voice to be detected, extracting a voice characteristic set to be detected, and classifying by using the classifier model so as to judge whether the voice segment is subjected to smoothing processing and positioning. Compared with the existing similar detection method, the method provided by the invention obviously has higher detection rate and can be used as a method with high success rate for judging whether the digital voice is processed smoothly and further detecting and positioning the voice falsification by using commercial audio processing software.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a method of detecting and locating a smoothing process within a speech segment.
FIG. 2 is a flow chart of detecting a voice under test
FIG. 3 is a diagram of a standard triangular window function.
Fig. 4 is a schematic diagram of a variation of the first triangular window function.
Fig. 5 is a diagram showing a variation of the second triangular window function.
FIG. 6 is a statistical histogram of correlation coefficients between adjacent samples in an original speech segment.
FIG. 7 is a statistical histogram of correlation coefficients between adjacent samples in a smoothed speech segment.
Fig. 8 is a differential signal before and after the speech segment smoothing process.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
A method for detecting and locating a smoothing process within a speech segment, as shown in fig. 1, comprising the steps of:
s1, selecting a smoothing filter;
s2, selecting original voice, extracting an original voice set, and processing the original voice set into a training voice set through a filter;
s3, extracting feature sets from the original voice and the training voice set;
s4, respectively screening out samples from the feature set of the original voice and the feature set of the training voice set, and training an SVM classifier model by adopting a classifier;
s5, as shown in the figure 2, selecting a voice to be detected, framing the voice to be detected, and extracting a voice feature set to be detected from each frame signal;
s6, as shown in FIG. 2, classifying the speech feature set to be detected by using the SVM classifier model of step S4, judging whether the signal is subjected to smoothing processing, and if so, locating the position where the smoothing processing is located.
In the present embodiment, the smoothing filter of step S1 includes a linear filter and a nonlinear filter;
the linear filter includes a triangular window function as shown in fig. 3, a first triangular window function variation as shown in fig. 4, a second triangular window function variation as shown in fig. 5, an averaging filter, and a gaussian filter; the first triangular window function variant in fig. 4 differs from the standard triangular window function in that the slope decreases in the left half and increases in the right half; the second triangular window function variant in fig. 5 differs from the standard triangular window function in that the slope increases in the left-hand half and decreases in the right-hand half.
The nonlinear filter is a median filter.
In the present embodiment, step S2 includes the following steps:
s2.1, selecting original voices, and intercepting non-silent voice fragments with certain sample length from each section of voice to serve as an original voice set;
and S2.2, setting the lengths of the filtering windows to be 5, 7, 9, 11, 13, 15 and 31 respectively, and filtering each voice segment in the original voice set in the step S2.1 by using the filter in the step S1 to obtain a filtered voice segment which is used as a training voice set.
In this embodiment, step S3 is used to derive the feature set of each speech segment in the original speech set and the training speech set of step S2, and step S3 includes the following steps:
s3.1, performing differential calculation on each section of voice segment in the original voice set and the training voice set in the step S2 to obtain a differential signal corresponding to each section of voice segment;
s3.2, standard deviation calculation is carried out on the difference signal in the step S3.1, and a calculation result is used as a first part of a feature set of each section of voice segment in the original voice set and the training voice set;
s3.3, carrying out Fourier transform on the differential signal in the step S3.1 to obtain a frequency domain signal corresponding to the differential signal;
s3.4, taking the original voice signal sampling rate of the step S2 as Fs, carrying out standard deviation calculation on the frequency signals of the frequency domain signals of the step S3.3 in a frequency interval from Fs/4 to Fs/2, and taking the calculation result as a second part of the feature set of each voice segment in the original voice set and the training voice set;
s3.5, filtering each section of voice fragment in the original voice set and the training voice set in the step S2 by adopting a median filter with the window length of 5, and calculating a residual error corresponding to each section of voice fragment;
and S3.6, carrying out differential calculation on the residual error in the step S3.5 to obtain a differential signal, and carrying out standard deviation calculation on the differential signal to obtain a standard deviation value which is used as a third part of the feature set of each section of voice segment in the original voice set and the training voice set.
In this embodiment, as shown in fig. 2, the step S5 is configured to extract a feature set of each speech segment of the speech to be detected, and includes the following steps:
s5.1, selecting the voice to be detected, framing the voice to be detected by a certain sample length, and calculating each frame signal to obtain a differential signal corresponding to each section of voice fragment;
s5.2, standard deviation calculation is carried out on the differential signal in the step S5.1, and a calculation result is used as a first part of a feature set of each section of voice fragment of the voice to be detected;
s5.3, carrying out Fourier transform on the differential signal in the step S5.1 to obtain a frequency domain signal corresponding to the differential signal;
s5.4, taking the sampling rate of the voice signal to be detected in the step S5.1 as Fs, calculating the standard deviation of the frequency signal of the frequency domain signal in the step S5.3 in the frequency interval from Fs/4 to Fs/2, and taking the calculation result as the second part of the feature set of each section of voice segment of the voice to be detected;
s5.5, filtering each section of voice fragment in the voice to be detected in the step S5.1 by adopting a median filter with the window length of 5, and calculating a residual error corresponding to each section of voice fragment;
and S5.6, carrying out differential calculation on the residual error in the step S5.5 to obtain a differential signal, and carrying out standard deviation calculation on the differential signal to obtain a standard deviation value which is used as a third part of the feature set of each section of the voice segment of the voice to be detected.
In this embodiment, the method for screening samples in step S4 randomly selects half of each of the feature set of the original speech and the feature set of the training speech set, and uses the selected half as the feature set sample of the original speech and the feature set sample of the training speech set respectively;
the classifier in step S4 is a LibSVM classifier.
In the present embodiment, the certain sample lengths are 50 samples, 100 samples, and 150 samples.
In this embodiment, the length of the fourier transform is 128.
The principle of the method provided by the invention is as follows:
as shown in fig. 6 and 7, smoothing digital speech at a certain position can enhance correlation between adjacent samples in the speech signal, so that the difference signal of the speech segment after smoothing and the difference signal of the original speech have a significant difference; as shown in fig. 8, the difference signal amplitude value of the smoothed speech signal is smaller and changes more slowly. Meanwhile, after the smoothing processing, the smoothing processing is carried out on the voice segment again, the residual error of the voice signal after the filtering processing is more gentle than that of the original voice signal, and whether the voice signal is subjected to the smoothing processing or not can be effectively distinguished by means of the differential signal of the residual error. The method provided by the invention forms the standard deviation of the difference signal of the voice segment, the standard deviation of the high-frequency part in the difference signal of the voice segment and the standard deviation of the difference signal of the residual error of the voice segment into a characteristic set, and can effectively detect and position the position of smoothing processing in the voice segment.
The present example also includes the following experiments and experimental results.
The embodiment adopts a speech library comprising 13240 pieces of speech, wherein each piece of speech is intercepted by three speech segments of 50 samples, 100 samples and 150 samples, the interception ensures that the speech segments are not muted, and the three speech segments are taken as three original speech sets. And then, the original speech set with each length is smoothed by using the above-mentioned 6 filter models to obtain a corresponding smoothed speech set. And extracting the feature set provided by the invention for each voice set and classifying the original voice set and the smoothed voice set by using a LibSVM classifier. Three sets of experiments were performed in common in this application, including: the method comprises the following steps of carrying out experiments according to the scheme of the invention, comparing experiments of the existing filtering detection algorithm and detecting a smoothing treatment experiment caused by audio editing software.
In experiments performed according to the inventive arrangements, the purpose of the experiments was to verify the effect of different speech segment lengths on the invention. In the present application, the voice segments with lengths of 50 samples, 100 samples and 150 samples are tested, and the test results are shown in table 1, table 2 and table 3.
TABLE 1 detection Rate of the method proposed by the present invention (Speech segment length 50 samples)
Figure BDA0001553675070000071
Figure BDA0001553675070000081
TABLE 2 detection Rate of the method proposed by the present invention (speech segment length 100 samples)
Figure BDA0001553675070000082
TABLE 3 detection Rate of the method proposed by the present invention (speech segment length 150 samples)
Figure BDA0001553675070000091
The accuracy in tables 1, 2, and 3 is the average accuracy of the original voice fragment and the smoothed voice fragment classified using LibSVM. For each smoothing operation, the filter windows have seven lengths of 5, 7, 9, 11, 13, 15 and 31, respectively.
It can be seen from the above experimental results that, for 6 different types of filtering operations, it can be effectively distinguished whether the speech segment is subjected to smoothing processing. And even for shorter speech segments, such as only 50 samples in length, more efficient detection can be performed. When the filter window length is only 5, the original speech segment and the smoothed speech segment can be effectively distinguished. In practical application, the voice to be detected can be framed, and then the feature set provided by the invention is respectively extracted and classified for each voice segment, so as to realize detection and positioning of smoothing processing in the voice segment.
In the comparative experiment of the existing filter Detection algorithm, the present embodiment adopts the method for detecting Median Filtering Using AR coefficients and the method for detecting post-Processing voice Detection proposed in the articles, "Robust media Filtering Using an Autoresistive Model", "IEEE Transactions on Information Formation and Security (Volume:8, Issue:9, Sept.2013)," DOI:10.1109/TIFS.2013.2273394 '", and" Audio Processing Detection Based on amplification Coccurrence Vector function "," IEEE Signal Processing Letters (Volume:23, Issue:5, May 2016), "DOI: 10.1109/LSP 2016.4925600'", as the comparison experiment of voice, the length of the sample is 50, and the experimental results are shown in the table 5 and the table 5.
Table 4 detection rate of method for detecting median filter using AR coefficient (voice segment length is 50 samples)
Figure BDA0001553675070000101
TABLE 5 detection Rate of the method of Speech post-processing detection (Speech segment length 50 samples)
Figure BDA0001553675070000111
Comparing the experiment performed according to the scheme of the present invention with the comparative experiment of the existing filtering detection algorithm, the experiment performed according to the scheme of the present invention has significantly higher accuracy.
In a smoothing experiment caused by detecting audio editing software, audio editing software Cooledit and Adobe Audio which are mainly applied in practice are selected to edit and modify digital voice; in order not to destroy the continuity of the voice signal, the audio editing software will automatically smooth the voice signal at the tampered boundary, and usually only smooth dozens of samples. The smoothing algorithms of the software are not disclosed to the outside, and in order to prove the practical value of the invention, the software is adopted to respectively delete the non-silent part of each section of voice in the voice library of the embodiment, so that the two kinds of software automatically smooth and tamper the boundary. Then, we intercept the speech segment (length about 20-30 samples) automatically smoothed by the audio editing software as the smoothed speech data set, and then intercept the speech segment (length of speech segment is 30 samples) near the tampered boundary that is not smoothed by the software as the original speech data set. We classify the two speech datasets with the present invention and compare it with the method of median filtering using AR coefficients and the Luo method, and the experimental results are shown in the following table:
TABLE 6 detection rate of three methods for voice segments smoothed by audio editing software
Figure BDA0001553675070000121
The experimental results show that the method provided by the invention can effectively detect the unknown smooth processing operation of the commercial audio processing software on the digital voice. The experimental result shows that the method has obvious practical value and can be effectively applied to detecting and positioning the smoothing operation in the voice segment in the actual environment.
From the above three groups of experimental results, the method for detecting and positioning the smoothing processing occurring in the voice segment provided by the invention has higher detection accuracy, and can effectively detect 6 common smoothing processing operations including a linear filter and a nonlinear filter. When the length of the smoothed speech signal is only 50 samples, the method provided by the invention can still effectively detect the speech signal. Meanwhile, for unknown smooth processing operation in commercial audio processing software, the method provided by the invention can also effectively detect and has better practical application significance in the aspect of audio forensics.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (7)

1. A method for detecting and locating a smoothing process within a speech segment, comprising the steps of:
s1, selecting a smoothing filter;
s2, selecting original voice, extracting an original voice set, and processing the original voice set into a training voice set through the filter;
s3, extracting feature sets from the original voice and the training voice set; the method comprises the following steps:
s3.1, performing differential calculation on each section of voice segment in the original voice set and the training voice set in the step S2 to obtain a differential signal corresponding to each section of voice segment;
s3.2, standard deviation calculation is carried out on the difference signal in the step S3.1, and a calculation result is used as a first part of a feature set of each section of voice segment in the original voice set and the training voice set;
s3.3, carrying out Fourier transform on the differential signal in the step S3.1 to obtain a frequency domain signal corresponding to the differential signal;
s3.4, taking the original voice signal sampling rate of the step S2 as Fs, performing standard deviation calculation on the frequency signal of the frequency domain signal of the step S3.3 in a frequency interval from Fs/4 to Fs/2, and taking a calculation result as a second part of a feature set of each voice segment in the original voice set and the training voice set;
s3.5, filtering each section of voice fragment in the original voice set and the training voice set in the step S2 by adopting a median filter with the window length of 5, and calculating a residual error corresponding to each section of voice fragment;
s3.6, carrying out differential calculation on the residual error in the step S3.5 to obtain a differential signal, and carrying out standard deviation calculation on the differential signal to obtain a standard deviation value which is used as a third part of the feature set of each section of voice fragment in the original voice set and the training voice set;
s4, screening out samples from the feature set of the original voice and the feature set of the training voice set respectively, and training an SVM classifier model by adopting a classifier;
s5, selecting a voice to be detected, framing the voice to be detected, and extracting a voice feature set to be detected from each frame signal;
s6, classifying the voice feature set to be detected by using the SVM classifier model of the step S4, judging whether the signal is subjected to smoothing treatment, and if so, positioning the position where the smoothing treatment is positioned.
2. The method for detecting and locating the smoothing process in a speech segment according to claim 1, wherein the smoothing filter of step S1 includes a linear filter and a nonlinear filter;
the linear filter comprises a triangular window function and two variants thereof, an average filter and a Gaussian filter;
the nonlinear filter is a median filter.
3. The method for detecting and locating a smoothing process within a speech segment according to claim 1, wherein said step S2 includes the steps of:
s2.1, selecting original voices, and intercepting non-silent voice fragments with certain sample length from each section of voice to serve as an original voice set;
and S2.2, setting the lengths of the filtering windows to be 5, 7, 9, 11, 13, 15 and 31 respectively, and filtering each voice segment in the original voice set of the step S2.1 by using the filter of the step S1 to obtain a filtered voice segment which is used as a training voice set.
4. The method for detecting and locating the smoothing process in the speech segment according to claim 1, wherein the step S5 is used to extract the feature set of each speech segment of the speech to be detected, and comprises the following steps:
s5.1, selecting a voice to be detected, framing the voice to be detected by a certain sample length, and calculating each frame signal to obtain a differential signal corresponding to each section of voice fragment;
s5.2, standard deviation calculation is carried out on the differential signal in the step S5.1, and a calculation result is used as a first part of a feature set of each section of voice fragment of the voice to be detected;
s5.3, carrying out Fourier transform on the differential signal in the step S5.1 to obtain a frequency domain signal corresponding to the differential signal;
s5.4, taking the sampling rate of the voice signal to be detected in the step S5.1 as Fs, calculating the standard deviation of the frequency signal of the frequency domain signal in the step S5.3 in the frequency interval from Fs/4 to Fs/2, and taking the calculation result as the second part of the feature set of each section of voice segment of the voice to be detected;
s5.5, filtering each section of voice fragment in the voice to be detected in the step S5.1 by adopting a median filter with the window length of 5, and calculating a residual error corresponding to each section of voice fragment;
and S5.6, carrying out differential calculation on the residual error in the step S5.5 to obtain a differential signal, and carrying out standard deviation calculation on the differential signal to obtain a standard deviation value which is used as a third part of the feature set of each section of the voice segment of the voice to be detected.
5. The method for detecting and locating the smoothness within a speech segment according to claim 1, wherein the method for screening samples in step S4 randomly selects half of each of the feature set of the original speech and the feature set of the training speech set as the feature set sample of the original speech and the feature set sample of the training speech set respectively;
the classifier in the step S4 is a LibSVM classifier.
6. The method of claim 3 or 4, wherein the certain sample length is 50 samples, 100 samples and 150 samples.
7. Method for detecting and localizing smoothing within a speech segment according to claim 1 or 4, characterized in that the Fourier transform has a length of 128.
CN201810055610.XA 2018-01-19 2018-01-19 Method for detecting and positioning smoothing processing in voice segment Active CN110060703B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810055610.XA CN110060703B (en) 2018-01-19 2018-01-19 Method for detecting and positioning smoothing processing in voice segment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810055610.XA CN110060703B (en) 2018-01-19 2018-01-19 Method for detecting and positioning smoothing processing in voice segment

Publications (2)

Publication Number Publication Date
CN110060703A CN110060703A (en) 2019-07-26
CN110060703B true CN110060703B (en) 2021-05-04

Family

ID=67315321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810055610.XA Active CN110060703B (en) 2018-01-19 2018-01-19 Method for detecting and positioning smoothing processing in voice segment

Country Status (1)

Country Link
CN (1) CN110060703B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111445924B (en) * 2020-03-18 2023-07-04 中山大学 Method for detecting and positioning smoothing process in voice segment based on autoregressive model coefficient
CN111916059B (en) * 2020-07-01 2022-12-27 深圳大学 Smooth voice detection method and device based on deep learning and intelligent equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440868A (en) * 2013-08-09 2013-12-11 中山大学 Method for identifying video processed through electronic tone modification
CN105118503A (en) * 2015-07-13 2015-12-02 中山大学 Ripped audio detection method
CN106409298A (en) * 2016-09-30 2017-02-15 广东技术师范学院 Identification method of sound rerecording attack
EP3267350A1 (en) * 2016-07-06 2018-01-10 Trust Ltd. Method of and system for analysis of interaction patterns of malware with control centers for detection of cyber attack

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440868A (en) * 2013-08-09 2013-12-11 中山大学 Method for identifying video processed through electronic tone modification
CN105118503A (en) * 2015-07-13 2015-12-02 中山大学 Ripped audio detection method
EP3267350A1 (en) * 2016-07-06 2018-01-10 Trust Ltd. Method of and system for analysis of interaction patterns of malware with control centers for detection of cyber attack
CN106409298A (en) * 2016-09-30 2017-02-15 广东技术师范学院 Identification method of sound rerecording attack

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
A Deterministic Approach to Detect Median Filtering in ID Data;Cecilia Pasquini, etc;<IEEE TRANSACTIONS ON INFORMATION PORENSICS AND SECURITY>;20160730;第11卷(第7期);1425-1436 *

Also Published As

Publication number Publication date
CN110060703A (en) 2019-07-26

Similar Documents

Publication Publication Date Title
US10397646B2 (en) Method, system, and program product for measuring audio video synchronization using lip and teeth characteristics
US8831942B1 (en) System and method for pitch based gender identification with suspicious speaker detection
CN108665903B (en) Automatic detection method and system for audio signal similarity
US9704495B2 (en) Modified mel filter bank structure using spectral characteristics for sound analysis
US20080111887A1 (en) Method, system, and program product for measuring audio video synchronization independent of speaker characteristics
CN108538312B (en) Bayesian information criterion-based automatic positioning method for digital audio tamper points
CN110060703B (en) Method for detecting and positioning smoothing processing in voice segment
CN104021785A (en) Method of extracting speech of most important guest in meeting
CN103559882A (en) Meeting presenter voice extracting method based on speaker division
CN106910495A (en) A kind of audio classification system and method for being applied to abnormal sound detection
CN110988137A (en) Abnormal sound detection system and method based on time-frequency domain characteristics
CN111445924B (en) Method for detecting and positioning smoothing process in voice segment based on autoregressive model coefficient
CN110767248B (en) Anti-modulation interference audio fingerprint extraction method
CN109920447B (en) Recording fraud detection method based on adaptive filter amplitude phase characteristic extraction
EP3504708B1 (en) A device and method for classifying an acoustic environment
CN104021791A (en) Detecting method based on digital audio waveform sudden changes
Wang et al. Automatic audio segmentation using the generalized likelihood ratio
CN116884431A (en) CFCC (computational fluid dynamics) feature-based robust audio copy-paste tamper detection method and device
Mills et al. Replay attack detection based on voice and non-voice sections for speaker verification
Lin et al. A replay speech detection algorithm based on sub-band analysis
Baskoro et al. Analysis of Voice Changes in Anti Forensic Activities Case Study: Voice Changer with Telephone Effect
CN104091104A (en) Feature extraction and authentication method for multi-format audio perceptual Hashing authentication
Tapkir et al. Significance of teager energy operator phase for replay spoof detection
Lin et al. A robust method for speech replay attack detection
WO2006113409A2 (en) Method, system, and program product for measuring audio video synchronization using lip and teeth charateristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant