CN102568469B - G.729A compressed pronunciation flow information hiding detection device and detection method - Google Patents

G.729A compressed pronunciation flow information hiding detection device and detection method Download PDF

Info

Publication number
CN102568469B
CN102568469B CN2011104351639A CN201110435163A CN102568469B CN 102568469 B CN102568469 B CN 102568469B CN 2011104351639 A CN2011104351639 A CN 2011104351639A CN 201110435163 A CN201110435163 A CN 201110435163A CN 102568469 B CN102568469 B CN 102568469B
Authority
CN
China
Prior art keywords
phoneme
sorter
vector
sequence
proper vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2011104351639A
Other languages
Chinese (zh)
Other versions
CN102568469A (en
Inventor
李松斌
黄永峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN2011104351639A priority Critical patent/CN102568469B/en
Publication of CN102568469A publication Critical patent/CN102568469A/en
Application granted granted Critical
Publication of CN102568469B publication Critical patent/CN102568469B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a G.729A compressing pronunciation flow information hiding detection device, which at least comprises a compressing pronunciation flow mapping phoneme sequence module, a phoneme sequence feature extraction module group, a grader device and a result integration module. The compressing pronunciation flow mapping phoneme sequence module receives compressing pronunciation flowdelivered from the outside, enables the compressing pronunciation flow to be mapped into phoneme sequences and outputs the phoneme sequence. The phoneme sequence feature extraction module group respectively extracts and outputs phoneme vector space feature vectors and phoneme state transfer first-order markov feature vectors of phoneme sequences. The grader device respectively trains graders for different feature vectors based on training sets, and train is used for samples in unknown types to obtain the graders for conducting classification and output classification results. The result integration module integrates output results of a plurality of graders and outputs the output results serving as final steganography detection results. The G.729A compressing pronunciation flow informationhiding detection device is applicable to detection of conducting quantization index modulation (QIM) information hiding through a grouping vector codebook optimized and divided by copy number variation (CNV) in a voice coding process based on a G.729A standard.

Description

G.729A compressed voice stream information hiding detection device and detection method
Technical field
The present invention relates to the Information hiding detection field, relate in particular to a kind of G.729A compressed voice stream information hiding detection device and detection method.
Background technology
In recent years, along with the sustainable growth of bandwidth and the enhancing of network integration trend, VoIP becomes popular streaming media communication service in the Internet gradually, be used widely in the world, thoroughly changed the voice communication market structure, the network traffics that it produces are in sustainable growth, and this makes VoIP be suitable for very much carrying out covert communications in IP network.G.729 standard is the VoIP speech coding standard of ITU definition, and G.729A its simple version is used widely at VoIP.G.729A this make that compressed voice stream becomes a kind of potential Covers of Information Hiding that has threat, utilizes it to carry out covert communications and will form grave danger to the national communication supervision, study necessary based on the Information hiding detection method of this carrier.Whether Information hiding detects (also claiming Stego-detection) is exactly to judge to exist in the viewed carrier data to hide Info.
Current method of carrying out Information hiding in voice can roughly be divided into following a few class: the first is replaced or matching process for the least significant bit (LSB) of pulse code modulation (PCM) speech data; It two is transform domain methods, and the method transforms to transform domain with carrier data first, and then by realize the embedding of confidential information in some parameters of transform domain modification, conversion commonly used comprises Cepstrum Transform, discrete cosine transform, wavelet transform etc.; Its three method that is based on quantization index modulation (Quantization Index Modulation, QIM) is applicable to comprise DAB, the image and video coding of vector quantization.It is fast to have the simple speed of computing based on the information concealing method of QIM mechanism in these three class methods, be used in characteristics such as carrying out Information hiding in the compression encoding process, be particularly suitable for G.729A carrying out Information hiding in the voice flow, its threat to national communication safety is also maximum.
Summary of the invention
For the problems referred to above, the object of the present invention is to provide a kind of G.729A compressed voice stream information hiding detection device and detection method, be applied to detect in the speech based on standard G.729A and use through the mutual neighbor node of CNV() the grouping vector code book divided of algorithm optimization carries out the detection of QIM Information hiding.
For achieving the above object, a kind of G.729A compressed voice stream information hiding detection device of the present invention comprises compressed voice stream mapping aligned phoneme sequence module, aligned phoneme sequence characteristic extracting module group, sorter device and result integration module, wherein at least;
Compressed voice stream mapping aligned phoneme sequence module receives the outside compressed voice stream that transports, and is mapped to aligned phoneme sequence and output;
Aligned phoneme sequence characteristic extracting module group, phoneme vector space proper vector and phoneme state transitions single order Markov proper vector and the output of extracting respectively aligned phoneme sequence;
The sorter device, to different characteristic vector difference training classifier, then the sample for unknown classification uses training acquisition sorter to classify and the output category result based on training set;
The result integration module, to the Output rusults of a plurality of sorters carry out integrated and output as final Stego-detection result.
Preferably, described aligned phoneme sequence characteristic extracting module group comprises PVSF characteristic extracting module and FOMF characteristic extracting module, wherein;
The PVSF characteristic extracting module, phoneme vector space proper vector and the output of extracting aligned phoneme sequence;
The FOMF characteristic extracting module is extracted phoneme state transitions single order Markov proper vector and output.
Preferably, described sorter device comprises the first sorter, the second sorter and the 3rd sorter, wherein,
The first sorter, based on phoneme vector space proper vector train obtain utilizing behind the sorter this sorter unknown classification sample is predicted and Output rusults to integration module;
The second sorter, based on the fusion feature vector of phoneme vector space proper vector and phoneme state transitions single order Markov proper vector train obtain utilizing behind the sorter this sorter unknown classification sample is predicted and Output rusults to integration module;
The 3rd sorter, based on phoneme state transitions single order Markov proper vector train obtain utilizing behind the sorter this sorter unknown classification sample is predicted and Output rusults to integration module.
For achieving the above object, a kind of G.729A compressed voice stream information of the present invention is hidden detection method, may further comprise the steps:
Compressed voice stream is mapped to aligned phoneme sequence;
Extract respectively phoneme vector space proper vector and the phoneme state transitions single order Markov proper vector of aligned phoneme sequence;
To various features vector difference training classifier, and carry out the classification results of a plurality of sorters integrated as final classification results based on most voting mechanisms.
Preferably, the described method that compressed voice stream is mapped to aligned phoneme sequence is: setting the phoneme that voice comprise is limited, and voice to be mapped are divided into voice small pieces corresponding to each phoneme, and the duration of getting small pieces is G.729A frame length.
Preferably, the aligned phoneme sequence feature extracting method is:
Channel parameters when using phoneme pronunciation is as the quantificational description of phoneme, LPC wave filter in using G.729A characterizes channel parameters, the LPC wave filter is determined by quantization index, each phoneme is corresponded to first territory of LPC wave filter quantization index, use the statistical nature of this territory formation sequence as the statistical nature of aligned phoneme sequence;
Use phoneme vector space proper vector to quantize to extract the G.729A skewness weighing apparatus property of phoneme that voice comprise;
Use phoneme state transitions single order Markov chain that aligned phoneme sequence is carried out modeling, the computing mode transition matrix is measured the correlativity that each phoneme distributes, and adopts and selects the dimension method that state-transition matrix is carried out dimensionality reduction---get the diagonal of a matrix element as the vector that characterizes the phoneme distribution correlation.
Preferably, adopt the Ensemble classifier method: extract the phoneme vector space proper vector of the G.729A compressed voice stream in the training set and the single order Markov proper vector behind the dimensionality reduction, distinguish training classifier with phoneme vector space proper vector, single order Markov proper vector and both fusion feature vectors as feature respectively.
Beneficial effect of the present invention is:
The present invention is applied to detect in the speech based on standard G.729A and uses through the mutual neighbor node of CNV() the grouping vector code book divided of algorithm optimization carries out the detection of QIM Information hiding.Use this detection system, show in the number of frames of frame sequence G.729A for the test of mass data to surpass (being that voice flow length is above 0.64 second) at 640 o'clock, system can obtain the Detection accuracy above 93%.
Description of drawings
Fig. 1 is the structural representation of the described device of the embodiment of the invention;
Fig. 2 is the synoptic diagram that the present invention is based on the voice component model of phoneme;
Fig. 3 is that the present invention uses the CNV algorithm to carry out that Optimization of Codebook is divided and divides code book and carry out QIM and embed the disturbance exemplary plot of quantization index sequence G.729A based on optimizing;
Fig. 4 is the synoptic diagram of the described detection method of the embodiment of the invention;
Fig. 5 is overhaul flow chart of the present invention.
Embodiment
The present invention will be further described below in conjunction with Figure of description.
As shown in Figure 1, the described a kind of G.729A compressed voice stream information hiding detection device of the embodiment of the invention comprises compressed voice stream mapping aligned phoneme sequence module, aligned phoneme sequence characteristic extracting module group, sorter device and result integration module, wherein at least;
Compressed voice stream mapping aligned phoneme sequence module receives the outside compressed voice stream that transports, and is mapped to aligned phoneme sequence and output;
Aligned phoneme sequence characteristic extracting module group, phoneme vector space proper vector and phoneme state transitions single order Markov proper vector and the output of extracting respectively aligned phoneme sequence;
The sorter device, to different characteristic vector difference training classifier, then the sample for unknown classification uses training acquisition sorter to classify and the output category result based on training set;
The result integration module, to the Output rusults of a plurality of sorters carry out integrated and output as final Stego-detection result.
Described aligned phoneme sequence characteristic extracting module group comprises PVSF characteristic extracting module and FOMF characteristic extracting module, wherein;
The PVSF characteristic extracting module, phoneme vector space proper vector and the output of extracting aligned phoneme sequence;
The FOMF characteristic extracting module is extracted phoneme state transitions single order Markov proper vector and output.
Described sorter device comprises the first sorter, the second sorter and the 3rd sorter, wherein,
The first sorter, based on phoneme vector space proper vector train obtain utilizing behind the sorter this sorter unknown classification sample is predicted and Output rusults to integration module;
The second sorter, based on the fusion feature vector of phoneme vector space proper vector and phoneme state transitions single order Markov proper vector train obtain utilizing behind the sorter this sorter unknown classification sample is predicted and Output rusults to integration module;
The 3rd sorter, based on phoneme state transitions single order Markov proper vector train obtain utilizing behind the sorter this sorter unknown classification sample is predicted and Output rusults to integration module.
The base unit of human pronunciation is phoneme, and phoneme can be divided into vowel and consonant by large class, and each classification can be divided into a plurality of subclasses again.The general corresponding different sound channel form of different phonemes.It is the elementary cell that consists of language that phoneme also can be described as phonetic symbol, and these discrete elementary cells become word according to certain phoneme with grammar rule cluster more or less; Word is according to the language system of certain syntax form complete.Language system is to have some statistical law, and for example, the letter that access times are maximum in the English according to statistics is " e ", is mapped to so to think on the voice that the occurrence number of phoneme " e " is also maximum; Secondly, the assembled arrangement mode in the English between the letter is and then " u " when existing the back such as " q " of certain rule most of, is mapped to so the assembled arrangement that can think on the voice between the phoneme and also has certain rule.In other words, the appearance of each phoneme in one section voice is unbalanced, and secondly there is correlativity in the appearance of different phonemes.Carry out to cause based on the Information hiding of QIM mechanism the change of these distribution characters, therefore can utilize the phoneme distribution character of the G.729A voice flow sample of Information hiding that whether exists to be determined to carry out Stego-detection.The below sets forth this Stego-detection method.The pronunciation of the phoneme " p " that the phoneme " o " of the phoneme " sh " that the pronunciation of English word " shop " is namely produced by noise source, periodically sound source generation and impact sound source produce consists of.As shown in Figure 2.But be a plurality of small fragments corresponding with phoneme in the cutting of next section of ideal case voice, in other words, one section complete voice can be considered as being arranged in sequence by a plurality of phonemes and form that the present invention is referred to as the voice component model based on phoneme.
As shown in Figure 4, a kind of G.729A compressed voice stream Stego-detection method is characterized in that, may further comprise the steps:
Compressed voice stream is mapped to aligned phoneme sequence;
Extract respectively phoneme vector space proper vector and the phoneme state transitions single order Markov proper vector of aligned phoneme sequence;
To various features vector difference training classifier, and carry out the classification results of a plurality of sorters integrated as final classification results based on most voting mechanisms.
The described method that compressed voice stream is mapped to aligned phoneme sequence is: setting the phoneme that voice comprise is limited, and voice to be mapped are divided into voice small pieces corresponding to each phoneme, and the duration of getting small pieces is G.729A frame length.
The specific implementation principle is:
Define 1 phoneme ρ iBe tlv triple (p i, s i, t i), p wherein iBe phonetic symbol, s iBe phonetic symbol p iPronunciation be the voice small fragment with certain time length, t iDuration for these voice small pieces; ρ iBe the basic composition unit of voice, the set of phonemes p={ ρ of language 1, ρ 2..., ρ M-1, ρ MComprise limited phoneme; But one section voice S cutting N voice small pieces S set={ f for arranging chronologically that duration is T 1, f 2..., f N-1, f N, burst f wherein k=s l(k ∈ [1, N], l ∈ [1, M]) claims at this moment f kCan be mapped to phoneme ρ l, use f k→ ρ lRepresent this mapping relations, the set of all mapping relations is F.Voice component model based on phoneme is described with tlv triple (P, S, F).
But one section voice cutting is voice small pieces sequence f based on above-mentioned model 1, f 2..., f N-1, f N, and can be aligned phoneme sequence ρ with the small pieces sequence mapping 1, ρ 2..., ρ N-1, ρ NThe duration of each phoneme is not isometric in the voice, and for example voiced sound " o " may continue more than 50 milliseconds, and turbid plosive " b " then may only continue 10 milliseconds, and along with its lasting duration of difference of speaker and word speed is ever-changing especially.Therefore, phoneme ρ iDuration t iBe very difficult pre-determined, this causes carrying out based on the cutting of phoneme one section voice very difficult.Because the purpose that the present invention sets up model is to analyze G.729A whether to have in the coded frame sequence that QIM is hidden to be write, and G.729A divide frame to voice and every frame calculated a LPC coefficient (namely estimating a secondary channel voice parameter) take 10 milliseconds as unit, this means G.729A think 10 milliseconds in short-term in the form of sound channel be stable; Suppose the corresponding different phoneme pronunciations of different sound channel form, can think so G.729A in the part of the corresponding phoneme of every frame or a phoneme.According to the statistics to actual speech, the lasting duration average of phoneme is much larger than 10 milliseconds in the English, and this has confirmed the correctness of above-mentioned conclusion.For this reason take 10 milliseconds as boundary, the present invention is with t iThe phoneme that is no more than 10 milliseconds is called category-A, otherwise is category-B.Be made as G.729A frame length L for its duration of category-A phoneme.Establish its duration t for the category-B phoneme i=nL (n 〉=1), this moment, each phoneme comprised a plurality of G.729A frames, specifically comprised several still being difficult to determine.Signal waveform when the present invention finds the category-B phoneme pronunciation generally has periodic feature, for example the phoneme " o " in the accompanying drawing 1 has comprised four obvious cycles, this moment, the signal of one-period can reflect the sound channel feature, therefore can be considered in G.729A for the category-B phoneme its channel parameters has been carried out repeatedly repeating estimating.Given this, it is considered herein that for t iThe category-B phoneme of=nL (n 〉=1) can be divided into n frame and carry out respectively the LPC parameter estimation.The above analysis, can G.729A frame be corresponding (for the category-B phoneme with a phoneme with each, may continuous several frames corresponding identical phonemes all), each frame can be mapped as a phoneme from this angle, such one section G.729A compressed voice can be considered an aligned phoneme sequence.The feature of aligned phoneme sequence can represent with the feature of this section speech frame sequence.
Described aligned phoneme sequence feature extracting method is:
Channel parameters when using phoneme pronunciation is as the quantificational description of phoneme, LPC wave filter in using G.729A characterizes channel parameters, the LPC wave filter is determined by quantization index, each phoneme is corresponded to first territory of LPC wave filter quantization index, use the statistical nature of this territory formation sequence as the statistical nature of aligned phoneme sequence;
Use phoneme vector space proper vector to quantize to extract the G.729A skewness weighing apparatus property of phoneme that voice comprise;
Use phoneme state transitions single order Markov chain that aligned phoneme sequence is carried out modeling, the computing mode transition matrix is measured the correlativity that each phoneme distributes, and adopts and selects the dimension method that state-transition matrix is carried out dimensionality reduction---get the diagonal of a matrix element as the vector that characterizes the phoneme distribution correlation.
The specific implementation principle is:
Owing to G.729A each frame is calculated one group of LPC coefficient, voice component model is above pointed out each G.729A corresponding phoneme of frame, and therefore every group of LPC coefficient is assumed to be ρ (suppose same phoneme is corresponding under the ideal conditions LPC coefficient be identical then think different phonemes such as difference) with a corresponding phoneme.Establishing LPC coefficient to every frame carries out its quantized result of vector quantization and is made as C=(c again 1, i, c 2, j, c 3, k) c wherein 1, i∈ L 1, c 2, j∈ L 2, c 3, k∈ L 3, then phoneme ρ and index C will form one-to-one relationship, and the present invention uses
Figure GDA00003502907300081
Represent this relation.If certain section voice comprise N frame, G.729A encoding to obtain the quantization index sequence C 1C 2C 3... C N-1C N, then corresponding aligned phoneme sequence is ρ 1ρ 2ρ 3... ρ N-1ρ NBecause quantized result C=(c 1, i, c 2, j, c 3, k) having 128 * 32 * 32=131072 kind value, small data quantity is difficult to its statistical property of reflection, this means and must could effectively obtain its statistical nature when G.729A the length of frame sequence reaches very large.Obviously, this is unfavorable for carrying out Stego-detection.Owing to G.729A adopt Split vector quantizer, index sequence C 1C 2C 3... C N-1C NIn fact consisted of by three subsequences, namely by c 1,1c 1,2c 1,3... c 1, N-1c 1, N, c 2,1c 2,2c 2,3... c 2, N-1c 2, NAnd c 3,1c 3,2c 3,3... c 3, N-1c 3, NConsist of.So basis
Figure GDA00003502907300082
Can get
Figure GDA00003502907300084
With
Figure GDA00003502907300085
Be that aligned phoneme sequence and three sub-index sequence also form one-to-one relationship, QIM is hidden to write the generation disturbance (mean aligned phoneme sequence generation disturbance, the statistical property on its some dimension will change) that will make these subsequences.In three territories that quantized result C comprises, c 1, iImportance surpass c 2, jAnd c 3, kThis is because c 1, iThat the one-level vector all needs in calculating 10 LPC coefficients, and c 2, jAnd c 3, kBe the secondary vector and only be respectively applied to calculate the first five and rear five LPC coefficients.For this reason, for reaching the purpose of dimensionality reduction, get c as a kind of approximate the present invention 1, iFor characterizing phoneme ρ iProper vector namely
Figure GDA00003502907300086
ρ behind the dimensionality reduction i128 kinds of values are only arranged.Accordingly, can adopt quantification subsequence c 1,1c 1,2c 1,3... c 1, N-1c 1, NStatistical nature as aligned phoneme sequence ρ 1ρ 2ρ 3... ρ N-1ρ NStatistical nature.
Use the CNV algorithm to carry out the code book division and carry out the QIM embedding to make the quantization index sequence that larger disturbance occurs, accompanying drawing 3 has provided the example of this kind disturbance.Accompanying drawing 2 has been added up one section voice for four kinds of different classes of speaker (this section voice duration be 1 second comprise 100 G.729A speech frames), its index sequence c behind the coil insertion device confidential information 1,1c 1,2c 1,3... c 1,99c 1,100Situation of change; Horizontal ordinate among this figure in four subgraphs is the sequence number of the frame arranged by the time, and ordinate is c 1, iThe quantization index of (1≤i≤100); For English male and female students and Chinese male and female students, the variation of the index sequence of investigating before and after embedding is significant as can be seen from this figure.To there be some statistical property, so basis according to above analyzing in the aligned phoneme sequence
Figure GDA00003502907300087
Also will there be this statistical property in the quantization index sequence as can be known.Hidden write operation can make quantization index sequence generation disturbance, and some statistical nature is changed.If obviously can carry out effective quantitative analysis to this kind change, then may carry out accordingly Stego-detection.For quantitative analysis is carried out in this kind variation, must set up the characteristic statistics model of aligned phoneme sequence.
As indicated above, the distribution of each phoneme exists lack of uniformity and correlativity in aligned phoneme sequence.Lack of uniformity for the distribution of quantitative analysis phoneme, the voice component model based on phoneme that proposes according to the present invention, the modeling method of while reference documents vector space model, set up the G.729A statistical model of aligned phoneme sequence of phoneme vector space model (Phoneme Vector Space Model, PVSM) conduct.This model description is as follows:
Define 2 phoneme ρ iBe the basic composition unit of voice, the set of phonemes P={ ρ of language L 1, ρ 2..., ρ M-1, ρ M, claim ρ iBe phoneme vocabulary (Phoneme Word) that P is the phoneme dictionary, P is that the corresponding high dimension vector of M dimensional vector space is called the phoneme vector space.
But N the voice small pieces of the voice S cutting that to define one section duration of 3 language L be T for arranging chronologically, each small pieces f iA vocabulary ρ among the corresponding phoneme dictionary P i, voice burst sequence is S behind the burst *=f 1, f 2..., f N-1, f N, deserve to be called the process of stating and be the voice burst based on phoneme.
Definition 4 is S to each voice burst sequence *=f 1, f 2..., f N-1, f N, a point (ρ in the available phoneme vector space 1, w 1, ρ 2, w 2..., ρ M, w M) expression, wherein ρ ii∈ P) is i small pieces f iCorresponding phoneme code word, w iBe ρ iWeight.Deserving to be called the sound bite quantization means method of stating the definition formation is phoneme vector space model (PVSM), (ρ 1, w 1, ρ 2, w 2..., ρ M, w M) be phoneme vector space feature (Phoneme Vector Space Feature, the PVSF) vector of S.
For the stream of compressed voice G.729A, need not divide frame (each G.729A frame directly phoneme of correspondence) according to above analyzing, and according to relation
Figure GDA00003502907300092
Its phoneme dictionary P={c as can be known 1,0, c 1,1..., c 1,126, c 1,127.According to these hypothesis, can with one section G.729A voice adopt phoneme vector space proper vector (c 1,0, w 1,0, c 1,1, w 1,1..., c 1,126, w 1,127) represented that its dimension is 128 dimensions, c 1, iWeight w 1, iGet c 1, iThe normalization frequency of occurrences in quantized sequences be w 1, i=d 1, i/ N, the G.729A number of frames that comprises for this section voice of N wherein, d 1, iBe c 1, iThe number of times that in quantized sequences, occurs.
But phoneme vector space feature (PVSF) has only reflected the unbalanced characteristic that different phonemes distribute in voice, can not reflect the correlativity that phoneme distributes.Be the quantitative analysis analysed for relevance, the present invention adopts Markov chain that aligned phoneme sequence is carried out modeling, each phoneme is considered as a state (being called the phoneme state) on the Markov chain, with this upstate transition probability the combination dependence between phoneme is carried out quantitative analysis.For one section aligned phoneme sequence ρ that voice are corresponding 1, ρ 2..., ρ N-1, ρ NIf, suppose that the appearance of each phoneme is only relevant with previous phoneme, then this aligned phoneme sequence can be considered as a phoneme state transitions single order Markov process, analogize and this aligned phoneme sequence can be considered as second order or the Markovian process of high-order more.According to philological statistical law, the appearance of general certain letter only has larger related with its previous letter, therefore analogize and think that the appearance of certain phoneme is also only larger related with its previous phoneme existence, Given this present invention adopts phoneme state transitions single order Markov process that aligned phoneme sequence is carried out modeling.Suppose ρ iThat the phoneme stochastic variable is at i value constantly, ρ I+1I+1 value constantly, according to
Figure GDA00003502907300093
Conditional probability shown in the available formula 2 represents each shape of single order Markov chain
Pr α,β=Pr(ρ i+1=α/ρ i=β)(α,β∈[0,127]) (2)
Transition probability between attitude.Direct design conditions probability is comparatively difficult when actual computation, generally is translated into joint probability and calculates, and namely according to condition probability formula formula 2 is converted into formula 3 and carries out the calculating of correlativity between each phoneme.According to formula 3
Pr α , β = Pr ( ρ i + 1 = α ρ i = β ) Pr ( ρ i = β ) ( α , β ∈ [ 0,127 ] ) - - - ( 3 )
For arbitrary aligned phoneme sequence, can obtain the state-transition matrix of one 128 * 128 dimension, this matrix is called the phoneme state transitions single order Markov feature of aligned phoneme sequence.Obviously, this characteristic quantification the correlativity that occurs of adjacent phoneme, but its dimension is too high-leveled and difficult in application, therefore must carry out dimensionality reduction to it.Dimension reduction method commonly used has svd, principal component analysis (PCA) and selects dimension method etc.The present invention adopts the dimension method of selecting: only select in the transition probability matrix element on the principal diagonal as characteristic quantity, phoneme state transitions single order Markov feature can be reduced to like this vector of one 128 dimension, claim that this vector is single order Markov feature (the First Order Markov Feature of G729A frame sequence, FOMF) vector is used for quantizing the correlative character that phoneme distributes.
The classification based training method is: extract the phoneme vector space proper vector of the G.729 compressed voice stream in the training set and the single order Markov proper vector behind the dimensionality reduction, distinguish training classifier with phoneme vector space proper vector, single order Markov proper vector and both fusion feature vectors as feature respectively.
The specific implementation principle is:
Suppose to have G.729A compressed voice frame sequence S of the unknown, the target of Stego-detection of the present invention is to judge whether S exists that QIM is hidden to be write, and it is differentiated the result and only has "Yes" (the present invention is called the stego class) and "No" (the present invention is called the cover class) two classes.Therefore the Stego-detection process nature is assorting process, and namely the sample S for unknown classification assigns to cover class or stego class with it.For classification problem, be current main-stream based on the sorting technique of machine learning, the present invention also adopts this method to carry out the kind judging of unknown sample for this reason.The present invention can be divided into two steps for the kind judging process of unknown sample: the feature of at first extracting the G.729A frame sequence of unknown classification obtains its vectorization character representation, thereafter utilize to utilize the suitable mapping of semantic classifiers realization from the frame sequence low-level feature to the high-level semantic classification of the institute's frame sequence that obtains vector characteristic Design, wherein semantic classifiers generally adopts the method acquisition of supervised learning namely by using some to mark the sample training acquisition sorter of classification.Training and the prediction steps of sorter are as follows:
Step 1: obtain G.729A frame sequence of cover classification as much as possible, and use QIM embedding grammar (the grouping code book uses the CNV algorithm to be optimized division) to carry out hidden writing to obtain the stego sample that each sample is corresponding in the cover classification, and carry out mark;
Step 2: the PVSF/FOMF feature of the two class samples that the extraction previous step obtains forms proper vector, each vectorial classification of mark;
Step 3: training classifier: practical previous step obtains other proper vector set training classifier of marking class, obtains the hidden identification and classification device of writing;
Step 4: use sorter that unknown classification sample is carried out the hidden judgement of writing: for the G.729A frame sequence of unknown classification, at first extract its PVSF/FOMF feature, form proper vector, with the input of this proper vector input as previous step training gained sorter, sorter is with the output category result.
According to described Stego-detection method, the present invention towards the flow process of the hidden detection system of writing of the stream QIM of compressed voice G.729A as shown in Figure 5.The G.729A voice flow that is input as classification to be detected of system (namely G.729A frame sequence) fragment is output as this stream and whether has Information hiding.G.729A frame sequence for input uses the PVSF feature extractor to obtain the PVSF vector of 128 dimensions, uses the FOMF feature extractor to obtain the FOMF vector of 128 dimensions.The training of using the PVSF vector to carry out sorter obtains sorter 1, and the training of using the FOMF vector to carry out sorter obtains sorter 3, and the training that the associating vector of use PVSF and FOMF (256 dimension) carries out sorter obtains sorter 2.Classification results to three sorters adopts most voting mechanisms to determine final classification results.Sorter adopts support vector machine in the present embodiment, but adopts other sorters also in this patent protection domain.Use this detection system, show in the number of frames of frame sequence G.729A for the test of mass data to surpass (being that voice flow length is above 0.64 second) at 640 o'clock, system can obtain the Detection accuracy above 93%.
More than; be preferred embodiment of the present invention only, but protection scope of the present invention is not limited to this, anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain that claim was defined.

Claims (7)

1. a compressed voice stream information hiding detection device G.729A is characterized in that, comprises at least compressed voice stream mapping aligned phoneme sequence module, aligned phoneme sequence characteristic extracting module group, sorter device and result integration module, wherein;
Compressed voice stream mapping aligned phoneme sequence module receives the outside compressed voice stream that transports, and is mapped to aligned phoneme sequence and output;
Aligned phoneme sequence characteristic extracting module group, phoneme vector space proper vector and phoneme state transitions single order Markov proper vector and the output of extracting respectively aligned phoneme sequence;
The sorter device, to different characteristic vector difference training classifier, then the sample for unknown classification uses training acquisition sorter to classify and the output category result based on training set;
The result integration module, to the Output rusults of a plurality of sorters carry out integrated and output as final Stego-detection result.
2. G.729A compressed voice stream information hiding detection device according to claim 1 is characterized in that, described aligned phoneme sequence characteristic extracting module group comprises PVSF characteristic extracting module and FOMF characteristic extracting module, wherein;
The PVSF characteristic extracting module, phoneme vector space proper vector and the output of extracting aligned phoneme sequence;
The FOMF characteristic extracting module is extracted phoneme state transitions single order Markov proper vector and output.
3. G.729A compressed voice stream information hiding detection device according to claim 1 is characterized in that described sorter device comprises the first sorter, the second sorter and the 3rd sorter, wherein,
The first sorter, based on phoneme vector space proper vector train obtain utilizing behind the sorter this sorter unknown classification sample is predicted and Output rusults to integration module;
The second sorter, based on the fusion feature vector of phoneme vector space proper vector and phoneme state transitions single order Markov proper vector train obtain utilizing behind the sorter this sorter unknown classification sample is predicted and Output rusults to integration module;
The 3rd sorter, based on phoneme state transitions single order Markov proper vector train obtain utilizing behind the sorter this sorter unknown classification sample is predicted and Output rusults to integration module.
One kind G.729A the compressed voice stream information hide detection method, it is characterized in that, may further comprise the steps:
Compressed voice stream is mapped to aligned phoneme sequence;
Extract respectively phoneme vector space proper vector and the phoneme state transitions single order Markov proper vector of aligned phoneme sequence;
To various features vector difference training classifier, and carry out the classification results of a plurality of sorters integrated as final classification results based on most voting mechanisms.
5. G.729A compressed voice stream information according to claim 4 is hidden detection method, it is characterized in that, the described method that compressed voice stream is mapped to aligned phoneme sequence is: setting the phoneme that voice comprise is limited, voice to be mapped are divided into voice small pieces corresponding to each phoneme, and the duration of getting small pieces is frame length G.729A.
6. G.729A compressed voice stream information according to claim 5 is hidden detection method, it is characterized in that the aligned phoneme sequence feature extracting method is:
Channel parameters when using phoneme pronunciation is as the quantificational description of phoneme, LPC wave filter in using G.729A characterizes channel parameters, the LPC wave filter is determined by quantization index, each phoneme is corresponded to first territory of LPC wave filter quantization index, use the statistical nature of this territory formation sequence as the statistical nature of aligned phoneme sequence;
Use phoneme vector space proper vector to quantize to extract the G.729A skewness weighing apparatus property of phoneme that voice comprise;
Use phoneme state transitions single order Markov chain that aligned phoneme sequence is carried out modeling, the computing mode transition matrix is measured the correlativity that each phoneme distributes, and adopts and selects the dimension method that state-transition matrix is carried out dimensionality reduction---get the diagonal of a matrix element as the vector that characterizes the phoneme distribution correlation.
7. G.729A compressed voice stream information according to claim 5 is hidden detection method, it is characterized in that, adopt the Ensemble classifier method: extract the phoneme vector space proper vector of the G.729A compressed voice stream in the training set and the single order Markov proper vector behind the dimensionality reduction, distinguish training classifier with phoneme vector space proper vector, single order Markov proper vector and both fusion feature vectors as feature respectively.
CN2011104351639A 2011-12-22 2011-12-22 G.729A compressed pronunciation flow information hiding detection device and detection method Active CN102568469B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011104351639A CN102568469B (en) 2011-12-22 2011-12-22 G.729A compressed pronunciation flow information hiding detection device and detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011104351639A CN102568469B (en) 2011-12-22 2011-12-22 G.729A compressed pronunciation flow information hiding detection device and detection method

Publications (2)

Publication Number Publication Date
CN102568469A CN102568469A (en) 2012-07-11
CN102568469B true CN102568469B (en) 2013-10-16

Family

ID=46413726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011104351639A Active CN102568469B (en) 2011-12-22 2011-12-22 G.729A compressed pronunciation flow information hiding detection device and detection method

Country Status (1)

Country Link
CN (1) CN102568469B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103685185B (en) * 2012-09-14 2018-04-27 上海果壳电子有限公司 Mobile equipment voiceprint registration, the method and system of certification
CN104064193B (en) * 2013-03-21 2016-12-28 中国科学院声学研究所 The information concealing method of a kind of linear prediction voice coding and extracting method
CN104200815B (en) * 2014-07-16 2017-06-16 电子科技大学 A kind of audio-frequency noise real-time detection method based on correlation analysis
CN104183244A (en) * 2014-08-18 2014-12-03 南京邮电大学 Steganography detection method based on evidence reasoning
CN107610711A (en) * 2017-08-29 2018-01-19 中国民航大学 G.723.1 voice messaging steganalysis method based on quantization index modulation QIM
CN110046655B (en) * 2019-03-26 2023-03-31 天津大学 Audio scene recognition method based on ensemble learning
CN111389741A (en) * 2020-04-16 2020-07-10 长春光华学院 Automatic sorting system for detecting surface defects of automobile brake pads based on machine vision
CN113223498A (en) * 2021-05-20 2021-08-06 四川大学华西医院 Swallowing disorder identification method, device and apparatus based on throat voice information
CN113704681B (en) * 2021-08-20 2024-01-12 上海思朗科技有限公司 Data processing method, device and super computing system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006081386A2 (en) * 2005-01-26 2006-08-03 New Jersey Institute Of Technology System and method for steganalysis
CN101494051B (en) * 2008-01-23 2011-12-28 武汉大学 Detection method for time-domain audio LSB hidden write
US8281138B2 (en) * 2008-04-14 2012-10-02 New Jersey Institute Of Technology Steganalysis of suspect media
CN102063907B (en) * 2010-10-12 2012-07-04 武汉大学 Steganalysis method for audio spread-spectrum steganography

Also Published As

Publication number Publication date
CN102568469A (en) 2012-07-11

Similar Documents

Publication Publication Date Title
CN102568469B (en) G.729A compressed pronunciation flow information hiding detection device and detection method
CN109326283B (en) Many-to-many voice conversion method based on text encoder under non-parallel text condition
CN113409759B (en) End-to-end real-time speech synthesis method
GB2326320A (en) Text to speech synthesis using neural network
Prasad et al. How accents confound: Probing for accent information in end-to-end speech recognition systems
CN110851176B (en) Clone code detection method capable of automatically constructing and utilizing pseudo-clone corpus
CN101645269A (en) Language recognition system and method
Żelasko et al. That sounds familiar: an analysis of phonetic representations transfer across languages
CN101208741B (en) Method for adapting for an interoperability between short-term correlation models of digital signals
Vuppala et al. Improved consonant–vowel recognition for low bit‐rate coded speech
US20110144991A1 (en) Compressing Feature Space Transforms
US20080120108A1 (en) Multi-space distribution for pattern recognition based on mixed continuous and discrete observations
WO2022126969A1 (en) Service voice quality inspection method, apparatus and device, and storage medium
CN102063897B (en) Sound library compression for embedded type voice synthesis system and use method thereof
CN104199811A (en) Short sentence analytic model establishing method and system
CN116386594A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
KR102461454B1 (en) Document Summarization System And Summary Method Thereof
CN113990325A (en) Streaming voice recognition method and device, electronic equipment and storage medium
Lin et al. Self-Supervised Acoustic Word Embedding Learning via Correspondence Transformer Encoder
Saychum et al. A great reduction of wer by syllable toneme prediction for thai grapheme to phoneme conversion
CN115424604B (en) Training method of voice synthesis model based on countermeasure generation network
Sridhar et al. Modeling the intonation of discourse segments for improved online dialog act tagging
KR102292479B1 (en) Violent section detection apparatus and method for voice signals using vector quantization information
CN113096625A (en) Multi-person Buddha music generation method, device, equipment and storage medium
Prom-on et al. Discovering underlying tonal representations by computational modeling: A case study of Thai

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant