Summary of the invention
Technical problem: the objective of the invention is to propose a kind of camouflage communication method that can be used in the secure communication based on speech recognition, can realize the transparency, robustness, the real-time of Information hiding, the performance index such as self-reparability of receiving terminal, has good practical value, for the research of secure communication and design provide a new approach.
Technical scheme: the camouflage communication method based on speech recognition of the present invention comprises that generation, the random key based on the secret information code stream of speech recognition generates, the unvoiced frame formant embeds cipher-text information, secret information extracts four big parts, and its entire method job step is as follows:
1.) watermark generates and embeds:
A. the user of service of system sends the phrase command that needs secret transmission to system
B. system at first carries out speech recognition based on DTW (dynamic time convolution), is corresponding literal with command conversion, if find mistake, revises; And literal table is shown as binary code stream,
C. according to formula:
Carry out binary system and quaternary string and conversion, wherein, s ' is a quaternary code fluxion value (i), and it makes up in twos from binary code stream numerical value and obtains, and conversion back forms the code stream string S ' of quaternary number, and M is the bit number that order needs during with binary coding.And according to formula:
S=(s′(k)+K
1)mod(4),s′(k)∈{0,1,2,3},0≤k<M/2
Carry out encryption, obtain to encrypt ciphertext S, K
1Be predetermined accidental enciphering seed, be even number, s ' is a quaternary unencryption code stream numerical value (k);
D. expressly voice divide frame, and each frame carries out voiceless sound and voiced sound judgement according to energy, finds out unvoiced frame,
E. unvoiced frame is DFT (discrete fourier transform), and selects first or second formant according to controlling elements K2,
F. according to formula:
The watermark that the c step is generated embeds, wherein C
kBe the coefficient that obtains behind the raw tone unvoiced frame DFT (discrete fourier transform), C '
kBe the coefficient after the embedding ciphertext, β is an insert depth, the scale factor that coefficient amplitude changes before and after promptly embedding, be determined by experiment, the n of 4n+s ' in (k) makes minimum value when the inequality on the right side is just set up in the following formula, and s ' is the quaternary ciphertext code stream numerical value of encrypting (k)
The expression round numbers.
G. the plaintext after watermarked carries out obtaining mixing voice against DFT (discrete fourier transform) and communicates;
2.) watermark extracting:
H. at first the mixing voice that receives is carried out the branch frame equally,
I. each frame carries out voiceless sound and voiced sound judgement according to energy equally, finds out unvoiced frame,
J. unvoiced frame is DFT (discrete fourier transform), and selects first or second formant according to same controlling elements K2,
K. according to formula:
Extract secret information, C '
kBe the coefficient that embeds after the ciphertext, β is an insert depth, M be order with binary-coded bit number, during quaternary representation, the Command field bit number of coding just equals M/2, s " (i) be the encryption ciphertext quaternary numerical value that extracts, round[] the expression round
L. according to formula:
K is the even number deciphering, S " be the quaternary ciphertext code stream string after the deciphering that extracts, K
1Be identical predetermined accidental enciphering seed with transmit leg, be even number, other variable is the same.
M. according to formula:
s(i)=S″,0≤i<M,s(i)∈{0,1}
Carry out the quaternary to binary conversion, obtain real ciphertext binary numeral s (i), it is corresponding one by one with literal, and other variable implication is the same.
N. literal is presented on the screen.
The training of larger data amount is carried out in generation based on the secret information code stream of speech recognition in advance at the speaker, pass on the people of secret information quietly transmitting an order by the microphone of terminal in the environment, possible minor error is revised in advance by keyboard through DTW (dynamic time convolution) speech recognition system identification back, secret subsequently voice messaging S is through being encoded into secret code stream.
The random key generating portion produces the key that above-mentioned voice identification result is upset, and final generation ciphertext to be hidden; At first carry out quaternary serial to parallel conversion, obtain new ciphertext code stream S ', generate key K then at random for recognition result sequence S
1S ' is encrypted upset, produce and wait to hide ciphertext, wherein K
1Be even number.
The unvoiced frame formant embeds the cipher-text information stage at first to expressly carrying out voice divides frame, carrying out the clear/voiced sound of frame then judges, the unvoiced frame that accounts for about 70% is searched for first and second formants, according to people's ear masking effect, the first or second formant place of adaptive selection unvoiced frame i.e. second random key K
2Control the coefficient of pairing Frequency point and make amendment, if K
2=0, select the first formant place frequency, otherwise select the second formant place frequency information of carrying out to embed; According to the three dB bandwidth theory, search out in DFT (discrete fourier transform) coefficient coefficient, and make amendment to realize hiding of cipher-text information near the first or second formant position, be about to wait hide ciphertext S ' embedding coefficient C
kIn; Replace the back voice of having hidden ciphertext are expressly carried out IDFT (contrary discrete fourier transform), obtain mixing voice, in PSTN (public users telephone network) channel, transmit.
Secret information extracts and at first mixing voice and the built-in end that receives is carried out the branch frame according to same frame length, carry out voicing decision then, unvoiced frame is carried out N point DFT (discrete fourier transform), find out the Frequency point of the every frame voiced sound first or the second formant correspondence, searching method is consistent when embedding; Find out the back and it is handled extract secret information, again by the key upset to being decrypted, carry out the speech code stream that parallel serial conversion obtains original transmission at last, on the receiving terminal screen, obtain the secret information that transmit leg transmits.
Beneficial effect:
1, speech recognition technology is introduced the Information hiding field as extremely low code check compression scheme, greatly compressed the code check of secret voice messaging, hide scheme for the real time information of realization transparency, robustness, high safety and created precondition.
2, the existing information concealing technology mainly concentrates on the digital watermarking aspect at present, and this programme realizes that jumbo real time information is hiding, in fields such as military security communications very high practical value is arranged.
3, the hiding scheme of existing audio-frequency information based on DFT (discrete fourier transform) adopts fixed intermediate frequency to embed mostly, fixed-site, and fail safe is relatively poor.And this programme is encrypted the fail safe that the two-stage key guarantees secret information before adopting the self adaptation frequency to select and embed.And make full use of human hearing characteristic (HAS), the capacity that adopts the multi-system modulation technique to hide Info with raising.
Embodiment:
For the system that carries out secure communication, it is vital that secret information is delivered to the destination like clockwork, and the form of the information of transmission is less important.This programme proposes with the method for speech recognition secret voice to be handled, and the code check of secret voice is reduced greatly, provides the high as far as possible embedding scheme of the transparency and robustness, realizes real-time secure communication.From information-theoretical angle, the reason that speech recognition why can compression bit rate is not only to have forgiven semantic information, the tone of also forgiving the speaker, intonation, characteristic informations such as emotion in the voice; In military security communication, these speakers' feature all is ' redundant ' with respect to semantic commands, the scheme that adopts speech recognition with secret voice change into the order literal again coding transmission can reduce the code check of secret information greatly.Through measuring and calculating, adopt this scheme to carry out ciphertext compression after, the ciphertext code check can be controlled within the 100bit/s, is the present traditional voice compression coding scheme code check that is beyond one's reach.According to present speech recognition technology level, can reasonably suppose: in the military security communication system, the order of transmission can be the limited vocabulary amount, and in this case, speech recognition can reach very high accuracy rate.
Be the transparency and the robustness that guarantees to hide Info after embedding, adopt at frequency domain and realize the Information hiding scheme by adaptive embedding point selection and the modulation of multi-system code element.Usually embedding in transform domain hides Info is fixed intermediate frequency position in DCT (discrete fourier transform) territory, fixed-site, and fail safe is relatively poor.The main distinction of this programme and traditional frequency domain Information hiding scheme is: (1) embedded location is unfixing, and the selection that embeds point can produce in adaptive search, is equivalent to key.(2) for each selected embedding point, can transmit multiple code element state (as four condition) after revising a frequency coefficient, realize the modulation of multi-system information, increase the bit rate that embedding hides Info.(3) utilize the apperceive characteristic of people's ear, the voice after the embedding are owing to transmit in PSTN (public users telephone network) channel, and the various possible interference (companding, low-pass filtering, white noise etc.) that channel is existed has very strong robustness.
(words) encodes with the semanteme after the speech recognition, obtains secret information code stream S, is hidden among the open voice V, in PSTN (public users telephone network) channel.For satisfying transparent requirement, make secret information disperse as far as possible, plaintext V is carried out the processing of branch frame, voicing decision, select unvoiced frame expressly to carry out the embedding of ciphertext.According to the auditory masking effect of people's ear, the Frequency point masking effect that spectrum energy is big more is strong more, can introduce bigger noise and is not discovered by pleasant.For unvoiced frame, the spectrum energy at first and second formant place of frequency spectrum is local maximum, and the Frequency point of therefore selecting the open voice unvoiced frame first or the second formant place correspondence is (by key K
2Control) revise the embedding that its coefficient carries out secret information.Owing to introducing bigger distortion at these frequency places modification coefficients and not discovered,, revise a coefficient and can transmit a plurality of states 2 for fully increasing the bit rate that embeds ciphertext by people's ear
N(as N=2), and just transmit two states unlike coefficient of traditional scheme modifying, when making full use of masking effect, guaranteeing the transparency, improved secret information and embedded efficient.Carry out voicing decision at receiving terminal according to identical strategy, amended coefficient is extracted in the formant search, and the information that judgement is hidden is decoded into semanteme at last, is presented on the terminal screen.The whole system framework as shown in Figure 1.
A. the generation of secret information code stream
The present invention adopts the compression algorithm of speech recognition technology as extremely low code check, improves the camouflage efficient of voice dazzle system.In view of little vocabulary speech recognition systems such as military security communications, adopt DTW (dynamic timewarping dynamic time convolution) to carry out little vocabulary speech recognition.DTW (dynamic time convolution) scheme is the speech recognition schemes of comparative maturity, in little vocabulary speech recognition systems such as military security communication, higher recognition success rate is arranged.Designed system of the present invention is carried out the training of larger data amount in advance at the speaker, pass on the people of secret information quietly transmitting an order by the microphone of terminal in the environment (as secret bunker), possible minor error is revised in advance by keyboard through DTW (dynamic time convolution) speech recognition system identification back, secret subsequently voice messaging S is embedded among the plaintext V through being encoded into secret code stream.Can suppose that the speaker sends secret order with such form: the military operation (for example shifting) of certain army (for example)+preposition (for example to)+place name (for example Nanjing)+take.Like this, in conjunction with semantic pause, adopt DTW (dynamic time convolution) technology that very high discrimination is arranged, and certain practical value is arranged.
By test, the speech recognition schemes that this paper adopts still has very high discrimination having under the situation of certain noise.And this paper designed system has been utilized PSTN (public users telephone network) wire message way when communication, can resist very strong electronic jamming under war environment.
B. secret information telescopiny
(1) the real-time secret voice of system acquisition are carried out be encoded into after the speech recognition M bit ciphertext code stream S:
S=s(i),0≤i<M,s(i)∈{0,1} (1)
Wherein, s (i) is a binary code stream numerical value.
(2) determine to revise the bit number that coefficient transmitted, native system can multi-system be modulated cipher-text information, is the modification that example is carried out coefficient with the quaternary.
Carry out serial to parallel conversion for S, obtain new ciphertext code stream S ':
Wherein, s ' is the numerical value that binary code stream in (1) formula is changed into quaternary code stream (i), and S ' is a unencryption quaternary ciphertext code stream, and M is the bit number of the coded command of binary representation after the speech recognition.
Generate key K at random
1(K
1Be even number) S ' is encrypted upset, specific algorithm is:
Wherein, the quaternary ciphertext code stream of S ' for having encrypted, other each variable implication is the same.
Like this, even algorithm is open, the person of stealing secret information also just obtains encrypted code stream at most and can't obtain effective information.
(3) divide frame for disclosed plaintext voice (8kHz sampling), carry out voicing decision, find out satisfactory unvoiced frame (V
k, frame length is L) and carry out Information hiding.For selected frame V
kMake the DFT (discrete fourier transform) that N is ordered, obtain
F=DFT(V
k)={f
k(i),0≤i≤N} (4)
F wherein
k(i) expression is used for i DFT (discrete fourier transform) coefficient of the k frame of Information hiding, and F is a transformation results.If discrete fourier transform points N>L (unvoiced frame voice number of samples), the back mends 0 when making DFT (discrete fourier transform).
(4) determine the embedded location of every frame and revise coefficient to hide Info.Selecting suitable frequency to embed is a very important problem, according to people's ear masking effect, and the first or second (K of formant place of adaptive selection unvoiced frame
2Control) coefficient of pairing Frequency point is made amendment.If K
2=0, select the first formant place frequency, otherwise select the second formant place frequency information of carrying out to embed.According to the three dB bandwidth theory, search out in DFT (discrete fourier transform) coefficient near the coefficient of the first or second formant position and the hiding of making amendment with the realization cipher-text information.Ciphertext S ' to be hidden after encrypting is embedded expressly conversion coefficient C
kIn, in order to carry out the blind Detecting of secret information at receiving terminal, embedding grammar is: carry out the quantification that insert depth is β, coefficient is after quantizing
Wherein
Expression rounds up, and insert depth β is determined by experiment.After the embedding information
C wherein
kBe the coefficient that obtains behind the raw tone unvoiced frame DFT (discrete fourier transform), C '
kBe the coefficient after the embedding ciphertext, β is an insert depth, the scale factor that coefficient amplitude changes before and after promptly embedding, be determined by experiment, the n of 4n+s ' in (k) makes minimum value when the inequality on the right side is just set up in the following formula, and s ' is the quaternary ciphertext code stream numerical value of encrypting (k)
The expression round numbers, other variable implication is the same.
Because first or second formant frequency of voiced sound concentrates on medium and low frequency (200-1000Hz), being embedded in information in this scope can avoid high fdrequency component to cause the loss of information in filtering or quantizing process, because the position difference of first and second formant of each frame voice, the selection of therefore hiding frequency is adaptive, this is equivalent to add the one-level key, has further strengthened the fail safe of secret information.In addition, because the spectrum component at first and second formant place of voiced sound is big many of other frequency place spectrum components relatively, when satisfying the transparency, can realize the multi-system modulation and guarantee robustness, as long as select suitable insert depth β by experiment, just may command embeds the influence to the transparency.
(5) will embed coefficient C after the ciphertext
k' replacement original plaintext voice coefficient C
k, and the transformation results F that revises carried out IDFT (contrary discrete fourier transform), and obtain mixing voice V ', in PSTN (public users telephone network) channel, transmit.
C. secret information leaching process
This programme can carry out the blind Detecting of cipher-text information at receiving terminal." carry out the branch frame with built-in end according to same frame length, carry out voicing decision then, unvoiced frame is carried out N point DFT (discrete fourier transform), obtain corresponding F ' according to Fig. 1, at first with the mixing voice V that receives.Because it is fixing that frequency domain embeds point, thus also need find out the Frequency point of the every frame voiced sound first or the second formant correspondence, consistent when searching method and embedding.The local maximum norm value of DFT (discrete fourier transform) coefficient that can prove the mixing voice V ' after the embedding information is still at the respective frequencies place of V, promptly
I ∈ 3dBwidth, the searching method when therefore embedding is suitable equally in testing process.Find out C
k' back is carried out following processing to it and is extracted secret information:
Round[wherein] the expression round, C '
kBe to receive the coefficient that voice carry out discrete fourier transform after embedding ciphertext, s " (i) is the quaternary ciphertext code stream numerical value of the encryption that extracts; S " be the quaternary ciphertext code stream of the encryption that extracts, M is the bit number of the coded command of binary representation after the speech recognition.
By key to S " be decrypted, obtain:
K
1Being the encryption seed identical with transmit leg, being even number, S " is the quaternary ciphertext code stream of having deciphered that extracts." carrying out parallel serial conversion obtains with S
S=s(i),0≤i<M,s(i)∈{0,1} (8)
S (i) is the binary system ciphertext code stream numerical value that extracts, and is corresponding one by one with literal, and S is the binary system ciphertext code stream that extracts, and can directly translate or be shown as literal, and M is the bit number of the coded command of binary representation after the speech recognition
Decode at last and on the receiving terminal screen, obtain the secret information that transmit leg transmits.