CN112687282A

CN112687282A - Voice source tracking method based on fingerprint image perceptual hashing

Info

Publication number: CN112687282A
Application number: CN202011401234.9A
Authority: CN
Inventors: 刘林; 贾鹏; 刘亮; 张磊
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2021-04-20

Abstract

The invention relates to a voice source tracking method based on fingerprint image perceptual hashing, and belongs to the technical field of information processing and voice tracking. The method comprises the following steps: step 1, performing perceptual hashing on a fingerprint image based on image characteristics to generate a hashed fingerprint image; step 2, embedding the Hash fingerprint image generated in the step 1 into a voice signal in a digital watermark mode, so as to bind the unique biological characteristics of the speaker with voice data and obtain the audio frequency embedded with the Hash perception of the fingerprint image of the speaker with the voice source; and 3, carrying out identity authentication on the voice by utilizing fingerprint image perception hash. The method performs identity authentication based on the fingerprint image generated by the voice perception hash, effectively avoids the technical defect that the voice recognition technology is easily interfered by external environmental factors, has certain robustness and collision resistance, meets the uniqueness requirement of the perception hash, and also has strong safety.

Description

Voice source tracking method based on fingerprint image perceptual hashing

Technical Field

The invention relates to a voice source tracking method based on fingerprint image perceptual hashing, and belongs to the technical field of information processing and voice tracking.

Background

With the acceleration of the social life digitization process, in the existing open communication environment, it is very easy to tamper with the voice content in storage and eavesdrop transmission, and the voice recording source is difficult to authenticate. For example, once economic and legal disputes are caused in financial operations and voice order business, the responsibility of the speaker cannot be traced easily; the voice evidence to be adopted in judicial judgment can hinder justice if part of the content is maliciously tampered but is not found in time in the storage and transmission processes. Even if the content is not tampered, if the certificate word is a fake certificate, the condition that the counterfeiter wants to escape legal sanction and deny that the counterfeiter does the certificate word cannot be ignored. Therefore, the research of tracking the source of the voice and confirming the identity of the speaker is significant.

The Perceptual Hashing (Perceptual Hashing) function takes multimedia data as input, outputs a Perceptual abstract set, and unidirectionally maps multimedia digital information with the same Perceptual effect (content) into a segment of digital abstract. The most advantageous characteristic of perceptual hashing is that the perceptual robustness is provided, and small-amplitude distortion and deformation which are frequently encountered in the input object information acquisition process can be tolerated. When the perception hash function is constrained by the perception threshold, the perception hash function has transferability, and simultaneously, because the hash algorithm has collision resistance, multimedia information with completely different perception contents cannot be mapped to obtain the same perception hash value. The perceptual hash can effectively reduce the feature vector dimension of the target object, occupies extremely small data capacity, and is suitable for generating a feature value mark.

The main research directions of the voice recognition technology include two types: the first category is to identify and understand the content of the speaker voice, and the second category is to identify the speaker voice by the unique characteristic, thereby realizing the identification of the speaker identity. The first category of research has emerged a number of mature products, with voice-based human-computer interaction having been widely used in everyday life; the second type of research, which aims to convert the speaker identification result into real life, is still limited by many uncertain factors: firstly, voice signals are unstable, microphones are different in the acquisition process, and the filtering and acquisition effects on original sound sources are different; secondly, the voice of the speaker is easily imitated or replayed by high-resolution recording equipment; thirdly, the identity characteristics of the speaker and the voice content characteristics of the speaker are difficult to be thoroughly separated, and characteristic parameters which can uniquely identify the identities of different speakers and can also block other external condition interferences are lacked.

The voice signals are utilized to carry out identity recognition, so that the identity recognition system not only contains identity characteristic information of individual speakers, but also contains voice content information of the speakers, namely aliasing of the identity characteristic signals and the voice content signals, and the difference between the two paths of signals is extremely small. Even if a high-fidelity and high-resolution voice recognition system is adopted under relatively ideal external conditions, the voice recognition system is influenced by external multiple complex factors in the practical application process, so that the accuracy of voice signal recognition is reduced. Therefore, the technical route of using the voice signal as the only analysis target to identify the speaker is difficult to be applied in a large range in a real life environment, and other types of identification information are required to be added to assist the voice signal in identifying the speaker.

The application researches the application of speaker biological fingerprint image perception hash in voice information source tracking, and the generated fingerprint image perception hash value is embedded into voice information as a watermark based on three fingerprint image perception hash generation schemes of gravity center, pixel expectation and gravity center angle. When the voice information source needs to be tracked, the perceptual hash value extracted from the voice information is compared with the perceptual hash value of the fingerprint image library, so that the tracking of the voice source is realized.

Disclosure of Invention

The invention aims to solve the technical situation that the identification accuracy is low due to the limitation of acquisition conditions and information processing means in the existing identification and source identification problems of voice signals, and provides a voice source tracking method based on fingerprint image perceptual hashing.

In order to achieve the above purpose, the present invention adopts the following technical scheme.

The voice source tracking method based on the fingerprint image perceptual hash comprises the following steps:

step 1, performing perceptual hashing on a fingerprint image based on image characteristics to generate a hashed fingerprint image;

the image characteristics comprise three fingerprint images based on gravity centers, pixel expectations and gravity center angles;

step 1, specifically comprising the following substeps:

step 1.1 calculate MD5 value with user input as key and calculated MD5 value as pseudo random number generator seed;

step 1.2, carrying out image acquisition on the user fingerprint, and randomly dividing a plurality of rectangular areas from the fingerprint image of the speaker through key control;

wherein, in the random division, the seed of the random number generator is generated in the step 1.1;

step 1.3, selecting parameters with good geometric invariance characteristics in each rectangular area generated in the step 1.2 as analysis objects, quantizing the selected parameters, and forming fingerprint image perception hash for identifying identity information of speakers to generate fingerprint images;

step 2, embedding the Hash fingerprint image generated in the step 1 into a voice signal in a digital watermark mode, so as to bind the unique biological characteristics of the speaker with voice data and obtain the audio frequency embedded with the Hash perception of the fingerprint image of the speaker with the voice source;

step 2, specifically comprising the following substeps:

step 2.1, dividing the original audio signal A into M sections, wherein the length of each section of audio signal is N;

step 2.2, each audio signal is divided into sub-bands, and each audio segment is divided into M₁A sub-band, the length of each sub-band is N/M₁(ii) a For the convenience of watermark embedding, M of the number of sub-bands is divided₁The value is equal to the perceptual hash bit;

and 2.3, embedding the perception hash values in the M audio segments respectively by adopting an LSB (least significant bit) method, wherein the specific embedding method is to embed 0 bit and 1 bit of each hash value into M of the corresponding audio segment respectively₁Embedding the M audio segments one by adopting the method at the 1 st sampling point of each sub-band until the M audio segments are embedded into all the audio segments, and then obtaining the audio A' embedded with the fingerprint image perception hash of the speaker with the voice source;

step 3, identity authentication is carried out on the voice by utilizing fingerprint image perception hash, and the method specifically comprises the following steps:

let h₁And h₂Respectively representing the perceptual hash values of the two fingerprint images, and measuring the similarity of the two fingerprint images by adopting the normalized Hamming distance;

x represents a fingerprint image, Y represents a fingerprint image of the image X after the content holding operation, and Z represents another fingerprint image different from X; h_kRepresenting a perception hash function controlled by the key K, wherein the comparison result of perception hash values meets the following conditions:

D(H_k(X),H_k(Y))＜T₁

D(H_k(X),H_k(Z))＞T₂

wherein, T is more than or equal to 0₁＜T₂≤0.5；

Step 3, specifically comprising the following substeps:

step 3.1 extract perceptual hash value H from the lowest order of the speech samples₁；

Step 3.2 and perception hash value H in fingerprint image perception hash library₂Matching one by one and calculating the normalized Hamming distance D (H) thereof₁,H₂) If D (H)₁,H₂)<T is thenIs successfully prepared, i.e. from H₂The corresponding fingerprint image can track the identity of the speaker in the voice source;

wherein T is a similarity threshold value, and T is more than or equal to 0 and less than or equal to 0.5.

Advantageous effects

The invention provides a voice source tracking method based on fingerprint image perceptual hashing, which has the following beneficial effects compared with the traditional voice source tracking method:

1. the method combines a plurality of advantageous characteristics of the biological fingerprint image with an encryption key, and embeds the calculated perceptual hash into the voice signal by referring to the digital watermark, so that the identity of the speaker in the voice source is bound with the voice signal, and the tracking of the speaker in the voice source is realized;

2. the method performs identity authentication based on the fingerprint image generated by the voice perception hash, effectively avoids the technical defect that the voice recognition technology is easily interfered by external environmental factors, and has certain robustness;

3. the method has certain robustness and collision resistance, meets the requirement of sensing Hash uniqueness, namely safety, and also has strong safety and authentication uniqueness.

Drawings

FIG. 1 is a flowchart of a voice source tracking method based on fingerprint image perceptual hashing according to the present invention;

FIG. 2 is a schematic diagram illustrating a voice source tracking method based on fingerprint image perceptual hashing according to the present invention;

FIG. 3 is a schematic diagram illustrating a gravity-based biometric fingerprint image perceptual hash generation method for voice source tracking based on fingerprint image perceptual hash according to the present invention;

FIG. 4 is a 300X 300 fingerprint image from a fingerprint image library;

FIG. 5 is a comparison of the rotation attack resistance of the three perceptual hash algorithms of FIG. 6 for an oblique fingerprint image after being rotated by different angles of 1, 2, 5, 10, 30, and 90 degrees;

fig. 7 is a histogram of the matching values.

Detailed Description

The following describes a voice source tracking method based on perceptual hashing of a fingerprint image in detail with reference to specific embodiments and the accompanying drawings.

Example 1

This embodiment describes steps and specific implementation of a voice source tracking method based on perceptual hashing of fingerprint images according to the present invention, and a flow thereof is shown in fig. 1.

In fig. 1, firstly, a fingerprint image of a speaker is input through a smart phone or a fingerprint acquisition instrument, then a perceptual hash value is generated based on the fingerprint image and embedded in voice information data as a watermark to obtain a voice signal embedded with the perceptual hash of the fingerprint image, and the signal is remotely transmitted to a receiving end through a transmission channel. When a receiving end needs to track the identity of a speaker of a voice source, a perceptual hash value is firstly extracted from voice information and then is compared with the perceptual hash values in a fingerprint image perceptual hash library one by one, and when the similarity of the perceptual hash value and a certain perceptual hash value exceeds a certain threshold, matching is successful, so that the identity of the speaker of the voice source is tracked. Since the perceptual hash value is closely related to the speaker fingerprint image, the same perceptual hash value does not substantially appear in fingerprint images of different persons due to different keys. Thus, the requirements of collision resistance and safety can be satisfied. In addition, the gravity center of the image can resist geometric attacks such as translation and rotation, the stability is good, and micro geometric distortion of the fingerprint image caused by factors such as angle and pressing force during fingerprint recording can be tolerated, so that the corresponding hash value also meets the requirement of robustness.

When the method is implemented specifically, as shown in fig. 2, the key is input by the user and is in the form of a character string; calculating an MD5 value based on the key, the calculated MD5 value being used as a pseudo random number generator seed; then, the fingerprint of the speaker is subjected to image acquisition, the fingerprint image of the speaker is randomly divided into a plurality of rectangular areas through key control, the rectangular areas are randomly divided, then parameters with good geometric invariance characteristics are selected from each rectangular area to be used as analysis objects, calculation is carried out after the relevant geometric parameters are quantized, finally, the fingerprint image perceptual hash for identifying the identity information of the speaker is formed, a fingerprint image is generated, and then characteristic extraction, matching and tracking are carried out.

Perceptual hashing is to map multimedia objects of any size to short output data according to human perceptual characteristics, so that multimedia objects with the same perception but different expression forms generate similar or even the same perceptual hash values through a perceptual hash function. Image-aware hashing maps a digital image into a string of fixed-length feature values. The generated effect is similar to the judgment result of a human eye visual perception system on a target, two images with completely different contents are subjected to perception hash calculation, and different perception hash values are obtained; the perception hash calculation is carried out on the images with similar visual effects but slightly different definition and shooting angles, and the similar or even identical perception hash values can be obtained. The application realizes voice tracking based on perceptual hashing.

In specific implementation, in step 1, the fingerprint image based on the image characteristics is subjected to perceptual hashing, which includes three types: three fingerprint images based on gravity center, pixel expectation and gravity center angle are respectively realized through 1A, 1B and 1C;

1A) the fingerprint image perception hash generation based on the gravity center is specifically that the perception hash generated by the gravity center of the biological fingerprint image is calculated firstly, and then the hash value is embedded into voice data, so that the identity of a speaker is tracked, and the specific implementation is as shown in fig. 2. The method comprises the following steps:

step 1A.1) a user inputs a password (key), an MD5 value of the key is calculated, and the MD5 value is used as a seed of a pseudorandom number generator;

step 1a.2) designates a positive integer as the number of rectangular areas, which integer also represents the number of image blocks. Under the control of a random number generator, randomly dividing the fingerprint image into N overlapped rectangular areas;

step 1A.3) calculating the gravity center and the complement map gravity center of each rectangular area by adopting a gravity center calculation formula, and calculating the distance L between the two gravity centers;

wherein, in order to lengthen the distance between two centroids; the gravity center calculation formula adopts a modified formula as follows:

wherein f (i, j), i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to N represents the gray value of the biological fingerprint image, and the parameter delta epsilon R⁺(ii) a The modified gravity center calculation formula enlarges the distance between the two gravity center points, and simultaneously the parameter delta ensures the robustness of the gravity center of the fingerprint image;

step 1A.4) combining N distances L into a column vector, and rounding and quantizing the column vector into a binary sequence L ', wherein L' is a generated perceptual hash value and a barycentric coordinate G (m) of a fingerprint image_x,m_y) Comprises the following steps:

in concrete implementation, the gravity center distribution region of the image is close to the geometric center region of the image, and if the gravity center of the image is not well obtained, the complement map gravity center G '(m'_x,m′_y) The center of gravity of the complement image is defined as follows:

f′(i,j)＝G_level-f(i,j)

wherein G is_levelRepresenting the maximum gray level number of the fingerprint image; the image gravity center has good stability to geometric attacks, can bear operations such as translation, rotation and the like, and the fingerprint image based on the gravity center has negligible micro geometric distortion during fingerprint input and reading; the gravity center calculation result of the fingerprint image depends on the sensed image content, and the calculation process does not need to consider that the fingerprint image acquisition results of different people are similar, so thatDifferent encryption keys are adopted for fingerprint images of different people, so that the same perception hash value basically does not appear;

1B) a block diagram of centroid-based biometric fingerprint image-aware hash generation is shown in fig. 3.

-a pixel-based desired fingerprint image aware hash generation comprising the steps of:

step 1B.1) a user inputs a key, the key is regarded as a character string, an MD5 value of the key is calculated, and the MD5 value is used as a seed of a pseudorandom number generator;

step 1B.2) a user scans a fingerprint image of the user, and randomly divides the scanned fingerprint image into N overlapped rectangular areas by using a pseudo-random number generator;

step 1B.3) calculating the expectation of pixels of each rectangular area to obtain a vector F consisting of the expectation of rectangular pixels;

step 1B.4) rounding and quantizing F into a binary sequence F ', wherein F' is the generated perceptual hash value;

the method is expected to generate the perceptual hash value by calculating the regional pixels of the fingerprint image, has strong adaptability to the distortion of the fingerprint image to a certain extent, but has weak adaptability to the rotation operation of the fingerprint image because the image rotation has pixel distortion.

1C) Perceptual hash generation based on barycentric angles, comprising the steps of:

step 1C.1) a user inputs a key, the key is regarded as a character string, an MD5 value of the key is calculated, and the calculated MD5 value is used as a seed of a pseudorandom number generator;

step 1c.2) the user scans his fingerprint image and randomly segments this image into N overlappable rectangular areas using a pseudo-random number generator.

Step 1C.3) calculating the gravity center of each rectangular area and the corresponding complement chart gravity center by adopting a gravity center calculation formula;

step 1C.4) taking an original point (a point with a horizontal and vertical coordinate of zero) of the image as a straight line and connecting two centroids of a rectangle to form two straight lines intersecting the original point;

step 1C.5) calculating an included angle of two intersecting straight lines to obtain a vector H related to the included angle;

step 1C.6) rounding and quantizing H into a binary sequence H ', and then H' is the generated perceptual hash value.

Because the biological fingerprint can represent the identity of a speaker, and the hash value calculated based on the biological fingerprint is a fingerprint image abstract, the fingerprint image hash value generated by the speaker as a voice source can be embedded in a voice signal as a watermark, so that the identity of the speaker as a voice source is tracked, and if a digital signal corresponding to an initial audio is A ═ a (i), i is more than or equal to 1 and less than or equal to L, the hash value is sensed; step 2, during specific implementation, embedding is carried out through the following steps:

step 2.1, segmenting the audio signal, namely dividing the original audio signal A into M segments, wherein the length of each segment is N, and then M is equal to L/N; each audio segment is marked as A₁(p)，p＝1,2,...,M；

Step 2.2, each audio signal is divided into sub-bands, and each audio segment is divided into M₁Individual sub-band A₂(p,q)， q＝1,2,...,M₁Then the length of each sub-band is N/M₁(ii) a For the convenience of watermark embedding, M of the number of sub-bands is divided₁The value is equal to the perceptual hash bit;

and 2.3, embedding the perception hash values in the M audio segments respectively by adopting an LSB (least significant bit) method, wherein the specific embedding method is to embed 0 bit and 1 bit of each hash value into M of the corresponding audio segment respectively₁The 1 st sample point of each sub-band. With S¹{A₂(p, q) } denotes the 1 st sample point of each subband, the embedding details are:

LSB(S¹{A₂(p,q)})←h_d (1)

wherein LSB (-) represents the lowest bit of the sampling point;

embedding the M audio segments one by adopting the method until the M audio segments are embedded into all the audio segments, and obtaining the audio A' embedded with the fingerprint image perception hash of the speaker with the voice source;

let h₁And h₂Respectively representing the perceptual hash values of two fingerprint images, using normalized Hamming distance to measure the two fingerprint imagesSimilarity, normalized hamming distance is defined as follows:

one fingerprint image is denoted by X, the fingerprint image of image X after the content holding operation is denoted by Y, and another fingerprint image different from X is denoted by Z. H_kRepresenting a perceptual hash function controlled by a key K. The perception hash value comparison result meets the following conditions:

D(H_k(X),H_k(Y))＜T₁ (3)

D(H_k(X),H_k(Z))＞T₂ (4)

wherein, T is more than or equal to 0₁＜T₂≦ 0.5, in the comparison of perceptual hash values, T is desired₁And T₂The larger the difference, the better. Ideally, the normalized Hamming distance of similar fingerprint images should be close to 0, and the normalized Hamming distance of dissimilar fingerprint images should be close to 0.5.

And voice source tracking implementation: when tracking a speech source, first extracting a perceptual hash value H from the lowest order bits of a speech sample value₁Then the fingerprint image is combined with a perception hash value H in a fingerprint image perception hash library₂Matching one by one and calculating the normalized Hamming distance D (H) thereof₁,H₂)：

Wherein, N is the length of the perceptual hash value. Normalized Hamming distance D (H)₁,H₂) The smaller, the closer the two fingerprint images are, the higher the source tracking accuracy.

Let T be the similarity threshold, and T is more than or equal to 0 and less than or equal to 0.5, if D (H1, H2) < T, the matching is successful, i.e. the identity of the speaker in the source of speech can be tracked by the fingerprint image corresponding to H2. The smaller the similarity threshold T, the higher the source tracking accuracy will be.

Performance analysis of robustness: in the experiment, the fingerprint image comes from the million-level fingerprint image pixel database of the research group of the intelligent biological information system of the Fingerpass of the Chinese academy of Automation. The BMP format grayscale image with a fingerprint image size of 300 × 300 is generated with a perceptual hash value length of 150 bits.

Fig. 4 is a 300 × 300 fingerprint image in the fingerprint image library, and a speaker often inevitably tilts at a certain angle when inputting a fingerprint, and fig. 5 is a tilted fingerprint image of the speaker after rotating at different angles of 1 °, 2 °, 5 °, 10 °, 30 °, and 90 °. Calculating the rotated perceptual hash value h₂Perceptual hash value h with original fingerprint image₁The normalized Hamming distance can obtain a relation graph of the rotation angle and the normalized Hamming distance. Fig. 6 shows the comparison of the anti-rotation attack performance of the three perceptual hash algorithms after the fingerprint image is rotated by different angles. As can be seen from fig. 6, the perceptual hash generation algorithm based on the angle of the center of gravity has better anti-attack performance, and the normalized hamming distance of all three perceptual hash generation algorithms increases as the rotation angle increases, but the normalized hamming distance decreases to some extent at some angles, and the normalized hamming distance increases as the rotation angle increases in a wavy manner. In addition, the normalized Hamming distance is larger for the rotation angle larger than 5 degrees, and the normalized Hamming distance is acceptable for the rotation angle within 5 degrees, which shows that the speaker with the phonetic source can be tracked more accurately. Impact resistance: to study the collision resistance of the proposed perceptual hash algorithm, we tested the centroid-based perceptual hash generation algorithm as an example. The method comprises the steps of randomly selecting 60 fingerprint images of 304 x 256 in a fingerprint image library (self-collection in a laboratory) to generate hash values, and matching the hash values pairwise to obtain 1770 matching results. Fig. 7 is a histogram of the matching values, and it can be seen that the matching results are approximately fitted to a gaussian distribution N (μ, σ), where the mathematical expectation is 1.327 and the standard deviation σ 0.5786. In the test, a threshold T' is selected to be 0.43, and the collision resistance generating the perceptual hash value is calculated according to the following formula:

the collision rate of the gravity-based perceptual hash generation algorithm calculated by equation (9) is 9.043 × 10^-5Therefore, the collision rate of the algorithm is small, and therefore the uniqueness of the perceptual hash value is guaranteed. Fig. 7, statistical histogram of 1770 matching results.

Safety: with the extreme sensitivity of the MD5 algorithm to input, the user inputs different keys, generates different random numbers, and randomly blocks the fingerprint image, resulting in the perceptual hash value segments shown in table 1. As can be seen from table 1, the perceptual hash values generated by different keys are distinct, and the euclidean distance is calculated to be 85.9867 by matching the two perceptual hash values, which indicates that the distance is large, and the security requirement of the hash sequence is satisfied. Therefore, without knowing the key, the perceptual hash value of the fingerprint image cannot be obtained.

TABLE 1 perceptual Hash value segments generated by different keys for the same fingerprint image

While the foregoing is directed to the preferred embodiment of the present invention, it is not intended that the invention be limited to the embodiment and the drawings disclosed herein. Equivalents and modifications may be made without departing from the spirit of the disclosure, which is to be considered as within the scope of the invention.

Claims

1. A voice source tracking method based on fingerprint image perceptual hashing is characterized in that: the method comprises the following steps:

step 1, specifically comprising the following substeps:

step 2, specifically comprising the following substeps:

D(H_k(X),H_k(Y))<T₁

D(H_k(X),H_k(Z))>T₂

step 3, specifically comprising the following substeps:

Step 3.2 and perception hash value H in fingerprint image perception hash library₂Matching one by one and calculating the normalized Hamming distance D (H) thereof₁,H₂) If D (H)₁,H₂)<T, then the matching is successful, i.e. from H₂The corresponding fingerprint image can track the identity of the speaker in the voice source.

2. The voice source tracking method based on the perceptual hashing of fingerprint images as claimed in claim 1, wherein: in step 1, the image features include three types of fingerprint images including fingerprint images based on gravity center, pixel expectation, and gravity center angle.

3. The voice source tracking method based on the perceptual hashing of fingerprint images as claimed in claim 2, wherein: in step 1.2, the random number generator seed is generated using step 1.1 during random division.

4. The voice source tracking method based on the perceptual hashing of fingerprint images as claimed in claim 3, wherein: in step 3, T is more than or equal to 0₁<T₂≤0.5。

5. The method for tracking the voice source based on the perceptual hash of the fingerprint image as claimed in claim 4, wherein: in step 3, T is a similarity threshold, and T is more than or equal to 0 and less than or equal to 0.5.