CN115186125A

CN115186125A - Encrypted voice retrieval method and device, electronic equipment and storage medium

Info

Publication number: CN115186125A
Application number: CN202210734067.2A
Authority: CN
Inventors: 黄石磊; 蒋志燕; 陈诚; 廖晨; 冯湘
Original assignee: Shenzhen Raisound Technology Co ltd
Current assignee: Shenzhen Raisound Technology Co ltd
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2022-10-14

Abstract

The embodiment of the disclosure relates to a retrieval method, a retrieval device, electronic equipment and a storage medium of encrypted voice, wherein the method comprises the following steps: acquiring a retrieval sound segment, wherein the retrieval sound segment is a voice segment for retrieval; dividing the search sound segment into syllable sequences; generating a hash value of each syllable in the syllable sequence by adopting a hash algorithm to obtain a target hash sequence; determining a hash sequence matched with the target hash sequence from at least one predetermined hash sequence, wherein each hash sequence in the at least one hash sequence corresponds to an encrypted sound segment; and determining the encrypted sound segment corresponding to the determined hash sequence as a retrieval result. By the scheme, the speed and the accuracy of retrieving the encrypted voice can be improved.

Description

Encrypted voice retrieval method and device, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of computers, and in particular relates to a retrieval method and device for encrypted voice, an electronic device and a storage medium.

Background

Voice encryption technology is an effective method for ensuring the security of voice information. The voice retrieval is a technical means for quickly positioning the voice. In the prior art, a neural network model is usually adopted to realize retrieval of encrypted speech.

However, due to the huge change of the voice characteristics after encryption and the massive scale of the encrypted data, how to perform faster and more accurate retrieval on the encrypted voice is a problem worthy of research.

Disclosure of Invention

In view of the above, to solve some or all of the above technical problems, embodiments of the present disclosure provide a method and an apparatus for retrieving encrypted speech, an electronic device, and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a retrieval method for encrypted speech, where the method includes:

acquiring a retrieval sound segment, wherein the retrieval sound segment is a voice segment for retrieval;

dividing the search sound segment into syllable sequences;

generating a hash value of each syllable in the syllable sequence by adopting a hash algorithm to obtain a target hash sequence;

determining a hash sequence matching the target hash sequence from at least one predetermined hash sequence, wherein each hash sequence in the at least one hash sequence corresponds to an encrypted sound segment;

and determining the encrypted sound segment corresponding to the determined hash sequence as a retrieval result.

Optionally, in the method according to any embodiment of the present disclosure, the at least one hash sequence is generated by:

acquiring target voice;

dividing the target voice into a voice segment sequence;

dividing each sound segment in the sound segment sequence into syllable sequences to obtain at least one syllable sequence;

and generating a hash value of each syllable in the at least one syllable sequence by adopting a hash algorithm to obtain at least one hash sequence.

Optionally, in the method according to any embodiment of the present disclosure, the correspondence between each hash sequence in the at least one hash sequence and the encrypted sound segment is established as follows:

respectively encrypting the sound segments in the sound segment sequence to obtain at least one encrypted sound segment;

and aiming at each hash sequence in the at least one hash sequence, embedding the hash sequence serving as a watermark into an encrypted sound segment obtained by encrypting the sound segment corresponding to the hash sequence so as to establish a corresponding relation between the hash sequence and the encrypted sound segment.

Optionally, in the method according to any embodiment of the present disclosure, the determining, from at least one predetermined hash sequence, a hash sequence that matches the target hash sequence includes:

selecting a hash sequence from at least one predetermined hash sequence, and executing the following determination steps based on the hash sequence: and if the normalized Hamming distance between each hash value of the target hash sequence and each corresponding hash value of the hash sequence is smaller than or equal to a preset distance threshold, determining that the hash sequence is matched with the target hash sequence.

Optionally, in the method according to any embodiment of the present disclosure, if the normalized hamming distance between each hash value of the target hash sequence and each corresponding hash value of the hash sequence is smaller than or equal to a preset distance threshold, the method includes:

determining whether the normalized Hamming distance between the first hash value of the target hash sequence and the first hash value of the hash sequence is smaller than or equal to a preset distance threshold value;

and if the normalized Hamming distance between the two hash values is smaller than or equal to the preset distance threshold, sequentially determining whether the normalized Hamming distance between the hash value in the target hash sequence and the hash value in the hash sequence is smaller than or equal to the preset distance threshold from the second hash value.

Optionally, in the method according to any embodiment of the present disclosure, the determining, from at least one predetermined hash sequence, a hash sequence that matches the target hash sequence further includes:

and if the normalized Hamming distance between the two hash values is smaller than the preset distance threshold, selecting unselected hash sequences from the at least one hash sequence, and executing the determining step based on the hash sequences.

Optionally, in a method according to any embodiment of the present disclosure, the generating a hash value of each syllable in the syllable sequence by using a hash algorithm to obtain a target hash sequence includes:

generating a hash value of each syllable in the syllable sequence by adopting a first hash algorithm to obtain a first target hash sequence; and

the determining a hash sequence matching the target hash sequence from at least one predetermined hash sequence includes:

determining a first hash sequence matched with the first target hash sequence from at least one predetermined first hash sequence; and

the determining the encrypted segment corresponding to the determined hash sequence as a retrieval result includes:

if the matching degree of the determined first hash sequence and the first target hash sequence is smaller than or equal to a preset matching degree threshold value, generating a hash value of each syllable in the syllable sequence by adopting a second hash algorithm to obtain a second target hash sequence;

determining a second hash sequence matched with the second target hash sequence from at least one predetermined second hash sequence; wherein each second hash sequence in the at least one second hash sequence corresponds to an encrypted sound segment; and the sound segment decrypted by the encrypted sound segment corresponding to each second hash sequence in the at least one second hash sequence is the same as the sound segment decrypted by the encrypted sound segment corresponding to each first hash sequence in the at least one first hash sequence.

Optionally, in a method according to any embodiment of the present disclosure, the determining, from at least one predetermined hash sequence, a hash sequence that matches the target hash sequence includes:

if the number of the voice frames of the retrieval sound segment is smaller than the preset number, determining a hash subsequence matched with the target hash sequence from at least one predetermined hash sequence; and

and determining the encrypted sub-sound segment corresponding to the matched hash sub-sequence as a retrieval result.

In a second aspect, an embodiment of the present disclosure provides a retrieval apparatus for encrypted speech, where the apparatus includes:

an acquisition unit configured to acquire a search segment, wherein the search segment is a speech segment for searching;

a dividing unit configured to divide the search segment into syllable sequences;

the generating unit is configured to generate hash values of all syllables in the syllable sequence by adopting a hash algorithm to obtain a target hash sequence;

a first determining unit configured to determine a hash sequence matching the target hash sequence from at least one predetermined hash sequence, wherein each hash sequence of the at least one hash sequence corresponds to an encrypted sound segment;

a second determination unit configured to determine the encrypted sound segment corresponding to the determined hash sequence as a retrieval result.

Optionally, in the apparatus according to any embodiment of the present disclosure, the at least one hash sequence is generated by:

acquiring target voice;

dividing the target voice into a voice segment sequence;

Optionally, in the apparatus according to any embodiment of the present disclosure, a correspondence between each hash sequence in the at least one hash sequence and the encrypted segment is established as follows:

Optionally, in an apparatus according to any embodiment of the present disclosure, the first determining unit is specifically configured to:

Optionally, in an apparatus according to any embodiment of the present disclosure, if the normalized hamming distance between each hash value of the target hash sequence and each corresponding hash value of the hash sequence is smaller than or equal to a preset distance threshold, the method includes:

if the normalized Hamming distance between the two hash values is smaller than or equal to the preset distance threshold, sequentially determining whether the normalized Hamming distance between the hash value in the target hash sequence and the hash value in the hash sequence is smaller than or equal to the preset distance threshold from the second hash value.

Optionally, in the apparatus according to any embodiment of the present disclosure, the first determining unit is further configured to:

Optionally, in an apparatus according to any embodiment of the present disclosure, the generating unit is specifically configured to:

the first determining unit is specifically configured to:

the second determining unit is specifically configured to:

and determining the encrypted sub-sound segment corresponding to the matched hash subsequence as a retrieval result.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including:

a memory for storing a computer program;

a processor, configured to execute the computer program stored in the memory, and when the computer program is executed, implement the method of any embodiment of the encrypted speech retrieval method according to the first aspect of the present disclosure.

In a fourth aspect, the disclosed embodiments provide a computer-readable storage medium, and when being executed by a processor, the computer program implements the method of any embodiment of the encrypted speech retrieval method according to the first aspect.

In a fifth aspect, the disclosed embodiments provide a computer program comprising computer readable code which, when run on a device, causes a processor in the device to execute instructions for implementing the steps in the method according to any one of the embodiments of the method for retrieving encrypted speech according to the first aspect described above.

The retrieval method of the encrypted speech provided by the embodiment of the present disclosure may obtain a retrieval segment, where the retrieval segment is a speech segment used for retrieval, then divide the retrieval segment into syllable sequences, then generate hash values of all syllables in the syllable sequences by using a hash algorithm to obtain target hash sequences, then determine hash sequences matched with the target hash sequences from at least one predetermined hash sequence, where each hash sequence in the at least one hash sequence corresponds to one encrypted segment, and finally determine the encrypted segments corresponding to the determined hash sequences as retrieval results. According to the scheme, the retrieval of the encrypted voice is realized through the Hash algorithm, and the speed and the accuracy of retrieving the encrypted voice are improved.

Drawings

Fig. 1A is a schematic flowchart of a retrieval method of encrypted speech according to an embodiment of the present disclosure;

FIG. 1B is a flow diagram for one implementation of FIG. 1A;

fig. 2A is a schematic flowchart of another encrypted speech retrieval method according to an embodiment of the present disclosure;

FIG. 2B is a schematic illustration of a matching process for FIG. 2A;

fig. 3A is a schematic flowchart of another encrypted speech retrieval method according to an embodiment of the present disclosure;

FIG. 3B is a schematic diagram of a matching process for FIG. 3A;

fig. 4 is a schematic structural diagram of a retrieval apparatus for encrypted speech according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of parts and steps, numerical expressions, and values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those within the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one object, step, device, or module from another object, and do not denote any particular technical meaning or logical order therebetween.

It is also understood that in the present embodiment, "a plurality" may mean two or more, and "at least one" may mean one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing the association object, and indicates that three relationships may exist, for example, a and/or B, may indicate: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

It should be noted that, in the present disclosure, the embodiments and the features of the embodiments may be combined with each other without conflict. For the purpose of facilitating an understanding of the embodiments of the present disclosure, the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with the embodiments. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

Fig. 1A is a schematic flowchart of a retrieval method of encrypted speech according to an embodiment of the present disclosure, and as shown in fig. 1A, the method specifically includes:

101. and acquiring a retrieval sound segment.

In the present embodiment, the execution subject of the retrieval method of the encrypted voice (e.g., server, terminal device, retrieval means of the encrypted voice, etc.) may acquire the retrieval segment.

Wherein, the retrieval sound segment is a voice segment for retrieval. The number of the voice frames of the search segment may be any value or may be a fixed value.

102. And dividing the search sound segment into syllable sequences.

In this embodiment, the execution body may use a syllable division algorithm to divide the search segment into syllable sequences.

Wherein, the syllable is the smallest voice structure unit formed by phoneme combination. There are clearly perceptible boundaries between syllables. In Chinese, the pronunciation of a Chinese character is a syllable.

103. And generating the hash value of each syllable in the syllable sequence by adopting a hash algorithm to obtain a target hash sequence.

In this embodiment, the executing body may generate a hash value of each syllable in the syllable sequence by using a hash algorithm for each syllable in the syllable sequence, so as to obtain a target hash sequence. The number of hash values in the target hash sequence may be equal to the number of syllables in the syllable sequence.

Among them, the Hash Algorithm (HA) is an algorithm for generating a fingerprint of data. As an example, the hash algorithm may include mean hash (aHash), perceptual hash (pHash), and difference value hash (dHash), among others.

104. And determining a hash sequence matched with the target hash sequence from at least one predetermined hash sequence.

In this embodiment, the execution body may determine a hash sequence matching the target hash sequence from at least one predetermined hash sequence.

Wherein each hash sequence in the at least one hash sequence corresponds to an encrypted sound segment.

As an example, for each of the at least one hash sequence, the executing entity may calculate a Jaccard (Jaccard) similarity coefficient of the hash sequence and the target hash sequence.

And then, taking the hash sequence corresponding to the largest Jacard similarity coefficient in the computed Jacard similarity coefficients as the hash sequence matched with the target hash sequence. Or one or more hash sequences corresponding to Jacard similarity coefficients larger than or equal to a target threshold value in the computed Jacard similarity coefficients are used as hash sequences matched with the target hash sequences. The target threshold may be a preset value, or may be a product of the number of hash sequences in the at least one hash sequence and a preset percentage.

As another example, a hash sequence that is equal to at least one predetermined hash sequence and whose normalized hamming distance is smaller than a preset threshold value may be determined as a hash sequence matching the target hash sequence.

Here, the following correspondence relationship may exist between the hash sequence and the encrypted segment:

and encrypting the sound segment A to obtain an encrypted sound segment A. And dividing the sound segment A to obtain a syllable sequence A. And determining the hash value of each syllable in the syllable sequence by adopting a hash algorithm to obtain a hash sequence A. In this scenario, hash sequence a corresponds to encrypted segment a.

105. And determining the encrypted sound segment corresponding to the determined hash sequence as a retrieval result.

In this embodiment, the execution body may determine an encrypted segment corresponding to the determined hash sequence as a search result.

In some optional implementations of this embodiment, the at least one hash sequence is generated by:

first, a target voice is acquired. Wherein the target speech may be any speech. Usually, the number of frames of the target speech is greater than or equal to a preset value, that is, the target speech usually has a certain length, so that the retrieval of the segments can be performed therein.

And then, dividing the target voice into a voice segment sequence. In some cases, the number of speech frames for each segment in the sequence of segments may be equal. Alternatively, each segment in the sequence of segments may correspond to a speech.

Then, each of the segments in the segment sequence is divided into syllable sequences to obtain at least one syllable sequence.

And then, generating a hash value of each syllable in the at least one syllable sequence by adopting a hash algorithm to obtain at least one hash sequence.

Here, the hash algorithm used may be the same as the hash algorithm in step 103 described above. In other words, the process of generating the target hash sequence in step 103 may be referred to as a process of generating a target hash sequence in the process of obtaining each hash sequence in at least one hash sequence, and will not be described herein again.

It can be understood that, in the above alternative implementation, the target speech is divided into the segment sequence, and then each segment in the segment sequence is divided into the syllable sequence, so as to obtain at least one hash sequence, thereby, when the retrieval of the encrypted speech is performed, the accuracy of the retrieval can be improved.

In some application scenarios in the above optional implementation manners, the correspondence between each hash sequence in the at least one hash sequence and the encrypted segment is established in the following manner:

firstly, respectively encrypting the sound segments in the sound segment sequence to obtain at least one encrypted sound segment.

And then, regarding each hash sequence in the at least one hash sequence, embedding the hash sequence as a watermark into an encrypted sound segment obtained by encrypting the sound segment corresponding to the hash sequence so as to establish a corresponding relation between the hash sequence and the encrypted sound segment.

It can be understood that in the above application scenario, the hash sequence is used as a watermark and embedded into the corresponding encrypted segment, so that the speed of encrypted voice retrieval is increased.

In some optional implementations of this embodiment, step 103 may be described as: and generating a hash value of each syllable in the syllable sequence by adopting a first hash algorithm to obtain a first target hash sequence.

On this basis, the step 104 can be described as: and determining a first hash sequence matched with the first target hash sequence from at least one predetermined first hash sequence.

On this basis, the executing body may execute the step 105 in the following manner:

firstly, if the matching degree of the determined first hash sequence and the first target hash sequence is less than or equal to a preset matching degree threshold value, generating a hash value of each syllable in the syllable sequence by adopting a second hash algorithm to obtain a second target hash sequence.

The matching degree may characterize a similarity between the determined first hash sequence and the first target hash sequence. The similarity can be calculated in various ways, and is not described herein again.

And then, determining a second hash sequence matched with the second target hash sequence from at least one predetermined second hash sequence.

Wherein each of the at least one second hash sequence corresponds to an encrypted segment. And the sound segment decrypted by the encrypted sound segment corresponding to each second hash sequence in the at least one second hash sequence is the same as the sound segment decrypted by the encrypted sound segment corresponding to each first hash sequence in the at least one first hash sequence.

It is to be understood that, in the above alternative implementation manner, the first hash sequence and the second hash sequence may be obtained separately in the following manner:

And then, respectively generating hash values of all syllables in the at least one syllable sequence by adopting two different hash algorithms to obtain at least one first hash sequence and at least one second hash sequence. In other words, here, the first hash sequence and the second hash sequence differ in that the employed hash algorithm is different.

Here, in the above optional implementation manner, by generating the first hash sequence and the second hash sequence, when the matching degree of the determined first hash sequence and the first target hash sequence is less than or equal to a preset matching degree threshold, the second hash sequence matching the second target hash sequence may be determined from at least one second hash sequence, so as to improve the accuracy of the retrieval.

Optionally, if the matching degree between the determined second hash sequence and the second target hash sequence is smaller than or equal to the preset matching degree threshold, a third hash algorithm may be used to generate hash values of all syllables in the syllable sequence, so as to obtain a third target hash sequence.

The matching degree may characterize a similarity between the determined second hash sequence and the second target hash sequence. The similarity can be calculated in various ways, and is not described herein again.

And then, determining a third hash sequence matched with the third target hash sequence from at least one predetermined third hash sequence.

Wherein each third hash sequence in the at least one third hash sequence corresponds to an encrypted sound segment. And the decrypted sound segment of the encrypted sound segment corresponding to each third hash sequence in the at least one third hash sequence is the same as the decrypted sound segment of the encrypted sound segment corresponding to each first hash sequence in the at least one first hash sequence.

Similarly, the first hash sequence, the second hash sequence and the third hash sequence differ in that the hash algorithms used are different.

In order to improve robustness and distinguishability of perceptual hashing and search speed of an algorithm and make the perceptual hashing more suitable for large-scale data processing, an application scenario of the above embodiment proposes a search method of an encrypted sound segment of syllable-level perceptual hashing. As shown in fig. 1B, fig. 1B is a schematic flow diagram for one implementation of fig. 1A. In fig. 1B, a generation process and a user retrieval process of encrypted voice data with watermark (i.e. data obtained after embedding a hash sequence as a watermark into a corresponding encrypted segment) are shown.

The generation process of the encrypted voice with the watermark comprises the following steps: firstly, the posterior probability characteristics of the segment to be encrypted (namely the segment sequence formed by dividing the target voice) are extracted, and meanwhile, each segment in the segment sequence is divided into ordered syllables by using a syllable division algorithm to obtain at least one syllable sequence. Fixed-length perceptual hash values are then generated in syllable units, and the perceptual hash values for these syllables will in turn constitute a hash sequence for the entire segment. And embedding the hash sequence of the sound segment as a watermark into the encrypted sound segment to obtain the encrypted voice with the watermark. The system hash table will be constructed from the perceptual hash sequences of all segments.

When the sound segment is searched, firstly, the perceptual hash sequence (target hash sequence) of the searched voice (namely the searched sound segment) is generated by taking the syllable as a unit, and then the perceptual hash sequence with the same length and the matched head is searched in the system hash table. And if the Hamming distance between the perceptual hash sequences is smaller than a set threshold value, the matching is considered to be successful, and a retrieval result is output.

Fig. 2 is a schematic flowchart of another encrypted speech retrieval method provided in the embodiment of the present disclosure, and as shown in fig. 2, the method specifically includes:

201. and acquiring a retrieval sound segment.

Wherein, the retrieval sound segment is a voice segment for retrieval.

In this embodiment, step 201 is substantially the same as step 101 in the corresponding embodiment of fig. 1A, and is not described herein again.

202. And dividing the search sound segment into syllable sequences.

In this embodiment, the execution body may divide the search segment into syllable sequences.

In this embodiment, step 202 is substantially the same as step 102 in the corresponding embodiment of fig. 1A, and is not described herein again.

203. And generating the hash value of each syllable in the syllable sequence by adopting a hash algorithm to obtain a target hash sequence.

In this embodiment, the executing entity may generate hash values of the syllables in the syllable sequence by using a hash algorithm, so as to obtain a target hash sequence.

In this embodiment, step 203 is substantially the same as step 103 in the corresponding embodiment of fig. 1A, and is not described herein again.

204. Selecting a hash sequence from at least one predetermined hash sequence, and executing the following determination steps based on the hash sequence: and if the normalized Hamming distance between each hash value of the target hash sequence and each corresponding hash value of the hash sequence is smaller than or equal to a preset distance threshold, determining that the hash sequence is matched with the target hash sequence.

In this embodiment, the executing body may select a hash sequence from at least one predetermined hash sequence, and execute the following determining steps based on the hash sequence: and if the normalized Hamming distance between each hash value of the target hash sequence and each corresponding hash value of the hash sequence is smaller than or equal to a preset distance threshold, determining that the hash sequence is matched with the target hash sequence.

Here, the correspondence between the hash sequence and the encrypted segment may refer to the above description, and is not described herein again.

205. And determining the encrypted sound segment corresponding to the determined hash sequence as a retrieval result.

In this embodiment, the execution body may determine the encrypted segment corresponding to the determined hash sequence as the search result.

In this embodiment, step 205 is substantially the same as step 105 in the corresponding embodiment of fig. 1A, and is not described herein again.

According to the retrieval method of the encrypted voice, the hash sequence matched with the target hash sequence is determined by calculating the normalized Hamming distance, and the accuracy of retrieval of the encrypted voice segment can be improved.

In some optional implementations of this embodiment, whether the normalized hamming distances are all less than or equal to the preset threshold may be determined as follows:

firstly, whether the normalized Hamming distance between the first hash value of the target hash sequence and the first hash value of the hash sequence is smaller than or equal to a preset distance threshold value is determined.

And then, if the normalized Hamming distance between the two first Hash values is smaller than or equal to the preset distance threshold, sequentially determining whether the normalized Hamming distance between the Hash value in the target Hash sequence and the Hash value in the Hash sequence is smaller than or equal to the preset distance threshold from the second Hash value.

In some application scenarios in the foregoing optional implementation manner, the determining, from at least one predetermined hash sequence, a hash sequence that matches the target hash sequence further includes:

if the normalized Hamming distance between two hash values is smaller than the preset distance threshold, selecting unselected hash sequences from the at least one hash sequence, and executing the determining step based on the hash sequences.

It can be understood that, in the above alternative implementation manner, first, whether the normalized hamming distance between two hash values is smaller than or equal to the preset distance threshold is compared, and if the normalized hamming distance is greater than the preset distance threshold, the subsequent comparison between the hash values is not required, thereby improving the retrieval speed.

As an application scenario of the above embodiment, please refer to fig. 2B, after the encrypted voice data with the watermark and the constructed system voice hash table are uploaded to the cloud server, if the user sends a retrieval request to the cloud server, the ciphertext can be directly retrieved without decrypting the encrypted voice data. The process of searching the voice segment to be searched in the encrypted voice with the watermark is assumed as follows:

the first step is as follows: d-dimensional posterior probability characteristics of the retrieval sound segment Q are extracted, and the retrieval sound segment Q is divided into N syllables by utilizing a syllable dividing technology.

The second step is that: generating a perceptual hash sequence with a fixed length of M for each syllable of the search sound segment Q, and sequentially forming a target perceptual hash sequence H corresponding to the search sound segment Q from the perceptual hash sequences of all the syllables _Q ＝{H _Q1 ,H _Q2 ,…,H _QN }。

The third step: searching a perceptual hash sequence H with the length of M x N in a system hash table _Q ＝{H _Q1 ,H _Q2 ,…,H _QN }. Since the perceptual hash value generated for each syllable has a fixed length M, the segment corresponding to the perceptual hash sequence with length M × N must contain N syllables. Thus, the target is perceptually hashed H _Q Head of (H) _Q1 Perceptual hashing to match H _S Head of (H) _S1 Performing matching calculation, and if the matching is successful, performing H _Q And H _S Otherwise H is not required _Q And H _S And (4) judging matching. Due to H _Q And H _S All are formed by perceptual hashes corresponding to N syllables, so before calculating the normalized hamming distance between them, a perceptual hash sequence of two syllables (such as H) needs to be defined first _i And H _j ) Normalized hamming distance of.

Assume that the similarity threshold is T (i.e., the predetermined distance threshold), 0<T<0.5, if the normalized Hamming distance is less than T, the matching is considered to be successful, and a retrieval result is output; otherwise, continuing to apply H using the same method _Q And matching with the next sensing hash sequence with the length of MxN in the system hash table. For perceptual hash sequences with lengths not being MxN in the system hash table or perceptual hash sequences with lengths not being MxN but headers not matched with target perceptual hash headers, matching calculation is not needed, and the search strategy can save a large amount of time in the retrieval process of the query segment.

The fourth step: after the retrieval is completed in the system hash table and the corresponding encrypted voice segment is obtained, the embedded watermark can be extracted from the encrypted voice by using a watermark extraction algorithm, and the embedded watermark is matched and verified with the watermark of the query voice segment, so that whether the encrypted voice data is damaged or tampered can be checked.

Fig. 3A is a schematic flowchart of a retrieval method of encrypted speech according to an embodiment of the present disclosure, where the retrieval method may be applied to electronic devices such as smart phones, notebook computers, desktop computers, and portable computers.

Specifically, as shown in fig. 3A, the method specifically includes:

301. and acquiring a retrieval sound segment.

In the present embodiment, the execution subject (e.g., server, terminal device, retrieval means of encrypted voice, etc.) of the retrieval method of encrypted voice may acquire the retrieval segment.

The retrieval sound segment is a voice segment used for retrieval.

In this embodiment, step 301 is substantially the same as step 101 in the embodiment corresponding to fig. 1A, and is not described herein again.

302. And dividing the search sound segment into syllable sequences.

In this embodiment, step 302 is substantially the same as step 102 in the corresponding embodiment of fig. 1A, and is not described herein again.

303. And generating the hash value of each syllable in the syllable sequence by adopting a hash algorithm to obtain a target hash sequence.

In this embodiment, step 303 is substantially the same as step 103 in the embodiment corresponding to fig. 1A, and is not described herein again.

304. And if the number of the voice frames of the retrieval sound segment is less than the preset number, determining a hash subsequence matched with the target hash sequence from at least one predetermined hash sequence.

In this embodiment, if the number of voice frames of the search segment is less than a preset number, the execution main body may determine a hash subsequence matching the target hash sequence from at least one predetermined hash sequence.

The preset number may be: and the preset at least one hash sequence is the minimum value in the voice frame numbers of the sound segments corresponding to the hash sequences. Therefore, the number of the voice frames of the search segments is smaller than the preset number, which usually indicates that the length of the search segments acquired in step 301 is smaller, and for example, it can be characterized that the search segments acquired in step 301 are voices corresponding to keywords.

As an example, for each hash sequence of the at least one hash sequence, the executing body may calculate a Jaccard (Jaccard) similarity coefficient of a hash subsequence of the hash sequence and a target hash sequence.

And then, taking the hash subsequence corresponding to the largest Jacard similarity coefficient in the Jacard similarity coefficients obtained by calculation as the hash subsequence matched with the target hash sequence. Or, one or more hash subsequences corresponding to the Jacard similarity coefficients larger than or equal to the target threshold value in the computed Jacard similarity coefficients are used as hash subsequences matched with the target hash sequences. The target threshold may be a preset value, or may be a product of a number of hash sequences in the at least one hash sequence and a preset percentage.

As another example, a hash subsequence that is equal to at least one predetermined hash sequence and has a hamming distance smaller than a preset threshold may be determined as a hash subsequence matching the target hash sequence.

Here, the following correspondence may exist between the hash sequence and the encrypted segment:

305. And determining the encrypted sub-sound segment corresponding to the matched hash subsequence as a retrieval result.

In this embodiment, the execution body may determine the encrypted sub-sound segment corresponding to the matching hash sub-sequence as the search result.

According to the retrieval method of the encrypted voice, provided by the embodiment of the disclosure, the retrieval of the retrieval sound segments with the voice frame number smaller than the preset number in the encrypted sound segments is realized by determining the hash subsequence matched with the target hash sequence.

The following description is made for the purpose of illustrating the embodiments of the present disclosure, but it should be noted that the embodiments of the present disclosure may have the features described below, but the following description is not to be construed as limiting the scope of the embodiments of the present disclosure.

Here, the above embodiment is exemplarily explained by taking the speech obtained in step 301 as the search field as the keyword:

illustratively, the search may be performed for a perceptual hash sequence of each segment into which the target speech is segmented. Firstly, generating perception hash values by taking syllables of key words as units, and forming an ordered perception hash set (target perception hash sequence) by the perception hash values; then, sequentially matching a perception hash subset with the same length and matched head in the perception hash sequence of each sound segment, and if the normalized Hamming distance between the perception hash subsets is smaller than a preset distance threshold value, positioning the sound segment; subsequently, matching is continued until the perceptual hash sequences of all segments are searched. In short, the search segments are different from the keywords in that the matching objects are different, and the search objects of the search segments are the perceptual hash sequences of each segment in the system hash table; and the retrieval object of the key word is a subset of each perceptual hash sequence in the system hash table.

Specifically, as shown in fig. 3B, after the watermarked encrypted voice data and the system hash table are uploaded to the cloud server, the user sends a request to the server to retrieve the voice keyword, falseLet the key to be searched be K, and the system hash table is composed of all segments (A) ₁ ,A ₂ ,…,A _t ) Perceptual hash sequence of (H) ¹ ,H ² ,…,H ^t ) Is constructed of a segment A _i Has a hash sequence of H ⁱ ＝(H ⁱ ₁ ,H ⁱ ₂ ,…,H ⁱ _N ) It is composed of the perceptual hash values of N syllables in order. The number of syllables contained in different segments may be different. The search key K is searched in the perception hash sequence corresponding to each sound segment of the system hash table, and the key K is searched in the sound segment A _i Perceptual hash sequence of H ⁱ ＝(H ⁱ ₁ ,H ⁱ ₂ ,…,H ⁱ _N ) The specific process of searching and matching is detailed below:

the first step is as follows: and extracting D-dimensional posterior probability characteristics of the query key word K, and dividing the query key word into L syllables by using a syllable dividing technology, wherein L is less than N.

The second step is that: generating a perceptual hash sequence with a fixed length of M for each syllable of the query keyword K, and sequentially forming a target perceptual hash sequence H corresponding to the query keyword by the perceptual hash sequences of all the syllables _K ＝{H _K1 ,H _K2 ,…,H _KL }。

The third step: from segment A _i Perceptual hash sequence H of ⁱ ＝(H ⁱ ₁ ,H ⁱ ₂ ,…,H ⁱ _N ) Generating N-L +1 perception hash subsets Hv to be matched with length L ⁱ ＝{H ⁱ _v ,H ⁱ _v+1 ,…,H ⁱ _v+L-1 },v＝1,2,…,N-L+1。

The fourth step: and (3) taking the target perception hash sequence corresponding to the query keyword as a sliding window, and sequentially retrieving N-L +1 perception hash subsets to be matched, wherein the length of the perception hash subsets is L. Firstly, judging whether a target perceptual hash is matched with the header of a perceptual hash subset to be matched, if so, further judging whether the target perceptual hash is successfully matched with the header of the perceptual hash subset to be matched by calculating a normalized Hamming distance between the target perceptual hash and the header of the perceptual hash subset to be matched; if not, no further matching work is required. The basis of this matching strategy is: in syllable-level perceptual hash sequences, perceptual hash sequences whose headers match may match as a whole, while perceptual hash sequences whose headers do not match must not match as a whole.

Fig. 4 is a schematic structural diagram of a retrieval apparatus for encrypted speech according to an embodiment of the present disclosure, which specifically includes:

an obtaining unit 401 configured to obtain a search sound segment, where the search sound segment is a speech segment for performing a search;

a dividing unit 402 configured to divide the search segment into syllable sequences;

a generating unit 403, configured to generate a hash value of each syllable in the syllable sequence by using a hash algorithm, so as to obtain a target hash sequence;

a first determining unit 404 configured to determine a hash sequence matching the target hash sequence from at least one predetermined hash sequence, wherein each hash sequence of the at least one hash sequence corresponds to an encrypted sound segment;

a second determining unit 405 configured to determine the encrypted sound segment corresponding to the determined hash sequence as a retrieval result.

Optionally, in an apparatus according to any embodiment of the present disclosure, the at least one hash sequence is generated by:

acquiring a target voice;

dividing the target voice into a voice segment sequence;

Optionally, in the apparatus according to any embodiment of the present disclosure, the correspondence between each hash sequence in the at least one hash sequence and the encrypted sound segment is established as follows:

Optionally, in an apparatus according to any embodiment of the present disclosure, the first determining unit 404 is specifically configured to:

Optionally, in an apparatus according to any embodiment of the present disclosure, the first determining unit 404 is further configured to:

Optionally, in an apparatus according to any embodiment of the present disclosure, the generating unit 403 is specifically configured to:

the first determining unit 404 is specifically configured to:

the second determining unit 405 is specifically configured to:

determining a second hash sequence matched with the second target hash sequence from at least one predetermined second hash sequence; wherein each second hash sequence in the at least one second hash sequence corresponds to an encrypted sound segment; and the sound segment after decryption of the encrypted sound segment corresponding to each second hash sequence in the at least one second hash sequence is the same as the sound segment after decryption of the encrypted sound segment corresponding to each first hash sequence in the at least one first hash sequence.

the second determining unit 405 is specifically configured to:

The encrypted speech retrieval device provided in this embodiment may be the encrypted speech retrieval device shown in fig. 4, and may perform all the steps of the encrypted speech retrieval method shown in fig. 1A to 3B, so as to achieve the technical effect of the encrypted speech retrieval method shown in fig. 1A to 3B, which is described with reference to fig. 1A to 3B for brevity, and is not described herein again.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, where the electronic device 500 shown in fig. 5 includes: at least one processor 501, memory 502, at least one network interface 504, and other user interfaces 503. The various components in the electronic device 500 are coupled together by a bus system 505. It is understood that the bus system 505 is used to enable connection communications between these components. The bus system 505 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 505 in FIG. 5.

The user interface 503 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, trackball, touch pad, or touch screen, among others.

It is to be understood that the memory 502 in embodiments of the present disclosure may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), double Data Rate Synchronous Dynamic random access memory (ddr Data Rate SDRAM, ddr SDRAM), enhanced Synchronous SDRAM (ESDRAM), synchlronous SDRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The memory 502 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some embodiments, memory 502 stores elements, executable units or data structures, or a subset thereof, or an expanded set thereof: an operating system 5021 and application programs 5022.

The operating system 5021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application 5022 includes various applications, such as a Media Player (Media Player), a Browser (Browser), and the like, for implementing various application services. A program implementing the method of an embodiment of the present disclosure may be included in the application program 5022.

In this embodiment, by calling a program or an instruction stored in the memory 502, specifically, a program or an instruction stored in the application 5022, the processor 501 is configured to execute the method steps provided by the method embodiments, for example, including:

dividing the search sound segment into syllable sequences;

acquiring a target voice;

dividing the target voice into a voice segment sequence;

Optionally, in the method according to any embodiment of the present disclosure, a correspondence between each hash sequence in the at least one hash sequence and the encrypted segment is established as follows:

selecting a hash sequence from at least one predetermined hash sequence, and executing the following determination steps based on the hash sequence: and if the normalized Hamming distance between each hash value of the target hash sequence and each corresponding hash value of the hash sequence is smaller than or equal to a preset distance threshold value, determining that the hash sequence is matched with the target hash sequence.

Optionally, in the method according to any embodiment of the present disclosure, if the normalized hamming distance between each hash value of the target hash sequence and each corresponding hash value of the hash sequence is less than or equal to a preset distance threshold, the method includes:

Optionally, in the method according to any embodiment of the present disclosure, the generating a hash value of each syllable in the syllable sequence by using a hash algorithm to obtain a target hash sequence includes:

the determining, as a search result, the encrypted segment corresponding to the determined hash sequence includes:

The method disclosed by the embodiment of the present disclosure can be applied to the processor 501, or implemented by the processor 501. The processor 501 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 501. The Processor 501 may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off the shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present disclosure may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software elements in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 502, and the processor 501 reads the information in the memory 502 and completes the steps of the method in combination with the hardware.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the Processing units may be implemented in one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the above-described functions of the present disclosure, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

The electronic device provided in this embodiment may be the electronic device shown in fig. 5, and may perform all the steps of the encrypted voice retrieval method shown in fig. 1A to 3B, so as to achieve the technical effect of the encrypted voice retrieval method shown in fig. 1A to 3B.

The disclosed embodiments also provide a storage medium (computer-readable storage medium). The storage medium herein stores one or more programs. Among others, storage media may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk; the memory may also comprise a combination of memories of the kind described above.

When one or more programs in the storage medium are executable by one or more processors, the method for retrieving the encrypted voice executed on the electronic device side is realized.

The processor executes the retrieval program of the encrypted voice stored in the memory to realize the following steps of the retrieval method of the encrypted voice executed on the electronic device side:

dividing the search sound segment into syllable sequences;

acquiring a target voice;

dividing the target voice into a voice segment sequence;

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments, objects, technical solutions and advantages of the present disclosure are described in further detail, it should be understood that the above-mentioned embodiments are merely illustrative of the present disclosure and are not intended to limit the scope of the present disclosure, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A method for retrieving encrypted speech, the method comprising:

dividing the search sound segment into syllable sequences;

generating hash values of all syllables in the syllable sequence by adopting a hash algorithm to obtain a target hash sequence;

2. The method of claim 1, wherein the at least one hash sequence is generated by:

acquiring target voice;

dividing the target voice into a voice segment sequence;

3. The method according to claim 2, wherein the correspondence between each hash sequence of the at least one hash sequence and the encrypted sound segment is established as follows:

4. The method of claim 1, wherein determining the hash sequence matching the target hash sequence from the predetermined at least one hash sequence comprises:

5. The method of claim 4, wherein if the normalized Hamming distance between each hash value of the target hash sequence and each corresponding hash value of the hash sequence is less than or equal to a predetermined distance threshold, the method comprises:

if the normalized Hamming distance between the two hash values is smaller than or equal to the preset distance threshold, sequentially determining whether the normalized Hamming distance between the hash value in the target hash sequence and the hash value in the hash sequence is smaller than or equal to the preset distance threshold from the second hash value; and

the determining, from at least one predetermined hash sequence, a hash sequence matching the target hash sequence, further includes:

6. The method according to any one of claims 1 to 5, wherein the generating a hash value of each syllable in the syllable sequence by using a hash algorithm to obtain a target hash sequence comprises:

the determining the encrypted sound segment corresponding to the determined hash sequence as a retrieval result comprises:

determining a second hash sequence matched with the second target hash sequence from at least one predetermined second hash sequence; wherein each of the at least one second hash sequence corresponds to an encrypted sound segment; and the sound segment after decryption of the encrypted sound segment corresponding to each second hash sequence in the at least one second hash sequence is the same as the sound segment after decryption of the encrypted sound segment corresponding to each first hash sequence in the at least one first hash sequence.

7. The method according to any one of claims 1 to 5, wherein determining the hash sequence matching the target hash sequence from the predetermined at least one hash sequence comprises:

the determining the encrypted sound segment corresponding to the determined hash sequence as a retrieval result includes:

8. An encrypted speech retrieval apparatus, comprising:

an acquisition unit configured to acquire a search segment, wherein the search segment is a speech segment used for searching;

a segmentation unit configured to segment the search segment into a sequence of syllables;

a generating unit configured to generate a hash value of each syllable in the syllable sequence by using a hash algorithm to obtain a target hash sequence;

a first determining unit configured to determine a hash sequence matching the target hash sequence from at least one predetermined hash sequence, wherein each of the at least one hash sequence corresponds to one encrypted sound segment;

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing a computer program stored in the memory, and when executed, implementing the method of any of the preceding claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of the preceding claims 1 to 7.