CN112445934B

CN112445934B - Voice retrieval method, device, equipment and storage medium

Info

Publication number: CN112445934B
Application number: CN202110133004.7A
Authority: CN
Inventors: 丁浩杰; 邓箐; 吴富章
Original assignee: Beijing Yuanjian Information Technology Co Ltd
Current assignee: Beijing Yuanjian Information Technology Co Ltd
Priority date: 2021-02-01
Filing date: 2021-02-01
Publication date: 2021-04-20
Anticipated expiration: 2041-02-01
Also published as: CN112445934A

Abstract

The application provides a voice retrieval method, a voice retrieval device, voice retrieval equipment and a storage medium, and belongs to the technical field of voice keyword detection. The voice retrieval method comprises the following steps: acquiring retrieval audio; obtaining at least one first vector sequence of the retrieval audio by adopting a classification model obtained by pre-training, wherein each first vector sequence corresponds to an audio speed; generating an index value corresponding to each first vector sequence according to the value of the target vector in the first vector sequence in the preset dimension; acquiring an index value mapping table of stored audio in a retrieved audio library; searching the searched audio library according to the index value corresponding to each first vector sequence and the index value mapping table to obtain a search result; and outputting a retrieval result. The method and the device can improve the retrieval efficiency.

Description

Voice retrieval method, device, equipment and storage medium

Technical Field

The present application relates to the field of voice keyword detection technologies, and in particular, to a voice retrieval method, apparatus, device, and storage medium.

Background

With the development of the internet, audio data is more and more, and it has become an urgent need to retrieve required information from mass audio data.

At present, in the search based on sample keywords, when searching on continuous speech, since the position of the keyword occurrence is unknown, the keyword needs to be searched on the searched audio vector sequence by means of sliding matching.

However, the currently adopted retrieval method, whether using inter-vector floating point operation or sliding mechanism in the retrieval matching process, is relatively inefficient, for example: the representation vectors of audio are all high-dimensional vector sequences, which results in very time-consuming computation on floating point numbers. In addition, in the sliding matching mode, the step length is small in order to avoid omission in the sliding process, so that a plurality of useless matches are made in the conventional sliding matching mode, the resource consumption is increased, and the current retrieval mode is low in efficiency due to the reasons.

Disclosure of Invention

The application aims to provide a voice retrieval method, a voice retrieval device, voice retrieval equipment and a storage medium, which can improve the retrieval efficiency.

The embodiment of the application is realized as follows:

in one aspect of the embodiments of the present application, a method for voice retrieval is provided, including:

acquiring retrieval audio, wherein the retrieval audio comprises at least one keyword;

obtaining at least one first vector sequence of the retrieval audio by adopting a classification model obtained by pre-training, wherein each first vector sequence corresponds to an audio speed; generating an index value corresponding to each first vector sequence according to a value in a preset dimension of a target vector in each first vector sequence, wherein the target vector is a first vector in the first vector sequence;

acquiring an index value mapping table of stored audio in a retrieved audio library, wherein each row of the index value mapping table is used for recording the mapping relation between an index value and one or more vectors in a second vector sequence of the stored audio, and the vectors with the same value in a preset dimension are mapped to the same index value;

searching the searched audio library according to the index value corresponding to each first vector sequence and the index value mapping table to obtain a search result;

and outputting a retrieval result.

Optionally, retrieving a retrieval result from the retrieved audio library according to the index value corresponding to each first vector sequence and the index value mapping table, where the retrieving result includes:

screening target rows with index values corresponding to the first vector sequences from the index value mapping table, and forming a vector set to be matched by vectors in the target rows;

aiming at each vector in the vector set to be matched, respectively matching a sub-vector sequence in a second vector sequence taking the vector as an initial vector with each first vector sequence to obtain at least one sub-vector sequence matched with the retrieval audio;

and obtaining a retrieval result according to at least one sub-vector sequence.

Optionally, for each vector in the vector set to be matched, matching a sub-vector sequence in a second vector sequence with the vector as a starting vector with each first vector sequence, to obtain at least one sub-vector sequence matched with the search audio, includes:

sequentially calculating the exclusive or value of each first vector in the sub-vector sequences and a corresponding second vector in one first vector sequence, wherein the position of the first vector in the sub-vector sequences is the same as that of the second vector in the first vector sequences;

accumulating the calculated exclusive or values to obtain a similarity result;

and if the similarity result meets a preset threshold value, determining that the sub-vector sequence is matched with the retrieval audio.

Optionally, obtaining a search result according to at least one sub-vector sequence, including:

performing de-duplication processing on at least one sub-vector sequence to obtain a processed sub-vector sequence;

determining whether to add the sub-vector sequence into a retrieval result set to be selected according to the similarity result of the sub-vector sequence;

if the retrieval audio is matched with each audio in the retrieved audio library, sorting each sub-vector sequence in the retrieval result set to be selected according to the similarity result, and selecting the sub-vector sequences with the preset number in the sorting result;

and taking the audio segments corresponding to the sub-vector sequences with the preset number as retrieval results.

Optionally, generating an index value corresponding to each first vector sequence according to a value in a preset dimension of a target vector in each first vector sequence includes:

and taking the value of the last dimension of the target vector in each first vector sequence as the index value corresponding to each first vector sequence.

Optionally, obtaining at least one first vector sequence of the search audio by using a classification model obtained by pre-training, including:

performing inflexion processing on the retrieved audio to obtain a plurality of target audio, wherein the audio speeds of the target audio are different;

and respectively adopting a classification model to obtain a first vector sequence of each target audio.

Optionally, obtaining the first vector sequence of each target audio by using a classification model respectively includes:

extracting a mel frequency cepstrum coefficient characteristic vector of the target audio;

inputting the Mel frequency cepstrum coefficient feature vector into a classification model according to a preset window length and a preset frame moving number to obtain an embedded vector;

and carrying out local Hash mapping processing on the embedded vectors to obtain a first vector sequence of the target audio, wherein the last dimension of each vector in the first vector sequence represents the number of parameters with the numerical value of 1 in the vector.

In another aspect of the embodiments of the present application, there is provided a speech retrieval apparatus, including: the system comprises an audio acquisition module, a vector determination module, an index acquisition module, a result retrieval module and a result output module;

the audio acquisition module is used for acquiring retrieval audio, and the retrieval audio comprises at least one keyword;

the vector determination module is used for obtaining at least one first vector sequence of the retrieval audio by adopting a classification model obtained by pre-training, and each first vector sequence corresponds to an audio speed; generating an index value corresponding to each first vector sequence according to a value in a preset dimension of a target vector in each first vector sequence, wherein the target vector is a first vector in the first vector sequence;

the index acquisition module is used for acquiring an index value mapping table of stored audio in the audio library to be retrieved, each row of the index value mapping table is used for recording the mapping relation between one index value and one or more vectors in a second vector sequence of the stored audio, and vectors with equal values in a preset dimension are mapped to the same index value;

the result retrieval module is used for retrieving the retrieved audio library to obtain a retrieval result according to the index value corresponding to each first vector sequence and the index value mapping table;

and the result output module is used for outputting the retrieval result.

Optionally, the result retrieval module is specifically configured to screen out, from the index value mapping table, a target row whose index value is an index value corresponding to each first vector sequence, and form a set of vectors to be matched from vectors in the target row; aiming at each vector in the vector set to be matched, respectively matching a sub-vector sequence in a second vector sequence taking the vector as an initial vector with each first vector sequence to obtain at least one sub-vector sequence matched with the retrieval audio; and obtaining a retrieval result according to at least one sub-vector sequence.

Optionally, the result retrieving module is specifically configured to sequentially calculate an exclusive or value between each first vector in the sub-vector sequences and a corresponding second vector in one first vector sequence, where a position of the first vector in the sub-vector sequences is the same as a position of the second vector in the first vector sequences; accumulating the calculated exclusive or values to obtain a similarity result; and if the similarity result meets a preset threshold value, determining that the sub-vector sequence is matched with the retrieval audio.

Optionally, the result retrieval module is specifically configured to perform de-duplication processing on at least one sub-vector sequence to obtain a processed sub-vector sequence; determining whether to add the sub-vector sequence into a retrieval result set to be selected according to the similarity result of the sub-vector sequence; if the retrieval audio is matched with each audio in the retrieved audio library, sorting each sub-vector sequence in the retrieval result set to be selected according to the similarity result, and selecting the sub-vector sequences with the preset number in the sorting result; and taking the audio segments corresponding to the sub-vector sequences with the preset number as retrieval results.

Optionally, the vector determining module is specifically configured to use a value of a last dimension of the target vector in each first vector sequence as an index value corresponding to each first vector sequence.

Optionally, the vector determination module is specifically configured to perform inflexion processing on the retrieved audio to obtain multiple target audio, where audio speeds of the target audio are different; and respectively adopting a classification model to obtain a first vector sequence of each target audio.

Optionally, the vector determination module is specifically configured to extract a mel-frequency cepstrum coefficient feature vector of the target audio; inputting the Mel frequency cepstrum coefficient feature vector into a classification model according to a preset window length and a preset frame moving number to obtain an embedded vector; and carrying out local Hash mapping processing on the embedded vectors to obtain a first vector sequence of the target audio, wherein the last dimension of each vector in the first vector sequence represents the number of parameters with the numerical value of 1 in the vector.

In another aspect of the embodiments of the present application, there is provided a computer device, including: the voice searching method comprises a memory and a processor, wherein a computer program capable of running on the processor is stored in the memory, and the steps of the voice searching method are realized when the processor executes the computer program.

In another aspect of the embodiments of the present application, a storage medium is provided, and the storage medium stores a computer program, and the computer program, when executed by a processor, implements the steps of the voice retrieval method.

The beneficial effects of the embodiment of the application include:

according to the voice retrieval method, the voice retrieval device, the voice retrieval equipment and the storage medium, retrieval audio can be obtained; obtaining at least one first vector sequence of the retrieval audio by adopting a classification model obtained by pre-training, wherein each first vector sequence corresponds to an audio speed; generating an index value corresponding to each first vector sequence according to the value of the target vector in the first vector sequence in the preset dimension; acquiring an index value mapping table of stored audio in a retrieved audio library; searching the searched audio library according to the index value corresponding to each first vector sequence and the index value mapping table to obtain a search result; and outputting a retrieval result. The index value corresponding to each first vector sequence is generated according to the value in the preset dimension of the target vector in each first vector sequence, each vector can be expressed in an index mode, a retrieval result can be retrieved from the retrieved audio library according to the index value corresponding to each first vector sequence and the index value mapping table, the retrieval can be realized by matching according to the difference of the index values through the index value mapping table, each vector in the whole sequence is prevented from being matched, the complexity of calculation is reduced, and the retrieval efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a first schematic flowchart of a voice retrieval method according to an embodiment of the present application;

fig. 2 is a schematic flow chart illustrating a speech retrieval method according to an embodiment of the present application;

fig. 3 is a schematic diagram illustrating matching processing performed in a speech retrieval method according to an embodiment of the present application;

fig. 4 is a third schematic flowchart of a voice retrieval method according to an embodiment of the present application;

fig. 5 is a fourth schematic flowchart of a voice retrieval method according to an embodiment of the present application;

fig. 6 is a fifth flowchart illustrating a voice retrieval method according to an embodiment of the present application;

fig. 7 is a sixth schematic flowchart of a speech retrieval method according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a voice retrieval apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present application, it is noted that the terms "first", "second", "third", and the like are used merely for distinguishing between descriptions and are not intended to indicate or imply relative importance.

In the technical field of voice keyword detection, it is generally required to locate a position in a section of audio consistent with audio content corresponding to an input keyword and give a confidence level of the location, so as to implement retrieval of the audio corresponding to the keyword. That is, the speech retrieval method provided in the embodiment of the present application is a method of determining the position of a keyword in the retrieved audio library.

The following specifically explains a specific implementation process of the voice retrieval method provided in the embodiment of the present application.

Fig. 1 is a first schematic flow chart of a voice retrieval method according to an embodiment of the present application, and referring to fig. 1, the voice retrieval method includes:

s110: and acquiring retrieval audio.

Wherein the retrieval audio comprises at least one keyword.

Alternatively, the retrieved audio may be a recorded file recorded by the user himself or an audio format file downloaded from the internet, which is not limited herein. The search audio may include at least one keyword, and the keyword may be a speech keyword read in any language.

For example: the user can read and record the Chinese word yesterday to generate a corresponding recording file, the recording file is the retrieval audio, and the keyword is yesterday. Alternatively, the user may read out a plurality of words and record to generate one or more audio files, and accordingly, the retrieved audio may include a plurality of keywords.

S120: and obtaining at least one first vector sequence of the retrieval audio by adopting a classification model obtained by pre-training.

Wherein, each first vector sequence corresponds to an audio speed.

Optionally, the classification model obtained by pre-training may be an embedded (embedding) extractor obtained by training a large number of corpus labeled at word level, the extractor may be any neural network classification model without limitation, and the classification model may perform vector extraction processing on the keywords in the input search audio to obtain a first vector sequence corresponding to each keyword.

Alternatively, the first vector sequence corresponding to each keyword may include a plurality of vectors, and different first vector sequences may correspond to different play speeds of the search audio including the same keyword.

For example, the aforementioned search audio with the keyword "yesterday" may be input into a classification model obtained by training in advance for vector extraction processing, and a plurality of first vector sequences corresponding to "yesterday" may be obtained, and these vector sequences may be obtained by adjusting the playing speed of the search audio with the related key word "yesterday", for example: the retrieval audio with the keyword yesterday is played according to the original speed, the accelerated speed, the decelerated speed and the like, and is input into a classification model obtained through pre-training for vector extraction processing, so that a first vector sequence corresponding to the original speed, a first vector sequence corresponding to the accelerated speed and a first vector sequence corresponding to the decelerated speed can be obtained respectively.

S130: and generating an index value corresponding to each first vector sequence according to the value of the target vector in the first vector sequence on the preset dimension.

Wherein the target vector is a first vector in the first vector sequence.

Optionally, each first vector sequence may include a plurality of vectors, and the vectors are arranged in the first vector sequence according to a certain sequence, where the sequence of the vectors may be arranged according to the time sequence of the audio corresponding to the vectors.

Optionally, each vector may further include a plurality of dimensions, each dimension is provided with a value, and an index value corresponding to each first vector sequence may be established according to a value in a preset dimension of the target vector. Wherein the predetermined dimension may be one dimension of each vector, for example: the preset dimension may refer to the last dimension of each vector.

Alternatively, the index value may be a symbol identifier used to indicate a vector sequence in which the vector is located, and the first vector sequence corresponding to the index value may be determined by the index value, or if multiple first vector sequences are included, multiple equivalent index values may also be set.

S140: and acquiring an index value mapping table of stored audio in the retrieved audio library.

Each row of the index value mapping table is used for recording a mapping relation between an index value and one or more vectors in the second vector sequence of the stored audio, and vectors with equal values in a preset dimension are mapped to the same index value.

Alternatively, the retrieved audio library may be a database built from a segment of stored audio, the database may include an index value mapping table corresponding to the stored audio, and the index value mapping table may include a plurality of rows and a plurality of columns, each row representing one or more vectors in the second vector sequence of the stored audio corresponding to the same index value. The second vector sequence may be a vector sequence obtained by extracting a vector from the stored audio by using the classification model obtained by the pre-training, and the second vector sequence may also include a plurality of vectors, each vector may also include a plurality of dimensions, and each dimension is provided with a value. Vectors with equal values in a preset dimension may be mapped to the same index value, where the index value mapping table is specifically shown in table 1 below:

TABLE 1

N1	e0	e1	e5	…
					N2	e2	e9	e15	…
Ni	…	…	…	…

Wherein N is₁- N_iFor different index values, e₀- e_m-1I is the total number of index values and m is the total number of vectors in the second vector sequence.

The first row in table 1 indicates: in the vector of the second vector sequence, the index value is N₁Includes e₀、e₁、e₅Etc.; other rows represent the same reason and are not described herein.

Optionally, the predetermined dimension of each vector in the second vector sequence is the same as the predetermined dimension of the target vector in the first vector sequence, for example: are all the last dimension.

Optionally, a specific establishment process of the index value mapping table is as follows:

multiple retrieved audios may be acquired, pre-processed, for example: removing the operations of muting the head and the tail and the like. Then extracting Mel Frequency Cepstral Coefficients (MFCC) feature vectors corresponding to the retrieved audio, and inputting the feature vectors into a classification model according to a preset window length and a preset frame moving number to obtain an embedded vector. Wherein the preset window length may be 50 frames, and the preset moving frame number may be 10 frames.

After the embedded vector is obtained, a partial hash mapping process may be performed. The process of partial hash mapping process is as follows:

b_t=Ax_t；

wherein, b_tThe vector after mapping is m in dimension; a is an mxn' random matrix as a hash-mapping matrix, x_tIs an embedded vector; in this embodiment, m may be 128, and n' may be 256. From the principle of Hash mapping, if two vectors are similar in the original vector space, the maximum probability in the new vector space is also similar after mapping, and if the original two vectors are not similar, the probability of similarity after mapping is also extremely low.

Optionally, the mapped vector may be further processed, b_tComposed of 128-dimensional positive and negative floating-point numbers, since each dimension in a high-dimensional vector can represent a direction, b can be expressed_tThe number of the middle and positive floating points is coded as 1, and the number of the negative floating point is coded as 0. The obtained 0/1 vector is used as c_tAnd (4) showing.

Alternatively, c may be counted_tThe number of 1 s is marked as N, and N is added to the vector c as a new one-dimension_tIn (1). N is the vector c_tFor use in retrieval. The vector containing N is denoted as e_t，e_tNamely the vector which is finally saved is stored as an index value mapping table in the searched audio library.

S150: and retrieving the retrieved audio library according to the index value corresponding to each first vector sequence and the index value mapping table.

Optionally, since the preset dimension of each vector in the second vector sequence is the same as the preset dimension of the target vector in the first vector sequence, if the index value of the vector in the second vector sequence is equal to the index value of the target vector in the first vector, the first vector sequence and the second vector sequence may be matched according to the equality relationship, so as to obtain a matching result, and if the matching is successful, the matching result may be used as the search result.

The search result may include a position of the first vector sequence in the second vector sequence, and accordingly, according to the position relationship, a specific position of the search audio in the stored audio may also be obtained, for example: start and end time, and the like, and further realizes the retrieval of the retrieval audio. Optionally, the search result may further include a confidence of the position of the first vector sequence in the second vector sequence, which may be used to characterize the accuracy of the search result.

S160: and outputting a retrieval result.

Optionally, the search result may be output for display to allow the user to determine the start and stop time of the search audio in the stored audio, and the confidence level of the search result.

For example: all the retrieval results and the corresponding confidence of each retrieval result can be displayed in a table establishing mode.

According to the voice retrieval method provided by the embodiment of the application, retrieval audio can be obtained; obtaining at least one first vector sequence of the retrieval audio by adopting a classification model obtained by pre-training, wherein each first vector sequence corresponds to an audio speed; generating an index value corresponding to each first vector sequence according to the value of the target vector in the first vector sequence in the preset dimension; acquiring an index value mapping table of stored audio in a retrieved audio library; searching the searched audio library according to the index value corresponding to each first vector sequence and the index value mapping table to obtain a search result; and outputting a retrieval result. The index value corresponding to each first vector sequence is generated according to the value in the preset dimension of the target vector in each first vector sequence, each vector can be expressed in an index mode, a retrieval result can be retrieved from the retrieved audio library according to the index value corresponding to each first vector sequence and the index value mapping table, the retrieval can be realized by matching according to the difference of the index values through the index value mapping table, each vector in the whole sequence is prevented from being matched, the complexity of calculation is reduced, and the retrieval efficiency is improved.

The following is a detailed explanation of another implementation procedure of the speech retrieval method in the embodiments provided in the present application.

Fig. 2 is a flowchart illustrating a second voice retrieval method according to an embodiment of the present application, please refer to fig. 2, which retrieves a retrieval result from a retrieved audio library according to an index value and an index value mapping table corresponding to each first vector sequence, and includes:

s210: and screening target rows with index values corresponding to the first vector sequences from the index value mapping table, and forming a vector set to be matched by the vectors in the target rows.

Optionally, the corresponding target row may be found from the index value mapping table according to the index value corresponding to each first vector sequence, as shown in table 1, if the index value of a certain first vector sequence is N₁Then the first row in the index value mapping table, i.e. N, may be set₁The row is taken as the target row. Correspondingly, the vectors in the corresponding second vector sequence in the target row are combined into a vector set to be matched, e.g. e can be₀、e₁、e₅And the vectors are equal to form a vector set to be matched.

Optionally, table 1 shows only a part of the index value mapping table, during the actual calculation, a plurality of index values may be determined according to the specific size of the preset dimension of each vector in the second vector sequence, each index value occupies one row in the index value mapping table, and a corresponding target row may be found from the index value mapping table according to the specific index value of the first vector sequence.

S220: and aiming at each vector in the vector set to be matched, respectively matching the sub-vector sequence in the second vector sequence taking the vector as the starting vector with each first vector sequence to obtain at least one sub-vector sequence matched with the retrieval audio.

Alternatively, the sequence of subvectors may be a sequence of a plurality of consecutive vectors in the second vector sequence.

For each vector in the vector set to be matched, matching the sub-vector sequence in the second vector sequence with the vector as the starting vector with each first vector sequence, for example, the method may specifically be:

if the vector in the set of vectors to be matched is e₂，e₆，e₇Then can be respectively given as e₂，e₆，e₇Matching the sub-vector sequences in the second vector sequence as the starting vector with each first vector sequence if one of the first vector sequences is k₂-k₃Then the subvector sequence e can be respectively₂- e₃，e₆- e₇，e₇- e₈And k is₂-k₃And carrying out matching processing.

Alternatively, if the first vector in the first vector sequence is taken as the target vector when determining the target vector in S130, each vector in the set of vectors to be matched may be taken as the starting vector, and then a sub-vector sequence with the vector as the starting point is obtained, where the length of the sub-vector sequence is equal to the length of the first vector sequence.

Accordingly, if the last vector in the first vector sequence is taken as the target vector when determining the target vector in S130, each vector in the set of vectors to be matched may be taken as a termination vector, and then a sub-vector sequence with the vector as an end point is obtained, where the length of the sub-vector sequence is equal to the length of the first vector sequence.

S230: and obtaining a retrieval result according to at least one sub-vector sequence.

Optionally, after a plurality of sub-vector sequences are obtained, the sub-vector sequences may be subjected to accumulation processing, a similarity threshold may be set, and according to the similarity threshold, a confidence degree of matching of each sub-vector sequence may be determined, and accordingly, the confidence degree and a position of each sub-vector sequence in the second vector sequence may be used as a search result, and a corresponding position degree of each search result may also be determined accordingly.

The following explains an implementation process of performing matching processing provided in the embodiments of the present application with a specific schematic diagram.

Fig. 3 is a schematic diagram illustrating a matching process performed in a speech retrieval method according to an embodiment of the present application, and referring to fig. 3, fig. 3 includes a second vector sequence 310 and a first vector sequence 320, where the second vector sequence 310 includes a plurality of vectors: from e₀To e_iThe first vector sequence is explained by the example of comprising two vectors.

Taking the first vector in the first vector sequence as the target vector, if the index value of the target vector is N₃Then find N according to the index value comparison table₃In the target row, e.g. by determining e₀，e₈，e₉For each vector in the set of vectors to be matched, the length of the sequence of vectors may be determined according to the length of the first sequence of vectors, which is 2 vectors, and the length of the sequence of sub-vectors is also 2 vectors. E is to be₀，e₈，e₉Respectively as the starting point of each vector to obtain the corresponding sub-vector sequence e₀- e₁，e₈- e₉，e₉- e₁₀。

Correspondingly, if there are multiple first vector sequences, the above method may be respectively adopted to perform multiple matching to find corresponding sub-vector sequences respectively.

The following specifically explains a specific implementation process of matching by the voice retrieval method provided in the embodiment of the present application.

Fig. 4 is a third flowchart of the voice retrieval method according to the embodiment of the present application, please refer to fig. 4, where for each vector in the set of vectors to be matched, a sub-vector sequence in the second vector sequence with the vector as a starting vector is respectively matched with each first vector sequence to obtain at least one sub-vector sequence matched with the retrieved audio, including:

s410: and sequentially calculating the exclusive or value of each first vector in the sub-vector sequences and the corresponding second vector in one first vector sequence.

Wherein the position of the first vector in the sequence of sub-vectors is the same as the position of the second vector in the sequence of first vectors.

Alternatively, each vector in the sub-vector sequence may be taken as a first vector, each vector in the first vector sequence may be taken as a second vector, and each first vector corresponds to one second vector according to the same position in the vector sequence.

For example: the subvector sequence is e₈- e₉The first vector sequence includes three vectors, each vector having k_t1、k_t2、k_t3The three first vector sequences respectively correspond to different audio speeds of the same search audio, wherein k is used for_t1This first vector sequence is an example, assuming k_t1Comprising two vectors, each k₀-k₁Then e₈When it is the first vector, k₀Is e₈A corresponding second vector; accordingly, e₉When it is the first vector, k₁Is e₉A corresponding second vector.

Alternatively, the exclusive or value of each first vector and its corresponding second vector may be calculated sequentially, where each vector may be represented by a binary representation method, and specifically, each vector may be represented by a binary value of each dimension of the vector, for example: one of the first vectors may be represented as 100 (representing the first dimension of the first vector as 1, the second dimension as 0, and the third dimension as 0), one of the second vectors may be represented as 010 (representing the first dimension of the second vector as 0, the second dimension as 1, and the third dimension as 0), in calculating the xor value, the xor value of each dimension of the first vector and the second vector may be added, taking the first vector as 100 and the second vector as 010 as an example, the first dimension of the first vector is different from the first dimension of the second vector, if the xor value is 1, the second dimension of the first vector is different from the second dimension of the second vector, the xor value is also 1, and the third dimension of the first vector is the same as the third dimension of the second vector, the xor value is 0, that is, a set of xor result is obtained: 110.

optionally, since the plurality of first vector sequences are vector sequences of different audio velocities corresponding to the same retrieved audio, in the matching process, a successful matching of a sub-vector sequence with any one of the first vector sequences is to match the retrieved audio with a corresponding position in the retrieved audio library.

S420: and accumulating the calculated exclusive OR values to obtain a similarity result.

Alternatively, after determining the xor value of the first vector and the second vector of each pair, the xor value of each first vector and the second vector may be accumulated to determine the corresponding similarity result.

Taking the above result of obtaining the xor value as 110 as an example, the values of each dimension are added, that is: 1+1+0, and the similarity result is obtained, which in the above example is 2.

The similarity result formula is calculated as follows:

M=S/n；

where M is the similarity result, S is the accumulated value (the xor value of the first vector and the second vector of each pair is added), and n is the number of comparisons, i.e. the logarithm of the first vector and the second vector, i.e. the length of the sub-vector sequence.

S430: and if the similarity result meets a preset threshold value, determining that the sub-vector sequence is matched with the retrieval audio.

Optionally, a preset threshold M0 may be set, and if M > M0, it may be determined that more than half of the directions of each vector in each current comparison process are not similar, that is, the preset threshold is not met. If M is less than or equal to M0, determining that the similarity result meets a preset threshold, and determining that the sub-vector sequence and the search audio are matched with each other.

The following is a specific explanation of a specific process for obtaining a search result according to at least one sub-vector sequence provided in the embodiment of the present application.

Fig. 5 is a fourth schematic flowchart of a speech retrieval method according to an embodiment of the present application, please refer to fig. 5, in which a retrieval result is obtained according to at least one sub-vector sequence, including:

s510: and performing de-duplication processing on at least one sub-vector sequence to obtain a processed sub-vector sequence.

Optionally, an NMS (Non-Maximum Suppression) algorithm may be used to perform de-duplication processing on at least one sub-vector sequence to obtain a processed sub-vector sequence.

The specific de-duplication process is as follows:

obtaining a similarity result of each sub-vector sequence, sequencing each sub-vector sequence from small to large according to the similarity result, traversing all the sub-vector sequences, solving an overlapping area of two adjacent sub-vector sequences, comparing the size relation between the overlapping area and a preset recording and judging threshold value, if the overlapping area is larger than the preset recording and judging threshold value, deleting the sub-vector sequence with the smaller similarity result in the two sub-vector sequences, and the rest sub-vector sequences are the processed sub-vector sequences.

S520: and determining whether to add the sub-vector sequence into the retrieval result set to be selected according to the similarity result of the sub-vector sequence.

Optionally, whether to add the sub-vector sequence to the candidate retrieval result set may be determined according to the similarity result obtained in the foregoing S420, and if the similarity result satisfies a preset threshold, the sub-vector sequence may be determined to be added to the candidate retrieval result set.

The candidate retrieval result set may include a plurality of sub-vector sequences whose similarity results satisfy a preset threshold.

S530: and if the searched audio is matched with each audio in the searched audio library, sorting each sub-vector sequence in the search result set to be selected according to the similarity result, and selecting the sub-vector sequences with the preset number in the sorting result.

Optionally, if all the first vector sequences in the retrieved audio are matched with the corresponding sub-vector sequences in the retrieved audio library, the sub-vector sequences in the retrieved audio result set may be sorted according to the similarity result, for example, the sub-vector sequences may be sorted according to a descending order of the similarity result, and the sub-vector sequences of the top P numbers are selected from the sorting result, where P is a preset number, and may be correspondingly set according to an actual requirement of the user, which is not limited herein.

S540: and taking the audio segments corresponding to the sub-vector sequences with the preset number as retrieval results.

Optionally, after selecting P number of sub-vector sequences, audio segments corresponding to the P sub-vector sequences may be determined, where the audio segments are segments in stored audio corresponding to the retrieved audio library.

Optionally, the audio segments and the start time and the end time of the audio segments in the stored audio may be used as the retrieval result, and accordingly, the similarity result corresponding to the sub-vector sequence may also be used as the confidence.

Optionally, the sub-vector sequences may be used as a data heap, and when a new sub-vector sequence is input, the data heap may be updated and the data pairs may be reordered, so that the data heap always includes P number of search results.

Optionally, the last dimension of the target vector may be used as a preset dimension, that is, the value of the last dimension of the target vector in each first vector sequence may be used as the index value corresponding to each first vector sequence.

The following explains a specific implementation procedure of obtaining the first vector sequence of the retrieved audio provided in the embodiment of the present application.

Fig. 6 is a fifth flowchart illustrating a speech retrieval method according to an embodiment of the present application, please refer to fig. 6, where at least one first vector sequence of the retrieved audio is obtained by using a classification model obtained by pre-training, including:

s610: and performing inflexion processing on the retrieved audio to obtain a plurality of target audio.

Wherein the audio speeds of the target audio are different.

Optionally, before the classification model is used for processing, the retrieval audio may be subjected to inflexion processing, where the inflexion processing may be to accelerate or decelerate the original retrieval audio to obtain multiple target audios, and the target audios may be results of different inflexion processing performed on the same retrieval audio. For example: speed up audio, speed down audio, etc.

Optionally, before the inflexion processing, the retrieved audio may also be preprocessed, for example: the end-to-end silent portions of the retrieved audio are removed.

S620: and respectively adopting a classification model to obtain a first vector sequence of each target audio.

Optionally, a plurality of target audios may be respectively input into the classification model for extraction, and then a first vector sequence corresponding to each target audio, that is, the plurality of first vector sequences, may be respectively obtained.

The first vector sequence corresponding to the accelerated speed is corresponding to the accelerated audio, the first vector sequence corresponding to the decelerated speed is corresponding to the decelerated audio, and the first vector sequence corresponding to the original speed is corresponding to the audio which is not subjected to the inflection processing.

The following explains the specific implementation process of obtaining the first vector sequence of each target audio provided in the embodiments of the present application in detail.

Fig. 7 is a sixth schematic flowchart of a speech retrieval method according to an embodiment of the present application, please refer to fig. 7, in which a classification model is respectively used to obtain a first vector sequence of each target audio, including:

s710: and extracting a Mel frequency cepstrum coefficient feature vector of the target audio.

Optionally, Mel Frequency Cepstral Coeffients (MFCC) feature vectors are features widely used in automatic speech and speaker recognition, and can be used to characterize audio by means of vectors, and the specific extraction method is as follows:

(1) pre-emphasis, framing and windowing are carried out on the target audio;

(2) performing fast Fourier transform on the processed target audio to obtain a short-time frequency spectrum corresponding to the target audio;

(3) carrying out frequency conversion on the short-time frequency spectrum through a Mel filter bank and solving the logarithm;

(4) and performing discrete cosine transform based on the short-time frequency spectrum and the solved logarithm to obtain the mel frequency cepstrum coefficient characteristic vector of the target audio.

S720: and inputting the Mel frequency cepstrum coefficient feature vector into the classification model according to the preset window length and the preset frame moving number to obtain an embedded vector.

Optionally, the preset window length may be 50 frames, the preset frame moving number may be 50 frames, and the window may be used as a unit and input into the classification model for vector processing, so as to obtain an embedded vector, where the embedded vector is a high-dimensional embedded feature vector.

S730: and carrying out local Hash mapping processing on the embedded vector to obtain a first vector sequence of the target audio.

And the value of the last dimension of each vector in the first vector sequence represents the number of parameters with the value of 1 in the vector.

Optionally, the determining method of the first vector sequence is similar to the determining process of determining each vector in the index value mapping table in S140, and is not repeated herein.

The following describes a device, an apparatus, a storage medium, and the like corresponding to the voice retrieval method provided by the present application for execution, and specific implementation processes and technical effects thereof are referred to above and will not be described again below.

Fig. 8 is a schematic structural diagram of a voice retrieving device according to an embodiment of the present application, and referring to fig. 8, the voice retrieving device includes: the audio retrieval system comprises an audio acquisition module 100, a vector determination module 200, an index acquisition module 300, a result retrieval module 400 and a result output module 500;

an audio acquisition module 100, configured to acquire a search audio, where the search audio includes at least one keyword;

the vector determination module 200 is configured to obtain at least one first vector sequence of the retrieved audio by using a classification model obtained through pre-training, where each first vector sequence corresponds to an audio speed; generating an index value corresponding to each first vector sequence according to a value in a preset dimension of a target vector in each first vector sequence, wherein the target vector is a first vector in the first vector sequence;

the index obtaining module 300 is configured to obtain an index value mapping table of stored audio in the retrieved audio library, where each row of the index value mapping table is used to record a mapping relationship between an index value and one or more vectors in a second vector sequence of stored audio, and vectors with equal values in a preset dimension are mapped to the same index value;

a result retrieval module 400, configured to retrieve a retrieval result from the retrieved audio library according to the index value corresponding to each first vector sequence and the index value mapping table;

and a result output module 500, configured to output the search result.

Optionally, the result retrieval module 400 is specifically configured to screen out, from the index value mapping table, a target row whose index value is an index value corresponding to each first vector sequence, and form a set of vectors to be matched from vectors in the target row; aiming at each vector in the vector set to be matched, respectively matching a sub-vector sequence in a second vector sequence taking the vector as an initial vector with each first vector sequence to obtain at least one sub-vector sequence matched with the retrieval audio; and obtaining a retrieval result according to at least one sub-vector sequence.

Optionally, the result retrieving module 400 is specifically configured to sequentially calculate an exclusive or value between each first vector in the sub-vector sequences and a corresponding second vector in one first vector sequence, where a position of the first vector in the sub-vector sequences is the same as a position of the second vector in the first vector sequences; accumulating the calculated exclusive or values to obtain a similarity result; and if the similarity result meets a preset threshold value, determining that the sub-vector sequence is matched with the retrieval audio.

Optionally, the result retrieving module 400 is specifically configured to perform de-duplication processing on at least one sub-vector sequence to obtain a processed sub-vector sequence; determining whether to add the sub-vector sequence into a retrieval result set to be selected according to the similarity result of the sub-vector sequence; if the retrieval audio is matched with each audio in the retrieved audio library, sorting each sub-vector sequence in the retrieval result set to be selected according to the similarity result, and selecting the sub-vector sequences with the preset number in the sorting result; and taking the audio segments corresponding to the sub-vector sequences with the preset number as retrieval results.

Optionally, the vector determining module 200 is specifically configured to use a value of a last dimension of the target vector in each first vector sequence as an index value corresponding to each first vector sequence.

Optionally, the vector determination module 200 is specifically configured to perform inflexion processing on the retrieved audio to obtain multiple target audio, where audio speeds of the target audio are different; and respectively adopting a classification model to obtain a first vector sequence of each target audio.

Optionally, the vector determining module 200 is specifically configured to extract a mel-frequency cepstrum coefficient feature vector of the target audio; inputting the Mel frequency cepstrum coefficient feature vector into a classification model according to a preset window length and a preset frame moving number to obtain an embedded vector; and carrying out local Hash mapping processing on the embedded vectors to obtain a first vector sequence of the target audio, wherein the last dimension of each vector in the first vector sequence represents the number of parameters with the numerical value of 1 in the vector.

The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.

These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application, and referring to fig. 9, the computer device includes: the memory 910 and the processor 920, wherein the memory 910 stores a computer program operable on the processor 920, and the processor 920 implements the steps of the voice retrieval method when executing the computer program.

In another aspect of the embodiments of the present application, a storage medium is further provided, where a computer program is stored, and when the computer program is executed by a processor, the steps of the above-mentioned voice retrieval method are implemented.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for voice retrieval, comprising:

acquiring an index value mapping table of stored audio in a retrieved audio library, wherein each row of the index value mapping table is used for recording a mapping relation between an index value and one or more vectors in a second vector sequence of the stored audio, and vectors with equal values in the preset dimension are mapped to the same index value;

outputting the retrieval result;

the retrieving a retrieval result from the retrieved audio library according to the index value corresponding to each first vector sequence and the index value mapping table includes:

screening out target rows with index values which are the index values corresponding to the first vector sequences from the index value mapping table, and forming a vector set to be matched by vectors in the target rows;

for each vector in the vector set to be matched, respectively performing matching processing on a sub-vector sequence in the second vector sequence and each first vector sequence by taking the vector as an initial vector to obtain at least one sub-vector sequence matched with the retrieval audio;

and obtaining the retrieval result according to the at least one sub-vector sequence.

2. The method as claimed in claim 1, wherein the matching, for each vector in the set of vectors to be matched, a sub-vector sequence in the second vector sequence with the vector as a starting vector with each first vector sequence to obtain at least one sub-vector sequence matching the search audio comprises:

sequentially calculating an exclusive-or value of each first vector in the sequence of sub-vectors and a corresponding second vector in one of the sequence of first vectors, the position of the first vector in the sequence of sub-vectors being the same as the position of the second vector in the sequence of first vectors;

accumulating the calculated exclusive or values to obtain a similarity result;

3. The method of claim 2, wherein said deriving the search result from the at least one sequence of subvectors comprises:

performing de-duplication processing on the at least one sub-vector sequence to obtain a processed sub-vector sequence;

if the retrieval audio is matched with each audio in the retrieved audio library, sorting each sub-vector sequence in the to-be-selected retrieval result set according to a similarity result, and selecting a pre-set number of sub-vector sequences in a sorting result;

and taking the audio segments corresponding to the sub-vector sequences with the preset number as the retrieval result.

4. The method according to any one of claims 1-3, wherein the generating an index value corresponding to each of the first vector sequences according to a value of a target vector in each of the first vector sequences in a predetermined dimension comprises:

5. The method according to any one of claims 1-3, wherein the obtaining at least one first vector sequence of the search audio using a classification model obtained by pre-training comprises:

and respectively adopting the classification model to obtain a first vector sequence of each target audio.

6. The method of claim 5, wherein the separately employing the classification model to obtain the first vector sequence for each target audio comprises:

inputting the mel frequency cepstrum coefficient feature vector into the classification model according to a preset window length and a preset frame moving number to obtain an embedded vector;

and carrying out local Hash mapping processing on the embedded vectors to obtain a first vector sequence of the target audio, wherein the value of the last dimension of each vector in the first vector sequence represents the number of parameters with the numerical value of 1 in the vector.

7. A speech retrieval apparatus, comprising: the system comprises an audio acquisition module, a vector determination module, an index acquisition module, a result retrieval module and a result output module;

the index obtaining module is configured to obtain an index value mapping table of stored audio in a retrieved audio library, where each row of the index value mapping table is used to record a mapping relationship between an index value and one or more vectors in the second vector sequence of the stored audio, and vectors with equal values in the preset dimension are mapped to the same index value;

the result retrieval module is used for retrieving retrieval results from the retrieved audio library according to the index values corresponding to the first vector sequences and the index value mapping table;

the result output module is used for outputting the retrieval result;

the result retrieval module is specifically configured to screen out, from the index value mapping table, target rows whose index values are the index values corresponding to the first vector sequences, and form a set of vectors to be matched from the vectors in the target rows; for each vector in the vector set to be matched, respectively performing matching processing on a sub-vector sequence in the second vector sequence and each first vector sequence by taking the vector as an initial vector to obtain at least one sub-vector sequence matched with the retrieval audio; and obtaining the retrieval result according to the at least one sub-vector sequence.

8. A computer device, comprising: memory in which a computer program is stored which is executable on the processor, and a processor which, when executing the computer program, carries out the steps of the method according to any one of the preceding claims 1 to 6.

9. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.