WO2016119604A1

WO2016119604A1 - Voice information search method and apparatus, and server

Info

Publication number: WO2016119604A1
Application number: PCT/CN2016/071164
Authority: WO
Inventors: 闻乃松
Original assignee: 阿里巴巴集团控股有限公司; 闻乃松
Priority date: 2015-01-26
Filing date: 2016-01-18
Publication date: 2016-08-04
Also published as: CN105893389A

Abstract

A voice information search method and apparatus, and a server. The method comprises: extracting voice features of target voices in a voice information base to generate feature descriptors of the target voices (S1); acquiring a feature descriptor of a voice to be identified, and taking a found target voice corresponding to a feature descriptor matching the feature descriptor of the voice to be identified as a target candidate set corresponding to the voice to be identified (S3); and selecting a search result of the voice to be identified in the target candidate set according to a pre-set rule (S4). By utilizing the method, query efficiency can be optimised, and the voice search speed can be increased.

Description

Voice information searching method, device and server

Technical field

The present application belongs to the technical field of electronic information signal processing, and in particular, to a voice information search method, device and server.

Background technique

In the future, speech recognition will gradually become a key technology for human-computer interaction in electronic information technology. At present, there is a growing demand for voice recognition technology in the fields of bank self-service, public self-service, WeChat terminal application, and instant voice communication, especially in the mobile Internet-based era.

For example, in many current social application apps, users can post voice messages including various content, some of which may involve illegal information such as horror, pornography, advertising, fraud, and the like. At present, the commonly used method is to automatically review the voice information content based on the speech recognition technology of specific languages and keywords. A common practice in the art may include acquiring speech features of the audited speech and generating an acoustic model through training in order to create a pronunciation template for each of the pronunciations. At the time of identification, the acoustic features to be recognized are matched with the acoustic models of the audited speech one by one, and the pronunciation template closest to the speech to be recognized is selected as the expressed meaning of the speech to be recognized.

In the actual speech recognition process, the speech information is usually divided into a plurality of audio features, for example, 20 ms as a frame length, and a 10 second period is that the speech will generate 500 audio features. The stored auditing voices often reach thousands of times. The auditing voice of the same meaning can include multiple different dialects and different expressions. At the same time, there are a large number of audio features in the pronunciation module of each auditing voice. In the case of the data set, the existing method based on the pronunciation module for audio recognition faces the problem of high-dimensional feature indexing and query process, and the query time is long, which reduces the query efficiency.

Summary of the invention

The purpose of the present application is to provide a voice information search method, device and server, which can extract voice underlying features unrelated to a specific person, then perform quantitative coding, establish an index, and search for existing target voices in the database through Kd numbers to achieve voice content. The purpose of quick search is to improve query efficiency.

The voice information searching method, device and server provided by the application are implemented as follows:

A voice information searching method, the method comprising:

Extracting a voice feature of the target voice in the voice information base, and generating a feature descriptor of the target voice;

Quantizing and encoding the feature descriptor to generate a quantized encoded feature descriptor, and storing the feature description symbol;

Obtaining a feature descriptor of the voice to be recognized, searching, in the stored feature descriptor, a target voice corresponding to the feature descriptor that matches the feature descriptor of the to-be-recognized voice, and using the found target voice as a target candidate set corresponding to the to-be-recognized voice;

The search result of the to-be-identified speech is selected in the target candidate set according to a predetermined rule.

A voice information searching device, the device comprising:

An information acquiring module, configured to acquire a target voice, and extract a voice feature of the target voice;

a descriptor module, configured to generate a feature descriptor of the target voice based on a voice feature of the target voice;

a quantization coding module, configured to perform quantization coding on the feature descriptor, generate a quantized encoded feature descriptor, and store the feature descriptor;

An identification information module, configured to acquire a feature descriptor of the voice to be recognized;

a first search module, configured to search, in the stored feature descriptor, a target voice corresponding to a feature descriptor that matches a feature descriptor of the to-be-identified voice, and use the found target voice as the a target candidate set corresponding to the speech to be recognized;

And a second searching module, configured to select the search result of the to-be-identified voice in the target candidate set according to a predetermined rule.

A voice information search server, the server being configured to include:

a first processing unit, configured to acquire a target voice, generate a feature descriptor of the target voice, and further configured to perform quantization coding on the feature descriptor;

a storage unit, configured to separately store feature descriptors with the same path in the quantized encoded feature descriptors;

a second processing unit, configured to acquire a feature descriptor of the to-be-identified voice; and further configured to search, in the stored feature descriptor, a target voice of the feature descriptor that matches the to-be-identified voice, to acquire a candidate set; Search results for selecting the speech to be recognized in the candidate set according to a predetermined rule.

The present application provides a voice information search method, apparatus, and server, which can learn and express a model of a phoneme level of a target voice information to be audited in a voice information database, and generate a feature descriptor to create a feature descriptor. index. In the present application, the generated feature descriptor can be quantized and encoded, the feature descriptor index dimension and the information length can be reduced, and the processing speed in the information index can be improved. At the time of inquiry, the present application uses the K-d tree to obtain a target candidate set of a smaller search range to be recognized, and then further filters out the search result. The voice information searching method provided by the present application converts the traditional high-latitude and complex pronunciation module speech recognition into similar audio feature search, and reduces the index dimension and the Kd tree optimization query efficiency by the feature descriptor, thereby greatly improving the voice information search speed. .

DRAWINGS

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description are only It is a few embodiments described in the present application, and other drawings can be obtained from those skilled in the art without any inventive labor.

1 is a schematic flow chart of an embodiment of a voice information searching method according to the present application;

2 is a schematic diagram of performing quantization coding on a feature descriptor according to the present application;

3 is a schematic diagram of an index of establishing a feature descriptor in the present application;

4 is a schematic structural diagram of a module of an embodiment of a voice information searching apparatus according to the present application;

FIG. 5 is a schematic structural diagram of a module of a quantization coding module provided by the present application; FIG.

FIG. 6 is a schematic structural diagram of a module of a second search module provided by the present application.

detailed description

The technical solutions in the embodiments of the present application are clearly and completely described in the following, in which the technical solutions in the embodiments of the present application are clearly and completely described. The embodiments are only a part of the embodiments of the present application, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope shall fall within the scope of the application.

A voice information searching method provided in the present application can add a voice underlying feature of a different language corresponding to a target keyword or a phrase to be audited to a database, perform model learning and expression of a phoneme level, and generate a feature descriptor to establish index. At the time of the query, the K-d tree is used to obtain the target candidate set of the speech to be recognized, and then the search result is further filtered out. Based on the above retrieval scheme, the index dimension and the K-d tree optimization query efficiency can be reduced by the feature descriptor quantization coding, and the voice information search speed can be greatly improved.

1 is a flowchart of a method for a voice information search method according to an embodiment of the present disclosure. As shown in FIG. 1, the method may include:

S1: Extract a voice feature of the target voice in the voice information base, and generate a feature descriptor of the target voice.

The voice information library may include stored target voice. The target voice can be pre-acquired or set. The target voice may specifically include different content in different application scenarios. For example, in a voice content review based on security considerations, the target voice in the voice information base may be a keyword or phrase involving content such as horror, pornography, advertisement, fraud, etc., including multiple dialects or multiple voices. In a home audio or car intelligent control terminal, the target voice in the voice information library may include a voice keyword or phrase that performs function control on a home smart device such as a smart TV, an audio, or the like, or a car driving control device. Or collect and store in the smart terminal social, shopping, chat and other applications often The target voice information of keywords or phrases such as "weather", "apple", "double eleven" is used. The target voice stored in the voice information database may be set according to different application scenarios. The voice information search method described in this application may be applicable to, but not limited to, an application scenario of voice content audit based on security considerations.

Speech recognition based on automatic auditing of voice content usually needs to extract the underlying features of the voice that are not related to a specific person, so that the same words spoken by different people can be more accurately recognized, or the same person has the same content because of his or her own state and occasion. The method for extracting the underlying features of the speech unrelated to a specific person in the present application may include a MFCC (Mel-Frequency Cepstrum Coefficients) and a PLP (Perceptual Linear Predictive) method. The MFCC is based on Fourier and cepstrum analysis, and performs Fourier transform on the sampling points in the short-time audio frame to obtain the energy of the short-time audio frame at each frequency, which can better reflect the frequency of the audio signal. Domain characteristics. Therefore, in this embodiment, the MFCC method can be used to acquire the voice features of the target voice in the voice information base.

Specifically, the extracting the voice features of the target voice in the voice information database may include the following processing steps:

S101: Perform preprocessing on the target voice.

The pre-processing may include operations such as voice format conversion, pre-emphasis, framing, windowing, and the like on the target voice. In this embodiment, the specific implementation process of pre-processing the target voice may include:

S1011: Voice format conversion.

The target voice stored in the voice information database may include voice information collected in multiple formats, such as an amr format, a wav format, and the like. In this embodiment, different target voice formats can be uniformly converted into a wav format, which facilitates unified and fast processing of subsequent data.

S1012: Pre-emphasis.

The pre-emphasis generally refers to flattening the spectrum of the signal by passing the speech signal through a high-pass filter, ensuring that the same or similar signal-to-noise ratio can be used to obtain the spectrum from the low frequency to the high frequency. At the same time, the pre-emphasis can also eliminate the vocal cords and lip effects during the vocalization process, and is used to compensate the high-frequency part of the speech signal that is suppressed by the pronunciation system and the resonance peak of the high-frequency. The filter formula is as follows:

H(z)=1-μz ^-1

In the above formula, the value of μ is between 0.9 and 1.0, and in this embodiment, it can be 0.97.

S1013: Framing.

The framing may include taking all the sampling points of the target speech every N sampling points as one frame. In this embodiment, in order to avoid excessive change of two adjacent frames, an overlapping area may be formed between adjacent frames, and the overlapping area may include M sampling points. Specifically, for example, the sampling frequency of the speech signal in the embodiment may be 8000 Hz, the sampling point N included in each frame is 512, and the overlapping area M is 256.

S1014: Windowing.

Usually, in order to process the speech signal efficiently, the speech signal can be windowed, and only the data in the window can be processed at a time. Since the actual speech signal is very long, it is usually not necessary to perform one-time processing on very long data in actual signal processing. You can take a piece of data each time, analyze it, then take a piece of data and analyze it. In the process, you can construct a function that has a non-zero value in a certain interval and a zero in the remaining intervals. The Hamming window is such a function. In this embodiment, the current framed signal can be multiplied by the Hamming window, and usually one third or one half of the window can be moved each time the next frame is processed. To increase the continuity of the left and right ends of the frame. The windowing formula provided in this embodiment may be:

S'(n)=S(n)×W(n)

In the above formula,

In the above formula, N is the number of sampling points of the frame, S(n) represents the speech signal, and S'(n) is the speech signal after the windowing process, and a can have a value of 0.46.

After the above pre-processing, the energy of the target speech after pre-processing can be obtained.

S102: Calculate an energy spectrum of the pre-processed target speech.

In this embodiment, the pre-processed frame speech signals may be subjected to fast Fourier transform to obtain the spectrum of each frame, and then the spectrum of the speech signal may be squared to obtain the power spectrum of the speech signal.

S103: Perform Mel filtering on the energy spectrum, and calculate a logarithm of the energy spectrum after the Mel filtering.

The energy spectrum can be smoothed by a set of Mel-scale triangular filters set in advance, eliminating harmonics and highlighting the resonance of the original speech. The logarithm of each filter output can then be calculated using the following equation:

In the above formula, N is the number of sampling points of the frame, M is the number of filters, Xa is the Fourier transformed spectrum, and Hm is the mth filter.

S104: performing DCT transform on the logarithm of the energy spectrum to obtain MFCC coefficients, and acquiring a voice feature of the target voice.

In this embodiment, DCT (Discrete Cosine Transform) transformation can be performed by using the following formula to obtain MFCC coefficients:

In the above formula, N is the number of sampling points of the frame, M is the number of filters, and L is the order of the MFCC coefficients, which is usually taken as 12-16.

The MFCC coefficient obtained by the above transformation can be used as the speech feature of the target speech.

In another preferred embodiment, dynamic parameters may also be added to the voice feature to improve speech recognition. In the method described in this embodiment, a difference parameter that characterizes the dynamic characteristics of the voice may be added to the voice feature, and the mask may distinguish the same words that different people say, and improve the voice recognition performance of the system. Therefore, the voice information searching method in this embodiment may further include:

S105: Calculate first and second order difference coefficients of the MFCC coefficients, and add the first and second order difference coefficients to the voice feature. That is, the speech feature may further include first and second order differential coefficients of the MFCC coefficients. The first and second order difference coefficients of the MFCC coefficients can be specifically calculated by the following formula:

In the above formula, dt represents the tth first-order difference, Ct represents the t-th cepstral coefficient, and N represents the time difference of the first-order derivative, which may take 1 or 2. By substituting the results of the above equation, the parameters of the second order difference can be obtained.

After extracting the speech features of the target speech in the speech information library, the feature descriptors of the target speech may be generated. The feature descriptors described in this application may include VLAD (vector of indigenous aggregated descriptors, VLAD, local feature aggregation descriptor) feature descriptors. The following is a method for generating a feature descriptor provided by this embodiment, which can aggregate voice features into individual feature vectors. The specific implementation process may include:

S101': acquiring, by using a k-means clustering method, the codebook of the target voice for the extracted voice feature;

Generating the VLAD descriptor usually requires training the codebook first, and randomly selecting N speech features from the extracted speech features, and obtaining a codebook with a number k of codewords by a k-means clustering method {μ ₁ . .., μ _k }. Each of said code present in a code word can be expressed as one or more of the same or similar polymerization speech samples, e.g. μ ₁ can be represented as a plurality of voice information database but said different tone of the same meaning as target speech polymerization. In this way, a large number of target voices in the voice information base can be aggregated to form a codebook with a codeword number of k. The k may be much smaller than the number of target speeches in the voice information library.

S102': acquiring a voice feature set of the target voice, and calculating a sum of the voice feature set and all residual vectors of the codeword closest to the codebook.

One or more speech features of a target speech in the speech information library may form a feature set {x ₁ , . . . , x _p }. For a feature set formed by a voice feature of a target voice, sequentially searching for a codeword closest to each of the voice features in the codebook, and calculating a voice feature and a current lookup in the feature set The residual vector of the nearest codeword is then added, and then all residual vectors belonging to the same codeword are accumulated. As shown below:

In the above formula, x _t is the t-th feature in the feature set {x ₁ , . . . , x _p }, 1≤t≤P, μ _i is the codebook {μ ₁ ,..., The i-th codeword in μ _k }, 1 ≤ i ≤ k. x _t : NN(x _t )=i represents a subset of features mapped to the same code.

S103': normalizing the sum of the residual vectors of the codewords to generate a feature descriptor of the target speech.

The residual vector of the accumulated codeword is normalized. The sum of the normalized codeword residual vectors may be connected to form a VLAD total feature descriptor V of the target speech. The VLAD total feature descriptor V can be expressed as:

V={v' _i ,...,v' _k }

Where v′ _i is the vector data of the normalized processed feature descriptor, and the normalization process can be processed by using the following formula:

In this embodiment, the voice feature of the target voice collected and stored in the voice information database may be extracted, and then the feature descriptor of the target voice may be generated. Each of the VLAD feature descriptors in the total feature descriptor V may be a multi-dimensional feature vector, for example, may be a 128-dimensional feature vector {l1, l2, l3, ..., l128}, Each dimension of each feature vector is a clustering index of the target speech, and the corresponding target speech can be found according to the multi-dimensional feature vector v′ _i .

S2: Perform quantization coding on the feature descriptor, generate a quantized encoded feature descriptor, and store the feature descriptor.

In the actual implementation process, the obtained one VLAD feature descriptor usually reaches thousands of bits. In order to speed up the search speed, especially in the application scenario of the big data set, the difficulty and complexity of the voice information retrieval are reduced. The acquired VLAD feature descriptor is quantized, the dimension and information length of the feature descriptor are reduced, and the search query efficiency is optimized.

Specifically, performing the quantization coding on the feature descriptor to generate the quantized coded land certificate descriptor may include:

S201: Dividing each of the feature descriptors into L sub-vectors, respectively clustering the L sub-vectors, and setting an index number of the sub-vector clustering, L≥2;

S202: The L sub-vectors of each of the feature descriptors are respectively represented by index numbers of the clusters that are closest to the sub-vectors, and the quantized encoded feature descriptors are generated.

FIG. 2 is a schematic diagram of performing quantization coding on a feature descriptor according to the embodiment. In FIG. 2, the VLAD feature descriptor is taken as an example of a 128-dimensional feature vector, which can be divided into 8 equal parts (y1 to y8), and each of the equal parts includes 16-dimensional components (16 components) of the 128-dimensional feature vector. Each of the components can be separately clustered and mapped to 256 cluster centers (256 centroids). Each sub-vector after clustering can be used for index number representation, for example, q1(y1) in FIG. 2 can represent the first 16-dimensional component including the 128-dimensional feature vector. Sub-vectors of 8 equal parts Each sub-vector can be represented by 8-bit binary, so that the 128-dimensional feature vector can be used for 8-bit 8-bit×8=64-bit information representation, which reduces the information processing dimension and information data length, and improves the processing efficiency.

Through the above method, the high-dimensional VLAD feature descriptor can be represented as a low-dimensional feature vector with a length L and a component of a secondary clustering index number, which can greatly improve the search speed of subsequent voice information.

After performing quantization quantization on the feature descriptor to form the quantized coded feature descriptor, the quantized coded feature descriptor may be stored. The specific storage may be implemented in the manner of indexing the quantized encoded feature descriptor. Preferably, the K-d tree index may be used in the embodiment, and the storing the feature descriptor implementation manner may include:

S201': establishing a K-d tree with a height of (L+1);

S202': dividing an index dimension and a partition value corresponding to the index dimension for a non-leaf node of the K-d tree; and establishing a result path direction that is compared with the partition value;

S203': starting from a root node of the Kd tree, comparing a value of a feature descriptor corresponding to an index dimension of the non-leaf node with a division value of the non-leaf node, and based on the result of the comparison and the result The path points to store the feature descriptors into the leaf nodes of the Kd tree.

FIG. 3 is a schematic diagram of an index for establishing a feature descriptor according to the present application. In the example of FIG. 3, a K-d tree of height (3+1) can be established, and the last layer of the K-d tree is a leaf node. The feature descriptor that needs to be stored is a 3-dimensional feature vector, and the index dimension and the partition value of each non-leaf node may be pre-divided, and the result path direction of the comparison may be preset, for example, if the feature descriptor is at the non-leaf node If the value of the corresponding index dimension is smaller than the division value of the non-leaf node, the left subtree entering the non-leaf node continues to be compared, otherwise the right subtree is continued to be compared until the leaf node is entered. It should be noted that the index dimension may be represented as the value of the dimension indicated by the value of the index dimension in the feature descriptor. If the dimension of the upper node of a non-leaf node has been compared, the comparison is no longer performed. Therefore, when the non-leaf node index dimension is divided, the index dimension values that have been divided from the K-d tree root node to the current non-leaf node are no longer allocated to the current non-leaf node.

The above-mentioned division value of the non-leaf node division may be half of the number of cluster centers set in the above-mentioned quantization coding in a specific implementation, and the number of cluster centers set in the above-mentioned quantization coding is 256. The value of the division value can be set to 128.

In the above example, the feature descriptors of the 3-dimensional feature vectors can be stored in the leaf nodes, respectively. If a feature descriptor is a (3, 45, 60) 3-dimensional feature vector, the Kd tree established according to FIG. 3 may start from the root node, and the feature descriptor (45, 210, 60) and the index shown in the root node. The dimension is 2, and the division value is 128 to compare the values of the corresponding index dimensions. The value of the feature descriptor (45, 210, 60) corresponding to the current node index dimension of 2 is 210, which is greater than the current node partition value of 128, and then enters the right subtree of the root node. Then the feature descriptor (45, 210, 60) continues The index dimension of the current non-leaf node is 1 and the value of the index is 128. The index of the index of the feature descriptor (45, 210, 60) is 1 and the value is 45, which is smaller than the current non-leaf node. The division value of 128 enters the left subtree of the current non-leaf node. In the same way, the value of the feature descriptor (45, 210, 60) is compared with 128 according to the current non-leaf node setting index dimension of 3, and the partition value is 128, and then enters the left side of the current non-leaf node to arrive. The leaf node of the Kd tree. At this time, the feature descriptors (45, 210, 60) may be stored in the leaf node. Corresponding to the higher-dimensional feature descriptors, the corresponding K-d tree can be established and still stored in the leaf nodes according to the above method.

The last layer in the above K-d is a leaf node, and each leaf node stores a feature vector of all feature descriptors passing through the search path from the root node to the leaf node. Each dimension of each of the feature vectors is a clustering index of the target speech. Thus, each leaf node stores a set of feature descriptors of the same path in the above-mentioned established K-d index, where a feature descriptor stored by one leaf node can be used as a target candidate set.

In a preferred embodiment of the method, the number of feature descriptors stored in each leaf node is equalized to ensure that the target candidate set capacity stored in each leaf node is equal, and the non-leaf node partition index dimension may include The index dimension value S divided for the non-leaf node is a randomly generated integer ranging from 1 ≤ S ≤ L, and the index dimension value S of the current non-leaf node is from the Kd root node to the current non- The undivided index dimension value on the path of the leaf node. L in the above is the length of the feature descriptor, that is, the dimension of the feature vector in the feature descriptor. In this way, the random division of the index dimension of the non-leaf node improves the uniform distribution of the nodes on the left and right sides of the Kd tree, and can avoid the excessive number of feature descriptors in the target candidate set stored by some leaf nodes, and can improve subsequent target candidates. The search speed within the set equalizes the load of each target candidate set.

S3: acquiring a feature descriptor of the voice to be recognized, searching, in the stored feature descriptor, a target voice corresponding to the feature descriptor that matches the feature descriptor of the to-be-recognized voice, and finding the target Voice as a target candidate set corresponding to the to-be-recognized voice;

For a given speech to be recognized, the speech feature of the speech to be recognized may be extracted according to the method described above, and the quantized and encoded feature descriptor of the speech to be recognized may be acquired. The feature descriptor of the to-be-recognized voice may be matched with the feature descriptor of the target voice, and the target voice corresponding to the feature descriptor matching the feature descriptor of the to-be-recognized voice may be found, and the target to be found is found. The voice is used as a target candidate set corresponding to the to-be-recognized voice. The target voice corresponding to the feature descriptor matching the feature descriptor of the to-be-identified voice may include a leaf node in the established Kd tree that is the same as the feature descriptor path of the to-be-recognized voice. The target voice corresponding to the feature descriptor stored in it.

For example, by an embodiment similar to that shown in FIG. 3 above, the feature to be recognized may be generated with the same feature descriptor as the feature descriptor stored by the leaf node, for example (90, 135, 78). Then you can follow the K-d tree and design shown in Figure 3. The index dimension and the score are set to perform the feature descriptor search, and finally the target candidate set of a leaf node matching the feature descriptor of the speech to be recognized is obtained. The searching process of the feature descriptor of the to-be-identified speech in the Kd tree may refer to a stored procedure of the feature descriptor (45, 210, 60) of the target speech, and may find a feature descriptor of the speech to be recognized (90, 135, 78). Corresponding target speech, determine the target candidate set, as shown in Figure 3 (90, 135, 75) where the leaf node is located. This achieves the positioning of the speech to be recognized to a target speech set corresponding to a smaller target candidate set.

S4: Select the search result of the to-be-identified voice in the target candidate set according to a predetermined rule.

After obtaining the target candidate set, the search scope has been greatly reduced. Then, further selection may be made according to a preset schedule to obtain a search result of the to-be-identified voice. In this embodiment, a method for further selecting a search result in the target candidate set is provided. Therefore, the search result that specifically selects the to-be-identified voice in the target candidate set according to a predetermined rule in the embodiment may include: :

Selecting, in the target candidate set, a top R feature descriptors having the smallest Euclidean distance from the feature descriptor of the to-be-recognized speech as a search result set, and using the target speech corresponding to the search result set as the to-be-recognized speech Search results for R≥1.

In this embodiment, the Euclidean distance between the feature descriptor of the to-be-recognized speech and the feature vector in the target candidate set may be calculated, and the calculated Euclidean distance may be sorted in an increasing order, and the Euclidean distance is selected to be the smallest. The first R feature descriptions are used as a search result set. Certainly, the target voice in the corresponding voice information database in the feature descriptor in the search result set is the search result of the voice to be recognized.

The range of values of R in the top R results with the smallest Euclidean distance selected here can be set according to requirements. For example, the R may take a value of 1, and may be represented as selecting a feature descriptor with the smallest Euclidean distance. For example, the feature descriptor of the speech to be recognized is (90, 135, 78), and the Euclidean distance may be selected in the acquired target candidate set. The smallest feature descriptor (90, 135, 75) is used as the search result set. Of course, in an actual application, the R may also take a value greater than 1, and is smaller than the number of feature descriptors in the target candidate set, for example, may be 3, so that the European candidate may be selected from the target candidate set. The minimum of the first three feature descriptors (90, 135, 78), (87, 135, 80), (101, 137, 81) as the search result set, the target speech corresponding to the feature descriptor (90, 135, 78) with the smallest Euclidean distance As a preferred search result of the speech to be recognized, the target speech corresponding to the feature descriptors of the remaining two search result sets is used as a reference or an alternative search result.

The voice information searching method provided by the present application can extract the underlying features of the speech of the target speech, generate corresponding VLAD feature descriptors, and represent the feature set of the target speech, and can locate the corresponding target speech according to the feature descriptor. . The present application introduces a VLAD feature descriptor to convert a speech feature set into a fixed length overall feature, and then converts a VLAD feature descriptor with a longer information length and a higher dimension into a vector length with a smaller length and a more dimensional dimension. Low feature vectors greatly improve information reading, parsing, and indexing speed. Finally, through the K-d tree index, the target candidate set is established, and the to-be-recognized speech is searched by the K-d tree index to obtain a target candidate set with a smaller search range, and the search result is further selected in the target candidate set, thereby further speeding up the search speed.

Based on the voice information searching method of the present application, the present application further provides a voice information searching device, which can store target voices in a voice information database according to rules into corresponding candidate set modules, and establish a corresponding indexing mechanism. After obtaining the speech to be recognized, the target candidate set can be quickly obtained through the index, and the search result is further obtained, thereby converting the speech recognition of the traditional pronunciation template into the search of the speech feature, and improving the search by quantitatively encoding and optimizing the index. Speed, optimize query efficiency. 4 is a block diagram of a structure of a voice information search device according to the present application. As shown in FIG. 4, the device may include:

The information acquiring module 101 is configured to acquire a target voice, and extract a voice feature of the target voice;

The descriptor module 102 may be configured to generate a feature descriptor of the target voice based on a voice feature of the target voice;

The quantization coding module 103 may be configured to perform quantization coding on the feature descriptor, generate a quantized encoded feature descriptor, and store the feature descriptor;

The identification information module 104 can be configured to acquire a feature descriptor of the voice to be recognized;

The first search module 105 may be configured to: in the stored feature descriptor, search for a target voice corresponding to the feature descriptor that matches the feature descriptor of the to-be-recognized voice, and use the found target voice as a target candidate set corresponding to the to-be-recognized voice;

The second search module 106 may be configured to select the search result of the to-be-identified voice in the target candidate set according to a predetermined rule.

In the voice search device, the manner in which the information acquisition module 101 extracts the voice features of the target voice may include an extraction method based on the MFCC and the PLP method. For example, the process of extracting speech features using MFCC may include:

Preprocessing the target speech;

Calculating an energy spectrum of the pre-processed target speech;

Performing Mel filtering on the energy spectrum to calculate a logarithm of the energy spectrum after the Mel filtering;

Performing DCT transform on the logarithm of the energy spectrum to obtain MFCC coefficients, and acquiring speech features of the target speech.

The pre-processing the target voice may include: performing voice format conversion, pre-emphasis, framing, and windowing on the target voice.

As a matter of course, in order to further improve the voice recognition rate of the device, the process of extracting voice features by using the MFCC may further include:

Calculating first and second order differential coefficients of the MFCC coefficients, adding the first and second order differential coefficients to the language In the tone feature.

For the implementation of the voice feature of the target voice in this embodiment, reference may be made to the description in other embodiments of the present application, and details are not described herein.

In another embodiment, the quantization and coding module 103 may specifically include:

The clustering module may be configured to divide each of the feature descriptors into L molecular vectors, cluster the L sub-vectors, and set an index number of the sub-vector clusters, L≥2;

The mapping module may be configured to respectively represent L sub-vectors of each of the feature descriptors by an index number of the cluster that is closest to the sub-vector, to form a quantized encoded feature descriptor.

When clustering the sub-vectors, each sub-vector can be used for an 8-bit binary representation, so that each feature descriptor can be converted into a feature vector of length L and lower dimension.

FIG. 5 is a schematic structural diagram of a module of a quantization and coding module provided by the present application. As shown in FIG. 5, in another embodiment, the quantization and coding module 103 may specifically include:

The index tree construction module 1031 can be used to establish a K-d tree with a height of (L+1);

The preset rule module 1032 may be configured to divide an index dimension and a partition value corresponding to the index dimension for the non-leaf node of the Kd tree according to a preset rule, and establish a result path direction that is compared with the partition value. ;

The storage module 1033 may be configured to compare, from a root node of the Kd tree, a value of a feature descriptor corresponding to an index dimension of the non-leaf node with a partition value of the non-leaf node, and based on the comparison result And the result path is directed to storing the feature descriptor into a leaf node of the Kd tree.

The established result path pointing may include comparing the feature descriptor with the current non-leaf node partition value. If the former is greater than the latter, entering the left subtree of the current non-leaf node to continue comparison, otherwise entering the current non-leaf node The right subtree continues to compare. Certainly, the left subtree that enters the current non-leaf node when the former is smaller than the latter may be set for comparison, and the specific one may be set according to requirements.

In a preferred embodiment, the preset rule module 1032 is configured to divide the index dimension for the non-leaf node.

The index dimension value S divided for the non-leaf node is a randomly generated integer having a value range of 1 ≤ S ≤ L, and the index dimension value S of the current non-leaf node is from the Kd root node to the current non-leaf The index dimension value that is not divided on the path of the node.

FIG. 6 is a schematic structural diagram of a module of a second search module provided by the present application. As shown in FIG. 6 , the second search module 106 of the voice search device may specifically include:

The distance calculation module 1061 may be configured to calculate an Euclidean distance between the feature descriptor in the target candidate set and the feature descriptor of the to-be-recognized voice;

The screening module 1062 is configured to select, in the target candidate set, a feature descriptor associated with the to-be-recognized voice, Euclidean The smallest front R feature descriptors as the search result set, R≥1;

The target speech module 1063 can be configured to acquire a target speech corresponding to the feature descriptor in the search result set.

In another implementation, the descriptor module 102 may specifically include:

The codebook training module may be configured to acquire the codebook of the target voice by using a k-means clustering method for the extracted voice features;

The voice feature set module may be configured to obtain a voice feature set, and calculate a sum of the voice feature set and all residual vector vectors of the codeword closest to the codebook; the voice feature set may be, for example, {x ₁ . .., x _p }, where each x can represent a speech feature corresponding to a target speech.

The normalization processing module may be configured to normalize a sum of residual vectors of the codewords to generate a feature descriptor of the target speech.

A voice information searching device described in the present application can be applied to a plurality of device terminals or servers. For example, the smart mobile terminal that is commonly used in daily life can obtain the voice information to be recognized, and the smart mobile terminal can send the voice information to be recognized to the server, and the server can use the voice information search method and device implemented in the application. For example, a voice search is performed to obtain a corresponding voice search result, and then the result obtained by the search can be further processed. Therefore, the application further provides a voice information search server, the server being configured to include:

Certainly, the server may further return the obtained search result of the to-be-identified voice to the client that sends the voice to be recognized, or perform other processing in combination with the function module of the server or other server. The voice search server provided in this embodiment combines the feature descriptor and the index tree to optimize the voice indexing method, and improves the server voice search speed.

The voice information searching method, device and server provided by the present application can store the target voices in the corresponding candidate sets according to the rules, and establish a corresponding indexing mechanism, and can quickly obtain the target candidates through the index after acquiring the to-be-identified voices. The set further obtains the search result, thereby converting the speech recognition of the traditional pronunciation template into the search of the speech feature, and optimizing the index by the quantization coding and the Kd tree, thereby improving the search speed and optimizing the query efficiency.

Although the description includes information transmission, data transformation, data tree structure, etc., This application is not limited to situations where it must be fully compliant with standard communication protocols or data processing standards. The protocol of the embodiments of the present application may also be implemented by a slightly modified transmission mechanism or data processing standard based on certain protocols or standards. Of course, even if the above-mentioned general protocol or data processing standard is not adopted, but a proprietary protocol or a data processing standard is adopted, the same application can be implemented as long as the information interaction and the information judgment feedback manner of the above embodiments of the present application are met. No longer.

The unit, device or module illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product having a certain function. For the convenience of description, the above devices are described as being separately divided into various modules by function. Of course, the functions of the modules may be implemented in the same software or software and/or hardware when implementing the present application, or the modules implementing the same functions may be implemented by multiple sub-modules or a combination of sub-units.

Those skilled in the art will also appreciate that in addition to implementing the controller in purely computer readable program code, the controller can be logically programmed by means of logic gates, switches, ASICs, programmable logic controllers, and embedding. The form of a microcontroller or the like to achieve the same function. Therefore, such a controller can be considered as a hardware component, and a device for internally implementing it for implementing various functions can also be regarded as a structure within a hardware component. Or even a device for implementing various functions can be considered as a software module that can be both a method of implementation and a structure within a hardware component.

The application can be described in the general context of computer-executable instructions executed by a computer, such as a program module. Generally, program modules include routines, programs, objects, components, data structures, classes, and the like that perform particular tasks or implement particular abstract data types. The present application can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are connected through a communication network. In a distributed computing environment, program modules can be located in both local and remote computer storage media including storage devices.

It will be apparent to those skilled in the art from the above description of the embodiments that the present application can be implemented by means of software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product in essence or in the form of a software product, which may be stored in a storage medium such as a ROM/RAM or a disk. , an optical disk, etc., includes instructions for causing a computer device (which may be a personal computer, mobile terminal, server, or network device, etc.) to perform the methods described in various embodiments of the present application or portions of the embodiments.

The various embodiments in the specification are described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on differences from other embodiments. This application can be used in a variety of general purpose or special purpose computer system environments or configurations. For example: personal computer, server computer, handheld or portable device, tablet device, multiprocessor system, microprocessor based system, set-top box, programmable electronic device, network PC, small computer, mainframe computer, including the above A distributed computing environment of any system or device, and so on.

While the present invention has been described by the embodiments of the present invention, it will be understood by those skilled in the art

Claims

A voice information searching method, the method comprising:

Extracting a voice feature of the target voice in the voice information base, and generating a feature descriptor of the target voice;

Performing quantization coding on the feature descriptor to generate a quantized encoded feature descriptor, and storing the feature descriptor;

Obtaining a feature descriptor of the voice to be recognized, searching, in the stored feature descriptor, a target voice corresponding to the feature descriptor that matches the feature descriptor of the to-be-recognized voice, and using the found target voice as a target candidate set corresponding to the to-be-recognized voice;

The search result of the to-be-identified speech is selected in the target candidate set according to a predetermined rule.
The method for searching for a voice information according to claim 1, wherein the performing the quantization encoding on the feature descriptor to generate the quantized encoded feature descriptor comprises:

Dividing each of the feature descriptors into L sub-vectors, respectively clustering the L sub-vectors, and setting an index number of the sub-vector clustering, L≥2;

The L sub-vectors of each of the feature descriptors are respectively represented by an index number of the cluster that is closest to the sub-vector, and a quantized-coded feature descriptor is generated.
The method for searching for a voice information according to claim 2, wherein said storing said feature descriptor comprises:

Establish a K-d tree with a height of (L+1);

Dividing an index dimension and a partition value corresponding to the index dimension for the non-leaf node of the K-d tree; establishing a result path direction that is compared with the partition value;

Starting from the root node of the Kd tree, comparing the value of the feature descriptor corresponding to the index dimension of the non-leaf node with the partition value of the non-leaf node, and based on the result of the comparison and the result path pointing The feature descriptor is stored in a leaf node of the Kd tree.
The method for searching for a voice information according to claim 3, wherein the dividing the index dimensions for the non-leaf nodes comprises:

The index dimension value S divided for the non-leaf node is a randomly generated integer having a value range of 1 ≤ S ≤ L, and the index dimension value S of the current non-leaf node is from the Kd root node to the current non-leaf The index dimension value that is not divided on the path of the node.
The method for searching for a voice information according to claim 1, wherein the selecting a search result of the voice to be recognized in the target candidate set according to a predetermined rule comprises:

Selecting, in the target candidate set, a top R feature descriptors having the smallest Euclidean distance from the feature descriptor of the to-be-recognized speech as a search result set, and using the target speech corresponding to the search result set as the to-be-recognized speech Search results for R≥1.
The method for searching a voice information according to claim 1, wherein the generating the feature descriptor of the target voice comprises:

Obtaining, by the k-means clustering method, the codebook of the target voice for the extracted voice feature;

Obtaining a voice feature set of the target voice, and calculating a sum of the voice feature set and all residual vectors of the codeword closest to the codebook;

The sum of the residual vectors of the codewords is normalized to generate a feature descriptor of the target speech.
The method for searching for a voice information according to claim 1, wherein the extracting the voice features of the target voice in the voice information database comprises:

Preprocessing the target speech;

Calculating an energy spectrum of the pre-processed target speech;

Performing Mel filtering on the energy spectrum to calculate a logarithm of the energy spectrum after the Mel filtering;

Performing DCT transform on the logarithm of the energy spectrum to obtain MFCC coefficients, and acquiring speech features of the target speech.
The method for searching for a voice information according to claim 7, wherein the pre-processing the target voice comprises: performing voice format conversion, pre-emphasis, framing, and windowing on the target voice.
The method for searching a voice information according to claim 8, wherein the method further comprises:

Calculating first and second order differential coefficients of the MFCC coefficients, adding the first and second order differential coefficients to the speech features.
A voice information searching device, the device comprising:

An information acquiring module, configured to acquire a target voice, and extract a voice feature of the target voice;

a descriptor module, configured to generate a feature descriptor of the target voice based on a voice feature of the target voice;

a quantization coding module, configured to perform quantization coding on the feature descriptor, generate a quantized encoded feature descriptor, and store the feature descriptor;

An identification information module, configured to acquire a feature descriptor of the voice to be recognized;

a first search module, configured to search, in the stored feature descriptor, a target voice corresponding to a feature descriptor that matches a feature descriptor of the to-be-identified voice, and use the found target voice as the a target candidate set corresponding to the speech to be recognized;

And a second searching module, configured to select the search result of the to-be-identified voice in the target candidate set according to a predetermined rule.
A voice information search device according to claim 10, wherein said quantized coding module comprises:

a clustering module, configured to divide each of the feature descriptors into L molecular vectors, cluster the L sub-vectors, and set an index number of the sub-vector clusters, L≥2;

And a mapping module, configured to respectively represent L sub-vectors of each of the feature descriptors by an index number of the cluster that is closest to the sub-vector, to form a quantized encoded feature descriptor.
A voice information search device according to claim 10, wherein said quantized coding module comprises:

An index tree building block for establishing a K-d tree of height (L+1);

a preset rule module, configured to divide an index dimension and a partition value corresponding to the index dimension for the non-leaf node of the K-d tree according to a preset rule, and establish a result path direction that is compared with the partition value;

a storage module, configured to compare, from a root node of the Kd tree, a value of a feature descriptor corresponding to an index dimension of the non-leaf node with a partition value of the non-leaf node, and based on the comparison result The result path is directed to storing the feature descriptor into a leaf node of the Kd tree.
A voice information searching device according to claim 12, wherein the preset rule module divides the index dimensions for the non-leaf nodes:

The index dimension value S divided for the non-leaf node is a randomly generated integer having a value range of 1 ≤ S ≤ L, and the index dimension value S of the current non-leaf node is from the Kd root node to the current non-leaf The index dimension value that is not divided on the path of the node.
A voice information search device according to claim 10, wherein the second search module comprises:

a distance calculation module, configured to calculate an Euclidean distance of the feature descriptor in the target candidate set and the feature descriptor of the to-be-recognized voice;

a screening module, configured to select, in the target candidate set, a top R feature descriptors having a minimum Euclidean distance from a feature descriptor of the to-be-recognized speech as a search result set, R≥1;

The target speech module is configured to acquire a target speech corresponding to the feature descriptor in the search result set.
A voice information searching device according to claim 10, wherein said descriptor module comprises:

a codebook training module, configured to acquire a codebook of the target voice by using a k-means clustering method;

a speech feature collection module, acquiring a speech feature set, and calculating a sum of the speech feature set and all residual vectors of the codeword closest to the codebook;

And a normalization processing module, configured to normalize a sum of residual vectors of the codewords to generate a feature descriptor of the target speech.
A voice information search server, wherein the server is configured to include:

a first processing unit, configured to acquire a target voice, generate a feature descriptor of the target voice, and further configured to perform quantization coding on the feature descriptor;

a storage unit, configured to separately store feature descriptors with the same path in the quantized encoded feature descriptors;

a second processing unit, configured to acquire a feature descriptor of the to-be-identified voice; and further configured to search, in the stored feature descriptor, a target voice of the feature descriptor that matches the to-be-identified voice, to acquire a candidate set; Search results for selecting the speech to be recognized in the candidate set according to a predetermined rule.