WO2016119604A1 - Voice information search method and apparatus, and server - Google Patents

Voice information search method and apparatus, and server Download PDF

Info

Publication number
WO2016119604A1
WO2016119604A1 PCT/CN2016/071164 CN2016071164W WO2016119604A1 WO 2016119604 A1 WO2016119604 A1 WO 2016119604A1 CN 2016071164 W CN2016071164 W CN 2016071164W WO 2016119604 A1 WO2016119604 A1 WO 2016119604A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
feature
target
feature descriptor
speech
Prior art date
Application number
PCT/CN2016/071164
Other languages
French (fr)
Chinese (zh)
Inventor
闻乃松
Original Assignee
阿里巴巴集团控股有限公司
闻乃松
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司, 闻乃松 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2016119604A1 publication Critical patent/WO2016119604A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search

Definitions

  • the present application belongs to the technical field of electronic information signal processing, and in particular, to a voice information search method, device and server.
  • a common practice in the art may include acquiring speech features of the audited speech and generating an acoustic model through training in order to create a pronunciation template for each of the pronunciations.
  • the acoustic features to be recognized are matched with the acoustic models of the audited speech one by one, and the pronunciation template closest to the speech to be recognized is selected as the expressed meaning of the speech to be recognized.
  • the speech information is usually divided into a plurality of audio features, for example, 20 ms as a frame length, and a 10 second period is that the speech will generate 500 audio features.
  • the stored auditing voices often reach thousands of times.
  • the auditing voice of the same meaning can include multiple different dialects and different expressions.
  • the existing method based on the pronunciation module for audio recognition faces the problem of high-dimensional feature indexing and query process, and the query time is long, which reduces the query efficiency.
  • the purpose of the present application is to provide a voice information search method, device and server, which can extract voice underlying features unrelated to a specific person, then perform quantitative coding, establish an index, and search for existing target voices in the database through Kd numbers to achieve voice content.
  • the purpose of quick search is to improve query efficiency.
  • the voice information searching method, device and server provided by the application are implemented as follows:
  • a voice information searching method comprising:
  • Quantizing and encoding the feature descriptor to generate a quantized encoded feature descriptor, and storing the feature description symbol
  • Obtaining a feature descriptor of the voice to be recognized searching, in the stored feature descriptor, a target voice corresponding to the feature descriptor that matches the feature descriptor of the to-be-recognized voice, and using the found target voice as a target candidate set corresponding to the to-be-recognized voice;
  • the search result of the to-be-identified speech is selected in the target candidate set according to a predetermined rule.
  • a voice information searching device comprising:
  • An information acquiring module configured to acquire a target voice, and extract a voice feature of the target voice
  • a descriptor module configured to generate a feature descriptor of the target voice based on a voice feature of the target voice
  • a quantization coding module configured to perform quantization coding on the feature descriptor, generate a quantized encoded feature descriptor, and store the feature descriptor;
  • An identification information module configured to acquire a feature descriptor of the voice to be recognized
  • a first search module configured to search, in the stored feature descriptor, a target voice corresponding to a feature descriptor that matches a feature descriptor of the to-be-identified voice, and use the found target voice as the a target candidate set corresponding to the speech to be recognized;
  • a second searching module configured to select the search result of the to-be-identified voice in the target candidate set according to a predetermined rule.
  • a voice information search server configured to include:
  • a first processing unit configured to acquire a target voice, generate a feature descriptor of the target voice, and further configured to perform quantization coding on the feature descriptor;
  • a storage unit configured to separately store feature descriptors with the same path in the quantized encoded feature descriptors
  • a second processing unit configured to acquire a feature descriptor of the to-be-identified voice; and further configured to search, in the stored feature descriptor, a target voice of the feature descriptor that matches the to-be-identified voice, to acquire a candidate set; Search results for selecting the speech to be recognized in the candidate set according to a predetermined rule.
  • the present application provides a voice information search method, apparatus, and server, which can learn and express a model of a phoneme level of a target voice information to be audited in a voice information database, and generate a feature descriptor to create a feature descriptor. index.
  • the generated feature descriptor can be quantized and encoded, the feature descriptor index dimension and the information length can be reduced, and the processing speed in the information index can be improved.
  • the present application uses the K-d tree to obtain a target candidate set of a smaller search range to be recognized, and then further filters out the search result.
  • the voice information searching method provided by the present application converts the traditional high-latitude and complex pronunciation module speech recognition into similar audio feature search, and reduces the index dimension and the Kd tree optimization query efficiency by the feature descriptor, thereby greatly improving the voice information search speed. .
  • FIG. 1 is a schematic flow chart of an embodiment of a voice information searching method according to the present application.
  • FIG. 2 is a schematic diagram of performing quantization coding on a feature descriptor according to the present application
  • FIG. 3 is a schematic diagram of an index of establishing a feature descriptor in the present application.
  • FIG. 4 is a schematic structural diagram of a module of an embodiment of a voice information searching apparatus according to the present application.
  • FIG. 5 is a schematic structural diagram of a module of a quantization coding module provided by the present application.
  • FIG. 6 is a schematic structural diagram of a module of a second search module provided by the present application.
  • a voice information searching method provided in the present application can add a voice underlying feature of a different language corresponding to a target keyword or a phrase to be audited to a database, perform model learning and expression of a phoneme level, and generate a feature descriptor to establish index.
  • the K-d tree is used to obtain the target candidate set of the speech to be recognized, and then the search result is further filtered out.
  • the index dimension and the K-d tree optimization query efficiency can be reduced by the feature descriptor quantization coding, and the voice information search speed can be greatly improved.
  • FIG. 1 is a flowchart of a method for a voice information search method according to an embodiment of the present disclosure. As shown in FIG. 1, the method may include:
  • S1 Extract a voice feature of the target voice in the voice information base, and generate a feature descriptor of the target voice.
  • the voice information library may include stored target voice.
  • the target voice can be pre-acquired or set.
  • the target voice may specifically include different content in different application scenarios.
  • the target voice in the voice information base may be a keyword or phrase involving content such as horror, pornography, advertisement, fraud, etc., including multiple dialects or multiple voices.
  • the target voice in the voice information library may include a voice keyword or phrase that performs function control on a home smart device such as a smart TV, an audio, or the like, or a car driving control device.
  • the target voice information of keywords or phrases such as "weather”, “apple”, “double eleven” is used.
  • the target voice stored in the voice information database may be set according to different application scenarios.
  • the voice information search method described in this application may be applicable to, but not limited to, an application scenario of voice content audit based on security considerations.
  • Speech recognition based on automatic auditing of voice content usually needs to extract the underlying features of the voice that are not related to a specific person, so that the same words spoken by different people can be more accurately recognized, or the same person has the same content because of his or her own state and occasion.
  • the method for extracting the underlying features of the speech unrelated to a specific person in the present application may include a MFCC (Mel-Frequency Cepstrum Coefficients) and a PLP (Perceptual Linear Predictive) method.
  • the MFCC is based on Fourier and cepstrum analysis, and performs Fourier transform on the sampling points in the short-time audio frame to obtain the energy of the short-time audio frame at each frequency, which can better reflect the frequency of the audio signal. Domain characteristics. Therefore, in this embodiment, the MFCC method can be used to acquire the voice features of the target voice in the voice information base.
  • the extracting the voice features of the target voice in the voice information database may include the following processing steps:
  • the pre-processing may include operations such as voice format conversion, pre-emphasis, framing, windowing, and the like on the target voice.
  • the specific implementation process of pre-processing the target voice may include:
  • the target voice stored in the voice information database may include voice information collected in multiple formats, such as an amr format, a wav format, and the like.
  • different target voice formats can be uniformly converted into a wav format, which facilitates unified and fast processing of subsequent data.
  • the pre-emphasis generally refers to flattening the spectrum of the signal by passing the speech signal through a high-pass filter, ensuring that the same or similar signal-to-noise ratio can be used to obtain the spectrum from the low frequency to the high frequency.
  • the pre-emphasis can also eliminate the vocal cords and lip effects during the vocalization process, and is used to compensate the high-frequency part of the speech signal that is suppressed by the pronunciation system and the resonance peak of the high-frequency.
  • the filter formula is as follows:
  • the value of ⁇ is between 0.9 and 1.0, and in this embodiment, it can be 0.97.
  • the framing may include taking all the sampling points of the target speech every N sampling points as one frame.
  • an overlapping area may be formed between adjacent frames, and the overlapping area may include M sampling points.
  • the sampling frequency of the speech signal in the embodiment may be 8000 Hz
  • the sampling point N included in each frame is 512
  • the overlapping area M is 256.
  • the speech signal can be windowed, and only the data in the window can be processed at a time. Since the actual speech signal is very long, it is usually not necessary to perform one-time processing on very long data in actual signal processing. You can take a piece of data each time, analyze it, then take a piece of data and analyze it. In the process, you can construct a function that has a non-zero value in a certain interval and a zero in the remaining intervals.
  • the Hamming window is such a function.
  • the current framed signal can be multiplied by the Hamming window, and usually one third or one half of the window can be moved each time the next frame is processed. To increase the continuity of the left and right ends of the frame.
  • the windowing formula provided in this embodiment may be:
  • N is the number of sampling points of the frame
  • S(n) represents the speech signal
  • S'(n) is the speech signal after the windowing process
  • a can have a value of 0.46.
  • the energy of the target speech after pre-processing can be obtained.
  • S102 Calculate an energy spectrum of the pre-processed target speech.
  • the pre-processed frame speech signals may be subjected to fast Fourier transform to obtain the spectrum of each frame, and then the spectrum of the speech signal may be squared to obtain the power spectrum of the speech signal.
  • S103 Perform Mel filtering on the energy spectrum, and calculate a logarithm of the energy spectrum after the Mel filtering.
  • the energy spectrum can be smoothed by a set of Mel-scale triangular filters set in advance, eliminating harmonics and highlighting the resonance of the original speech.
  • the logarithm of each filter output can then be calculated using the following equation:
  • N is the number of sampling points of the frame
  • M is the number of filters
  • Xa is the Fourier transformed spectrum
  • Hm is the mth filter.
  • S104 performing DCT transform on the logarithm of the energy spectrum to obtain MFCC coefficients, and acquiring a voice feature of the target voice.
  • DCT Discrete Cosine Transform
  • N is the number of sampling points of the frame
  • M is the number of filters
  • L is the order of the MFCC coefficients, which is usually taken as 12-16.
  • the MFCC coefficient obtained by the above transformation can be used as the speech feature of the target speech.
  • dynamic parameters may also be added to the voice feature to improve speech recognition.
  • a difference parameter that characterizes the dynamic characteristics of the voice may be added to the voice feature, and the mask may distinguish the same words that different people say, and improve the voice recognition performance of the system. Therefore, the voice information searching method in this embodiment may further include:
  • S105 Calculate first and second order difference coefficients of the MFCC coefficients, and add the first and second order difference coefficients to the voice feature. That is, the speech feature may further include first and second order differential coefficients of the MFCC coefficients.
  • the first and second order difference coefficients of the MFCC coefficients can be specifically calculated by the following formula:
  • dt represents the tth first-order difference
  • Ct represents the t-th cepstral coefficient
  • N represents the time difference of the first-order derivative, which may take 1 or 2.
  • the feature descriptors of the target speech may be generated.
  • the feature descriptors described in this application may include VLAD (vector of indigenous aggregated descriptors, VLAD, local feature aggregation descriptor) feature descriptors.
  • VLAD vector of indigenous aggregated descriptors
  • VLAD local feature aggregation descriptor
  • the following is a method for generating a feature descriptor provided by this embodiment, which can aggregate voice features into individual feature vectors.
  • the specific implementation process may include:
  • S101' acquiring, by using a k-means clustering method, the codebook of the target voice for the extracted voice feature;
  • Generating the VLAD descriptor usually requires training the codebook first, and randomly selecting N speech features from the extracted speech features, and obtaining a codebook with a number k of codewords by a k-means clustering method ⁇ 1 . .., ⁇ k ⁇ .
  • Each of said code present in a code word can be expressed as one or more of the same or similar polymerization speech samples, e.g. ⁇ 1 can be represented as a plurality of voice information database but said different tone of the same meaning as target speech polymerization. In this way, a large number of target voices in the voice information base can be aggregated to form a codebook with a codeword number of k.
  • the k may be much smaller than the number of target speeches in the voice information library.
  • S102' acquiring a voice feature set of the target voice, and calculating a sum of the voice feature set and all residual vectors of the codeword closest to the codebook.
  • One or more speech features of a target speech in the speech information library may form a feature set ⁇ x 1 , . . . , x p ⁇ .
  • a feature set formed by a voice feature of a target voice sequentially searching for a codeword closest to each of the voice features in the codebook, and calculating a voice feature and a current lookup in the feature set
  • the residual vector of the nearest codeword is then added, and then all residual vectors belonging to the same codeword are accumulated. As shown below:
  • x t is the t-th feature in the feature set ⁇ x 1 , . . . , x p ⁇ , 1 ⁇ t ⁇ P, ⁇ i is the codebook ⁇ 1 ,...,
  • S103' normalizing the sum of the residual vectors of the codewords to generate a feature descriptor of the target speech.
  • the residual vector of the accumulated codeword is normalized.
  • the sum of the normalized codeword residual vectors may be connected to form a VLAD total feature descriptor V of the target speech.
  • the VLAD total feature descriptor V can be expressed as:
  • V ⁇ v' i ,...,v' k ⁇
  • v′ i is the vector data of the normalized processed feature descriptor, and the normalization process can be processed by using the following formula:
  • the voice feature of the target voice collected and stored in the voice information database may be extracted, and then the feature descriptor of the target voice may be generated.
  • Each of the VLAD feature descriptors in the total feature descriptor V may be a multi-dimensional feature vector, for example, may be a 128-dimensional feature vector ⁇ l1, l2, l3, ..., l128 ⁇ ,
  • Each dimension of each feature vector is a clustering index of the target speech, and the corresponding target speech can be found according to the multi-dimensional feature vector v′ i .
  • S2 Perform quantization coding on the feature descriptor, generate a quantized encoded feature descriptor, and store the feature descriptor.
  • the obtained one VLAD feature descriptor usually reaches thousands of bits.
  • the difficulty and complexity of the voice information retrieval are reduced.
  • the acquired VLAD feature descriptor is quantized, the dimension and information length of the feature descriptor are reduced, and the search query efficiency is optimized.
  • performing the quantization coding on the feature descriptor to generate the quantized coded land certificate descriptor may include:
  • S202 The L sub-vectors of each of the feature descriptors are respectively represented by index numbers of the clusters that are closest to the sub-vectors, and the quantized encoded feature descriptors are generated.
  • FIG. 2 is a schematic diagram of performing quantization coding on a feature descriptor according to the embodiment.
  • the VLAD feature descriptor is taken as an example of a 128-dimensional feature vector, which can be divided into 8 equal parts (y1 to y8), and each of the equal parts includes 16-dimensional components (16 components) of the 128-dimensional feature vector.
  • Each of the components can be separately clustered and mapped to 256 cluster centers (256 centroids).
  • Each sub-vector after clustering can be used for index number representation, for example, q1(y1) in FIG. 2 can represent the first 16-dimensional component including the 128-dimensional feature vector.
  • the high-dimensional VLAD feature descriptor can be represented as a low-dimensional feature vector with a length L and a component of a secondary clustering index number, which can greatly improve the search speed of subsequent voice information.
  • the quantized coded feature descriptor may be stored.
  • the specific storage may be implemented in the manner of indexing the quantized encoded feature descriptor.
  • the K-d tree index may be used in the embodiment, and the storing the feature descriptor implementation manner may include:
  • S203' starting from a root node of the Kd tree, comparing a value of a feature descriptor corresponding to an index dimension of the non-leaf node with a division value of the non-leaf node, and based on the result of the comparison and the result
  • the path points to store the feature descriptors into the leaf nodes of the Kd tree.
  • FIG. 3 is a schematic diagram of an index for establishing a feature descriptor according to the present application.
  • a K-d tree of height (3+1) can be established, and the last layer of the K-d tree is a leaf node.
  • the feature descriptor that needs to be stored is a 3-dimensional feature vector, and the index dimension and the partition value of each non-leaf node may be pre-divided, and the result path direction of the comparison may be preset, for example, if the feature descriptor is at the non-leaf node If the value of the corresponding index dimension is smaller than the division value of the non-leaf node, the left subtree entering the non-leaf node continues to be compared, otherwise the right subtree is continued to be compared until the leaf node is entered.
  • the index dimension may be represented as the value of the dimension indicated by the value of the index dimension in the feature descriptor.
  • the comparison is no longer performed. Therefore, when the non-leaf node index dimension is divided, the index dimension values that have been divided from the K-d tree root node to the current non-leaf node are no longer allocated to the current non-leaf node.
  • the above-mentioned division value of the non-leaf node division may be half of the number of cluster centers set in the above-mentioned quantization coding in a specific implementation, and the number of cluster centers set in the above-mentioned quantization coding is 256.
  • the value of the division value can be set to 128.
  • the feature descriptors of the 3-dimensional feature vectors can be stored in the leaf nodes, respectively.
  • a feature descriptor is a (3, 45, 60) 3-dimensional feature vector
  • the Kd tree established according to FIG. 3 may start from the root node, and the feature descriptor (45, 210, 60) and the index shown in the root node.
  • the dimension is 2, and the division value is 128 to compare the values of the corresponding index dimensions.
  • the value of the feature descriptor (45, 210, 60) corresponding to the current node index dimension of 2 is 210, which is greater than the current node partition value of 128, and then enters the right subtree of the root node.
  • the index dimension of the current non-leaf node is 1 and the value of the index is 128.
  • the index of the index of the feature descriptor (45, 210, 60) is 1 and the value is 45, which is smaller than the current non-leaf node.
  • the division value of 128 enters the left subtree of the current non-leaf node.
  • the value of the feature descriptor (45, 210, 60) is compared with 128 according to the current non-leaf node setting index dimension of 3, and the partition value is 128, and then enters the left side of the current non-leaf node to arrive.
  • the leaf node of the Kd tree is compared with 128 according to the current non-leaf node setting index dimension of 3, and the partition value is 128, and then enters the left side of the current non-leaf node to arrive.
  • the feature descriptors (45, 210, 60) may be stored in the leaf node.
  • the corresponding K-d tree can be established and still stored in the leaf nodes according to the above method.
  • each leaf node stores a feature vector of all feature descriptors passing through the search path from the root node to the leaf node.
  • Each dimension of each of the feature vectors is a clustering index of the target speech.
  • each leaf node stores a set of feature descriptors of the same path in the above-mentioned established K-d index, where a feature descriptor stored by one leaf node can be used as a target candidate set.
  • the number of feature descriptors stored in each leaf node is equalized to ensure that the target candidate set capacity stored in each leaf node is equal
  • the non-leaf node partition index dimension may include
  • the index dimension value S divided for the non-leaf node is a randomly generated integer ranging from 1 ⁇ S ⁇ L
  • the index dimension value S of the current non-leaf node is from the Kd root node to the current non- The undivided index dimension value on the path of the leaf node.
  • L in the above is the length of the feature descriptor, that is, the dimension of the feature vector in the feature descriptor.
  • the random division of the index dimension of the non-leaf node improves the uniform distribution of the nodes on the left and right sides of the Kd tree, and can avoid the excessive number of feature descriptors in the target candidate set stored by some leaf nodes, and can improve subsequent target candidates.
  • the search speed within the set equalizes the load of each target candidate set.
  • S3 acquiring a feature descriptor of the voice to be recognized, searching, in the stored feature descriptor, a target voice corresponding to the feature descriptor that matches the feature descriptor of the to-be-recognized voice, and finding the target Voice as a target candidate set corresponding to the to-be-recognized voice;
  • the speech feature of the speech to be recognized may be extracted according to the method described above, and the quantized and encoded feature descriptor of the speech to be recognized may be acquired.
  • the feature descriptor of the to-be-recognized voice may be matched with the feature descriptor of the target voice, and the target voice corresponding to the feature descriptor matching the feature descriptor of the to-be-recognized voice may be found, and the target to be found is found.
  • the voice is used as a target candidate set corresponding to the to-be-recognized voice.
  • the target voice corresponding to the feature descriptor matching the feature descriptor of the to-be-identified voice may include a leaf node in the established Kd tree that is the same as the feature descriptor path of the to-be-recognized voice.
  • the target voice corresponding to the feature descriptor stored in it.
  • the feature to be recognized may be generated with the same feature descriptor as the feature descriptor stored by the leaf node, for example (90, 135, 78). Then you can follow the K-d tree and design shown in Figure 3. The index dimension and the score are set to perform the feature descriptor search, and finally the target candidate set of a leaf node matching the feature descriptor of the speech to be recognized is obtained.
  • the searching process of the feature descriptor of the to-be-identified speech in the Kd tree may refer to a stored procedure of the feature descriptor (45, 210, 60) of the target speech, and may find a feature descriptor of the speech to be recognized (90, 135, 78). Corresponding target speech, determine the target candidate set, as shown in Figure 3 (90, 135, 75) where the leaf node is located. This achieves the positioning of the speech to be recognized to a target speech set corresponding to a smaller target candidate set.
  • S4 Select the search result of the to-be-identified voice in the target candidate set according to a predetermined rule.
  • the search scope After obtaining the target candidate set, the search scope has been greatly reduced. Then, further selection may be made according to a preset schedule to obtain a search result of the to-be-identified voice.
  • a method for further selecting a search result in the target candidate set is provided. Therefore, the search result that specifically selects the to-be-identified voice in the target candidate set according to a predetermined rule in the embodiment may include: :
  • the Euclidean distance between the feature descriptor of the to-be-recognized speech and the feature vector in the target candidate set may be calculated, and the calculated Euclidean distance may be sorted in an increasing order, and the Euclidean distance is selected to be the smallest.
  • the first R feature descriptions are used as a search result set.
  • the target voice in the corresponding voice information database in the feature descriptor in the search result set is the search result of the voice to be recognized.
  • the range of values of R in the top R results with the smallest Euclidean distance selected here can be set according to requirements.
  • the R may take a value of 1, and may be represented as selecting a feature descriptor with the smallest Euclidean distance.
  • the feature descriptor of the speech to be recognized is (90, 135, 78), and the Euclidean distance may be selected in the acquired target candidate set.
  • the smallest feature descriptor (90, 135, 75) is used as the search result set.
  • the R may also take a value greater than 1, and is smaller than the number of feature descriptors in the target candidate set, for example, may be 3, so that the European candidate may be selected from the target candidate set.
  • the target speech corresponding to the feature descriptors of the remaining two search result sets is used as a reference or an alternative search result.
  • the voice information searching method provided by the present application can extract the underlying features of the speech of the target speech, generate corresponding VLAD feature descriptors, and represent the feature set of the target speech, and can locate the corresponding target speech according to the feature descriptor. .
  • the present application introduces a VLAD feature descriptor to convert a speech feature set into a fixed length overall feature, and then converts a VLAD feature descriptor with a longer information length and a higher dimension into a vector length with a smaller length and a more dimensional dimension. Low feature vectors greatly improve information reading, parsing, and indexing speed.
  • the target candidate set is established, and the to-be-recognized speech is searched by the K-d tree index to obtain a target candidate set with a smaller search range, and the search result is further selected in the target candidate set, thereby further speeding up the search speed.
  • the present application further provides a voice information searching device, which can store target voices in a voice information database according to rules into corresponding candidate set modules, and establish a corresponding indexing mechanism. After obtaining the speech to be recognized, the target candidate set can be quickly obtained through the index, and the search result is further obtained, thereby converting the speech recognition of the traditional pronunciation template into the search of the speech feature, and improving the search by quantitatively encoding and optimizing the index. Speed, optimize query efficiency.
  • 4 is a block diagram of a structure of a voice information search device according to the present application. As shown in FIG. 4, the device may include:
  • the information acquiring module 101 is configured to acquire a target voice, and extract a voice feature of the target voice;
  • the descriptor module 102 may be configured to generate a feature descriptor of the target voice based on a voice feature of the target voice;
  • the quantization coding module 103 may be configured to perform quantization coding on the feature descriptor, generate a quantized encoded feature descriptor, and store the feature descriptor;
  • the identification information module 104 can be configured to acquire a feature descriptor of the voice to be recognized
  • the first search module 105 may be configured to: in the stored feature descriptor, search for a target voice corresponding to the feature descriptor that matches the feature descriptor of the to-be-recognized voice, and use the found target voice as a target candidate set corresponding to the to-be-recognized voice;
  • the second search module 106 may be configured to select the search result of the to-be-identified voice in the target candidate set according to a predetermined rule.
  • the manner in which the information acquisition module 101 extracts the voice features of the target voice may include an extraction method based on the MFCC and the PLP method.
  • the process of extracting speech features using MFCC may include:
  • the pre-processing the target voice may include: performing voice format conversion, pre-emphasis, framing, and windowing on the target voice.
  • the process of extracting voice features by using the MFCC may further include:
  • the quantization and coding module 103 may specifically include:
  • the clustering module may be configured to divide each of the feature descriptors into L molecular vectors, cluster the L sub-vectors, and set an index number of the sub-vector clusters, L ⁇ 2;
  • the mapping module may be configured to respectively represent L sub-vectors of each of the feature descriptors by an index number of the cluster that is closest to the sub-vector, to form a quantized encoded feature descriptor.
  • each sub-vector can be used for an 8-bit binary representation, so that each feature descriptor can be converted into a feature vector of length L and lower dimension.
  • FIG. 5 is a schematic structural diagram of a module of a quantization and coding module provided by the present application.
  • the quantization and coding module 103 may specifically include:
  • the index tree construction module 1031 can be used to establish a K-d tree with a height of (L+1);
  • the preset rule module 1032 may be configured to divide an index dimension and a partition value corresponding to the index dimension for the non-leaf node of the Kd tree according to a preset rule, and establish a result path direction that is compared with the partition value. ;
  • the storage module 1033 may be configured to compare, from a root node of the Kd tree, a value of a feature descriptor corresponding to an index dimension of the non-leaf node with a partition value of the non-leaf node, and based on the comparison result And the result path is directed to storing the feature descriptor into a leaf node of the Kd tree.
  • the established result path pointing may include comparing the feature descriptor with the current non-leaf node partition value. If the former is greater than the latter, entering the left subtree of the current non-leaf node to continue comparison, otherwise entering the current non-leaf node The right subtree continues to compare. Certainly, the left subtree that enters the current non-leaf node when the former is smaller than the latter may be set for comparison, and the specific one may be set according to requirements.
  • the preset rule module 1032 is configured to divide the index dimension for the non-leaf node.
  • the index dimension value S divided for the non-leaf node is a randomly generated integer having a value range of 1 ⁇ S ⁇ L, and the index dimension value S of the current non-leaf node is from the Kd root node to the current non-leaf The index dimension value that is not divided on the path of the node.
  • FIG. 6 is a schematic structural diagram of a module of a second search module provided by the present application.
  • the second search module 106 of the voice search device may specifically include:
  • the distance calculation module 1061 may be configured to calculate an Euclidean distance between the feature descriptor in the target candidate set and the feature descriptor of the to-be-recognized voice;
  • the screening module 1062 is configured to select, in the target candidate set, a feature descriptor associated with the to-be-recognized voice, Euclidean The smallest front R feature descriptors as the search result set, R ⁇ 1;
  • the target speech module 1063 can be configured to acquire a target speech corresponding to the feature descriptor in the search result set.
  • the descriptor module 102 may specifically include:
  • the codebook training module may be configured to acquire the codebook of the target voice by using a k-means clustering method for the extracted voice features;
  • the voice feature set module may be configured to obtain a voice feature set, and calculate a sum of the voice feature set and all residual vector vectors of the codeword closest to the codebook; the voice feature set may be, for example, ⁇ x 1 . .., x p ⁇ , where each x can represent a speech feature corresponding to a target speech.
  • the normalization processing module may be configured to normalize a sum of residual vectors of the codewords to generate a feature descriptor of the target speech.
  • a voice information searching device described in the present application can be applied to a plurality of device terminals or servers.
  • the smart mobile terminal that is commonly used in daily life can obtain the voice information to be recognized, and the smart mobile terminal can send the voice information to be recognized to the server, and the server can use the voice information search method and device implemented in the application. For example, a voice search is performed to obtain a corresponding voice search result, and then the result obtained by the search can be further processed. Therefore, the application further provides a voice information search server, the server being configured to include:
  • a first processing unit configured to acquire a target voice, generate a feature descriptor of the target voice, and further configured to perform quantization coding on the feature descriptor;
  • a storage unit configured to separately store feature descriptors with the same path in the quantized encoded feature descriptors
  • a second processing unit configured to acquire a feature descriptor of the to-be-identified voice; and further configured to search, in the stored feature descriptor, a target voice of the feature descriptor that matches the to-be-identified voice, to acquire a candidate set; Search results for selecting the speech to be recognized in the candidate set according to a predetermined rule.
  • the server may further return the obtained search result of the to-be-identified voice to the client that sends the voice to be recognized, or perform other processing in combination with the function module of the server or other server.
  • the voice search server provided in this embodiment combines the feature descriptor and the index tree to optimize the voice indexing method, and improves the server voice search speed.
  • the voice information searching method, device and server provided by the present application can store the target voices in the corresponding candidate sets according to the rules, and establish a corresponding indexing mechanism, and can quickly obtain the target candidates through the index after acquiring the to-be-identified voices.
  • the set further obtains the search result, thereby converting the speech recognition of the traditional pronunciation template into the search of the speech feature, and optimizing the index by the quantization coding and the Kd tree, thereby improving the search speed and optimizing the query efficiency.
  • the protocol of the embodiments of the present application may also be implemented by a slightly modified transmission mechanism or data processing standard based on certain protocols or standards.
  • a slightly modified transmission mechanism or data processing standard based on certain protocols or standards.
  • the same application can be implemented as long as the information interaction and the information judgment feedback manner of the above embodiments of the present application are met. No longer.
  • the unit, device or module illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product having a certain function.
  • the above devices are described as being separately divided into various modules by function.
  • the functions of the modules may be implemented in the same software or software and/or hardware when implementing the present application, or the modules implementing the same functions may be implemented by multiple sub-modules or a combination of sub-units.
  • the controller can be logically programmed by means of logic gates, switches, ASICs, programmable logic controllers, and embedding.
  • the application can be described in the general context of computer-executable instructions executed by a computer, such as a program module.
  • program modules include routines, programs, objects, components, data structures, classes, and the like that perform particular tasks or implement particular abstract data types.
  • the present application can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are connected through a communication network.
  • program modules can be located in both local and remote computer storage media including storage devices.
  • the present application can be implemented by means of software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product in essence or in the form of a software product, which may be stored in a storage medium such as a ROM/RAM or a disk. , an optical disk, etc., includes instructions for causing a computer device (which may be a personal computer, mobile terminal, server, or network device, etc.) to perform the methods described in various embodiments of the present application or portions of the embodiments.
  • a computer device which may be a personal computer, mobile terminal, server, or network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A voice information search method and apparatus, and a server. The method comprises: extracting voice features of target voices in a voice information base to generate feature descriptors of the target voices (S1); acquiring a feature descriptor of a voice to be identified, and taking a found target voice corresponding to a feature descriptor matching the feature descriptor of the voice to be identified as a target candidate set corresponding to the voice to be identified (S3); and selecting a search result of the voice to be identified in the target candidate set according to a pre-set rule (S4). By utilizing the method, query efficiency can be optimised, and the voice search speed can be increased.

Description

一种语音信息搜索方法、装置及服务器Voice information searching method, device and server 技术领域Technical field
本申请属于电子信息信号处理技术领域,尤其涉及一种语音信息搜索方法、装置及服务器。The present application belongs to the technical field of electronic information signal processing, and in particular, to a voice information search method, device and server.
背景技术Background technique
在未来,语音识别将逐步成为电子信息技术中人机交互的关键技术。目前在银行自助服务、公共自助服务、微信的终端应用、即时语音通信等领域对语音识别技术的需求越来越强烈,尤其是在移动互联网领时代基于安全考虑的语音内容审核。In the future, speech recognition will gradually become a key technology for human-computer interaction in electronic information technology. At present, there is a growing demand for voice recognition technology in the fields of bank self-service, public self-service, WeChat terminal application, and instant voice communication, especially in the mobile Internet-based era.
例如在目前的众多社交应用App中,用户可以发布包括各种内容的语音信息,其中有些可能涉及恐怖、色情、广告推销、诈骗等违法信息。目前,常用的方法是基于特定语种和关键词的语音识别技术进行语音信息内容自动审核。在该技术中通常的做法可以包括获取审核语音的语音特征,通过训练产生声学模型,目的是为每个发音建立发音模板。在识别时将待识别的语音特征与审核语音的声学模型逐个进行匹配,选取与待识别语音最接近的发音模板作为待识别语音的所表达的含义。For example, in many current social application apps, users can post voice messages including various content, some of which may involve illegal information such as horror, pornography, advertising, fraud, and the like. At present, the commonly used method is to automatically review the voice information content based on the speech recognition technology of specific languages and keywords. A common practice in the art may include acquiring speech features of the audited speech and generating an acoustic model through training in order to create a pronunciation template for each of the pronunciations. At the time of identification, the acoustic features to be recognized are matched with the acoustic models of the audited speech one by one, and the pronunciation template closest to the speech to be recognized is selected as the expressed meaning of the speech to be recognized.
在实际的语音识别过程中,通常将语音信息分成多个音频特征,例如以20毫秒为一个帧长,一段10秒是语音将产生500个音频特征。而存储的审核语音常常多达成千上万,同一个含义的审核语音又可以包括多个不同方言、不同语气的表述方式,同时每个审核语音的发音模块中又存在大量音频特征,在大规模数据集的情况下,现有的基于发音模块进行音频识别的方法面临着高维特征索引和查询过程复杂,查询时间长的问题,降低了查询效率。In the actual speech recognition process, the speech information is usually divided into a plurality of audio features, for example, 20 ms as a frame length, and a 10 second period is that the speech will generate 500 audio features. The stored auditing voices often reach thousands of times. The auditing voice of the same meaning can include multiple different dialects and different expressions. At the same time, there are a large number of audio features in the pronunciation module of each auditing voice. In the case of the data set, the existing method based on the pronunciation module for audio recognition faces the problem of high-dimensional feature indexing and query process, and the query time is long, which reduces the query efficiency.
发明内容Summary of the invention
本申请目的在于提供一种语音信息搜索方法、装置及服务器,可以提取与特定人无关的语音底层特征,然后进行量化编码,建立索引,通过K-d数搜索数据库中已有的目标语音,达到语音内容快速搜索的目的,提高查询效率。The purpose of the present application is to provide a voice information search method, device and server, which can extract voice underlying features unrelated to a specific person, then perform quantitative coding, establish an index, and search for existing target voices in the database through Kd numbers to achieve voice content. The purpose of quick search is to improve query efficiency.
本申请提供的一种语音信息搜索方法、装置及服务器是这样实现的:The voice information searching method, device and server provided by the application are implemented as follows:
一种语音信息搜索方法,所述方法包括:A voice information searching method, the method comprising:
提取语音信息库中目标语音的语音特征,生成所述目标语音的特征描述符;Extracting a voice feature of the target voice in the voice information base, and generating a feature descriptor of the target voice;
对所述特征描述符进行量化编码生成量化编码后的特征描述符,并存储所述特征描述 符;Quantizing and encoding the feature descriptor to generate a quantized encoded feature descriptor, and storing the feature description symbol;
获取待识别语音的特征描述符,在所述存储的所述特征描述符中,查找与所述待识别语音的特征描述符相匹配的特征描述符对应的目标语音,将查找到的目标语音作为所述待识别语音对应的目标候选集;Obtaining a feature descriptor of the voice to be recognized, searching, in the stored feature descriptor, a target voice corresponding to the feature descriptor that matches the feature descriptor of the to-be-recognized voice, and using the found target voice as a target candidate set corresponding to the to-be-recognized voice;
根据预定规则在所述目标候选集中选取所述待识别语音的搜索结果。The search result of the to-be-identified speech is selected in the target candidate set according to a predetermined rule.
一种语音信息搜索装置,所述装置包括:A voice information searching device, the device comprising:
信息获取模块,用于获取目标语音,并提取所述目标语音的语音特征;An information acquiring module, configured to acquire a target voice, and extract a voice feature of the target voice;
描述符模块,用于基于所述目标语音的语音特征生成所述目标语音的特征描述符;a descriptor module, configured to generate a feature descriptor of the target voice based on a voice feature of the target voice;
量化编码模块,用于对所述特征描述符进行量化编码,生成量化编码后的特征描述符,并存储所述特征描述符;a quantization coding module, configured to perform quantization coding on the feature descriptor, generate a quantized encoded feature descriptor, and store the feature descriptor;
识别信息模块,用于获取待识别语音的特征描述符;An identification information module, configured to acquire a feature descriptor of the voice to be recognized;
第一搜索模块,用于在所述存储的所述特征描述符中,查找与所述待识别语音的特征描述符相匹配的特征描述符对应的目标语音,将查找到的目标语音作为所述待识别语音对应的目标候选集;a first search module, configured to search, in the stored feature descriptor, a target voice corresponding to a feature descriptor that matches a feature descriptor of the to-be-identified voice, and use the found target voice as the a target candidate set corresponding to the speech to be recognized;
第二搜索模块,用于根据预定规则在所述目标候选集中选取所述待识别语音的搜索结果。And a second searching module, configured to select the search result of the to-be-identified voice in the target candidate set according to a predetermined rule.
一种语音信息搜索服务器,所述服务器被设置成包括:A voice information search server, the server being configured to include:
第一处理单元,用于获取目标语音,生成所述目标语音的特征描述符;还用于对所述特征描述符进行量化编码;a first processing unit, configured to acquire a target voice, generate a feature descriptor of the target voice, and further configured to perform quantization coding on the feature descriptor;
存储单元,用于分别存储所述量化编码后的特征描述符中路径相同的特征描述符;a storage unit, configured to separately store feature descriptors with the same path in the quantized encoded feature descriptors;
第二处理单元,用于获取待识别语音的特征描述符;还用于在所述存储的特征描述符中查找与所述待识别语音相匹配的特征描述符的目标语音,获取候选集;还用于根据预定规则在所述候选集中选取所述待识别语音的搜索结果。a second processing unit, configured to acquire a feature descriptor of the to-be-identified voice; and further configured to search, in the stored feature descriptor, a target voice of the feature descriptor that matches the to-be-identified voice, to acquire a candidate set; Search results for selecting the speech to be recognized in the candidate set according to a predetermined rule.
本申请提供一种语音信息搜索方法、装置及服务器,可以将语音信息库中存储的需要审核的目标关键词或者短语的目标语音信息行音素级别的模型学习和表述,并生成特征描述符,建立索引。本申请中可以对生成特征描述符进行量化编码,降低特征描述符索引维度和信息长度,可以提高信息索引时的处理速度。在查询时,本申请利用K-d树获取搜索范围更小的待识别语音的目标候选集,然后进一步筛选出搜索结果。本申请提供的语音信息搜索方法,将传统高纬度、复杂的发音模块语音识别转化成相似音频特征的搜索,而且通过特征描述符降低索引维度和K-d树优化查询效率,可以大大提高语音信息搜索速度。 The present application provides a voice information search method, apparatus, and server, which can learn and express a model of a phoneme level of a target voice information to be audited in a voice information database, and generate a feature descriptor to create a feature descriptor. index. In the present application, the generated feature descriptor can be quantized and encoded, the feature descriptor index dimension and the information length can be reduced, and the processing speed in the information index can be improved. At the time of inquiry, the present application uses the K-d tree to obtain a target candidate set of a smaller search range to be recognized, and then further filters out the search result. The voice information searching method provided by the present application converts the traditional high-latitude and complex pronunciation module speech recognition into similar audio feature search, and reduces the index dimension and the Kd tree optimization query efficiency by the feature descriptor, thereby greatly improving the voice information search speed. .
附图说明DRAWINGS
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description are only It is a few embodiments described in the present application, and other drawings can be obtained from those skilled in the art without any inventive labor.
图1是本申请一种语音信息搜索方法一种实施例的流程示意图;1 is a schematic flow chart of an embodiment of a voice information searching method according to the present application;
图2是本申请一种对特征描述符进行量化编码的示意图;2 is a schematic diagram of performing quantization coding on a feature descriptor according to the present application;
图3是本申请建立特征描述符的索引示意图;3 is a schematic diagram of an index of establishing a feature descriptor in the present application;
图4是本申请一种语音信息搜索装置的一种实施例的模块结构示意图;4 is a schematic structural diagram of a module of an embodiment of a voice information searching apparatus according to the present application;
图5是本申请提供的一种量化编码模块的模块结构示意图;FIG. 5 is a schematic structural diagram of a module of a quantization coding module provided by the present application; FIG.
图6是本申请提供的一种第二搜索模块的模块结构示意图。FIG. 6 is a schematic structural diagram of a module of a second search module provided by the present application.
具体实施方式detailed description
为了使本技术领域的人员更好地理解本申请中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。The technical solutions in the embodiments of the present application are clearly and completely described in the following, in which the technical solutions in the embodiments of the present application are clearly and completely described. The embodiments are only a part of the embodiments of the present application, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope shall fall within the scope of the application.
本申请中提供的一种语音信息搜索方法,可以将需要审核的目标关键词或者短语对应的不同语种的语音底层特征添加到数据库,进行音素级别的模型学习和表述,并生成特征描述符,建立索引。在查询时,利用K-d树获取待识别语音的目标候选集,然后进一步筛选出搜索结果。基于上述的检索方案,可以通过特征描述符量化编码降低索引维度和K-d树优化查询效率,可以大大提高语音信息搜索速度。A voice information searching method provided in the present application can add a voice underlying feature of a different language corresponding to a target keyword or a phrase to be audited to a database, perform model learning and expression of a phoneme level, and generate a feature descriptor to establish index. At the time of the query, the K-d tree is used to obtain the target candidate set of the speech to be recognized, and then the search result is further filtered out. Based on the above retrieval scheme, the index dimension and the K-d tree optimization query efficiency can be reduced by the feature descriptor quantization coding, and the voice information search speed can be greatly improved.
图1是本申请所述一种语音信息搜索方法的一个实施例的方法流程图,如图1所述,所述方法可以包括:1 is a flowchart of a method for a voice information search method according to an embodiment of the present disclosure. As shown in FIG. 1, the method may include:
S1:提取语音信息库中目标语音的语音特征,生成所述目标语音的特征描述符。S1: Extract a voice feature of the target voice in the voice information base, and generate a feature descriptor of the target voice.
所述的语音信息库可以包括存储的目标语音。所述目标语音可以预先采集或设置。所述目标语音具体的在不同的应用场景中可以包括不同的内容。例如在基于安全考虑的语音内容审核中,所述语音信息库中的目标语音可以为包括多个方言或多种语气的涉及恐怖、色情、广告、诈骗等内容的关键词或者短语。在家庭音像或者汽车智能控制终端中,所述语音信息库中的目标语音可以包括对家庭智能设备如智能电视、音响等或者汽车驾驶控制设备进行功能控制的语音关键词或者短语。又或者收集存储的在智能终端社交、购物、聊天等应用中常 用的例如“天气”、“苹果”“双十一”等关键词或者短语的目标语音信息。所述的语音信息库中存储的目标语音可以根据不同的应用场景进行设置,本申请中所述的语音信息搜索方法可以适用但并不限于基于安全考虑的语音内容审核的应用场景。The voice information library may include stored target voice. The target voice can be pre-acquired or set. The target voice may specifically include different content in different application scenarios. For example, in a voice content review based on security considerations, the target voice in the voice information base may be a keyword or phrase involving content such as horror, pornography, advertisement, fraud, etc., including multiple dialects or multiple voices. In a home audio or car intelligent control terminal, the target voice in the voice information library may include a voice keyword or phrase that performs function control on a home smart device such as a smart TV, an audio, or the like, or a car driving control device. Or collect and store in the smart terminal social, shopping, chat and other applications often The target voice information of keywords or phrases such as "weather", "apple", "double eleven" is used. The target voice stored in the voice information database may be set according to different application scenarios. The voice information search method described in this application may be applicable to, but not limited to, an application scenario of voice content audit based on security considerations.
基于语音内容自动审核的语音识别通常需要提取与特定人无关的语音底层特征,这样可以更加准确识别不同的人说的相同的话,或者相同的人因为自身状态和场合不同但说的内容相同的话。本申请中提取跟特定人无关的语音底层特征的方法可以包括MFCC(Mel-Frequency Cepstrum Coefficients,Mel频率倒谱系数)和PLP(Perceptual Linear Predictive,感觉加权线性预测)方法。其中MFCC是建立在傅立叶和倒谱分析基础上,对短时音频帧中的采样点进行傅立叶变换,得到这个短时音频帧在每个频率上的能量大小,可以较好的反应音频信号的频域特征。因此,在本实施例中可以采用MFCC的方法获取语音信息库中的目标语音的语音特征。Speech recognition based on automatic auditing of voice content usually needs to extract the underlying features of the voice that are not related to a specific person, so that the same words spoken by different people can be more accurately recognized, or the same person has the same content because of his or her own state and occasion. The method for extracting the underlying features of the speech unrelated to a specific person in the present application may include a MFCC (Mel-Frequency Cepstrum Coefficients) and a PLP (Perceptual Linear Predictive) method. The MFCC is based on Fourier and cepstrum analysis, and performs Fourier transform on the sampling points in the short-time audio frame to obtain the energy of the short-time audio frame at each frequency, which can better reflect the frequency of the audio signal. Domain characteristics. Therefore, in this embodiment, the MFCC method can be used to acquire the voice features of the target voice in the voice information base.
具体的,所述提取语音信息库中目标语音的语音特征可以包括如下的处理步骤:Specifically, the extracting the voice features of the target voice in the voice information database may include the following processing steps:
S101:对所述目标语音进行预处理。S101: Perform preprocessing on the target voice.
所述的预处理可以包括对所述目标语音进行的语音格式转换、预加重、分帧、加窗处理等操作。在本实施例中,所述的对目标语音进行的预处理具体的实施过程可以包括:The pre-processing may include operations such as voice format conversion, pre-emphasis, framing, windowing, and the like on the target voice. In this embodiment, the specific implementation process of pre-processing the target voice may include:
S1011:语音格式转换。S1011: Voice format conversion.
所述语音信息库中存储的目标语音可以包括采集获取的多种格式的语音信息,如amr格式、wav格式等。本实施例中可以不同的目标语音格式统一转换为wav格式,方便后续数据的统一、快速处理。The target voice stored in the voice information database may include voice information collected in multiple formats, such as an amr format, a wav format, and the like. In this embodiment, different target voice formats can be uniformly converted into a wav format, which facilitates unified and fast processing of subsequent data.
S1012:预加重。S1012: Pre-emphasis.
所述的预加重通常是指将语音信号通过高通滤波器使信号的频谱变的平坦,保证从低频到高频的整个频带中能用相同或相近的信噪比求取频谱。同时,预加重还可以消除发声过程中的声带和嘴唇效应,用来补偿语音信号受到发音系统所抑制的高频部分和突出高频的共振峰。滤波器公式如下:The pre-emphasis generally refers to flattening the spectrum of the signal by passing the speech signal through a high-pass filter, ensuring that the same or similar signal-to-noise ratio can be used to obtain the spectrum from the low frequency to the high frequency. At the same time, the pre-emphasis can also eliminate the vocal cords and lip effects during the vocalization process, and is used to compensate the high-frequency part of the speech signal that is suppressed by the pronunciation system and the resonance peak of the high-frequency. The filter formula is as follows:
H(z)=1-μz-1 H(z)=1-μz -1
上式中,μ的值介于0.9-1.0之间,本实施例中可以取值为0.97。In the above formula, the value of μ is between 0.9 and 1.0, and in this embodiment, it can be 0.97.
S1013:分帧。S1013: Framing.
所述分帧可以包括将目标语音的所有采集点每N个采样点作为一个帧。本实施例中为了避免相邻两帧的变化过大,可以使相邻帧之间有一段重叠区域,所述重叠区域可以包括M个采样点。具体的,例如本实施例中的语音信号采样频率可以为8000Hz,每帧包括的采样点N为512,重叠区域M为256。 The framing may include taking all the sampling points of the target speech every N sampling points as one frame. In this embodiment, in order to avoid excessive change of two adjacent frames, an overlapping area may be formed between adjacent frames, and the overlapping area may include M sampling points. Specifically, for example, the sampling frequency of the speech signal in the embodiment may be 8000 Hz, the sampling point N included in each frame is 512, and the overlapping area M is 256.
S1014:加窗。S1014: Windowing.
通常为了高效处理语音信号,可以对语音信号进行加窗,也就一次仅处理窗中的数据。因为实际的语音信号是很长的,在实际信号处理时通常可以不需要对非常长的数据进行一次性处理。可以每次取一段数据,进行分析,然后再取下一段数据,再进行分析。在此过程中可以构造一个函数,这个函数在某一区间有非零值,而在其余区间皆为0。汉明窗就是这样的一种函数,本实施例中可以将当前分帧的信号乘以汉明窗,在处理下一帧时通常可以每次移动窗口的三分之一或者二分之一,以增加帧左端和右端的连续性。本实施例中提供的加窗公式可以为:Usually, in order to process the speech signal efficiently, the speech signal can be windowed, and only the data in the window can be processed at a time. Since the actual speech signal is very long, it is usually not necessary to perform one-time processing on very long data in actual signal processing. You can take a piece of data each time, analyze it, then take a piece of data and analyze it. In the process, you can construct a function that has a non-zero value in a certain interval and a zero in the remaining intervals. The Hamming window is such a function. In this embodiment, the current framed signal can be multiplied by the Hamming window, and usually one third or one half of the window can be moved each time the next frame is processed. To increase the continuity of the left and right ends of the frame. The windowing formula provided in this embodiment may be:
S′(n)=S(n)×W(n)S'(n)=S(n)×W(n)
上式中,In the above formula,
Figure PCTCN2016071164-appb-000001
Figure PCTCN2016071164-appb-000001
上式中,N为帧的采样点个数,S(n)表示语音信号,S′(n)为加窗处理后的语音信号,a可以取值为0.46。In the above formula, N is the number of sampling points of the frame, S(n) represents the speech signal, and S'(n) is the speech signal after the windowing process, and a can have a value of 0.46.
通过上述预处理后,可以得到取预处理后的目标语音的能量。After the above pre-processing, the energy of the target speech after pre-processing can be obtained.
S102:计算所述预处理后的目标语音的能量谱。S102: Calculate an energy spectrum of the pre-processed target speech.
本实施例中可以对预处理后的各帧语音信号进行快速傅里叶变换得到各帧的频谱,然后可以对语音信号的频谱取模平方得到语音信号的功率谱。In this embodiment, the pre-processed frame speech signals may be subjected to fast Fourier transform to obtain the spectrum of each frame, and then the spectrum of the speech signal may be squared to obtain the power spectrum of the speech signal.
S103:对所述能量谱进行Mel滤波,计算所述Mel滤波后的能量谱的对数。S103: Perform Mel filtering on the energy spectrum, and calculate a logarithm of the energy spectrum after the Mel filtering.
可以将所述能量谱通过预先设置的一组Mel尺度的三角形滤波器,对频谱进行平滑化,消除谐波的作用,突显原先语音的共振峰。然后可以采用下式计算每个滤波器输出的对数:The energy spectrum can be smoothed by a set of Mel-scale triangular filters set in advance, eliminating harmonics and highlighting the resonance of the original speech. The logarithm of each filter output can then be calculated using the following equation:
Figure PCTCN2016071164-appb-000002
Figure PCTCN2016071164-appb-000002
上式中,N为帧的采样点个数,M为滤波器个数,Xa为经过傅里叶变换后的频谱,Hm为第m个滤波器。In the above formula, N is the number of sampling points of the frame, M is the number of filters, Xa is the Fourier transformed spectrum, and Hm is the mth filter.
S104:对所述能量谱的对数进行DCT变换得到MFCC系数,获取所述目标语音的语音特征。S104: performing DCT transform on the logarithm of the energy spectrum to obtain MFCC coefficients, and acquiring a voice feature of the target voice.
本实施例中可以采用下式进行DCT(Discrete Cosine Transform,离散余弦变换)变换得到MFCC系数:In this embodiment, DCT (Discrete Cosine Transform) transformation can be performed by using the following formula to obtain MFCC coefficients:
Figure PCTCN2016071164-appb-000003
Figure PCTCN2016071164-appb-000003
上式中,N为帧的采样点的个数,M为滤波器个数,L阶指MFCC系数阶数,通常取12-16。 In the above formula, N is the number of sampling points of the frame, M is the number of filters, and L is the order of the MFCC coefficients, which is usually taken as 12-16.
可以将上式变换得到的MFCC系数作为目标语音的语音特征。The MFCC coefficient obtained by the above transformation can be used as the speech feature of the target speech.
另一种优选的实施方式中,还可以在所述语音特征中添加动态参数,提高语音识别性。本实施例所述的方法中可以在所述语音特征中加入表征语音动态特性的差分参数,屏蔽可以区分不同的人说的同样的话的特征,提高系统的语音识别性能。因此,本实施例中所述语音信息搜索方法还可以包括:In another preferred embodiment, dynamic parameters may also be added to the voice feature to improve speech recognition. In the method described in this embodiment, a difference parameter that characterizes the dynamic characteristics of the voice may be added to the voice feature, and the mask may distinguish the same words that different people say, and improve the voice recognition performance of the system. Therefore, the voice information searching method in this embodiment may further include:
S105:计算所述MFCC系数的一阶和二阶差分系数,将所述一阶和二阶差分系数添加至所述语音特征中。即所述语音特征还可以包括所述MFCC系数的一阶和二阶差分系数。所述MFCC系数的一阶和二阶差分系数具体的可以采用下述公式计算:S105: Calculate first and second order difference coefficients of the MFCC coefficients, and add the first and second order difference coefficients to the voice feature. That is, the speech feature may further include first and second order differential coefficients of the MFCC coefficients. The first and second order difference coefficients of the MFCC coefficients can be specifically calculated by the following formula:
Figure PCTCN2016071164-appb-000004
Figure PCTCN2016071164-appb-000004
上式中dt表示第t个一阶差分,Ct表示第t个倒谱系数,N表示一阶导数的时间差,可取1或2。将上式的结果再代入就可以得到二阶差分的参数。In the above formula, dt represents the tth first-order difference, Ct represents the t-th cepstral coefficient, and N represents the time difference of the first-order derivative, which may take 1 or 2. By substituting the results of the above equation, the parameters of the second order difference can be obtained.
提取所述语音信息库中目标语音的语音特征后,可以生成所述目标语音的特征描述符。本申请中所述的特征描述符可以包括语音特征的VLAD(vector of locally aggregated descriptors,VLAD,局部特征聚合描述符)特征描述符。下述是本实施例提供的一种生成特征描述符的实施方法,可以将语音特征聚合到单独特征向量中。具体的实施过程可以包括:After extracting the speech features of the target speech in the speech information library, the feature descriptors of the target speech may be generated. The feature descriptors described in this application may include VLAD (vector of indigenous aggregated descriptors, VLAD, local feature aggregation descriptor) feature descriptors. The following is a method for generating a feature descriptor provided by this embodiment, which can aggregate voice features into individual feature vectors. The specific implementation process may include:
S101’:对所述提取的语音特征通过k-means聚类方法获取所述目标语音的码本;S101': acquiring, by using a k-means clustering method, the codebook of the target voice for the extracted voice feature;
生成所述VLAD描述符通常需要先训练码本,可以从所述提取的语音特征中随机选取N个语音特征,通过k-means聚类方法得到码字数量为k的码本{μ1,...,μk}。所述码本中的每一项为码字,可以表示为一个或者多个相同或者相近语音样本的聚合,例如μ1可以表示为语音信息库中多个不同语气但表示相同含义的目标语音的聚合。这样可以将所述语音信息库中的大量目标语音聚合形成码字数量为k的码本。所述k可以远远小于所述语音信息库中目标语音的数量。Generating the VLAD descriptor usually requires training the codebook first, and randomly selecting N speech features from the extracted speech features, and obtaining a codebook with a number k of codewords by a k-means clustering method {μ 1 . .., μ k }. Each of said code present in a code word can be expressed as one or more of the same or similar polymerization speech samples, e.g. μ 1 can be represented as a plurality of voice information database but said different tone of the same meaning as target speech polymerization. In this way, a large number of target voices in the voice information base can be aggregated to form a codebook with a codeword number of k. The k may be much smaller than the number of target speeches in the voice information library.
S102’:获取所述目标语音的语音特征集合,计算所述语音特征集合与所述码本中距离最近的码字的所有残差向量之和。S102': acquiring a voice feature set of the target voice, and calculating a sum of the voice feature set and all residual vectors of the codeword closest to the codebook.
所述语音信息库中一个目标语音的一个或多个语音特征可以形成一个特征集合{x1,...,xp}。对于某个目标语音的语音特征形成的特征集合,依次查找所述特征集合中的每个语音特征在所述码本中距离最近的码字,并计算所述特征集合中的语音特征与当前查找到的距离最近的码字的残差向量,然后累加属于同一个码字的所有残差向量。如下式所示:One or more speech features of a target speech in the speech information library may form a feature set {x 1 , . . . , x p }. For a feature set formed by a voice feature of a target voice, sequentially searching for a codeword closest to each of the voice features in the codebook, and calculating a voice feature and a current lookup in the feature set The residual vector of the nearest codeword is then added, and then all residual vectors belonging to the same codeword are accumulated. As shown below:
Figure PCTCN2016071164-appb-000005
Figure PCTCN2016071164-appb-000005
上式中,xt为所述特征集合{x1,...,xp}中的第t个特征,1≤t≤P,μi为所述码本{μ1,...,μk}中的第i个码字,1≤i≤k。xt:NN(xt)=i表示映射为同一个码子的特征子集。In the above formula, x t is the t-th feature in the feature set {x 1 , . . . , x p }, 1≤t≤P, μ i is the codebook {μ 1 ,..., The i-th codeword in μ k }, 1 ≤ i ≤ k. x t : NN(x t )=i represents a subset of features mapped to the same code.
S103’:对所述码字的残差向量之和进行归一化,生成所述目标语音的特征描述符。S103': normalizing the sum of the residual vectors of the codewords to generate a feature descriptor of the target speech.
对所述累加后的码字的残差向量进行归一化。可以连接所述归一化后的码字残差向量之和组成目标语音的VLAD总特征描述符V。所述VLAD总特征描述符V可以表示为:The residual vector of the accumulated codeword is normalized. The sum of the normalized codeword residual vectors may be connected to form a VLAD total feature descriptor V of the target speech. The VLAD total feature descriptor V can be expressed as:
V={v′i,...,v′k}V={v' i ,...,v' k }
其中v′i为所述归一化处理后的特征描述符的向量数据,所述的归一化处理可以采用下式进行处理:Where v′ i is the vector data of the normalized processed feature descriptor, and the normalization process can be processed by using the following formula:
Figure PCTCN2016071164-appb-000006
Figure PCTCN2016071164-appb-000006
本实施例中可以提取语音信息库中采集存储的目标语音的语音特征,然后可以生成所述目标语音的特征描述符。所述的总特征描述符V中的每一个VLAD特征描述符可以为一个多维的特性向量,例如可以为一个128维的特征向量{l1,l2,l3,......,l128},其中每一个特征向量的每个维度都是目标语音的聚类索引,可以根据所述多维的特征向量v′i定位找到相应的目标语音。In this embodiment, the voice feature of the target voice collected and stored in the voice information database may be extracted, and then the feature descriptor of the target voice may be generated. Each of the VLAD feature descriptors in the total feature descriptor V may be a multi-dimensional feature vector, for example, may be a 128-dimensional feature vector {l1, l2, l3, ..., l128}, Each dimension of each feature vector is a clustering index of the target speech, and the corresponding target speech can be found according to the multi-dimensional feature vector v′ i .
S2:对所述特征描述符进行量化编码,生成量化编码后的特征描述符,并存储所述特征描述符。S2: Perform quantization coding on the feature descriptor, generate a quantized encoded feature descriptor, and store the feature descriptor.
实际的实施过程中,所述得到的一个VLAD特征描述符通常达到上千比特,为了加快搜索速度,尤其是在大数据集的应用场景下降低语音信息检索的困难和复杂性,本申请可以对所述获取的VLAD特征描述符进量化编码,降低所述特征描述符的维度和信息长度,优化搜索查询效率。In the actual implementation process, the obtained one VLAD feature descriptor usually reaches thousands of bits. In order to speed up the search speed, especially in the application scenario of the big data set, the difficulty and complexity of the voice information retrieval are reduced. The acquired VLAD feature descriptor is quantized, the dimension and information length of the feature descriptor are reduced, and the search query efficiency is optimized.
具体的所述对所述特征描述符进行量化编码生成量化编码后的土地证描述符可以包括:Specifically, performing the quantization coding on the feature descriptor to generate the quantized coded land certificate descriptor may include:
S201:将每个所述特征描述符等分成L份子向量,对所述L个子向量分别进行聚类,并设置所述子向量聚类后的索引编号,L≥2;S201: Dividing each of the feature descriptors into L sub-vectors, respectively clustering the L sub-vectors, and setting an index number of the sub-vector clustering, L≥2;
S202:将每个所述特征描述符的L个子向量分别用与所述子向量距离最近的所述聚类的索引编号表示,生成量化编码后的特征描述符。S202: The L sub-vectors of each of the feature descriptors are respectively represented by index numbers of the clusters that are closest to the sub-vectors, and the quantized encoded feature descriptors are generated.
图2是本实施例中所述一种对特征描述符进行量化编码的示意图。图2中以VLAD特征描述符为128维特征向量为例,可以将其分成8等份(y1~y8),每一等份中包括128维特征向量的中的16维分量(16components)。可以分别对所述每个分量进行单独聚类并映射到256个聚类中心上(256centroids)。聚类后的每个子向量可以用于索引编号表示,例如图2中的q1(y1)可以表示包括所述128维特征向量的前16维分量。分成的8等份的子向量 中每个子向量可以用8位二进制表示,这样128维的特征向量就可以用于8位的8bit×8=64bit的信息表示,降低了信息处理维度和信息数据长度,提高处理效率。FIG. 2 is a schematic diagram of performing quantization coding on a feature descriptor according to the embodiment. In FIG. 2, the VLAD feature descriptor is taken as an example of a 128-dimensional feature vector, which can be divided into 8 equal parts (y1 to y8), and each of the equal parts includes 16-dimensional components (16 components) of the 128-dimensional feature vector. Each of the components can be separately clustered and mapped to 256 cluster centers (256 centroids). Each sub-vector after clustering can be used for index number representation, for example, q1(y1) in FIG. 2 can represent the first 16-dimensional component including the 128-dimensional feature vector. Sub-vectors of 8 equal parts Each sub-vector can be represented by 8-bit binary, so that the 128-dimensional feature vector can be used for 8-bit 8-bit×8=64-bit information representation, which reduces the information processing dimension and information data length, and improves the processing efficiency.
通过上述方法可以将高维度的VLAD特征描述符表示为长度为L,分量为二级聚类索引编号的低维特征向量,可以大大提高后续语音信息的搜索速度。Through the above method, the high-dimensional VLAD feature descriptor can be represented as a low-dimensional feature vector with a length L and a component of a secondary clustering index number, which can greatly improve the search speed of subsequent voice information.
对所述特征描述符进行量化编码形成所述量化编码后特征描述符后,可以存储所述量化编码后的特征描述符。具体的存储时可以包括以所述量化编码后的特征描述符的索引的方式实现。本实施例中优选的可以采用建立K-d树索引,具体的所述存储所述特征描述符实施方式可以包括:After performing quantization quantization on the feature descriptor to form the quantized coded feature descriptor, the quantized coded feature descriptor may be stored. The specific storage may be implemented in the manner of indexing the quantized encoded feature descriptor. Preferably, the K-d tree index may be used in the embodiment, and the storing the feature descriptor implementation manner may include:
S201’:建立高度为(L+1)的K-d树;S201': establishing a K-d tree with a height of (L+1);
S202’:为所述K-d树的非叶子节点划分索引维度和与所述索引维度相对应的划分值;并建立与所述划分值进行比较的结果路径指向;S202': dividing an index dimension and a partition value corresponding to the index dimension for a non-leaf node of the K-d tree; and establishing a result path direction that is compared with the partition value;
S203’:从所述K-d树的根节点开始,将与非叶子节点的索引维度相对应的特征描述符的值与所述非叶子节点的划分值进行比较,并基于比较的结果与所述结果路径指向将所述特征描述符存储至所述K-d树的叶子节点中。S203': starting from a root node of the Kd tree, comparing a value of a feature descriptor corresponding to an index dimension of the non-leaf node with a division value of the non-leaf node, and based on the result of the comparison and the result The path points to store the feature descriptors into the leaf nodes of the Kd tree.
图3是本申请所述建立特征描述符的索引示意图。在图3的示例中,可以建立一个高度为(3+1)的K-d树,所述K-d树最后一层为叶子节点。需要进行存储的特征描述符为3维特征向量,可以预先划分每个非叶子节点的索引维度和划分值,并可以预先设置比较的结果路径指向,例如如果所述特征描述符在该非叶子节点上相应索引维度的值小于该非叶子节点的划分值,则进入该非叶子节点的左子树继续进行比较,否则进行右子树继续进行比较,直至进入叶子节点。需要说明的是,所述的索引维度可以表示为特征描述符中索引维度的值所指示的那一个维度的值,如果某一非叶子节点的上层节点已经比较过的维度则不再进行比较。因此,在非叶子节点索引维度划分的时候,从所述K-d树根节点到当前非叶子节点的路径上已经划分过的索引维度值不再划分给当前的非叶子节点。FIG. 3 is a schematic diagram of an index for establishing a feature descriptor according to the present application. In the example of FIG. 3, a K-d tree of height (3+1) can be established, and the last layer of the K-d tree is a leaf node. The feature descriptor that needs to be stored is a 3-dimensional feature vector, and the index dimension and the partition value of each non-leaf node may be pre-divided, and the result path direction of the comparison may be preset, for example, if the feature descriptor is at the non-leaf node If the value of the corresponding index dimension is smaller than the division value of the non-leaf node, the left subtree entering the non-leaf node continues to be compared, otherwise the right subtree is continued to be compared until the leaf node is entered. It should be noted that the index dimension may be represented as the value of the dimension indicated by the value of the index dimension in the feature descriptor. If the dimension of the upper node of a non-leaf node has been compared, the comparison is no longer performed. Therefore, when the non-leaf node index dimension is divided, the index dimension values that have been divided from the K-d tree root node to the current non-leaf node are no longer allocated to the current non-leaf node.
上述为非叶子节点划分的划分值,在具体的实施中可以取值上述量化编码时设置的所述聚类中心个数的一半,如上述量化编码时设置的聚类中心个数为256,在设置所述划分值时可以取值为128。The above-mentioned division value of the non-leaf node division may be half of the number of cluster centers set in the above-mentioned quantization coding in a specific implementation, and the number of cluster centers set in the above-mentioned quantization coding is 256. The value of the division value can be set to 128.
在上述例子中,可以分别将3维特征向量的特征描述符存储到叶子节点中。如一个特征描述符为(45,210,60)的3维特征向量,按照图3建立的所述K-d树可以先从根节点开始,将特征描述符(45,210,60)与根节点中所示的索引维度为2、划分值为128进行相应索引维度的值的比较。当前节点索引维度为2所对应的所述特征描述符(45,210,60)的值为210,大于当前节点划分值128,则进入根节点的右子树。然后所述特征描述符(45,210,60)继续 与当前非叶子节点设置的索引维度为1、划分值为128进行相应索引维度值的比较,所述特征描述符(45,210,60)索引维度为1所对应的值为45,小于当前非叶子节点的划分值128,则进入当前非叶子节点的左子树。同样的方法,可以按照当前非叶子节点设置索引维度为3、划分值为128将所述特征描述符(45,210,60)的值60与128进行比较,然后进入当前非叶子节点的左侧,到达所述K-d树的叶子节点。此时,可以将所述特征描述符(45,210,60)存入至该叶子节点中。对应更高维的特征描述符,可以建立相应的K-d树,仍然按照上述方法进行存储到叶子节点中。In the above example, the feature descriptors of the 3-dimensional feature vectors can be stored in the leaf nodes, respectively. If a feature descriptor is a (3, 45, 60) 3-dimensional feature vector, the Kd tree established according to FIG. 3 may start from the root node, and the feature descriptor (45, 210, 60) and the index shown in the root node. The dimension is 2, and the division value is 128 to compare the values of the corresponding index dimensions. The value of the feature descriptor (45, 210, 60) corresponding to the current node index dimension of 2 is 210, which is greater than the current node partition value of 128, and then enters the right subtree of the root node. Then the feature descriptor (45, 210, 60) continues The index dimension of the current non-leaf node is 1 and the value of the index is 128. The index of the index of the feature descriptor (45, 210, 60) is 1 and the value is 45, which is smaller than the current non-leaf node. The division value of 128 enters the left subtree of the current non-leaf node. In the same way, the value of the feature descriptor (45, 210, 60) is compared with 128 according to the current non-leaf node setting index dimension of 3, and the partition value is 128, and then enters the left side of the current non-leaf node to arrive. The leaf node of the Kd tree. At this time, the feature descriptors (45, 210, 60) may be stored in the leaf node. Corresponding to the higher-dimensional feature descriptors, the corresponding K-d tree can be established and still stored in the leaf nodes according to the above method.
上述K-d中的最后一层为叶子节点,每个叶子节点存储的为所有经过从根节点到该叶子节点这条搜索路径的特征描述符的特征向量。其中每一个特征向量的每个维度都是目标语音的聚类索引。这样,每个叶子节点存储的是上述建立的K-d索引中相同路径的特征描述符的集合,在此可以将一个叶子节点存储的特征描述符作为一个目标候选集。The last layer in the above K-d is a leaf node, and each leaf node stores a feature vector of all feature descriptors passing through the search path from the root node to the leaf node. Each dimension of each of the feature vectors is a clustering index of the target speech. Thus, each leaf node stores a set of feature descriptors of the same path in the above-mentioned established K-d index, where a feature descriptor stored by one leaf node can be used as a target candidate set.
本申请所述方法优选的实施例中,为保证每个所述叶子节点存储的目标候选集容量相当,均衡各个叶子节点存储的特征描述符的数量,所述为非叶子节点划分索引维度可以包括:为非叶子节点划分的索引维度值S为随机生成的取值范围为1≤S≤L的整数,并且当前非叶子节点的索引维度值S为从所述K-d树根节点到所述当前非叶子节点的路径上未划分过的索引维度值。上述中的L为所述特征描述符的长度,即所述特征描述符中特征向量的维数。这样,非叶子节点所述索引维度的随机划分,提高K-d树左右两侧节点的均匀分布性,可以避免部分叶子节点存储的目标候选集内的特征描述符数量过多,可以提高后续在目标候选集内的搜索速度,均衡各个目标候选集的负载。In a preferred embodiment of the method, the number of feature descriptors stored in each leaf node is equalized to ensure that the target candidate set capacity stored in each leaf node is equal, and the non-leaf node partition index dimension may include The index dimension value S divided for the non-leaf node is a randomly generated integer ranging from 1 ≤ S ≤ L, and the index dimension value S of the current non-leaf node is from the Kd root node to the current non- The undivided index dimension value on the path of the leaf node. L in the above is the length of the feature descriptor, that is, the dimension of the feature vector in the feature descriptor. In this way, the random division of the index dimension of the non-leaf node improves the uniform distribution of the nodes on the left and right sides of the Kd tree, and can avoid the excessive number of feature descriptors in the target candidate set stored by some leaf nodes, and can improve subsequent target candidates. The search speed within the set equalizes the load of each target candidate set.
S3:获取待识别语音的特征描述符,在所述存储的所述特征描述符中,查找与所述待识别语音的特征描述符相匹配的特征描述符对应的目标语音,将查找到的目标语音作为所述待识别语音对应的目标候选集;S3: acquiring a feature descriptor of the voice to be recognized, searching, in the stored feature descriptor, a target voice corresponding to the feature descriptor that matches the feature descriptor of the to-be-recognized voice, and finding the target Voice as a target candidate set corresponding to the to-be-recognized voice;
对于给定的待识别语音,可以按照上述所述的方法提取待识别语音的语音特征,并获取所述待识别语音的量化编码后的特征描述符。可以将所述待识别语音的特征描述符与所述目标语音的特征描述符匹配,查找到与所述待识别语音的特征描述符相匹配的特征描述符对应的目标语音,将查找到的目标语音作为所述待识别语音对应的目标候选集。所述与所述待识别语音的特征描述符相匹配的特征描述符对应的目标语音在本实施例中可以包括在建立的K-d树中与所述待识别语音的特征描述符路径相同的叶子节点中存储的特征描述符所对应的目标语音。For a given speech to be recognized, the speech feature of the speech to be recognized may be extracted according to the method described above, and the quantized and encoded feature descriptor of the speech to be recognized may be acquired. The feature descriptor of the to-be-recognized voice may be matched with the feature descriptor of the target voice, and the target voice corresponding to the feature descriptor matching the feature descriptor of the to-be-recognized voice may be found, and the target to be found is found. The voice is used as a target candidate set corresponding to the to-be-recognized voice. The target voice corresponding to the feature descriptor matching the feature descriptor of the to-be-identified voice may include a leaf node in the established Kd tree that is the same as the feature descriptor path of the to-be-recognized voice. The target voice corresponding to the feature descriptor stored in it.
例如通过类似上述图3所示的实施方式,可以将待识别语音生成与叶子节点存储的特征描述符的维度相同的特征描述符,例如(90,135,78)。然后可以按照图3所示的K-d树及设 置的索引维度和分值进行特征描述符的查找,最终获取到一个叶子节点的与待识别语音的特征描述符相匹配的目标候选集。所述待识别语音的特征描述符在所述K-d树中的查找过程可以参照前述目标语音的特征描述符(45,210,60)的存储过程,可以查找到待识别语音的特征描述符(90,135,78)对应的目标语音,确定目标候选集,如图3所示的(90,135,75)所在的叶子节点。这样实现了将待识别语音定位到一个范围较小的目标候选集所对应的目标语音集中。For example, by an embodiment similar to that shown in FIG. 3 above, the feature to be recognized may be generated with the same feature descriptor as the feature descriptor stored by the leaf node, for example (90, 135, 78). Then you can follow the K-d tree and design shown in Figure 3. The index dimension and the score are set to perform the feature descriptor search, and finally the target candidate set of a leaf node matching the feature descriptor of the speech to be recognized is obtained. The searching process of the feature descriptor of the to-be-identified speech in the Kd tree may refer to a stored procedure of the feature descriptor (45, 210, 60) of the target speech, and may find a feature descriptor of the speech to be recognized (90, 135, 78). Corresponding target speech, determine the target candidate set, as shown in Figure 3 (90, 135, 75) where the leaf node is located. This achieves the positioning of the speech to be recognized to a target speech set corresponding to a smaller target candidate set.
S4:根据预定规则在所述目标候选集中选取所述待识别语音的搜索结果。S4: Select the search result of the to-be-identified voice in the target candidate set according to a predetermined rule.
获取目标候选集后,已经大大缩小了搜索范围。然后可以根据预先设置的预定做进一步精选,获取所述待识别语音的搜索结果。本实施例中提供一种在所述目标候选集中进一步精选搜索结果的方法,因此,本实施例中具体所述根据预定规则在所述目标候选集中选取所述待识别语音的搜索结果可以包括:After obtaining the target candidate set, the search scope has been greatly reduced. Then, further selection may be made according to a preset schedule to obtain a search result of the to-be-identified voice. In this embodiment, a method for further selecting a search result in the target candidate set is provided. Therefore, the search result that specifically selects the to-be-identified voice in the target candidate set according to a predetermined rule in the embodiment may include: :
在所述目标候选集中选取与所述待识别语音的特征描述符欧氏距离最小的前R个特征描述符作为搜索结果集,以所述搜索结果集所对应的目标语音作为所述待识别语音的搜索结果,R≥1。Selecting, in the target candidate set, a top R feature descriptors having the smallest Euclidean distance from the feature descriptor of the to-be-recognized speech as a search result set, and using the target speech corresponding to the search result set as the to-be-recognized speech Search results for R≥1.
本实施例中可以计算所述待识别语音的特征描述符与所述目标候选集中的特征向量的欧式距离,可以将计算得出的欧式距离按照递增的顺序进行排序,选取所述欧式距离最小的前R个特征描述作为搜索结果集。当然,所述搜索结果集内的特征描述符所述对应的语音信息库中的目标语音即为所述待识别语音的搜索结果。In this embodiment, the Euclidean distance between the feature descriptor of the to-be-recognized speech and the feature vector in the target candidate set may be calculated, and the calculated Euclidean distance may be sorted in an increasing order, and the Euclidean distance is selected to be the smallest. The first R feature descriptions are used as a search result set. Certainly, the target voice in the corresponding voice information database in the feature descriptor in the search result set is the search result of the voice to be recognized.
这里所述的选取的欧式距离最小的前R个结果中R的取值范围可以根据需求进行自行设置。例如所述R可以取值为1,可以表示为选取所述欧式距离最小的特征描述符,如待识别语音的特征描述符为(90,135,78),在获取的目标候选集中可以选出欧式距离最小的特征描述符(90,135,75)作为所述搜索结果集。当然,在实际的应用中,所述R也可以取值大于1,小于所述目标候选集内特征描述符的个数,例如可以取值为3,这样可以从所述目标候选集中选出欧式距离最小的前3个特征描述符(90,135,78)、(87,135,80)、(101,137,81)作为搜索结果集,可以将欧式距离最小的特征描述符(90,135,78)所对应的目标语音作为待识别语音的优选搜索结果,将其余两个搜索结果选集的特征描述符所对应的目标语音作为参考或者备选搜索结果。The range of values of R in the top R results with the smallest Euclidean distance selected here can be set according to requirements. For example, the R may take a value of 1, and may be represented as selecting a feature descriptor with the smallest Euclidean distance. For example, the feature descriptor of the speech to be recognized is (90, 135, 78), and the Euclidean distance may be selected in the acquired target candidate set. The smallest feature descriptor (90, 135, 75) is used as the search result set. Of course, in an actual application, the R may also take a value greater than 1, and is smaller than the number of feature descriptors in the target candidate set, for example, may be 3, so that the European candidate may be selected from the target candidate set. The minimum of the first three feature descriptors (90, 135, 78), (87, 135, 80), (101, 137, 81) as the search result set, the target speech corresponding to the feature descriptor (90, 135, 78) with the smallest Euclidean distance As a preferred search result of the speech to be recognized, the target speech corresponding to the feature descriptors of the remaining two search result sets is used as a reference or an alternative search result.
本申请提供的一种语音信息搜索方法,可以提取目标语音的语音底层特征,生成相应的VLAD特征描述符,用于表示目标语音的特征集合,可以根据所述特征描述符定位到相应的目标语音。本申请引入VLAD特征描述符将语音特征集合转换成长度固定的整体特征,然后将信息长度较长、维度较高的VLAD特征描述符经过量化编码转换为向量长度较小、维度更 低的特征向量,大大提高了信息读取、解析、索引速度。最后通过K-d树索引,建立目标候选集,将待识别语音通过K-d树索引进行搜索获取搜索范围更小的目标候选集,在所述目标候选集中进一步精选得到搜索结果,进一步加快了搜索速度。The voice information searching method provided by the present application can extract the underlying features of the speech of the target speech, generate corresponding VLAD feature descriptors, and represent the feature set of the target speech, and can locate the corresponding target speech according to the feature descriptor. . The present application introduces a VLAD feature descriptor to convert a speech feature set into a fixed length overall feature, and then converts a VLAD feature descriptor with a longer information length and a higher dimension into a vector length with a smaller length and a more dimensional dimension. Low feature vectors greatly improve information reading, parsing, and indexing speed. Finally, through the K-d tree index, the target candidate set is established, and the to-be-recognized speech is searched by the K-d tree index to obtain a target candidate set with a smaller search range, and the search result is further selected in the target candidate set, thereby further speeding up the search speed.
基于本申请所述语音信息搜索方法,本申请还提供一种语音信息搜索装置,该装置可以将语音信息库中的目标语音按照规则分别存储到相应的候选集模块中,并建立相应的索引机制,在获取待识别语音后可以快速的通过索引获取目标候选集,进一步的获取搜索结果,从而将传统的发音模板的语音识别转换成语音特征的搜索,并且通过量化编码和优化索引,提高了搜索速度,优化查询效率。图4是本申请所述一种语音信息搜索装置的模块结构示意图,如图4所示,所述装置可以包括:Based on the voice information searching method of the present application, the present application further provides a voice information searching device, which can store target voices in a voice information database according to rules into corresponding candidate set modules, and establish a corresponding indexing mechanism. After obtaining the speech to be recognized, the target candidate set can be quickly obtained through the index, and the search result is further obtained, thereby converting the speech recognition of the traditional pronunciation template into the search of the speech feature, and improving the search by quantitatively encoding and optimizing the index. Speed, optimize query efficiency. 4 is a block diagram of a structure of a voice information search device according to the present application. As shown in FIG. 4, the device may include:
信息获取模块101,可以用于获取目标语音,并提取所述目标语音的语音特征;The information acquiring module 101 is configured to acquire a target voice, and extract a voice feature of the target voice;
描述符模块102,可以用于基于所述目标语音的语音特征生成所述目标语音的特征描述符;The descriptor module 102 may be configured to generate a feature descriptor of the target voice based on a voice feature of the target voice;
量化编码模块103,可以用于对所述特征描述符进行量化编码,生成量化编码后的特征描述符,并存储所述特征描述符;The quantization coding module 103 may be configured to perform quantization coding on the feature descriptor, generate a quantized encoded feature descriptor, and store the feature descriptor;
识别信息模块104,可以用于获取待识别语音的特征描述符;The identification information module 104 can be configured to acquire a feature descriptor of the voice to be recognized;
第一搜索模块105,可以用于在所述存储的所述特征描述符中,查找与所述待识别语音的特征描述符相匹配的特征描述符对应的目标语音,将查找到的目标语音作为所述待识别语音对应的目标候选集;The first search module 105 may be configured to: in the stored feature descriptor, search for a target voice corresponding to the feature descriptor that matches the feature descriptor of the to-be-recognized voice, and use the found target voice as a target candidate set corresponding to the to-be-recognized voice;
第二搜索模块106,可以用于根据预定规则在所述目标候选集中选取所述待识别语音的搜索结果。The second search module 106 may be configured to select the search result of the to-be-identified voice in the target candidate set according to a predetermined rule.
所在语音搜索装置中,所述信息获取模块101提取目标语音的语音特征的方式可以包括基于MFCC和PLP方法的提取方式。例如利用MFCC提取语音特征的过程可以包括:In the voice search device, the manner in which the information acquisition module 101 extracts the voice features of the target voice may include an extraction method based on the MFCC and the PLP method. For example, the process of extracting speech features using MFCC may include:
对所述目标语音进行预处理;Preprocessing the target speech;
计算所述预处理后的目标语音的能量谱;Calculating an energy spectrum of the pre-processed target speech;
对所述能量谱进行Mel滤波,计算所述Mel滤波后的能量谱的对数;Performing Mel filtering on the energy spectrum to calculate a logarithm of the energy spectrum after the Mel filtering;
对所述能量谱的对数进行DCT变换得到MFCC系数,获取所述目标语音的语音特征。Performing DCT transform on the logarithm of the energy spectrum to obtain MFCC coefficients, and acquiring speech features of the target speech.
其中,所述对所述目标语音进行预处理具体的可以包括:对所述目标语音进行语音格式转换、预加重、分帧、加窗处理。The pre-processing the target voice may include: performing voice format conversion, pre-emphasis, framing, and windowing on the target voice.
当然,如上述,为进一步提高装置语音识别率,所述利用MFCC提取语音特征的过程还可以包括:As a matter of course, in order to further improve the voice recognition rate of the device, the process of extracting voice features by using the MFCC may further include:
计算所述MFCC系数的一阶和二阶差分系数,将所述一阶和二阶差分系数添加至所述语 音特征中。Calculating first and second order differential coefficients of the MFCC coefficients, adding the first and second order differential coefficients to the language In the tone feature.
本实施例具体的提取目标语音的语音特征的实施过程可以参照本申请其他实施例中的叙述,在此不做赘述。For the implementation of the voice feature of the target voice in this embodiment, reference may be made to the description in other embodiments of the present application, and details are not described herein.
另一种实施例中,所述量化编码模块103具体的可以包括:In another embodiment, the quantization and coding module 103 may specifically include:
聚类模块,可以用于将每个所述特征描述符等分成L分子向量,对所述L个子向量分别进行聚类,并设置所述子向量聚类后的索引编号,L≥2;The clustering module may be configured to divide each of the feature descriptors into L molecular vectors, cluster the L sub-vectors, and set an index number of the sub-vector clusters, L≥2;
映射模块,可以用于将每个所述特征描述符的L个子向量分别用与所述子向量距离最近的所述聚类的索引编号表示,形成量化编码后的特征描述符。The mapping module may be configured to respectively represent L sub-vectors of each of the feature descriptors by an index number of the cluster that is closest to the sub-vector, to form a quantized encoded feature descriptor.
在对所述子向量进行聚类时,每个子向量可以用于8位二进制表示,这样可以将每个特征描述符转换成长度为L、维度更低的特征向量。When clustering the sub-vectors, each sub-vector can be used for an 8-bit binary representation, so that each feature descriptor can be converted into a feature vector of length L and lower dimension.
图5是本申请提供的一种量化编码模块的模块结构示意图,如图5所示,在另一种实施例中,所述量化编码模块103具体的可以包括:FIG. 5 is a schematic structural diagram of a module of a quantization and coding module provided by the present application. As shown in FIG. 5, in another embodiment, the quantization and coding module 103 may specifically include:
索引树构建模块1031,可以用于建立高度为(L+1)的K-d树;The index tree construction module 1031 can be used to establish a K-d tree with a height of (L+1);
预置规则模块1032,可以用于根据预先设置规则为所述K-d树的非叶子节点划分索引维度和与所述索引维度相对应的划分值,并建立与所述划分值进行比较的结果路径指向;The preset rule module 1032 may be configured to divide an index dimension and a partition value corresponding to the index dimension for the non-leaf node of the Kd tree according to a preset rule, and establish a result path direction that is compared with the partition value. ;
存储模块1033,可以用于从所述K-d树的根节点开始,将与非叶子节点的索引维度相对应的特征描述符的值与所述非叶子节点的划分值进行比较,并基于比较的结果与所述结果路径指向将所述特征描述符存储至所述K-d树的叶子节点中。The storage module 1033 may be configured to compare, from a root node of the Kd tree, a value of a feature descriptor corresponding to an index dimension of the non-leaf node with a partition value of the non-leaf node, and based on the comparison result And the result path is directed to storing the feature descriptor into a leaf node of the Kd tree.
所述建立的结果路径指向可以包括将特征描述符与当前非叶子节点划分值进行比较,如果前者大于后者,则进入当前非叶子节点的左子树继续进行比较,否则进入当前非叶子节点的右子树继续比较。当然,也可以设置所述前者小于所述后者时进入当前非叶子节点的左子树进行比较,具体的可以根据需求自行设定。The established result path pointing may include comparing the feature descriptor with the current non-leaf node partition value. If the former is greater than the latter, entering the left subtree of the current non-leaf node to continue comparison, otherwise entering the current non-leaf node The right subtree continues to compare. Certainly, the left subtree that enters the current non-leaf node when the former is smaller than the latter may be set for comparison, and the specific one may be set according to requirements.
优选的实施例中,所述预置规则模块1032为非叶子节点划分索引维度具体的可以包括:In a preferred embodiment, the preset rule module 1032 is configured to divide the index dimension for the non-leaf node.
为非叶子节点划分的索引维度值S为随机生成的取值范围为1≤S≤L的整数,并且当前非叶子节点的索引维度值S为从所述K-d树根节点到所述当前非叶子节点的路径上未划分过的索引维度值。The index dimension value S divided for the non-leaf node is a randomly generated integer having a value range of 1 ≤ S ≤ L, and the index dimension value S of the current non-leaf node is from the Kd root node to the current non-leaf The index dimension value that is not divided on the path of the node.
图6是本申请提供的一种第二搜索模块的模块结构示意图,如图6所示,上述所述语音搜索装置的第二搜索模块106,具体的可以包括:FIG. 6 is a schematic structural diagram of a module of a second search module provided by the present application. As shown in FIG. 6 , the second search module 106 of the voice search device may specifically include:
距离计算模块1061,可以用于计算所述目标候选集中的特征描述符与所述待识别语音的特征描述符的欧氏距离;The distance calculation module 1061 may be configured to calculate an Euclidean distance between the feature descriptor in the target candidate set and the feature descriptor of the to-be-recognized voice;
筛选模块1062,可以用于在所述目标候选集中选取与所述待识别语音的特征描述符欧氏 距离最小的前R个特征描述符作为搜索结果集,R≥1;The screening module 1062 is configured to select, in the target candidate set, a feature descriptor associated with the to-be-recognized voice, Euclidean The smallest front R feature descriptors as the search result set, R≥1;
目标语音模块1063,可以用于获取与搜索结果集中的特征描述符相对应的目标语音。The target speech module 1063 can be configured to acquire a target speech corresponding to the feature descriptor in the search result set.
另一种实施方式中,所述描述符模块102具体的可以包括:In another implementation, the descriptor module 102 may specifically include:
码本训练模块,可以用于对所述提取的语音特征通过k-means聚类方法获取所述目标语音的码本;The codebook training module may be configured to acquire the codebook of the target voice by using a k-means clustering method for the extracted voice features;
语音特征集合模块,可以用于获取语音特征集合,并计算所述语音特征集合与所述码本中距离最近的码字所有残差向量之和;所述语音特征集合可以例如{x1,...,xp},其中每个x可以表示对应一个目标语音的语音特征。The voice feature set module may be configured to obtain a voice feature set, and calculate a sum of the voice feature set and all residual vector vectors of the codeword closest to the codebook; the voice feature set may be, for example, {x 1 . .., x p }, where each x can represent a speech feature corresponding to a target speech.
归一化处理模块,可以用于对所述码字的残差向量之和进行归一化,生成所述目标语音的特征描述符。The normalization processing module may be configured to normalize a sum of residual vectors of the codewords to generate a feature descriptor of the target speech.
本申请所述的一种语音信息搜索装置可以应用于多种设备终端或者服务器中。例如日常生活中常用的智能移动终端可以获取待识别语音信息,所述智能移动终端可以将所述待识别语音信息发送至服务器,服务器可以利用本申请所述的语音信息搜索方法和装置中的实施例进行语音搜索,获取相应的语音搜索结果,然后可以再根据搜索得到的结果做进一步处理。因此,本申请还提供一种语音信息搜索服务器,所述服务器被设置成可以包括:A voice information searching device described in the present application can be applied to a plurality of device terminals or servers. For example, the smart mobile terminal that is commonly used in daily life can obtain the voice information to be recognized, and the smart mobile terminal can send the voice information to be recognized to the server, and the server can use the voice information search method and device implemented in the application. For example, a voice search is performed to obtain a corresponding voice search result, and then the result obtained by the search can be further processed. Therefore, the application further provides a voice information search server, the server being configured to include:
第一处理单元,用于获取目标语音,生成所述目标语音的特征描述符;还用于对所述特征描述符进行量化编码;a first processing unit, configured to acquire a target voice, generate a feature descriptor of the target voice, and further configured to perform quantization coding on the feature descriptor;
存储单元,用于分别存储所述量化编码后的特征描述符中路径相同的特征描述符;a storage unit, configured to separately store feature descriptors with the same path in the quantized encoded feature descriptors;
第二处理单元,用于获取待识别语音的特征描述符;还用于在所述存储的特征描述符中查找与所述待识别语音相匹配的特征描述符的目标语音,获取候选集;还用于根据预定规则在所述候选集中选取所述待识别语音的搜索结果。a second processing unit, configured to acquire a feature descriptor of the to-be-identified voice; and further configured to search, in the stored feature descriptor, a target voice of the feature descriptor that matches the to-be-identified voice, to acquire a candidate set; Search results for selecting the speech to be recognized in the candidate set according to a predetermined rule.
当然,进一步的所述服务器可以将获取的待识别语音的搜索结果返回至发送待识别语音的客户端,或者结合所述服务器的或者其他服务器的功能模块进行其他处理。本实施例提供的语音搜索服务器,结合特征描述符和索引树,优化语音索引方法,提高了服务器语音搜索速度。Certainly, the server may further return the obtained search result of the to-be-identified voice to the client that sends the voice to be recognized, or perform other processing in combination with the function module of the server or other server. The voice search server provided in this embodiment combines the feature descriptor and the index tree to optimize the voice indexing method, and improves the server voice search speed.
本申请提供的一种语音信息搜索方法、装置及服务器,可以将目标语音按照规则分别存储到相应的候选集中,并建立相应的索引机制,在获取待识别语音后可以快速的通过索引获取目标候选集,进一步的获取搜索结果,从而将传统的发音模板的语音识别转换成语音特征的搜索,并且通过量化编码和K-d树优化索引,提高了搜索速度,优化查询效率。The voice information searching method, device and server provided by the present application can store the target voices in the corresponding candidate sets according to the rules, and establish a corresponding indexing mechanism, and can quickly obtain the target candidates through the index after acquiring the to-be-identified voices. The set further obtains the search result, thereby converting the speech recognition of the traditional pronunciation template into the search of the speech feature, and optimizing the index by the quantization coding and the Kd tree, thereby improving the search speed and optimizing the query efficiency.
尽管本申请内容中提到包括信息传输、数据变换和数据树形结构等之类的描述,但是, 本申请并不局限于必须是完全符合标准通信协议或数据处理标准的情况。某些协议或者标准的基础上略加修改后的传输机制或者数据处理标准也可以实行上述本申请各实施例的方案。当然,即使不采用上述通用的协议或数据处理标准,而是采用私有协议或数据处理标准,只要符合本申请上述各实施例的信息交互和信息判断反馈方式,仍然可以实现相同的申请,在此不再赘述。Although the description includes information transmission, data transformation, data tree structure, etc., This application is not limited to situations where it must be fully compliant with standard communication protocols or data processing standards. The protocol of the embodiments of the present application may also be implemented by a slightly modified transmission mechanism or data processing standard based on certain protocols or standards. Of course, even if the above-mentioned general protocol or data processing standard is not adopted, but a proprietary protocol or a data processing standard is adopted, the same application can be implemented as long as the information interaction and the information judgment feedback manner of the above embodiments of the present application are met. No longer.
上述实施例阐明的单元、装置或模块,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。为了描述的方便,描述以上装置时以功能分为各种模块分别描述。当然,在实施本申请时可以把各模块的功能在同一个或多个软件和/或硬件中实现,也可以将实现同一功能的模块由多个子模块或子单元的组合实现。The unit, device or module illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product having a certain function. For the convenience of description, the above devices are described as being separately divided into various modules by function. Of course, the functions of the modules may be implemented in the same software or software and/or hardware when implementing the present application, or the modules implementing the same functions may be implemented by multiple sub-modules or a combination of sub-units.
本领域技术人员也知道,除了以纯计算机可读程序代码方式实现控制器以外,完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件,而对其内部包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。Those skilled in the art will also appreciate that in addition to implementing the controller in purely computer readable program code, the controller can be logically programmed by means of logic gates, switches, ASICs, programmable logic controllers, and embedding. The form of a microcontroller or the like to achieve the same function. Therefore, such a controller can be considered as a hardware component, and a device for internally implementing it for implementing various functions can also be regarded as a structure within a hardware component. Or even a device for implementing various functions can be considered as a software module that can be both a method of implementation and a structure within a hardware component.
本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构、类等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The application can be described in the general context of computer-executable instructions executed by a computer, such as a program module. Generally, program modules include routines, programs, objects, components, data structures, classes, and the like that perform particular tasks or implement particular abstract data types. The present application can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are connected through a communication network. In a distributed computing environment, program modules can be located in both local and remote computer storage media including storage devices.
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,移动终端,服务器,或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法。It will be apparent to those skilled in the art from the above description of the embodiments that the present application can be implemented by means of software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product in essence or in the form of a software product, which may be stored in a storage medium such as a ROM/RAM or a disk. , an optical disk, etc., includes instructions for causing a computer device (which may be a personal computer, mobile terminal, server, or network device, etc.) to perform the methods described in various embodiments of the present application or portions of the embodiments.
本说明书中的各个实施例采用递进的方式描述,各个实施例之间相同或相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。本申请可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。 The various embodiments in the specification are described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on differences from other embodiments. This application can be used in a variety of general purpose or special purpose computer system environments or configurations. For example: personal computer, server computer, handheld or portable device, tablet device, multiprocessor system, microprocessor based system, set-top box, programmable electronic device, network PC, small computer, mainframe computer, including the above A distributed computing environment of any system or device, and so on.
虽然通过实施例描绘了本申请,本领域普通技术人员知道,本申请有许多变形和变化而不脱离本申请的精神,希望所附的权利要求包括这些变形和变化而不脱离本申请的精神。 While the present invention has been described by the embodiments of the present invention, it will be understood by those skilled in the art

Claims (16)

  1. 一种语音信息搜索方法,其特征在于,所述方法包括:A voice information searching method, the method comprising:
    提取语音信息库中目标语音的语音特征,生成所述目标语音的特征描述符;Extracting a voice feature of the target voice in the voice information base, and generating a feature descriptor of the target voice;
    对所述特征描述符进行量化编码生成量化编码后的特征描述符,并存储所述特征描述符;Performing quantization coding on the feature descriptor to generate a quantized encoded feature descriptor, and storing the feature descriptor;
    获取待识别语音的特征描述符,在所述存储的所述特征描述符中,查找与所述待识别语音的特征描述符相匹配的特征描述符对应的目标语音,将查找到的目标语音作为所述待识别语音对应的目标候选集;Obtaining a feature descriptor of the voice to be recognized, searching, in the stored feature descriptor, a target voice corresponding to the feature descriptor that matches the feature descriptor of the to-be-recognized voice, and using the found target voice as a target candidate set corresponding to the to-be-recognized voice;
    根据预定规则在所述目标候选集中选取所述待识别语音的搜索结果。The search result of the to-be-identified speech is selected in the target candidate set according to a predetermined rule.
  2. 如权利要求1所述的一种语音信息搜索方法,其特征在于,所述对所述特征描述符进行量化编码生成量化编码后的特征描述符包括:The method for searching for a voice information according to claim 1, wherein the performing the quantization encoding on the feature descriptor to generate the quantized encoded feature descriptor comprises:
    将每个所述特征描述符等分成L份子向量,对所述L个子向量分别进行聚类,并设置所述子向量聚类后的索引编号,L≥2;Dividing each of the feature descriptors into L sub-vectors, respectively clustering the L sub-vectors, and setting an index number of the sub-vector clustering, L≥2;
    将每个所述特征描述符的L个子向量分别用与所述子向量距离最近的所述聚类的索引编号表示,生成量化编码后的特征描述符。The L sub-vectors of each of the feature descriptors are respectively represented by an index number of the cluster that is closest to the sub-vector, and a quantized-coded feature descriptor is generated.
  3. 如权利要求2所述的一种语音信息搜索方法,其特征在于,所述存储所述特征描述符包括:The method for searching for a voice information according to claim 2, wherein said storing said feature descriptor comprises:
    建立高度为(L+1)的K-d树;Establish a K-d tree with a height of (L+1);
    为所述K-d树的非叶子节点划分索引维度和与所述索引维度相对应的划分值;建立与所述划分值进行比较的结果路径指向;Dividing an index dimension and a partition value corresponding to the index dimension for the non-leaf node of the K-d tree; establishing a result path direction that is compared with the partition value;
    从所述K-d树的根节点开始,将与非叶子节点的索引维度相对应的特征描述符的值与所述非叶子节点的划分值进行比较,并基于比较的结果与所述结果路径指向将所述特征描述符存储至所述K-d树的叶子节点中。Starting from the root node of the Kd tree, comparing the value of the feature descriptor corresponding to the index dimension of the non-leaf node with the partition value of the non-leaf node, and based on the result of the comparison and the result path pointing The feature descriptor is stored in a leaf node of the Kd tree.
  4. 如权利要求3所述的一种语音信息搜索方法,其特征在于,所述为非叶子节点划分索引维度包括:The method for searching for a voice information according to claim 3, wherein the dividing the index dimensions for the non-leaf nodes comprises:
    为非叶子节点划分的索引维度值S为随机生成的取值范围为1≤S≤L的整数,并且当前非叶子节点的索引维度值S为从所述K-d树根节点到所述当前非叶子节点的路径上未划分过的索引维度值。The index dimension value S divided for the non-leaf node is a randomly generated integer having a value range of 1 ≤ S ≤ L, and the index dimension value S of the current non-leaf node is from the Kd root node to the current non-leaf The index dimension value that is not divided on the path of the node.
  5. 如权利要求1所述的一种语音信息搜索方法,其特征在于,所述根据预定规则在所述目标候选集中选取所述待识别语音的搜索结果包括: The method for searching for a voice information according to claim 1, wherein the selecting a search result of the voice to be recognized in the target candidate set according to a predetermined rule comprises:
    在所述目标候选集中选取与所述待识别语音的特征描述符欧氏距离最小的前R个特征描述符作为搜索结果集,以所述搜索结果集所对应的目标语音作为所述待识别语音的搜索结果,R≥1。Selecting, in the target candidate set, a top R feature descriptors having the smallest Euclidean distance from the feature descriptor of the to-be-recognized speech as a search result set, and using the target speech corresponding to the search result set as the to-be-recognized speech Search results for R≥1.
  6. 如权利要求1所述的一种语音信息搜索方法,其特征在于,所述生成所述目标语音的特征描述符包括:The method for searching a voice information according to claim 1, wherein the generating the feature descriptor of the target voice comprises:
    对所述提取的语音特征通过k-means聚类方法获取所述目标语音的码本;Obtaining, by the k-means clustering method, the codebook of the target voice for the extracted voice feature;
    获取所述目标语音的语音特征集合,计算所述语音特征集合与所述码本中距离最近的码字所有残差向量之和;Obtaining a voice feature set of the target voice, and calculating a sum of the voice feature set and all residual vectors of the codeword closest to the codebook;
    对所述码字的残差向量之和进行归一化,生成所述目标语音的特征描述符。The sum of the residual vectors of the codewords is normalized to generate a feature descriptor of the target speech.
  7. 如权利要求1所述的一种语音信息搜索方法,其特征在于,所述提取语音信息库中目标语音的语音特征包括:The method for searching for a voice information according to claim 1, wherein the extracting the voice features of the target voice in the voice information database comprises:
    对所述目标语音进行预处理;Preprocessing the target speech;
    计算所述预处理后的目标语音的能量谱;Calculating an energy spectrum of the pre-processed target speech;
    对所述能量谱进行Mel滤波,计算所述Mel滤波后的能量谱的对数;Performing Mel filtering on the energy spectrum to calculate a logarithm of the energy spectrum after the Mel filtering;
    对所述能量谱的对数进行DCT变换得到MFCC系数,获取所述目标语音的语音特征。Performing DCT transform on the logarithm of the energy spectrum to obtain MFCC coefficients, and acquiring speech features of the target speech.
  8. 如权利要求7所述的一种语音信息搜索方法,其特征在于,所述对所述目标语音进行预处理包括:对所述目标语音进行语音格式转换、预加重、分帧、加窗处理。The method for searching for a voice information according to claim 7, wherein the pre-processing the target voice comprises: performing voice format conversion, pre-emphasis, framing, and windowing on the target voice.
  9. 如权利要求8所述的一种语音信息搜索方法,其特征在于,所述方法还包括:The method for searching a voice information according to claim 8, wherein the method further comprises:
    计算所述MFCC系数的一阶和二阶差分系数,将所述一阶和二阶差分系数添加至所述语音特征中。Calculating first and second order differential coefficients of the MFCC coefficients, adding the first and second order differential coefficients to the speech features.
  10. 一种语音信息搜索装置,其特征在于,所述装置包括:A voice information searching device, the device comprising:
    信息获取模块,用于获取目标语音,并提取所述目标语音的语音特征;An information acquiring module, configured to acquire a target voice, and extract a voice feature of the target voice;
    描述符模块,用于基于所述目标语音的语音特征生成所述目标语音的特征描述符;a descriptor module, configured to generate a feature descriptor of the target voice based on a voice feature of the target voice;
    量化编码模块,用于对所述特征描述符进行量化编码,生成量化编码后的特征描述符,并存储所述特征描述符;a quantization coding module, configured to perform quantization coding on the feature descriptor, generate a quantized encoded feature descriptor, and store the feature descriptor;
    识别信息模块,用于获取待识别语音的特征描述符;An identification information module, configured to acquire a feature descriptor of the voice to be recognized;
    第一搜索模块,用于在所述存储的所述特征描述符中,查找与所述待识别语音的特征描述符相匹配的特征描述符对应的目标语音,将查找到的目标语音作为所述待识别语音对应的目标候选集;a first search module, configured to search, in the stored feature descriptor, a target voice corresponding to a feature descriptor that matches a feature descriptor of the to-be-identified voice, and use the found target voice as the a target candidate set corresponding to the speech to be recognized;
    第二搜索模块,用于根据预定规则在所述目标候选集中选取所述待识别语音的搜索结果。 And a second searching module, configured to select the search result of the to-be-identified voice in the target candidate set according to a predetermined rule.
  11. 如权利要求10所述的一种语音信息搜索装置,其特征在于,所述量化编码模块包括:A voice information search device according to claim 10, wherein said quantized coding module comprises:
    聚类模块,用于将每个所述特征描述符等分成L分子向量,对所述L个子向量分别进行聚类,并设置所述子向量聚类后的索引编号,L≥2;a clustering module, configured to divide each of the feature descriptors into L molecular vectors, cluster the L sub-vectors, and set an index number of the sub-vector clusters, L≥2;
    映射模块,用于将每个所述特征描述符的L个子向量分别用与所述子向量距离最近的所述聚类的索引编号表示,形成量化编码后的特征描述符。And a mapping module, configured to respectively represent L sub-vectors of each of the feature descriptors by an index number of the cluster that is closest to the sub-vector, to form a quantized encoded feature descriptor.
  12. 如权利要求10所述的一种语音信息搜索装置,其特征在于,所述量化编码模块包括:A voice information search device according to claim 10, wherein said quantized coding module comprises:
    索引树构建模块,用于建立高度为(L+1)的K-d树;An index tree building block for establishing a K-d tree of height (L+1);
    预置规则模块,用于根据预先设置规则为所述K-d树的非叶子节点划分索引维度和与所述索引维度相对应的划分值,并建立与所述划分值进行比较的结果路径指向;a preset rule module, configured to divide an index dimension and a partition value corresponding to the index dimension for the non-leaf node of the K-d tree according to a preset rule, and establish a result path direction that is compared with the partition value;
    存储模块,用于从所述K-d树的根节点开始,将与非叶子节点的索引维度相对应的特征描述符的值与所述非叶子节点的划分值进行比较,并基于比较的结果与所述结果路径指向将所述特征描述符存储至所述K-d树的叶子节点中。a storage module, configured to compare, from a root node of the Kd tree, a value of a feature descriptor corresponding to an index dimension of the non-leaf node with a partition value of the non-leaf node, and based on the comparison result The result path is directed to storing the feature descriptor into a leaf node of the Kd tree.
  13. 如权利要求12所述的一种语音信息搜索装置,其特征在于,所述预置规则模块为非叶子节点划分索引维度包括:A voice information searching device according to claim 12, wherein the preset rule module divides the index dimensions for the non-leaf nodes:
    为非叶子节点划分的索引维度值S为随机生成的取值范围为1≤S≤L的整数,并且当前非叶子节点的索引维度值S为从所述K-d树根节点到所述当前非叶子节点的路径上未划分过的索引维度值。The index dimension value S divided for the non-leaf node is a randomly generated integer having a value range of 1 ≤ S ≤ L, and the index dimension value S of the current non-leaf node is from the Kd root node to the current non-leaf The index dimension value that is not divided on the path of the node.
  14. 如权利要求10所述的一种语音信息搜索装置,其特征在于,所述第二搜索模块包括:A voice information search device according to claim 10, wherein the second search module comprises:
    距离计算模块,用于计算所述目标候选集中的特征描述符与所述待识别语音的特征描述符的欧氏距离;a distance calculation module, configured to calculate an Euclidean distance of the feature descriptor in the target candidate set and the feature descriptor of the to-be-recognized voice;
    筛选模块,用于在所述目标候选集中选取与所述待识别语音的特征描述符欧氏距离最小的前R个特征描述符作为搜索结果集,R≥1;a screening module, configured to select, in the target candidate set, a top R feature descriptors having a minimum Euclidean distance from a feature descriptor of the to-be-recognized speech as a search result set, R≥1;
    目标语音模块,用于获取与搜索结果集中的特征描述符相对应的目标语音。The target speech module is configured to acquire a target speech corresponding to the feature descriptor in the search result set.
  15. 如权利要求10所述的一种语音信息搜索装置,其特征在于,所述描述符模块包括:A voice information searching device according to claim 10, wherein said descriptor module comprises:
    码本训练模块,用于对所述提取的语音特征通过k-means聚类方法获取所述目标语音的码本;a codebook training module, configured to acquire a codebook of the target voice by using a k-means clustering method;
    语音特征集合模块,获取语音特征集合,并计算所述语音特征集合与所述码本中距离最近的码字所有残差向量之和; a speech feature collection module, acquiring a speech feature set, and calculating a sum of the speech feature set and all residual vectors of the codeword closest to the codebook;
    归一化处理模块,用于对所述码字的残差向量之和进行归一化,生成所述目标语音的特征描述符。And a normalization processing module, configured to normalize a sum of residual vectors of the codewords to generate a feature descriptor of the target speech.
  16. 一种语音信息搜索服务器,其特征在于,所述服务器被设置成,包括:A voice information search server, wherein the server is configured to include:
    第一处理单元,用于获取目标语音,生成所述目标语音的特征描述符;还用于对所述特征描述符进行量化编码;a first processing unit, configured to acquire a target voice, generate a feature descriptor of the target voice, and further configured to perform quantization coding on the feature descriptor;
    存储单元,用于分别存储所述量化编码后的特征描述符中路径相同的特征描述符;a storage unit, configured to separately store feature descriptors with the same path in the quantized encoded feature descriptors;
    第二处理单元,用于获取待识别语音的特征描述符;还用于在所述存储的特征描述符中查找与所述待识别语音相匹配的特征描述符的目标语音,获取候选集;还用于根据预定规则在所述候选集中选取所述待识别语音的搜索结果。 a second processing unit, configured to acquire a feature descriptor of the to-be-identified voice; and further configured to search, in the stored feature descriptor, a target voice of the feature descriptor that matches the to-be-identified voice, to acquire a candidate set; Search results for selecting the speech to be recognized in the candidate set according to a predetermined rule.
PCT/CN2016/071164 2015-01-26 2016-01-18 Voice information search method and apparatus, and server WO2016119604A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510037176.9 2015-01-26
CN201510037176.9A CN105893389A (en) 2015-01-26 2015-01-26 Voice message search method, device and server

Publications (1)

Publication Number Publication Date
WO2016119604A1 true WO2016119604A1 (en) 2016-08-04

Family

ID=56542386

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/071164 WO2016119604A1 (en) 2015-01-26 2016-01-18 Voice information search method and apparatus, and server

Country Status (2)

Country Link
CN (1) CN105893389A (en)
WO (1) WO2016119604A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339457A (en) * 2018-12-18 2020-06-26 富士通株式会社 Method and apparatus for extracting information from web page and storage medium
CN111862967A (en) * 2020-04-07 2020-10-30 北京嘀嘀无限科技发展有限公司 Voice recognition method and device, electronic equipment and storage medium
CN112185363A (en) * 2020-10-21 2021-01-05 北京猿力未来科技有限公司 Audio processing method and device
CN111862967B (en) * 2020-04-07 2024-05-24 北京嘀嘀无限科技发展有限公司 Voice recognition method and device, electronic equipment and storage medium

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106683664A (en) * 2016-11-22 2017-05-17 中南大学 Voice starting method and system for wireless charging
CN106919662B (en) * 2017-02-14 2021-08-31 复旦大学 Music identification method and system
CN107229691B (en) * 2017-05-19 2021-11-02 上海掌门科技有限公司 Method and equipment for providing social contact object
CN109102824B (en) * 2018-07-06 2021-04-09 北京比特智学科技有限公司 Voice error correction method and device based on man-machine interaction
CN109978066B (en) * 2019-04-01 2020-10-30 苏州大学 Rapid spectral clustering method based on multi-scale data structure
CN112382276A (en) * 2020-10-20 2021-02-19 国网山东省电力公司物资公司 Power grid material information acquisition method and device based on voice semantic recognition

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0838803A2 (en) * 1996-10-28 1998-04-29 Nec Corporation Distance calculation for use in a speech recognition apparatus
CN1455388A (en) * 2002-09-30 2003-11-12 中国科学院声学研究所 Voice identifying system and compression method of characteristic vector set for voice identifying system
CN101030369A (en) * 2007-03-30 2007-09-05 清华大学 Built-in speech discriminating method based on sub-word hidden Markov model
CN101154379A (en) * 2006-09-27 2008-04-02 夏普株式会社 Method and device for locating keywords in voice and voice recognition system
CN103310790A (en) * 2012-03-08 2013-09-18 富泰华工业(深圳)有限公司 Electronic device and voice identification method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0838803A2 (en) * 1996-10-28 1998-04-29 Nec Corporation Distance calculation for use in a speech recognition apparatus
CN1455388A (en) * 2002-09-30 2003-11-12 中国科学院声学研究所 Voice identifying system and compression method of characteristic vector set for voice identifying system
CN101154379A (en) * 2006-09-27 2008-04-02 夏普株式会社 Method and device for locating keywords in voice and voice recognition system
CN101030369A (en) * 2007-03-30 2007-09-05 清华大学 Built-in speech discriminating method based on sub-word hidden Markov model
CN103310790A (en) * 2012-03-08 2013-09-18 富泰华工业(深圳)有限公司 Electronic device and voice identification method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339457A (en) * 2018-12-18 2020-06-26 富士通株式会社 Method and apparatus for extracting information from web page and storage medium
CN111339457B (en) * 2018-12-18 2023-09-08 富士通株式会社 Method and apparatus for extracting information from web page and storage medium
CN111862967A (en) * 2020-04-07 2020-10-30 北京嘀嘀无限科技发展有限公司 Voice recognition method and device, electronic equipment and storage medium
CN111862967B (en) * 2020-04-07 2024-05-24 北京嘀嘀无限科技发展有限公司 Voice recognition method and device, electronic equipment and storage medium
CN112185363A (en) * 2020-10-21 2021-01-05 北京猿力未来科技有限公司 Audio processing method and device
CN112185363B (en) * 2020-10-21 2024-02-13 北京猿力未来科技有限公司 Audio processing method and device

Also Published As

Publication number Publication date
CN105893389A (en) 2016-08-24

Similar Documents

Publication Publication Date Title
WO2016119604A1 (en) Voice information search method and apparatus, and server
US10978047B2 (en) Method and apparatus for recognizing speech
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
US11475881B2 (en) Deep multi-channel acoustic modeling
CN103700370B (en) A kind of radio and television speech recognition system method and system
CN110189749A (en) Voice keyword automatic identifying method
US20090037175A1 (en) Confidence measure generation for speech related searching
CN112735383A (en) Voice signal processing method, device, equipment and storage medium
WO2022178969A1 (en) Voice conversation data processing method and apparatus, and computer device and storage medium
CN111402891A (en) Speech recognition method, apparatus, device and storage medium
US11276403B2 (en) Natural language speech processing application selection
WO2023245389A1 (en) Song generation method, apparatus, electronic device, and storage medium
Zhang et al. Speech emotion recognition using combination of features
CN112669842A (en) Man-machine conversation control method, device, computer equipment and storage medium
CN115602165A (en) Digital staff intelligent system based on financial system
Rudresh et al. Performance analysis of speech digit recognition using cepstrum and vector quantization
CN112035700B (en) Voice deep hash learning method and system based on CNN
CN114187914A (en) Voice recognition method and system
Chandra et al. Keyword spotting: an audio mining technique in speech processing–a survey
Liu et al. Supra-Segmental Feature Based Speaker Trait Detection.
Barakat et al. An improved template-based approach to keyword spotting applied to the spoken content of user generated video blogs
CN112820274B (en) Voice information recognition correction method and system
US11551666B1 (en) Natural language processing
Xia et al. Simulation of Multi-Band Anti-Noise Broadcast Host Speech Recognition Method Based on Multi-Core Learning
Sun et al. Combination of sparse classification and multilayer perceptron for noise-robust ASR

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16742666

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16742666

Country of ref document: EP

Kind code of ref document: A1