WO2019237518A1 - 模型库建立方法、语音识别方法、装置、设备及介质 - Google Patents

模型库建立方法、语音识别方法、装置、设备及介质 Download PDF

Info

Publication number
WO2019237518A1
WO2019237518A1 PCT/CN2018/104040 CN2018104040W WO2019237518A1 WO 2019237518 A1 WO2019237518 A1 WO 2019237518A1 CN 2018104040 W CN2018104040 W CN 2018104040W WO 2019237518 A1 WO2019237518 A1 WO 2019237518A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
current
training
data
original
Prior art date
Application number
PCT/CN2018/104040
Other languages
English (en)
French (fr)
Inventor
涂宏
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019237518A1 publication Critical patent/WO2019237518A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Definitions

  • the present application relates to the field of voiceprint recognition, and in particular, to a method for establishing a model library, a voice recognition method, a device, a device, and a medium.
  • voiceprint recognition is more and more favored by system developers and users.
  • the world market share of voiceprint recognition is second only to fingerprint and palmprint biometric recognition, and has a rising trend.
  • the advantages of voiceprint recognition are: (1) voice acquisition with voiceprint features is convenient and natural, voiceprint extraction can be done unknowingly, so the user's acceptance is also high; (2) voice acquisition costs are low, and use Simple, one microphone can be used, and no additional recording equipment is needed when using communication equipment; (3) suitable for remote identity verification, only a microphone, phone or mobile phone can be used to achieve remote login through the network (communication network or Internet) (4) the algorithm complexity of voiceprint recognition and confirmation is low; (5) with some other measures, such as content identification through voice recognition, etc., the accuracy can be improved.
  • Voiceprint recognition generally compares the voices to be tested in turn with the speaker voices already in the database, and then confirms the target speaker. And when the number of speakers in the database is relatively large, comparing and finding the target speakers in turn will greatly affect the recognition efficiency.
  • the method, device, equipment and medium for establishing the model database above establish a current hierarchical model based on the training voice features extracted from the original voice data, store the current hierarchical model in the hierarchical model library, and then divide the original voice data into at least two according to the model hierarchical logic.
  • Current training subset until the number of samples in the current training subset is not greater than a preset threshold, the current training subset can be determined as the recognition data set to complete the establishment of the model database, and a hierarchical model library that facilitates rapid identification of the recognition data set is established .
  • a model library building method includes:
  • the current classification model is established based on the training speech features extracted from the original speech data, the current classification model is stored in the classification model library, and the model classification in the classification model library is determined Logic, dividing the original speech data into at least two current training subsets according to model hierarchical logic;
  • the current training subset is determined as a recognition data set, and the recognition data set is stored in a hierarchical model library.
  • a model library building device includes:
  • a training sample set acquisition module for acquiring a training sample set, where the training sample set includes at least two original speech data
  • a storage classification model module is used to establish a current classification model based on the training speech features extracted from the original speech data if the number of samples of the original speech data in the training sample set is greater than a preset threshold, and store the current classification model in the classification model library and determine Model classification logic in the hierarchical model library, which divides the original speech data into at least two current training subsets according to the model classification logic;
  • An update training sample set module for updating the current training subset to the training sample set if the number of samples in the current training subset is greater than a preset threshold
  • a recognition data set module is configured to determine the current training subset as a recognition data set if the number of samples in the current training subset is not greater than a preset threshold, and store the recognition data set in a hierarchical model library.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that when the processor executes the computer-readable instructions, the following steps are implemented:
  • the current classification model is established based on the training speech features extracted from the original speech data, the current classification model is stored in the classification model library, and the model classification in the classification model library is determined Logic, dividing the original speech data into at least two current training subsets according to model hierarchical logic;
  • the current training subset is determined as a recognition data set, and the recognition data set is stored in a hierarchical model library.
  • One or more non-volatile readable storage media storing computer-readable instructions, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:
  • the current classification model is established based on the training speech features extracted from the original speech data, the current classification model is stored in the classification model library, and the model classification in the classification model library is determined Logic, dividing the original speech data into at least two current training subsets according to model hierarchical logic;
  • the current training subset is determined as a recognition data set, and the recognition data set is stored in a hierarchical model library.
  • a computer-readable storage medium stores computer-readable instructions. When the computer-readable instructions are executed by a processor, the steps of the method for establishing a model library are implemented.
  • the above method for establishing a model library, device, equipment and medium establish the current hierarchical model step by step by determining the number of samples of the original speech data in the training sample set, and until the current training subset of the training sample set is determined as the recognition data set, the recognition is performed.
  • the data set is stored in the hierarchical model library to complete the establishment of the hierarchical model library.
  • the hierarchical model library stores all the original voice data in different recognition data sets, avoiding subsequent comparison of all original voice data when identifying the voice data to be tested in succession, and can quickly match the voice data to be tested according to the model hierarchical logic of the hierarchical model library. Recognition data set to improve the efficiency of speech recognition.
  • a speech recognition method includes:
  • each original voice data in the target data set carries a speaker identification
  • the spatial distance between the feature to be tested and each original feature in the target data set is acquired, and the target speaker identifier corresponding to the feature to be tested is determined.
  • a voice recognition device includes:
  • the acquisition test voice module is used to acquire voice data to be tested, and to extract voice characteristics corresponding to the voice data to be tested;
  • a target model module which is used to process target speech features based on the model grading logic in the hierarchical model library and the current hierarchical model to determine the target node;
  • the corresponding identification data set module is configured to use the identification data set corresponding to the target node as the target data set, and each original voice data in the target data set carries a speaker identifier;
  • a speaker identification module is configured to obtain a spatial distance between the voice feature to be tested and each original voice feature in the target data set, and determine a target speaker identifier corresponding to the voice data to be tested.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that when the processor executes the computer-readable instructions, the following steps are implemented:
  • each original voice data in the target data set carries a speaker identification
  • the spatial distance between the feature to be tested and each original feature in the target data set is acquired, and the target speaker identifier corresponding to the feature to be tested is determined.
  • One or more non-volatile readable storage media storing computer-readable instructions, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:
  • each original voice data in the target data set carries a speaker identification
  • the spatial distance between the feature to be tested and each original feature in the target data set is acquired, and the target speaker identifier corresponding to the feature to be tested is determined.
  • FIG. 1 is a schematic diagram of an application environment of a method for establishing a model library according to an embodiment of the present application
  • FIG. 2 is a flowchart of a method for establishing a model library according to an embodiment of the present application
  • FIG. 3 is another specific flowchart of a method for establishing a model library according to an embodiment of the present application
  • FIG. 4 is another specific flowchart of a method for establishing a model library according to an embodiment of the present application.
  • FIG. 5 is another specific flowchart of a method for establishing a model library according to an embodiment of the present application.
  • FIG. 6 is another specific flowchart of a method for establishing a model library according to an embodiment of the present application.
  • FIG. 7 is a flowchart of a speech recognition method according to an embodiment of the present application.
  • FIG. 8 is a schematic block diagram of a model library establishing device in an embodiment of the present application.
  • FIG. 9 is a principle block diagram of a voice recognition device in an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a computer device according to an embodiment of the present application.
  • the method for establishing a model library provided in the embodiment of the present application can be applied in the application environment shown in FIG. 1, where a computer device communicates with an identification server through a network.
  • computer equipment includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the identification server can be implemented by an independent server or a server cluster composed of multiple servers.
  • the computer equipment can collect the original speech data and send the original speech data to the recognition server through the network, so that the recognition server can use the obtained original speech data for modeling.
  • a method for establishing a model library is provided.
  • the method is applied to the identification server in FIG. 1 as an example, and includes the following steps:
  • the training sample set is a set of multiple original speech data.
  • the original voice data is the voice data entered by the speaker through a computer device.
  • the computer device can send the collected raw voice data to a recognition server, and the recognition server stores the received raw voice data in a database for subsequent recognition calls.
  • the number of samples is the number of original speech data in the training sample set.
  • the preset threshold is a preset threshold used to limit whether the training sample set needs to be evenly divided. That is, the preset threshold value is the minimum value that can continue to divide the original voice data according to the number. For example, the preset threshold is 100. If the number of samples of the original speech data is 99, the original speech data is not divided. If the number of samples of the original speech data is 100, the original speech data is continuously divided into at least two current trainings. Subset.
  • the training speech feature is a speech feature obtained after feature extraction of the original speech data, and is applied in this embodiment, and Mel-Frequency Cepstral Coefficients (hereinafter referred to as MFCC features) can be used as the training speech feature.
  • MFCC features Mel-Frequency Cepstral Coefficients
  • the test found that the human ear is like a filter bank and only pays attention to certain specific frequency components (human hearing is non-linear to frequency), which means that the human ear receives a limited number of signals at a sound frequency.
  • these filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and they are densely distributed. However, in the high frequency region, the number of filters becomes relatively small and the distribution is sparse.
  • the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear. Therefore, the Mel frequency cepstrum coefficient will be used as the training speech feature, which can accurately reflect the speaker's speech feature
  • the current hierarchical model is to project training speech features corresponding to multiple original speech data belonging to the same set to a low-dimensional overall change subspace, and obtain a fixed-length vector representation to represent multiple original speech data belonging to the set. Corresponding speech model.
  • the hierarchical model library is a database including multiple hierarchical models and storing each hierarchical model according to the model hierarchical logic.
  • the model hierarchical logic specifically includes: saving each hierarchical model to a node at a corresponding position in a hierarchical model library established according to a tree structure.
  • the tree structure includes a root node and multiple child nodes.
  • the root node has no precursor nodes, and each child node has only one precursor node. If the root node or each child node has a trailing node, it includes at least two trailing nodes.
  • the child nodes include a leaf node at the extreme end and an intermediate node located between the root node and the leaf node.
  • the child node without a trailing node is a leaf node, and the child node with a trailing node is an intermediate node.
  • the current training subset is a subset of the original speech data in the training sample set divided equally.
  • step S20 the recognition server establishes a current classification model for the original speech data and stores the current classification model in a classification model database, which is helpful for speech clustering determination based on the current classification model during speech recognition.
  • the current training subset is a data set formed after the original voice data is evenly divided by the number in step S20.
  • the training sample set includes at least two original speech data, which is the speech data entered by the speaker through the computer device. It should be noted that in step S30, the number of samples in the current training subset is greater than a preset threshold, and the current training subset is updated to the training sample set, so that the current training subset can repeatedly perform steps S20 and steps subsequent to step S20. To determine whether the current training subset needs to be further divided.
  • the identification data set is a current training subset in which the number of samples of the original speech data does not exceed a preset threshold. That is, the number of samples of the original speech data in the current training subset is compared with a preset threshold, and it is determined that the current training subset with the number of samples larger than the preset threshold is a recognition data set, without further dividing the data set. Understandably, in order for the recognition server to find specific original voice data during voice recognition, the recognition data set needs to be stored in the hierarchical model database in step S40.
  • the method for establishing a model database establishes a current hierarchical model based on the training voice features extracted from the original voice data, stores the current hierarchical model in the hierarchical model database, and then divides the original voice data into at least two according to the model hierarchical logic.
  • Current training subset until the number of samples in all current training subsets is not greater than a preset threshold, the current training subset can be determined as a recognition data set to complete the establishment of a model database, and a hierarchical model that facilitates rapid identification of the recognition data set can be established Database, which helps improve the efficiency of subsequent speech recognition based on this hierarchical model library.
  • the method for establishing a model library further includes:
  • the tree structure includes a root node and at least two child nodes associated with the root node.
  • the tree structure refers to a data structure in which there is a "one-to-many" tree-shaped relationship between data elements, and is an important type of non-linear data structure.
  • the root node does not have a predecessor node, and each other node has only one predecessor node.
  • Child nodes include the extreme leaf nodes and intermediate nodes between the root and leaf nodes.
  • a leaf node has no subsequent nodes, and the number of subsequent nodes of each of the remaining nodes (including the root node and the intermediate node) may be one or multiple.
  • a hierarchical model database is established by using a tree structure, and multiple hierarchical models can be associated according to the relationship between the root node and the child nodes, which is beneficial to quickly find the original voice data corresponding to the speaker based on the determined association relationship.
  • step S20 the current hierarchical model is stored in the hierarchical model library, and the model hierarchical logic in the hierarchical model library is determined, which specifically includes the following steps:
  • the current hierarchical model is to project training speech features corresponding to multiple original speech data belonging to the same set to a low-dimensional overall change subspace, and obtain a fixed-length vector representation to represent multiple original data belonging to the set.
  • the speech model corresponding to the speech data.
  • Model hierarchical logic refers to the logical relationship that each hierarchical model is saved by level to the nodes of the corresponding level in the hierarchical model library established according to the tree structure.
  • the tree structure includes a zero-level root node and at least two back-drive nodes associated with the zero-level root node. They are also called first-level child nodes. If each first-level child node has a back-drive node, each level The child node is associated with at least two secondary child nodes, and so on, until the leaf nodes (the leaf nodes have no associated trailing nodes) constituting the tree structure.
  • the child nodes include a leaf node at the extreme end and an intermediate node located between the root node and the leaf node.
  • the child node without a trailing node is a leaf node, and the child node with a trailing node is an intermediate node.
  • the zero-level root node does not have a predecessor node, and each child node of the remaining levels has only one predecessor node. If the voice data corresponding to the current child node needs to be divided, it is divided into at least two voice subsets. In order to satisfy each voice subset corresponding to a back-drive node, the current sub-node (root node or intermediate node) includes at least two back-drive nodes. node.
  • no hierarchical model is stored on the zero-level root node and leaf nodes of the tree structure, and each of the remaining intermediate nodes respectively stores a hierarchical model corresponding to the level.
  • the recognition server may store the current hierarchical model created each time on the child nodes according to the hierarchical logic, which is helpful for finding at least two N + 1-level child nodes from the N-level child nodes during speech recognition, and storing the N-level nodes.
  • the speech data to be identified stored by the child node is matched with the hierarchical model corresponding to each N + 1 level child node to obtain the current N + 1 level child node corresponding to the hierarchical model with the highest matching degree, and the speech data to be identified is further classified. To the current N + 1 level child node.
  • the leaf node stores the original speech data corresponding to the speaker, and matches the speech data to be recognized with each original speech data on the leaf node to obtain the speaker corresponding to the original speech data with the highest matching degree, which is the result of speech recognition.
  • the recognition server uses a tree structure to establish a hierarchical model library, which can associate multiple hierarchical models according to the relationship between the predecessor node and the child node, which is helpful for quickly finding the original voice data corresponding to the speaker based on the association relationship.
  • the recognition server can store the current hierarchical model created each time on the child nodes according to the hierarchical logic, which is beneficial to find the corresponding hierarchical model corresponding to at least two child nodes from the current hierarchical model corresponding to the root node during speech recognition.
  • step S20 that is, the current hierarchical model is established according to the training speech features extracted from the original speech data, which specifically includes the following steps:
  • the original voice data is the voice data entered by the speaker through a computer device.
  • the computer device can send the collected raw voice data to the recognition server, and the recognition server stores the received raw voice data in a database for subsequent recognition and calling. .
  • the training speech feature is a speech feature obtained after feature extraction of the original speech data, and is applied in this embodiment, and Mel-Frequency Cepstral Coefficients (hereinafter referred to as MFCC features) can be used as the training speech feature.
  • MFCC features Mel-Frequency Cepstral Coefficients
  • the test found that the human ear is like a filter bank and only pays attention to certain specific frequency components (human hearing is non-linear to frequency), which means that the human ear receives a limited number of signals at a sound frequency.
  • these filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and they are densely distributed. However, in the high frequency region, the number of filters becomes relatively small and the distribution is sparse.
  • the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear. Therefore, the Mel frequency cepstrum coefficient will be used as the training speech feature, which can accurately reflect the speaker's speech feature
  • step S21 the recognition server obtains training voice features of the original voice data, that is, MFCC features, and performs model training, which can effectively reflect the voice features of the data set where the original voice data is located.
  • a simplified model algorithm is used to simplify the training voice features to obtain simplified voice features.
  • the simplified model algorithm refers to a Gaussian Blur (Gaussian Blur) processing algorithm, which is used to reduce the sound noise and level of detail of a voice file.
  • Simplified speech features are pure speech features that are reduced by the simplified model algorithm to remove sound noise.
  • step S22 a simplified model algorithm is used to simplify the processing of the training voice features.
  • the two-dimensional normal distribution of the training voice features can be obtained first, and then all phonemes of the two-dimensional normal distribution are blurred to obtain a purer simplified voice feature. It can largely reflect the characteristics of the training speech features and help improve the efficiency of subsequent training of the current hierarchical model.
  • the Maximum Expectation Algorithm (Maximum Expectation Algorithm, hereinafter referred to as the EM algorithm) is an iterative algorithm that is used in statistics to find the maximum likelihood of parameters in a probability model that depends on unobservable hidden variables estimate.
  • T space The total variation subspace (Total Space) (hereinafter referred to as T space) is a global change mapping matrix that is directly set to contain all possible information of the speaker in the voice data.
  • the speaker space and channel space are not separated in the T space.
  • T-space can map high-dimensional full statistics (supervectors) to i-vectors (identity-vectors) that can be used as low-dimensional speaker representations, and play a role in reducing dimensions.
  • the training process of the T space includes: according to a preset UBM model, using the vector analysis and the EM (Expectation Maximization Algorithm) algorithm to calculate the T space from the convergence.
  • the EM algorithm is used to iteratively simplify the speech features.
  • the realization process of obtaining T space is as follows:
  • Preset sample set x (x (1), x (2), ... x (m)) comprises m independent samples, each sample x i z i corresponding to category is unknown, it is necessary to take into account the joint probability distribution
  • x, ⁇ ) need to find the appropriate ⁇ and z to maximize L ( ⁇ ), where the maximum number of iterations J:
  • E-step iteration calculate the conditional probability expectation of the joint distribution, and calculate the posterior probability Q i (z (i) ) of the recessive variable according to the initial value of the parameter ⁇ or the parameter value obtained from the previous iteration.
  • step c) If ⁇ j + 1 of the M-step iteration has converged, the algorithm ends. Otherwise, continue to step a) and perform step E iteration.
  • the overall change subspace obtained in step 23 does not distinguish between the speaker space and the channel space, and converges the information of the channel space and the channel space into one space to reduce the computational complexity and facilitate further based on the overall change subspace to obtain a simplification.
  • the current universal speech vector The current universal speech vector.
  • the simplified speech feature is the speech feature obtained after processing by the simplified model algorithm obtained in step S32.
  • the current universal speech vector is a fixed-length vector representation obtained by projecting simplified speech features onto a low-dimensional overall change subspace, which is used to represent the current hierarchical model formed by multiple original speech data belonging to the same set.
  • the recognition server uses a simplified model algorithm to simplify the processing of the training voice features. After obtaining the simplified voice features, and then projecting the simplified voice features into the overall change subspace, a more pure and simple current classification model can be obtained, so that Subsequent speech clustering is performed on the speaker's speech data based on the current hierarchical model to reduce the complexity of speech clustering and speed up the efficiency of speech clustering.
  • step S22 a simplified model algorithm is used to simplify the training voice features to obtain the simplified voice features, which specifically include the following steps:
  • the Gaussian filter can perform linear smooth filtering on the input training speech features, is suitable for eliminating Gaussian noise, and is widely used in the noise reduction process.
  • the process of Gaussian filter processing training speech features is specifically the process of weighted average of training speech features. Taking the phonemes in training speech features as an example, the value of each phoneme is weighted by itself and other phoneme values in the neighborhood. Obtained after averaging.
  • the two-dimensional normal distribution (also known as the two-dimensional Gaussian distribution) meets the following characteristics of the density function: With respect to ⁇ symmetry, it reaches the maximum at ⁇ , and the value is 0 at positive (negative) infinity, and at ⁇ ⁇ ⁇ There are inflection points; the shape of the two-dimensional normal distribution is high in the middle and low in both sides, and the image is a bell curve above the x-axis.
  • the Gaussian filter processes the training voice features by using a 3 * 3 mask to scan each phoneme in the training voice data, and using the weighted average of the phonemes in the neighborhood determined by the mask to replace the template center.
  • the phoneme values form a two-dimensional normal distribution of training speech data.
  • the calculation process of the weighted average of each phoneme includes:
  • step S221 the noise in the training voice features can be removed, and the output is a linear smooth sound filter to obtain a pure sound filter for further processing.
  • the simplified model algorithm may use a Gaussian fuzzy algorithm to simplify the two-dimensional normal distribution.
  • each phoneme takes the average value of the surrounding phonemes
  • the "middle point” takes the average value of the "surrounding points”. Numerically, this is a “smoothing”. In graphics, it is equivalent to produce a “blur” effect, and the "middle point” loses details. Obviously, when calculating the average value, the larger the value range, the stronger the “blur” effect.
  • the recognition server can obtain the simplified speech feature of the two-dimensional normal distribution corresponding to the training speech feature through the simplified model algorithm, which can further reduce the speech details of the training speech feature and simplify the speech feature.
  • the recognition server may sequentially denoise and reduce details of the training voice features to obtain pure and simplified simplified voice features, which is beneficial to improving the recognition efficiency of voice clustering.
  • the model library building method also includes:
  • the identification data set is that the number of original voice data included in the data set does not exceed a preset threshold, then the data set is defined as the identification data set without further dividing the data set.
  • the hierarchical model library is a database including multiple hierarchical models.
  • the original voice data is the voice data entered by the speaker through the computer equipment.
  • the speaker identification is the speaker identification corresponding to the original voice data to indicate the unique identity of the speaker.
  • a user ID, a mobile phone number, or an identification number may be used as the speaker identification.
  • the original speech feature is a speech feature that distinguishes the speaker from others, and specifically refers to the speech feature obtained after feature extraction of the original speech data, which is applied in this embodiment, and the Mel-Frequency coefficient Cepstral Coefficients (hereinafter referred to as MFCC features) as the original speech features.
  • MFCC features Mel-Frequency coefficient Cepstral Coefficients
  • S421 Preprocess the original voice data to obtain preprocessed voice data.
  • step S421 the original voice data is pre-processed to obtain pre-processed voice data, which specifically includes the following steps:
  • Pre-emphasis processing is performed on the original voice data.
  • pre-emphasis is a signal processing method that compensates the high-frequency component of the input signal at the transmitting end.
  • the idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the transmitting end of the transmission line to compensate for the excessive attenuation of the high-frequency component during transmission, so that the receiving end can obtain a better signal waveform.
  • Pre-emphasis has no effect on noise, so it can effectively improve the output signal-to-noise ratio.
  • the value of a ranges from 0.9 ⁇ a ⁇ 1.0.
  • the effect of pre-emphasis of 0.97 is better.
  • this pre-emphasis processing can eliminate interference caused by vocal cords and lips during vocalization, can effectively compensate the suppressed high-frequency part of the original speech data, and can highlight the high-frequency formants of the original speech data, enhancing the signal amplitude of the original speech data To help extract training speech features.
  • S4212 Perform frame processing on the pre-emphasized original voice data.
  • Framing refers to the speech processing technology that cuts the entire voice signal into several segments.
  • the size of each frame is in the range of 10-30ms, and the frame shift is about 1/2 frame length.
  • Frame shift refers to the overlapping area between two adjacent frames, which can avoid the problem of excessive changes in adjacent two frames.
  • Frame processing the original speech data the original speech data can be divided into several pieces of speech data, and the original speech data can be subdivided to facilitate the extraction of training speech features.
  • S4213 Perform windowing on the original framed voice data to obtain preprocessed voice data.
  • the calculation formula for windowing is Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
  • the windowing process specifically refers to using a window function to process the original speech data.
  • the window function can select the Hamming window.
  • the windowing formula is N Hamming window length, n is the time, s n of the signal amplitude on the time domain, s' n in the time domain signal after the amplitude is windowed. Windowing the original speech data to obtain pre-processed speech data can make the signal of the original speech data in the time domain after the frame become continuous, which is helpful for extracting the training speech features of the original speech data.
  • the pre-processing operations on the original speech data in steps S4214-S4213 provide a basis for extracting the original speech features of the original speech data, and can make the extracted original speech features more representative of the original speech data.
  • S422 Perform a fast Fourier transform on the pre-processed speech data to obtain the frequency spectrum of the original speech data, and obtain the power spectrum of the original speech data according to the frequency spectrum.
  • FFT Fast Fourier Transform
  • a fast Fourier transform is performed on the pre-processed voice data to convert the pre-processed voice data from a signal amplitude in a time domain to a signal amplitude (spectrum) in a frequency domain.
  • the formula for calculating the spectrum is N is the frame size, s (k) is the signal amplitude in the frequency domain, s (n) is the signal amplitude in the time domain, n is time, and i is a complex unit.
  • the power spectrum of the pre-processed voice data can be directly obtained based on the frequency spectrum.
  • the power spectrum of the pre-processed voice data is hereinafter referred to as the power spectrum of the original voice data.
  • the formula for calculating the power spectrum of the original speech data is N is the frame size, and s (k) is the signal amplitude in the frequency domain.
  • S423 Use the Mel scale filter bank to process the power spectrum of the original speech data, and obtain the Mel power spectrum of the original speech data.
  • the power spectrum used to process the original speech data using the Mel scale filter bank is a Mel frequency analysis of the power spectrum
  • the Mel frequency analysis is an analysis based on human auditory perception.
  • the test found that the human ear is like a filter bank and only pays attention to certain specific frequency components (human hearing is non-linear to frequency), which means that the human ear receives a limited number of sound frequencies.
  • these filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and they are densely distributed. However, in the high frequency region, the number of filters becomes relatively small and the distribution is sparse. Understandably, the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear, which is also the physical meaning of the Mel scale.
  • a Mel scale filter bank is used to process the power spectrum of the original speech data, and a Mel power spectrum of the original voice data is obtained.
  • the frequency domain signals are segmented by using the Mel scale filter bank, so that the The frequency segment corresponds to a numerical value. If the number of filters is 22, 22 energy values corresponding to the Mel power spectrum of the original speech data can be obtained.
  • the Mel power spectrum obtained after the analysis retains a frequency portion that is closely related to the characteristics of the human ear, and this frequency portion can well reflect the characteristics of the original speech data .
  • S424 Perform cepstrum analysis on the Mel power spectrum to obtain the MFCC characteristics of the original speech data.
  • cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. .
  • cepstrum analysis is performed on the Mel power spectrum, and according to the cepstrum result, the MFCC feature of the original speech data is analyzed and obtained.
  • the features contained in the Mel power spectrum of the original speech data that are too high in original feature data and difficult to use directly can be converted into easy-to-use features by performing cepstrum analysis on the Mel power spectrum ( MFCC feature feature vector used for training or recognition).
  • the MFCC feature can be used as a coefficient for distinguishing different voices from the original voice feature.
  • the original voice feature can reflect the difference between the voices and can be used to identify and distinguish the original voice data.
  • step S424 cepstrum analysis is performed on the Mel power spectrum to obtain the MFCC characteristics of the original speech data, including the following steps:
  • S4241 Take the log value of the Mel power spectrum, and obtain the Mel power spectrum to be transformed.
  • a log value log of the Mel power spectrum is taken to obtain a Mel power spectrum m to be transformed.
  • S4242 Perform discrete cosine transform on the Mel power spectrum to be transformed to obtain the MFCC feature of the original speech data.
  • a discrete cosine transform is performed on the Mel power spectrum m to be transformed to obtain corresponding MFCC features of the original speech data.
  • the second to thirteenth coefficients are taken as the original speech features.
  • Speech features can reflect the differences between speech data.
  • the formula for discrete cosine transform of the transformed Mel power spectrum m is N is the frame length, m is the Mel power spectrum to be transformed, and j is the independent variable of the Mel power spectrum to be transformed. Because there is overlap between Mel filters, there is a correlation between the energy values obtained by using Mel scale filters.
  • Discrete cosine transform can perform dimensionality reduction and abstraction on the transformed Mel power spectrum m, and Compared with the Fourier transform, the result of the indirect original speech feature has no imaginary part and has obvious advantages in calculation.
  • Steps S421-S424 process the feature extraction of the original speech data based on the training technology.
  • the original speech features obtained can well reflect the original speech data, and the original speech features can train the corresponding current hierarchical model so that based on the current The classification model is more accurate in speech recognition.
  • Step S42 uses the Mel frequency cepstrum coefficient as the original speech feature, which can accurately reflect the speech feature of the original speech data, so that the recognition accuracy of the current hierarchical model trained by using the original speech feature is higher.
  • the original speech feature is the original speech feature corresponding to the original speech data obtained in step S42
  • the speaker identification is the speaker identification corresponding to the original speech data obtained in step S41.
  • step 43 the original speech features and the speaker identification are associated and stored in the identification data set, which is beneficial to the subsequent rapid acquisition of the original speech data corresponding to the speaker identification based on the identification data set for speech recognition based on the original speech data.
  • the recognition server uses the Mel frequency cepstrum coefficient as the training voice feature, which can accurately reflect the voice feature of the original voice data; the original voice feature and the speaker identifier are associated and stored in the recognition data set, which is beneficial for subsequent based on the Recognize the data set, and quickly obtain the original speech data corresponding to the speaker identification to perform speech recognition based on the original speech data.
  • the method for establishing a model library establishes the current hierarchical model step by step by determining the number of samples of the original speech data in the training sample set, and until the current training subset of the training sample set is determined as the recognition data set, the recognition data is The set is stored in the hierarchical model library to complete the establishment of the hierarchical model library.
  • the hierarchical model library stores all the original voice data in different recognition data sets, avoiding subsequent comparison of all original voice data when identifying the voice data to be tested in succession, and can quickly match the voice data to be tested according to the model hierarchical logic of the hierarchical model library. Recognition data set to improve the efficiency of speech recognition.
  • the recognition server uses a tree structure to establish a hierarchical model library, which can associate multiple hierarchical models according to the relationship between the root node and the child nodes, which is beneficial to the subsequent rapid search for the original voice data corresponding to the speaker based on the association relationship; the recognition server
  • the current hierarchical model established each time can be stored on the root node or child node according to the hierarchical logic, which is helpful for finding the corresponding next corresponding to at least two back-drive nodes from the current hierarchical model corresponding to the root node or child node during speech recognition.
  • Graded model is helpful for finding the corresponding next corresponding to at least two back-drive nodes from the current hierarchical model corresponding to the root node or child node during speech recognition.
  • a speech recognition method is provided.
  • the method is applied to the recognition server in FIG. 1 as an example, and includes the following steps:
  • the voice data to be tested refers to voice data that needs to be tested, and specifically refers to voice data used to confirm the speaker identification corresponding to the voice data in the hierarchical model library.
  • the voice feature to be tested is a corresponding MFCC feature obtained after feature extraction is performed on the voice data to be tested.
  • the process of feature extraction in step S50 is the same as the foregoing steps S421 to S424. To avoid repetition, details are not repeated here.
  • Step S50 uses the Mel frequency cepstrum coefficient as the voice feature to be measured, which can accurately reflect the voice feature corresponding to the speaker.
  • S60 Process the speech features to be tested according to the model grading logic in the grading model library and the current grading model to determine the target node.
  • the hierarchical model library is a database generated in steps S10 to S40, and includes a plurality of current hierarchical models that have been trained and held in the database, and each current hierarchical model database is stored in accordance with the hierarchical logic.
  • the model hierarchical logic specifically includes: saving each hierarchical model to a node at a corresponding position in a hierarchical model library established according to a tree structure.
  • the tree structure includes a root node and multiple child nodes.
  • the root node has no precursor nodes, and each child node has only one precursor node. If the root node or each child node has a trailing node, it includes at least two trailing nodes.
  • the child nodes include a leaf node at the extreme end and an intermediate node located between the root node and the leaf node.
  • the child node without a trailing node is a leaf node, and the child node with a trailing node is an intermediate node.
  • the target node is to process the voice data to be tested according to the model hierarchical logic in the hierarchical model library and the current hierarchical model of the voice characteristics to be tested.
  • a leaf node without a trailing node serves as a target node.
  • the recognition server may process the voice data to be tested according to the model grading logic in the hierarchical model library and the current hierarchical model to process the features of the voice to be tested, speed up the recognition processing speed, and avoid directly processing the test voice data and the original voice data in turn Contrast until the target node is found, greatly reducing the number of specific voice feature comparisons required.
  • the identification data set corresponding to the target node is used as the target data set, and each original voice data in the target data set carries a speaker identification.
  • the target node is a leaf node in the hierarchical model library searched from step S60.
  • the identification data set is a set that is associated with a leaf node of a tree structure formed by a hierarchical model library, and the number of original voice data in the data set does not exceed a preset threshold.
  • the hierarchical model library is a database including multiple current hierarchical models and identification data sets.
  • the original voice data that is, the original voice data obtained in step S41
  • the speaker identification is an identification corresponding to step S41, that is, the speaker identification corresponding to the original voice data, to indicate the unique identity of the speaker, and a user ID, a mobile phone number, or an identification number may be used as the speaker identification.
  • step S70 the recognition server finds an identification data set associated with the target node as the target data set, and further obtains all the original voice data and the corresponding speaker identification in the identification data set according to step S41.
  • the spatial cosine of the distance can be used to measure the difference between the two features in the spatial direction.
  • the target speaker identification is the speaker identification corresponding to the speech feature to be measured in the hierarchical model library.
  • obtaining the spatial distance between the voice feature under test and each original voice feature in the target data set can be determined by the following formula:
  • a i and B i respectively represent each component of the measured voice feature and the original voice feature.
  • the spatial distance that is, the cosine value is from -1 to 1, where -1 indicates that the two speech features have opposite directions in space, and 1 indicates that the two speech features have the same direction in space; independent. Between -1 and 1 indicates the similarity or dissimilarity between the two speech features. Understandably, the closer the similarity is to 1, the closer the two speech features are.
  • the recognition server obtains the speaker ID corresponding to the maximum spatial distance between the voice feature to be measured and the original voice feature as the target speaker identity.
  • the speech recognition methods provided in steps S50 to S80 process the speech features to be measured through the model classification logic in the hierarchical model library and the current hierarchical model to determine the target node.
  • the target speaker ID corresponding to the voice data to be tested can be determined, and the voice data to be tested is not directly compared with all the original voice data in turn.
  • the target speaker identification is compared with the original original speech data in the target data set in order to improve the efficiency of speech recognition.
  • a model library building device is provided, and the model library building device corresponds to the model library building method in the above embodiment in a one-to-one correspondence.
  • the model library building device includes a training sample set acquisition module 10, a storage hierarchical model module 20, an updated training sample set module 30, and a determination identification data set module 40.
  • the detailed description of each function module is as follows:
  • the training sample set obtaining module 10 is configured to obtain a training sample set, where the training sample set includes at least two original speech data.
  • a storage classification model module 20 is configured to establish a current classification model according to the training speech features extracted from the original speech data if the number of samples of the original speech data in the training sample set is greater than a preset threshold, and store the current classification model in a classification model library, and Determine the model hierarchy logic in the hierarchy model library, and divide the original speech data into at least two current training subsets according to the model hierarchy logic.
  • the update training sample set module 30 is configured to update the current training subset to a training sample set if the number of samples in the current training subset is greater than a preset threshold.
  • the determination identification data set module 40 is configured to determine the current training subset as the identification data set if the number of samples of the current training subset is not greater than a preset threshold, and store the identification data set in a hierarchical model library.
  • the model library establishing device further includes a model library creating unit 11.
  • the model library unit 11 is used to create a hierarchical model library by using a tree structure, and the tree structure includes a root node and at least two child nodes associated with the root node.
  • the storage hierarchical model module 20 includes a determination hierarchical logic unit 12.
  • a determining hierarchical logic unit 12 is configured to store the current hierarchical model on a child node of the tree structure, and determine the model hierarchical logic according to the storage location of the current hierarchical model in the tree structure.
  • the storage hierarchical model module 20 includes an acquired training feature unit 21, an acquired training feature unit 22, an acquired simplified feature unit 23, and an acquired hierarchical model unit 24.
  • the training feature obtaining unit 21 is configured to perform feature extraction on the original voice data to obtain training voice features.
  • a training feature unit 22 is used for simplifying the training voice feature by using a simplified model algorithm to obtain a simplified voice feature.
  • a simplified feature unit 23 is used to iteratively simplify the speech features by using the maximum expectation algorithm to obtain the overall change subspace.
  • An acquisition hierarchical model unit 24 is configured to project the simplified speech features onto the overall change subspace to acquire the current classification model.
  • the acquisition training feature unit 22 includes an acquisition normal distribution subunit 221 and an acquisition speech feature subunit 222.
  • a normal distribution subunit 221 is used to process a training speech feature by using a Gaussian filter to obtain a corresponding two-dimensional normal distribution.
  • a speech feature subunit 222 is used to simplify a two-dimensional normal distribution by using a simplified model algorithm to obtain simplified speech features.
  • the model library establishing device further includes a speaker data obtaining unit 41, an original feature obtaining unit 42, and a storage original feature unit 43.
  • the speaker data obtaining unit 41 is configured to obtain the original voice data and the corresponding speaker identification in each identification data set.
  • An original feature unit 42 is used to perform feature extraction on the original speech data, and obtain original speech features corresponding to the original speech data.
  • the storage original feature unit 43 is configured to associate and store the original speech feature and the speaker identification into the recognition data set.
  • a voice recognition device is provided, and the voice recognition device corresponds to the voice recognition method in the embodiment described above.
  • the model library building device includes a test voice module 50, a target model module 60, a corresponding identification data set module 70, and a speaker identification module 80.
  • the functional modules are described in detail as follows:
  • the test voice acquisition module 50 is configured to acquire voice data to be tested and extract voice characteristics to be tested corresponding to the voice data to be tested.
  • a target model determination module 60 is configured to process target speech features based on the model hierarchy logic in the hierarchical model library and the current hierarchical model to determine a target node.
  • the corresponding identification data set module 70 is configured to use the identification data set corresponding to the target node as the target data set, and each original voice data in the target data set carries a speaker identifier.
  • the speaker identification module 80 is configured to obtain a spatial distance between the voice feature to be tested and each original voice feature in the target data set, and determine a target speaker identifier corresponding to the voice data to be tested.
  • Each module in the above model library building device may be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided.
  • the computer device may be a server, and the internal structure diagram may be as shown in FIG. 10.
  • the computer device includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium.
  • the computer equipment database is used to store data related to speech recognition.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by a processor to implement a model library building method.
  • a computer device which includes a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor.
  • the processor executes the computer-readable instructions, the processor implements the following steps: obtaining training samples Set, the training sample set includes at least two original speech data; if the number of samples of the original speech data in the training sample set is greater than a preset threshold, the current classification model is established based on the training speech features extracted from the original speech data, and the current classification model is stored in the classification
  • the model library determine the model hierarchy logic in the hierarchical model library, and divide the original voice data into at least two current training subsets according to the model hierarchy logic; if the number of samples in the current training subset is greater than a preset threshold, the current training subset is divided.
  • the set is updated to a training sample set; if the number of samples in the current training subset is not greater than a preset threshold, the current training subset is determined as a recognition data set, and the recognition data set is stored in a
  • the processor when the processor executes the computer-readable instructions, the processor further implements the following steps: creating a hierarchical model library using a tree structure, the tree structure including a root node and at least two child nodes associated with the root node; Stored in the hierarchical model library and determining the model hierarchical logic in the hierarchical model library includes: storing the current hierarchical model on a child node of the tree structure, and determining the model hierarchical logic according to the current hierarchical model storage location in the tree structure.
  • the processor executes the computer-readable instructions
  • the following steps are further implemented: feature extraction of the original voice data to obtain training voice features; simplified processing of the training voice features using a simplified model algorithm to obtain simplified voice features;
  • the maximum expectation algorithm iteratively simplifies the speech features to obtain the overall change subspace; projects the simplified speech features to the overall change subspace to obtain the current hierarchical model.
  • the processor executes the computer-readable instructions, the following steps are further implemented: the Gaussian filter is used to process the training speech features to obtain the corresponding two-dimensional normal distribution; the simplified model algorithm is used to simplify the two-dimensional normal distribution, and the simplified Speech characteristics.
  • the processor when the processor executes the computer-readable instructions, the processor further implements the following steps: acquiring the original voice data and the corresponding speaker identification in each recognition data set; performing feature extraction on the original voice data, and acquiring the corresponding original voice data.
  • Original speech features; original speech features and speaker identification are stored in the recognition data set in association.
  • the processor when the processor executes the computer-readable instructions, the processor further implements the following steps: acquiring the voice data to be tested, extracting the voice characteristics to be tested corresponding to the voice data to be tested; according to the model grading logic in the hierarchical model library and the current hierarchical model Process the features to be tested to determine the target node; use the identification data set corresponding to the target node as the target data set, each original voice data in the target data set carries a speaker identification; obtain the feature to be tested and the target data set The spatial distance of each of the original speech features determines the target speaker identification corresponding to the measured speech data.
  • one or more non-volatile readable storage media storing computer-readable instructions, and when the computer-readable instructions are executed by one or more processors, cause the one or more processors to perform the following steps : Obtain a training sample set, the training sample set includes at least two original speech data; if the number of samples of the original speech data in the training sample set is greater than a preset threshold, a current classification model is established based on the training speech features extracted from the original speech data, and the current classification The model is stored in the hierarchical model library, and the model hierarchical logic in the hierarchical model library is determined, and the original speech data is divided into at least two current training subsets according to the model hierarchical logic; if the number of samples in the current training subset is greater than a preset threshold, Update the current training subset to the training sample set; if the number of samples in the current training subset is not greater than a preset threshold, determine the current training subset as a recognition data set, and store the recognition data set in a hier
  • the one or more processors when the computer-readable instructions are executed by one or more processors, the one or more processors further perform the following steps: using a tree structure to create a hierarchical model library, the tree structure includes a root node and a root node. At least two child nodes associated with the node; storing the current hierarchical model in the hierarchical model library, and determining the model hierarchical logic in the hierarchical model library, including: storing the current hierarchical model on the child nodes of the tree structure, according to the current hierarchical model Determine the model hierarchy logic in the storage location of the tree structure.
  • the one or more processors when the computer-readable instructions are executed by one or more processors, the one or more processors further perform the following steps: feature extraction of the original voice data to obtain training voice features; and training using a simplified model algorithm Simplify the speech features to obtain the simplified speech features; use the maximum expectation algorithm to iterate the simplified speech features to obtain the overall change subspace; project the simplified speech features to the overall change subspace to obtain the current hierarchical model.
  • the one or more processors when the computer-readable instructions are executed by one or more processors, the one or more processors further perform the following steps: processing the training speech features using a Gaussian filter to obtain the corresponding two-dimensional normal distribution; using The simplified model algorithm simplifies the two-dimensional normal distribution and obtains simplified speech features.
  • the one or more processors when the computer-readable instructions are executed by one or more processors, the one or more processors further perform the following steps: obtaining the original voice data and the corresponding speaker identification in each identification data set; The speech data is subjected to feature extraction to obtain the original speech features corresponding to the original speech data; the original speech features and the speaker identification are stored in association with the recognition data set.
  • the one or more processors when the computer-readable instructions are executed by one or more processors, the one or more processors further perform the following steps: acquiring voice data to be tested, and extracting voice characteristics to be tested corresponding to the voice data to be tested; The model hierarchical logic in the hierarchical model library and the current hierarchical model are used to process the features of the voice to be tested to determine the target node; the identification data set corresponding to the target node is used as the target data set, and each original voice data in the target data set carries a speech Person identification; acquiring the spatial distance between the speech feature to be tested and each original speech feature in the target data set, and determining the target speaker identity corresponding to the speech data to be tested.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM dual data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Synchlink DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本申请公开了一种模型库建立方法、语音识别方法、装置、设备及介质,其中,该模型库建立方法包括:获取训练样本集;若所述训练样本集中原始语音数据的样本数量大于预设阈值,则根据所述原始语音数据提取的训练语音特征建立当前分级模型,根据模型分级逻辑将所述原始语音数据划分成至少两个当前训练子集;若所述当前训练子集的样本数量大于所述预设阈值,将所述当前训练子集更新为所述训练样本集;若所述当前训练子集的样本数量不大于所述预设阈值,则将所述当前训练子集确定为识别数据集。该方法使得识别服务器可根据分级模型库的模型分级逻辑快速匹配出待测语音数据所在的识别数据集,提高语音识别的效率。

Description

模型库建立方法、语音识别方法、装置、设备及介质
本申请以2018年06月11日提交的申请号为201810592869.8,名称为“模型库建立方法、语音识别方法、装置、设备及介质”的中国发明申请为基础,并要求其优先权。
技术领域
本申请涉及声纹识别领域,尤其涉及一种模型库建立方法、语音识别方法、装置、设备及介质。
背景技术
声纹识别的应用越来越受到系统开发者和用户青睐,声纹识别的世界市场占有率仅次于指纹和掌纹的生物特征识别,并有不断上升的趋势。声纹识别的优势在于:(1)蕴含声纹特征的语音获取方便和自然,声纹提取可在不知不觉中完成,因此使用者的接受程度也高;(2)语音获取成本低廉,使用简单,一个麦克风即可,在使用通讯设备时更无需额外的录音设备;(3)适合远程身份确认,只需要一个麦克风、电话或手机就可以通过网路(通讯网络或互联网络)实现远程登录;(4)声纹辨认和确认的算法复杂度低;(5)配合一些其他措施,如通过语音识别进行内容鉴别等,可以提高准确率等。
声纹识别一般是通过将待测试语音依次对比已存在数据库中的说话人语音后确认目标说话人。而当数据库中说话人的数量较为庞大时,依次对比查找到目标说话人会极大影响识别效率。
发明内容
基于此,有必要针对上述技术问题,提供一种可以提高识别效率的模型库建立方法、装置、设备及介质。
上述模型库建立方法、装置、设备及介质,根据原始语音数据提取的训练语音特征建立当前分级模型,将当前分级模型存储在分级模型库中后再根据模型分级逻辑将原始语音数据划分成至少两个当前训练子集,直至当前训练子集的样本数量不大于预设阈值,可将当前训练子集确定为识别数据集完成模型库的建立,搭建出利于快速查找到识别数据集的分级模型库。
一种模型库建立方法,包括:
获取训练样本集,训练样本集包括至少两个原始语音数据;
若训练样本集中原始语音数据的样本数量大于预设阈值,则根据原始语音数据提取的训练语音特征建立当前分级模型,将当前分级模型存储在分级模型库中,并确定分级模型库中的模型分级逻辑,根据模型分级逻辑将原始语音数据划分成至少两个当前训练子集;
若当前训练子集的样本数量大于预设阈值,将当前训练子集更新为训练样本集;
若当前训练子集的样本数量不大于预设阈值,则将当前训练子集确定为识别数据集,并将识别数据集存储在分级模型库中。
一种模型库建立装置,包括:
获取训练样本集模块,用于获取训练样本集,训练样本集包括至少两个原始语音数据;
存储分级模型模块,用于若训练样本集中原始语音数据的样本数量大于预设阈值,则根据原始语音数据提取的训练语音特征建立当前分级模型,将当前分级模型存储在分级模型库中,并确定分级模型库中的模型分级逻辑,根据模型分级逻辑将原始语音数据划分成至少两个当前训练子集;
更新训练样本集模块,用于若当前训练子集的样本数量大于预设阈值,将当前训练子集更新为训练样本集;
确定识别数据集模块,用于若当前训练子集的样本数量不大于预设阈值,则将当前训练子集确定为识别数据集,并将识别数据集存储在分级模型库中。
一种计算机设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机可读指令,其特征在于,处理器执行计算机可读指令时实现如下步骤:
获取训练样本集,训练样本集包括至少两个原始语音数据;
若训练样本集中原始语音数据的样本数量大于预设阈值,则根据原始语音数据提取的训练语音特征建立当前分级模型,将当前分级模型存储在分级模型库中,并确定分级模型库中的模型分级逻辑,根据模型分级逻辑将原始语音数据划分成至少两个当前训练子集;
若当前训练子集的样本数量大于预设阈值,将当前训练子集更新为训练样本集;
若当前训练子集的样本数量不大于预设阈值,则将当前训练子集确定为识别数据集,并将识别数据集存储在分级模型库中。
一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
获取训练样本集,训练样本集包括至少两个原始语音数据;
若训练样本集中原始语音数据的样本数量大于预设阈值,则根据原始语音数据提取的训练语音特征建立当前分级模型,将当前分级模型存储在分级模型库中,并确定分级模型库中的模型分级逻辑,根据模型分级逻辑将原始语音数据划分成至少两个当前训练子集;
若当前训练子集的样本数量大于预设阈值,将当前训练子集更新为训练样本集;
若当前训练子集的样本数量不大于预设阈值,则将当前训练子集确定为识别数据集,并将识别数据集存储在分级模型库中。
一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现上述模型库建立方法的步骤。
上述模型库建立方法、装置、设备及介质,通过判定训练样本集中原始语音数据的样本数量来逐级建立当前分级模型,直至将训练样本集的当前训练子集确定为识别数据集后,将识别数据集存储在分级模型库中完成建立分级模型库。该分级模型库将所有原始语音数据存储到不同的识别数据集,避免后续识别待测语音数据时依次对比所有原始语音数据,可根据该分级模型库的模型分级逻辑快速匹配出待测语音数据所在的识别数据集,提高语音识别的效率。
基于此,有必要针对上述技术问题,提供一种可以提高识别效率的语音识别方法、装置、设备及介质。
一种语音识别方法,包括:
获取待测语音数据,提取待测语音数据对应的待测语音特征;
依据分级模型库中的模型分级逻辑和当前分级模型对待测语音特征进行处理,确定目标节点;
将与目标节点相对应的识别数据集作为目标数据集,目标数据集中的每一原始语音数据携带一说话人标识;
获取待测语音特征和目标数据集中的每一原始语音特征的空间距离,确定与待测语音数据相对应的目标说话人标识。
一种语音识别装置,包括:
获取测试语音模块,用于获取待测语音数据,提取待测语音数据对应的待测语音特征;
确定目标模型模块,用于依据分级模型库中的模型分级逻辑和当前分级模型对待测语音特征进行处理,确定目标节点;
对应识别数据集模块,用于将与目标节点相对应的识别数据集作为目标数据集,目标数据集中的每一原始语音数据携带一说话人标识;
确定说话人标识模块,用于获取待测语音特征和目标数据集中的每一原始语音特征的空间距离,确定与待测语音数据相对应的目标说话人标识。
一种计算机设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机可读指令,其特征在于,处理器执行计算机可读指令时实现如下步骤:
获取待测语音数据,提取待测语音数据对应的待测语音特征;
依据分级模型库中的模型分级逻辑和当前分级模型对待测语音特征进行处理,确定目标节点;
将与目标节点相对应的识别数据集作为目标数据集,目标数据集中的每一原始语音数据携带一说话人标识;
获取待测语音特征和目标数据集中的每一原始语音特征的空间距离,确定与待测语音数据相对应的目标说话人标识。
一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
获取待测语音数据,提取待测语音数据对应的待测语音特征;
依据分级模型库中的模型分级逻辑和当前分级模型对待测语音特征进行处理,确定目标节点;
将与目标节点相对应的识别数据集作为目标数据集,目标数据集中的每一原始语音数据携带一说话人标识;
获取待测语音特征和目标数据集中的每一原始语音特征的空间距离,确定与待测语音数据相对应的目标说话人标识。
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中模型库建立方法的应用环境示意图;
图2是本申请一实施例中模型库建立方法的流程图;
图3是本申请一实施例中模型库建立方法的另一具体流程图;
图4是本申请一实施例中模型库建立方法的另一具体流程图;
图5是本申请一实施例中模型库建立方法的另一具体流程图;
图6是本申请一实施例中模型库建立方法的另一具体流程图;
图7是本申请一实施例中语音识别方法的流程图;
图8是本申请一实施例中模型库建立装置的原理框图;
图9是本申请一实施例中语音识别装置的原理框图;
图10是本申请一实施例中计算机设备的示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的模型库建立方法,可应用在如图1的应用环境中,其中,计算机设备通过网络与识别服务器 进行通信。其中,计算机设备包括但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。识别服务器可以用独立的服务器或者用多个服务器组成的服务器集群来实现。计算机设备可采集原始语音数据,并将原始语音数据通过网络发送给识别服务器,以使识别服务器利用获取到的原始语音数据进行建模。
在一实施例中,如图2所示,提供一种模型库建立方法,以该方法应用在图1中的识别服务器为例进行说明,包括如下步骤:
S10.获取训练样本集,训练样本集包括至少两个原始语音数据。
其中,训练样本集是多个原始语音数据的集合。原始语音数据就是说话人通过计算机设备录入的语音数据,该计算机设备可将采集到的原始语音数据发送给识别服务器,识别服务器将接收到的原始语音数据存储在数据库中,以便后续识别调用。
S20.若训练样本集中原始语音数据的样本数量大于预设阈值,则根据原始语音数据提取的训练语音特征建立当前分级模型,将当前分级模型存储在分级模型库中,并确定分级模型库中的模型分级逻辑,根据模型分级逻辑将原始语音数据划分成至少两个当前训练子集。
其中,样本数量就是训练样本集中原始语音数据的数量。预设阈值是预先设置的用于限定是否需要对训练样本集进行均分的阈值。即预设阈值是可将原始语音数据继续按数量进行均分的最少值。比如,预设阈值为100,若原始语音数据的样本数量为99,则不划分该原始语音数据;若原始语音数据的样本数量为100,则将该原始语音数据继续至少划分成为两个当前训练子集。
训练语音特征是对原始语音数据进行特征提取后获取的语音特征,应用于本实施例,可采用梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,以下简称MFCC特征)作为训练语音特征。检测发现人耳像一个滤波器组,只关注某些特定的频率分量(人的听觉对频率是非线性的),也就是说人耳接收声音频率的信号是有限的。然而这些滤波器在频率坐标轴上却不是统一分布的,在低频区域有很多的滤波器,他们分布比较密集,但在高频区域,滤波器的数目就变得比较少,分布很稀疏。梅尔刻度滤波器组在低频部分的分辨率高,跟人耳的听觉特性是相符的,因此将采用梅尔频率倒谱系数作为训练语音特征,可以准确地体现说话人的语音特征。
当前分级模型是将属于同一集合的多个原始语音数据对应的训练语音特征投影到低维的总体变化子空间,获取的一个固定长度的矢量表征,用以表示属于该集合的多个原始语音数据对应的语音模型。
可以理解地,分级模型库就是包括多个分级模型,且按照模型分级逻辑存储各个分级模型的数据库。其中,模型分级逻辑具体包括:将每个分级模型保存到按树状结构建立的分级模型库中对应位置的节点上。树状结构包括一个根节点和多个子节点,根节点没有前驱节点,其余每个子节点有且只有一个前驱节点。根节点或每个子节点若有后驱节点,则至少包括两个后驱节点。子节点包括最末端的叶节点和位于根节点和叶节点之间的中间节点,没有后驱节点的子节点为叶节点,有后驱节点的子节点为中间节点。
每建立一个新的当前分级模型,就需要给当前分级模型建立关联关系。在树状结构中获取该当前分级模型所在子节点的位置,获取该子节点的前驱节点和前驱节点上存储的上一级分级模型。将该当前分级模型和上一级分级模型在分级模型库中进行关联,以实现模型分级逻辑要求的逻辑关系。
当前训练子集就是将训练样本集中的原始语音数据按数量均分后形成的子数据集。
步骤S20中,识别服务器给原始语音数据建立当前分级模型并存储在分级模型库中,利于语音识别时基于该当前分级模型进行语音分簇判定。
S30.若当前训练子集的样本数量大于预设阈值,将当前训练子集更新为训练样本集。
其中,当前训练子集是经步骤S20按数量均分原始语音数据后形成的数据集。
训练样本集包括至少两个原始语音数据,该原始语音数据就是说话人通过计算机设备录入的语音数据。需要说明的是,在步骤S30中,当前训练子集的样本数量大于预设阈值,将当前训练子集更新为训练样本集,以使当前训练子 集可以重复执行步骤S20及步骤S20以后的步骤,以实现对该当前训练子集是否需要进一步划分进行判定。
S40.若当前训练子集的样本数量不大于预设阈值,则将当前训练子集确定为识别数据集,并将识别数据集存储在分级模型库中。
其中,识别数据集是该原始语音数据的样本数量未超过预设阈值的当前训练子集。即将当前训练子集中原始语音数据的样本数量与预设阈值比较,确定该样本数量大于预设阈值的当前训练子集为识别数据集,而无需继续划分该数据集。可以理解地,识别服务器为了便于语音识别时查找到具体的原始语音数据,在步骤S40中需要将识别数据集都存储到分级模型库中。
本申请实施例提出的模型库建立方法,根据原始语音数据提取的训练语音特征建立当前分级模型,将当前分级模型存储在分级模型库中后,再根据模型分级逻辑将原始语音数据划分成至少两个当前训练子集,直至所有当前训练子集的样本数量不大于预设阈值,可将当前训练子集确定为识别数据集完成模型库的建立,搭建出利于快速查找到识别数据集的分级模型库,有助于提高后续基于该分级模型库进行语音识别的效率。
在一实施例中,如图3所示,在步骤S10之前,即在获取训练样本集的步骤之前,该模型库建立方法还包括:
S11.采用树状结构创建分级模型库,树状结构包括一个根节点和与根节点关联的至少两个子节点。
其中,树状结构指的是数据元素之间存在着“一对多”的树形关系的数据结构,是一类重要的非线性数据结构。在树状结构中,根节点没有前驱节点,其余每个节点有且只有一个前驱节点。子节点包括最末端的叶子节点和位于根节点和叶子节点之间的中间节点。叶子节点没有后续节点,其余每个节点(包括根节点和中间节点)的后续节点数可以是一个也可以是多个。
步骤S11采用树状结构建立分级模型库,可将多个分级模型按根节点和子节点的关系关联起来,利于后续基于所确定的关联关系迅速查找到与说话人对应的原始语音数据。
在一具体实施方式中,在步骤S20中,即将当前分级模型存储在分级模型库中,并确定分级模型库中的模型分级逻辑,具体包括如下步骤:
S12.将当前分级模型存储在树状结构的子节点上,根据当前分级模型在树状结构的存储位置确定模型分级逻辑。
其中,当前分级模型是将属于同一集合的多个原始语音数据对应的训练语音特征投影到低维的总体变化子空间,获取的一个固定长度的矢量表征,用以表示属于该集合的多个原始语音数据对应的语音模型。
模型分级逻辑是指将每个分级模型按级别保存到按树状结构建立的分级模型库中对应级别的节点上的逻辑关系。其中,树状结构包括一个零级根节点和与该零级根节点关联的至少两个后驱节点也称一级子节点,每个一级子节点若有后驱节点,则每个一级子节点关联至少两个二级子节点,以此类推,直至构成该树状结构的叶节点(叶节点没有关联的后驱节点)为止。子节点包括最末端的叶节点和位于根节点和叶节点之间的中间节点,没有后驱节点的子节点为叶节点,有后驱子节点的子节点为中间节点。
可以理解地,零级根节点没有前驱节点,其余级别的每个子节点有且只有一个前驱节点。若当前子节点对应的语音数据需要划分,则至少划分成两个语音子集,为了满足每个语音子集对应一个后驱节点,当前子节点(根节点或中间节点)至少包括两个后驱节点。
根据上述树状结构的逻辑关系和本实施例的需求,该树状结构的零级根节点和叶节点上未存储分级模型,其余的每一中间节点按级别分别对应存储一个分级模型。
进一步地,按级别依次给每一中间节点关联存储对应级别的分级模型的具体实现过程如下:
1.首先获取N级前驱节点对应的语音数据,若该前驱节点对应的语音数据的样本数量大于预设划分的数量阈值,则需要将该语音数据划分为至少两个N+1级语音子集。划分完成后,根据每个语音子集的语音数据,给每个语音子集建立对应级别的N+1级分级模型。
2.给前驱节点建立至少两个N+1级子节点(N+1级子节点的数量与N+1级语音子集的数量相等),将每个N+1级 语音子集与每个N+1级子节点进行关联。
3.再判定每一N+1级子节点对应的N+1语音子集是否需要再进行划分,若是则重复步骤1至步骤2,直至每一子节点对应的语音子集不需要再进行划分。
步骤S12中,识别服务器可将每次建立的当前分级模型按分级逻辑存储到子节点上,利于语音识别时从N级子节点查找到对应的至少两个N+1级子节点,将N级子节点存储的待识别语音数据和每一N+1级子节点对应的分级模型进行匹配,获取匹配程度最高的分级模型对应的当前N+1级子节点,并将待识别语音数据继续归类到该当前N+1级子节点。
重复将该待识别语音数据再对比该当前N+1级子节点对应的每一N+2级子节点对应的分级模型,直至将该待识别语音数据归类到树状结构的叶节点。叶节点上存储说话人对应的原始语音数据,将待识别语音数据和叶节点上的每一原始语音数据进行匹配,获取匹配度最高的原始语音数据对应的说话人,即为语音识别的结果。
步骤S11至S12中,识别服务器采用树状结构建立分级模型库,可将多个分级模型按前驱节点和子节点的关系关联起来,利于后续基于该关联关系迅速查找到与说话人对应的原始语音数据;识别服务器可将每次建立的当前分级模型按分级逻辑存储到子节点上,利于语音识别时从根节点对应的当前分级模型查找到对应的至少两个子节点对应的下一级分级模型。
在一实施例中,如图4所示,在步骤S20中,即根据原始语音数据提取的训练语音特征建立当前分级模型,具体包括如下步骤:
S21.对原始语音数据进行特征提取,获取训练语音特征。
其中,原始语音数据就是说话人通过计算机设备录入的语音数据,该计算机设备可将采集到的原始语音数据发送给识别服务器,识别服务器将接收到的原始语音数据存储在数据库中,以便后续识别调用。
训练语音特征是对原始语音数据进行特征提取后获取的语音特征,应用于本实施例,可采用梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,以下简称MFCC特征)作为训练语音特征。检测发现人耳像一个滤波器组,只关注某些特定的频率分量(人的听觉对频率是非线性的),也就是说人耳接收声音频率的信号是有限的。然而这些滤波器在频率坐标轴上却不是统一分布的,在低频区域有很多的滤波器,他们分布比较密集,但在高频区域,滤波器的数目就变得比较少,分布很稀疏。梅尔刻度滤波器组在低频部分的分辨率高,跟人耳的听觉特性是相符的,因此将采用梅尔频率倒谱系数作为训练语音特征,可以准确地体现说话人的语音特征。
步骤S21中,识别服务器获取原始语音数据的训练语音特征也即MFCC特征进行模型训练,可有效体现原始语音数据所在数据集合的语音特征。
S22.采用简化模型算法对训练语音特征进行简化处理,获取简化语音特征。
其中,简化模型算法是指高斯模糊(Gaussian Blur,高斯平滑)处理算法,用于降低语音文件的声音噪声和细节层次。简化语音特征是经简化模型算法简化后去除声音噪声,较为纯净的语音特征。
步骤S22中采用简化模型算法简化处理训练语音特征具体可先获取训练语音特征的二维正态分布,再模糊二维正态分布的所有音素,以获取更纯净的简化语音特征,该简化语音特征可以在很大程度上体现训练语音特征的特性,有助于提高后续训练当前分级模型的效率。
S23.采用最大期望算法迭代简化语音特征,获取总体变化子空间。
其中,最大期望算法(Expectation Maximization Algorithm,最大期望算法,以下简称EM算法)是一种迭代算法,在统计学中被用于寻找依赖于不可观察的隐性变量的概率模型中参数的最大似然估计。
总体变化子空间(Total Variability Space,以下简称T空间),是直接设置一个全局变化的映射矩阵,用以包含语音数据中说话人所有可能的信息,在T空间内不分开说话人空间和信道空间。T空间能把高维充分统计量(超矢量)映射到可以作为低维说话人表征的i-vector(identity-vector,身份认证向量),起到降维作用。T空间的训练过 程包括:根据预设UBM模型,利用向量分析和EM(Expectation Maximization Algorithm,最大期望)算法,从其中收敛计算出T空间。
采用EM算法迭代简化语音特征,获取T空间的实现过程如下:
预先设置样本集x=(x (1),x (2),...x (m))包含m个独立样本,每个样本x i对应的类别z i是未知的,需要顾及联合分布概率模型p(x,z|θ)和条件分布概率模型p(z|x,θ)的参数θ,即需要找到合适的θ和z让L(θ)最大,其中,最大迭代次数J:
1)随机初始化简化语音特征的模型参数θ,初值为θ 0
2)for j from 1 to J(最大迭代次数)开始EM算法迭代:
a)E步迭代:计算联合分布的条件概率期望,根据参数θ初始值或上一次迭代所得参数值来计算出隐性变量的后验概率Q i(z (i)),作为隐性变量的现估计值:
Q i(z (i))=P(z (i)|x (i),θ j))
Figure PCTCN2018104040-appb-000001
b)M步迭代:极大化L(θ,θ j),得到θ j+1(将似然函数最大化以获得新的参数值):
Figure PCTCN2018104040-appb-000002
c)如果M步迭代的θ j+1已收敛,则算法结束。否则,继续回到步骤a)进行E步迭代。
3)输出:T空间的模型参数θ。
步骤23获取的总体变化子空间不区分说话人空间和信道空间,将声道空间的信息和信道空间的信息收敛于一个空间,以降低计算复杂度,便于进一步基于总体变化子空间,以获取简化的当前通用语音向量。
S24.将简化语音特征投影到总体变化子空间,获取当前分级模型。
其中,简化语音特征就是由步骤S32获取的经简化模型算法处理后获取的语音特征。
当前通用语音向量是将简化语音特征投影到低维的总体变化子空间,获取的一个固定长度的矢量表征,用以表示属于同一集合的多个原始语音数据形成的当前分级模型。
步骤S21至S24中,识别服务器采用简化模型算法简化处理训练语音特征,获取简化语音特征后,再将简化语音特征投影到总体变化子空间后,可得更为纯净和简单的当前分级模型,以便后续基于当前分级模型对说话人的语音数据进行语音聚类,以降低进行语音聚类的复杂性,同时加快语音聚类的效率。
在一实施例中,如图5所示,在步骤S22中,即采用简化模型算法对训练语音特征进行简化处理,获取简化语音特征,具体包括如下步骤:
S221.采用高斯滤波器处理训练语音特征,获取对应的二维正态分布。
其中,高斯滤波器可对输入的训练语音特征进行线性平滑滤波,适用于消除高斯噪声,广泛应用于减噪过程。高斯滤波器处理训练语音特征的过程具体为对训练语音特征进行加权平均的过程,以训练语音特征中的音素为例,每一个音素的值,都由其本身和邻域内的其他音素值经过加权平均后得到。
二维正态分布(又名二维高斯分布),是满足如下密度函数特点:关于μ对称,在μ处达到最大值,在正(负)无穷远处取值为0,在μ±σ处有拐点;二维正态分布的形状是中间高两边低,图像是一条位于x轴上方的钟形曲线。
具体地,高斯滤波器对训练语音特征进行处理的具体操作是:用一个3*3掩模扫描训练语音数据中的每一个音素,用掩模确定的邻域内音素的加权平均值去替代模板中心音素的值后形成有关训练语音数据的二维正态分布,其中,每一个音素的加权平均值的计算过程包括:
(1)求各音素的权值总和。(2)逐个扫描训练语音特征中的音素,根据音素中各位置的权值求其邻域的加权平均值,并将求得的加权平均值赋给当前位置对应的音素。(3)循环步骤(2),直到处理完训练语音特征的全部音素。
经步骤S221,可去除训练语音特征中的噪音,输出为线性平滑的声音滤波,以获取纯净的声音滤波进行进一步处理。
S222.采用简化模型算法简化二维正态分布,获取简化语音特征。
应用于本实施例,简化模型算法可采用高斯模糊算法来简化二维正态分布。
具体地,高斯模糊算法简化二维正态分布的实现过程包括:每一个音素都取周边音素的平均值,“中间点”取“周围点”的平均值。在数值上,这是一种"平滑化"。在图形上,就相当于产生“模糊”效果,“中间点”失去细节。显然,计算平均值时,取值范围越大,“模糊”效果越强烈。
步骤S222中,识别服务器通过简化模型算法可获取训练语音特征对应的二维正态分布的简化语音特征,可进一步降低训练语音特征的语音细节,简化语音特征。
步骤S221至S222,识别服务器可依次将训练语音特征进行除噪和降低细节,以得到纯净简单的简化语音特征,利于提高语音聚类的识别效率。
在一实施例中,如图6所示,在步骤S40之后,即在将当前训练子集确定为识别数据集,并将识别数据集存储在分级模型库中的步骤之后,该模型库建立方法还包括:
S41.获取每一识别数据集中的原始语音数据和对应的说话人标识。
其中,识别数据集是该数据集中包括的原始语音数据的数量未超过预设阈值,则将该数据集定义为识别数据集,而无需继续划分该数据集。分级模型库就是包括多个分级模型的数据库。
原始语音数据就是说话人通过计算机设备录入的语音数据。对应地,说话人标识就是原始语音数据对应的说话人身份标识,以表明说话人的唯一身份,可以采用用户ID、手机号或身份证号等作为说话人标识。
S42.对所述原始语音数据进行特征提取,获取原始语音数据对应的原始语音特征。
其中,原始语音特征是代表说话人区别于他人的语音特征,具体是指对原始语音数据进行特征提取后获取的语音特征,应用于本实施例,可采用梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,以下简称MFCC特征)作为原始语音特征。获取原始语音数据对应的原始语音特征的实现过程如下:
S421:对原始语音数据进行预处理,获取预处理语音数据。
在一具体实施方式中,步骤S421中,对原始语音数据进行预处理,获取预处理语音数据,具体包括如下步骤:
S4211:对原始语音数据作预加重处理,预加重处理的计算公式为s' n=s n-a*s n-1,其中,s n为时域上的信号幅度,s n-1为与s n相对应的上一时刻的信号幅度,s′ n为预加重后时域上的信号幅度,a为预加重系数,a的取值范围为0.9<a<1.0。
其中,预加重是一种在发送端对输入信号高频分量进行补偿的信号处理方式。随着信号速率的增加,信号在传输过程中受损很大,为了使接收端能得到比较好的信号波形,就需要对受损的信号进行补偿。预加重技术的思想就是在传输线的发送端增强信号的高频成分,以补偿高频分量在传输过程中的过大衰减,使得接收端能够得到较好的信号波形。预加重对噪声并没有影响,因此能够有效提高输出信噪比。
本实施例中,对原始语音数据作预加重处理,该预加重处理的公式为s' n=s n-a*s n-1,其中,s n为时域上的信号幅度,即语音数据在时域上表达的语音的幅值(幅度),s n-1为与s n相对的上一时刻的信号幅度,s′ n为预加重后时域上的信号幅度,a为预加重系数,a的取值范围为0.9<a<1.0,这里取0.97预加重的效果比较好。采用该预加重处理能够消除发声过程中声带和嘴唇等造成的干扰,可以有效补偿原始语音数据被压抑的高频部分,并且能够突显 原始语音数据高频的共振峰,加强原始语音数据的信号幅度,有助于提取训练语音特征。
S4212:将预加重后的原始语音数据进行分帧处理。
具体地,在预加重原始语音数据后,还应进行分帧处理。分帧是指将整段的语音信号切分成若干段的语音处理技术,每帧的大小在10-30ms的范围内,以大概1/2帧长作为帧移。帧移是指相邻两帧间的重叠区域,能够避免相邻两帧变化过大的问题。对原始语音数据进行分帧处理,能够将原始语音数据分成若干段的语音数据,可以细分原始语音数据,便于训练语音特征的提取。
S4213:将分帧后的原始语音数据进行加窗处理,获取预处理语音数据,加窗的计算公式为
Figure PCTCN2018104040-appb-000003
其中,N为窗长,n为时间,s n为时域上的信号幅度,s′ n为加窗后时域上的信号幅度。
具体地,在对原始语音数据进行分帧处理后,每一帧的起始段和末尾端都会出现不连续的地方,所以分帧越多与原始语音数据的误差也就越大。采用加窗能够解决这个问题,可以使分帧后的原始语音数据变得连续,并且使得每一帧能够表现出周期函数的特征。加窗处理具体是指采用窗函数对原始语音数据进行处理,窗函数可以选择汉明窗,则该加窗的公式为
Figure PCTCN2018104040-appb-000004
N为汉明窗窗长,n为时间,s n为时域上的信号幅度,s′ n为加窗后时域上的信号幅度。对原始语音数据进行加窗处理,获取预处理语音数据,能够使得分帧后的原始语音数据在时域上的信号变得连续,有助于提取原始语音数据的训练语音特征。
上述步骤S4214-S4213对原始语音数据的预处理操作,为提取原始语音数据的原始语音特征提供了基础,能够使得提取的原始语音特征更能代表该原始语音数据。
S422:对预处理语音数据作快速傅里叶变换,获取原始语音数据的频谱,并根据频谱获取原始语音数据的功率谱。
其中,快速傅里叶变换(Fast Fourier Transformation,简称FFT),指利用计算机计算离散傅里叶变换的高效、快速计算方法的统称。采用这种算法能使计算机计算离散傅里叶变换所需要的乘法次数大为减少,特别是被变换的抽样点数越多,FFT算法计算量的节省就越显著。
具体地,对预处理语音数据进行快速傅里叶变换,以将预处理语音数据从时域上的信号幅度转换为在频域上的信号幅度(频谱)。该计算频谱的公式为
Figure PCTCN2018104040-appb-000005
N为帧的大小,s(k)为频域上的信号幅度,s(n)为时域上的信号幅度,n为时间,i为复数单位。在获取预处理语音数据的频谱后,可以根据该频谱直接求得预处理语音数据的功率谱,以下将预处理语音数据的功率谱称为原始语音数据的功率谱。该计算原始语音数据的功率谱的公式为
Figure PCTCN2018104040-appb-000006
N为帧的大小,s(k)为频域上的信号幅度。通过将预处理语音数据从时域上的信号幅度转换为频域上的信号幅度,再根据该频域上的信号幅度获取原始语音数据的功率谱,为从原始语音数据的功率谱中提取原始语音特征提供重要的技术基础。
S423:采用梅尔刻度滤波器组处理原始语音数据的功率谱,获取原始语音数据的梅尔功率谱。
其中,采用梅尔刻度滤波器组处理原始语音数据的功率谱是对功率谱进行的梅尔频率分析,梅尔频率分析是基于人类听觉感知的分析。检测发现,人耳就像一个滤波器组一样,只关注某些特定的频率分量(人的听觉对频率是非线性的),也就是说人耳接收声音频率的信号是有限的。然而这些滤波器在频率坐标轴上却不是统一分布的,在低频区域有很多的滤波器,他们分布比较密集,但在高频区域,滤波器的数目就变得比较少,分布很稀疏。可以理解地,梅尔刻度滤波器组在低频部分的分辨率高,跟人耳的听觉特性是相符的,这也是梅尔刻度的物理意义所在。
本实施例中,采用梅尔刻度滤波器组处理原始语音数据的功率谱,获取原始语音数据的梅尔功率谱,通过采用梅尔刻度滤波器组对频域信号进行切分,使得最后每个频率段对应一个数值,若滤波器的个数为22,则可以得到原始语音数据的梅尔功率谱对应的22个能量值。通过对原始语音数据的功率谱进行梅尔频率分析,使得其分析后获取的梅尔功率谱保留着与人耳特性密切相关的频率部分,该频率部分能够很好地反映出原始语音数据的特征。
S424:在梅尔功率谱上进行倒谱分析,获取原始语音数据的MFCC特征。
其中,倒谱(cepstrum)是指一种信号的傅里叶变换谱经对数运算后再进行的傅里叶反变换,由于一般傅里叶谱是复数谱,因而倒谱又称复倒谱。
具体地,对梅尔功率谱进行倒谱分析,根据倒谱的结果,分析并获取原始语音数据的MFCC特征。通过该倒谱分析,可以将原本特征维度过高,难以直接使用的原始语音数据的梅尔功率谱中包含的特征,通过在梅尔功率谱上进行倒谱分析,转换成易于使用的特征(用来进行训练或识别的MFCC特征特征向量)。该MFCC特征能够作为原始语音特征对不同语音进行区分的系数,该原始语音特征可以反映语音之间的区别,可以用来识别和区分原始语音数据。
在一具体实施方式中,步骤S424中,在梅尔功率谱上进行倒谱分析,获取原始语音数据的MFCC特征,包括如下步骤:
S4241:取梅尔功率谱的对数值,获取待变换梅尔功率谱。
具体地,根据倒谱的定义,对梅尔功率谱取对数值log,获取待变换梅尔功率谱m。
S4242:对待变换梅尔功率谱作离散余弦变换,获取原始语音数据的MFCC特征。
具体地,对待变换梅尔功率谱m作离散余弦变换(Discrete Cosine Transform,DCT),获取相对应的原始语音数据的MFCC特征,一般取第2个到第13个系数作为原始语音特征,该原始语音特征能够反映语音数据间的区别。对待变换梅尔功率谱m作离散余弦变换的公式为
Figure PCTCN2018104040-appb-000007
N为帧长,m为待变换梅尔功率谱,j为待变换梅尔功率谱的自变量。由于梅尔滤波器之间是有重叠的,所以采用梅尔刻度滤波器获取的能量值之间是具有相关性的,离散余弦变换可以对待变换梅尔功率谱m进行降维压缩和抽象,并获取间接的原始语音特征,相比于傅里叶变换,离散余弦变换的结果没有虚部,在计算方面有明显的优势。
步骤S421-S424基于训练技术对原始语音数据进行特征提取的处理,最终获取的原始语音特征能够很好地体现原始语音数据,该原始语音特征能够训练出对应的当前分级模型,以使基于该当前分级模型在进行语音识别时的结果更为精确。
步骤S42采用梅尔频率倒谱系数作为原始语音特征,可以准确地体现原始语音数据的语音特征,以使采用该原始语音特征训练得到的当前分级模型的识别准确率更高。
S43.关联存储原始语音特征和说话人标识到识别数据集中。
其中,原始语音特征即步骤S42获取的原始语音数据对应的原始语音特征,说话人标识即步骤S41获取的原始语音数据对应的说话人标识。
步骤43中将原始语音特征和说话人标识关联存储到识别数据集,利于后续基于该识别数据集,迅速获取到说话人标识对应的原始语音数据,以基于原始语音数据进行语音识别。
步骤S41至S43中,识别服务器采用梅尔频率倒谱系数作为训练语音特征,可以准确地体现原始语音数据的语音特征;将原始语音特征和说话人标识关联存储到识别数据集,利于后续基于该识别数据集,迅速获取到说话人标识对应的原始语音数据,以基于原始语音数据进行语音识别。
本申请实施例提供的模型库建立方法,通过判定训练样本集中原始语音数据的样本数量来逐级建立当前分级模型,直至将训练样本集的当前训练子集确定为识别数据集后,将识别数据集存储在分级模型库中完成建立分级模型库。 该分级模型库将所有原始语音数据存储到不同的识别数据集,避免后续识别待测语音数据时依次对比所有原始语音数据,可根据该分级模型库的模型分级逻辑快速匹配出待测语音数据所在的识别数据集,提高语音识别的效率。
优选地,识别服务器采用树状结构建立分级模型库,可将多个分级模型按根节点和子节点的关系关联起来,利于后续基于该关联关系迅速查找到与说话人对应的原始语音数据;识别服务器可将每次建立的当前分级模型按分级逻辑存储到根节点或者子节点上,利于语音识别时从根节点或子节点对应的当前分级模型查找到对应的至少两个后驱节点对应的下一级分级模型。
在一实施例中,如图7所示,提供一种语音识别方法,以该方法应用在图1中的识别服务器为例进行说明,包括如下步骤:
S50.获取待测语音数据,提取待测语音数据对应的待测语音特征。
其中,待测语音数据是指需要进行测试的语音数据,具体指用以确认该语音数据在分级模型库中对应的说话人标识的语音数据。待测语音特征是对该待测语音数据进行特征提取后获取的对应的MFCC特征。步骤S50进行特征提取的过程与前述步骤S421至S424相同,为避免重复,在此不一一赘述。
步骤S50采用梅尔频率倒谱系数作为待测语音特征,可以准确地体现说话人对应的语音特征。
S60.依据分级模型库中的模型分级逻辑和当前分级模型对待测语音特征进行处理,确定目标节点。
其中,分级模型库就是步骤S10至S40生成的数据库,包括多个已训练保持到数据库中的当前分级模型,且按照分级逻辑存储各个当前分级模型的数据库。其中,模型分级逻辑具体包括:将每个分级模型保存到按树状结构建立的分级模型库中对应位置的节点上。树状结构包括一个根节点和多个子节点,根节点没有前驱节点,其余每个子节点有且只有一个前驱节点。根节点或每个子节点若有后驱节点,则至少包括两个后驱节点。子节点包括最末端的叶节点和位于根节点和叶节点之间的中间节点,没有后驱节点的子节点为叶节点,有后驱节点的子节点为中间节点。
每建立一个新的当前分级模型,就需要给当前分级模型建立关联关系。在树状结构中获取该当前分级模型所在子节点的位置,获取该子节点的前驱节点和前驱节点上存储的上一级分级模型。将该当前分级模型和上一级分级模型在分级模型库中进行关联,以实现模型分级逻辑要求的逻辑关系。
目标节点就是将待测语音数据按依据分级模型库中的模型分级逻辑和当前分级模型对待测语音特征进行处理,从分级模型库形成的树状结构的根节点不断向下关联查找,直至查找到没有后驱节点的叶节点作为目标节点。
步骤S60中,识别服务器可将待测语音数据依据分级模型库中的模型分级逻辑和当前分级模型对待测语音特征进行处理,加快识别处理速度,避免依次直接将待测语音数据和原始语音数据进行对比,直至查找到目标节点,极大缩小需要具体进行语音特征对比的数量。
S70.将与目标节点相对应的识别数据集作为目标数据集,目标数据集中的每一原始语音数据携带一说话人标识。
其中,目标节点就是从步骤S60查找得到的分级模型库中的一个叶节点。
识别数据集是与分级模型库形成的树状结构的一叶节点关联存储,且该数据集中原始语音数据的数量未超过预设阈值的集合。分级模型库就是包括多个当前分级模型和识别数据集的数据库。
原始语音数据即步骤S41得到的原始语音数据,是说话人通过计算机设备录入的语音数据。说话人标识是与步骤S41对应的标识,也即是原始语音数据对应的说话人身份标识,以表明说话人的唯一身份,可以采用用户ID、手机号或身份证号等作为说话人标识。
步骤S70中,识别服务器通过查找到与目标节点关联的识别数据集作为目标数据集,并进一步根据步骤S41获取该识别数据集中所有原始语音数据和对应的说话人标识。
S80.获取待测语音特征和目标数据集中的每一原始语音特征的空间距离,确定与待测语音数据相对应的目标说话人标识。
其中,应用于本实施例,空间距离可采用几何中的夹角余弦来衡量两个特征在空间方向上的差异。
目标说话人标识就是在分级模型库中与待测语音特征对应的说话人标识。
具体地,获取待测语音特征和目标数据集中的每一原始语音特征的空间距离可由以下公式进行判定:
Figure PCTCN2018104040-appb-000008
其中,A i和B i分别代表待测语音特征和原始语音特征的各个分量。由上式可知,空间距离也即余弦值从-1到1,其中-1表示两个语音特征在空间的方向相反,1表示两个语音特征在空间的方向相同;0表示两个语音特征是独立的。在-1和1之间表示两个语音特征之间的相似性或相异性,可以理解地,相似度越接近1表示两个语音特征越接近。
本步骤中,识别服务器获取待测语音特征和原始语音特征的空间距离最大值对应的说话人标识作为目标说话人标识。
步骤S50至S80提供的语音识别方法,通过分级模型库中的模型分级逻辑和当前分级模型对待测语音特征进行处理,确定目标节点。将与目标节点相对应的识别数据集作为目标数据集,可确定与待测语音数据相对应的目标说话人标识,避免将待测语音数据直接依次对比所有原始语音数据,通过分级模型库来逐级查找对应的当前分级模型以确定目标数据集后,最后再依次和目标数据集中有限的原始语音数据进行对比确定目标说话人标识,从而提高语音识别的效率。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,提供一种模型库建立装置,该模型库建立装置与上述实施例中模型库建立方法一一对应。如图8所示,该模型库建立装置包括获取训练样本集模块10、存储分级模型模块20、更新训练样本集模块30和确定识别数据集模块40。各功能模块详细说明如下:
获取训练样本集模块10,用于获取训练样本集,训练样本集包括至少两个原始语音数据。
存储分级模型模块20,用于若训练样本集中原始语音数据的样本数量大于预设阈值,则根据原始语音数据提取的训练语音特征建立当前分级模型,将当前分级模型存储在分级模型库中,并确定分级模型库中的模型分级逻辑,根据模型分级逻辑将原始语音数据划分成至少两个当前训练子集。
更新训练样本集模块30,用于若当前训练子集的样本数量大于预设阈值,将当前训练子集更新为训练样本集。
确定识别数据集模块40,用于若当前训练子集的样本数量不大于预设阈值,则将当前训练子集确定为识别数据集,并将识别数据集存储在分级模型库中。
优选地,该模型库建立装置还包括创建模型库单元11。
创建模型库单元11,用于采用树状结构创建分级模型库,树状结构包括一个根节点和与根节点关联的至少两个子节点。
优选地,存储分级模型模块20包括确定分级逻辑单元12。
确定分级逻辑单元12,用于将当前分级模型存储在树状结构的子节点上,根据当前分级模型在树状结构的存储位置确定模型分级逻辑。
优选地,存储分级模型模块20包括获取训练特征单元21、获取训练特征单元22和获取简化特征单元23和获取分级模型单元24。
获取训练特征单元21,用于对原始语音数据进行特征提取,获取训练语音特征。
获取训练特征单元22,用于采用简化模型算法对训练语音特征进行简化处理,获取简化语音特征。
获取简化特征单元23,用于采用最大期望算法迭代简化语音特征,获取总体变化子空间。
获取分级模型单元24,用于将简化语音特征投影到总体变化子空间,获取当前分级模型。
优选地,获取训练特征单元22包括获取正态分布子单元221和获取语音特征子单元222。
获取正态分布子单元221,用于采用高斯滤波器处理训练语音特征,获取对应的二维正态分布。
获取语音特征子单元222,用于采用简化模型算法简化二维正态分布,获取简化语音特征。
优选地,该模型库建立装置还包括获取说话人数据单元41、获取原始特征单元42和存储原始特征单元43。
获取说话人数据单元41,用于获取每一识别数据集中的原始语音数据和对应的说话人标识。
获取原始特征单元42,用于对原始语音数据进行特征提取,获取原始语音数据对应的原始语音特征。
存储原始特征单元43,用于关联存储原始语音特征和说话人标识到识别数据集中。
在一实施例中,提供一种语音识别装置,该语音识别装置与上述实施例中语音识别方法一一对应。如图9所示,该模型库建立装置包括获取测试语音模块50、确定目标模型模块60、对应识别数据集模块70和确定说话人标识模块80,各功能模块详细说明如下:
获取测试语音模块50,用于获取待测语音数据,提取待测语音数据对应的待测语音特征。
确定目标模型模块60,用于依据分级模型库中的模型分级逻辑和当前分级模型对待测语音特征进行处理,确定目标节点。
对应识别数据集模块70,用于将与目标节点相对应的识别数据集作为目标数据集,目标数据集中的每一原始语音数据携带一说话人标识。
确定说话人标识模块80,用于获取待测语音特征和目标数据集中的每一原始语音特征的空间距离,确定与待测语音数据相对应的目标说话人标识。
关于模型库建立装置的具体限定可以参见上文中对于模型库建立方法的限定,在此不再赘述。上述模型库建立装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一实施例中,提供一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储与语音识别相关的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种模型库建立方法。
在一实施例中,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现以下步骤:获取训练样本集,训练样本集包括至少两个原始语音数据;若训练样本集中原始语音数据的样本数量大于预设阈值,则根据原始语音数据提取的训练语音特征建立当前分级模型,将当前分级模型存储在分级模型库中,并确定分级模型库中的模型分级逻辑,根据模型分级逻辑将原始语音数据划分成至少两个当前训练子集;若当前训练子集的样本数量大于预设阈值,将当前训练子集更新为训练样本集;若当前训练子集的样本数量不大于预设阈值,则将当前训练子集确定为识别数据集,并将识别数据集存储在分级模型库中。
在一实施例中,处理器执行计算机可读指令时还实现以下步骤:采用树状结构创建分级模型库,树状结构包括一个根节点和与根节点关联的至少两个子节点;将当前分级模型存储在分级模型库中,并确定分级模型库中的模型分级逻辑,包括:将当前分级模型存储在树状结构的子节点上,根据当前分级模型在树状结构的存储位置确定模型分级逻辑。
在一实施例中,处理器执行计算机可读指令时还实现以下步骤:对原始语音数据进行特征提取,获取训练语音特征;采用简化模型算法对训练语音特征进行简化处理,获取简化语音特征;采用最大期望算法迭代简化语音特征,获取总体变化子空间;将简化语音特征投影到总体变化子空间,获取当前分级模型。
在一实施例中,处理器执行计算机可读指令时还实现以下步骤:采用高斯滤波器处理训练语音特征,获取对应的二维正态分布;采用简化模型算法简化二维正态分布,获取简化语音特征。
在一实施例中,处理器执行计算机可读指令时还实现以下步骤:获取每一识别数据集中的原始语音数据和对应的说话人标识;对原始语音数据进行特征提取,获取原始语音数据对应的原始语音特征;关联存储原始语音特征和说话人标识到识别数据集中。
在一实施例中,处理器执行计算机可读指令时还实现以下步骤:获取待测语音数据,提取待测语音数据对应的待测语音特征;依据分级模型库中的模型分级逻辑和当前分级模型对待测语音特征进行处理,确定目标节点;将与目标节点相对应的识别数据集作为目标数据集,目标数据集中的每一原始语音数据携带一说话人标识;获取待测语音特征和目标数据集中的每一原始语音特征的空间距离,确定与待测语音数据相对应的目标说话人标识。
在一实施例中,一个或多个存储有计算机可读指令的非易失性可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:获取训练样本集,训练样本集包括至少两个原始语音数据;若训练样本集中原始语音数据的样本数量大于预设阈值,则根据原始语音数据提取的训练语音特征建立当前分级模型,将当前分级模型存储在分级模型库中,并确定分级模型库中的模型分级逻辑,根据模型分级逻辑将原始语音数据划分成至少两个当前训练子集;若当前训练子集的样本数量大于预设阈值,将当前训练子集更新为训练样本集;若当前训练子集的样本数量不大于预设阈值,则将当前训练子集确定为识别数据集,并将识别数据集存储在分级模型库中。
在一实施例中,计算机可读指令被一个或多个处理器执行时,一个或多个处理器还执行如下步骤:采用树状结构创建分级模型库,树状结构包括一个根节点和与根节点关联的至少两个子节点;将当前分级模型存储在分级模型库中,并确定分级模型库中的模型分级逻辑,包括:将当前分级模型存储在树状结构的子节点上,根据当前分级模型在树状结构的存储位置确定模型分级逻辑。
在一实施例中,计算机可读指令被一个或多个处理器执行时,一个或多个处理器还执行如下步骤:对原始语音数据进行特征提取,获取训练语音特征;采用简化模型算法对训练语音特征进行简化处理,获取简化语音特征;采用最大期望算法迭代简化语音特征,获取总体变化子空间;将简化语音特征投影到总体变化子空间,获取当前分级模型。
在一实施例中,计算机可读指令被一个或多个处理器执行时,一个或多个处理器还执行如下步骤:采用高斯滤波器处理训练语音特征,获取对应的二维正态分布;采用简化模型算法简化二维正态分布,获取简化语音特征。
在一实施例中,计算机可读指令被一个或多个处理器执行时,一个或多个处理器还执行如下步骤:获取每一识别数据集中的原始语音数据和对应的说话人标识;对原始语音数据进行特征提取,获取原始语音数据对应的原始语音特征;关联存储原始语音特征和说话人标识到识别数据集中。
在一实施例中,计算机可读指令被一个或多个处理器执行时,一个或多个处理器还执行如下步骤:获取待测语音数据,提取待测语音数据对应的待测语音特征;依据分级模型库中的模型分级逻辑和当前分级模型对待测语音特征进行处理,确定目标节点;将与目标节点相对应的识别数据集作为目标数据集,目标数据集中的每一原始语音数据携带一说话人标识;获取待测语音特征和目标数据集中的每一原始语音特征的空间距离,确定与待测语音数据相对应的目标说话人标识。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、 存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种模型库建立方法,其特征在于,包括:
    获取训练样本集,所述训练样本集包括至少两个原始语音数据;
    若所述训练样本集中原始语音数据的样本数量大于预设阈值,则根据所述原始语音数据提取的训练语音特征建立当前分级模型,将所述当前分级模型存储在分级模型库中,并确定所述分级模型库中的模型分级逻辑,根据模型分级逻辑将所述原始语音数据划分成至少两个当前训练子集;
    若所述当前训练子集的样本数量大于所述预设阈值,将所述当前训练子集更新为所述训练样本集;
    若所述当前训练子集的样本数量不大于所述预设阈值,则将所述当前训练子集确定为识别数据集,并将所述识别数据集存储在分级模型库中。
  2. 如权利要求1所述的模型库建立方法,其特征在于,所述获取训练样本集的步骤之前,所述模型库建立方法还包括:
    采用树状结构创建分级模型库,所述树状结构包括一个根节点和与所述根节点关联的至少两个子节点;
    所述将所述当前分级模型存储在分级模型库中,并确定所述分级模型库中的模型分级逻辑,包括:
    将所述当前分级模型存储在所述树状结构的子节点上,根据所述当前分级模型在所述树状结构的存储位置确定模型分级逻辑。
  3. 如权利要求1所述的模型库建立方法,其特征在于,所述根据所述原始语音数据提取的训练语音特征建立当前分级模型,包括:
    对所述原始语音数据进行特征提取,获取训练语音特征;
    采用简化模型算法对所述训练语音特征进行简化处理,获取简化语音特征;
    采用最大期望算法迭代所述简化语音特征,获取总体变化子空间;
    将所述简化语音特征投影到所述总体变化子空间,获取所述当前分级模型。
  4. 如权利要求3所述模型库建立方法,其特征在于,所述采用简化模型算法简化处理所述训练语音特征,获取简化语音特征,包括:
    采用高斯滤波器处理所述训练语音特征,获取对应的二维正态分布;
    采用简化模型算法简化所述二维正态分布,获取简化语音特征。
  5. 如权利要求1所述的模型库建立方法,其特征在于,所述在将所述当前训练子集确定为识别数据集,并将所述识别数据集存储在分级模型库中的步骤之后,所述模型库建立方法还包括:
    获取每一所述识别数据集中的原始语音数据和对应的说话人标识;
    对所述原始语音数据进行特征提取,获取所述原始语音数据对应的原始语音特征;
    关联存储所述原始语音特征和所述说话人标识到所述识别数据集中。
  6. 一种语音识别方法,其特征在于,包括:
    获取待测语音数据,提取所述待测语音数据对应的待测语音特征;
    依据分级模型库中的模型分级逻辑和当前分级模型对所述待测语音特征进行处理,确定目标节点;
    将与所述目标节点相对应的识别数据集作为目标数据集,所述目标数据集中的每一原始语音数据携带一说话人标识;
    获取待测语音特征和目标数据集中的每一原始语音特征的空间距离,确定与所述待测语音数据相对应的目标说话人标识。
  7. 一种模型库建立装置,其特征在于,包括:
    获取训练样本集模块,用于获取训练样本集,所述训练样本集包括至少两个原始语音数据;
    存储分级模型模块,用于若所述训练样本集中原始语音数据的样本数量大于预设阈值,则根据所述原始语音数据提取的训练语音特征建立当前分级模型,将所述当前分级模型存储在分级模型库中,并确定所述分级模型库中的模型分级逻辑,根据模型分级逻辑将所述原始语音数据划分成至少两个当前训练子集;
    更新训练样本集模块,用于若所述当前训练子集的样本数量大于所述预设阈值,将所述当前训练子集更新为所述训练样本集;
    确定识别数据集模块,用于若所述当前训练子集的样本数量不大于所述预设阈值,则将所述当前训练子集确定为识别数据集,并将所述识别数据集存储在分级模型库中。
  8. 一种语音识别装置,其特征在于,包括:
    获取测试语音模块,用于获取待测语音数据,提取所述待测语音数据对应的待测语音特征;
    确定目标模型模块,用于依据分级模型库中的模型分级逻辑和当前分级模型对所述待测语音特征进行处理,确定目标节点;
    对应识别数据集模块,用于将与所述目标节点相对应的识别数据集作为目标数据集,所述目标数据集中的每一原始语音数据携带一说话人标识;
    确定说话人标识模块,用于获取待测语音特征和目标数据集中的每一原始语音特征的空间距离,确定与所述待测语音数据相对应的目标说话人标识。
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取训练样本集,所述训练样本集包括至少两个原始语音数据;
    若所述训练样本集中原始语音数据的样本数量大于预设阈值,则根据所述原始语音数据提取的训练语音特征建立当前分级模型,将所述当前分级模型存储在分级模型库中,并确定所述分级模型库中的模型分级逻辑,根据模型分级逻辑将所述原始语音数据划分成至少两个当前训练子集;
    若所述当前训练子集的样本数量大于所述预设阈值,将所述当前训练子集更新为所述训练样本集;
    若所述当前训练子集的样本数量不大于所述预设阈值,则将所述当前训练子集确定为识别数据集,并将所述识别数据集存储在分级模型库中。
  10. 如权利要求9所述的计算机设备,其特征在于,所述获取训练样本集的步骤之前,所述处理器执行所述计算机可读指令时还实现如下步骤:
    采用树状结构创建分级模型库,所述树状结构包括一个根节点和与所述根节点关联的至少两个子节点;
    所述将所述当前分级模型存储在分级模型库中,并确定所述分级模型库中的模型分级逻辑,包括:
    将所述当前分级模型存储在所述树状结构的子节点上,根据所述当前分级模型在所述树状结构的存储位置确定模型分级逻辑。
  11. 如权利要求9所述的计算机设备,其特征在于,所述根据所述原始语音数据提取的训练语音特征建立当前分级模型,包括:
    对所述原始语音数据进行特征提取,获取训练语音特征;
    采用简化模型算法对所述训练语音特征进行简化处理,获取简化语音特征;
    采用最大期望算法迭代所述简化语音特征,获取总体变化子空间;
    将所述简化语音特征投影到所述总体变化子空间,获取所述当前分级模型。
  12. 如权利要求11所述的计算机设备,其特征在于,所述采用简化模型算法简化处理所述训练语音特征,获取简化语音特征,包括:
    采用高斯滤波器处理所述训练语音特征,获取对应的二维正态分布;
    采用简化模型算法简化所述二维正态分布,获取简化语音特征。
  13. 如权利要求9所述的计算机设备,其特征在于,所述在将所述当前训练子集确定为识别数据集,并将所述识别数据集存储在分级模型库中的步骤之后,所述处理器执行所述计算机可读指令时还实现如下步骤:
    获取每一所述识别数据集中的原始语音数据和对应的说话人标识;
    对所述原始语音数据进行特征提取,获取所述原始语音数据对应的原始语音特征;
    关联存储所述原始语音特征和所述说话人标识到所述识别数据集中。
  14. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取待测语音数据,提取所述待测语音数据对应的待测语音特征;
    依据分级模型库中的模型分级逻辑和当前分级模型对所述待测语音特征进行处理,确定目标节点;
    将与所述目标节点相对应的识别数据集作为目标数据集,所述目标数据集中的每一原始语音数据携带一说话人标识;
    获取待测语音特征和目标数据集中的每一原始语音特征的空间距离,确定与所述待测语音数据相对应的目标说话人标识。
  15. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    获取训练样本集,所述训练样本集包括至少两个原始语音数据;
    若所述训练样本集中原始语音数据的样本数量大于预设阈值,则根据所述原始语音数据提取的训练语音特征建立当前分级模型,将所述当前分级模型存储在分级模型库中,并确定所述分级模型库中的模型分级逻辑,根据模型分级逻辑将所述原始语音数据划分成至少两个当前训练子集;
    若所述当前训练子集的样本数量大于所述预设阈值,将所述当前训练子集更新为所述训练样本集;
    若所述当前训练子集的样本数量不大于所述预设阈值,则将所述当前训练子集确定为识别数据集,并将所述识别数据集存储在分级模型库中。
  16. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述获取训练样本集的步骤之前,所述一个或多个处理器还执行如下步骤:
    采用树状结构创建分级模型库,所述树状结构包括一个根节点和与所述根节点关联的至少两个子节点;
    所述将所述当前分级模型存储在分级模型库中,并确定所述分级模型库中的模型分级逻辑,包括:
    将所述当前分级模型存储在所述树状结构的子节点上,根据所述当前分级模型在所述树状结构的存储位置确定模型分级逻辑。
  17. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述根据所述原始语音数据提取的训练语音特征建立当前分级模型,包括:
    对所述原始语音数据进行特征提取,获取训练语音特征;
    采用简化模型算法对所述训练语音特征进行简化处理,获取简化语音特征;
    采用最大期望算法迭代所述简化语音特征,获取总体变化子空间;
    将所述简化语音特征投影到所述总体变化子空间,获取所述当前分级模型。
  18. 如权利要求17所述的非易失性可读存储介质,其特征在于,所述采用简化模型算法简化处理所述训练语音特征,获取简化语音特征,包括:
    采用高斯滤波器处理所述训练语音特征,获取对应的二维正态分布;
    采用简化模型算法简化所述二维正态分布,获取简化语音特征。
  19. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述在将所述当前训练子集确定为识别数据集,并将所述识别数据集存储在分级模型库中的步骤之后,所述一个或多个处理器还执行如下步骤:
    获取每一所述识别数据集中的原始语音数据和对应的说话人标识;
    对所述原始语音数据进行特征提取,获取所述原始语音数据对应的原始语音特征;
    关联存储所述原始语音特征和所述说话人标识到所述识别数据集中。
  20. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    获取待测语音数据,提取所述待测语音数据对应的待测语音特征;
    依据分级模型库中的模型分级逻辑和当前分级模型对所述待测语音特征进行处理,确定目标节点;
    将与所述目标节点相对应的识别数据集作为目标数据集,所述目标数据集中的每一原始语音数据携带一说话人标识;
    获取待测语音特征和目标数据集中的每一原始语音特征的空间距离,确定与所述待测语音数据相对应的目标说话人标识。
PCT/CN2018/104040 2018-06-11 2018-09-05 模型库建立方法、语音识别方法、装置、设备及介质 WO2019237518A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810592869.8 2018-06-11
CN201810592869.8A CN108922543B (zh) 2018-06-11 2018-06-11 模型库建立方法、语音识别方法、装置、设备及介质

Publications (1)

Publication Number Publication Date
WO2019237518A1 true WO2019237518A1 (zh) 2019-12-19

Family

ID=64418041

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/104040 WO2019237518A1 (zh) 2018-06-11 2018-09-05 模型库建立方法、语音识别方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN108922543B (zh)
WO (1) WO2019237518A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114530163A (zh) * 2021-12-31 2022-05-24 安徽云磬科技产业发展有限公司 基于密度聚类的采用声音识别设备生命周期的方法及系统

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060667B (zh) * 2019-03-15 2023-05-30 平安科技(深圳)有限公司 语音信息的批量处理方法、装置、计算机设备及存储介质
CN110148403B (zh) * 2019-05-21 2021-04-13 腾讯科技(深圳)有限公司 解码网络生成方法、语音识别方法、装置、设备及介质
CN110414709A (zh) * 2019-06-18 2019-11-05 重庆金融资产交易所有限责任公司 债务风险智能预测方法、装置及计算机可读存储介质
CN110782879B (zh) * 2019-09-18 2023-07-07 平安科技(深圳)有限公司 基于样本量的声纹聚类方法、装置、设备及存储介质
CN111247585B (zh) * 2019-12-27 2024-03-29 深圳市优必选科技股份有限公司 语音转换方法、装置、设备及存储介质
CN112634863B (zh) * 2020-12-09 2024-02-09 深圳市优必选科技股份有限公司 一种语音合成模型的训练方法、装置、电子设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1447278A (zh) * 2002-11-15 2003-10-08 郑方 一种声纹识别方法
CN101562012A (zh) * 2008-04-16 2009-10-21 创而新(中国)科技有限公司 语音分级测定方法及系统
CN104135577A (zh) * 2014-08-27 2014-11-05 陈包容 一种基于自定义语音实现快速查找联系人的方法和装置
US20150325240A1 (en) * 2014-05-06 2015-11-12 Alibaba Group Holding Limited Method and system for speech input
CN107993071A (zh) * 2017-11-21 2018-05-04 平安科技(深圳)有限公司 电子装置、基于声纹的身份验证方法及存储介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1099662C (zh) * 1997-09-05 2003-01-22 中国科学院声学研究所 汉语普通话大词汇连续语音识别方法
US6684186B2 (en) * 1999-01-26 2004-01-27 International Business Machines Corporation Speaker recognition using a hierarchical speaker model tree
US6754626B2 (en) * 2001-03-01 2004-06-22 International Business Machines Corporation Creating a hierarchical tree of language models for a dialog system based on prompt and dialog context
US6941264B2 (en) * 2001-08-16 2005-09-06 Sony Electronics Inc. Retraining and updating speech models for speech recognition
CN102789779A (zh) * 2012-07-12 2012-11-21 广东外语外贸大学 一种语音识别系统及其识别方法
CN104268279B (zh) * 2014-10-16 2018-04-20 魔方天空科技(北京)有限公司 语料数据的查询方法和装置
CN105006231A (zh) * 2015-05-08 2015-10-28 南京邮电大学 基于模糊聚类决策树的分布式大型人口语者识别方法
CN105096955B (zh) * 2015-09-06 2019-02-01 广东外语外贸大学 一种基于模型生长聚类的说话人快速识别方法及系统
CN107993663A (zh) * 2017-09-11 2018-05-04 北京航空航天大学 一种基于Android的声纹识别方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1447278A (zh) * 2002-11-15 2003-10-08 郑方 一种声纹识别方法
CN101562012A (zh) * 2008-04-16 2009-10-21 创而新(中国)科技有限公司 语音分级测定方法及系统
US20150325240A1 (en) * 2014-05-06 2015-11-12 Alibaba Group Holding Limited Method and system for speech input
CN104135577A (zh) * 2014-08-27 2014-11-05 陈包容 一种基于自定义语音实现快速查找联系人的方法和装置
CN107993071A (zh) * 2017-11-21 2018-05-04 平安科技(深圳)有限公司 电子装置、基于声纹的身份验证方法及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114530163A (zh) * 2021-12-31 2022-05-24 安徽云磬科技产业发展有限公司 基于密度聚类的采用声音识别设备生命周期的方法及系统

Also Published As

Publication number Publication date
CN108922543B (zh) 2022-08-16
CN108922543A (zh) 2018-11-30

Similar Documents

Publication Publication Date Title
WO2019237518A1 (zh) 模型库建立方法、语音识别方法、装置、设备及介质
WO2019237519A1 (zh) 通用向量训练方法、语音聚类方法、装置、设备及介质
JP7152514B2 (ja) 声紋識別方法、モデルトレーニング方法、サーバ、及びコンピュータプログラム
JP7177167B2 (ja) 混合音声の特定方法、装置及びコンピュータプログラム
CN109065028B (zh) 说话人聚类方法、装置、计算机设备及存储介质
WO2019232829A1 (zh) 声纹识别方法、装置、计算机设备及存储介质
CN109087648B (zh) 柜台语音监控方法、装置、计算机设备及存储介质
WO2020177380A1 (zh) 基于短文本的声纹检测方法、装置、设备及存储介质
WO2018107810A1 (zh) 声纹识别方法、装置、电子设备及介质
US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
TW201935464A (zh) 基於記憶性瓶頸特徵的聲紋識別的方法及裝置
WO2020181824A1 (zh) 声纹识别方法、装置、设备以及计算机可读存储介质
WO2019232826A1 (zh) i-vector向量提取方法、说话人识别方法、装置、设备及介质
WO2019227574A1 (zh) 语音模型训练方法、语音识别方法、装置、设备及介质
CN109360572B (zh) 通话分离方法、装置、计算机设备及存储介质
WO2020024396A1 (zh) 音乐风格识别方法、装置、计算机设备及存储介质
WO2020224114A1 (zh) 基于残差时延网络的说话人确认方法、装置、设备及介质
WO2022178942A1 (zh) 情绪识别方法、装置、计算机设备和存储介质
US9947323B2 (en) Synthetic oversampling to enhance speaker identification or verification
CN113223536B (zh) 声纹识别方法、装置及终端设备
WO2022143723A1 (zh) 语音识别模型训练方法、语音识别方法及相应装置
WO2021227259A1 (zh) 重音检测方法及装置、非瞬时性存储介质
WO2019232848A1 (zh) 语音区分方法、装置、计算机设备及存储介质
WO2019232833A1 (zh) 语音区分方法、装置、计算机设备及存储介质
CN112053694A (zh) 一种基于cnn与gru网络融合的声纹识别方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18922303

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18922303

Country of ref document: EP

Kind code of ref document: A1