CN108922543A - Model library method for building up, audio recognition method, device, equipment and medium - Google Patents

Model library method for building up, audio recognition method, device, equipment and medium Download PDF

Info

Publication number
CN108922543A
CN108922543A CN201810592869.8A CN201810592869A CN108922543A CN 108922543 A CN108922543 A CN 108922543A CN 201810592869 A CN201810592869 A CN 201810592869A CN 108922543 A CN108922543 A CN 108922543A
Authority
CN
China
Prior art keywords
model
voice data
training
current
model library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810592869.8A
Other languages
Chinese (zh)
Other versions
CN108922543B (en
Inventor
涂宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810592869.8A priority Critical patent/CN108922543B/en
Priority to PCT/CN2018/104040 priority patent/WO2019237518A1/en
Publication of CN108922543A publication Critical patent/CN108922543A/en
Application granted granted Critical
Publication of CN108922543B publication Critical patent/CN108922543B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Abstract

The invention discloses a kind of model library method for building up, audio recognition method, device, equipment and media, wherein the model library method for building up includes:Obtain training sample set;If the sample size of primary voice data is greater than preset threshold in the training sample set, current classification model is then established according to the training phonetic feature that the primary voice data is extracted, logic is classified according to model, the primary voice data is divided at least two current training subsets;If the sample size of the current training subset is greater than the preset threshold, the current training subset is updated to the training sample set;If the sample size of the current training subset is not more than the preset threshold, the current training subset is determined as to identify data set.This method makes the identification data set for identifying that server can be classified where logic Rapid matching goes out voice data to be measured according to the model in hierarchy model library, improves the efficiency of speech recognition.

Description

Model library method for building up, audio recognition method, device, equipment and medium
Technical field
The present invention relates to Application on Voiceprint Recognition field more particularly to a kind of model library method for building up, audio recognition method, device, Equipment and medium.
Background technique
The application of Application on Voiceprint Recognition is increasingly favored by system developer and user, the world market occupation rate of Application on Voiceprint Recognition It is only second to the living things feature recognition of fingerprint and palmmprint, and has the tendency that constantly rising.The advantage of Application on Voiceprint Recognition is:(1) contain It is convenient and naturally, voiceprint extraction can be completed unconsciously that the voice of vocal print feature obtains, therefore the acceptance level of user Also high;(2) voice procurement cost is cheap, and using simple a, microphone, there are no need additionally when using communication apparatus Sound pick-up outfit;(3) it is suitble to remote identity confirmation, it is only necessary to which a microphone, phone or mobile phone can pass through network (communication network Network or internet) realize Telnet;(4) vocal print identification is low with the algorithm complexity of confirmation;(5) cooperate some other arrange It applies, such as content identification is carried out by speech recognition, accuracy rate etc. can be improved.
Application on Voiceprint Recognition is generally by true after the speaker's voice successively compared voice to be tested in existing database Recognize target speaker.And when the quantity of speaker in database is more huge, successively comparison finds target speaker and understands pole It is big to influence recognition efficiency.
Summary of the invention
Based on this, it is necessary in view of the above technical problems, provide a kind of model library foundation side that recognition efficiency can be improved Method, device, equipment and medium.
Above-mentioned model library method for building up, device, equipment and medium, the training phonetic feature extracted according to primary voice data Current classification model is established, is classified logic for original language further according to model after current classification model is stored in hierarchy model library Sound data are divided at least two current training subsets, until the sample size of current training subset is not more than preset threshold, it can Current training subset is determined as to identify the foundation that data set completes model library, builds and is conducive to quickly find identification data set Hierarchy model library.
A kind of model library method for building up, including:
Training sample set is obtained, training sample set includes at least two primary voice datas;
If the sample size of primary voice data is greater than preset threshold in training sample set, mentioned according to primary voice data The training phonetic feature taken establishes current classification model, and current classification model is stored in hierarchy model library, and determines classification Model in model library is classified logic, is classified logic according to model and primary voice data is divided at least two current training Collection;
If the sample size of current training subset is greater than preset threshold, current training subset is updated to training sample set;
If the sample size of current training subset is not more than preset threshold, current training subset is determined as to identify data Collection, and identification data set is stored in hierarchy model library.
A kind of model library establishes device, including:
Training sample set module is obtained, for obtaining training sample set, training sample set includes at least two raw tones Data;
Hierarchy model module is stored, if the sample size for primary voice data in training sample set is greater than default threshold Value then establishes current classification model according to the training phonetic feature that primary voice data is extracted, current classification model is stored in It in hierarchy model library, and determines that the model in hierarchy model library is classified logic, logic is classified for primary voice data according to model It is divided at least two current training subsets;
More new training sample set module will be instructed currently if the sample size for current training subset is greater than preset threshold Practice subset and is updated to training sample set;
It determines identification data set module, if the sample size for current training subset is not more than preset threshold, will work as Preceding training subset is determined as identifying data set, and identification data set is stored in hierarchy model library.
A kind of computer equipment, including memory, processor and storage are in the memory and can be in the processing The computer program run on device, the processor realize the step of above-mentioned model library method for building up when executing the computer program Suddenly.
A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, the meter The step of calculation machine program realizes above-mentioned model library method for building up when being executed by processor.
Above-mentioned model library method for building up, device, equipment and medium, by determining primary voice data in training sample set Sample size establishes current classification model step by step, until the current training subset of training sample set is determined as to identify data set Afterwards, identification data set is stored in and completes to establish hierarchy model library in hierarchy model library.The hierarchy model library is by all original languages Sound data are stored to different identification data sets, avoid successively comparing all raw tone numbers when subsequent identification voice data to be measured According to the identification data set that can be classified according to the model in the hierarchy model library where logic Rapid matching goes out voice data to be measured mentions The efficiency of high speech recognition.
Based on this, it is necessary in view of the above technical problems, provide a kind of audio recognition method that recognition efficiency can be improved, Device, equipment and medium.
A kind of audio recognition method, including:
Voice data to be measured is obtained, the corresponding phonetic feature to be measured of voice data to be measured is extracted;
Logic is classified according to the model in hierarchy model library and current classification model handles phonetic feature to be measured, really Set the goal node;
Each original language that identification data set corresponding with destination node is concentrated as target data set, target data Sound data carry speaker mark;
Obtain the space length for each raw tone feature that phonetic feature to be measured and target data are concentrated, it is determining with it is to be measured The corresponding target speaker mark of voice data.
A kind of speech recognition equipment, including:
It obtains tested speech module and extracts the corresponding voice to be measured of voice data to be measured for obtaining voice data to be measured Feature;
It determines target model module, is classified logic for the model in foundation hierarchy model library and current classification model is treated It surveys phonetic feature to be handled, determines destination node;
Corresponding identification data set module, for will identification data set corresponding with destination node as target data set, Each primary voice data that target data is concentrated carries speaker mark;
Determine speaker's mark module, each raw tone for obtaining phonetic feature to be measured and target data concentration is special The space length of sign determines target speaker mark corresponding with voice data to be measured.
A kind of computer equipment, including memory, processor and storage are in the memory and can be in the processing The computer program run on device, the processor realize the step of above-mentioned audio recognition method when executing the computer program Suddenly.
A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, the meter The step of calculation machine program realizes above-mentioned audio recognition method when being executed by processor.
Above-mentioned audio recognition method, device, equipment and medium, by the model classification logic in hierarchy model library and currently Hierarchy model handles phonetic feature to be measured, determines destination node.Identification data set corresponding with destination node is made For target data set, it may be determined that target speaker mark corresponding with voice data to be measured avoids voice data to be measured is straight It connects and successively compares all primary voice datas, search corresponding current classification model step by step by hierarchy model library to determine mesh After marking data set, finally successively limited primary voice data is concentrated to compare determining target speaker mark with target data again Know, to improve the efficiency of speech recognition.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.
Fig. 1 is the application environment schematic diagram of model library method for building up in one embodiment of the invention;
Fig. 2 is the flow chart of model library method for building up in one embodiment of the invention;
Fig. 3 is another specific flow chart of model library method for building up in one embodiment of the invention;
Fig. 4 is another specific flow chart of model library method for building up in one embodiment of the invention;
Fig. 5 is another specific flow chart of model library method for building up in one embodiment of the invention;
Fig. 6 is another specific flow chart of model library method for building up in one embodiment of the invention;
Fig. 7 is the flow chart of audio recognition method in one embodiment of the invention;
Fig. 8 is the functional block diagram that model library establishes device in one embodiment of the invention;
Fig. 9 is the functional block diagram of speech recognition equipment in one embodiment of the invention;
Figure 10 is the schematic diagram of computer equipment in one embodiment of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.
Model library method for building up provided in an embodiment of the present invention, can be applicable in the application environment such as Fig. 1, wherein calculate Machine equipment is communicated by network with identification server.Wherein, computer equipment include but is not limited to various personal computers, Laptop, smart phone, tablet computer and portable wearable device.Identify server can with independent server or Person is realized with the server cluster that multiple servers form.Computer equipment can acquire primary voice data, and by original language Sound data are sent to identification server by network, so that the primary voice data that identification server by utilizing is got is built Mould.
In one embodiment, as shown in Fig. 2, providing a kind of model library method for building up, the knowledge in Fig. 1 is applied in this way It is illustrated, includes the following steps for other server:
S10. training sample set is obtained, training sample set includes at least two primary voice datas.
Wherein, training sample set is the set of multiple primary voice datas.Primary voice data is exactly that speaker passes through meter The voice data of machine equipment typing is calculated, which can be sent to identification server for collected primary voice data, Identify that server stores the primary voice data received in the database, so as to subsequent identifying call.
If S20. the sample size of primary voice data is greater than preset threshold in training sample set, according to raw tone number Current classification model is established according to the training phonetic feature of extraction, current classification model is stored in hierarchy model library, and is determined Model in hierarchy model library is classified logic, is classified logic according to model and primary voice data is divided at least two current instructions Practice subset.
Wherein, sample size is exactly the quantity of primary voice data in training sample set.Preset threshold is pre-set Whether need to carry out training sample set threshold value respectively for limiting.I.e. preset threshold be primary voice data can be continued by Quantity carries out minimum value respectively.For example, preset threshold is 100, if the sample size of primary voice data is 99, do not divide The primary voice data;If the sample size of primary voice data is 100, primary voice data continuation is at least divided into For two current training subsets.
Training phonetic feature is the phonetic feature for obtain after feature extraction to primary voice data, is applied to this implementation Example, mel-frequency cepstrum coefficient can be used, and (Mel-Frequency Cepstral Coefficients, hereinafter referred to as MFCC are special Sign) as training phonetic feature.Detect one filter group of finder's earcon, only focus on certain specific frequency components (people's The sense of hearing is nonlinear to frequency), that is to say, that the signal that human ear receives sound frequency is limited.However these filters exist But it is not univesral distribution on frequency coordinate axis, there are many filters in low frequency region, they is distributed than comparatively dense, but in height Frequency domain, the number of filter just become fewer, are distributed very sparse.Resolution ratio of the melscale filter group in low frequency part Height, the auditory properties with human ear are consistent, therefore will be using mel-frequency cepstrum coefficient as training phonetic feature, Ke Yizhun Really embody the phonetic feature of speaker.
Current classification model is the corresponding trained phonetic feature projection of multiple primary voice datas that will belong to identity set To the entire change subspace of low-dimensional, the characterization vector of a regular length of acquisition belongs to the multiple of the set to indicate The corresponding speech model of primary voice data.
It is to be appreciated that it includes multiple hierarchy models that hierarchy model library, which is exactly, and each according to model classification logic storage The database of hierarchy model.Wherein, model classification logic specifically includes:Each hierarchy model is saved in and is established by tree Hierarchy model library in corresponding position node on.Tree includes that a root node and multiple child nodes, root node do not have Predecessor node, remaining each child node has and only one predecessor node.If root node or each child node have rear-guard node, Including at least two rear-guard nodes.Child node includes the leaf node of least significant end and the middle node between root node and leaf node Point, the child node without rear-guard node are leaf node, and the child node for having rear-guard node is intermediate node.
It is every to establish a new current classification model, it is necessary to give current classification model foundation incidence relation.In tree-shaped knot The position of child node, obtains and stores on the predecessor node and predecessor node of the child node where obtaining the current hierarchy model in structure Upper level hierarchy model.The current hierarchy model and upper level hierarchy model are associated in hierarchy model library, with reality The logical relation of existing model classification logical requirements.
Current training subset is exactly the subdata formed after dividing equally the primary voice data that training sample is concentrated by quantity Collection.
In step S20, identification server establishes current classification model to primary voice data and is stored in hierarchy model library In, voice sub-clustering judgement is carried out based on the current hierarchy model when being conducive to speech recognition.
If S30. the sample size of current training subset is greater than preset threshold, current training subset is updated to training sample Collection.
Wherein, current training subset is the data set formed after step S20 divides equally primary voice data by quantity.
Training sample set includes at least two primary voice datas, which is exactly that speaker passes through computer The voice data of equipment typing.It should be noted that in step s 30, the sample size of current training subset is greater than default threshold Value, is updated to training sample set for current training subset, so that current training subset can repeat step S20 and step S20 later step, to realize whether further division is needed to determine the current training subset.
If S40. the sample size of current training subset is not more than preset threshold, current training subset is determined as identifying Data set, and identification data set is stored in hierarchy model library.
Wherein, identification data set is that the sample size of the primary voice data is less than current training of preset threshold Collection.Will in current training subset primary voice data sample size compared with preset threshold, determine that the sample size is greater than The current training subset of preset threshold is identification data set, divides the data set without continuing.It is to be appreciated that identification service Device needs to identify that in step s 40 data set is all deposited for the ease of finding specific primary voice data when speech recognition It stores up in hierarchy model library.
The model library method for building up that the embodiment of the present invention proposes is built according to the training phonetic feature that primary voice data is extracted Vertical current classification model after current classification model is stored in hierarchy model library, is classified logic for original language further according to model Sound data are divided at least two current training subsets, until the sample size of all current training subsets is not more than default threshold Current training subset can be determined as identifying the foundation that data set completes model library by value, build and be conducive to quickly find identification The hierarchy model library of data set helps to improve the subsequent efficiency that speech recognition is carried out based on the hierarchy model library.
It in one embodiment, i.e., should before the step of obtaining training sample set as shown in figure 3, before step S10 Model library method for building up further includes:
S11. hierarchy model library is created using tree, tree includes a root node and associated with root node At least two child nodes.
Wherein, tree refer to be there is the data structure of the tree-like relationship of " one-to-many " between data element A kind of important nonlinear data structure.In tree, root node does not have predecessor node, remaining each node has and only has One predecessor node.Child node includes the leaf node of least significant end and the intermediate node between root node and leaf node. Leaf node does not have subsequent node, and the subsequent node number of remaining each node (including root node and intermediate node) can be one It is also possible to multiple.
Step S11 establishes hierarchy model library using tree, can be by multiple hierarchy models by root node and child node Relationship gets up, and finds raw tone number corresponding with speaker rapidly based on identified incidence relation conducive to subsequent According to.
In a specific embodiment, in step S20, i.e., current classification model is stored in hierarchy model library, and It determines the model classification logic in hierarchy model library, specifically comprises the following steps:
S12. current classification model is stored in the child node of tree, according to current classification model in tree Storage location determine model be classified logic.
Wherein, current classification model is the corresponding trained phonetic feature of multiple primary voice datas that will belong to identity set The entire change subspace of low-dimensional is projected to, the characterization vector of a regular length of acquisition belongs to the set to indicate The corresponding speech model of multiple primary voice datas.
Model classification logic, which refers to, is saved in the hierarchy model library established by tree by rank for each hierarchy model Logical relation on the node of middle corresponding level.Wherein, tree include a zero level root node and with the zero level root node Associated at least two rear-guards node is also referred to as level-one child node, if each level-one child node has rear-guard node, each level-one Node is associated at least two second level child nodes, and so on, until (leaf node is not associated with the leaf node of the composition tree Rear-guard node) until.Child node includes the leaf node of least significant end and the intermediate node between root node and leaf node, is not had The child node for having rear-guard node is leaf node, and the child node for having rear-guard child node is intermediate node.
It is to be appreciated that zero level root node does not have predecessor node, each child node of remaining rank has and before only one Drive node.If the corresponding voice data of current node needs to divide, two voice subsets are at least divided into, it is every in order to meet The corresponding rear-guard node of a voice subset, current node (root node or intermediate node) include at least two rear-guard nodes.
According to the demand of the logical relation of above-mentioned tree and the present embodiment, the zero level root node and leaf of the tree Not stored hierarchy model on node, remaining each intermediate node respectively correspond one hierarchy model of storage by rank.
Further, the specific implementation of the hierarchy model of each intermediate node associated storage corresponding level is successively given by rank Process is as follows:
1. the corresponding voice data of N grades of predecessor nodes is obtained first, if the sample of the corresponding voice data of the predecessor node Quantity is greater than the default amount threshold divided, then needs for the voice data to be divided at least two N+1 grades of voice subset.It divides After the completion, according to the voice data of each voice subset, the N+1 grade hierarchy model of corresponding level is established to each voice subset.
2. establishing at least two N+1 grades of child node (quantity and N+1 grades of voice subsets of N+1 grades of child nodes to predecessor node Quantity it is equal), each N+1 grades of voice subset is associated with each N+1 grades of child node.
3. determining whether the corresponding N+1 voice subset of an every N+1 grades of child nodes needs to be divided again again, if then repeating Step 1 is to step 2, until the corresponding voice subset of each child node does not need to be divided again.
In step S12, identification server can be by the current classification model established every time by classification logic storage to child node On, corresponding at least two N+1 grades of child nodes are found from N grades of child nodes conducive to when speech recognition, N grades of child nodes are stored Voice data to be identified and the corresponding hierarchy model of an every N+1 grades of child nodes are matched, and the highest classification of matching degree is obtained The corresponding current N+1 grades of child node of model, and voice data to be identified is continued to be referred to the current N+1 grades of child node.
It repeats to compare the voice data to be identified into a corresponding every N+2 grades of child nodes pair of the current N+1 grades of child node again The hierarchy model answered, until the voice data to be identified to be referred to the leaf node of tree.Speaker is stored on leaf node Corresponding primary voice data matches each primary voice data on voice data to be identified and leaf node, obtains The corresponding speaker of the highest primary voice data of matching degree, the as result of speech recognition.
For step S11 into S12, identification server establishes hierarchy model library using tree, can be by multiple hierarchy models Get up by the relationship of predecessor node and child node, conducive to it is subsequent based on the incidence relation find rapidly it is corresponding with speaker Primary voice data;Identify that server can be stored the current classification model established every time onto child node by classification logic, It is corresponding next conducive to corresponding at least two child node is found from the corresponding current classification model of root node when speech recognition Grade hierarchy model.
In one embodiment, as shown in figure 4, in step S20, i.e., the training voice extracted according to primary voice data is special Sign establishes current classification model, specifically comprises the following steps:
S21. feature extraction is carried out to primary voice data, obtains training phonetic feature.
Wherein, primary voice data is exactly the voice data that speaker passes through computer equipment typing, the computer equipment Collected primary voice data can be sent to identification server, identification server stores the primary voice data received In the database, so as to subsequent identifying call.
Training phonetic feature is the phonetic feature for obtain after feature extraction to primary voice data, is applied to this implementation Example, mel-frequency cepstrum coefficient can be used, and (Mel-Frequency Cepstral Coefficients, hereinafter referred to as MFCC are special Sign) as training phonetic feature.Detect one filter group of finder's earcon, only focus on certain specific frequency components (people's The sense of hearing is nonlinear to frequency), that is to say, that the signal that human ear receives sound frequency is limited.However these filters exist But it is not univesral distribution on frequency coordinate axis, there are many filters in low frequency region, they is distributed than comparatively dense, but in height Frequency domain, the number of filter just become fewer, are distributed very sparse.Resolution ratio of the melscale filter group in low frequency part Height, the auditory properties with human ear are consistent, therefore will be using mel-frequency cepstrum coefficient as training phonetic feature, Ke Yizhun Really embody the phonetic feature of speaker.
In step S21, identification server obtains the training phonetic feature of primary voice data namely MFCC feature carries out mould Type training, the phonetic feature of data acquisition system where can effectively embodying primary voice data.
S22. training phonetic feature is carried out simplifying processing using simplified model algorithm, obtains and simplifies phonetic feature.
Wherein, simplified model algorithm refers to Gaussian Blur (Gaussian Blur, Gaussian smoothing) Processing Algorithm, for dropping The sound noise and level of detail of low voice document.Simplifying phonetic feature is that removal sound is made an uproar after simplified model algorithm simplifies Sound, more pure phonetic feature.
Simplifying to handle using simplified model algorithm in step S22 trains phonetic feature specifically can first obtain trained phonetic feature Two dimension normal distribution, then all phonemes of fuzzy two-dimensional normal distribution, to obtain purer simplification phonetic feature, the simplification Phonetic feature can largely embody the characteristic of trained phonetic feature, help to improve subsequent trained current classification model Efficiency.
S23. phonetic feature is simplified using EM algorithm iteration, obtains entire change subspace.
Wherein, EM algorithm (Expectation Maximization Algorithm, EM algorithm, with Lower abbreviation EM algorithm) it is a kind of iterative algorithm, it be used to find dependent on the general of the not recessive variable of observable in statistics The maximal possibility estimation of rate Model Parameter.
Entire change subspace (Total Variability Space, the hereinafter referred to as space T) is direct setting one The mapping matrix of global change, it is not separated in the space T to speak to comprising all possible information of speaker in voice data People space and channel space.The space T, which can be mapped to higher-dimension sufficient statistic (super vector), can be used as low-dimensional speaker characterization I-vector (identity-vector, authentication vector), play the role of dimensionality reduction.The training process in the space T includes:Root According to default UBM model, calculated using vector analysis and EM (Expectation Maximization Algorithm, greatest hope) Method calculates the space T from wherein convergence.
Phonetic feature is simplified using EM algorithm iteration, the realization process for obtaining the space T is as follows:
Preset sample set x=(x(1),x(2),...x(m)) it include m independent sample, each sample xiCorresponding classification ziBe it is unknown, need to take into account the parameter θ of Joint Distribution probabilistic model p (x, z | θ) and condition distribution probability model p (z | x, θ), Needing to find suitable θ and z makes L (θ) maximum, wherein maximum number of iterations J:
1) random initializtion simplifies the model parameter θ, initial value θ of phonetic feature0
2) 1 to J (maximum number of iterations) of for j from starts EM algorithm iteration:
A) E walks iteration:The conditional probability expectation for calculating Joint Distribution, according to obtained by parameter θ initial value or last iteration Parameter value calculates the posterior probability Q of recessive variablei(z(i)), the existing estimated value as recessive variable:
Qi(z(i))=P (z(i)|x(i), θj))
B) M walks iteration:Maximize L (θ, θj), obtain θj+1(likelihood function is maximized to obtain new parameter value):
If c) M walks the θ of iterationj+1It has been restrained that, then algorithm terminates.Otherwise, it continues back at step a) and carries out E step iteration.
3) it exports:The model parameter θ in the space T.
Speaker space and channel space are not distinguished in the entire change subspace that step 23 obtains, by the information in sound channel space A space is converged on the information of channel space, to reduce computation complexity, convenient for being based further on entire change subspace, To obtain simplified current universal phonetic vector.
S24. it will simplify phonetic feature and project to entire change subspace, obtain current classification model.
Wherein, simplifying phonetic feature is exactly the voice spy obtained after the simplified model algorithm obtained by step S32 is handled Sign.
Current universal phonetic vector is the entire change subspace that will simplify phonetic feature and project to low-dimensional, one of acquisition The characterization vector of regular length, the current classification model formed to indicate the multiple primary voice datas for belonging to identity set.
For step S21 into S24, identification server is simplified using simplified model algorithm handles training phonetic feature, obtains letter It after changing phonetic feature, then will simplify after phonetic feature projects to entire change subspace, can obtain more pure and simple current Hierarchy model carries out voice cluster based on voice data of the current classification model to speaker so as to subsequent, to reduce carry out language The complexity of sound cluster, while accelerating the efficiency of voice cluster.
In one embodiment, as shown in figure 5, in step S22, i.e., using simplified model algorithm to training phonetic feature into Row simplifies processing, obtains and simplifies phonetic feature, specifically comprises the following steps:
S221. training phonetic feature is handled using Gaussian filter, obtains corresponding Two dimension normal distribution.
Wherein, Gaussian filter can carry out linear smoothing filtering to the training phonetic feature of input, be suitable for eliminating Gauss Noise is widely used in noise abatement process.The process of Gaussian filter processing training phonetic feature is specially to training phonetic feature The process being weighted and averaged, for training the phoneme in phonetic feature, the value of each phoneme, all by itself and neighborhood Other interior phoneme values obtain after being weighted averagely.
Two dimension normal distribution (also known as dimensional gaussian distribution) is to meet following density function feature:It is symmetrical about μ, at μ Reach maximum value, is 0 in positive (negative) infinite point value, there is inflection point at μ ± σ;The shape of Two dimension normal distribution is intermediate high Both sides are low, and image is the bell curve being located above x-axis.
Specifically, Gaussian filter is to the concrete operations for training phonetic feature to be handled:It is scanned with a 3*3 mask Each of training voice data phoneme removes heart sound in alternate template with the weighted average of phoneme in the determining neighborhood of mask The Two dimension normal distribution in relation to training voice data is formed after the value of element, wherein the calculating of the weighted average of each phoneme Process includes:
(1) the weight summation of each phoneme is sought.(2) one by one scan training phonetic feature in phoneme, according in phoneme everybody The weight set seeks the weighted average of its neighborhood, and the weighted average acquired is assigned to the corresponding phoneme in current location.(3) it follows Ring step (2), whole phonemes until having handled trained phonetic feature.
Through step S221, the noise in training phonetic feature can remove, the sound filtering for linear smoothing is exported, to obtain Pure sound filtering is further processed.
S222. Two dimension normal distribution is simplified using simplified model algorithm, obtains and simplifies phonetic feature.
Applied to the present embodiment, Gaussian Blur algorithm is can be used to simplify Two dimension normal distribution in simplified model algorithm.
Specifically, the realization process of the simplified Two dimension normal distribution of Gaussian Blur algorithm includes:Each phoneme takes periphery The average value of phoneme, " intermediate point " take the average value of " surrounding point ".Numerically, this is a kind of " smoothing ".On figure, just It is equivalent to generation " fuzzy " effect, " intermediate point " loses details.Obviously, when calculating average value, value range is bigger, " fuzzy " effect Fruit is stronger.
In step S222, identification server can obtain the corresponding two-dimentional normal state of trained phonetic feature by simplified model algorithm The simplification phonetic feature of distribution can further decrease the voice details of trained phonetic feature, simplify phonetic feature.
Step S221 to S222, identification server can successively carry out training phonetic feature except making an uproar and reducing details, to obtain To pure simple simplified phonetic feature, conducive to the recognition efficiency for improving voice cluster.
In one embodiment, as shown in fig. 6, after the step s 40, i.e., being determined as current training subset to identify data Collection, and after the step that identification data set is stored in hierarchy model library, which further includes:
S41. the primary voice data and corresponding speaker mark in each identification data set are obtained.
Wherein, identification data set is that the quantity for the primary voice data for including is less than preset threshold in the data set, then It is identification data set by the data set definition, divides the data set without continuing.Hierarchy model library be exactly include multiple classifications The database of model.
Primary voice data is exactly the voice data that speaker passes through computer equipment typing.Accordingly, speaker identifies It is exactly the corresponding speaker's identity mark of primary voice data, to show the unique identities of speaker, User ID, hand can be used Machine number or identification card number etc. are identified as speaker.
S42. feature extraction is carried out to the primary voice data, it is special obtains the corresponding raw tone of primary voice data Sign.
Wherein, raw tone is characterized in that representing speaker is different from other people phonetic feature, in particular to raw tone Data carry out the phonetic feature obtained after feature extraction, are applied to the present embodiment, mel-frequency cepstrum coefficient (Mel- can be used Frequency Cepstral Coefficients, hereinafter referred to as MFCC feature) it is used as raw tone feature.Obtain original language The realization process of the corresponding raw tone feature of sound data is as follows:
S421:Primary voice data is pre-processed, pretreatment voice data is obtained.
In a specific embodiment, in step S421, primary voice data is pre-processed, obtains pretreatment voice Data specifically comprise the following steps:
S4211:Preemphasis processing is made to primary voice data, the calculation formula of preemphasis processing is s'n=sn-a*sn-1, Wherein, snFor the signal amplitude in time domain, sn-1For with snThe signal amplitude of corresponding last moment, s'nWhen for after preemphasis Signal amplitude on domain, a are pre emphasis factor, and the value range of a is 0.9<a<1.0.
Wherein, preemphasis is a kind of signal processing mode compensated in transmitting terminal to input signal high fdrequency component.With The increase of signal rate, signal be damaged in transmission process it is very big, in order to enable receiving end to obtain relatively good signal waveform, With regard to needing to compensate impaired signal.The thought of pre-emphasis technique is exactly the high frequency in the transmitting terminal enhancing signal of transmission line Ingredient enables receiving end to obtain preferable signal waveform to compensate excessive decaying of the high fdrequency component in transmission process.In advance Exacerbation does not have an impact to noise, therefore can effectively improve output signal-to-noise ratio.
In the present embodiment, preemphasis processing is made to primary voice data, the formula of preemphasis processing is s'n=sn-a* sn-1, wherein snFor the signal amplitude in time domain, i.e. the amplitude (amplitude) of voice expressed in the time domain of voice data, sn-1For With snThe signal amplitude of opposite last moment, s'nFor the signal amplitude in time domain after preemphasis, a is pre emphasis factor, and a's takes Being worth range is 0.9<a<1.0, take the effect of 0.97 preemphasis relatively good here.Sounding mistake can be eliminated by being handled using the preemphasis It is interfered caused by vocal cords and lip etc. in journey, can be with the pent-up high frequency section of effective compensation primary voice data, and it can The formant for highlighting primary voice data high frequency reinforces the signal amplitude of primary voice data, helps to extract trained voice spy Sign.
S4212:Primary voice data after preemphasis is subjected to sub-frame processing.
Specifically, after preemphasis primary voice data, sub-frame processing should also be carried out.Framing refers to whole section of voice letter It number is cut into the voice processing technology of several segments, the size of every frame is in the range of 10-30ms, using general 1/2 frame length as frame It moves.Frame moves the overlapping region for referring to adjacent two interframe, can be avoided adjacent two frame and changes excessive problem.To primary voice data Sub-frame processing is carried out, primary voice data can be divided into the voice data of several segments, primary voice data can be segmented, be convenient for The extraction of training phonetic feature.
S4213:Primary voice data after framing is subjected to windowing process, obtains pretreatment voice data, the meter of adding window Calculating formula isWherein, N is that window is long, and n is time, snFor the signal width in time domain Degree, s'nFor the signal amplitude in time domain after adding window.
Specifically, after carrying out sub-frame processing to primary voice data, the initial segment of each frame and end end can all occur Discontinuous place, so framing is mostly also bigger with the error of primary voice data.This is able to solve using adding window to ask Topic, the primary voice data after can making framing becomes continuously, and each frame is enabled to show the feature of periodic function. Windowing process specifically refers to handle primary voice data using window function, and window function can choose Hamming window, then should add The formula of window isN is that Hamming window window is long, and n is time, snFor the letter in time domain Number amplitude, s'nFor the signal amplitude in time domain after adding window.Windowing process is carried out to primary voice data, obtains pretreatment voice Data, the signal of primary voice data in the time domain after enabling to framing become continuously, to help to extract raw tone number According to training phonetic feature.
Above-mentioned steps S4214-S4213 is to the pretreatment operation of primary voice data, for the original for extracting primary voice data Beginning phonetic feature provides the foundation, and enables to the raw tone feature extracted more representative of the primary voice data.
S422:Fast Fourier Transform (FFT) is made to pretreatment voice data, obtains the frequency spectrum of primary voice data, and according to frequency Spectrum obtains the power spectrum of primary voice data.
Wherein, Fast Fourier Transform (FFT) (Fast Fourier Transformation, abbreviation FFT), refers to and utilizes computer Calculate efficient, quick calculation method the general designation of discrete Fourier transform.Computer can be made to calculate discrete Fu using this algorithm In multiplication number required for leaf transformation be greatly reduced, the number of sampling points being especially transformed is more, the section of fft algorithm calculation amount It saves more significant.
Specifically, Fast Fourier Transform (FFT) is carried out to pretreatment voice data, voice data will be pre-processed from time domain Signal amplitude be converted to the signal amplitude (frequency spectrum) on frequency domain.The formula of the calculating frequency spectrum is N is the size of frame, and s (k) is the signal amplitude on frequency domain, and s (n) is in time domain Signal amplitude, n are the time, and i is complex unit.It, can be direct according to the frequency spectrum after the frequency spectrum for obtaining pretreatment voice data The power spectrum for pre-processing voice data, is known as the power of primary voice data by the power spectrum for acquiring pretreatment voice data below Spectrum.The formula of the power spectrum of the calculating primary voice data isN is the size of frame, s (k) For the signal amplitude on frequency domain.By the way that pretreatment voice data to be converted to the signal width on frequency domain from the signal amplitude in time domain Degree obtains the power spectrum of primary voice data further according to the signal amplitude on the frequency domain, is the power spectrum from primary voice data Middle extraction raw tone feature provides important technical foundation.
S423:Using the power spectrum of melscale filter group processing primary voice data, primary voice data is obtained Meier power spectrum.
It wherein, is the Meier carried out to power spectrum using the power spectrum of melscale filter group processing primary voice data Frequency analysis, mel-frequency analysis are the analyses based on human auditory's perception.Detection discovery, human ear is just as a filter group one Sample only focuses on certain specific frequency components (sense of hearing of people is nonlinear to frequency), that is to say, that human ear receives sound audio The signal of rate is limited.However these filters are not but univesral distributions on frequency coordinate axis, are had very in low frequency region More filters, they are distributed than comparatively dense, but in high-frequency region, the number of filter just becomes fewer, are distributed very sparse. It is to be appreciated that high resolution of the melscale filter group in low frequency part, the auditory properties with human ear are consistent, this It is the physical significance place of melscale.
In the present embodiment, using the power spectrum of melscale filter group processing primary voice data, raw tone is obtained The Meier power spectrum of data carries out cutting to frequency-region signal by using melscale filter group, so that last each frequency The corresponding numerical value of section, if the number of filter is 22, the Meier power spectrum corresponding 22 of available primary voice data A energy value.Mel-frequency analysis is carried out by the power spectrum to primary voice data, so that the Meier function obtained after its analysis Rate spectrum maintains the frequency-portions closely related with human ear characteristic, which can be well reflected out primary voice data Feature.
S424:Cepstral analysis is carried out on Meier power spectrum, obtains the MFCC feature of primary voice data.
Wherein, cepstrum (cepstrum) refers in Fu that a kind of Fourier transform spectrum of signal carries out again after logarithm operation Leaf inverse transformation, since general Fourier spectrum is complex number spectrum, thus cepstrum is also known as cepstrum.
Specifically, cepstral analysis is carried out to Meier power spectrum, according to cepstrum as a result, analyzing and obtaining primary voice data MFCC feature.It, can be excessively high by script characteristic dimension, it is difficult to the primary voice data directly used by the cepstral analysis The feature for including in Meier power spectrum is converted into wieldy feature and (uses by carrying out cepstral analysis on Meier power spectrum Come the MFCC character vector for being trained or identifying).The MFCC feature can be as raw tone feature to different phonetic The coefficient distinguished, the raw tone feature can reflect the difference between voice, can be used to identify and distinguish between original language Sound data.
In a specific embodiment, in step S424, cepstral analysis is carried out on Meier power spectrum, obtains raw tone The MFCC feature of data, includes the following steps:
S4241:The logarithm for taking Meier power spectrum obtains Meier power spectrum to be transformed.
Specifically, according to the definition of cepstrum, logarithm log is taken to Meier power spectrum, obtains Meier power spectrum m to be transformed.
S4242:Discrete cosine transform is made to Meier power spectrum to be transformed, obtains the MFCC feature of primary voice data.
Specifically, to Meier power spectrum m to be transformed make discrete cosine transform (Discrete Cosine Transform, DCT), the MFCC feature for obtaining corresponding primary voice data generally takes the 2nd to the 13rd coefficient as raw tone spy Sign, the raw tone feature are able to reflect the difference between voice data.Discrete cosine transform is made to Meier power spectrum m to be transformed Formula isN is frame length, and m is Meier power to be transformed Spectrum, j are the independent variable of Meier power spectrum to be transformed.Due to having overlapping between Meier filter, so using melscale There is correlation between the energy value that filter obtains, discrete cosine transform can carry out Meier power spectrum m to be transformed Dimensionality reduction is compressed and is abstracted, and obtains indirect raw tone feature, and compared to Fourier transformation, the result of discrete cosine transform does not have There is imaginary part, there is apparent advantage in terms of calculating.
Step S421-S424 carries out the processing of feature extraction based on training technique to primary voice data, finally obtains Raw tone feature can embody primary voice data well, which can train corresponding current classification Model, so that the result based on the current hierarchy model when carrying out speech recognition is more accurate.
Step S42, as raw tone feature, can accurately embody raw tone number using mel-frequency cepstrum coefficient According to phonetic feature so that higher using the recognition accuracy of current classification model that raw tone feature training obtains.
S43. associated storage raw tone feature and speaker are identified in identification data set.
Wherein, the corresponding raw tone feature of primary voice data that raw tone feature, that is, step S42 is obtained, speaker Mark is the corresponding speaker's mark of primary voice data that step S41 is obtained.
Raw tone feature and speaker are identified into associated storage to identification data set in step 43, are based on being somebody's turn to do conducive to subsequent It identifies data set, gets speaker rapidly and identify corresponding primary voice data, to carry out voice based on primary voice data Identification.
Step S41 is into S43, and identification server is using mel-frequency cepstrum coefficient as training phonetic feature, Ke Yizhun Really embody the phonetic feature of primary voice data;Raw tone feature and speaker are identified into associated storage to identification data Collection is based on the identification data set conducive to subsequent, gets speaker rapidly and identify corresponding primary voice data, based on original Voice data carries out speech recognition.
Model library method for building up provided in an embodiment of the present invention, by the sample for determining primary voice data in training sample set This quantity establishes current classification model step by step, until the current training subset of training sample set is determined as to identify data set Afterwards, identification data set is stored in and completes to establish hierarchy model library in hierarchy model library.The hierarchy model library is by all original languages Sound data are stored to different identification data sets, avoid successively comparing all raw tone numbers when subsequent identification voice data to be measured According to the identification data set that can be classified according to the model in the hierarchy model library where logic Rapid matching goes out voice data to be measured mentions The efficiency of high speech recognition.
Preferably, identification server establishes hierarchy model library using tree, multiple hierarchy models can be pressed root node Get up with the relationship of child node, raw tone corresponding with speaker is found rapidly based on the incidence relation conducive to subsequent Data;Identify that server can be stored the current classification model established every time onto root node or child node by classification logic, Corresponding at least two rear-guards node is found from root node or the corresponding current classification model of child node conducive to when speech recognition Corresponding next stage hierarchy model.
In one embodiment, as shown in fig. 7, providing a kind of audio recognition method, the identification in Fig. 1 is applied in this way It is illustrated, includes the following steps for server:
S50. voice data to be measured is obtained, the corresponding phonetic feature to be measured of voice data to be measured is extracted.
Wherein, voice data to be measured refers to voice data in need of test, refers specifically to confirm the voice data The voice data of corresponding speaker's mark in hierarchy model library.Phonetic feature to be measured is to carry out spy to the voice data to be measured The corresponding MFCC feature that sign obtains after extracting.The process and abovementioned steps S421 to S424 phase of step S50 progress feature extraction Together, it to avoid repeating, will not repeat them here.
Step S50, as phonetic feature to be measured, it is corresponding can accurately to embody speaker using mel-frequency cepstrum coefficient Phonetic feature.
S60. according in hierarchy model library model classification logic and current classification model to phonetic feature to be measured at Reason, determines destination node.
Wherein, hierarchy model library is exactly the database that step S10 to S40 is generated, including multiple trained remains to data Current classification model in library, and the database of each current classification model is stored according to classification logic.Wherein, model classification is patrolled It collects and specifically includes:Each hierarchy model is saved in the hierarchy model library established by tree on the node of corresponding position. Tree includes that a root node and multiple child nodes, root node do not have predecessor node, remaining each child node has and only has One predecessor node.If root node or each child node have rear-guard node, two rear-guard nodes are included at least.Child node includes The leaf node of least significant end and the intermediate node between root node and leaf node, the child node without rear-guard node are leaf segment Point, the child node for having rear-guard node are intermediate node.
It is every to establish a new current classification model, it is necessary to give current classification model foundation incidence relation.In tree-shaped knot The position of child node, obtains and stores on the predecessor node and predecessor node of the child node where obtaining the current hierarchy model in structure Upper level hierarchy model.The current hierarchy model and upper level hierarchy model are associated in hierarchy model library, with reality The logical relation of existing model classification logical requirements.
Destination node is exactly by voice data to be measured by according to the model classification logic and current classification in hierarchy model library Model handles phonetic feature to be measured, and the root node of the tree formed from hierarchy model library constantly look into downwards by association It looks for, until finding the leaf node of no rear-guard node as destination node.
In step S60, voice data to be measured can be classified logic sum according to the model in hierarchy model library by identification server Current classification model handles phonetic feature to be measured, accelerates identifying processing speed, avoids successively directly by voice number to be measured It is compared according to primary voice data, until finding destination node, greatly reduces and need specifically to carry out phonetic feature comparison Quantity.
S70. each original identification data set corresponding with destination node concentrated as target data set, target data Beginning voice data carries speaker mark.
Wherein, destination node is exactly a leaf node from the hierarchy model library that step S60 is searched.
Identification data set is a leaf node associated storage of the tree formed with hierarchy model library, and in the data set The quantity of primary voice data is less than the set of preset threshold.Hierarchy model library be exactly include multiple current classification models and knowledge The database of other data set.
The primary voice data that primary voice data, that is, step S41 is obtained is that speaker passes through computer equipment typing Voice data.Speaker's mark is mark corresponding with step S41, that is to say the corresponding speaker's identity mark of primary voice data Know, to show the unique identities of speaker, can be identified using User ID, cell-phone number or identification card number etc. as speaker.
In step S70, server is identified by searching for arriving with the associated identification data set of destination node as target data Collection, and all primary voice datas and corresponding speaker in the identification data set are further obtained according to step S41 and identified.
S80. obtain the space length for each raw tone feature that phonetic feature to be measured and target data are concentrated, determine with The corresponding target speaker mark of voice data to be measured.
Wherein, it is applied to the present embodiment, the included angle cosine in geometry can be used to measure two features in sky in space length Between difference on direction.
Target speaker mark is exactly speaker's mark corresponding with phonetic feature to be measured in hierarchy model library.
Specifically, the space length for obtaining each raw tone feature that phonetic feature to be measured and target data are concentrated can be by Following formula is determined:
Wherein, AiAnd BiRespectively represent each component of phonetic feature to be measured and raw tone feature.From the above equation, we can see that empty Between distance namely cosine value from -1 to 1, wherein -1 indicates that two phonetic features indicate two voices in the contrary of space, 1 Feature is identical in the direction in space;0 indicates that two phonetic features are independent.Indicated between -1 and 1 two phonetic features it Between similitude or diversity, it is possible to understand that ground, similarity closer to 1 indicate two phonetic features it is closer.
In this step, identification server obtains phonetic feature to be measured and the space length maximum value of raw tone feature is corresponding Speaker mark as target speaker identify.
The audio recognition method that step S50 to S80 is provided is classified logic and current point by the model in hierarchy model library Grade model handles phonetic feature to be measured, determines destination node.It will identification data set conduct corresponding with destination node Target data set, it may be determined that target speaker mark corresponding with voice data to be measured avoids voice data to be measured is direct All primary voice datas are successively compared, search corresponding current classification model step by step by hierarchy model library to determine target After data set, finally successively limited primary voice data is concentrated to compare determining target speaker mark with target data again Know, to improve the efficiency of speech recognition.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.
In one embodiment, a kind of model library is provided and establishes device, which establishes mould in device and above-described embodiment Type library method for building up corresponds.As shown in figure 8, it includes obtaining training sample set module 10, storage that the model library, which establishes device, Hierarchy model module 20, more new training sample set module 30 and determining identification data set module 40.Each functional module is described in detail It is as follows:
Training sample set module 10 is obtained, for obtaining training sample set, training sample set includes at least two original languages Sound data.
Hierarchy model module 20 is stored, if the sample size for primary voice data in training sample set is greater than default threshold Value then establishes current classification model according to the training phonetic feature that primary voice data is extracted, current classification model is stored in It in hierarchy model library, and determines that the model in hierarchy model library is classified logic, logic is classified for primary voice data according to model It is divided at least two current training subsets.
More new training sample set module 30 will be current if the sample size for current training subset is greater than preset threshold Training subset is updated to training sample set.
Determine identification data set module 40, it, will if the sample size for current training subset is not more than preset threshold Current training subset is determined as identifying data set, and identification data set is stored in hierarchy model library.
Preferably, it further includes creation model library unit 11 which, which establishes device,.
Model library unit 11 is created, for creating hierarchy model library using tree, tree includes a root section Point and with associated at least two child node of root node.
Preferably, storage hierarchy model module 20 includes determining classification logic unit 12.
Classification logic unit 12 is determined, for current classification model to be stored in the child node of tree, according to working as Preceding hierarchy model determines that model is classified logic in the storage location of tree.
Preferably, storage hierarchy model module 20 includes obtaining training characteristics unit 21, obtaining 22 and of training characteristics unit It obtains and simplifies feature unit 23 and acquisition hierarchy model unit 24.
Training characteristics unit 21 is obtained, for carrying out feature extraction to primary voice data, obtains training phonetic feature.
Training characteristics unit 22 is obtained, for carrying out simplifying processing to training phonetic feature using simplified model algorithm, is obtained Take simplified phonetic feature.
It obtains and simplifies feature unit 23, for simplifying phonetic feature using EM algorithm iteration, obtain entire change Subspace.
Hierarchy model unit 24 is obtained, entire change subspace is projected to for phonetic feature will to be simplified, obtains current point Grade model.
Preferably, obtaining training characteristics unit 22 includes that acquisition normal distribution subelement 221 and acquisition phonetic feature are single Member 222.
Normal distribution subelement 221 is obtained, for handling training phonetic feature using Gaussian filter, obtains corresponding two Tie up normal distribution.
Phonetic feature subelement 222 is obtained, for simplifying Two dimension normal distribution using simplified model algorithm, obtains and simplifies language Sound feature.
Preferably, it further includes obtaining speaker's data cell 41, obtaining primitive character unit 42 which, which establishes device, With storage primitive character unit 43.
Speaker's data cell 41 is obtained, for obtaining primary voice data and corresponding theory in each identification data set Talk about people's mark.
Primitive character unit 42 is obtained, for carrying out feature extraction to primary voice data, obtains primary voice data pair The raw tone feature answered.
Primitive character unit 43 is stored, is identified to identification data set for associated storage raw tone feature and speaker In.
In one embodiment, a kind of speech recognition equipment is provided, voice is known in the speech recognition equipment and above-described embodiment Other method corresponds.As shown in figure 9, it includes obtaining tested speech module 50, determining object module that the model library, which establishes device, Module 60, corresponding identification data set module 70 and determining speaker's mark module 80, detailed description are as follows for each functional module:
It obtains tested speech module 50 and extracts the corresponding language to be measured of voice data to be measured for obtaining voice data to be measured Sound feature.
Target model module 60 is determined, for according to the model classification logic and current classification model pair in hierarchy model library Phonetic feature to be measured is handled, and determines destination node.
Corresponding identification data set module 70, for will identification data set corresponding with destination node as target data Collection, each primary voice data that target data is concentrated carry speaker mark.
Speaker's mark module 80 is determined, for obtaining each raw tone of phonetic feature to be measured and target data concentration The space length of feature determines target speaker mark corresponding with voice data to be measured.
The specific of device, which is established, about model library limits the restriction that may refer to above for model library method for building up, This is repeated no more.Above-mentioned model library, which establishes the modules in device, to be come fully or partially through software, hardware and combinations thereof It realizes.Above-mentioned each module can be embedded in the form of hardware or independently of in the processor in computer equipment, can also be with software Form is stored in the memory in computer equipment, executes the corresponding operation of the above modules in order to which processor calls.
In one embodiment, a kind of computer equipment is provided, which can be server, internal structure chart It can be as shown in Figure 10.The computer equipment includes processor, memory, network interface and the data connected by system bus Library.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory of the computer equipment includes non- Volatile storage medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and database. The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The computer is set Standby database is for storing data relevant to speech recognition.The network interface of the computer equipment is used for and external terminal It is communicated by network connection.To realize a kind of model library method for building up when the computer program is executed by processor.
In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory and can The computer program run on a processor, processor realize following steps when executing computer program:Training sample set is obtained, Training sample set includes at least two primary voice datas;If the sample size of primary voice data is greater than pre- in training sample set If threshold value, then current classification model is established according to the training phonetic feature that primary voice data is extracted, current classification model is deposited Storage determines that the model in hierarchy model library is classified logic in hierarchy model library, is classified logic for raw tone according to model Data are divided at least two current training subsets;If the sample size of current training subset is greater than preset threshold, will currently instruct Practice subset and is updated to training sample set;If the sample size of current training subset is not more than preset threshold, son will be currently trained Collection is determined as identifying data set, and identification data set is stored in hierarchy model library.
In one embodiment, following steps are also realized when processor executes computer program:Using tree creation point Grade model library, tree include a root node and with associated at least two child node of root node;By current classification model It is stored in hierarchy model library, and determines that the model in hierarchy model library is classified logic, including:Current classification model is stored in In the child node of tree, determine that model is classified logic in the storage location of tree according to current classification model.
In one embodiment, following steps are realized when processor executes computer program:Primary voice data is carried out special Sign is extracted, and training phonetic feature is obtained;Training phonetic feature is carried out simplifying processing using simplified model algorithm, obtains and simplifies language Sound feature;Phonetic feature is simplified using EM algorithm iteration, obtains entire change subspace;Phonetic feature projection will be simplified To entire change subspace, current classification model is obtained.
In one embodiment, following steps are realized when processor executes computer program:It is handled and is instructed using Gaussian filter Practice phonetic feature, obtains corresponding Two dimension normal distribution;Two dimension normal distribution is simplified using simplified model algorithm, obtains and simplifies language Sound feature.
In one embodiment, following steps are also realized when processor executes computer program:Obtain each identification data set In primary voice data and corresponding speaker mark;Feature extraction is carried out to primary voice data, obtains raw tone number According to corresponding raw tone feature;Associated storage raw tone feature and speaker are identified in identification data set.
In one embodiment, following steps are realized when processor executes computer program:Voice data to be measured is obtained, is extracted The corresponding phonetic feature to be measured of voice data to be measured;According to the model classification logic and current classification model pair in hierarchy model library Phonetic feature to be measured is handled, and determines destination node;Will identification data set corresponding with destination node as target data Collection, each primary voice data that target data is concentrated carry speaker mark;Obtain phonetic feature to be measured and target data The space length for each raw tone feature concentrated determines target speaker mark corresponding with voice data to be measured.
In one embodiment, a kind of computer readable storage medium is provided, computer program, computer journey are stored thereon with Following steps are realized when sequence is executed by processor:Training sample set is obtained, training sample set includes at least two raw tone numbers According to;If the sample size of primary voice data is greater than preset threshold in training sample set, extracted according to primary voice data Training phonetic feature establishes current classification model, and current classification model is stored in hierarchy model library, and determines hierarchy model Model in library is classified logic, is classified logic according to model and primary voice data is divided at least two current training subsets; If the sample size of current training subset is greater than preset threshold, current training subset is updated to training sample set;If current instruction The sample size for practicing subset is not more than preset threshold, then is determined as current training subset identifying data set, and will identify data Collection is stored in hierarchy model library.
In one embodiment, following steps are also realized when computer program is executed by processor:It is created using tree Hierarchy model library, tree include a root node and with associated at least two child node of root node;By current classification mould Type is stored in hierarchy model library, and determines that the model in hierarchy model library is classified logic, including:Current classification model is stored In the child node of tree, determine that model is classified logic in the storage location of tree according to current classification model.
In one embodiment, following steps are realized when computer program is executed by processor:Primary voice data is carried out Feature extraction obtains training phonetic feature;Training phonetic feature is carried out simplifying processing using simplified model algorithm, obtains and simplifies Phonetic feature;Phonetic feature is simplified using EM algorithm iteration, obtains entire change subspace;Phonetic feature will be simplified to throw Shadow obtains current classification model to entire change subspace.
In one embodiment, following steps are realized when computer program is executed by processor:It is handled using Gaussian filter Training phonetic feature, obtains corresponding Two dimension normal distribution;Two dimension normal distribution is simplified using simplified model algorithm, obtains and simplifies Phonetic feature.
In one embodiment, following steps are also realized when computer program is executed by processor:Obtain each identification data The primary voice data of concentration and corresponding speaker mark;Feature extraction is carried out to primary voice data, obtains raw tone The corresponding raw tone feature of data;Associated storage raw tone feature and speaker are identified in identification data set.
In one embodiment, following steps are realized when processor executes computer program:Voice data to be measured is obtained, is extracted The corresponding phonetic feature to be measured of voice data to be measured;According to the model classification logic and current classification model pair in hierarchy model library Phonetic feature to be measured is handled, and determines destination node;Will identification data set corresponding with destination node as target data Collection, each primary voice data that target data is concentrated carry speaker mark;Obtain phonetic feature to be measured and target data The space length for each raw tone feature concentrated determines target speaker mark corresponding with voice data to be measured.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, To any reference of memory, storage, database or other media used in each embodiment provided herein, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing The all or part of function of description.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that:It still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all It is included within protection scope of the present invention.

Claims (10)

1. a kind of model library method for building up, which is characterized in that including:
Training sample set is obtained, the training sample set includes at least two primary voice datas;
If the sample size of primary voice data is greater than preset threshold in the training sample set, according to the raw tone number Current classification model is established according to the training phonetic feature of extraction, the current classification model is stored in hierarchy model library, and Determine in the hierarchy model library model classification logic, according to model be classified logic by the primary voice data be divided into Few two current training subsets;
If the sample size of the current training subset is greater than the preset threshold, the current training subset is updated to described Training sample set;
If the sample size of the current training subset is not more than the preset threshold, the current training subset is determined as It identifies data set, and the identification data set is stored in hierarchy model library.
2. model library method for building up as described in claim 1, which is characterized in that the step of the acquisition training sample set it Before, the model library method for building up further includes:
Hierarchy model library is created using tree, the tree includes a root node and associated with the root node At least two child nodes;
It is described that the current classification model is stored in hierarchy model library, and determine the model classification in the hierarchy model library Logic, including:
The current classification model is stored in the child node of the tree, according to the current classification model described The storage location of tree determines that model is classified logic.
3. model library method for building up as described in claim 1, which is characterized in that described to be extracted according to the primary voice data Training phonetic feature establish current classification model, including:
Feature extraction is carried out to the primary voice data, obtains training phonetic feature;
The trained phonetic feature is carried out using simplified model algorithm to simplify processing, obtains and simplifies phonetic feature;
Using phonetic feature is simplified described in EM algorithm iteration, entire change subspace is obtained;
The simplified phonetic feature is projected into the entire change subspace, obtains the current classification model.
4. model library method for building up as claimed in claim 3, which is characterized in that described to simplify processing institute using simplified model algorithm Trained phonetic feature is stated, obtains and simplifies phonetic feature, including:
The trained phonetic feature is handled using Gaussian filter, obtains corresponding Two dimension normal distribution;
The Two dimension normal distribution is simplified using simplified model algorithm, obtains and simplifies phonetic feature.
5. model library method for building up as described in claim 1, which is characterized in that described to be determined by the current training subset To identify data set, and after the step that the identification data set is stored in hierarchy model library, the model library foundation side Method further includes:
Obtain the primary voice data and corresponding speaker mark in each identification data set;
Feature extraction is carried out to the primary voice data, obtains the corresponding raw tone feature of the primary voice data;
Raw tone feature and the speaker described in associated storage are identified in the identification data set.
6. a kind of audio recognition method, which is characterized in that including:
Voice data to be measured is obtained, the corresponding phonetic feature to be measured of the voice data to be measured is extracted;
Logic is classified according to the model in hierarchy model library and current classification model handles the phonetic feature to be measured, really Set the goal node;
Each original that identification data set corresponding with the destination node is concentrated as target data set, the target data Beginning voice data carries speaker mark;
Obtain the space length for each raw tone feature that phonetic feature to be measured and target data are concentrated, it is determining with it is described to be measured The corresponding target speaker mark of voice data.
7. a kind of model library establishes device, which is characterized in that including:
Training sample set module is obtained, for obtaining training sample set, the training sample set includes at least two raw tones Data;
Hierarchy model module is stored, if the sample size for primary voice data in the training sample set is greater than default threshold Value then establishes current classification model according to the training phonetic feature that the primary voice data is extracted, by the current classification mould Type is stored in hierarchy model library, and determines that the model in the hierarchy model library is classified logic, and being classified logic according to model will The primary voice data is divided at least two current training subsets;
More new training sample set module, if the sample size for the current training subset is greater than the preset threshold, by institute It states current training subset and is updated to the training sample set;
Determine identification data set module, if the sample size for the current training subset is not more than the preset threshold, The current training subset is determined as to identify data set, and the identification data set is stored in hierarchy model library.
8. a kind of speech recognition equipment, which is characterized in that including:
Tested speech module is obtained, for obtaining voice data to be measured, extracts the corresponding voice to be measured of the voice data to be measured Feature;
Determine target model module, for according in hierarchy model library model classification logic and current classification model to it is described to It surveys phonetic feature to be handled, determines destination node;
Corresponding identification data set module, for will identification data set corresponding with the destination node as target data set, Each primary voice data that the target data is concentrated carries speaker mark;
Speaker's mark module is determined, for obtaining each raw tone feature that phonetic feature to be measured and target data are concentrated Space length, determining target speaker mark corresponding with the voice data to be measured.
9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to Any one of 5 model library method for building up, alternatively, the processor realizes such as claim 6 when executing the computer program The step of audio recognition method.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In, the model library method for building up as described in any one of claim 1 to 5 is realized when the computer program is executed by processor, or The step of person, the processor realizes audio recognition method as claimed in claim 6 when executing the computer program.
CN201810592869.8A 2018-06-11 2018-06-11 Model base establishing method, voice recognition method, device, equipment and medium Active CN108922543B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810592869.8A CN108922543B (en) 2018-06-11 2018-06-11 Model base establishing method, voice recognition method, device, equipment and medium
PCT/CN2018/104040 WO2019237518A1 (en) 2018-06-11 2018-09-05 Model library establishment method, voice recognition method and apparatus, and device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810592869.8A CN108922543B (en) 2018-06-11 2018-06-11 Model base establishing method, voice recognition method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN108922543A true CN108922543A (en) 2018-11-30
CN108922543B CN108922543B (en) 2022-08-16

Family

ID=64418041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810592869.8A Active CN108922543B (en) 2018-06-11 2018-06-11 Model base establishing method, voice recognition method, device, equipment and medium

Country Status (2)

Country Link
CN (1) CN108922543B (en)
WO (1) WO2019237518A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060667A (en) * 2019-03-15 2019-07-26 平安科技(深圳)有限公司 Batch processing method, device, computer equipment and the storage medium of voice messaging
CN110414709A (en) * 2019-06-18 2019-11-05 重庆金融资产交易所有限责任公司 Debt risk intelligent Forecasting, device and computer readable storage medium
CN110428819A (en) * 2019-05-21 2019-11-08 腾讯科技(深圳)有限公司 Decoding network generation method, audio recognition method, device, equipment and medium
CN110782879A (en) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 Sample size-based voiceprint clustering method, device, equipment and storage medium
CN112634863A (en) * 2020-12-09 2021-04-09 深圳市优必选科技股份有限公司 Training method and device of speech synthesis model, electronic equipment and medium
WO2021128256A1 (en) * 2019-12-27 2021-07-01 深圳市优必选科技股份有限公司 Voice conversion method, apparatus and device, and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1211026A (en) * 1997-09-05 1999-03-17 中国科学院声学研究所 Continuous voice identification technology for Chinese putonghua large vocabulary
US20030014250A1 (en) * 1999-01-26 2003-01-16 Homayoon S. M. Beigi Method and apparatus for speaker recognition using a hierarchical speaker model tree
US20030036903A1 (en) * 2001-08-16 2003-02-20 Sony Corporation Retraining and updating speech models for speech recognition
CN1447278A (en) * 2002-11-15 2003-10-08 郑方 Method for recognizing voice print
CN1535460A (en) * 2001-03-01 2004-10-06 �Ҵ���˾ Hierarchichal language models
CN102789779A (en) * 2012-07-12 2012-11-21 广东外语外贸大学 Speech recognition system and recognition method thereof
CN104268279A (en) * 2014-10-16 2015-01-07 魔方天空科技(北京)有限公司 Query method and device of corpus data
CN105006231A (en) * 2015-05-08 2015-10-28 南京邮电大学 Distributed large population speaker recognition method based on fuzzy clustering decision tree
CN105096955A (en) * 2015-09-06 2015-11-25 广东外语外贸大学 Speaker rapid identification method and system based on growing and clustering algorithm of models
CN107993663A (en) * 2017-09-11 2018-05-04 北京航空航天大学 A kind of method for recognizing sound-groove based on Android

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101562012B (en) * 2008-04-16 2011-07-20 创而新(中国)科技有限公司 Method and system for graded measurement of voice
CN105096935B (en) * 2014-05-06 2019-08-09 阿里巴巴集团控股有限公司 A kind of pronunciation inputting method, device and system
CN104135577A (en) * 2014-08-27 2014-11-05 陈包容 Method and device for quickly finding contact persons based on user-defined voice
CN107993071A (en) * 2017-11-21 2018-05-04 平安科技(深圳)有限公司 Electronic device, auth method and storage medium based on vocal print

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1211026A (en) * 1997-09-05 1999-03-17 中国科学院声学研究所 Continuous voice identification technology for Chinese putonghua large vocabulary
US20030014250A1 (en) * 1999-01-26 2003-01-16 Homayoon S. M. Beigi Method and apparatus for speaker recognition using a hierarchical speaker model tree
CN1535460A (en) * 2001-03-01 2004-10-06 �Ҵ���˾ Hierarchichal language models
US20030036903A1 (en) * 2001-08-16 2003-02-20 Sony Corporation Retraining and updating speech models for speech recognition
CN1447278A (en) * 2002-11-15 2003-10-08 郑方 Method for recognizing voice print
CN102789779A (en) * 2012-07-12 2012-11-21 广东外语外贸大学 Speech recognition system and recognition method thereof
CN104268279A (en) * 2014-10-16 2015-01-07 魔方天空科技(北京)有限公司 Query method and device of corpus data
CN105006231A (en) * 2015-05-08 2015-10-28 南京邮电大学 Distributed large population speaker recognition method based on fuzzy clustering decision tree
CN105096955A (en) * 2015-09-06 2015-11-25 广东外语外贸大学 Speaker rapid identification method and system based on growing and clustering algorithm of models
CN107993663A (en) * 2017-09-11 2018-05-04 北京航空航天大学 A kind of method for recognizing sound-groove based on Android

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060667A (en) * 2019-03-15 2019-07-26 平安科技(深圳)有限公司 Batch processing method, device, computer equipment and the storage medium of voice messaging
WO2020186695A1 (en) * 2019-03-15 2020-09-24 平安科技(深圳)有限公司 Voice information batch processing method and apparatus, computer device, and storage medium
CN110060667B (en) * 2019-03-15 2023-05-30 平安科技(深圳)有限公司 Batch processing method and device for voice information, computer equipment and storage medium
CN110428819A (en) * 2019-05-21 2019-11-08 腾讯科技(深圳)有限公司 Decoding network generation method, audio recognition method, device, equipment and medium
CN110428819B (en) * 2019-05-21 2020-11-24 腾讯科技(深圳)有限公司 Decoding network generation method, voice recognition method, device, equipment and medium
CN110414709A (en) * 2019-06-18 2019-11-05 重庆金融资产交易所有限责任公司 Debt risk intelligent Forecasting, device and computer readable storage medium
CN110782879A (en) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 Sample size-based voiceprint clustering method, device, equipment and storage medium
WO2021051505A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Method, device, and apparatus for performing voiceprint clustering on the basis of sample size, and storage medium
WO2021128256A1 (en) * 2019-12-27 2021-07-01 深圳市优必选科技股份有限公司 Voice conversion method, apparatus and device, and storage medium
CN112634863A (en) * 2020-12-09 2021-04-09 深圳市优必选科技股份有限公司 Training method and device of speech synthesis model, electronic equipment and medium
CN112634863B (en) * 2020-12-09 2024-02-09 深圳市优必选科技股份有限公司 Training method and device of speech synthesis model, electronic equipment and medium

Also Published As

Publication number Publication date
WO2019237518A1 (en) 2019-12-19
CN108922543B (en) 2022-08-16

Similar Documents

Publication Publication Date Title
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
CN108922543A (en) Model library method for building up, audio recognition method, device, equipment and medium
WO2020177380A1 (en) Voiceprint detection method, apparatus and device based on short text, and storage medium
WO2021139425A1 (en) Voice activity detection method, apparatus and device, and storage medium
WO2019232829A1 (en) Voiceprint recognition method and apparatus, computer device and storage medium
CN109065028B (en) Speaker clustering method, speaker clustering device, computer equipment and storage medium
CN108922544A (en) General vector training method, voice clustering method, device, equipment and medium
TW201935464A (en) Method and device for voiceprint recognition based on memorability bottleneck features
CN110600017A (en) Training method of voice processing model, voice recognition method, system and device
WO2019200744A1 (en) Self-updated anti-fraud method and apparatus, computer device and storage medium
CN109065022B (en) Method for extracting i-vector, method, device, equipment and medium for speaker recognition
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN107993663A (en) A kind of method for recognizing sound-groove based on Android
CN107093422B (en) Voice recognition method and voice recognition system
CN112735435A (en) Voiceprint open set identification method with unknown class internal division capability
CN112712809A (en) Voice detection method and device, electronic equipment and storage medium
CN111161713A (en) Voice gender identification method and device and computing equipment
CN110136726A (en) A kind of estimation method, device, system and the storage medium of voice gender
CN112992155A (en) Far-field voice speaker recognition method and device based on residual error neural network
Hossan et al. Speaker recognition utilizing distributed DCT-II based Mel frequency cepstral coefficients and fuzzy vector quantization
CN111968650A (en) Voice matching method and device, electronic equipment and storage medium
CN114512133A (en) Sound object recognition method, sound object recognition device, server and storage medium
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
CN111179942B (en) Voiceprint recognition method, voiceprint recognition device, voiceprint recognition equipment and computer readable storage medium
Kambli et al. MelSpectroNet: Enhancing Voice Authentication Security with AI-based Siamese Model and Noise Reduction for Seamless User Experience

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant