CN109410956A - A kind of object identifying method of audio data, device, equipment and storage medium - Google Patents

A kind of object identifying method of audio data, device, equipment and storage medium Download PDF

Info

Publication number
CN109410956A
CN109410956A CN201811580955.3A CN201811580955A CN109410956A CN 109410956 A CN109410956 A CN 109410956A CN 201811580955 A CN201811580955 A CN 201811580955A CN 109410956 A CN109410956 A CN 109410956A
Authority
CN
China
Prior art keywords
audio data
target
vocal print
feature
print feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811580955.3A
Other languages
Chinese (zh)
Other versions
CN109410956B (en
Inventor
张享
高建清
王智国
胡国平
胡郁
刘庆峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201811580955.3A priority Critical patent/CN109410956B/en
Publication of CN109410956A publication Critical patent/CN109410956A/en
Application granted granted Critical
Publication of CN109410956B publication Critical patent/CN109410956B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building

Abstract

This application provides a kind of object identifying method of audio data, device, equipment and storage medium, method includes: the audio data to be identified obtained under target scene, and the target vocal print feature set being adapted with target scene;Based on the target vocal print feature set being adapted with target scene, the corresponding object of audio data to be identified is identified.The object identifying method of audio data provided by the present application, since target vocal print feature set and target scene are adapted, therefore, preferably the vocal print feature extracted from the audio data to be identified under target scene can be matched based on target vocal print feature set, so as to promote the recognition effect of audio data corresponding objects to be identified under target scene.

Description

A kind of object identifying method of audio data, device, equipment and storage medium
Technical field
This application involves audio data processing technology field more particularly to a kind of object identifying methods of audio data, dress It sets, equipment and storage medium.
Background technique
Under certain scenes (for example, meeting, speech, diplomacy etc.), when the speech of some object, need the letter of the object Breath is shown, to allow other objects that can understand the information of speech object, to be easier to understand the hair of the speech object Say content.
It is understood that the information to the object that will make a speech shows occur, need in the audio for getting speech object After data, the corresponding object of the audio data is identified.
Existing identifying schemes are: vocal print feature are extracted from audio data to be identified, by the vocal print of extraction and vocal print library In vocal print feature matched.However, sometimes, it is understood that there may be the problem of vocal print of extraction can not match, extraction Vocal print, which can not match will lead to, can not know the corresponding object of audio data, that is, existing identifying schemes recognition effect is bad.
Summary of the invention
In view of this, this application provides a kind of object identifying method of audio data, device, equipment and storage medium, To provide a kind of preferable identifying schemes of recognition effect, the program is as follows:
A kind of object identifying method of audio data, comprising:
Obtain the audio data to be identified under target scene, and the target vocal print feature being adapted with the target scene Set;
Based on the target vocal print feature set being adapted with the target scene, identify that the audio data to be identified is corresponding Object.
Optionally, the target vocal print feature set being adapted with the target scene is obtained, comprising:
Obtain the target audio data under the target scene, wherein the target audio data include at least it is described to Identify audio data;
According to the target audio data to the corresponding multiple vocal print features of a plurality of history audio data collected in advance into Row migration, to obtain the target vocal print feature set being adapted with the target scene.
Optionally, a plurality of history audio data is the history audio data of multiple objects;
It is described special according to the target audio data multiple vocal prints corresponding to a plurality of history audio data collected in advance Sign is migrated, to obtain the target vocal print feature set being adapted with the target scene, comprising:
Based on the target audio data, the acquisition a plurality of history audio data is corresponding, it is described to can adapt to The target acoustical feature of target scene;
By the corresponding target acoustical feature of a plurality of history audio data, it is right respectively to obtain the multiple object The vocal print feature answered;
The set of the corresponding vocal print feature composition of the multiple object, as the mesh being adapted with the target scene Mark vocal print feature set.
Optionally, described corresponding multiple to a plurality of history audio data collected in advance according to the target audio data Vocal print feature is migrated, to obtain the target vocal print feature set being adapted with the target scene, further includes:
Using the migration transformation model pre-established, migration change is carried out to the corresponding vocal print feature of the multiple object It changes, obtains the corresponding transformed vocal print feature of migration of the multiple object;
The set of the transformed vocal print feature composition of the corresponding migration of the multiple object, as with the target field The adaptable target vocal print feature set of scape.
Optionally, described to be based on the target audio data, obtain that a plurality of history audio data is corresponding, energy Enough adapt to the target acoustical feature of the target scene, comprising:
Acoustic feature is extracted from the target audio data, obtains the corresponding acoustic feature of the target audio data, And acoustic feature is extracted from a plurality of history audio data respectively, it is corresponding to obtain a plurality of history audio data Acoustic feature;
Based on the corresponding acoustic feature of the target audio data, sound corresponding to a plurality of history audio data It learns feature and carries out migration transformation, obtain that a plurality of history audio data is corresponding, it is described to can adapt to after migration transformation The target acoustical feature of target scene;
Wherein, the migration transformation is for making the corresponding acoustic feature of the history audio data and the target sound frequency It is minimum according to corresponding acoustical feature distance.
Optionally, described to be based on the corresponding acoustic feature of the target audio data, to a plurality of history audio data Corresponding acoustic feature carries out migration transformation, comprising:
For any bar history audio data:
Based on the corresponding acoustic feature of this history audio data and the corresponding acoustic feature of the target audio data, really The fixed corresponding migration transformation matrix of this history audio data;
It is special to the corresponding acoustics of this history audio data using the corresponding migration transformation matrix of this history audio data Sign carries out migration transformation, migrates transformed acoustic feature as target acoustical feature;
To obtain the corresponding target acoustical feature of a plurality of history audio data.
Optionally, described corresponding based on the corresponding acoustic feature of this history audio data and the target audio data Acoustic feature determines the corresponding migration transformation matrix of this history audio data, comprising:
The general mixed Gauss model pre-established is obtained, the general mixed Gauss model includes multiple single Gausses;
Calculate probability and the target sound of the corresponding acoustic feature of this history audio data in each single Gauss Probability of the frequency according to corresponding acoustic feature in each single Gauss;
With probability of the corresponding acoustic feature of this history audio data in each single Gauss and the target audio Probability of the corresponding acoustic feature of data under each single Gauss, expresses the corresponding acoustic feature of this history audio data and institute State the distance between corresponding acoustic feature of target audio data;
By the way that the distance minimization is determined transformation matrix error, and it is based on preset migration transformation matrix and the change Matrix error is changed, determines the corresponding migration transformation matrix of this history audio data.
Optionally, described corresponding based on the corresponding acoustic feature of this history audio data and the target audio data Acoustic feature determines the corresponding migration transformation matrix of this history audio data, comprising:
The general mixed Gauss model pre-established is obtained, the general mixed Gauss model includes multiple single Gausses;
Calculate probability of the corresponding acoustic feature of the target audio data in each single Gauss;
Probability based on the corresponding acoustic feature of the target audio data in each single Gauss, from the multiple single height Choose multiple target list Gausses in this, probability of the corresponding acoustic feature of the target audio data in each target list Gauss The probability being all larger than in other single Gausses;
With mean value and variance of the corresponding acoustic feature of history audio data in each target list Gauss and described Mean value and variance of the corresponding acoustic feature of target audio data in each target list Gauss, express this history audio data The distance between corresponding acoustic feature acoustic feature corresponding with target audio data;
By the way that the distance minimization is determined transformation matrix error, and it is based on preset migration transformation matrix and the change Matrix error is changed, determines the corresponding migration transformation matrix of this history audio data.
Optionally, described by the corresponding target acoustical feature of a plurality of history audio data, it obtains described more The corresponding vocal print feature of a object, comprising:
By in a plurality of history audio data, at least one history audio data of corresponding same target is corresponding Target acoustical feature is combined, and obtains the corresponding acoustic feature of multiple objects;
Vocal print feature is extracted from the corresponding acoustic feature of the multiple object respectively, obtains the multiple object point Not corresponding vocal print feature.
Optionally, the migration transformation model makes a living into confrontation model;
The generation confrontation model includes: generation module and confrontation discrimination module;
The generation module, for carrying out migration transformation to the vocal print feature of input, it is special that output migrates transformed vocal print Sign, wherein the vocal print feature of input includes the corresponding vocal print feature of a plurality of history audio data and the target sound Frequency is according to corresponding vocal print feature;
The confrontation discrimination module, after differentiating the generation module exports, migration transformation during training Vocal print feature belonging to scene be the target scene or non-targeted scene.
Optionally, after obtaining the target vocal print feature set being adapted with the target scene, the audio data Object identifying method further include:
Based on the target vocal print feature set being adapted with the target scene, determines and belong to from the target audio data In the audio section of same target;
It is subordinated in the audio fragment of same target and extracts vocal print feature, updated and the target with the vocal print feature of extraction The adaptable target vocal print feature set of scene.
Optionally, described based on the target vocal print feature set being adapted with the target scene, from the target audio The audio section for belonging to same target is determined in data, comprising:
Effective audio section is obtained from the target audio data;
It is multiple audio fragments of different objects by effective audio section cutting;
For any audio fragment in the multiple audio fragment, based on the vocal print feature extracted from the audio fragment Target vocal print feature set with being adapted with the target scene, determines the corresponding object of the audio section, described more to obtain The corresponding object of a audio fragment;
Based on the corresponding object of the multiple audio fragment, the audio section for belonging to same target is determined.
A kind of object recognition equipment of audio data, comprising: audio data obtains module, vocal print feature set obtains module And Object Identification Module;
The audio data obtains module, for obtaining the audio data to be identified under target scene;
The vocal print feature set obtains module, for obtaining the target vocal print feature collection being adapted with the target scene It closes;
The Object Identification Module, for based on the target vocal print feature set being adapted with the target scene, identification The corresponding object of the audio data to be identified.
Optionally, the vocal print feature set obtains module, comprising: audio data acquisition submodule and vocal print feature migration Submodule;
The audio data acquisition submodule, for obtaining the target audio data under the target scene, wherein described Target audio data include at least the audio data to be identified;
The vocal print feature migrates submodule, for according to the target audio data to a plurality of history sound collected in advance Frequency is migrated according to corresponding multiple vocal print features, to obtain the target vocal print feature collection being adapted with the target scene It closes.
Optionally, the vocal print feature migration submodule includes: that acoustic feature determines that submodule and the first vocal print feature are true Stator modules;
The acoustic feature determines submodule, for being based on the target audio data, obtains a plurality of history audio Data are corresponding, can adapt to the target acoustical feature of the target scene;
First vocal print feature determines submodule, for passing through the corresponding target of a plurality of history audio data Acoustic feature, obtains the corresponding vocal print feature of the multiple object, the corresponding vocal print feature group of the multiple object At set, as with the target scene be adapted target vocal print feature set.
Optionally, the vocal print feature migrates submodule further include: the second vocal print feature determines submodule;
Second vocal print feature determines submodule, for utilizing the migration transformation model pre-established, to the multiple The corresponding vocal print feature of object carries out migration transformation, obtains the corresponding transformed vocal print of migration of the multiple object Feature, the set of the transformed vocal print feature composition of the corresponding migration of the multiple object, as with the target scene Adaptable target vocal print feature set.
Optionally, the object recognition equipment of the audio data further include: audio section determining module, vocal print feature extract mould Block and vocal print feature update module;
The audio section determining module, for based on the target scene be adapted target vocal print feature set, from The audio section for belonging to same target is determined in the target audio data;
The vocal print feature extraction module extracts vocal print feature for being subordinated in the audio fragment of same target;
The vocal print feature update module, the vocal print feature for being extracted with the vocal print feature extraction module updates and institute State the adaptable target vocal print feature set of target scene.
A kind of object identification device of audio data, comprising: memory and processor;
The memory, for storing program;
The processor realizes each step of the object identifying method of the audio data for executing described program.
A kind of readable storage medium storing program for executing is stored thereon with computer program, real when the computer program is executed by processor Each step of the object identifying method of the existing audio data.
It can be seen from the above technical scheme that the object identifying method of audio data provided by the present application, device, equipment And storage medium, the audio data to be identified under target scene can be obtained, and the target vocal print being adapted with target scene is special Collection is closed, and then based on the target vocal print feature set being adapted with target scene to the audio data to be identified under target scene Corresponding object is identified.Since the vocal print feature in the application in target vocal print feature set is adapted with target scene, Therefore, preferably the vocal print feature extracted from the audio data to be identified under target scene can be matched, so as to Promote the recognition effect of audio data corresponding objects to be identified under target scene.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of application for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.
Fig. 1 is the flow diagram of the object identifying method of audio data provided by the embodiments of the present application;
Fig. 2 is in the object identifying method of audio data provided by the embodiments of the present application, according to target audio data to pre- The corresponding multiple vocal print features of a plurality of history audio data first collected are migrated, to obtain the mesh being adapted with target scene Mark a kind of flow diagram of implementation of vocal print feature set;
Fig. 3 is in the object identifying method of audio data provided by the embodiments of the present application, based on the target under target scene Audio data, obtain a plurality of history audio data is corresponding, can adapt to target scene target acoustical feature process Schematic diagram;
Fig. 4 and Fig. 5 is to determine any bar history sound in the object identifying method of audio data provided by the embodiments of the present application Frequency according to it is corresponding migration transformation matrix a kind of implementation schematic diagram;
Fig. 6 and Fig. 7 is to determine any bar history sound in the object identifying method of audio data provided by the embodiments of the present application Frequency according to it is corresponding migration transformation matrix another implementation schematic diagram;
Fig. 8 is in the object identifying method of audio data provided by the embodiments of the present application, according to target audio data to pre- The corresponding multiple vocal print features of a plurality of history audio data first collected are migrated, to obtain the mesh being adapted with target scene Mark the flow diagram of another implementation of vocal print feature set;
Fig. 9 is to adjust target vocal print feature set in the object identifying method of audio data provided by the embodiments of the present application In the part or all of realization process of object corresponding vocal print feature flow diagram;
Figure 10 is the structural schematic diagram of the object recognition equipment of audio data provided by the embodiments of the present application;
Figure 11 is the structural schematic diagram of the object identification device of audio data provided by the embodiments of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.
Bad in view of identifying schemes recognition effect in the prior art, inventor has made intensive studies, and finally mentions A kind of preferable identifying schemes of effect are gone out, have known followed by object of following embodiments to audio data provided by the present application Other method is introduced.
Referring to Fig. 1, the flow diagram of the object identifying method of audio data provided by the embodiments of the present application is shown, This method may include:
Step S101: the audio data to be identified under target scene, and the target sound being adapted with target scene are obtained Line characteristic set.
Inventor has found during realizing the invention: cause existing identifying schemes there are vocal print feature can not It is that scene mismatches with the reason of problem, for example, the vocal print feature in vocal print library may be from multiple and different scenes such as A, B, C Under audio data in extract, and audio data to be identified may be D scene subaudio frequency data, the difference of scene cause from The vocal print extracted in audio data under different scenes can not match, undesirable so as to cause recognition effect, in view of this, this Shen The target vocal print feature set adaptable with target scene please be obtain to carry out the matching of vocal print feature.
Step S102: based on the target vocal print feature set being adapted with target scene, audio data pair to be identified is identified The object answered.
It wherein, include multiple sound being adapted with target scene in the target vocal print feature set being adapted with target scene Line feature.
Based on the target vocal print feature set being adapted with target scene, the corresponding object of audio data to be identified is carried out The process of identification may include: to extract vocal print feature from audio data to be identified, from target vocal print feature set determine with The matched vocal print feature of the vocal print feature extracted from audio data to be identified, the corresponding object of the vocal print feature determined is true It is set to the corresponding object of audio data to be identified.Assuming that extracting vocal print feature from audio data to be identified is X, then calculate separately The similarity of each vocal print feature in target vocal print feature set and vocal print feature X, by target vocal print feature set with sound The highest vocal print feature of similarity of line feature X is determined as and the matched vocal print feature of vocal print feature X.
The object identifying method of audio data provided by the embodiments of the present application can obtain the audio to be identified under target scene Data, and the target vocal print feature set being adapted with target scene, and then based on the target sound being adapted with target scene Line characteristic set identifies the corresponding object of audio data to be identified under target scene.Due to target vocal print feature set In vocal print feature and target scene be adapted, therefore, based on target vocal print feature set can be preferably under target scene Audio data to be identified in the vocal print feature extracted matched, so as to promote audio data to be identified under target scene The recognition effect of corresponding objects.
In another embodiment of the application, acquisition in the step S102 of above-described embodiment is adapted with target scene Target vocal print feature set is introduced.
The process for obtaining the target vocal print feature set being adapted with target scene may include: to obtain under target scene Target audio data;According to target audio data to the corresponding multiple vocal print features of a plurality of history audio data collected in advance into Row migration, to obtain the target vocal print feature set being adapted with target scene.
Wherein, target audio data include at least audio data to be identified, can also include other sounds under target scene Frequency evidence.
Under some real-time scenes (such as meeting, debate etc.), has speech object and carry out live speech, the present embodiment In the audio data that can currently make a speech for current speech object of audio data to be identified, target audio data may include working as The audio data that preceding speech object is currently made a speech, and, all audio datas obtained before current time.In addition, user can The audio data under some scene can be recorded, is played back afterwards, is based on this, the audio data to be identified in the present embodiment is also It can be currently playing audio data, correspondingly, target audio data include currently playing audio data and currently broadcast Audio data before the audio data put.
The a plurality of history audio data collected in advance may include the history audio data under a plurality of non-targeted scene, may be used also To include the registration audio data under a plurality of target scene, what the user that registration audio data refers to prerecorded under target scene Audio data, registration audio data is usually one, naturally it is also possible to be a plurality of.
It should be noted that a plurality of history audio data is the history audio data of multiple objects, multiple objects are at least wrapped Include the object in target scene.In one possible implementation, a plurality of history audio data of multiple objects is obtained Process can be with are as follows: the list for obtaining all objects under target scene obtains the history audio data of each object according to list. Illustratively, target scene is conference scenario, then can obtain the list of personnel participating in the meeting, obtains going through for personnel participating in the meeting according to the list History audio data.
In the present embodiment, according to target audio data to the corresponding multiple sound of a plurality of history audio data collected in advance Line feature is migrated, with obtain with target scene be adapted target vocal print feature set implementation there are many, please join Fig. 2 is read, a kind of flow diagram of possible implementation is shown, may include:
Step S201: target audio data are based on, obtain that a plurality of history audio data is corresponding, can adapt to target The target acoustical feature of scene.
It should be noted that can adapt to the sound of target scene to obtain since vocal print feature is from acoustic feature extraction Line feature should obtain the acoustic feature that can adapt to target scene first.
Step S202: by the corresponding target acoustical feature of a plurality of audio data, multiple objects is obtained and are respectively corresponded Vocal print feature, the set of the corresponding vocal print feature composition of multiple objects, as the target sound being adapted with target scene Line characteristic set.
Wherein, by the corresponding target acoustical feature of a plurality of audio data, the corresponding sound of multiple objects is obtained The process of line feature includes: by a plurality of history audio data, and at least one history audio data of corresponding same target is distinguished Corresponding target acoustical feature is combined, and obtains the corresponding acoustic feature of multiple objects;Distinguish respectively from multiple objects Vocal print feature is extracted in corresponding acoustic feature, obtains the corresponding vocal print feature of multiple objects.
" step S201: based on the target audio data under target scene, a plurality of history audio data is obtained to above-mentioned below Target acoustical feature that is corresponding, can adapt to target scene " is introduced.
Referring to Fig. 3, showing based on the target audio data under target scene, obtains a plurality of audio data and respectively correspond , the flow diagram of the realization process of the target acoustical feature that can adapt to target scene, may include:
Step S301: extracting acoustic feature from target audio data, obtains the corresponding acoustic feature of target audio data, And acoustic feature is extracted from a plurality of history audio data respectively, it is special to obtain the corresponding acoustics of a plurality of history audio data Sign.
Step S302: being based on the corresponding acoustic feature of target audio data, corresponding to a plurality of history audio data Acoustic feature carries out migration transformation, obtains that a plurality of history audio data is corresponding, can adapt to target field after migration transformation The target acoustical feature of scape.
Wherein, migration transformation is for making the corresponding acoustic feature of history audio data acoustics corresponding with target audio data Characteristic distance is minimum.
It should be noted that the acoustic feature extracted from history audio data is influenced vulnerable to scene, it is based on this, this implementation Example carries out migration transformation to the acoustic feature extracted from history audio data, and making to migrate transformed acoustic feature can adapt to Target scene.
Specifically, being based on the corresponding acoustic feature of target audio data, sound corresponding to a plurality of history audio data Learning feature and carrying out the process of migration transformation may include: to be based on this history audio data for any bar history audio data Corresponding acoustic feature and the corresponding acoustic feature of target audio data determine the corresponding migration transformation of this history audio data Matrix;Using the corresponding migration transformation matrix of this history audio data to the corresponding acoustic feature of this history audio data into Row migration transformation, migrates transformed feature as target acoustical feature;It is corresponding to obtain a plurality of history audio data Target acoustical feature.
The present embodiment is special by the corresponding acoustic feature of building history audio data acoustics corresponding with target audio data Migration transformation matrix between sign makes the corresponding acoustic feature of history audio data acoustic feature corresponding with target audio data Between probability distribution it is identical as far as possible.
Assuming that the scene corresponding from i-th history audio data of scene corresponding to target audio data is respectively different Shock response, then feature of the acoustic feature of the same phoneme under different scenes may be expressed as:
Wherein, XsIndicate the acoustic feature for some phoneme state s being collected under quiet environment, HoldAnd HnewRespectively Indicate the different shock responses of history audio data corresponding scene and target scene, FoldAnd FnewRespectively from i-th history The acoustic feature extracted in audio data and the acoustic feature extracted from target audio data.
To reduce acoustic feature FoldWith acoustic feature FnewBetween gap, it is necessary between construct one move Move transformation matrix Wi, so that the distance of the two is minimum, WiFor the corresponding migration transformation matrix of i-th history audio data.It needs Bright, a plurality of history audio data may correspond to different scenes, and different scenes corresponds to different migration transformation matrixs, Based on this, the present embodiment determines migration transformation matrix corresponding with every history audio data respectively.
Acoustic feature FoldWith acoustic feature FnewDistance the same space distribution under be calculated, acoustic feature FoldWith Acoustic feature FnewBelong to the feature under the space, acoustic feature FoldThrough migrating transformation matrix WiAfterwards, due to being linear transformation, The space is still fallen within, for determining transformation matrix WiObjective function are as follows:
min|f(WiFold)-f(Fnew)| (2)
Wherein, f is the transforming function transformation function in the space.
Solve the corresponding migration transformation matrix W of i-th history audio dataiImplementation there are many, please refer to Fig. 4 and 5, a kind of schematic diagram of implementation is shown, may include:
Step S401: obtaining the general mixed Gauss model pre-established, and general mixed Gauss model includes multiple single high This.
The present embodiment trains a common Gaussian mixed model M using a plurality of audio data of different scenes, different objects, From any scene, any object audio data in the acoustic feature that extracts all obey the distribution of the general mixed Gauss model M, General mixed Gauss model M includes multiple single Gauss (m1, m2..., mn)。
Step S402: probability of the corresponding acoustic feature of target audio data under each single Gauss is calculated, and calculates i-th The probability of the corresponding acoustic feature of history audio data under each single Gauss.
Step S403: with probability and i-th history of the corresponding acoustic feature of target audio data under each single Gauss Probability of the corresponding acoustic feature of audio data under each single Gauss, it is special to express the corresponding acoustics of i-th history audio data Levy the distance between acoustic feature corresponding with target audio data.
Assuming that the corresponding acoustic feature of i-th history audio data is in single Gauss mjUnder probability be pj,old, target audio The corresponding acoustic feature of data is in single Gauss mjUnder probability be pj,new, then the corresponding acoustic feature of i-th history audio data FoldAcoustic feature F corresponding with target audio datanewThe distance between can be expressed as:
Step S404: by by the corresponding acoustic feature of i-th history audio data sound corresponding with target audio data It learns the distance between feature to minimize, acquires transformation matrix error, and based on preset migration transformation matrix and the transformation acquired Matrix error determines the corresponding migration transformation matrix of i-th history audio data.
Minimizing distance may be expressed as:
By above-mentioned minimum range formula, transformation matrix error ▽ W can be acquired using following formulai:
Wherein, xoldFor transformed acoustic feature, μjFor single Gauss mjMean value.
Fig. 6 and 7 are please referred to, shows and solves the corresponding transformation matrix W of i-th history audio dataiIt is alternatively possible Implementation schematic diagram, may include:
Step S601: obtaining the general mixed Gauss model pre-established, and general mixed Gauss model includes multiple single high This.
Step S602: probability of the corresponding acoustic feature of target audio data under each single Gauss is calculated.
Step S603: based on probability of the corresponding acoustic feature of target audio data under each single Gauss, from multiple high Multiple target list Gausses are chosen in this.
Wherein, probability of the corresponding acoustic feature of target audio data under each target list Gauss is all larger than target audio Probability of the corresponding acoustic feature of data in other single Gausses.
Assuming that general mixed Gauss model includes n single Gauss, respectively m1、m2、…mn, can be to target audio data pair Probability of the acoustic feature answered under each single Gauss sorts by sequence from big to small, and x (x < n) probability are right respectively before choosing The single Gauss answered is as target list Gauss.
Step S604: mean value and side of the corresponding acoustic feature of target audio data in each target list Gauss are calculated Difference, and calculate mean value and variance of the corresponding acoustic feature of i-th history audio data in each target list Gauss.
Step S605: with mean value of the corresponding acoustic feature of i-th history audio data in each target list Gauss and The mean value and variance of variance and the corresponding acoustic feature of target audio data in each target list Gauss, expression i-th are gone through The distance between the corresponding acoustic feature of history audio data acoustic feature corresponding with target audio data.
Assuming that the corresponding acoustic feature of i-th history audio data is in target list Gauss mjUnder mean value be μold, variance is δold, the corresponding acoustic feature of target audio data is in target list Gauss mjUnder mean value be μnew, variance δnew, then go through for i-th The corresponding acoustic feature F of history audio dataoldAcoustic feature F corresponding with target audio datanewThe distance between can be expressed as:
Step S606: by by the corresponding acoustic feature of i-th history audio data sound corresponding with target audio data It learns the distance between feature to minimize, acquires transformation matrix error, and based on preset migration transformation matrix and the transformation acquired Matrix error determines the corresponding migration transformation matrix of i-th history audio data.
Minimizing distance may be expressed as:
It should be noted that the mode that this step seeks transformation matrix error is similar with above-mentioned implementation, the present embodiment Therefore not to repeat here.In addition, this implementation and above-mentioned implementation same section can cross-reference, the present embodiment do not make herein It repeats.
In order to obtain the vocal print feature that can preferably adapt to target scene, the present embodiment is additionally provided according to target audio Data migrate the corresponding multiple vocal print features of a plurality of history audio data collected in advance, to obtain and target scene phase The alternatively possible implementation of the target vocal print feature set of adaptation, referring to Fig. 8, showing the process of the implementation Schematic diagram may include:
Step S801: based on the target audio data under target scene, it is corresponding to obtain a plurality of history audio data Target acoustical feature.
Step S802: by the corresponding target acoustical feature of a plurality of history audio data, multiple object difference are obtained Corresponding vocal print feature.
The specific implementation process of step S801 and step S802 can be found in step S201 in above-described embodiment in the present embodiment With the related introduction of step S202, therefore not to repeat here for the present embodiment.
Step S803: using the migration transformation model pre-established, the corresponding vocal print feature of multiple objects is carried out Migration transformation, obtains the transformed vocal print feature of the corresponding migration of multiple objects, the corresponding migration change of multiple objects The set of vocal print feature composition after changing, as the target vocal print feature set being adapted with target scene.
Wherein, migration transformation model, which is used from the corresponding target acoustical feature of history audio data, (migrates transformed sound Learn feature) in extract vocal print feature, and extracted from target audio data vocal print feature training obtain.
Migration transformation model in the present embodiment can make a living into confrontation model, which includes: generation mould Block and confrontation discrimination module.Wherein:
Generation module, for carrying out migration transformation to the vocal print feature of input, output migrates transformed vocal print feature, In, the vocal print feature of input includes the corresponding vocal print feature of history audio data and the corresponding vocal print feature of target audio data.
Discrimination module is fought, for differentiating generation module output, migration transformed vocal print spy during training Scene belonging to sign is target scene or non-targeted scene.
The training process for migrating transformation model includes: the vocal print feature that will be extracted from history audio data respectively, and, Extracted in target audio data under from target scene vocal print feature input generation module, obtain generation module output, move Transformed vocal print feature is moved, generation module output, the transformed vocal print feature input confrontation discrimination module of migration pass through It fights discrimination module and differentiates that scene belonging to the vocal print feature of input is target scene or non-targeted scene, obtain and differentiate knot Fruit, based on the parameter for differentiating result update generation module, until the vocal print feature that input cannot be distinguished in confrontation discrimination module comes from Target scene or non-targeted scene.Wherein, the differentiation result of confrontation discrimination module output can indicate non-by 0/1 characterization, 0 Target scene, 1 indicates target scene.
After training obtains migration transformation model, vocal print feature input that can obtain step S802, each object is moved It moves transformation model and carries out migration transformation, to obtain the transformed vocal print feature of migration.
After obtaining the corresponding transformed vocal print feature of migration of multiple objects, move multiple objects are corresponding It moves transformed vocal print feature to store to vocal print library, in order to match the vocal print feature extracted from audio to be identified. Assuming that not only having included a plurality of history audio data under non-targeted scene in a plurality of history audio data, but also including under target scene A plurality of registration audio data, a plurality of history audio data corresponding 40 objects under non-targeted scene are a plurality of under target scene Corresponding 20 objects of audio data are registered, then have the corresponding vocal print feature of 50 objects in final vocal print library.Transformation will migrated When vocal print feature afterwards is stored to vocal print library, stored as unit of object.
It should be noted that under target scene, since the audio data of speech object will receive speech content and hair The influence (when as having a meeting, the position of spokesman or ambient noise) for saying environment, therefore, gets in audio data from different moments The vocal print feature of extraction be also likely to be it is different, even it is same speech the subjects history moment extract vocal print it is (5 points such as preceding Vocal print feature when clock or 10 minutes first) can all there be certain variation, to influence the effect of identification, it is based on this, in order to Further promote effect, method provided by the embodiments of the present application further include: dynamically adjust at any time above-mentioned with target scene phase The vocal print feature of part or all of object in the target vocal print feature set of adaptation, to make the corresponding vocal print feature of each object It can be higher with the vocal print feature matching degree extracted from audio data to be identified.
Referring to Fig. 9, show it is partly or entirely right in the target vocal print feature set that adjustment and target scene are adapted As the flow diagram of the realization process of corresponding vocal print feature, may include:
Step S901: it based on the target vocal print feature set being adapted with target scene, is determined from target audio data Belong to the audio section of same target.
Specifically, it based on the target vocal print feature set being adapted with target scene, determines and belongs to from target audio data May include: in the process of the audio section of same target
Step S9011: effective audio section is obtained from target audio data.
It should be noted that these data are invalid data, base there may be pure noise data in target audio data In this, speech terminals detection is carried out to target audio data in the present embodiment and (is examined for example, the sound end based on energy can be used Survey method, sound end detecting method based on model etc.), to remove pure noise segment, retain effective audio section.
Step S9012: being multiple audio sections of different objects by effective audio section cutting.
Effective audio section can be carried out speaking object trip point detection (for example, the trip point detection based on BIC can be used Method, based on deep neural network DNN or long jump point detecting method of memory network LSTM model etc. in short-term), based on jump Effective audio section is cut into multiple audio fragments of different objects by point testing result.
Step S9013: for any audio fragment in multiple audio fragments, vocal print feature is extracted from the audio fragment As the corresponding vocal print feature of the audio fragment, it is adapted based on the corresponding vocal print feature of the audio fragment and with target scene Target vocal print feature set determines the corresponding object of the audio fragment, to obtain the corresponding object of multiple audio fragments.
Specifically, for any audio fragment in multiple audio fragments, it can calculate separately and be adapted with target scene Each vocal print feature, will be with the audio fragment with the audio fragment at a distance from corresponding vocal print feature in target vocal print feature set The corresponding object of the nearest vocal print feature of the distance of corresponding vocal print feature is determined as the corresponding object of the audio fragment.
It should be noted that the present embodiment is determining each audio piece in order to precisely be divided to effective audio section After the corresponding object of section, the audio fragment of same target can be merged, then cutting again, be based on mutually fitting with target scene again The target vocal print feature set answered determines the corresponding object of each audio fragment.Above-mentioned audio fragment merging, determines audio at cutting This process of the corresponding object of segment is executable multiple.
Step S9014: being based on the corresponding object of multiple audio fragments, obtains the audio fragment for belonging to same target.
The corresponding object of each audio fragment has determined that out, can know which audio fragment belongs to same target.
Step S902: being subordinated in the audio fragment of same target and extract vocal print feature, is updated with the vocal print feature of extraction The target vocal print feature set being adapted with target scene.
Less accurate in view of the vocal print feature extracted from small audio fragment, the present embodiment is by the audio piece of same target Section merges, and extracts vocal print feature from the audio data after merging, is updated and target scene phase with the vocal print feature of extraction The target vocal print feature set of adaptation.
Mode with the vocal print feature more fresh target vocal print feature set of extraction can may be replacement to increase.Due to Vocal print in target vocal print feature set is stored in vocal print library as unit of object, therefore, extracts sound by step S902 After line feature, the vocal print feature of extraction can be added in vocal print library in the vocal print feature of corresponding objects.Due in vocal print library Vocal print feature can real-time update, if every time update when in vocal print library add vocal print feature necessarily lead to the data in vocal print library Amount increases rapidly, and the occurrence of in order to avoid this, in alternatively possible implementation, can be extracted with by step S902 The vocal print feature of corresponding objects into vocal print feature replacement vocal print library, specifically, object corresponding for the vocal print feature of extraction, If being stored in vocal print library, there is the vocal print that obtains from the audio data under target scene in the corresponding vocal print feature of the object Feature then replaces the vocal print obtained from the audio data under target scene spy with the vocal print feature extracted by step S902 Sign.
In addition, method provided by the embodiments of the present application can also include: by the target audio data transcription under target scene For text, the information for the object that will identify that is added in text, so that the speech content of each object is associated with corresponding objects, To which the speech content of different objects be distinguished.For example, target scene is debate class scene, then after identifying object, The speech content of different objects can be distinguished, belong to different pleaders.In addition, the present embodiment can also be obtained from text The information (such as name of object) of object is taken, the information of the object based on acquisition repairs the information of the object identified Just.
In the application, since the vocal print feature in vocal print library (i.e. target vocal print feature set) is adapted with target scene, Therefore, preferably the vocal print feature extracted from the audio data to be identified under target scene can be matched, so as to Promote the recognition effect of audio data corresponding objects to be identified under target scene.
The embodiment of the present application also provides a kind of identification authentication systems based on Application on Voiceprint Recognition, below to the embodiment of the present application The identification authentication system based on Application on Voiceprint Recognition provided is described, the identification authentication system described below based on Application on Voiceprint Recognition Reference can be corresponded to each other with the above-described identity identifying method based on Application on Voiceprint Recognition.
Referring to Fig. 10, the structure for showing a kind of object recognition equipment of audio data provided by the embodiments of the present application is shown It is intended to, as shown in Figure 10, the apparatus may include: audio data obtains module 1001, vocal print feature set obtains module 1002 With Object Identification Module 1003.
Audio data obtains module 1001, for obtaining the audio data to be identified under target scene.
Vocal print feature set obtains module 1002, for obtaining the target vocal print feature collection being adapted with the target scene It closes.
Object Identification Module 1003, for based on the target vocal print feature set being adapted with the target scene, identification The corresponding object of the audio data to be identified.
The object recognition equipment of audio data provided by the embodiments of the present application can obtain the audio to be identified under target scene Data, and the target vocal print feature set being adapted with target scene, and then based on the target sound being adapted with target scene Line characteristic set identifies the corresponding object of audio data to be identified under target scene.Due to target vocal print feature set In vocal print feature and target scene be adapted, therefore, can preferably be mentioned to from the audio data to be identified under target scene The vocal print feature taken is matched, so as to promote the recognition effect of audio data corresponding objects to be identified under target scene.
In one possible implementation, the vocal print in the object recognition equipment of audio data provided by the above embodiment It may include: audio data acquisition submodule and vocal print feature migration submodule that characteristic set, which obtains module 1002,.
The audio data acquisition submodule, for obtaining the target audio data under the target scene, wherein described Target audio data include at least the audio data to be identified.
The vocal print feature migrates submodule, for according to the target audio data to a plurality of history sound collected in advance Frequency is migrated according to corresponding multiple vocal print features, to obtain the target vocal print feature collection being adapted with the target scene It closes.
In one possible implementation, the vocal print feature migration submodule includes: that acoustic feature determines submodule Submodule is determined with the first vocal print feature.
The acoustic feature determines submodule, for being based on the target audio data, obtains a plurality of history audio Data are corresponding, can adapt to the target acoustical feature of the target scene.
First vocal print feature determines submodule, for passing through the corresponding target of a plurality of history audio data Acoustic feature, obtains the corresponding vocal print feature of the multiple object, the corresponding vocal print feature group of the multiple object At set, as with the target scene be adapted target vocal print feature set.
In one possible implementation, the vocal print feature migrates submodule further include: the second vocal print feature determines Submodule.
Second vocal print feature determines submodule, for utilizing the migration transformation model pre-established, to the multiple The corresponding vocal print feature of object carries out migration transformation, obtains the corresponding transformed vocal print of migration of the multiple object Feature, the set of the transformed vocal print feature composition of the corresponding migration of the multiple object, as with the target scene Adaptable target vocal print feature set.
In one possible implementation, the acoustic feature determine submodule include: fisrt feature extracting sub-module, Migrate transformation submodule and second feature extracting sub-module.
The fisrt feature extracting sub-module, for extracting acoustic feature from the target audio data, described in acquisition The corresponding acoustic feature of target audio data, and acoustic feature is extracted from a plurality of history audio data respectively, it obtains described more The corresponding acoustic feature of history audio data;
The migration transformation submodule, for being based on the corresponding acoustic feature of the target audio data, to described a plurality of The corresponding acoustic feature of history audio data carries out migration transformation, obtains a plurality of history audio data after migration transformation Corresponding target acoustical feature, wherein the migration transformation is for making the corresponding acoustic feature of the history audio data Acoustical feature distance corresponding with the target audio data is minimum;
The second feature extracting sub-module, for extracting acoustic feature from a plurality of registration audio data as mesh respectively Acoustic feature is marked, the corresponding target acoustical feature of a plurality of registration audio data is obtained.
In one possible implementation, the migration transformation submodule, is specifically used for: being directed to any bar history audio Data are based on the corresponding acoustic feature of this history audio data and the corresponding acoustic feature of the target audio data, determine The corresponding migration transformation matrix of this history audio data;Using the corresponding migration transformation matrix of this history audio data to this The corresponding acoustic feature of history audio data carries out migration transformation, migrates transformed acoustic feature as target acoustical spy Sign;To obtain the corresponding target acoustical feature of a plurality of history audio data.
In one possible implementation, the migration transformation submodule is being based on the corresponding acoustics of this audio data Feature and the corresponding acoustic feature of the target audio data, when determining the corresponding migration transformation matrix of this audio data, tool Body is used for: obtaining the general mixed Gauss model pre-established, the general mixed Gauss model includes multiple single Gausses;It calculates Probability and the target audio data of the corresponding acoustic feature of this history audio data in each single Gauss are corresponding Probability of the acoustic feature in each single Gauss;With the corresponding acoustic feature of this history audio data in each single Gauss The probability of probability and the corresponding acoustic feature of the target audio data under each single Gauss, expresses this history audio The distance between the corresponding acoustic feature of data acoustic feature corresponding with the target audio data;By by the distance most Smallization determines transformation matrix error, and is based on preset migration transformation matrix and the transformation matrix error, determines this history The corresponding migration transformation matrix of audio data.
In alternatively possible implementation, the migration transformation submodule is being based on the corresponding sound of this audio data Feature and the corresponding acoustic feature of the target audio data are learned, when determining the corresponding migration transformation matrix of this audio data, It is specifically used for: obtains the general mixed Gauss model pre-established, the general mixed Gauss model includes multiple single Gausses;Meter Calculate probability of the corresponding acoustic feature of the target audio data in each single Gauss;It is corresponding based on the target audio data Probability of the acoustic feature in each single Gauss, choose multiple target list Gausses, the target from the multiple single Gauss The corresponding acoustic feature of audio data is all larger than the probability in other single Gausses in the probability in each target list Gauss;With this Mean value and variance and the target sound frequency of the corresponding acoustic feature of history audio data in each target list Gauss According to mean value and variance of the corresponding acoustic feature in each target list Gauss, the corresponding acoustics of this history audio data is expressed The distance between feature acoustic feature corresponding with target audio data;By the way that the distance minimization is determined that transformation matrix misses Difference, and it is based on preset migration transformation matrix and the transformation matrix error, determine the corresponding migration of this history audio data Transformation matrix.
In one possible implementation, first vocal print feature determines submodule, and being specifically used for will be described a plurality of In history audio data, the corresponding target acoustical feature of at least one history audio data of corresponding same target carries out group It closes, obtains the corresponding acoustic feature of the multiple object;Respectively from the corresponding acoustic feature of the multiple object Vocal print feature is extracted, the corresponding vocal print feature of the multiple object is obtained.
In one possible implementation, the migration transformation model makes a living into confrontation model;The generation fights mould Type includes: generation module and confrontation discrimination module.
The generation module, for carrying out migration transformation to the vocal print feature of input, it is special that output migrates transformed vocal print Sign, wherein the vocal print feature of input includes the corresponding vocal print feature of a plurality of history audio data and the target sound Frequency is according to corresponding vocal print feature;
The confrontation discrimination module, after differentiating the generation module exports, migration transformation during training Vocal print feature belonging to scene be the target scene or non-targeted scene.
In one possible implementation, the object recognition equipment of the audio data further include: audio section determines mould Block, vocal print feature extraction module and vocal print feature update module.
The audio section determining module, for based on the target scene be adapted target vocal print feature set, from The audio section for belonging to same target is determined in the target audio data;
The vocal print feature extraction module extracts vocal print feature for being subordinated in the audio fragment of same target;
The vocal print feature update module, the vocal print feature for being extracted with the vocal print feature extraction module updates and institute State the adaptable target vocal print feature set of target scene.
In one possible implementation, the audio section determining module includes: effective audio section acquisition submodule, sound Frequency cutting submodule, object determine that submodule and audio fragment determine submodule.
Effective audio section acquisition submodule, for obtaining effective audio section from the target audio data.
The audio cutting submodule, for being multiple audio fragments of different objects by effective audio section cutting.
The object determines submodule, for being based on from this for any audio fragment in the multiple audio fragment The vocal print feature extracted in audio fragment and the target vocal print feature set being adapted with the target scene, determine the audio section Corresponding object, to obtain the corresponding object of the multiple audio fragment.
The audio fragment determines submodule, for being based on the corresponding object of the multiple audio fragment, determines and belongs to In the audio section of same target.
The embodiment of the present application also provides a kind of object identification devices of audio data, please refer to Figure 11, show the sound The structural schematic diagram of the object identification device of frequency evidence, the equipment may include: at least one processor 1101, at least one is logical Believe interface 1102, at least one processor 1103 and at least one communication bus 1004;
In the embodiment of the present application, the number of processor 1101, communication interface 1102, memory 1103, communication bus 1104 Amount be at least one, and processor 1101, communication interface 1102, memory 1103 by communication bus 1104 complete it is mutual Communication;
Processor 1101 may be a central processor CPU or specific integrated circuit ASIC (Application Specific Integrated Circuit), or it is arranged to one or more integrated circuits of the embodiment of the present invention Deng;
Memory 1103 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non- Volatile memory) etc., a for example, at least magnetic disk storage;
Wherein, memory is stored with program, the program that processor can call memory to store, and described program is used for:
Obtain the audio data to be identified under target scene, and the target vocal print feature being adapted with the target scene Set;
Based on the target vocal print feature set being adapted with the target scene, identify that the audio data to be identified is corresponding Object.
Optionally, the refinement function of described program and extension function can refer to above description.
The embodiment of the present application also provides a kind of readable storage medium storing program for executing, which can be stored with and hold suitable for processor Capable program, described program are used for:
Obtain the audio data to be identified under target scene, and the target vocal print feature being adapted with the target scene Set;
Based on the target vocal print feature set being adapted with the target scene, identify that the audio data to be identified is corresponding Object.
Optionally, the refinement function of described program and extension function can refer to above description.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except there is also other identical elements in the process, method, article or apparatus that includes the element.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.
The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims (19)

1. a kind of object identifying method of audio data characterized by comprising
Obtain the audio data to be identified under target scene, and the target vocal print feature collection being adapted with the target scene It closes;
Based on the target vocal print feature set being adapted with the target scene, identify that the audio data to be identified is corresponding right As.
2. the object identifying method of audio data according to claim 1, which is characterized in that obtain and the target scene Adaptable target vocal print feature set, comprising:
Obtain the target audio data under the target scene, wherein the target audio data include at least described to be identified Audio data;
The corresponding multiple vocal print features of a plurality of history audio data collected in advance are moved according to the target audio data It moves, to obtain the target vocal print feature set being adapted with the target scene.
3. the object identifying method of audio data according to claim 2, which is characterized in that a plurality of history audio number According to the history audio data for multiple objects;
It is described according to the target audio data to the corresponding multiple vocal print features of a plurality of history audio data collected in advance into Row migration, to obtain the target vocal print feature set being adapted with the target scene, comprising:
Based on the target audio data, the acquisition a plurality of history audio data is corresponding, can adapt to the target The target acoustical feature of scene;
By the corresponding target acoustical feature of a plurality of history audio data, it is corresponding to obtain the multiple object Vocal print feature;
The set of the corresponding vocal print feature composition of the multiple object, as the target sound being adapted with the target scene Line characteristic set.
4. the object identifying method of audio data according to claim 3, which is characterized in that described according to the target audio Data migrate the corresponding multiple vocal print features of a plurality of history audio data collected in advance, to obtain and the target field The adaptable target vocal print feature set of scape, further includes:
Using the migration transformation model pre-established, migration transformation is carried out to the corresponding vocal print feature of the multiple object, Obtain the corresponding transformed vocal print feature of migration of the multiple object;
The set of the transformed vocal print feature composition of the corresponding migration of the multiple object, as with the target scene phase The target vocal print feature set of adaptation.
5. the object identifying method of audio data according to claim 3, which is characterized in that described to be based on the target sound Frequency evidence, the acquisition a plurality of history audio data is corresponding, can adapt to the target acoustical feature of the target scene, Include:
Acoustic feature is extracted from the target audio data, obtains the corresponding acoustic feature of the target audio data, and divide Acoustic feature is not extracted from a plurality of history audio data, obtains the corresponding acoustics of a plurality of history audio data Feature;
It is special to the corresponding acoustics of a plurality of history audio data based on the corresponding acoustic feature of the target audio data Sign carries out migration transformation, obtains that a plurality of history audio data is corresponding, can adapt to the target after migration transformation The target acoustical feature of scene;
Wherein, the migration transformation is for making the corresponding acoustic feature of the history audio data and the target audio data pair The acoustical feature distance answered is minimum.
6. the object identifying method of audio data according to claim 5, which is characterized in that described to be based on the target sound Frequency carries out migration transformation according to corresponding acoustic feature, to the corresponding acoustic feature of a plurality of history audio data, wraps It includes:
For any bar history audio data:
Based on the corresponding acoustic feature of this history audio data and the corresponding acoustic feature of the target audio data, determining should The corresponding migration transformation matrix of history audio data;
Using the corresponding migration transformation matrix of this history audio data to the corresponding acoustic feature of this history audio data into Row migration transformation, migrates transformed acoustic feature as target acoustical feature;
To obtain the corresponding target acoustical feature of a plurality of history audio data.
7. the object identifying method of audio data according to claim 6, which is characterized in that described to be based on this history sound Frequency determines that this history audio data is corresponding according to corresponding acoustic feature and the corresponding acoustic feature of the target audio data Migration transformation matrix, comprising:
The general mixed Gauss model pre-established is obtained, the general mixed Gauss model includes multiple single Gausses;
Calculate probability and the target sound frequency of the corresponding acoustic feature of this history audio data in each single Gauss According to probability of the corresponding acoustic feature in each single Gauss;
With probability of the corresponding acoustic feature of this history audio data in each single Gauss and the target audio data Probability of the corresponding acoustic feature under each single Gauss, expresses the corresponding acoustic feature of this history audio data and the mesh Mark the distance between corresponding acoustic feature of audio data;
By the way that the distance minimization is determined transformation matrix error, and it is based on preset migration transformation matrix and the transformation square Battle array error, determines the corresponding migration transformation matrix of this history audio data.
8. the object identifying method of audio data according to claim 6, which is characterized in that described to be based on this history sound Frequency determines that this history audio data is corresponding according to corresponding acoustic feature and the corresponding acoustic feature of the target audio data Migration transformation matrix, comprising:
The general mixed Gauss model pre-established is obtained, the general mixed Gauss model includes multiple single Gausses;
Calculate probability of the corresponding acoustic feature of the target audio data in each single Gauss;
Probability based on the corresponding acoustic feature of the target audio data in each single Gauss, from the multiple single Gauss Choose multiple target list Gausses, probability of the corresponding acoustic feature of the target audio data in each target list Gauss is big In the probability in other single Gausses;
With mean value and variance and the target of the corresponding acoustic feature of history audio data in each target list Gauss It is corresponding to express this history audio data for mean value and variance of the corresponding acoustic feature of audio data in each target list Gauss The distance between acoustic feature acoustic feature corresponding with target audio data;
By the way that the distance minimization is determined transformation matrix error, and it is based on preset migration transformation matrix and the transformation square Battle array error, determines the corresponding migration transformation matrix of this history audio data.
9. the object identifying method of audio data according to claim 3, which is characterized in that described a plurality of to be gone through by described The corresponding target acoustical feature of history audio data, obtains the corresponding vocal print feature of the multiple object, comprising:
By in a plurality of history audio data, the corresponding target of at least one history audio data of same target is corresponded to Acoustic feature is combined, and obtains the corresponding acoustic feature of multiple objects;
Vocal print feature is extracted from the corresponding acoustic feature of the multiple object respectively, it is right respectively to obtain the multiple object The vocal print feature answered.
10. the object identifying method of audio data according to claim 4, which is characterized in that the migration transformation model Make a living into confrontation model;
The generation confrontation model includes: generation module and confrontation discrimination module;
The generation module, for carrying out migration transformation to the vocal print feature of input, output migrates transformed vocal print feature, In, the vocal print feature of input includes the corresponding vocal print feature of a plurality of history audio data and the target audio data Corresponding vocal print feature;
The confrontation discrimination module, for differentiating the generation module output, migration transformed sound during training Scene belonging to line feature is the target scene or non-targeted scene.
11. the object identifying method of audio data described according to claim 1~any one of 10, which is characterized in that After obtaining the target vocal print feature set being adapted with the target scene, the method also includes:
Based on the target vocal print feature set being adapted with the target scene, determines and belong to together from the target audio data The audio section of an object;
It is subordinated in the audio fragment of same target and extracts vocal print feature, updated and the target scene with the vocal print feature of extraction Adaptable target vocal print feature set.
12. the object identifying method of audio data according to claim 11, which is characterized in that described to be based on and the mesh The adaptable target vocal print feature set of scene is marked, the audio section for belonging to same target is determined from the target audio data, Include:
Effective audio section is obtained from the target audio data;
It is multiple audio fragments of different objects by effective audio section cutting;
For any audio fragment in the multiple audio fragment, based on the vocal print feature extracted from the audio fragment and with The adaptable target vocal print feature set of the target scene, determines the corresponding object of the audio section, to obtain the multiple sound The corresponding object of frequency segment;
Based on the corresponding object of the multiple audio fragment, the audio section for belonging to same target is determined.
13. a kind of object recognition equipment of audio data characterized by comprising audio data obtains module, vocal print feature collection It closes and obtains module and Object Identification Module;
The audio data obtains module, for obtaining the audio data to be identified under target scene;
The vocal print feature set obtains module, for obtaining the target vocal print feature set being adapted with the target scene;
The Object Identification Module, for based on the target vocal print feature set being adapted with the target scene, described in identification The corresponding object of audio data to be identified.
14. the object recognition equipment of audio data according to claim 13, which is characterized in that the vocal print feature set Obtain module, comprising: audio data acquisition submodule and vocal print feature migrate submodule;
The audio data acquisition submodule, for obtaining the target audio data under the target scene, wherein the target Audio data includes at least the audio data to be identified;
The vocal print feature migrates submodule, for according to the target audio data to a plurality of history audio number collected in advance It is migrated according to corresponding multiple vocal print features, to obtain the target vocal print feature set being adapted with the target scene.
15. the object recognition equipment of audio data according to claim 14, which is characterized in that the vocal print feature migration Submodule includes: that acoustic feature determines that submodule and the first vocal print feature determine submodule;
The acoustic feature determines submodule, for being based on the target audio data, obtains a plurality of history audio data Target acoustical feature that is corresponding, can adapt to the target scene;
First vocal print feature determines submodule, for passing through the corresponding target acoustical of a plurality of history audio data Feature, obtains the corresponding vocal print feature of the multiple object, the corresponding vocal print feature composition of the multiple object Set, as the target vocal print feature set being adapted with the target scene.
16. the object recognition equipment of audio data according to claim 15, which is characterized in that the vocal print feature migration Submodule further include: the second vocal print feature determines submodule;
Second vocal print feature determines submodule, for utilizing the migration transformation model pre-established, to the multiple object Corresponding vocal print feature carries out migration transformation, and it is special to obtain the corresponding transformed vocal print of migration of the multiple object Sign, the set of the transformed vocal print feature composition of the corresponding migration of the multiple object, as with the target scene phase The target vocal print feature set of adaptation.
17. the object recognition equipment of audio data described in any one of 3~16 according to claim 1, which is characterized in that described Device further include: audio section determining module, vocal print feature extraction module and vocal print feature update module;
The audio section determining module, for based on the target vocal print feature set being adapted with the target scene, from described The audio section for belonging to same target is determined in target audio data;
The vocal print feature extraction module extracts vocal print feature for being subordinated in the audio fragment of same target;
The vocal print feature update module, the vocal print feature for being extracted with the vocal print feature extraction module update and the mesh Mark the adaptable target vocal print feature set of scene.
18. a kind of object identification device of audio data characterized by comprising memory and processor;
The memory, for storing program;
The processor realizes pair of the audio data as described in any one of claim 1~12 for executing described program As each step of recognition methods.
19. a kind of readable storage medium storing program for executing, is stored thereon with computer program, which is characterized in that the computer program is processed When device executes, each step of the object identifying method of the audio data as described in any one of claim 1~12 is realized.
CN201811580955.3A 2018-12-24 2018-12-24 Object identification method, device, equipment and storage medium of audio data Active CN109410956B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811580955.3A CN109410956B (en) 2018-12-24 2018-12-24 Object identification method, device, equipment and storage medium of audio data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811580955.3A CN109410956B (en) 2018-12-24 2018-12-24 Object identification method, device, equipment and storage medium of audio data

Publications (2)

Publication Number Publication Date
CN109410956A true CN109410956A (en) 2019-03-01
CN109410956B CN109410956B (en) 2021-10-08

Family

ID=65460822

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811580955.3A Active CN109410956B (en) 2018-12-24 2018-12-24 Object identification method, device, equipment and storage medium of audio data

Country Status (1)

Country Link
CN (1) CN109410956B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111145772A (en) * 2019-12-28 2020-05-12 广州国音智能科技有限公司 Voice enhancement method, system and equipment
CN111309962A (en) * 2020-01-20 2020-06-19 北京字节跳动网络技术有限公司 Method and device for extracting audio clip and electronic equipment
CN111653283A (en) * 2020-06-28 2020-09-11 讯飞智元信息科技有限公司 Cross-scene voiceprint comparison method, device, equipment and storage medium
CN112820300A (en) * 2021-02-25 2021-05-18 北京小米松果电子有限公司 Audio processing method and device, terminal and storage medium
CN114400009A (en) * 2022-03-10 2022-04-26 深圳市声扬科技有限公司 Voiceprint recognition method and device and electronic equipment
CN115064176A (en) * 2022-06-22 2022-09-16 广州市迪声音响有限公司 Voiceprint screening system and method

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101226743A (en) * 2007-12-05 2008-07-23 浙江大学 Method for recognizing speaker based on conversion of neutral and affection sound-groove model
CN103258535A (en) * 2013-05-30 2013-08-21 中国人民财产保险股份有限公司 Identity recognition method and system based on voiceprint recognition
WO2015074411A1 (en) * 2013-11-20 2015-05-28 中兴通讯股份有限公司 Terminal unlocking method, apparatus and terminal
CN105244031A (en) * 2015-10-26 2016-01-13 北京锐安科技有限公司 Speaker identification method and device
CN107610709A (en) * 2017-08-01 2018-01-19 百度在线网络技术(北京)有限公司 A kind of method and system for training Application on Voiceprint Recognition model
CN107680600A (en) * 2017-09-11 2018-02-09 平安科技(深圳)有限公司 Sound-groove model training method, audio recognition method, device, equipment and medium
US20180197547A1 (en) * 2017-01-10 2018-07-12 Fujitsu Limited Identity verification method and apparatus based on voiceprint
CN108305615A (en) * 2017-10-23 2018-07-20 腾讯科技(深圳)有限公司 A kind of object identifying method and its equipment, storage medium, terminal
CN108305633A (en) * 2018-01-16 2018-07-20 平安科技(深圳)有限公司 Speech verification method, apparatus, computer equipment and computer readable storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101226743A (en) * 2007-12-05 2008-07-23 浙江大学 Method for recognizing speaker based on conversion of neutral and affection sound-groove model
CN103258535A (en) * 2013-05-30 2013-08-21 中国人民财产保险股份有限公司 Identity recognition method and system based on voiceprint recognition
WO2015074411A1 (en) * 2013-11-20 2015-05-28 中兴通讯股份有限公司 Terminal unlocking method, apparatus and terminal
CN105244031A (en) * 2015-10-26 2016-01-13 北京锐安科技有限公司 Speaker identification method and device
US20180197547A1 (en) * 2017-01-10 2018-07-12 Fujitsu Limited Identity verification method and apparatus based on voiceprint
CN107610709A (en) * 2017-08-01 2018-01-19 百度在线网络技术(北京)有限公司 A kind of method and system for training Application on Voiceprint Recognition model
CN107680600A (en) * 2017-09-11 2018-02-09 平安科技(深圳)有限公司 Sound-groove model training method, audio recognition method, device, equipment and medium
CN108305615A (en) * 2017-10-23 2018-07-20 腾讯科技(深圳)有限公司 A kind of object identifying method and its equipment, storage medium, terminal
CN108305633A (en) * 2018-01-16 2018-07-20 平安科技(深圳)有限公司 Speech verification method, apparatus, computer equipment and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
林江云: "文本无关说话人识别系统研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111145772A (en) * 2019-12-28 2020-05-12 广州国音智能科技有限公司 Voice enhancement method, system and equipment
CN111309962A (en) * 2020-01-20 2020-06-19 北京字节跳动网络技术有限公司 Method and device for extracting audio clip and electronic equipment
CN111309962B (en) * 2020-01-20 2023-05-16 抖音视界有限公司 Method and device for extracting audio clips and electronic equipment
CN111653283A (en) * 2020-06-28 2020-09-11 讯飞智元信息科技有限公司 Cross-scene voiceprint comparison method, device, equipment and storage medium
CN111653283B (en) * 2020-06-28 2024-03-01 讯飞智元信息科技有限公司 Cross-scene voiceprint comparison method, device, equipment and storage medium
CN112820300A (en) * 2021-02-25 2021-05-18 北京小米松果电子有限公司 Audio processing method and device, terminal and storage medium
CN112820300B (en) * 2021-02-25 2023-12-19 北京小米松果电子有限公司 Audio processing method and device, terminal and storage medium
CN114400009A (en) * 2022-03-10 2022-04-26 深圳市声扬科技有限公司 Voiceprint recognition method and device and electronic equipment
CN115064176A (en) * 2022-06-22 2022-09-16 广州市迪声音响有限公司 Voiceprint screening system and method

Also Published As

Publication number Publication date
CN109410956B (en) 2021-10-08

Similar Documents

Publication Publication Date Title
CN109410956A (en) A kind of object identifying method of audio data, device, equipment and storage medium
US11900947B2 (en) Method and system for automatically diarising a sound recording
CN106057206B (en) Sound-groove model training method, method for recognizing sound-groove and device
JP6993353B2 (en) Neural network-based voiceprint information extraction method and device
CN106251874B (en) A kind of voice gate inhibition and quiet environment monitoring method and system
CN105096941B (en) Audio recognition method and device
CN108305615A (en) A kind of object identifying method and its equipment, storage medium, terminal
CN106504768B (en) Phone testing audio frequency classification method and device based on artificial intelligence
CN108766418A (en) Sound end recognition methods, device and equipment
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
CN104036774A (en) Method and system for recognizing Tibetan dialects
CN110310647A (en) A kind of speech identity feature extractor, classifier training method and relevant device
CN110544482A (en) single-channel voice separation system
CN114882914A (en) Aliasing tone processing method, device and storage medium
WO2020170907A1 (en) Signal processing device, learning device, signal processing method, learning method, and program
CN109448732A (en) A kind of digit string processing method and processing device
Chandankhede et al. Voice recognition based security system using convolutional neural network
Ng et al. Teacher-student training for text-independent speaker recognition
CN105679323B (en) A kind of number discovery method and system
CN105845131A (en) Far-talking voice recognition method and device
CN110569908B (en) Speaker counting method and system
Bui et al. A non-linear GMM KL and GUMI kernel for SVM using GMM-UBM supervector in home acoustic event classification
JP2019152737A (en) Speaker estimation method and speaker estimation device
Martin-Morato et al. On the robustness of deep features for audio event classification in adverse environments
CN113889081A (en) Speech recognition method, medium, device and computing equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant