CN116486789A - Speech recognition model generation method, speech recognition method, device and equipment - Google Patents

Speech recognition model generation method, speech recognition method, device and equipment Download PDF

Info

Publication number
CN116486789A
CN116486789A CN202210048877.2A CN202210048877A CN116486789A CN 116486789 A CN116486789 A CN 116486789A CN 202210048877 A CN202210048877 A CN 202210048877A CN 116486789 A CN116486789 A CN 116486789A
Authority
CN
China
Prior art keywords
voice
sample
similarity
speech
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210048877.2A
Other languages
Chinese (zh)
Inventor
陈勇
王浪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Wuhan Kingsoft Office Software Co Ltd
Original Assignee
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Wuhan Kingsoft Office Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Office Software Inc, Zhuhai Kingsoft Office Software Co Ltd, Wuhan Kingsoft Office Software Co Ltd filed Critical Beijing Kingsoft Office Software Inc
Priority to CN202210048877.2A priority Critical patent/CN116486789A/en
Publication of CN116486789A publication Critical patent/CN116486789A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method, a device and equipment for generating a voice recognition model, and relates to the technical field of voice recognition. The method comprises the following steps: acquiring a multi-element voice sample data combination; feature extraction is carried out on the multi-element voice sample data combination through a feature extraction module of the twin neural network to obtain voice feature vectors representing different sample types, similarity between the voice feature vectors representing different sample types is determined through a similarity calculation module of the twin neural network, the twin neural network is trained according to the similarity between the voice feature vectors representing different sample types, the trained twin neural network is obtained, and a voice recognition model is constructed by using the feature extraction module of the trained twin neural network.

Description

Speech recognition model generation method, speech recognition method, device and equipment
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a method, a device, and equipment for generating a speech recognition model.
Background
The voice is one of natural attributes of people, because of physiological differences of pronunciation organs of a speaker and behavioral differences formed in the acquired days, the voice of each person has strong personal color, and the identification of the user identity by using the voice has a plurality of unique advantages, for example, the voice is an inherent characteristic of the user, cannot be lost or forgotten, the collection of voice signals is convenient, the cost of system equipment is low, and compared with other biological recognition technologies such as fingerprints, gestures and the like, the voice is convenient to use and is easy to accept by the user.
The key point of the voice recognition user identity is that the voice recognition model is used for extracting the voice characteristics representing the user from the voice signals, the voice characteristics can reflect the inherent characteristics of the user and can directly influence the accuracy of the user identity recognition, the voice recognition model is mainly used for recognizing the speaker identity by combining the voice spectrum characteristics with the probability mode to generate a voice recognition model in the related technology, the voice recognition model can accurately recognize the user identity in a noise-free environment, but the voice recognition model is quite complex in structure, the parameter quantity related to the voice recognition user identity is larger, the voice characteristics with strong characterization capability are difficult to extract, and the accuracy of the voice recognition user identity is lower.
Disclosure of Invention
In view of this, the present application provides a method, a device and a device for generating a speech recognition model, and aims to solve the problem that in the prior art, the speech recognition model has a very complex structure, and it is difficult to extract speech features with a strong characterization capability.
According to a first aspect of the present application, there is provided a method for generating a speech recognition model, the method comprising:
acquiring a multi-element voice sample data combination, wherein the multi-element voice sample data combination comprises a plurality of voice data with different sample types;
extracting features of the multi-element voice sample data combination through a feature extraction module of the twin neural network to obtain voice feature vectors representing different sample types;
determining the similarity between the voice feature vectors representing different sample types through a similarity calculation module of the twin neural network;
training the twin neural network according to the similarity between the voice feature vectors representing different sample types to obtain a trained twin neural network;
and constructing a voice recognition model by using the trained feature extraction module of the twin neural network.
Further, the acquiring the multi-element voice sample data combination specifically includes:
acquiring voice data of a target speaker as an anchor point sample;
acquiring voice data which is the same as the anchor point sample in attribute dimension representing a speaker as a positive sample;
acquiring voice data which is different from the anchor point sample in attribute dimension representing a speaker as a negative sample;
the anchor point sample, the positive sample, and the negative sample are constructed as the multi-component speech sample data combination.
Further, the acquiring the multi-element voice sample data combination specifically includes:
acquiring voice data of a target speaker as an anchor point sample;
acquiring voice data which is the same as the anchor point sample in attribute dimension representing a speaker as a positive sample;
acquiring voice data which is different from the anchor point sample in attribute dimension representing a speaker as a negative sample;
obtaining, as a hard negative sample, speech data that is different from the target speaker but identical to the anchor point sample in an attribute dimension characterizing at least one speech feature;
the anchor point sample, the positive sample, the negative sample, and the hard negative sample are constructed as the multi-component speech sample data combination.
Further, the feature extraction module, through the twin neural network, performs feature extraction on the multi-element voice sample data combination to obtain voice feature vectors representing different sample types, and specifically includes:
the following is performed for each sample type of speech data in the multi-element speech sample data combination:
performing Fourier transform on the voice data to obtain corresponding frequency components;
and converting the frequency components into corresponding frequency spectrum feature matrixes, and inputting the frequency components into a feature extraction module of the twin neural network to perform coding processing to obtain corresponding voice feature vectors.
Further, the determining, by the similarity calculation module of the twin neural network, the similarity between the speech feature vectors characterizing different sample types specifically includes:
performing, by a similar computing module of the twin neural network, at least one of:
calculating the similarity between the voice characteristic vector of the anchor point sample and the voice characteristic vector of the positive sample;
calculating the similarity between the voice feature vector of the anchor point sample and the voice feature vector of the negative sample;
and calculating the similarity between the voice characteristic vector of the anchor point sample and the voice characteristic vector of the hard negative sample.
Further, training the twin neural network according to the similarity between the voice feature vectors representing different sample types to obtain a trained twin neural network, which specifically includes:
carrying the similarity between the voice feature vectors representing different sample types into a preset loss function, and calculating to obtain a loss value;
and under the condition that the loss value is not smaller than a first preset threshold value, adjusting parameters in the feature extraction module, and stopping training until the loss value is smaller than the first preset threshold value, so as to obtain the feature extraction module of the trained twin neural network.
Further, before the similarity between the speech feature vectors representing different sample types is brought into a preset loss function, and a loss value is calculated, the method further includes:
setting learning task requirements for the twin neural network by utilizing the difference between sample types of the voice data;
and setting a loss function of the twin neural network according to the learning task requirement.
Further, the learning task requirement includes: a first learning task requirement and a second learning task requirement;
The first learning task requirement is that the first similarity is greater than the second similarity, the second similarity is greater than the third similarity, and the distance between the first similarity and the third similarity is greater than a second preset threshold; the second learning task is required to have the first similarity greater than a third similarity and the distance between the first similarity and the third similarity greater than a third preset threshold;
wherein the first similarity is a similarity between a speech feature vector of the anchor sample and a speech feature vector of the positive sample; the second similarity is the similarity between the speech feature vector of the anchor sample and the speech feature vector of the hard negative sample; the third similarity is a similarity between the speech feature vector of the anchor sample and the speech feature vector of the negative sample.
According to a second aspect of the present application, there is provided a speech recognition method, a speech recognition model used in the speech recognition method being trained according to the above-described generation method of a speech recognition model, the speech recognition method comprising:
the method comprises the following steps:
acquiring voice data of a user to be identified;
responding to a voice recognition instruction, calling the voice recognition model to perform feature extraction on voice data of the user to be recognized, and obtaining a voice feature vector of the user to be recognized;
Performing feature matching processing on the voice feature vector of the user to be identified and a pre-stored standard voice feature vector, and determining identity information corresponding to the standard voice feature vector matched with the voice feature vector of the user to be identified as the identity information of the user to be identified;
the standard voice feature vector is obtained by calling the voice recognition model to perform feature extraction on voice data of a standard user.
According to a third aspect of the present application, there is provided a generation apparatus of a speech recognition model, the apparatus comprising:
a first acquisition unit configured to acquire a plurality of voice sample data combinations including voice data of different sample types;
the extraction unit is used for extracting the characteristics of the multi-element voice sample data combination through a characteristic extraction module of the twin neural network to obtain voice characteristic vectors representing different sample types;
the determining unit is used for determining the similarity between the voice feature vectors representing different sample types through a similarity calculation module of the twin neural network;
the training unit is used for training the twin neural network according to the similarity between the voice feature vectors representing different sample types to obtain a trained twin neural network;
And the construction unit is used for constructing a voice recognition model by using the trained feature extraction module of the twin neural network.
Further, the sample types include an anchor sample, a positive sample, and a negative sample, and the first acquisition unit includes:
the first selecting module is used for acquiring the voice data of the target speaker as an anchor point sample;
the second selecting module is used for acquiring voice data which is the same as the anchor point sample in attribute dimension of the characterization speaker as a positive sample;
a third selecting module, configured to obtain, as a negative sample, speech data different from the anchor point sample in attribute dimension representing a speaker;
a construction module for constructing the anchor point sample, the positive sample and the negative sample as the multi-element voice sample data combination.
Further, the sample types include an anchor sample, a positive sample, a negative sample, and a hard negative sample, and the first acquisition unit includes:
the first selecting module is used for acquiring the voice data of the target speaker as an anchor point sample;
the second selecting module is used for acquiring voice data which is the same as the anchor point sample in attribute dimension of the characterization speaker as a positive sample;
A third selecting module, configured to obtain, as a negative sample, speech data different from the anchor point sample in attribute dimension representing a speaker;
a fourth selection module, configured to obtain, as a hard negative sample, speech data that is different from the target speaker but is the same as the anchor point sample in an attribute dimension that characterizes at least one speech feature;
a construction module for constructing the anchor point sample, the positive sample, the negative sample, and the hard negative sample as the multi-element speech sample data combination.
Further, the extraction unit is specifically configured to perform the following operations for each sample type of speech data in the multi-component speech sample data combination:
performing Fourier transform on the voice data to obtain corresponding frequency components;
and converting the frequency components into corresponding frequency spectrum feature matrixes, and inputting the frequency components into a feature extraction module of the twin neural network to perform coding processing to obtain corresponding voice feature vectors.
Further, the determining unit is specifically configured to perform at least one of the following operations by using a similar computing module of the twin neural network:
calculating the similarity between the voice characteristic vector of the anchor point sample and the voice characteristic vector of the positive sample;
Calculating the similarity between the voice feature vector of the anchor point sample and the voice feature vector of the negative sample;
and calculating the similarity between the voice characteristic vector of the anchor point sample and the voice characteristic vector of the hard negative sample.
Further, the training unit includes:
the calculation module is used for bringing the similarity between the voice feature vectors representing different sample types into a preset loss function and calculating to obtain a loss value;
and the adjusting module is used for adjusting the parameters in the feature extraction module under the condition that the loss value is not smaller than a first preset threshold value, stopping training until the loss value is smaller than the first preset threshold value, and obtaining the feature extraction module of the trained twin neural network.
Further, the training unit further includes:
the setting module is used for setting learning task requirements for the twin neural network by utilizing the difference between the sample types of the voice data before the similarity between the voice feature vectors representing different sample types is brought into a preset loss function and a loss value is obtained through calculation;
and the construction module is used for setting the loss function of the twin neural network according to the learning task requirement.
Further, the learning task requirement includes: a first learning task requirement and a second learning task requirement;
the first learning task requirement is that the first similarity is greater than the second similarity, the second similarity is greater than the third similarity, and the distance between the first similarity and the third similarity is greater than a second preset threshold; the second learning task is required to have the first similarity greater than a third similarity and the distance between the first similarity and the third similarity greater than a third preset threshold;
wherein the first similarity is a similarity between a speech feature vector of the anchor sample and a speech feature vector of the positive sample; the second similarity is the similarity between the speech feature vector of the anchor sample and the speech feature vector of the hard negative sample; the third similarity is a similarity between the speech feature vector of the anchor sample and the speech feature vector of the negative sample.
According to a fourth aspect of the present application, there is provided a speech recognition apparatus, the speech recognition model used by the speech recognition method being trained according to the above-described speech recognition model generation method, the apparatus comprising:
the second acquisition unit is used for acquiring voice data of the user to be identified;
The calling unit is used for responding to the voice recognition instruction and calling the voice recognition model to perform feature extraction on the voice data of the user to be recognized so as to obtain a voice feature vector of the user to be recognized;
the matching unit is used for carrying out feature matching processing on the voice feature vector of the user to be identified and a pre-stored standard voice feature vector, and determining identity information corresponding to the standard voice feature vector matched with the voice feature vector of the user to be identified as the identity information of the user to be identified;
the standard voice feature vector is obtained by calling the voice recognition model to perform feature extraction on voice data of a standard user.
According to a fifth aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described speech recognition model generation method and speech recognition method.
According to a sixth aspect of the present application, there is provided a speech recognition model generating apparatus, including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, the processor implementing the above-described speech recognition model generating method and speech recognition method when executing the program.
By means of the technical scheme, the method, the device and the equipment for generating the voice recognition model are characterized in that a multi-element voice sample data combination is obtained, the multi-element voice sample data combination comprises a plurality of voice data of different sample types, feature extraction is conducted on the multi-element voice sample data combination through a feature extraction module of a twin neural network to obtain voice feature vectors representing different sample types, similarity between the voice feature vectors representing different sample types is determined through a similarity calculation module of the twin neural network, training is conducted on the twin neural network according to the similarity between the voice feature vectors representing different sample types, the trained twin neural network is obtained, and the feature extraction module of the trained twin neural network is used for constructing the voice recognition model. Compared with the prior art that the voice spectrum features are combined with the probability model to carry out user identity recognition, the method and the device have the advantages that the twin neural network is used to construct the voice recognition model, voices of different speakers are represented as voice data of different sample types, then the end-to-end technology is used to learn the voice data of the different sample types respectively, similarity among voice feature vectors of the different sample types is obtained through learning to reflect the voice features of the different speakers, and therefore the constructed voice recognition model can extract voice features with strong characterization capability, and accuracy of voice recognition user identity is improved.
The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
fig. 1 is a schematic flow chart of a method for generating a speech recognition model according to an embodiment of the present application;
fig. 2 shows a block diagram of a twin neural network in a practical application scenario according to an embodiment of the present application;
fig. 3 shows a block diagram of a twin neural network in a practical application scenario according to an embodiment of the present application;
FIG. 4 is a flowchart illustrating another speech recognition method according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a device for generating a speech recognition model according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of another speech recognition model generating apparatus according to an embodiment of the present application;
Fig. 7 shows a schematic structural diagram of a voice recognition device according to an embodiment of the present application.
Detailed Description
The present application will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.
In the related technology, the voice recognition model is mainly generated by combining the voice spectrum characteristics with the probability to recognize the speaker identity, and the method can accurately recognize the user identity in a noise-free environment, but the structure of the voice recognition model in the method is very complex, the parameter quantity related to the voice recognition user identity is large, the voice characteristics with strong characterization capability are difficult to extract, and the accuracy of the voice recognition user identity is low.
In order to solve the problem, the present embodiment provides a method for generating a speech recognition model, as shown in fig. 1, where the method may be applied to a server side of a speech recognition platform, and includes the following steps:
101. a plurality of voice sample data combinations are acquired.
The multi-element voice sample data combination comprises a plurality of voice data of different sample types, wherein the sample types at least comprise an anchor point sample, a positive sample and a negative sample, the anchor point sample can be selected from a voice data set containing different speakers, the anchor point sample is one voice data of any speaker selected from the voice data set, a speaker of the anchor point sample is a target speaker, the positive sample is another voice data of the target speaker selected from the voice data set, the negative sample is voice data of a non-target speaker selected from the voice data set, further in order to better express the commonality of the same speaker and the difference among different speakers, the sample types can also comprise a hard negative sample, the hard negative sample is voice data which is selected from the voice data set and has commonality with the anchor point sample in any attribute dimension, for example, the anchor point sample can be selected from the voice data with the same text content, the anchor point sample can be selected from the voice data with the same speaker gender, and the anchor point sample can be selected from the voice data with the same voice duration.
The voice data set includes 15 voice data formed by a user a, a user B, a user C and a user D, each user corresponds to 3 voice data, one voice data of the user B is selected from the voice data set as an anchor point sample, the user B is a target speaker, the positive sample is another voice data of the user B, the negative sample can be one voice data corresponding to the user a, the user C or the user D, and the hard negative sample can be voice data with the same voice text content, the same voice length or the same speaker gender as the user B, and the voice data can be from the user a, the user C or the user D.
It can be understood that, considering that the voice data of different speakers has different voice features, the voice data of the speakers can be used as the characterization of the user identity, and the twin neural network is trained by using the voice data of different sample types in the multi-element voice sample data combination, so as to learn the voice features of the same speaker and the voice features of different speakers.
The execution subject of the embodiment of the invention can be a generation device of a voice recognition model, and can be specifically configured at a server side of a voice recognition platform, and can receive user voice data collected by the terminal and identify the user according to the user voice data by being connected with the terminal with the voice recognition function, so that a multi-element voice sample data combination constructed by voice data of different speakers is obtained, and the multi-element voice sample data combination has voice data of different sample types and can better represent the voice characteristics of different speakers.
102. And carrying out feature extraction on the multi-element voice sample data combination through a feature extraction module of the twin neural network to obtain voice feature vectors representing different sample types.
The feature extraction module of the twin neural network is used for outputting embedded high-dimensional spatial characterization of a plurality of voice sample data combinations by taking the plurality of voice sample data combinations as input, and the feature extraction module uses a labeling mechanism to understand the relationship between voice components of different speakers in the voice sample data in the feature extraction process so as to maximize the characterization of different speaker tags and minimize the characterization of the same speaker tag, thereby enhancing the expression capability of sample types among the same speakers and weakening the expression capability of sample types among different speakers, so that the feature extraction capability of the feature extraction module is improved.
By way of example, the feature extraction module may be any functional module that is capable of extracting features from speech sample data, such as an encoder, a shared encoder, etc.
In the application, the feature extraction module may select a language model such as Bert, where the number of encoders may be set according to the number of sample types included in the multi-element speech sample data combination, and speech data of each sample type is input to one encoder, where multiple encoders share weight parameters in the feature extraction process and output speech feature vectors of corresponding sample types, for example, for sample types including an anchor sample, a positive sample, and a negative sample, one encoder may be set for each of the anchor sample, the positive sample, and the negative sample, and weight parameters of each encoder are shared in the feature extraction process.
103. And determining the similarity between the voice feature vectors representing different sample types through a similarity calculation module of the twin neural network.
The similarity calculation module of the twin neural network is used for calculating the similarity between the voice feature vectors representing different sample types, the voice feature vectors of the different sample types are used as input, the similarity between the voice feature vectors of the different sample types is output, the similarity value can reflect the difference between the voice features of the same speaker and the voice features of different speakers to a certain extent, and in general, the similarity between the voice feature vectors of the sample types of the same speaker is higher, and the similarity between the voice feature vectors of the different speakers is lower.
In consideration of different sample types, the speech feature vector of the anchor sample can be used as a reference value, and similarity calculation can be performed on the speech feature vector of the anchor sample and the speech feature vectors of other sample types respectively to obtain similarity between any two vectors, wherein the similarity calculation mode can be distance between the two speech feature vectors, for example, euclidean distance.
104. Training the twin neural network according to the similarity between the voice feature vectors representing different sample types to obtain the trained twin neural network.
Particularly, in the training process, the voice data of different sample types are usually labeled in advance, the label can be whether the voice data comes from the same speaker or whether the voice data has the same attribute dimension (for example, whether the voice data has the same text content, whether the voice data has the same speaker gender, whether the voice data has the same duration and the like) and the like, the twin neural network outputs the similarity between the voice feature vectors of different sample types, then according to the similarity, whether the voice data of different sample types comes from the same speaker can be judged, and whether the judging result meets the condition is further used as the basis for stopping training of the twin neural network until the similarity between the voice feature vectors of different sample types output by the twin neural network meets the training stopping condition, so as to obtain the trained twin neural network.
The speaker of the voice data in the voice data set and the attribute dimension of the voice data can be marked in a periodical manual marking mode, whether the voice data are from the same speaker or not is determined according to the label of the speaker, and whether the voice data have the same attribute dimension or not is determined through the attribute dimension mark of the voice data.
105. And constructing a voice recognition model by using the feature extraction module of the trained twin neural network.
It can be understood that, since the feature extraction module of the trained twin neural network has better voice feature extraction capability, the voice features of the same speaker can be identified and the voice features of different speakers can be distinguished, and the voice features representing the speaker can be accurately extracted from the voice data by using the voice recognition model constructed by the feature extraction module of the trained twin neural network, and the voice features can represent the identity information of the speaker to a certain extent and can be applied to the scene of speaker identity recognition, for example, the voice data of the same speaker can be identified from a plurality of voice data from different speakers or the voice data of a specific speaker can be identified.
According to the method for generating the voice recognition model, the multiple voice sample data combinations are obtained, the multiple voice sample data combinations comprise a plurality of voice data of different sample types, the feature extraction module of the twin neural network is used for extracting features of the multiple voice sample data combinations to obtain voice feature vectors representing different sample types, the similarity between the voice feature vectors representing different sample types is determined through the similarity calculation module of the twin neural network, the twin neural network is trained according to the similarity between the voice feature vectors representing different sample types, the trained twin neural network is obtained, and the feature extraction module of the trained twin neural network is used for constructing the voice recognition model. Compared with the prior art that the voice spectrum features are combined with the probability model to carry out user identity recognition, the method and the device have the advantages that the twin neural network is used to construct the voice recognition model, voices of different speakers are represented as voice data of different sample types, then the end-to-end technology is used to learn the voice data of the different sample types respectively, similarity among voice feature vectors of the different sample types is obtained through learning to reflect the voice features of the different speakers, and therefore the constructed voice recognition model can extract voice features with strong characterization capability, and accuracy of voice recognition user identity is improved.
Further, as a refinement and extension of the foregoing embodiment, in order to fully describe the implementation process of this embodiment, in some embodiments, the foregoing multi-element voice sample data combination includes an anchor sample, a positive sample, and a negative sample, and voice data of different sample types are selected from the voice data set, where, considering diversity of the voice sample data, the voice data in the voice data set should include voice feature attributes as much as possible, so that the voice data set may cover a wider sample type, for example, the voice data set may include voice data of different text contents and different voice durations of the same speaker, or may include voice data of the same text contents and the same voice duration of different speakers.
Specifically, in selecting the voice data of different sample types from the voice data set, step 101 may include: the method comprises the steps of obtaining voice data of a target speaker as an anchor point sample, obtaining voice data which is the same as the anchor point sample in attribute dimension representing the speaker as a positive sample, obtaining voice data which is different from the anchor point sample in attribute dimension representing the speaker as a negative sample, and further constructing the anchor point sample, the positive sample and the negative sample into a multi-element voice sample data combination. It should be noted that, the voice data of the target speaker may be obtained as an anchor sample, and the voice data of the target speaker may be selected from the voice data set as the anchor sample, or the voice data of the target speaker may be collected in real time as the anchor sample. The positive sample, the negative sample and the hard negative sample can be obtained in the same way, namely can be selected from a voice data set, and can also be collected in real time.
Further, considering the diversity of sample types, the sample types include hard negative samples in addition to anchor samples, positive samples and negative samples, and step 101 may specifically include: the method comprises the steps of obtaining voice data of a target speaker as an anchor point sample, obtaining voice data which is the same as the anchor point sample in attribute dimension representing the speaker as a positive sample, obtaining voice data which is different from the anchor point sample in attribute dimension representing the speaker as a negative sample, obtaining voice data which is different from the target speaker but is the same as the anchor point sample in attribute dimension representing at least one voice feature as a hard negative sample, and further constructing the anchor point sample, the positive sample, the negative sample and the hard negative sample into a multi-element voice sample data combination.
The attribute dimension characterizing the speech feature herein may be speaker information, speech text content, speech duration, speech intonation, speech emotion, and the like. It can be appreciated that, since the hard negative sample is the same voice data as the anchor point sample in the attribute dimension representing a certain voice feature, the effect of enhancing the representation capability and distinguishing capability of the model can be achieved in the training process, and the training effect of the model can be improved.
The speech data set includes speech data of the speakers A1-a10, each speaker has at least 5 speech data, one speech data of the speaker A1 is selected as an anchor sample, a positive sample is another speech data of the speaker A1, a negative sample is one speech data of the non-speaker A1, i.e. one speech data of the speaker A2-a10, and a hard negative sample is one speech data of the non-speaker A1 and the speaker A1 has the same attribute dimension of a certain speech feature, i.e. one speech data of the speaker A2-a10 is illustrated, and the speech data is consistent with the speech text content of the speaker A1, or is consistent with the speech length of the speaker A1, or is identical with the speaker A1 in the speaker human.
It should be noted that, in the case that the sample type includes a hard negative sample, in order to avoid that the negative sample and the hard negative sample are selected to the same voice data, it may be set that the selection of the hard negative sample from the voice data set is prioritized, and then the negative sample is selected from the voice data set, and the selection condition of the negative sample from the voice data set may be further defined, that is, the voice data of the non-target speaker selected by the negative sample needs not to have any commonality with the voice data of the target speaker, that is, the voice data of the negative sample and the voice data of the anchor sample have no commonality in each attribute dimension.
In some embodiments, the step 102 is further expanded to perform the following operations for each sample type of speech data in the multi-element speech sample data combination: and carrying out Fourier transform on the voice data to obtain corresponding frequency components, converting the frequency components into corresponding frequency spectrum feature matrixes, and inputting the frequency components into a feature extraction module of the twin neural network to carry out coding processing to obtain corresponding voice feature vectors.
Typically, the voice data in the voice data set can be visualized as a time-varying graph of amplitude, i.e. a time-domain representation of the voice data, according to the amplitude and the sampling frequency, but these amplitudes do not provide very useful information, but only represent the response of the voice recording, and for a better understanding of the voice data, the frequency domain of the voice data may indicate how many different frequency components are shared in the voice, where the fourier transform is a way of transforming the voice data from the time domain to the frequency domain, and may decompose the voice data into sound waves of different frequencies, thereby obtaining corresponding frequency components.
It can be understood that from the time domain to the frequency domain, various frequency components contained in the voice data, such as amplitude, power, energy and phase relation, that is, spectral characteristics of signals, can be analyzed, further in order to facilitate unifying input parameters of the twin neural network, the frequency components can be converted into corresponding spectral characteristic matrixes, and the corresponding spectral characteristic matrixes are input to a characteristic extraction module of the twin neural network to be subjected to coding processing, so that corresponding voice characteristic vectors are obtained.
The feature extraction module is composed of a plurality of encoders with shared weight parameters, and can specifically form spectrum feature matrixes with the same dimension after voice data of different sample types are subjected to Fourier transformation, and the spectrum feature matrixes formed by the sample types are further input into one encoder for training to obtain voice feature vectors representing the different sample types. In one example, the sample type may contain only positive samples, negative samples, and anchor samples; at this time, the number of encoders (e.g., bert encoders) is three, and voice data of each sample type may be input to one Bert encoder. Correspondingly, speech feature vectors of corresponding sample types output by the three encoders are also obtained.
In some embodiments, the step 103 is further developed, where the similarity calculation module may calculate the cosine similarity between the speech feature vectors of any two sample types, and the obtained cosine similarity may represent the difference between the two sample types to a certain extent, and in general, there should be a low difference between sample types representing the same speaker, such as an anchor sample and a positive sample, and a high difference between sample types representing different speakers, such as an anchor sample and a negative sample.
For the sample types contained in the multi-element voice sample data combination, at least one of the following operations is executed through a similarity calculation module of the twin neural network, the similarity between the voice feature vector of the anchor sample and the voice feature vector of the positive sample is calculated, the similarity between the voice feature vector of the anchor sample and the voice feature vector of the negative sample is calculated, and the similarity between the voice feature vector of the anchor sample and the voice feature vector of the hard negative sample is calculated.
In some embodiments, the step 104 may further include: and carrying the similarity between the voice feature vectors representing different sample types into a preset loss function, calculating to obtain a loss value, and adjusting parameters in the feature extraction module under the condition that the loss value is not smaller than a first preset threshold value until the loss value is smaller than the first preset threshold value, stopping training, and obtaining the feature extraction module of the trained twin neural network.
Considering the training effect of the twin neural network, the multiple voice sample data are combined to carry a sample type label of the voice data, the sample type label can reflect the difference between sample types of the voice data, the learning task requirement can be set for the twin neural network by utilizing the difference between the sample types of the voice data, and the loss function of the twin neural network can be set according to the learning task requirement.
It can be understood that, since the anchor sample and the positive sample are voice data of the same speaker, the anchor sample and the hard negative sample are voice data identical in attribute dimension of a certain voice feature, the anchor sample and the negative sample are voice data of different speakers, and the difference between different sample types reflects different similarity relations to a certain extent. The sample type for speech data from the same speaker should have a lower variance, i.e. the similarity between the speech feature vector of the anchor sample and the speech feature vector of the positive sample should be higher, while the sample type for speech data from different speakers should have a higher variance, i.e. the similarity between the speech feature vector of the anchor sample and the speech feature vector of the negative sample should be lower, whereas the variability for speech sample types from different speakers but with the same attribute dimension for speech data would be between the two above variances, i.e. the similarity between the speech feature vector of the anchor sample and the speech feature vector of the hard negative sample would be between the two above similarities.
In combination with the difference between the different sample types reflected by the similarity, as an example, for the case that the sample types include an anchor sample, a positive sample, a negative sample, and a hard negative sample, the setting of the loss function needs to satisfy the similarity relationship between the speech feature vectors of the different sample types in the following learning task requirements: the first similarity is greater than the second similarity, the second similarity is greater than the third similarity, and the distance between the first similarity and the third similarity is greater than a second preset threshold, the first similarity is the similarity between the speech feature vector of the anchor sample and the speech feature vector of the positive sample, the second similarity is the similarity between the speech feature vector of the anchor sample and the speech feature vector of the hard negative sample, and the third similarity is the similarity between the speech feature vector of the anchor sample and the speech feature vector of the negative sample.
In combination with the difference between the different sample types reflected by the similarity, as another example, for the case where the sample types include an anchor sample, a positive sample, and a negative sample, the setting of the loss function needs to satisfy the similarity relationship between the speech feature vectors of the different sample types in the following learning task requirements: the first similarity is greater than the third similarity, the distance between the first similarity and the third similarity is greater than a third preset threshold, the first similarity is the similarity between the voice feature vector of the anchor sample and the voice feature vector of the positive sample, and the third similarity is the similarity between the voice feature vector of the anchor sample and the voice feature vector of the negative sample.
The second preset threshold and the third preset threshold may be set according to an actual scene, and the two may be the same value or different values.
It can be understood that the loss value output by the loss function of the twin neural network can reflect the characteristics among different speakers, further can strengthen the characterization of the network for the same speaker according to the characteristics among the different speakers reflected by the loss value in the training process, and meanwhile weakens the characterization of the network for the different speakers, so that parameters in the twin neural network are adjusted, and the network has stronger extraction capability for the voice characteristics of the different speakers.
The first preset threshold and the third preset threshold may be set to a value of typically 0 or approaching 0 according to the actual scene.
In addition, after parameters in the feature extraction module are adjusted each time, the similarity between voice feature vectors representing different sample types is changed, the calculated loss value is changed, the changed loss value may not reach the condition of being smaller than a first preset threshold after long-time training, at this time, on the basis of guaranteeing the training effect, the training time is shortened by setting the iteration number of training stop, for example, the iteration number is set to be 10, if the loss value output by the loss function is still not smaller than the first preset threshold after 10 iteration training, the parameters in the feature extraction module are not required to be adjusted, namely, training is stopped, and the feature extraction module of the trained twin neural network is obtained.
In an actual application scenario, taking a feature extraction module as an example of a shared encoder, the structure of a twin neural network is shown in fig. 2, the input layer of the twin neural network is voice data of different sample types, the inputs from left to right are respectively an anchor point sample (voice data of a speaker 1), a positive sample (voice data of a speaker 1) and a negative sample (voice data of a speaker 3), the feature extraction module is used for respectively training the voice data of different sample types to obtain voice feature vectors embedded 1, embedded 2 and embedded 4 which have the same dimension and represent the different sample types, and the similarity between the voice feature vectors representing the different sample types is further calculated by using a similarity calculation module to obtain the following steps:
S1(anc,pos)=cosine(Embed1,Embed2)
S3(anc,neg)=cosine(Embed1,Embed4)
Wherein S1 (anc, pos) is a similarity between the speech feature vector embedded 1 of the anchor sample and the speech feature vector embedded 2 of the positive sample, which is set as a first similarity; s3 (anc, neg) is the similarity between the speech feature vector embedded 1 of the anchor sample and the speech feature vector embedded 4 of the negative sample, and is set to a third similarity. Here expressed using cosine similarity between vectors. It is desirable that the first similarity is greater than the third similarity S1 (anc, pos) > S3 (anc, neg), and that the distance between the first similarity and the third similarity is greater than a third preset threshold, that is, the distance between S1 (anc, pos) and S3 (anc, neg) is maximized, that is, S1 (anc, pos) > S3 (anc, neg) +margin, wherein margin is an over-parameter, and is (0, 1). The final loss function is: loss=l1+l2, where l1=max (s3—s1+margin, 0), l2=max (S3, 0), i.e., loss=max (s3—s1+margin, 0) +max (S3, 0).
The more the loss value output by the loss function approaches 0, the similarity relation between the voice feature vectors representing different sample types accords with the learning task, that is, the parameters in the twin neural network reach the training effect without adjusting the parameters, otherwise, the similarity relation between the voice feature vectors representing different sample types does not accord with the learning task requirement, and the parameters in the twin neural network need to be adjusted according to the loss value. Specifically, describing the loss function loss, the loss value output by L1 in the loss function can reflect the distance between S1 and S3 and the magnitude relation between S1 and S3, and the magnitude relation between S3 and S3 can be reflected when S1 is greater than S3 and the distance between the two is greater than a preset threshold margin, and the learning task requirement is met when S1-S3 is greater than margin, that is, when S3-s1+margin is less than 0, the L1 output loss value is 0, whereas the learning task requirement is not met when S1 is less than S3, and when S3-s1+margin output value is necessarily greater than zero, the L1 output loss value is the sum of the difference value between the two and margin, and similarly, the loss value output by L2 in the loss function can reflect the magnitude relation between S3 and 0, and when S3 approaches to 0, the L2 output loss value is 0, and the learning task requirement is met, otherwise, the learning task requirement is not met.
As a further extension of the application scenario, taking the feature extraction module as an example of a shared encoder, the structure of the twin neural network is shown in fig. 3, the input layers of the twin neural network are respectively an anchor point sample (voice data of speaker 1), a positive sample (voice data of speaker 1), a hard negative sample (voice data of speaker 2) and a negative sample (voice data of speaker 3) from left to right, the feature extraction module is used for respectively training the voice data of different sample types to obtain voice feature vectors of the same dimension and representing the different sample types, and the similarity calculation module is further used for respectively calculating the similarity between the voice feature vectors representing the different sample types to obtain the following voice feature vectors:
S1(anc,pos)=cosine(Embed1,Embed2)
S2(anc,hard)=cosine(Embed1,Embed3)
S3(anc,neg)=cosine(Embed1,Embed4)
wherein S1 (anc, pos) is a similarity between the speech feature vector embedded 1 of the anchor sample and the speech feature vector embedded 2 of the positive sample, which is set as a first similarity; s2 (anc, hard) is the similarity between the speech feature vector Embed1 of the anchor sample and the speech feature vector Embed3 of the hard negative sample, which is set as the second similarity; s3 (anc, neg) is the similarity between the speech feature vector embedded 1 of the anchor sample and the speech feature vector embedded 4 of the negative sample, and is set to a third similarity. Here expressed using cosine similarity between vectors. It is desirable that the first similarity is greater than the second similarity, which is greater than the third similarity, namely: s1 (anc, pos) > S2 (anc, hard) > S3 (anc, neg)), and the distance between the first similarity and the third similarity is greater than a second preset threshold, that is, the distance between S1 (anc, pos) and S3 (anc, neg) is maximized, S1 (anc, pos) > S3 (anc, neg) +margin is maximized, wherein margin is an over-parameter, and is (0, 1). The final loss function is: loss=l1+l2, where l1=max (s3—s1+margin, 0), l2=max (S3-S2, 0), i.e. loss=max (s3—s1+margin, 0) +max (S3-S2, 0).
Specifically, for the loss function loss, the detailed description of the loss function L1 can be seen above, the loss value of the L2 output in the loss function can reflect the magnitude relation between S2 and S3, the L2 output loss value is 0 for the case that S2 is greater than S3, and the L2 output loss value is not the case that S2 is less than S3, if not, the difference is the difference.
Further, as an application of the above-mentioned speech recognition model, the present application further provides a speech recognition method, as shown in fig. 4, where the method may be applied to a speech recognition terminal, and includes the following steps:
201. and acquiring voice data of the user to be identified.
The voice data of the user to be recognized is voice data collected by the user aiming at unknown identity information, and the voice data can be obtained through terminal equipment or voice collection equipment.
202. And responding to the voice recognition instruction, calling the voice recognition model to perform feature extraction on voice data of the user to be recognized, and obtaining a voice feature vector of the user to be recognized.
The speech recognition model is trained by using the method for generating the speech recognition model, and specifically, the speech recognition model is called to perform feature extraction on the user speech data to obtain a speech feature vector representing the user to be recognized, where the speech feature vector representing the user to be recognized can be used as a basis for judging the identity of the user to be recognized, and the specific feature extraction process is described in detail above and is not repeated herein.
203. And carrying out feature matching processing on the voice feature vector of the user to be identified and the pre-stored standard voice feature vector, and determining identity information corresponding to the standard voice feature vector matched with the voice feature vector of the user to be identified as the identity information of the user to be identified.
The standard voice feature vector is obtained by calling the voice recognition model to perform feature extraction on voice data of the standard user, wherein the standard user is used for acquiring the user with recorded identity information, whether the user to be recognized is the standard user can be judged by performing feature matching processing on the voice feature vector of the user to be recognized and the pre-stored standard voice feature vector, and identity information corresponding to the standard voice feature vector matched with the voice feature vector of the user to be recognized is determined as the identity information of the user to be recognized.
In general, a large number of standard voice feature vectors are pre-stored in the voice platform, each standard voice feature vector corresponds to identity information of one standard user, that is, a large number of standard voice feature vectors of the standard user are pre-stored in the voice platform, and in the process of performing feature matching processing on the voice feature vectors of the user to be identified and the pre-stored standard voice feature vectors, the voice feature vectors of the user to be identified can be matched with the standard feature vectors of the standard user one by one until the standard feature vectors matched with each other are found, and then the identity information of the standard user corresponding to the feature vectors is used as the identity information of the user to be identified.
The process of the feature matching processing can be applied to judging whether two sections of voice data come from the application scene of the same user or not, and can also be applied to judging whether a voice speaker exists in the application scene in the existing voice library, specifically, different recognition modes are selected according to different application scenes, and for judging whether the voice comes from the application scene of the target user or not, the voice feature vectors extracted by the two sections of voice can be respectively compared with the voice feature vectors of the target user, and if the comparison results of the two sections of voice data are consistent, the two sections of voice data can be judged to come from the target user; for the application scenario of identifying the user identity, the voice feature vector representing the user to be identified can be matched with the standard voice feature vector of the standard user to judge whether the user is the standard user.
It can be understood that the voice data of the user to be identified may be voice data of one user, or may be voice data of a plurality of users, the extracted voice feature vector representing the user to be identified may be at least one voice feature vector, and the identity of the user cannot be determined only according to the voice feature vector, and for the voice data of one user, the voice feature vector representing the user to be identified may be compared with the voice feature vector of a standard user stored in advance, and the identity information of the user to be identified may be determined further according to the comparison result; for voice data of a plurality of users, extracting voice feature vectors possibly having the same characteristic in voice feature vectors representing the users to be identified, and repeatedly executing a large amount of comparison processes, wherein in order to save comparison time, the voice feature vectors representing the users to be identified can be clustered first, then the voice feature vectors are mapped to different users to be identified, and then each of the clustered voice feature vectors is compared with the voice feature vectors of the standard users stored in advance, so that identity recognition is performed on the users to be identified.
For example, the voice data of the user to be recognized is 10 voices, the 10 voices are subjected to feature extraction through a feature extraction module of the twin neural network to obtain 10 voice feature vectors representing the user to be recognized, the voice feature vectors formed by clustering the 10 voice feature vectors are respectively mapped to a user A, a user B and a user C, at the moment, identity information of the three users is not clear, the clustered voice feature vectors are further matched with voice feature vectors of pre-stored standard users, and if the user A, the user B or the user C is found to be matched with the consistent standard users, the identity information corresponding to the standard users is determined to be the identity information of the user A, the user B or the user C
Further, as a specific implementation of the method of fig. 1, an embodiment of the present application provides a device for generating a speech recognition model, as shown in fig. 5, where the device includes: a first acquisition unit 31, an extraction unit 32, a determination unit 33, a training unit 34, a construction unit 35.
A first obtaining unit 31 that may be used to obtain a plurality of voice sample data combinations including voice data of different sample types;
The extracting unit 32 may be configured to perform feature extraction on the multi-element speech sample data combination by using a feature extracting module of the twin neural network, so as to obtain speech feature vectors representing different sample types;
a determining unit 33, configured to determine, by means of a similarity calculation module of the twin neural network, a similarity between the speech feature vectors characterizing different sample types;
the training unit 34 may be configured to train the twin neural network according to the similarity between the speech feature vectors representing different sample types, to obtain a trained twin neural network;
a construction unit 35 may be configured to construct a speech recognition model using the trained feature extraction module of the twin neural network.
According to the generation device of the voice recognition model, the multi-element voice sample data combination is obtained, the multi-element voice sample data combination comprises a plurality of voice data of different sample types, the feature extraction module of the twin neural network is used for extracting features of the multi-element voice sample data combination to obtain voice feature vectors representing different sample types, the similarity between the voice feature vectors representing different sample types is determined through the similarity calculation module of the twin neural network, the twin neural network is trained according to the similarity between the voice feature vectors representing different sample types, the trained twin neural network is obtained, and the feature extraction module of the trained twin neural network is used for constructing the voice recognition model. Compared with the prior art that the voice spectrum features are combined with the probability model to carry out user identity recognition, the method and the device have the advantages that the twin neural network is used to construct the voice recognition model, voices of different speakers are represented as voice data of different sample types, then the end-to-end technology is used to learn the voice data of the different sample types respectively, similarity among voice feature vectors of the different sample types is obtained through learning to reflect the voice features of the different speakers, and therefore the constructed voice recognition model can extract voice features with strong characterization capability, and accuracy of voice recognition user identity is improved.
In a specific application scenario, as shown in fig. 6, the first obtaining unit 31 includes:
the first selection module 311 may be configured to obtain voice data of a target speaker as an anchor sample;
a second selection module 312, configured to obtain, as a positive sample, speech data identical to the anchor point sample in attribute dimension representing a speaker;
a third selection module 313, configured to obtain, as a negative sample, speech data different from the anchor point sample in attribute dimension representing a speaker;
a construction module 314 may be configured to construct the anchor sample, the positive sample, and the negative sample as the multi-component speech sample data combination.
In a specific application scenario, as shown in fig. 6, the first obtaining unit 31 includes:
the first selection module 311 may be configured to obtain voice data of a target speaker as an anchor sample;
a second selection module 312, configured to obtain, as a positive sample, speech data identical to the anchor point sample in attribute dimension representing a speaker;
a third selection module 313, configured to obtain, as a negative sample, speech data different from the anchor point sample in attribute dimension representing a speaker;
A fourth selection module 315, which may be configured to obtain, as a hard negative sample, speech data that is different from the target speaker but is the same as the anchor point sample in an attribute dimension that characterizes at least one speech feature;
a construction module 314 may be configured to construct the anchor point sample, the positive sample, the negative sample, and the hard negative sample as the multi-component speech sample data combination.
In a specific application scenario, the extracting unit 32 may be specifically configured to perform the following operations for each sample type of the multi-element speech sample data combination:
performing Fourier transform on the voice data to obtain corresponding frequency components;
and converting the frequency components into corresponding frequency spectrum feature matrixes, and inputting the frequency components into a feature extraction module of the twin neural network to perform coding processing to obtain corresponding voice feature vectors.
In a specific application scenario, the determining unit 33 may be specifically configured to perform at least one of the following operations by using a similar computing module of the twin neural network:
calculating the similarity between the voice characteristic vector of the anchor point sample and the voice characteristic vector of the positive sample;
Calculating the similarity between the voice feature vector of the anchor point sample and the voice feature vector of the negative sample;
and calculating the similarity between the voice characteristic vector of the anchor point sample and the voice characteristic vector of the hard negative sample.
In a specific application scenario, as shown in fig. 6, the training unit 34 includes:
the calculating module 341 may be configured to bring the similarity between the speech feature vectors representing different sample types into a preset loss function, and calculate a loss value;
the adjusting module 342 may be configured to adjust the parameters in the feature extraction module when the loss value is not less than a first preset threshold, and stop training until the loss value is less than the first preset threshold, so as to obtain a feature extraction module of the trained twin neural network.
In a specific application scenario, as shown in fig. 6, the training unit 34 further includes:
the setting module 343 may be configured to set a learning task requirement for the twin neural network by using a difference between sample types of the speech data before the similarity between the speech feature vectors characterizing different sample types is brought to a preset loss function to calculate a loss value;
A construction module 344 may be configured to set a loss function of the twin neural network according to the learning task requirements.
In a specific application scenario, the learning task requirement includes: a first learning task requirement and a second learning task requirement;
the first learning task requirement is that the first similarity is greater than the second similarity, the second similarity is greater than the third similarity, and the distance between the first similarity and the third similarity is greater than a second preset threshold; the second learning task is required to have the first similarity greater than a third similarity and the distance between the first similarity and the third similarity greater than a third preset threshold;
wherein the first similarity is a similarity between a speech feature vector of the anchor sample and a speech feature vector of the positive sample; the second similarity is the similarity between the speech feature vector of the anchor sample and the speech feature vector of the hard negative sample; the third similarity is a similarity between the speech feature vector of the anchor sample and the speech feature vector of the negative sample.
It should be noted that, in the other corresponding descriptions of each functional unit related to the generating device of the speech recognition model applicable to the server side provided in this embodiment, reference may be made to the corresponding descriptions in fig. 1 and fig. 2, and no further description is given here.
Further, as a specific implementation of the method of fig. 4, an embodiment of the present application provides a speech recognition device, as shown in fig. 7, where a speech recognition model used in the speech recognition method is trained according to the method for generating the speech recognition model, where the device includes: a second acquisition unit 41, a calling unit 42, a matching unit 43.
A second obtaining unit 41, configured to obtain voice data of a user to be identified;
the calling unit 42 may be configured to respond to a voice recognition instruction, and call the voice recognition model to perform feature extraction on voice data of the user to be recognized, so as to obtain a voice feature vector of the user to be recognized;
the matching unit 43 may be configured to perform feature matching processing on the speech feature vector of the user to be identified and a pre-stored standard speech feature vector, and determine identity information corresponding to the standard speech feature vector that is matched with the speech feature vector of the user to be identified as identity information of the user to be identified;
the standard voice feature vector is obtained by calling the voice recognition model to perform feature extraction on voice data of a standard user.
It should be noted that, in the other corresponding descriptions of the functional units related to the voice recognition device applicable to the voice recognition terminal side provided in this embodiment, reference may be made to the corresponding descriptions in fig. 4, and no further description is given here.
Based on the method shown in fig. 1, correspondingly, the embodiment of the application also provides a storage medium, on which a computer program is stored, which when executed by a processor, implements the method for generating the speech recognition model shown in fig. 1; based on the above method shown in fig. 4, correspondingly, the embodiment of the application further provides a storage medium, on which a computer program is stored, which when executed by a processor, implements the above voice recognition method shown in fig. 4.
Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods described in various implementation scenarios of the present application.
Based on the method shown in fig. 1 and the virtual device embodiments shown in fig. 5 to 6, in order to achieve the above objective, the embodiments of the present application further provide a server entity device, which may specifically be a computer, a server, or other network devices, where the entity device includes a storage medium and a processor; a storage medium storing a computer program; a processor for executing a computer program to implement the above-described method for generating a speech recognition model as shown in fig. 1.
Based on the method shown in fig. 4 and the virtual device embodiment shown in fig. 7, in order to achieve the above objective, the embodiment of the present application further provides a voice recognition terminal entity device, which may specifically be a computer, a smart phone, a tablet computer, a smart watch, or a network device, where the entity device includes a readable storage medium and a processor; a readable storage medium storing a computer program; a processor for executing the computer program to implement the above-described speech recognition method as shown in fig. 4.
Optionally, both of the above-mentioned physical devices may further include a user interface, a network interface, a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and so on. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.
It will be appreciated by those skilled in the art that the structure of the entity device for generating a speech recognition model according to the present embodiment is not limited to the entity device, and may include more or fewer components, or some components may be combined, or different arrangements of components.
The storage medium may also include an operating system, a network communication module. The operating system is a program that manages the generation of the above-described speech recognition model and the hardware and software resources of the speech recognition entity device, supporting the execution of information processing programs and other software and/or programs. The network communication module is used for realizing communication among all components in the storage medium and communication with other hardware and software in the information processing entity equipment.
From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware. Compared with the existing mode, the method and the device have the advantages that the voice recognition model is built through the twin neural network, voices of different speakers are represented as voice data of different sample types, then the end-to-end technology is used for respectively learning the voice data of the different sample types, the similarity between voice feature vectors of the different sample types is obtained through learning to reflect the voice features of the different speakers, so that the built voice recognition model can extract the voice features with strong characterization capability, and the accuracy of the identity of a voice recognition user is improved.
Those skilled in the art will appreciate that the drawings are merely schematic illustrations of one preferred implementation scenario, and that the modules or flows in the drawings are not necessarily required to practice the present application. Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
The foregoing application serial numbers are merely for description, and do not represent advantages or disadvantages of the implementation scenario. The foregoing disclosure is merely a few specific implementations of the present application, but the present application is not limited thereto and any variations that can be considered by a person skilled in the art shall fall within the protection scope of the present application.

Claims (13)

1. A method for generating a speech recognition model, comprising:
acquiring a multi-element voice sample data combination, wherein the multi-element voice sample data combination comprises a plurality of voice data with different sample types;
extracting features of the multi-element voice sample data combination through a feature extraction module of the twin neural network to obtain voice feature vectors representing different sample types;
Determining the similarity between the voice feature vectors representing different sample types through a similarity calculation module of the twin neural network;
training the twin neural network according to the similarity between the voice feature vectors representing different sample types to obtain a trained twin neural network;
and constructing a voice recognition model by using the trained feature extraction module of the twin neural network.
2. The method according to claim 1, wherein the obtaining a plurality of voice sample data combinations, in particular, comprises:
acquiring voice data of a target speaker as an anchor point sample;
acquiring voice data which is the same as the anchor point sample in attribute dimension representing a speaker as a positive sample;
acquiring voice data which is different from the anchor point sample in attribute dimension representing a speaker as a negative sample;
the anchor point sample, the positive sample, and the negative sample are constructed as the multi-component speech sample data combination.
3. The method according to claim 1, wherein the obtaining a plurality of voice sample data combinations, in particular, comprises:
acquiring voice data of a target speaker as an anchor point sample;
Acquiring voice data which is the same as the anchor point sample in attribute dimension representing a speaker as a positive sample;
acquiring voice data which is different from the anchor point sample in attribute dimension representing a speaker as a negative sample;
obtaining, as a hard negative sample, speech data that is different from the target speaker but identical to the anchor point sample in an attribute dimension characterizing at least one speech feature;
the anchor point sample, the positive sample, the negative sample, and the hard negative sample are constructed as the multi-component speech sample data combination.
4. The method according to claim 1, wherein the feature extraction module, configured to perform feature extraction on the multi-component speech sample data combination to obtain speech feature vectors characterizing different sample types, specifically includes:
the following is performed for each sample type of speech data in the multi-element speech sample data combination:
performing Fourier transform on the voice data to obtain corresponding frequency components;
and converting the frequency components into corresponding frequency spectrum feature matrixes, and inputting the frequency components into a feature extraction module of the twin neural network to perform coding processing to obtain corresponding voice feature vectors.
5. The method according to any one of claims 1 to 4, wherein the determining, by the similarity calculation module of the twin neural network, the similarity between the speech feature vectors characterizing the different sample types, in particular comprises:
performing, by a similar computing module of the twin neural network, at least one of:
calculating the similarity between the voice characteristic vector of the anchor point sample and the voice characteristic vector of the positive sample;
calculating the similarity between the voice feature vector of the anchor point sample and the voice feature vector of the negative sample;
and calculating the similarity between the voice characteristic vector of the anchor point sample and the voice characteristic vector of the hard negative sample.
6. The method according to claim 1, wherein the training the twin neural network according to the similarity between the speech feature vectors characterizing different sample types to obtain a trained twin neural network specifically comprises:
carrying the similarity between the voice feature vectors representing different sample types into a preset loss function, and calculating to obtain a loss value;
and under the condition that the loss value is not smaller than a first preset threshold value, adjusting parameters in the feature extraction module, and stopping training until the loss value is smaller than the first preset threshold value, so as to obtain the feature extraction module of the trained twin neural network.
7. The method of claim 6, wherein before said bringing the similarity between the speech feature vectors characterizing different sample types to a pre-set loss function, the method further comprises:
setting learning task requirements for the twin neural network by utilizing the difference between sample types of the voice data;
and setting a loss function of the twin neural network according to the learning task requirement.
8. The method of claim 7, wherein the learning task requirement comprises: a first learning task requirement and a second learning task requirement;
the first learning task requirement is that the first similarity is greater than the second similarity, the second similarity is greater than the third similarity, and the distance between the first similarity and the third similarity is greater than a second preset threshold; the second learning task is required to have the first similarity greater than a third similarity and the distance between the first similarity and the third similarity greater than a third preset threshold;
wherein the first similarity is a similarity between a speech feature vector of the anchor sample and a speech feature vector of the positive sample; the second similarity is the similarity between the speech feature vector of the anchor sample and the speech feature vector of the hard negative sample; the third similarity is a similarity between the speech feature vector of the anchor sample and the speech feature vector of the negative sample.
9. A speech recognition method, characterized in that a speech recognition model used in the speech recognition method is trained according to the method of any one of claims 1 to 8, the speech recognition method comprising:
acquiring voice data of a user to be identified;
responding to a voice recognition instruction, calling the voice recognition model to perform feature extraction on voice data of the user to be recognized, and obtaining a voice feature vector of the user to be recognized;
performing feature matching processing on the voice feature vector of the user to be identified and a pre-stored standard voice feature vector, and determining identity information corresponding to the standard voice feature vector matched with the voice feature vector of the user to be identified as the identity information of the user to be identified;
the standard voice feature vector is obtained by calling the voice recognition model to perform feature extraction on voice data of a standard user.
10. A device for generating a speech recognition model, the device comprising:
a first acquisition unit configured to acquire a plurality of voice sample data combinations including voice data of different sample types;
The extraction unit is used for extracting the characteristics of the multi-element voice sample data combination through a characteristic extraction module of the twin neural network to obtain voice characteristic vectors representing different sample types;
the determining unit is used for determining the similarity between the voice feature vectors representing different sample types through a similarity calculation module of the twin neural network;
the training unit is used for training the twin neural network according to the similarity between the voice feature vectors representing different sample types to obtain a trained twin neural network;
and the construction unit is used for constructing a voice recognition model by using the trained feature extraction module of the twin neural network.
11. A speech recognition device, the device comprising:
the second acquisition unit is used for acquiring voice data of the user to be identified;
the calling unit is used for responding to the voice recognition instruction and calling the voice recognition model to perform feature extraction on the voice data of the user to be recognized so as to obtain a voice feature vector of the user to be recognized;
the matching unit is used for carrying out feature matching processing on the voice feature vector of the user to be identified and a pre-stored standard voice feature vector, and determining identity information corresponding to the standard voice feature vector matched with the voice feature vector of the user to be identified as the identity information of the user to be identified;
The standard voice feature vector is obtained by calling the voice recognition model to perform feature extraction on voice data of a standard user; the speech recognition model is trained using the method for generating a speech recognition model according to any one of claims 1 to 8.
12. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 9 when the computer program is executed.
13. A computer storage medium having stored thereon a computer program, which when executed by a processor realizes the steps of the method according to any of claims 1 to 9.
CN202210048877.2A 2022-01-17 2022-01-17 Speech recognition model generation method, speech recognition method, device and equipment Pending CN116486789A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210048877.2A CN116486789A (en) 2022-01-17 2022-01-17 Speech recognition model generation method, speech recognition method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210048877.2A CN116486789A (en) 2022-01-17 2022-01-17 Speech recognition model generation method, speech recognition method, device and equipment

Publications (1)

Publication Number Publication Date
CN116486789A true CN116486789A (en) 2023-07-25

Family

ID=87216471

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210048877.2A Pending CN116486789A (en) 2022-01-17 2022-01-17 Speech recognition model generation method, speech recognition method, device and equipment

Country Status (1)

Country Link
CN (1) CN116486789A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116758936A (en) * 2023-08-18 2023-09-15 腾讯科技(深圳)有限公司 Processing method and device of audio fingerprint feature extraction model and computer equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116758936A (en) * 2023-08-18 2023-09-15 腾讯科技(深圳)有限公司 Processing method and device of audio fingerprint feature extraction model and computer equipment
CN116758936B (en) * 2023-08-18 2023-11-07 腾讯科技(深圳)有限公司 Processing method and device of audio fingerprint feature extraction model and computer equipment

Similar Documents

Publication Publication Date Title
CN109726624B (en) Identity authentication method, terminal device and computer readable storage medium
CN110457432B (en) Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
CN107492379B (en) Voiceprint creating and registering method and device
WO2019210796A1 (en) Speech recognition method and apparatus, storage medium, and electronic device
CN107610709B (en) Method and system for training voiceprint recognition model
CN110020009B (en) Online question and answer method, device and system
CN108346427A (en) Voice recognition method, device, equipment and storage medium
CN112233698B (en) Character emotion recognition method, device, terminal equipment and storage medium
CN109658921B (en) Voice signal processing method, equipment and computer readable storage medium
CN111028845A (en) Multi-audio recognition method, device, equipment and readable storage medium
CN110853617A (en) Model training method, language identification method, device and equipment
CN110648671A (en) Voiceprint model reconstruction method, terminal, device and readable storage medium
CN109947971A (en) Image search method, device, electronic equipment and storage medium
CN110136726A (en) A kind of estimation method, device, system and the storage medium of voice gender
CN113327620A (en) Voiceprint recognition method and device
CN111383138B (en) Restaurant data processing method, device, computer equipment and storage medium
Biagetti et al. Speaker identification in noisy conditions using short sequences of speech frames
CN113948090B (en) Voice detection method, session recording product and computer storage medium
CN116486789A (en) Speech recognition model generation method, speech recognition method, device and equipment
KR102220964B1 (en) Method and device for audio recognition
CN111477212A (en) Content recognition, model training and data processing method, system and equipment
CN117649857A (en) Zero-sample audio classification model training method and zero-sample audio classification method
CN112580669A (en) Training method and device for voice information
CN112351047A (en) Double-engine based voiceprint identity authentication method, device, equipment and storage medium
CN111899718A (en) Method, apparatus, device and medium for recognizing synthesized speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination