CN113593579B - Voiceprint recognition method and device and electronic equipment - Google Patents

Voiceprint recognition method and device and electronic equipment Download PDF

Info

Publication number
CN113593579B
CN113593579B CN202110838405.2A CN202110838405A CN113593579B CN 113593579 B CN113593579 B CN 113593579B CN 202110838405 A CN202110838405 A CN 202110838405A CN 113593579 B CN113593579 B CN 113593579B
Authority
CN
China
Prior art keywords
voice
similarity
voice sample
voiceprint
voiceprint feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110838405.2A
Other languages
Chinese (zh)
Other versions
CN113593579A (en
Inventor
陈燕丽
蒋宁
吴海英
王洪斌
刘敏
孟庆林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mashang Consumer Finance Co Ltd
Original Assignee
Mashang Consumer Finance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mashang Consumer Finance Co Ltd filed Critical Mashang Consumer Finance Co Ltd
Priority to CN202110838405.2A priority Critical patent/CN113593579B/en
Publication of CN113593579A publication Critical patent/CN113593579A/en
Application granted granted Critical
Publication of CN113593579B publication Critical patent/CN113593579B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the application provides a voiceprint recognition method, a voiceprint recognition device and electronic equipment, wherein when voiceprint recognition is carried out, first similarity between a first voiceprint feature of voice to be recognized and a second voiceprint feature of preset voice can be obtained first, and influence of a voiceprint recognition result by a voice quality factor can be fully considered, so that the first similarity is adjusted by combining the voice quality factor corresponding to the voice to be recognized to obtain second similarity; and then, voice print recognition is carried out on the voice to be recognized according to the second similarity, so that the problem of lower accuracy of voice print recognition results caused by the fact that voice quality factors are not considered can be solved, and the accuracy of the voice print recognition results is effectively improved.

Description

Voiceprint recognition method and device and electronic equipment
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a voiceprint recognition method, a voiceprint recognition device, and an electronic device.
Background
In order to ensure the security of the service, in many scenarios, identity recognition technology is generally required to identify the identity of the user. For example, facial recognition techniques, fingerprint recognition techniques, or voiceprint recognition techniques, etc. The voiceprint recognition is to convert the acoustic signal into an electric signal and then to recognize the identity of the speaker through the voiceprint recognition model by the computer according to the acoustic characteristics of the speaker.
When the user identity is identified through voiceprint identification, after voice data input by a user is collected, corresponding voiceprint features are extracted from the voice data, the extracted voiceprint features are calculated with voiceprint features corresponding to the pre-stored voice data, and whether the current user and the user to which the pre-stored voice data belong are the same user or not is determined according to the cosine distance obtained through calculation, so that the identification of the user identity is completed through a voiceprint identification technology.
However, the existing voiceprint recognition method can make the accuracy of the voiceprint recognition result lower.
Disclosure of Invention
The embodiment of the application provides a voiceprint recognition method, a voiceprint recognition device and electronic equipment, which improve the accuracy of voiceprint recognition results.
In a first aspect, an embodiment of the present application provides a voiceprint recognition method, where the voiceprint recognition method may include:
And acquiring a first similarity between the first voiceprint feature of the voice to be recognized and the second voiceprint feature of the preset voice.
And adjusting the first similarity according to the voice quality factor corresponding to the voice to be recognized to obtain a second similarity, wherein the voice quality factor is used for representing the voice quality of the voice to be recognized.
And carrying out voiceprint recognition on the voice to be recognized according to the second similarity.
In a second aspect, an embodiment of the present application further provides a method for training a voiceprint feature extraction model, where the method for training a voiceprint feature extraction model may include:
acquiring a plurality of voice sample pairs and marking information corresponding to each voice sample pair in the plurality of voice sample pairs; each voice sample pair comprises a first voice sample collected through a first channel and a second voice sample collected through a second channel, and the marking information is used for representing whether the first voice sample and the second voice sample belong to the same user or not.
And inputting the voice sample pairs into a preset initial voiceprint feature extraction model to obtain a first voiceprint feature corresponding to a first voice sample and a second voiceprint feature corresponding to a second voice sample in each voice sample pair.
Training the initial voiceprint feature extraction model according to the first voiceprint feature, the second voiceprint feature and the marking information corresponding to each voice sample to obtain the voiceprint feature extraction model.
In a third aspect, an embodiment of the present application further provides a method for determining a similarity adjustment parameter, where the method for determining a similarity adjustment parameter may include:
a first similarity between a first voiceprint feature of a first voice sample and a second voiceprint feature of a second voice sample of each of a plurality of pairs of voice samples is determined.
Inputting the first similarity and a target voice quality factor into a similarity prediction function to obtain a prediction result of each voice sample pair, wherein the prediction result is used for representing the probability that the first voice sample and the second voice sample belong to the same user, and the target voice quality factor is determined based on the voice quality factor of the first voice sample and the voice quality factor of the second voice sample.
And determining target similarity adjustment parameters according to the prediction results of the voice sample pairs and the marking information of the voice sample pairs, wherein the marking information is used for representing whether the first voice sample and the second voice sample belong to the same user.
In a fourth aspect, an embodiment of the present application further provides a user identity identification method, where the user identity identification method may include:
and acquiring the voice to be recognized input by the user to be recognized.
Inputting the voice to be recognized and the preset voice into a voiceprint feature extraction model to obtain a first voiceprint feature corresponding to the voice to be recognized and a second voiceprint feature corresponding to the preset voice.
Determining a first similarity between the first voiceprint feature and the second voiceprint feature; and adjusting the first similarity according to the voice quality factor corresponding to the voice to be recognized to obtain a second similarity, wherein the voice quality factor is used for representing the voice quality of the voice to be recognized.
And identifying the identity of the user to be identified according to the second similarity.
In a fifth aspect, an embodiment of the present application further provides a voiceprint recognition apparatus, where the voiceprint recognition apparatus includes:
The device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a first similarity between a first voiceprint feature of a voice to be recognized and a second voiceprint feature of a preset voice.
The processing unit is used for adjusting the first similarity according to the voice quality factor corresponding to the voice to be recognized to obtain a second similarity, and the voice quality factor is used for representing the voice quality of the voice to be recognized.
And the recognition unit is used for carrying out voiceprint recognition on the voice to be recognized according to the second similarity.
In a sixth aspect, an embodiment of the present application further provides a training device for a voiceprint feature extraction model, where the training device for a voiceprint feature extraction model may include:
An obtaining unit, configured to obtain a plurality of voice sample pairs and marking information corresponding to each voice sample pair in the plurality of voice sample pairs; each voice sample pair comprises a first voice sample collected through a first channel and a second voice sample collected through a second channel, and the marking information is used for representing whether the first voice sample and the second voice sample belong to the same user or not.
The processing unit is used for inputting the first frequency spectrum characteristic corresponding to the first voice sample and the second frequency spectrum characteristic corresponding to the second voice sample included in each voice sample pair into a preset initial voiceprint characteristic extraction model to obtain the first voiceprint characteristic corresponding to the first voice sample and the second voiceprint characteristic corresponding to the second voice sample.
The training unit is used for training the initial voiceprint feature extraction model according to the first voiceprint feature, the second voiceprint feature and the marking information corresponding to each voice sample to obtain the voiceprint feature extraction model.
In a seventh aspect, an embodiment of the present application further provides a device for determining a similarity adjustment parameter, where the device for determining a similarity adjustment parameter may include:
An acquisition unit for determining a first similarity between a first voiceprint feature of a first voice sample and a second voiceprint feature of a second voice sample of each of a plurality of voice sample pairs.
The processing unit is used for inputting the first similarity and a target voice quality factor into a similarity prediction function to obtain a prediction result of each voice sample pair, wherein the prediction result is used for representing the probability that the first voice sample and the second voice sample belong to the same user, and the target voice quality factor is determined based on the voice quality factor of the first voice sample and the voice quality factor of the second voice sample.
And the determining unit is used for determining target similarity adjustment parameters according to the prediction results of the voice sample pairs and the marking information of the voice sample pairs, wherein the marking information is used for representing whether the first voice sample and the second voice sample belong to the same user or not.
In an eighth aspect, an embodiment of the present application further provides a user identity recognition device, where the user identity recognition device may include:
The acquisition unit is used for acquiring the voice to be recognized input by the user to be recognized.
The processing unit is used for inputting the voice to be recognized and the preset voice into the voiceprint feature extraction model to obtain a first voiceprint feature corresponding to the voice to be recognized and a second voiceprint feature corresponding to the preset voice.
A determining unit configured to determine a first similarity between the first voiceprint feature and the second voiceprint feature; and adjusting the first similarity according to the voice quality factor corresponding to the voice to be recognized to obtain a second similarity, wherein the voice quality factor is used for representing the voice quality of the voice to be recognized.
And the identification unit is used for identifying the identity of the user to be identified according to the second similarity.
In a ninth aspect, an embodiment of the present application further provides an electronic device, where the electronic device may include: a memory, a processor;
A memory; for storing a computer program.
The processor is configured to read the computer program stored in the memory, and execute the voiceprint recognition method according to the first aspect, or execute the training method of the voiceprint feature extraction model according to the second aspect, or execute the similarity adjustment parameter determination method according to the third aspect, or execute the user identification method according to the fourth aspect according to the computer program in the memory.
In a tenth aspect, an embodiment of the present application further provides a readable storage medium, where a computer program is stored with computer-executable instructions stored therein, where the computer-executable instructions are used to implement the voiceprint recognition method according to the first aspect, or implement the training method of the voiceprint feature extraction model according to the second aspect, or implement the method for determining the similarity adjustment parameter according to the third aspect, or implement the user identification method according to the fourth aspect when executed by a processor.
In an eleventh aspect, an embodiment of the present application further provides a computer program product, where the computer program product is configured to implement the voiceprint recognition method according to the first aspect, or implement the training method of the voiceprint feature extraction model according to the second aspect, or implement the method for determining the similarity adjustment parameter according to the third aspect, or implement the user identification method according to the fourth aspect when the computer program is executed by a processor.
According to the voiceprint recognition method, the voiceprint recognition device and the electronic equipment, when voiceprint recognition is carried out, the first similarity between the first voiceprint feature of the voice to be recognized and the second voiceprint feature of the preset voice can be obtained first, and the influence of the voice quality factor on the voiceprint recognition result can be fully considered, so that the first similarity is adjusted by combining the voice quality factor corresponding to the voice to be recognized, and the second similarity is obtained; and then, voice print recognition is carried out on the voice to be recognized according to the second similarity, so that the problem of lower accuracy of voice print recognition results caused by the fact that voice quality factors are not considered can be solved, and the accuracy of the voice print recognition results is effectively improved.
Drawings
Fig. 1 is a schematic flow chart of a voiceprint recognition method according to an embodiment of the present application;
Fig. 2 is a schematic diagram of a dual-network architecture model according to an embodiment of the present application;
FIG. 3 is a flowchart of a training method of a voiceprint feature extraction model according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an Ecapa network model according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of an initial voiceprint feature extraction model according to an embodiment of the present application;
Fig. 6 is a flowchart of a method for determining a similarity adjustment parameter according to an embodiment of the present application;
Fig. 7 is a schematic flow chart of a user identification method according to an embodiment of the present application;
Fig. 8 is a schematic structural diagram of a voiceprint recognition device according to an embodiment of the present application;
FIG. 9 is a schematic structural diagram of a training device for a voiceprint feature extraction model according to an embodiment of the present application;
Fig. 10 is a schematic structural diagram of a device for determining a similarity adjustment parameter according to an embodiment of the present application;
FIG. 11 is a schematic diagram of a user identity recognition device according to an embodiment of the present application;
Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Specific embodiments of the present disclosure have been shown by way of the above drawings and will be described in more detail below. These drawings and the written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the disclosed concepts to those skilled in the art by reference to specific embodiments.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
In embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In the text description of the present application, the character "/" generally indicates that the front-rear associated object is an or relationship.
The technical scheme provided by the embodiment of the application can be applied to the scene of voiceprint recognition. The voiceprint recognition is used as a trusted voiceprint feature authentication technology, has wide application prospect in various fields and scenes such as identity authentication, security check and the like, and is one of the preferred identity authentication schemes of a plurality of call centers.
In the prior art, when the identity of a user is identified through voiceprint identification, calculating the cosine distance between voiceprint features corresponding to voice data currently input by the user and voiceprint features corresponding to pre-stored voice data, and if the calculated cosine distance is greater than or equal to a preset threshold value, determining that the current user and the user to which the pre-stored voice data belong are the same user; and if the calculated cosine distance is smaller than the preset threshold value, determining that the current user and the user to which the prestored voice data belong are different users, and completing the identification of the user identity through a voiceprint identification technology.
However, in view of the existing voiceprint recognition method, the voiceprint recognition result is determined directly according to the cosine distance between the voiceprint features, and the influence of other factors on the voiceprint recognition result, such as the duration, the signal-to-noise ratio, the volume and other information, is not considered, so that when the existing voiceprint recognition method is adopted to determine the voiceprint recognition result, the accuracy of the voiceprint recognition result is lower.
In order to improve the accuracy of the voiceprint recognition result, the influence of voice quality factors such as voice duration, signal-to-noise ratio, volume and the like on the voiceprint recognition can be fully considered in the process of determining the voiceprint recognition result, so that the problem of low accuracy of the voiceprint recognition result caused by the fact that voice quality factors such as voice duration, signal-to-noise ratio, volume and the like are not considered can be solved, and the accuracy of the voiceprint recognition result is effectively improved.
Hereinafter, the voiceprint recognition method provided by the present application will be described in detail by way of specific examples. It is to be understood that the following embodiments may be combined with each other and that some embodiments may not be repeated for the same or similar concepts or processes.
Fig. 1 is a flowchart of a voiceprint recognition method according to an embodiment of the present application, where the voiceprint recognition method may be performed by a software and/or hardware device, and the hardware device may be, for example, a voiceprint recognition device. For example, referring to fig. 1, the voiceprint recognition method may include:
S101, obtaining first similarity between first voiceprint features of voice to be recognized and second voiceprint features of preset voice.
The voice to be recognized can be understood as a voice which needs to be verified at present, and the preset voice can be understood as a voice which is stored in advance and used as a verification basis.
For example, when obtaining the first similarity between the first voiceprint feature of the voice to be recognized and the second voiceprint feature of the preset voice, at least two possible implementations may be included:
In one possible implementation manner, the first similarity between the first voiceprint feature of the voice to be recognized and the second voiceprint feature of the preset voice can be directly extracted through the similarity model, and the first similarity between the first voiceprint feature and the second voiceprint feature of the preset voice is determined, so that the first similarity is obtained.
In this possible implementation manner, the similarity model has the capability of extracting voiceprint features and determining the similarity of the voiceprint features, and the input of the similarity model is two voices, and the output is the similarity between the voiceprint features of the two voices respectively.
In another possible implementation manner, the first voiceprint feature of the voice to be recognized and the second voiceprint feature of the preset voice may be extracted through the voiceprint feature extraction model, and then the first similarity between the first voiceprint feature and the second voiceprint feature of the preset voice is calculated, so as to obtain the first similarity.
In this possible implementation manner, when the first voiceprint feature of the voice to be recognized and the second voiceprint feature of the preset voice are extracted through the voiceprint feature extraction model, the description may be made in combination with two scenarios.
In one scenario: if the voice to be recognized and the preset voice are collected through the same channel, for example, are collected through network channels or are collected through telephone channels, and channel differences do not exist, the first voice print feature of the voice to be recognized and the second voice print feature of the preset voice can be directly extracted through the voice print feature extraction model of the existing single network architecture. Wherein, the single network architecture refers to a network architecture with only one network model.
In another scenario: if the voice to be recognized and the preset voice are collected through different channels, for example, the voice to be recognized is collected through a telephone channel, and the preset voice is collected through a network channel. For example, in the existing enterprise call center scene, in the registration link, a user can read according to the prompt text of an enterprise application program, and in the reading process, the voice of the user can be collected through a network channel, and the voice is a preset voice which is used as a matching basis subsequently; after registration is completed, when the user subsequently transacts the service through the call center, the identity of the user needs to be verified in the service transacting process. In the verification link, the voice of the user can be collected through a telephone channel, the voice is the voice to be recognized, and the two voices are collected through different channels.
Because the encoding and decoding algorithms corresponding to different channels are different, if the voiceprint features of the voice acquired through different channels are extracted through the same voiceprint feature extraction model, the accuracy of the extracted voiceprint features is lower. Therefore, in order to improve accuracy of the extracted voiceprint feature, a voiceprint feature extraction model of a dual-network architecture may be constructed, where the voiceprint feature extraction model of the dual-network architecture may include a first network model and a second network model with the same network architecture, and may refer to an existing pseudo-twin network model, for example, as may be shown in fig. 2, fig. 2 is a schematic architecture diagram of the dual-network architecture model provided in an embodiment of the present application, where the first network model may be obtained by training a voice sample collected through a phone channel, and may be subsequently used to extract the voiceprint feature of a voice collected through the phone channel; the second network model can be obtained through training of voice samples collected through the network channel, and can be used for extracting voiceprint features of voice collected through the network channel later, and the first network model and the second network model are obtained through training of the same loss function. As to how to train the voiceprint feature extraction model of the dual-network architecture, a detailed description will be given later on of how to train the voiceprint feature extraction model of the dual-network architecture through the second embodiment shown in fig. 3.
When the first voiceprint feature of the voice to be recognized and the second voiceprint feature of the preset voice are respectively extracted through the voiceprint feature extraction model of the double-network architecture, the first spectral feature can be extracted from the voice to be recognized, the second spectral feature can be extracted from the preset voice, the first spectral feature and the second spectral feature are input into the voiceprint feature extraction model, the first voiceprint feature is obtained through the first network model in the voiceprint feature extraction model, and the second voiceprint feature is obtained through the second network model in the voiceprint feature extraction model, so that the first voiceprint feature of the voice to be recognized and the second voiceprint feature of the preset voice are extracted. When the voiceprint characteristics of the voice to be recognized and the preset voice acquired through different channels are extracted through the voiceprint characteristic extraction model of the double-network architecture, the voiceprint characteristics are extracted in a targeted manner through distinguishing the channels, so that the problem that the accuracy of the voiceprint recognition result is low due to channel difference can be solved, and the accuracy of the voiceprint recognition result is improved.
For example, when the first spectral feature is extracted from the voice to be recognized, processing such as pre-emphasis, framing, windowing, fourier transform, filter, logarithmic operation and the like may be performed on the voice to be recognized, so as to obtain fbank spectral features of 80×t, and determining fbank spectral features of 80×t as the first spectral feature; when the second spectral feature is extracted from the preset voice, similarly, the preset voice may be pre-emphasized, framed, windowed, fourier transformed, filtered, logarithmic operation and other processes, to obtain the fbank spectral feature of 80×t, and the fbank spectral feature of 80×t is determined to be the second spectral feature, which may be specifically set according to the actual needs.
After the first voiceprint feature and the second voiceprint feature are extracted, respectively, a first similarity between the first voiceprint feature and the second voiceprint feature can be calculated. It should be noted that, in the embodiment of the present application, unlike the prior art, the voiceprint recognition is not directly performed on the voice to be recognized according to the first similarity, but the influence of the voice quality factor on the voiceprint recognition result is fully considered, so that the first similarity can be adjusted according to the voice quality factor corresponding to the voice to be recognized to obtain the second similarity, that is, the following S102 is executed:
S102, adjusting the first similarity according to a voice quality factor corresponding to the voice to be recognized to obtain the second similarity, wherein the voice quality factor is used for representing the voice quality of the voice to be recognized.
The voice quality factor may include factors related to voice quality, such as voice duration, signal to noise ratio, volume, and the like, and may be specifically set according to actual needs.
For example, when the first similarity is adjusted according to the voice quality factor corresponding to the voice to be recognized to obtain the second similarity, a similarity adjustment parameter may be obtained first, where the similarity adjustment parameter includes a weight of the first similarity, a weight of the voice quality factor, and a bias; and adjusting the first similarity according to the weight of the first similarity, the weight of the voice quality factor, the bias and the voice quality factor to obtain the second similarity. Therefore, the accuracy of the acquired second similarity can be effectively improved by acquiring the similarity adjustment parameters and adjusting the first similarity by combining the similarity adjustment parameters on the basis of fully considering the voice quality factors. The bias is used for enabling the difference value between the second similarity and the first similarity to be smaller than a preset value.
For example, when the similarity adjustment parameter is acquired, the pre-learned similarity adjustment parameter may be acquired locally, or the pre-learned similarity adjustment parameter may be acquired from another device, which may be specifically set according to actual needs. It should be noted that, how to learn the three similarity adjustment parameters in advance will be described in detail later through the third embodiment shown in fig. 6.
After the similarity adjustment parameter and the voice quality factor are obtained, the following equation 1 may be used:
And adjusting the first similarity to obtain an adjusted second similarity. Where l(s) denotes the second similarity, ω s denotes the weight of the first similarity, s denotes the first similarity, W q denotes the weight of the speech quality factor, q denotes the speech quality factor, b denotes the bias, Representing the transpose of W q. Therefore, the accuracy of the acquired second similarity can be effectively improved by acquiring the similarity adjustment parameter and adjusting the first similarity by combining the similarity adjustment parameter and the voice quality factor.
It should be noted that, the formula for adjusting the first similarity by using the weight of the first similarity, the weight of the voice quality factor, the bias and the voice quality factor is not limited to the formula 1, and various modifications or adjustments may be made on the basis of the formula 1, or a new adjustment formula may be constructed on the basis of these parameters.
After the adjusted second similarity is obtained in S102, voiceprint recognition may be performed on the speech to be recognized according to the second similarity, that is, the following S103 is performed:
s103, voiceprint recognition is carried out on the voice to be recognized according to the second similarity.
For example, when voiceprint recognition is performed on the voice to be recognized according to the second similarity, if the second similarity is greater than or equal to a preset value, determining that the user to which the voice to be recognized belongs and the user to which the preset voice belongs are the same user; if the second similarity is smaller than the preset value, determining that the user to which the voice to be recognized belongs and the user to which the preset voice belongs are different users, and thus completing voiceprint recognition.
It can be seen that, in the embodiment of the present application, when voiceprint recognition is performed, a first similarity between a first voiceprint feature of a voice to be recognized and a second voiceprint feature of a preset voice can be obtained first, and an influence of a voice quality factor on a voiceprint recognition result is fully considered, so that the first similarity is adjusted by combining the voice quality factor corresponding to the voice to be recognized, to obtain a second similarity; and then, voice print recognition is carried out on the voice to be recognized according to the second similarity, so that the problem of lower accuracy of voice print recognition results caused by the fact that voice quality factors are not considered can be solved, and the accuracy of the voice print recognition results is effectively improved.
Fig. 3 is a flow chart of a training method of a voiceprint feature extraction model according to an embodiment of the present application, where the training method of the voiceprint feature extraction model may be performed by a software and/or hardware device, for example, the hardware device may be a training device of the voiceprint feature extraction model. For example, referring to fig. 3, the training method of the voiceprint feature extraction model may include:
S301, a plurality of voice sample pairs and marking information corresponding to each voice sample pair in the plurality of voice sample pairs are obtained.
Each voice sample pair comprises a first voice sample collected through a first channel and a second voice sample collected through a second channel, and the marking information is used for representing whether the first voice sample and the second voice sample belong to the same user or not. The first channel and the second channel are different channels. By way of example, the first channel may be a telephone channel and the second channel may be a network channel; the first channel may also be a network channel and the second channel may be a telephone channel, as long as the first channel and the second channel are different channels.
For example, when a plurality of voice sample pairs are acquired, a first voice sample set may be acquired first, the voice samples in the first voice sample set are classified through channels corresponding to each voice sample included in the first voice sample set, the voice samples collected through the first channels in the first voice sample set are divided into a voice sample training set, and the voice samples are recorded as a first voice sample training set; dividing the voice sample collected through the second channel into a voice sample training set, and marking the voice sample training set as a second voice sample training set; and constructing a plurality of voice sample pairs according to the first voice sample training set and the second voice sample training set.
For example, when the voice sample set is acquired, an initial voice sample set may be acquired first, an augmentation process may be performed on the voice samples in the initial voice sample set, and the first voice sample set may be determined from the voice sample set after the augmentation process. The amplification process includes at least one of a noise increasing process, a volume changing process, a speech speed processing, or a rate processing, and may be specifically set according to actual needs, where the embodiment of the present application is only described by taking the amplification process including at least one of a noise increasing process, a volume changing process, a speech speed processing, or a rate processing as an example, but the embodiment of the present application is not limited thereto.
It may be appreciated that, in the embodiment of the present application, when the first speech sample set is acquired, the object of the present application is to perform the augmentation processing on the speech samples in the initial speech sample set: the robustness of the voiceprint feature extraction model obtained based on the training of the first voice sample set is improved, so that the voiceprint feature extraction model can be suitable for more voiceprint recognition scenes.
In addition, in view of the fact that the voice samples with fewer single speakers and the voice samples with shorter duration are used for training the voiceprint feature extraction model, the reference value is smaller, therefore, before the voice samples in the initial voice sample set are subjected to the augmentation processing, the voice samples in the initial voice sample set can be screened, the voice samples with fewer single speakers and the voice samples with shorter duration are removed from the initial voice sample set, for example, the voice with fewer single speakers and the voice with shorter duration are removed, for example, the sentences of the single speakers are smaller than 5 voices and the voices with shorter duration are removed, and the data volume of the augmentation processing can be reduced by removing the voice samples with smaller reference value, so that the augmentation processing efficiency is improved.
When a plurality of voice sample pairs are constructed according to the first voice sample training set and the second voice sample training set, one voice sample collected through the first channel can be arbitrarily selected from the first voice sample training set to serve as a first voice sample in the voice sample pair, one voice sample collected through the second channel is arbitrarily selected from the second voice sample training set to serve as a second voice sample in the voice sample pair, and one voice sample pair can be constructed by the selected first voice sample and second voice sample. By a similar method, a plurality of speech sample pairs can be constructed. Assuming that the first voice sample training set includes M first voice samples collected through the first channel, and the second voice sample training set includes N second voice samples collected through the second channel, m×n voice sample pairs may be constructed through the first voice sample training set and the second voice sample training set.
In addition, it can be understood that, when each voice sample in the voice sample set is acquired, the user information that each voice sample belongs to can be acquired together, so that when the voice sample pair is constructed, the corresponding marking information of the voice sample pair can be determined according to the user information corresponding to the first voice sample and the second voice sample included in the voice sample pair.
For example, if it is determined that the first voice sample and the second voice sample belong to the same user according to the user information corresponding to the first voice sample and the second voice sample included in the voice sample pair, the flag information may be marked as 1, and conversely, if it is determined that the first voice sample and the second voice sample belong to different users according to the user information corresponding to the first voice sample and the second voice sample included in the voice sample pair, the flag information may be marked as 0.
It can be understood that, in the embodiment of the present application, in order to further improve the accuracy of the voiceprint feature extraction model obtained by training, the plurality of voice sample pairs may be voice sample pairs under an application scene of the voiceprint feature extraction model, so that the voiceprint feature extraction model is trained by using the voice sample pairs under the application scene in a targeted manner, so that the trained voiceprint feature extraction model can be more suitable for voiceprint feature extraction under the application scene, and the accuracy of the extracted voiceprint features can be further improved. For example, when the voiceprint feature extraction model is applied to an enterprise call center scenario, the plurality of voice sample pairs may include capturing voice samples of a user through a network channel in a registration procedure, and capturing voice samples of the user through a telephone channel during business handling. When the voiceprint feature extraction model is applied to a banking scene, the plurality of voice sample pairs can include a banking registration step of collecting voice samples of a user through a network channel and collecting voice samples of the user through a telephone channel in a banking processing process.
After acquiring the plurality of voice sample pairs and the flag information corresponding to each of the plurality of voice sample pairs, the following S302 may be executed:
S302, inputting each voice sample pair into a preset initial voiceprint feature extraction model to obtain a first voiceprint feature corresponding to a first voice sample and a second voiceprint feature corresponding to a second voice sample in each voice sample pair.
For example, when the initial voiceprint feature extraction model is obtained, the basic voiceprint feature extraction model can be trained by means of a large number of open-source voice sample sets based on a training method of transfer learning, and the initial voiceprint feature extraction model is obtained based on the voiceprint feature extraction model obtained by training.
For example, when training the basic voiceprint feature extraction model with a large number of open-source speech sample sets, a large number of open-source speech sample sets may be acquired first, which may be denoted as a second speech sample set. Similar to the method for obtaining the first voice sample set, the initial voice sample set may be obtained first, the voice samples in the initial voice sample set may be subjected to an amplification process, and the amplified voice sample set may be determined to be the second voice sample set. The amplification process includes at least one of a noise increasing process, a volume changing process, a speech speed processing, or a rate processing, and may be specifically set according to actual needs, where the embodiment of the present application is only described by taking the amplification process including at least one of a noise increasing process, a volume changing process, a speech speed processing, or a rate processing as an example, but the embodiment of the present application is not limited thereto.
It may be appreciated that, in the embodiment of the present application, when the second speech sample set is acquired, the object of the present application is to perform the augmentation processing on the speech samples in the initial speech sample set: the robustness of the basic voiceprint feature extraction model obtained based on the second voice sample set training is improved, so that the method can be suitable for more voiceprint recognition scenes.
In addition, in view of fewer voice samples of a single speaker and shorter voice samples, the reference value is smaller when the voice sample is used for training a voiceprint feature extraction model, so that before the voice samples in the initial voice sample set are subjected to the augmentation treatment, the voice samples in the initial voice sample set can be screened first, fewer voice samples of the single speaker and shorter voice samples are removed from the initial voice sample set, and therefore the data volume of the augmentation treatment can be reduced by removing the voice samples with smaller reference value, and the augmentation treatment efficiency is improved.
When the basic voiceprint feature extraction model is trained by the second voice sample set, the spectral features, such as fbank spectral features, mfcc spectral features, or the like, of each voice sample in the second voice sample set can be extracted first, the spectral features of each voice sample are input into the initial basic voiceprint feature extraction model for training, if the loss function converges, the voiceprint feature extraction model when the loss function converges is determined as the basic voiceprint feature extraction model.
For example, the initial underlying voiceprint feature extraction model can be any of an Ecapa network model, resNet network model, or Tdnn network model. In the embodiment of the present application, taking an initial basic voiceprint feature extraction model as an Ecapa network model as an example, a network structure of the Ecapa network model may be shown in fig. 4, fig. 4 is a schematic structural diagram of an Ecapa network model provided in the embodiment of the present application, and Ecapa is a neural network structure based on an attention mechanism, where the Ecapa network model includes: one-dimensional convolution layer, four SE-Res2NetBlock (SE-Res 2 block) structures, and connecting the features output by the four SE-Res2NetBlock as the output of the last one-dimensional convolution, and then connecting a statistical pooling (ATTENTIVE STATIC pooling) layer with an attention mechanism and a BN layer for outputting voiceprint features. Wherein, the first layer is one-dimensional convolution, the convolution kernel is 5, and the expansion interval of the time context is 1; the SE-Res2NetBlock structure, SE-Res2NetBlock convolution kernel is k, the expansion interval of the time context is s, and k determines the number of channels output. The SENet network focuses on the relation among channels, so that the basic voiceprint feature extraction model can learn the features of different channels better, and the larger the value of s is, the more features among the contexts of the voice sample can be learned.
As can be seen in connection with fig. 4, in the embodiment of the present application, in the four SE-Res2NetBlock, in top-down order, k=3, d=2 in the first SE-Res2 NetBlock; k=3 and d=3 in the second SE-Res2 NetBlock; k=3, d=4 in the third SE-Res2 NetBlock; in the fourth SE-Res2NetBlock k=3 and d=5. In the case of multi-level feature polymerization by the four SE-Res2NetBlock, features of the shallow SE-Res2NetBlock structure may be polymerized in two ways. In one approach: features output by four layers of SE-Res2NetBlock may be concatenated, with 4 times the input of features. In another approach, the characteristics of the four layers SE-Res2NetBlock outputs may be summed. In embodiments of the present application, features of the shallow SE-Res2NetBlock structure may be aggregated in a tandem fashion. In addition, the statistical pooling layer with attention mechanism+bn layer in Ecapa network model can use the attention mechanism to provide different weights for different frames, not only to generate a weighted average, but also to generate a weighted standard deviation. By the method, the long-term change of the voiceprint characteristics can be more effectively captured, and the output of the statistical pooling layer with the attention mechanism and the BN layer is the extracted voiceprint characteristics. When the basic voiceprint feature extraction model is trained based on the initial basic voiceprint feature extraction model, an FC layer and an AAM-Softmax layer can be added after a statistical pooling layer and a BN layer with a attention mechanism to serve as a classifier, network parameters in the initial basic voiceprint feature extraction model are continuously and iteratively optimized through a loss function constructed by a classification result until the loss function converges, and the voiceprint feature extraction model when the loss function converges is determined to be the basic voiceprint feature extraction model.
After the basic voiceprint feature extraction model is obtained through training, the initial voiceprint feature extraction model can be constructed based on the basic voiceprint feature extraction model obtained through training in view of the fact that the initial voiceprint feature extraction model comprises a first network model and a second network model which are identical in network architecture. Taking the basic voiceprint feature extraction model as an Ecapa network model as an example, the network structure of the initial voiceprint feature extraction model can be shown in fig. 5, and fig. 5 is a schematic structural diagram of an initial voiceprint feature extraction model provided by the embodiment of the present application, it can be seen that the initial voiceprint feature extraction model includes two Ecapa network models with identical structures, where one Ecapa network model can be used for training to obtain a first network model in the voiceprint feature extraction model, and the other Ecapa network model can be used for training to obtain a second network model in the voiceprint feature extraction model. It should be noted that the two Ecapa network models with identical structures are not separately split, but have the same loss function, so that an initial voiceprint feature extraction model of the dual network architecture can be obtained.
After a plurality of voice sample pairs and an initial voiceprint feature extraction model are respectively acquired, first acquiring a first frequency spectrum feature corresponding to a first voice sample and a second frequency spectrum feature corresponding to a second voice sample included in each voice sample pair; and inputting the first frequency spectrum characteristic corresponding to the first voice sample and the second frequency spectrum characteristic corresponding to the second voice sample included in each voice sample pair into a preset initial voiceprint characteristic extraction model to obtain the first voiceprint characteristic corresponding to the first voice sample and the second voiceprint characteristic corresponding to the second voice sample included in each voice sample pair.
Referring to fig. 5, when the first spectral feature corresponding to the first voice sample and the second spectral feature corresponding to the second voice sample included in each voice sample pair are taken as two inputs of the initial voiceprint feature extraction model and input into the initial voiceprint feature extraction model, if the first spectral feature corresponding to the first voice sample is input into the Ecapa network model on the left side and the second spectral feature corresponding to the second voice sample is input into the Ecapa network model on the right side, the first network model obtained by training based on the Ecapa network model on the left side can be used for extracting voiceprint features in the voice collected by the channel corresponding to the first voice sample; the second network model obtained by subsequent training based on the Ecapa network model on the right side can be used for extracting voiceprint features in the voice acquired by the channel corresponding to the second voice sample; conversely, if the first spectral feature corresponding to the first voice sample is input into the Ecapa network model on the right side, and the second spectral feature corresponding to the second voice sample is input into the Ecapa network model on the left side, the first network model obtained by training based on the Ecapa network model on the left side can be used for extracting voiceprint features in the voice acquired by the channel corresponding to the second voice sample; the second network model obtained by subsequent training based on the right Ecapa network model can be used for extracting voiceprint features in the voice acquired by the channel corresponding to the first voice sample, and can be specifically set according to actual needs. Therefore, by distinguishing the channels, the voiceprint characteristics are extracted in a targeted manner, and the problem that the accuracy of the voiceprint recognition result is low due to channel difference can be solved, so that the accuracy of the voiceprint recognition result is improved.
After extracting the first voiceprint feature and the second voiceprint feature corresponding to each voice sample by the initial voiceprint feature extraction model, the following S303 may be executed:
S303, training the initial voiceprint feature extraction model according to the first voiceprint feature, the second voiceprint feature and the marking information corresponding to each voice sample to obtain a voiceprint feature extraction model.
For example, when training the initial voiceprint feature extraction model according to the first voiceprint feature, the second voiceprint feature and the label information corresponding to each voice sample, the training method may include:
Firstly, determining the Euclidean distance between a first voiceprint feature and a second voiceprint feature corresponding to each voice sample; determining a first loss function corresponding to the voice sample pair according to the Euclidean distance, the marking information corresponding to the voice sample pair and the difference; training the initial voiceprint feature extraction model according to the corresponding first loss function of each voice sample, and taking the voiceprint feature extraction model when the first loss function is converged as the voiceprint feature extraction model if the first loss function is converged; if the first loss function is not converged, modifying network parameters of the voiceprint feature extraction model in the training process until the first loss function is converged, and taking the voiceprint feature extraction model when the first loss function is converged as the voiceprint feature extraction model; when the first loss function is calculated according to the Euclidean distance between two voiceprint features and the corresponding marking information and difference of the voice sample pairs, and the voiceprint feature extraction model when the first loss function is converged is used as the voiceprint feature extraction model, the voiceprint features of the voices input by the same user through different channels can be similar as far as possible, and the voiceprint features of the voices input by different users through different channels are far as possible, so that the accuracy of the voiceprint feature extraction model is improved.
In addition, in the training process, the Euclidean distance between the first voiceprint feature output by the first network model and the second voiceprint feature output by the second network model can be continuously reduced, so that when the first loss function is calculated according to the Euclidean distance between the two voiceprint features, the value of the first loss function can be minimized until the first loss function converges, and the voiceprint feature extraction model when the first loss function converges is used as the voiceprint feature extraction model. Therefore, the voiceprint characteristics of the voices input by the same user through different channels are similar as far as possible, and the voiceprint characteristics of the voices input by different users through different channels are far away as far as possible, so that the accuracy of the voiceprint recognition result is improved.
For example, in calculating the euclidean distance between the first voiceprint feature and the second voiceprint feature, assuming that the first voiceprint feature is an x 1-dimensional feature vector [ N, C, H, W ], and the second voiceprint feature is an x 2-dimensional feature vector [ M, C, H, W ], the euclidean distance between the first voiceprint feature and the second voiceprint feature can be calculated through the torch. Where n=mor n=1 or m=1.
It may be understood that the embodiment of the present application is only described by taking calculation of the euclidean distance between the first voiceprint feature and the second voiceprint feature by the torch.
After the euclidean distance between the first voiceprint feature and the second voiceprint feature is calculated, a loss function corresponding to the voice sample pair can be further determined according to the euclidean distance, the corresponding marking information of the voice sample pair and the difference, and specifically, the following formula 2 can be referred to:
Wherein L1 represents a loss function corresponding to a voice sample, Y represents flag information corresponding to a voice sample, D W represents a euclidean distance, and m represents a gap. The value of m is related to the value of D W, and in general, the value of m is 2 times that of D W, but may be 1.9 times or 2.1 times that of D W, and may be specifically set according to practical needs, where the embodiment of the present application is only illustrated by taking the value of m as 2 times that of D W as an example, but the embodiment of the present application is not limited thereto.
It should be noted that, the formula for determining the loss function corresponding to the voice sample pair by using the euclidean distance, the corresponding mark information of the voice sample pair and the gap is not limited to the formula 2, and various modifications or adjustments can be made on the basis of the formula 2, or a new adjustment formula can be constructed on the basis of the parameters.
And the formula 2 is combined, so that the Euclidean distance between two voiceprint features, the corresponding marking information of the voice sample pair and the difference are combined to jointly determine a first loss function, and the accuracy of the voiceprint feature extraction model can be effectively improved when the voiceprint feature extraction model is adjusted based on the determined first loss function.
For example, in the embodiment of the present application, whether the loss function L1 converges may be determined by three conditions, and the convergence of the loss function L1 may be determined by satisfying one of the three conditions. Wherein, condition 1 is: according to the loss function, if the iteration frequency of the voiceprint feature extraction model reaches a preset frequency threshold, the convergence of the loss function L1 can be determined; condition 2 is: the loss value of the loss function is smaller than a preset loss threshold value and is kept stable, and the convergence of the loss function L1 can be determined; condition 3 is: dividing a training sample set of the voiceprint feature extraction model into a training data set and a test data set according to a ratio of 8:2, and determining that the loss function L1 converges if the loss function value is stable on the test data set and does not drop any more.
If the loss function converges, taking the voiceprint feature extraction model when the first loss function converges as a voiceprint feature extraction model; if the first loss function is not converged, network parameters of the voiceprint feature extraction model in the training process are modified until the first loss function is converged, and the voiceprint feature extraction model in the condition that the first loss function is converged is used as the voiceprint feature extraction model, so that the voiceprint feature extraction model of the double-network architecture is obtained through training.
It can be seen that, in the embodiment of the present application, when the voiceprint feature extraction model is acquired, a plurality of voice sample pairs and the marking information corresponding to each voice sample pair in the plurality of voice sample pairs may be acquired first; inputting a first frequency spectrum characteristic corresponding to the first voice sample and a second frequency spectrum characteristic corresponding to the second voice sample included in each voice sample pair into a preset initial voiceprint characteristic extraction model to obtain a first voiceprint characteristic corresponding to the first voice sample and a second voiceprint characteristic corresponding to the second voice sample; and training an initial voiceprint feature extraction model according to the first voiceprint feature, the second voiceprint feature and the marking information corresponding to each voice sample, so that the voiceprint feature extraction model is trained through the first voice samples collected by the first channels and the second voice samples collected by the second channels included in the voice samples, the voiceprint feature extraction model obtained through training has the capability of distinguishing channels and extracting voiceprint features in a targeted manner, and the channels can be distinguished when the voiceprint features are extracted, so that the voiceprint features can be extracted in a targeted manner, the problem of lower accuracy of voiceprint recognition results caused by channel differences is solved, and the accuracy of the voiceprint recognition results is improved.
Fig. 6 is a flowchart of a method for determining a similarity adjustment parameter according to an embodiment of the present application, where the method for determining a similarity adjustment parameter may be performed by a software and/or hardware device, for example, the hardware device may still be a device for determining a similarity adjustment parameter. For example, referring to fig. 6, the method for determining the similarity adjustment parameter may include:
s601, determining a first similarity between a first voiceprint feature of a first voice sample and a second voiceprint feature of a second voice sample of each of a plurality of voice sample pairs.
For example, when determining the first similarity s between the first voiceprint feature of the first voice sample and the second voiceprint feature of the second voice sample, the first similarity s between the first voiceprint feature and the second voiceprint feature may be determined through an existing similarity model, or the first similarity s between the first voiceprint feature and the second voiceprint feature may be determined through another voiceprint recognition model, which may be specifically set according to actual needs, where the embodiment of the present application is not limited specifically how to determine the first similarity between the first voiceprint feature and the second voiceprint feature.
S602, inputting the first similarity and the target voice quality factor into a similarity prediction function to obtain a prediction result of each voice sample pair.
The prediction result is used for representing the probability that the first voice sample and the second voice sample belong to the same user, the value range can be [0,1], and the target voice quality factor is determined based on the voice quality factor of the first voice sample and the voice quality factor of the second voice sample.
For example, the similarity prediction function can be found in the following equation 3:
Wherein h (s, q) represents a similarity prediction function, the value of h (s, q) represents the prediction result of the voice sample pair, l(s) represents a second similarity between the voiceprint feature of the first voice sample and the voiceprint feature of the second voice sample in each voice sample pair, A weight representing a first similarity, s representing a first similarity between the voiceprint features of the first voice sample and the voiceprint features of the second voice sample in each pair of voice samples, W q representing a weight of a target voice quality factor, q representing a target voice quality factor, b representing a bias,/>Representing the transpose of W q.
For example, in determining the target voice quality factor, a difference between the voice quality factor of the first voice sample and the voice quality factor of the second voice sample may be determined as the target voice quality factor; or an average value between the voice quality factor of the first voice sample and the voice quality factor of the second voice sample may also be determined as the target voice quality factor; or the voice quality factor of the voice sample with the worst voice quality in the first voice sample and the second voice sample can be determined as the target voice quality factor, and the voice quality factor can be specifically set according to actual needs, and the embodiment of the application is not limited further. In this way, the target voice quality factor is determined by the difference value, the average value or the voice quality factor with the worst voice quality between the voice quality factor of the first voice sample and the voice quality factor of the second voice sample, so that the voice quality condition can be provided for the adjustment of the similarity prediction function through the target voice quality factor, and the target similarity adjustment parameter can be better learned through the similarity prediction function so as to obtain the target similarity adjustment parameter.
After the first similarity and the target voice quality factor are obtained respectively, the first similarity and the target voice quality information may be input into a similarity prediction function to obtain a prediction result of each voice sample pair, and the following S603 is executed according to the prediction result of each voice sample pair:
S603, determining target similarity adjustment parameters according to the prediction results of the voice sample pairs and the marking information of the voice sample pairs, wherein the marking information is used for representing whether the first voice sample and the second voice sample belong to the same user.
For example, when determining the target similarity adjustment parameter according to the prediction result of each voice sample pair and the label information of each voice sample pair, the second loss function corresponding to each voice sample pair may be determined first based on the prediction result of each voice sample pair and the label information of each voice sample pair, for example, see the following formula 4:
Wherein L2 represents a second loss function corresponding to each voice sample pair, y represents the label information of each voice sample pair, y=1 represents that the first voice sample and the second voice sample belong to the same user, y=0 represents that the first voice sample and the second voice sample belong to different users, h (s, q) represents a similarity prediction function, and the value of h (s, q) represents the prediction result of each voice sample pair. Therefore, the second loss function is determined by combining the marking information and the prediction result of each voice sample pair, the prediction result and the marking information can be similar as much as possible, and the accuracy of the similarity prediction function is improved.
Wherein,L(s) represents a second similarity between the voiceprint features of the first voice sample and the voiceprint features of the second voice sample in each pair of voice samples,/>Omega s denotes the weight of the first similarity, s denotes the first similarity between the voiceprint feature of the first voice sample and the voiceprint feature of the second voice sample in each pair of voice samples, W q denotes the weight of the target voice quality factor, q denotes the target voice quality factor, b denotes the bias,Representing the transpose of W q.
It should be noted that, the formula for determining the second loss function corresponding to each voice sample is not limited to the formula 4 by using the prediction result of each voice sample pair and the label information of each voice sample pair, and various modifications or adjustments may be made based on the formula 4, or new adjustment formulas may be constructed based on these parameters.
In determining the second loss function L2 in combination with the above formula 4, a random initial assignment may be first performed on the weight ω s of the first similarity, the weight W q of the speech quality factor, and the bias b, for example, ω s is assigned a decimal between 0 and 1, W q is a vector composed of all 1's, and b is set to 0; thus, the corresponding voice quality and the first similarity of each voice sample pair are input into the formula 4, and a corresponding second loss function of each voice sample pair can be obtained; in view of the fact that the plurality of voice sample pairs are the same batch of samples for executing one training operation, an average loss function of the second loss functions corresponding to the plurality of voice sample pairs can be calculated, a target similarity adjustment parameter is determined based on the average loss function until the average loss function converges, and the similarity adjustment parameter at the time of convergence is used as the target similarity adjustment parameter; therefore, compared with the method for determining the target similarity adjustment parameter through the average loss function corresponding to one voice sample pair, the method can achieve a better adjustment effect, provide more adjustment basis for the target similarity adjustment parameter, and further effectively improve the accuracy of the determined target similarity adjustment parameter.
For example, in the embodiment of the present application, whether the average loss function converges may be determined by three conditions, where satisfaction of one of the three conditions may determine that the average loss function converges. Wherein, condition 1 is: according to the average loss function, the convergence of the average loss function can be determined when the iteration number of the similarity prediction function shown in the formula 3 reaches a frequency threshold; condition 2 is: the loss value of the average loss function is smaller than the loss threshold value and is kept stable, so that the convergence of the average loss function can be determined; condition 3 is: the training sample set of the similarity prediction function is divided into a training data set and a test data set according to the ratio of 8:2, and if the average loss function value is kept stable on the test data set and does not drop any more, the convergence of the average loss function can be determined.
And if the average loss function converges, determining the similarity adjustment parameter when the average loss function converges as the target similarity adjustment parameter. And if the average loss function is not converged, modifying the similarity adjustment parameter in the training process until the average loss function is converged, and taking the similarity adjustment parameter when the average loss function is converged as a target similarity adjustment parameter.
It can be seen that, in the embodiment of the present application, when determining the similarity adjustment parameter, a first similarity between a first voiceprint feature of a first voice sample and a second voiceprint feature of a second voice sample of each of a plurality of voice sample pairs may be determined; inputting the first similarity and the target voice quality factor into a similarity prediction function to obtain a prediction result of each voice sample pair; and then determining target similarity adjustment parameters according to the prediction results of each voice sample pair and the marking information of each voice sample pair, so that the target similarity adjustment parameters can be trained and obtained, and the first similarity can be adjusted together according to the target similarity adjustment parameters and the voice quality factors when the voice print recognition is carried out later, thereby effectively improving the accuracy of the voice print recognition result.
The voiceprint feature extraction model obtained through training in the embodiment can be applied to a scene of user identification based on voiceprint features. When user identification is carried out based on the voiceprint features, the voiceprint features can be extracted through the voiceprint feature extraction model, and the extracted voiceprint features are used as identification basis for carrying out the identification, so that the method has wide application prospect in various fields and scenes such as identification, security check and the like, and is one of the preferred identification schemes of a plurality of call centers.
Fig. 7 is a flowchart of a user identification method according to an embodiment of the present application, where the user identification method may be performed by a software and/or hardware device. For example, referring to fig. 7, the user identification method may include:
S701, acquiring voice to be recognized input by a user to be recognized.
The voice to be recognized can be understood as the voice input by the user who needs to be authenticated currently.
For example, when the voice to be recognized input by the user to be recognized is obtained, the voice to be recognized input by the user to be recognized may be directly obtained through a microphone of the electronic device, the voice to be recognized sent by other electronic devices may be received, the voice to be recognized input by the user to be recognized may be obtained through other manners, and the voice to be recognized may be specifically set according to actual needs, where the manner of obtaining the voice to be recognized input by the user to be recognized is not limited specifically.
S702, inputting the voice to be recognized and the preset voice into a voiceprint feature extraction model to obtain a first voiceprint feature corresponding to the voice to be recognized and a second voiceprint feature corresponding to the preset voice.
The preset voice can be understood as a voice which is stored in advance and is used as a verification basis when user identity verification is performed.
S703, determining a first similarity between the first voiceprint feature and the second voiceprint feature; and adjusting the first similarity according to a voice quality factor corresponding to the voice to be recognized to obtain a second similarity, wherein the voice quality factor is used for representing the voice quality of the voice to be recognized.
It should be noted that, in the step S703, the method for determining the first similarity between the first voiceprint feature and the second voiceprint feature is similar to the method for obtaining the first similarity between the first voiceprint feature of the voice to be recognized and the second voiceprint feature of the preset voice input by the user to be recognized in the above step S101, and the description thereof in the above step S101 is omitted herein.
In addition, in the step S703, the method for adjusting the first similarity according to the voice quality factor corresponding to the voice to be recognized to obtain the second similarity is similar to the method for adjusting the first similarity according to the voice quality factor corresponding to the voice to be recognized to obtain the second similarity in the above step S102, which can be referred to the related description in the above step S102, and herein, the embodiments of the present application will not be repeated.
S704, identifying the identity of the user to be identified according to the second similarity.
For example, when the identity of the user to be identified is identified according to the second similarity, if the second similarity is greater than or equal to a preset similarity threshold value, and the voice to be identified and the preset voice belong to the same user, the user to be identified is determined to be successfully authenticated, and the user to be identified is a legal user; if the second similarity is smaller than the preset similarity threshold, the voice to be recognized and the preset voice belong to different users, and the identity verification failure of the user to be recognized is determined, wherein the user to be recognized is an illegal user.
It can be seen that, in the embodiment of the present application, when identity recognition is performed, a voice to be recognized and a preset voice of a user to be recognized may be input into a voiceprint feature extraction model to obtain a first voiceprint feature corresponding to the voice to be recognized and a second voiceprint feature corresponding to the preset voice; determining a first similarity between the first voiceprint feature and the second voiceprint feature; adjusting the first similarity according to the voice quality factor corresponding to the voice to be recognized to obtain the second similarity; identifying the identity of the user to be identified according to the second similarity; thus, the second similarity is determined by combining the voice quality factors corresponding to the voice to be recognized, the accuracy of the second similarity is improved, the identity recognition is performed according to the second similarity with higher accuracy, and the accuracy of the identity recognition is further improved.
Fig. 8 is a schematic structural diagram of a voiceprint recognition device 80 according to an embodiment of the present application, for example, please refer to fig. 8, the voiceprint recognition device 80 includes:
An obtaining unit 801, configured to obtain a first similarity between a first voiceprint feature of a voice to be recognized and a second voiceprint feature of a preset voice.
The processing unit 802 is configured to adjust the first similarity according to a voice quality factor corresponding to the voice to be recognized, so as to obtain a second similarity, where the voice quality factor is used to characterize the voice quality of the voice to be recognized.
And a recognition unit 803 for performing voiceprint recognition on the voice to be recognized according to the second similarity.
Optionally, the processing unit 802 is specifically configured to obtain a similarity adjustment parameter, where the similarity adjustment parameter includes a weight of the first similarity, a weight of the voice quality factor, and a bias; the bias is used for enabling the difference value between the second similarity and the first similarity to be smaller than a preset value; and adjusting the first similarity according to the weight of the first similarity, the weight of the voice quality factor, the bias and the voice quality factor to obtain the second similarity.
Optionally, the processing unit 802 is specifically configured to, according to the followingDetermining a second similarity; where l(s) denotes the second similarity, ω s denotes the weight of the first similarity, s denotes the first similarity, W q denotes the weight of the speech quality factor, q denotes the speech quality factor, b denotes the bias, and W q T denotes the transpose of W q.
Optionally, the processing unit 802 is further configured to extract a first spectral feature from the voice to be recognized, and extract a second spectral feature from the preset voice; inputting the first frequency spectrum feature and the second frequency spectrum feature into a voiceprint feature extraction model, obtaining a first voiceprint feature through a first network model in the voiceprint feature extraction model, and obtaining a second voiceprint feature through a second network model in the voiceprint feature extraction model; the first network model and the second network model are obtained through the same loss function training.
The voiceprint recognition device 80 in the embodiment of the present application may implement the technical scheme of the voiceprint recognition method in the above embodiment, and its implementation principle and beneficial effects are similar to those of the voiceprint recognition method, and may refer to the implementation principle and beneficial effects of the voiceprint recognition method, which are not described herein.
Fig. 9 is a schematic structural diagram of a training device 90 for a voiceprint feature extraction model according to an embodiment of the present application, for example, referring to fig. 9, the training device 90 for a voiceprint feature extraction model may include:
An acquiring unit 901, configured to acquire a plurality of voice sample pairs and tag information corresponding to each of the plurality of voice sample pairs; each voice sample pair comprises a first voice sample collected through a first channel and a second voice sample collected through a second channel, and the marking information is used for representing whether the first voice sample and the second voice sample belong to the same user or not.
The processing unit 902 is configured to input a first spectral feature corresponding to a first voice sample and a second spectral feature corresponding to a second voice sample included in each voice sample pair to a preset initial voiceprint feature extraction model, so as to obtain a first voiceprint feature corresponding to the first voice sample and a second voiceprint feature corresponding to the second voice sample.
The training unit 903 is configured to train the initial voiceprint feature extraction model according to the first voiceprint feature, the second voiceprint feature, and the label information corresponding to each voice sample, so as to obtain a voiceprint feature extraction model.
Optionally, the training unit 903 is specifically configured to determine a euclidean distance between a first voiceprint feature and a second voiceprint feature corresponding to each voice sample; determining a first loss function corresponding to the voice sample pair according to the Euclidean distance, the marking information corresponding to the voice sample pair and the difference; training the initial voiceprint feature extraction model according to the corresponding first loss function of each voice sample until the loss function converges, and taking the voiceprint feature extraction model when the first loss function converges as the voiceprint feature extraction model.
Optionally, the training unit 903 is specifically configured to A first loss function corresponding to the speech sample is determined.
Wherein L1 represents a loss function corresponding to a voice sample, Y represents flag information corresponding to a voice sample, D W represents a euclidean distance, and m represents a gap.
The training device 90 for the voiceprint feature extraction model in the embodiment of the present application may execute the technical scheme of the training method for the voiceprint feature extraction model in the above embodiment, and the implementation principle and beneficial effects of the training device are similar to those of the voiceprint feature extraction model, and may refer to the implementation principle and beneficial effects of the training method for the voiceprint feature extraction model, and will not be described herein.
Fig. 10 is a schematic structural diagram of a device 100 for determining a similarity adjustment parameter according to an embodiment of the present application, for example, referring to fig. 10, the device 100 for determining a similarity adjustment parameter may include:
An obtaining unit 1001 is configured to determine a first similarity between a first voiceprint feature of a first voice sample and a second voiceprint feature of a second voice sample of each of a plurality of voice sample pairs.
The processing unit 1002 is configured to input a first similarity and a target speech quality factor into a similarity prediction function, to obtain a prediction result of each speech sample pair, where the prediction result is used to characterize a probability that the first speech sample and the second speech sample belong to the same user, and the target speech quality factor is determined based on the speech quality factor of the first speech sample and the speech quality factor of the second speech sample.
A determining unit 1003, configured to determine a target similarity adjustment parameter according to a prediction result of each voice sample pair and label information of each voice sample pair, where the label information is used to characterize whether the first voice sample and the second voice sample belong to the same user.
Optionally, the determining unit 1003 is specifically configured to determine, according to the prediction result of each voice sample pair and the label information of each voice sample pair, a second loss function corresponding to each voice sample pair; determining an average loss function according to the second loss function corresponding to each voice sample pair; and if the average loss function converges, determining a similarity adjustment parameter when the average loss function converges as the target similarity adjustment parameter.
Optionally, the determining unit 1003 is specifically configured to, according toA corresponding second loss function for each pair of speech samples is determined.
Wherein L2 represents a second loss function corresponding to each voice sample pair, y represents the label information of each voice sample pair, h (s, q) represents a similarity prediction function, the value of h (s, q) represents the prediction result of each voice sample pair,L(s) represents a second similarity between the voiceprint features of the first voice sample and the voiceprint features of the second voice sample in each pair of voice samples, s represents a first similarity between the voiceprint features of the first voice sample and the voiceprint features of the second voice sample in each pair of voice samples, and q represents a target voice quality factor.
Optionally, the processing unit 1002 is further configured to determine a difference between the speech quality factor of the first speech sample and the speech quality factor of the second speech sample as the target speech quality factor; or determining an average value between the voice quality factor of the first voice sample and the voice quality factor of the second voice sample as a target voice quality factor; or determining the voice quality factor of the voice sample with the worst voice quality among the first voice sample and the second voice sample as the target voice quality factor.
The determining device 100 for the similarity adjustment parameter shown in the embodiment of the present application may execute the technical scheme of the determining method for the similarity adjustment parameter in the above embodiment, and the implementation principle and the beneficial effects of the determining method for the similarity adjustment parameter are similar to those of the determining method for the similarity adjustment parameter, and may refer to the implementation principle and the beneficial effects of the determining method for the similarity adjustment parameter, which are not described herein.
Fig. 11 is a schematic structural diagram of a user identity recognition device 110 according to an embodiment of the present application, for example, referring to fig. 11, the user identity recognition device 110 may include:
the obtaining unit 1101 is configured to obtain a voice to be recognized input by a user to be recognized.
The processing unit 1102 is configured to input a voice to be recognized and a preset voice into the voiceprint feature extraction model, so as to obtain a first voiceprint feature corresponding to the voice to be recognized and a second voiceprint feature corresponding to the preset voice.
A determining unit 1103 for determining a first similarity between the first voiceprint feature and the second voiceprint feature; and adjusting the first similarity according to a voice quality factor corresponding to the voice to be recognized to obtain a second similarity, wherein the voice quality factor is used for representing the voice quality of the voice to be recognized.
The identifying unit 1104 is configured to identify the identity of the user to be identified according to the second similarity.
The user identity recognition device 110 in the embodiment of the present application may execute the technical scheme of the user identity recognition method in the above embodiment, and the implementation principle and beneficial effects of the user identity recognition method are similar to those of the user identity recognition method, and may refer to the implementation principle and beneficial effects of the user identity recognition method, and will not be described herein.
Fig. 12 is a schematic structural diagram of an electronic device 120 according to an embodiment of the present application, for example, referring to fig. 12, the electronic device 120 may include a processor 1201 and a memory 1202; wherein,
The memory 1202 is used for storing a computer program.
The processor 1201 is configured to read the computer program stored in the memory 1202, and execute the voiceprint recognition method in the above embodiment, or execute the training method of the voiceprint feature extraction model in the above embodiment, or execute the similarity adjustment parameter determination method in the above embodiment, or execute the user identification method in the above embodiment according to the computer program in the memory 1202.
Alternatively, the memory 1202 may be separate or integrated with the processor 1201. When the memory 1202 is separate from the processor 1201, the electronic device 120 may further include: a bus connecting the memory 1202 and the processor 1201.
Optionally, the present embodiment further includes: a communication interface, which may be connected to the processor 1201 by a bus. The processor 1201 may control the communication interface to implement the functions of acquisition and transmission of the electronic device 120 described above.
For example, in the embodiment of the present application, the electronic device 120 may be a terminal or a server, and may be specifically set according to actual needs.
The electronic device 120 shown in the embodiment of the present application may implement the technical scheme of the voiceprint recognition method in the above embodiment, where the implementation principle and beneficial effects are similar to those of the voiceprint recognition method, and may refer to the implementation principle and beneficial effects of the voiceprint recognition method, or implement the technical scheme of the training method of the voiceprint feature extraction model in the above embodiment, where the implementation principle and beneficial effects are similar to those of the training method of the voiceprint feature extraction model, and may refer to the implementation principle and beneficial effects of the training method of the voiceprint feature extraction model, or implement the technical scheme of the determining method of the similarity adjustment parameter in the above embodiment, where the implementation principle and beneficial effects are similar to those of the determining method of the similarity adjustment parameter, and may refer to the implementation principle and beneficial effects of the similarity adjustment parameter, or implement the technical scheme of the user identity recognition method in the above embodiment, and may refer to the implementation principle and beneficial effects of the user identity recognition method, and may not be repeated here.
The embodiment of the application also provides a computer readable storage medium, in which computer executable instructions are stored, when a processor executes the computer executable instructions, a technical scheme of implementing a voiceprint recognition method in the above embodiment is implemented, and the implementation principle and beneficial effects are similar to those of the voiceprint recognition method, and can refer to the implementation principle and beneficial effects of the voiceprint recognition method, or a technical scheme of implementing a training method of a voiceprint feature extraction model in the above embodiment, and the implementation principle and beneficial effects of the training method of the voiceprint feature extraction model are similar to those of the training method of the voiceprint feature extraction model, and can refer to the implementation principle and beneficial effects of the training method of the voiceprint feature extraction model, or the implementation principle and beneficial effects of the implementation principle and the beneficial effects of the method of determining similarity adjustment parameters are similar to those of the voiceprint recognition method, and can refer to the implementation principle and beneficial effects of the method of the similarity adjustment parameter, or implement the identity recognition method in the above embodiment, and the identity recognition method is not longer practical, and the implementation principle and beneficial effects of the identity recognition method are similar to those of the user.
The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program is executed by a processor, and the computer program realizes the technical scheme of the voiceprint recognition method in the embodiment, and the implementation principle and beneficial effects are similar to those of the voiceprint recognition method, and can be seen in the implementation principle and beneficial effects of the voiceprint recognition method, or the technical scheme of the training method of the voiceprint feature extraction model in the embodiment, and the implementation principle and beneficial effects are similar to those of the training method of the voiceprint feature extraction model, and can be seen in the implementation principle and beneficial effects of the training method of the voiceprint feature extraction model, or the technical scheme of the implementation principle and beneficial effects of the similarity adjustment parameter determination method in the embodiment are similar to those of the similarity adjustment parameter determination method, and can be seen in the implementation principle and beneficial effects of the similarity adjustment parameter determination method, or the technical scheme of the user identity recognition method in the embodiment is seen in the implementation principle and beneficial effects similar to those of the identity recognition method, and identity recognition method are not seen in the embodiment, and the implementation principle and beneficial effects are seen in the implementation principle and beneficial effects of the identity recognition method are not seen.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection illustrated or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.
The integrated modules, which are implemented in the form of software functional modules, may be stored in a computer readable storage medium. The software functional module is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (english: processor) to perform some steps of the methods of the embodiments of the application.
It should be understood that the above Processor may be a central processing unit (english: central Processing Unit, abbreviated as CPU), or may be other general purpose processors, digital signal processors (english: DIGITAL SIGNAL Processor, abbreviated as DSP), application specific integrated circuits (english: application SPECIFIC INTEGRATED Circuit, abbreviated as ASIC), or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution.
The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile memory NVM, such as at least one magnetic disk memory, and may also be a U-disk, a removable hard disk, a read-only memory, a magnetic disk or optical disk, etc.
The bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (PERIPHERAL COMPONENT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, the buses in the drawings of the present application are not limited to only one bus or to one type of bus.
The computer-readable storage medium described above may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims (18)

1. A method of voiceprint recognition comprising:
Acquiring a first similarity between a first voiceprint feature of a voice to be recognized and a second voiceprint feature of a preset voice;
The first similarity is adjusted according to the voice quality factor corresponding to the voice to be recognized, so that a second similarity is obtained, and the voice quality factor is used for representing the voice quality of the voice to be recognized;
and carrying out voiceprint recognition on the voice to be recognized according to the second similarity.
2. The method of claim 1, wherein the adjusting the first similarity according to the voice quality factor corresponding to the voice to be recognized to obtain the second similarity includes:
obtaining a similarity adjustment parameter, wherein the similarity adjustment parameter comprises the weight of the first similarity, the weight and the bias of the voice quality factor; wherein the bias is used for enabling the difference value between the second similarity and the first similarity to be smaller than a preset value;
and adjusting the first similarity according to the weight of the first similarity, the weight of the voice quality factor, the bias and the voice quality factor to obtain the second similarity.
3. The method of claim 2, wherein said adjusting said first similarity based on said first similarity weight, said voice quality factor weight, said bias, and said voice quality factor to obtain said second similarity comprises:
According to Determining the second similarity;
Wherein l(s) represents the second similarity, ω s represents the weight of the first similarity, s represents the first similarity, W q represents the weight of the speech quality factor, q represents the speech quality factor, b represents the bias, Representing the transpose of W q.
4. A method according to any one of claims 1-3, wherein the method further comprises:
Extracting a first frequency spectrum characteristic from the voice to be recognized, and extracting a second frequency spectrum characteristic from the preset voice;
inputting the first frequency spectrum feature and the second frequency spectrum feature into a voiceprint feature extraction model, obtaining the first voiceprint feature through a first network model in the voiceprint feature extraction model, and obtaining the second voiceprint feature through a second network model in the voiceprint feature extraction model; the first network model and the second network model are obtained through the same loss function training.
5. A training method of a voiceprint feature extraction model is characterized by comprising the following steps:
Acquiring a plurality of voice sample pairs and marking information corresponding to each voice sample pair in the plurality of voice sample pairs; each voice sample pair comprises a first voice sample collected through a first channel and a second voice sample collected through a second channel, and the marking information is used for representing whether the first voice sample and the second voice sample belong to the same user or not;
inputting a first frequency spectrum characteristic corresponding to the first voice sample and a second frequency spectrum characteristic corresponding to the second voice sample included in each voice sample pair into a preset initial voiceprint characteristic extraction model to obtain a first voiceprint characteristic corresponding to the first voice sample and a second voiceprint characteristic corresponding to the second voice sample;
training the initial voiceprint feature extraction model according to the first voiceprint feature, the second voiceprint feature and the marking information corresponding to each voice sample to obtain the voiceprint feature extraction model, wherein the voiceprint feature extraction model is used for realizing the voiceprint recognition method according to any one of claims 1-4.
6. The method of claim 5, wherein training the initial voiceprint feature extraction model based on the first voiceprint feature, the second voiceprint feature, and the label information corresponding to each of the voice samples comprises:
Determining Euclidean distance between a first voiceprint feature and a second voiceprint feature corresponding to each voice sample;
Determining a first loss function corresponding to the voice sample pair according to the Euclidean distance, the marking information corresponding to the voice sample pair and the difference;
Training the initial voiceprint feature extraction model according to the corresponding first loss function of each voice sample until the first loss function converges, and taking the voiceprint feature extraction model when the first loss function converges as the voiceprint feature extraction model.
7. The method of claim 6, wherein determining the first loss function corresponding to the pair of voice samples based on the euclidean distance, the label information corresponding to the pair of voice samples, and the gap comprises:
According to Determining a first loss function corresponding to the voice sample;
wherein L1 represents a loss function corresponding to a voice sample, Y represents flag information corresponding to a voice sample, D W represents a euclidean distance, and m represents a gap.
8. A method for determining a similarity adjustment parameter, comprising:
Determining a first similarity between a first voiceprint feature of a first voice sample and a second voiceprint feature of a second voice sample of each of a plurality of pairs of voice samples;
Inputting the first similarity and a target voice quality factor into a similarity prediction function to obtain a prediction result of each voice sample pair, wherein the prediction result is used for representing the probability that the first voice sample and the second voice sample belong to the same user, and the target voice quality factor is determined based on the voice quality factor of the first voice sample and the voice quality factor of the second voice sample;
And determining target similarity adjustment parameters according to the prediction results of the voice sample pairs and the marking information of the voice sample pairs, wherein the marking information is used for representing whether the first voice sample and the second voice sample belong to the same user.
9. The method of claim 8, wherein determining the target similarity adjustment parameter based on the predicted result of each of the pair of voice samples and the label information of each of the pair of voice samples comprises:
determining a second loss function corresponding to each voice sample pair according to the prediction result of each voice sample pair and the marking information of each voice sample pair;
determining an average loss function according to the second loss function corresponding to each voice sample pair;
And if the average loss function converges, determining a similarity adjustment parameter when the average loss function converges as the target similarity adjustment parameter.
10. The method of claim 9, wherein determining the second loss function corresponding to each of the voice sample pairs based on the prediction result of each of the voice sample pairs and the label information of each of the voice sample pairs comprises:
According to Determining a second loss function corresponding to each voice sample pair;
wherein L2 represents a second loss function corresponding to each voice sample pair, y represents the label information of each voice sample pair, h (s, q) represents a similarity prediction function, the value of h (s, q) represents the prediction result of each voice sample pair, L(s) represents a second similarity between the voiceprint features of the first voice sample and the voiceprint features of the second voice sample in each pair of voice samples, s represents a first similarity between the voiceprint features of the first voice sample and the voiceprint features of the second voice sample in each pair of voice samples, and q represents a target voice quality factor.
11. The method according to any one of claims 8-10, further comprising:
determining a difference between a speech quality factor of the first speech sample and a speech quality factor of the second speech sample as the target speech quality factor; or alternatively
Determining an average value between the voice quality factor of the first voice sample and the voice quality factor of the second voice sample as the target voice quality factor; or alternatively
And determining the voice quality factor of the voice sample with the worst voice quality in the first voice sample and the second voice sample as the target voice quality factor.
12. A method for identifying a user, comprising:
acquiring voice to be recognized input by a user to be recognized;
inputting the voice to be recognized and the preset voice into a voiceprint feature extraction model to obtain a first voiceprint feature corresponding to the voice to be recognized and a second voiceprint feature corresponding to the preset voice;
Determining a first similarity between the first voiceprint feature and the second voiceprint feature; the first similarity is adjusted according to the voice quality factor corresponding to the voice to be recognized, so that a second similarity is obtained, and the voice quality factor is used for representing the voice quality of the voice to be recognized;
and identifying the identity of the user to be identified according to the second similarity.
13. A voiceprint recognition apparatus, comprising:
The voice recognition device comprises an acquisition unit, a recognition unit and a recognition unit, wherein the acquisition unit is used for acquiring a first similarity between a first voiceprint feature of voice to be recognized and a second voiceprint feature of preset voice;
the processing unit is used for adjusting the first similarity according to a voice quality factor corresponding to the voice to be recognized to obtain a second similarity, wherein the voice quality factor is used for representing the voice quality of the voice to be recognized;
And the recognition unit is used for carrying out voiceprint recognition on the voice to be recognized according to the second similarity.
14. A training device for a voiceprint feature extraction model, comprising:
An obtaining unit, configured to obtain a plurality of voice sample pairs and marking information corresponding to each voice sample pair in the plurality of voice sample pairs; each voice sample pair comprises a first voice sample collected through a first channel and a second voice sample collected through a second channel, and the marking information is used for representing whether the first voice sample and the second voice sample belong to the same user or not;
The processing unit is used for inputting a first frequency spectrum characteristic corresponding to the first voice sample and a second frequency spectrum characteristic corresponding to the second voice sample included in each voice sample pair into a preset initial voiceprint characteristic extraction model to obtain a first voiceprint characteristic corresponding to the first voice sample and a second voiceprint characteristic corresponding to the second voice sample;
The training unit is used for training the initial voiceprint feature extraction model according to the first voiceprint feature, the second voiceprint feature and the marking information corresponding to each voice sample to obtain the voiceprint feature extraction model, and the voiceprint feature extraction model is used for realizing the voiceprint recognition method according to any one of claims 1-4.
15. A device for determining a similarity adjustment parameter, comprising:
An acquisition unit configured to determine a first similarity between a first voiceprint feature of a first voice sample and a second voiceprint feature of a second voice sample of each of a plurality of voice sample pairs;
The processing unit is used for inputting the first similarity and a target voice quality factor into a similarity prediction function to obtain a prediction result of each voice sample pair, wherein the prediction result is used for representing the probability that the first voice sample and the second voice sample belong to the same user, and the target voice quality factor is determined based on the voice quality factor of the first voice sample and the voice quality factor of the second voice sample;
and the determining unit is used for determining target similarity adjustment parameters according to the prediction results of the voice sample pairs and the marking information of the voice sample pairs, wherein the marking information is used for representing whether the first voice sample and the second voice sample belong to the same user or not.
16. A user identification device, comprising:
The acquisition unit is used for acquiring the voice to be recognized input by the user to be recognized;
the processing unit is used for inputting the voice to be recognized and the preset voice into a voiceprint feature extraction model to obtain a first voiceprint feature corresponding to the voice to be recognized and a second voiceprint feature corresponding to the preset voice;
A determining unit configured to determine a first similarity between the first voiceprint feature and the second voiceprint feature; the first similarity is adjusted according to the voice quality factor corresponding to the voice to be recognized, so that a second similarity is obtained, and the voice quality factor is used for representing the voice quality of the voice to be recognized;
and the identification unit is used for identifying the identity of the user to be identified according to the second similarity.
17. An electronic device, comprising: a memory, a processor;
a memory; for storing a computer program;
The processor is configured to read the computer program stored in the memory, and execute the voiceprint recognition method according to any one of claims 1 to 4 according to the computer program in the memory, or execute the training method of the voiceprint feature extraction model according to any one of claims 5 to 7, or execute the similarity adjustment parameter determination method according to any one of claims 8 to 11; or for performing the user identification method of claim 12.
18. A readable storage medium having stored thereon a computer program, characterized in that the computer program has stored therein computer-executable instructions for implementing the voiceprint recognition method according to any one of claims 1 to 4, or for implementing the training method of the voiceprint feature extraction model according to any one of claims 5 to 7, or for implementing the similarity adjustment parameter determination method according to any one of claims 8 to 11, when executed by a processor; or for implementing a user identification method as claimed in claim 12.
CN202110838405.2A 2021-07-23 2021-07-23 Voiceprint recognition method and device and electronic equipment Active CN113593579B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110838405.2A CN113593579B (en) 2021-07-23 2021-07-23 Voiceprint recognition method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110838405.2A CN113593579B (en) 2021-07-23 2021-07-23 Voiceprint recognition method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN113593579A CN113593579A (en) 2021-11-02
CN113593579B true CN113593579B (en) 2024-04-30

Family

ID=78249312

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110838405.2A Active CN113593579B (en) 2021-07-23 2021-07-23 Voiceprint recognition method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113593579B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115565537B (en) * 2022-09-01 2024-03-15 荣耀终端有限公司 Voiceprint recognition method and electronic equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1006507A1 (en) * 1998-11-27 2000-06-07 Ascom Systec AG Method for realizing speaker recognition
JP2006189544A (en) * 2005-01-05 2006-07-20 Matsushita Electric Ind Co Ltd Interpretation system, interpretation method, recording medium with interpretation program recorded thereon, and interpretation program
CN102270451A (en) * 2011-08-18 2011-12-07 安徽科大讯飞信息科技股份有限公司 Method and system for identifying speaker
CN105632499A (en) * 2014-10-31 2016-06-01 株式会社东芝 Method and device for optimizing voice recognition result
CN108040032A (en) * 2017-11-02 2018-05-15 阿里巴巴集团控股有限公司 A kind of voiceprint authentication method, account register method and device
CN109614881A (en) * 2018-11-19 2019-04-12 中国地质大学(武汉) It can the biometric authentication method of automatic adjusument threshold value, equipment and storage equipment
CN111785283A (en) * 2020-05-18 2020-10-16 北京三快在线科技有限公司 Voiceprint recognition model training method and device, electronic equipment and storage medium
CN112053695A (en) * 2020-09-11 2020-12-08 北京三快在线科技有限公司 Voiceprint recognition method and device, electronic equipment and storage medium
CN112435673A (en) * 2020-12-15 2021-03-02 北京声智科技有限公司 Model training method and electronic terminal

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9992123B2 (en) * 2016-06-30 2018-06-05 Verizon Patent And Licensing Inc. Methods and systems for evaluating voice over Wi-Fi (VoWiFi) call quality
US10667155B2 (en) * 2018-07-16 2020-05-26 Verizon Patent And Licensing Inc. Methods and systems for evaluating voice call quality

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1006507A1 (en) * 1998-11-27 2000-06-07 Ascom Systec AG Method for realizing speaker recognition
JP2006189544A (en) * 2005-01-05 2006-07-20 Matsushita Electric Ind Co Ltd Interpretation system, interpretation method, recording medium with interpretation program recorded thereon, and interpretation program
CN102270451A (en) * 2011-08-18 2011-12-07 安徽科大讯飞信息科技股份有限公司 Method and system for identifying speaker
CN105632499A (en) * 2014-10-31 2016-06-01 株式会社东芝 Method and device for optimizing voice recognition result
CN108040032A (en) * 2017-11-02 2018-05-15 阿里巴巴集团控股有限公司 A kind of voiceprint authentication method, account register method and device
WO2019085575A1 (en) * 2017-11-02 2019-05-09 阿里巴巴集团控股有限公司 Voiceprint authentication method and apparatus, and account registration method and apparatus
CN109614881A (en) * 2018-11-19 2019-04-12 中国地质大学(武汉) It can the biometric authentication method of automatic adjusument threshold value, equipment and storage equipment
CN111785283A (en) * 2020-05-18 2020-10-16 北京三快在线科技有限公司 Voiceprint recognition model training method and device, electronic equipment and storage medium
CN112053695A (en) * 2020-09-11 2020-12-08 北京三快在线科技有限公司 Voiceprint recognition method and device, electronic equipment and storage medium
CN112435673A (en) * 2020-12-15 2021-03-02 北京声智科技有限公司 Model training method and electronic terminal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张敏敏 ; 马骏 ; 龚晨晓 ; 陈亮亮 ; 郑茜茜 ; .基于MATLAB的声纹识别系统软件的设计.科技视界.2013,第7、54页. *

Also Published As

Publication number Publication date
CN113593579A (en) 2021-11-02

Similar Documents

Publication Publication Date Title
CN107517207A (en) Server, auth method and computer-readable recording medium
WO2017215558A1 (en) Voiceprint recognition method and device
RU2727720C1 (en) Method and device for personal identification
CN112949708B (en) Emotion recognition method, emotion recognition device, computer equipment and storage medium
CN105991593B (en) A kind of method and device identifying consumer's risk
CN110033756B (en) Language identification method and device, electronic equipment and storage medium
WO2021051608A1 (en) Voiceprint recognition method and device employing deep learning, and apparatus
CN106991312B (en) Internet anti-fraud authentication method based on voiceprint recognition
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
CN111508524B (en) Method and system for identifying voice source equipment
CN112382300A (en) Voiceprint identification method, model training method, device, equipment and storage medium
CN110781952A (en) Image identification risk prompting method, device, equipment and storage medium
CN109920435A (en) A kind of method for recognizing sound-groove and voice print identification device
CN113593579B (en) Voiceprint recognition method and device and electronic equipment
CN117237757A (en) Face recognition model training method and device, electronic equipment and medium
EP4170526A1 (en) An authentication system and method
CN111161759A (en) Audio quality evaluation method and device, electronic equipment and computer storage medium
Nagakrishnan et al. Generic speech based person authentication system with genuine and spoofed utterances: different feature sets and models
CN106373576B (en) Speaker confirmation method and system based on VQ and SVM algorithms
CN112735381B (en) Model updating method and device
CN114078484B (en) Speech emotion recognition method, device and storage medium
CN112820298B (en) Voiceprint recognition method and device
Li et al. Advanced RawNet2 with Attention-based Channel Masking for Synthetic Speech Detection
CN114333786A (en) Speech emotion recognition method and related device, electronic equipment and storage medium
CN112309404A (en) Machine voice identification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant