CN113921017A - Voice identity detection method and device, electronic equipment and storage medium - Google Patents

Voice identity detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113921017A
CN113921017A CN202111524105.3A CN202111524105A CN113921017A CN 113921017 A CN113921017 A CN 113921017A CN 202111524105 A CN202111524105 A CN 202111524105A CN 113921017 A CN113921017 A CN 113921017A
Authority
CN
China
Prior art keywords
voice
speech
syllable
same
comparison
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111524105.3A
Other languages
Chinese (zh)
Inventor
张伟彬
丁俊豪
卢光明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Voiceai Technologies Co ltd
Original Assignee
Voiceai Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Voiceai Technologies Co ltd filed Critical Voiceai Technologies Co ltd
Priority to CN202111524105.3A priority Critical patent/CN113921017A/en
Publication of CN113921017A publication Critical patent/CN113921017A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephone Function (AREA)

Abstract

The application discloses a method and a device for testing voice identity, electronic equipment and a storage medium, wherein the method comprises the following steps: performing syllable matching on the voice recognition result of the comparison voice and the voice recognition result of the sample voice, and determining a plurality of same syllables of the comparison voice relative to the sample voice; the plurality of identical syllables including syllables of at least two syllable types; calculating the voiceprint similarity between the first voice section corresponding to the same syllable and the corresponding second voice section according to the voice characteristics of the first voice section corresponding to the same syllable and the voice characteristics of the corresponding second voice section; and determining the identity test result of the comparison voice and the sample voice according to the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable. The scheme can improve the accuracy of the identity test result.

Description

Voice identity detection method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of audio processing technologies, and in particular, to a method and an apparatus for testing voice identity, an electronic device, and a storage medium.
Background
The voice identity test means that whether two pieces of input voice come from the same person or not is determined by comparing and analyzing the two pieces of input voice. The problem that the accuracy of voice identity detection is not high exists in the prior art, so that the reliability of an identity detection result is not high. Therefore, how to improve the accuracy of the voice identity check is an urgent technical problem to be solved in the prior art.
Disclosure of Invention
In view of the foregoing problems, embodiments of the present application provide a method and an apparatus for checking speech identity, an electronic device, and a storage medium, so as to improve the foregoing problems.
According to an aspect of the embodiments of the present application, there is provided a method for checking speech identity, including: performing syllable matching on a voice recognition result of a comparison voice and a voice recognition result of a sample voice, and determining a plurality of same syllables of the comparison voice relative to the sample voice; the plurality of identical syllables including syllables of at least two syllable types, the syllable types including a word type, and a phone type; calculating the voiceprint similarity between the first voice section corresponding to the same syllable and the corresponding second voice section according to the voice characteristics of the first voice section corresponding to the same syllable and the voice characteristics of the corresponding second voice section; the first voice segment is a voice segment corresponding to the same syllable in the comparison voice; the second speech segment is a speech segment corresponding to the same syllable in the sample speech; and determining the identity test result of the comparison voice and the sample voice according to the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable.
According to an aspect of an embodiment of the present application, there is provided a device for verifying speech identity, including: the syllable matching module is used for carrying out syllable matching on the voice recognition result of the comparison voice and the voice recognition result of the sample voice and determining a plurality of same syllables of the comparison voice relative to the sample voice; the plurality of identical syllables including syllables of at least two syllable types, the syllable types including a word type, and a phone type; the voiceprint similarity calculation module is used for calculating the voiceprint similarity between the first voice section corresponding to the same syllable and the corresponding second voice section according to the voice characteristics of the first voice section corresponding to the same syllable and the voice characteristics of the corresponding second voice section; the first voice segment is a voice segment corresponding to the same syllable in the comparison voice; the second speech segment is a speech segment corresponding to the same syllable in the sample speech; and the identity test result determining module is used for determining the identity test result of the comparison voice and the sample voice according to the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable.
According to an aspect of an embodiment of the present application, there is provided an electronic device including: a processor; a memory having computer readable instructions stored thereon which, when executed by the processor, implement a method of verifying speech identity as described above.
According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium having stored thereon computer-readable instructions, which, when executed by a processor, implement the method for checking the identity of speech as described above.
In the scheme of the application, syllable matching is carried out on the basis of a voice recognition result of comparison voice and a voice recognition result of sample voice, a plurality of same syllables of the comparison voice relative to the sample voice are determined, then voiceprint similarity between a first voice section corresponding to the same syllables and a second voice section corresponding to the same syllables is calculated according to voice characteristics of the first voice section corresponding to the same syllables and voice characteristics of the second voice section corresponding to the same syllables, and then an identity test result of the comparison voice and the sample voice is determined according to the determined voiceprint similarity. In the scheme, the voiceprint similarity calculation is carried out by combining a plurality of same syllables of the comparison voice and the sample voice, the plurality of same syllables comprise syllables of at least two syllable types, and the stability of the syllables of different syllable types and the granularity of expressed characteristics are different, so that the scheme combines the same syllables of at least two syllable types to carry out the voiceprint similarity calculation and determine the identity test result, and the accuracy and the validity of the identity test result can be improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a flow chart illustrating a method of verifying speech identity according to an embodiment of the present application.
Fig. 2 is a flow chart illustrating obtaining a speech recognition result according to an embodiment of the present application.
FIG. 3 is a flow chart illustrating syllable matching according to an embodiment of the present application.
FIG. 4 is a flowchart illustrating step 120 according to an embodiment of the present application.
FIGS. 5A-5C are a spectrogram of a first speech segment corresponding to three identical syllables and a spectrogram of a second speech segment corresponding to the three identical syllables according to an embodiment of the present application.
Fig. 6 is a flowchart illustrating step 120 according to another embodiment of the present application.
Fig. 7 is a flowchart illustrating steps 120 and 130 according to an embodiment of the present application.
Fig. 8 is a flowchart illustrating step 130 according to another embodiment of the present application.
Fig. 9 is a flowchart illustrating step 130 according to another embodiment of the present application.
Fig. 10 is a flowchart illustrating step 130 according to another embodiment of the present application.
Fig. 11 is a block diagram of a device for verifying speech identity according to an embodiment of the present application.
FIG. 12 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It should be noted that: reference herein to "a plurality" means two or more. "and/or" describe the association relationship of the associated objects, meaning that there may be three relationships, e.g., A and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
The implementation details of the technical solution of the embodiment of the present application are set forth in detail below:
fig. 1 is a flowchart illustrating a method for checking voice identity according to an embodiment of the present application, which may be performed by a computer device with processing capability, such as a server, a terminal device (e.g., a desktop computer, a notebook computer), and the like, and may also be performed by a checking system including the server and the terminal device, which is not limited in detail herein. Referring to fig. 1, the method includes at least steps 110 to 130, which are described in detail as follows:
step 110, performing syllable matching on a voice recognition result of a comparison voice and a voice recognition result of a sample voice, and determining a plurality of same syllables of the comparison voice relative to the sample voice; the plurality of identical syllables includes syllables of at least two syllable types, the syllable types including a word type, and a phone type.
In some embodiments, the speech recognition result may be obtained by text-to-text recognition of the comparison speech (or the sample speech), such that the speech recognition result of the comparison speech indicates the text content corresponding to the comparison speech, and the speech recognition result of the sample speech indicates the text content corresponding to the sample speech.
In some embodiments, the speech recognition result may be obtained by performing phoneme recognition on the comparison speech (or the sample speech), so that the speech recognition result of the comparison speech indicates the phoneme content corresponding to the comparison speech, and the speech recognition result of the sample speech indicates the phoneme content corresponding to the sample speech.
A phoneme (phone) is a minimum unit of speech divided according to natural attributes of speech, and is divided into two major categories, namely vowels and consonants. The phonemes are analyzed according to pronunciation actions in syllables, and one action forms one phoneme, such as Chinese syllable (ā) only has one phoneme ā; the love (aii) has two phonemes, i.e. an and i; the generations (d-a-i) have three phonemes, i.e. d, a-and i.
In the scheme, the syllables with the syllable type of the word type are words, and the syllables with the syllable type of the phoneme type are phonemes. Thus, in this scheme, the same syllables may be the same word, or the same phone.
In some embodiments, the speech recognition result includes phoneme content and text content, such that phoneme matching may be performed based on the phoneme content to determine identical phonemes, and word matching may be performed based on the text content to determine identical words; and performing word matching based on the text content to determine the same word.
Fig. 2 is a flowchart illustrating obtaining a speech recognition result according to an embodiment of the present application, as shown in fig. 2, including: step 210, inputting a voice signal; step 220, active voice detection; step 230, voice recognition; and step 240, outputting the phoneme content and the character content.
If the speech signal in step 210 is a comparison speech, the content of the phoneme and the content of the text output in step 240 are speech recognition results of the comparison speech; on the contrary, if the speech signal in step 210 is the sample speech, the content of the phoneme and the content of the text output in step 240 are the speech recognition result of the sample speech.
The active speech detection is also called speech endpoint detection, speech boundary detection, which means that a long-time silence segment is identified and eliminated from the speech signal, and an active speech segment (i.e. non-silence segment) in the speech signal is determined, so that, in step 230, speech recognition is performed on the active speech segment in the speech signal without paying attention to the silence segment in the speech signal.
The same syllable of the comparison speech with respect to the sample speech refers to a syllable included in both the speech recognition result of the comparison speech and the speech recognition result of the sample speech.
In some embodiments, a plurality of syllables may be selected from the sample speech as the reference syllables, each reference syllable is then matched with the speech recognition result of the comparison speech, whether the reference syllable is included in the speech recognition result of the comparison speech is determined, and if the reference syllable is included, the reference syllable is determined as an identical syllable of the comparison speech relative to the sample speech; otherwise, if not, the reference syllable is not the same syllable of the comparison speech relative to the sample speech; the above process is repeated to determine whether each reference syllable is the same syllable for the comparison speech relative to the sample speech. It will be appreciated that in order to ensure that the determined identical syllables include syllables of at least two syllable types, the plurality of reference syllables also includes syllables of at least two syllable types.
FIG. 3 is a flow chart illustrating syllable matching according to an embodiment of the present application. In this embodiment, the speech recognition result of the comparison speech includes the phoneme content and the text content corresponding to the comparison speech; the speech recognition result of the sample speech comprises phoneme content and character content corresponding to the sample speech. As shown in fig. 3, the phoneme content and the text content corresponding to the comparison speech are syllable-matched with the phoneme content and the text content corresponding to the sample speech, and then a plurality of identical syllables and time position information of each identical syllable are output. The time position information of the same syllable includes the time position information of the same syllable in the comparison speech segment and the time position information of the same syllable in the sample speech segment.
Specifically, the time position information of the same syllable in the comparison speech is also understood as the time position information of the first speech segment corresponding to the same syllable in the comparison speech, and the time position information of the first speech segment corresponding to the same syllable in the comparison speech indicates the start time of the first speech segment in the comparison speech and the end time of the first speech segment in the comparison speech.
For example, if the text content corresponding to a comparison speech is "where you go today", if the same syllable is "today", the start time of the speech segment corresponding to the word "today" in the comparison speech is t1, and the end time in the speech is t2, the time position information of the first speech segment corresponding to the same syllable "today" in the comparison speech indicates the start time t1 and the end time t 2.
In some embodiments, in the process of performing speech recognition on the comparison speech (or the sample speech), for example, performing speech-to-text recognition, not only the text content corresponding to each audio segment is identified and determined, but also the time position information of the audio segment corresponding to each text content (word or word) in the comparison speech is determined, so that, in the process of performing syllable matching based on the speech recognition result of the comparison speech and the speech recognition result of the sample speech, not only the same word and/or the same word of the comparison speech relative to the sample speech can be determined, but also the time position information of the same word or the same word in the comparison speech and the time position information of the same word or the same word in the sample speech can be further determined.
Similarly, in the phoneme recognition process of performing speech recognition on the comparison speech (or the sample speech), not only the phoneme corresponding to each audio segment is recognized and determined, but also the time position information of the audio segment corresponding to each phoneme in the comparison speech (or the sample speech) is correspondingly determined, so that when the syllables are matched, the time position information of the same phoneme in the comparison speech and the time position information of the same phoneme in the sample speech can be correspondingly determined.
It can be understood that, because the time lengths of the audio segments corresponding to the syllables of different syllable types are different, the comprehensiveness of the embodied sound characteristics also has difference. Generally, among syllables corresponding to the word type, the subtype and the phone type, the audio corresponding to the syllable (word) of the word type has the longest duration, so that the audio segment corresponding to the syllable of the word type has a larger information amount of the voiceprint feature of the speaker, and the audio segment corresponding to the syllable of the phone type has the shortest duration, so that the audio segment corresponding to the syllable (phone) of the phone type has a larger information amount of the voiceprint feature of the speaker; however, since the duration of the audio corresponding to the syllable of the phoneme type is the shortest, it can express the voiceprint feature of the speaker at a smaller fine granularity.
Therefore, in the scheme of the application, the same syllables of at least two syllable types are combined to carry out identity verification on the comparison voice and the sample voice, and the comprehensiveness and the fineness of the voice frequency corresponding to the same syllables in the voiceprint feature expression are considered, so that the accuracy of the identity verification result can be ensured.
Step 120, calculating the voiceprint similarity between the first voice segment corresponding to the same syllable and the corresponding second voice segment according to the voice characteristics of the first voice segment corresponding to the same syllable and the voice characteristics of the corresponding second voice segment; the first voice segment is a voice segment corresponding to the same syllable in the comparison voice; the second speech segment is a speech segment for which the same syllable corresponds in the sample speech.
In some embodiments, prior to step 120, the method further comprises: acquiring the voice characteristics of the first voice section corresponding to the same syllable; and acquiring the voice characteristics of the second voice section corresponding to the same syllable.
In some embodiments, the speech recognition result of the comparison speech indicates time position information of each syllable included in the comparison speech; in this embodiment, the step of obtaining the speech features of the first speech segment corresponding to the same syllable further includes: according to the time position information of the same syllable in the comparison voice, segment extraction is carried out in the comparison voice to obtain a first voice section corresponding to the same syllable; and performing voice feature extraction on the first voice section to obtain the voice feature of the first voice section.
The voice features may include characteristic parameters such as center frequency, bandwidth, and intensity of a formant, and further, the voice features may further include characteristic curves such as formant trend, fundamental frequency trajectory, and LPC spectrum. In an embodiment, frequency domain features such as a formant, a fundamental frequency, and the like of each speech frame in the speech segments (the first speech segment and the second speech segment) may be automatically calculated through a speech signal processing algorithm, for example, an autocorrelation method, a cepstrum method, a Linear Prediction (LPC), and the like, so as to obtain feature curves such as a formant trend, a fundamental frequency trajectory, an LPC spectrum, and the like of the speech segments (the first speech segment and the second speech segment).
After the first speech segment corresponding to the same syllable is extracted, the feature curves of the center frequency, the bandwidth and the intensity, the vibration peak trend, the fundamental frequency trajectory, the LPC spectrum and the like can be extracted in the above manner to obtain the speech feature of the first speech segment.
The speech features of the second speech segment corresponding to the same syllable can be extracted in a similar manner, and are not described herein again.
In other embodiments, the speech recognition result of the comparison speech indicates the time position information of each syllable included in the comparison speech; in this embodiment, the step of obtaining the speech characteristics of the first speech segment corresponding to the same syllable further includes: determining a segmented spectrogram corresponding to the first voice segment in the spectrogram of the comparison voice according to the time position information of the same syllable in the comparison voice; and acquiring the voice features of the first voice section from the segmented voice spectrogram corresponding to the first voice section.
In a case where a spectrogram of a comparison voice and a spectrogram of a sample voice are generated in advance, a segment spectrogram corresponding to each syllable is associated with time position information in the spectrogram of the voice based on time position information of each syllable indicated by a voice recognition result, whereby the corresponding segment spectrogram can be extracted from the spectrogram of the voice based on the time position information of the syllable.
The speech spectrogram of the speech calculates the frequency domain characteristics of each speech frame by performing time-frequency transformation on the speech of the time domain, so that the speech spectrogram of the speech correspondingly expresses the frequency domain characteristics of each syllable, such as formants, center frequency, bandwidth and the like of the syllables, and the speech characteristics of the first speech segment can be directly obtained from the segmented speech spectrogram corresponding to the first speech segment corresponding to the same syllables. Similarly, the speech features of the second speech segment can also be obtained in a similar manner, and are not described herein again.
Referring to fig. 3, in step 130, the identity check result of the comparison speech and the sample speech is determined according to the voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable.
The identity test is to test whether the two voices are from the same person, so that the result of the identity test between the comparison voice and the sample voice is used to indicate whether the comparison voice and the sample voice are from the same person or indicate the probability of the comparison voice and the sample voice being from the same person.
In some embodiments, the number of target identical syllables with a voiceprint similarity exceeding a similarity threshold between the first speech segment corresponding to the identical syllable and the second speech segment corresponding to the identical syllable may be counted, and then the identity check result of the comparison speech and the sample speech is determined based on the number of target identical syllables.
In the scheme of the application, syllable matching is carried out on the basis of a voice recognition result of comparison voice and a voice recognition result of sample voice, a plurality of same syllables of the comparison voice relative to the sample voice are determined, then voiceprint similarity between a first voice section corresponding to the same syllables and a second voice section corresponding to the same syllables is calculated according to voice characteristics of the first voice section corresponding to the same syllables and voice characteristics of the second voice section corresponding to the same syllables, and then an identity test result of the comparison voice and the sample voice is determined according to the determined voiceprint similarity. In the scheme, the voiceprint similarity calculation is carried out by combining a plurality of same syllables of the comparison voice and the sample voice, the plurality of same syllables comprise syllables of at least two syllable types, and the stability of the syllables of different syllable types and the granularity of expressed characteristics are different, so that the scheme combines the same syllables of at least two syllable types to carry out the voiceprint similarity calculation and determine the identity test result, and the accuracy and the validity of the identity test result can be improved.
In some embodiments, the speech features include speech feature curves and speech feature parameters; in this embodiment, as shown in fig. 4, step 120 includes:
step 410, determining the similarity of the characteristic curves according to the voice characteristic curve of the first voice segment corresponding to the same syllable and the voice characteristic curve of the second voice segment corresponding to the same syllable.
The speech characteristic may be one or more of the formant trends above, the fundamental frequency trajectory, the LPC spectrum, etc. And according to each voice characteristic curve, calculating the characteristic curve similarity between the voice characteristic curve of the first voice section corresponding to the same syllable and the voice characteristic curve of the second voice section corresponding to the same syllable.
It is understood that, when there are a plurality of voice characteristic curves, the calculated similarity of the characteristic curves is also a plurality.
Step 420, determining a feature parameter deviation according to the voice feature parameters of the first voice segment corresponding to the same syllable and the voice feature parameters of the second voice segment corresponding to the same syllable.
The speech characteristic parameters may be the center frequency, bandwidth, intensity, etc. of the formants listed above, and are not particularly limited herein.
The feature parameter deviation may be obtained by subtracting, for each speech feature parameter, the speech feature parameter of the first speech segment corresponding to the same syllable from the speech feature parameter of the second speech segment corresponding to the same syllable.
It is understood that, when the speech feature parameter includes a plurality of parameters, the calculated feature parameter deviation also corresponds to a plurality of parameters.
Step 430, determining the voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable according to the feature curve similarity and the feature parameter deviation.
In some embodiments, the feature curve similarity and the feature parameter deviation may be weighted, and the result of the weighted calculation is used as the voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable. The weighting weight corresponding to the similarity of each characteristic curve and the deviation of each characteristic parameter may be set according to actual needs, and is not specifically limited herein.
Through the above process, the voiceprint similarity between the first voice segment corresponding to the same syllable and the second voice segment corresponding to the same syllable is comprehensively calculated by combining the voice characteristic curve and the voice parameters of the voice segments, and the voiceprint similarity is calculated by combining the voice characteristics under multiple dimensions, so that the accuracy of the calculated voiceprint similarity can be ensured.
In some embodiments, based on the segmented speech spectrogram of the first speech segment corresponding to the extracted same syllable and the segmented speech spectrogram of the corresponding second speech segment, the segmented speech spectrogram of the first speech segment corresponding to the same syllable and the segmented speech spectrogram of the corresponding second speech segment can be respectively displayed in the display interface, thereby facilitating a user to visually observe the similarity of the same syllable in the comparison speech and the sample speech based on the displayed segmented speech spectrogram of the same syllable.
FIG. 5A illustrates a spectrogram of the word "which" is segmented in two voices according to one embodiment. The left side of fig. 5A shows the segmentation spectrogram of the word "which" in audio 1; the right side of fig. 5A shows the segmentation spectrogram of the word "which bit" in audio 2.
FIG. 5B illustrates a segmented spectrogram of a "YES" word in two voices, according to one embodiment. FIG. 5B shows on the left hand side a spectrogram of a "YES" word in Audio 1; the segmented speech spectrogram of the "yes" word in Audio 2 is shown on the right side of FIG. 5B.
FIG. 5C illustrates a segmented spectrogram of the "e 4" phoneme in two voices, according to one embodiment. The left side of fig. 5C shows a segmented spectrogram of the phoneme of "e 4" in "this" in audio 1; the segmentation spectrogram of the factor "e 4" in "this" in audio 2 is shown on the right side of fig. 5C.
In some embodiments, as shown in fig. 6, step 120, comprises:
step 610, determining a first speech feature vector of the first speech segment corresponding to the same syllable according to the speech feature of the first speech segment corresponding to the same syllable.
Step 620, determining a second speech feature vector of the second speech segment corresponding to the same syllable according to the speech feature of the second speech segment corresponding to the same syllable.
In this embodiment, the speech features of the first speech segment corresponding to the same syllable and the speech features of the second speech segment corresponding to the same syllable are vectorized respectively to obtain a first speech feature vector of the first speech segment corresponding to the same syllable and a second speech feature vector of the second speech segment corresponding to the same syllable.
Step 630, calculating to obtain the voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable according to the first speech feature vector of the first speech segment corresponding to the same syllable and the second speech feature vector of the second speech segment corresponding to the same syllable.
In some embodiments, a distance between the first speech feature vector and the second speech feature vector may be calculated, and a voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable may be determined according to the calculated distance. The distance may be a euclidean distance or the like, and is not particularly limited. It will be appreciated that the greater the calculated distance, the less the degree of voiceprint similarity between a first speech segment corresponding to the same syllable and the corresponding second speech segment.
In some embodiments, a similarity between the first speech feature vector and the second speech feature vector may be calculated as a voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable. The calculated similarity, such as cosine similarity, relative entropy, Jaccard similarity coefficient, etc., is not particularly limited herein.
In this embodiment, the voice features of the first voice segment corresponding to the same syllable and the voice features of the second voice segment corresponding to the same syllable are vectorized, so that the calculation of the voiceprint similarity is simplified.
In some embodiments, as shown in fig. 7, step 120, comprises:
and step 121, selecting one same syllable from the plurality of same syllables as a candidate same syllable according to the sequence of syllables from long to short.
As described above, the length of the syllable of the word type is greater than that of the syllable of the word type, which is greater than that of the syllable of the phoneme type, and the longer the length of the syllable is, the longer the duration of the corresponding audio segment is, and correspondingly, the more speech characteristics are expressed. Therefore, in the scheme of the embodiment, the candidate same syllables are selected from the multiple same syllables according to the order of the syllables from long to short to calculate the voiceprint similarity, and the accuracy of the calculated voiceprint similarity can be ensured because the expressed voice features are more.
Step 122, calculating the voiceprint similarity between the first speech segment corresponding to the candidate same syllable and the second speech segment corresponding to the candidate same syllable. The specific calculation process of the voiceprint similarity is described above, and is not described herein again.
In this embodiment, the step 130 includes:
step 131, if the voiceprint similarity between the first speech segment corresponding to the candidate same syllable and the second speech segment corresponding to the candidate same syllable exceeds a similarity threshold, accumulating the number. Otherwise, if the voiceprint similarity between the first speech segment corresponding to the candidate same syllable and the second speech segment corresponding to the candidate same syllable exceeds the similarity threshold, the quantity accumulation is not performed, so as to count the quantity of the same syllables of which the voiceprint similarity exceeds the similarity threshold.
The similarity threshold may be set as needed, for example, 95%, 96%, 98%, etc., and is not particularly limited herein.
Step 132, determine if the cumulative number reaches the target number. If the target number is reached, step 133 is executed to determine that the identity test result is a test result indicating that the comparison speech and the sample speech are from the same person. If the target number is not reached, the process returns to step 121 to continue to select the candidate same syllables to calculate the voiceprint similarity.
In some embodiments, if the cumulative number is determined to reach the target number in step 132, the voiceprint similarity calculation may not be performed subsequently by selecting the candidate identical syllable from the plurality of identical syllables.
In some embodiments, the identity test result may include a test result (referred to as a first test result for convenience of description) indicating that the comparison voice and the sample voice are from the same person, and a test result (referred to as a second test result for convenience of description) indicating that the comparison voice and the sample voice are not from the same person. Thus, if the cumulative number counted by the voiceprint similarity calculation based on all the same syllables does not reach the target number according to the above process, the identity test result is determined to be the second test result.
In some embodiments, the identity test result may further include a plurality of test results other than the first test result and the second test result, and a number range may be set for each identity test result, so that, when a cumulative number counted based on the voiceprint similarity of all the same syllables is calculated, the identity test result corresponding to the number range in which the cumulative number is located is used as the identity test result of the comparison speech and the sample speech.
For example, if the identity check result includes the first check result, the second check result, the third check result, the fourth check result, and the fifth check result, wherein the third check result indicates that the comparison voice and the sample voice are from the same person with a higher probability, and the fourth check result indicates that the comparison voice and the sample voice are from the same person with a lower probability; the fifth test result indicates that it cannot be determined whether the comparison voice and the sample voice are from the same person; the first test result can be set to correspond to a first quantity range, and the third test result can be set to correspond to a second quantity range; the fifth test result corresponds to a third quantity range; the fourth test result corresponds to a fourth quantity range; the second test result corresponds to a fifth number range, and thus, if it is determined that the finally obtained accumulated number is within the fifth number range, the result of the identity test of the comparison voice and the sample voice is determined to be the second test result. It is understood that the number in the first number range > the number in the second number range > the number in the third number range > the number in the fourth number range > the number in the fifth number range.
In other embodiments, as shown in FIG. 8, step 130, comprises:
step 810, determining a target same syllable with the corresponding voiceprint similarity exceeding a similarity threshold according to the voiceprint similarity between the first speech segment corresponding to each same syllable and the second speech segment corresponding to each same syllable.
The target identical syllable is the identical syllable whose corresponding voiceprint similarity exceeds the similarity threshold.
Step 820, determining the identity test result of the comparison voice and the sample voice according to the number of the target same syllables.
In some embodiments, after calculating the voiceprint similarity between the first speech segment and the second speech segment corresponding to all the same syllables, the target same syllables may be determined correspondingly, and the number of the target same syllables may be counted.
In some embodiments, a number range may be set for each type of the identity check result, so that, after the number of the target identical syllables is determined, the identity check result corresponding to the number range to which the number of the target identical syllables belongs is used as the identity check result of the comparison speech and the sample speech.
In other embodiments, as shown in FIG. 9, step 130, comprises:
step 910, determining a target same syllable whose corresponding voiceprint similarity exceeds a similarity threshold according to the voiceprint similarity between the first speech segment corresponding to each same syllable and the second speech segment corresponding to each same syllable.
Step 920, according to the syllable type to which the target identical syllable belongs, counting the number of the target identical syllables belonging to each syllable type.
Step 930, calculating the probability that the comparison speech and the sample speech come from the same person according to the first weight corresponding to each syllable type and the number of the target same syllables belonging to each syllable type.
In some embodiments, a weighting calculation may be performed according to the first weight corresponding to each syllable type and the number of target identical syllables belonging to each syllable type, and the weighting calculation result is used as the probability that the comparison speech and the sample speech come from the same person.
For example, if the first weight corresponding to the word type is a1, the first weight corresponding to the word type is a2, and the first weight corresponding to the phone type is A3, it is statistically determined that the number of target identical syllables belonging to the word type is B1, and the number of target identical syllables belonging to the word type is B2; the number of target identical syllables attributed to the phoneme type is B3, the result of a1 × B1+ a2 × B2+ A3 × B3 can be taken as the probability that the comparison speech and the sample speech come from the same person. The foregoing is, of course, merely exemplary and is not to be construed as limiting the scope of the invention.
In the present embodiment, the first weight is used to reflect the degree of contribution of the same syllable of the corresponding syllable type to the result of the identity check, and as described above, since the syllable of the word type expresses more features, the first weight corresponding to the word type > the first weight corresponding to the syllable type.
And 940, determining the identity test result of the comparison voice and the sample voice according to the probability that the comparison voice and the sample voice come from the same person.
In some embodiments, a probability range (referred to as a first probability range for ease of distinction) may be set for each type of the identity check result, so that, based on the probabilities that the determined comparison voice and the sample voice are from the same person, the identity check result corresponding to the first probability range in which the determined probability is located is taken as the identity check result of the comparison voice and the sample voice.
In this embodiment, the probability that the comparison speech and the sample speech are from the same person is calculated by combining the first weight corresponding to each syllable type and the number of target identical syllables belonging to each syllable type, so that the probability that the comparison speech and the sample speech are from the same person is determined by comprehensively considering the contribution degree of syllables of different syllable types to the identity test result and the number of target identical syllables of different syllable types, the reasonability and the accuracy of the probability that the determined comparison speech and the sample speech are from the same person are ensured, and the accuracy of the identity test result of the comparison speech and the sample speech determined based on the probability is ensured.
In other embodiments, step 110 includes: performing syllable matching on a voice recognition result of a comparison voice and a voice recognition result of a sample voice under at least two granularities, and determining the same syllable of the comparison voice relative to the sample voice under each granularity, wherein the granularities comprise word granularity, word granularity and phoneme granularity. At each granularity, the same syllable may correspond to the syllable type for which the granularity is determined. Specifically, if syllable matching is performed under the word granularity, the determined same syllables are all words; if syllable matching is carried out under the word granularity, the determined same syllables are all words; if syllable matching is performed at the phoneme granularity, the determined identical syllables are all phonemes.
In this embodiment, since syllable matching is performed at least two granularities, the obtained syllable matching result includes a syllable matching result corresponding to each of the at least two granularities.
In this embodiment, as shown in fig. 10, step 130 includes:
step 1010, based on the same syllable at each granularity, counting the number of target same syllables corresponding to which the voiceprint similarity exceeds the similarity threshold.
Step 1020, calculating a reference probability that the comparison speech and the sample speech come from the same person at each granularity according to the number of target identical syllables of which the corresponding voiceprint similarity exceeds a similarity threshold at each granularity.
In some embodiments, the mapping relationship between the number of target identical syllables and the reference probability for each granularity may be preset, so that, after determining the number of target identical syllables for each granularity, the reference probability mapped by the number of target identical syllables for the granularity is used as the reference probability that the comparison speech and the sample speech come from the same person for the granularity.
Under different granularities, the mapping relation between the number of the same syllables of the set target and the reference probability can be the same or different, and the mapping relation can be specifically set according to actual needs.
Step 1030, determining a target probability that the comparison voice and the sample voice come from the same person according to the reference probability that the comparison voice and the sample voice come from the same person at each granularity and the second weight corresponding to each granularity.
In some embodiments, the reference probabilities at least two granularities may be weighted according to the second weight corresponding to each granularity, and the result of the weighted calculation is used as the target probability that the comparison speech and the sample speech are from the same person.
For example, if the at least two granularities include a word granularity and a phoneme granularity, the second weight corresponding to the word granularity is C1, the second weight corresponding to the phoneme granularity is C2, the reference probability at the word granularity is P1, and the reference probability at the phoneme granularity is P2, the target probability may be: C1P 1+ C2P 2. It should be noted that the above are merely exemplary and should not be considered as limiting the scope of the application.
Step 1040, determining the identity test result of the comparison voice and the sample voice according to the target probability that the comparison voice and the sample voice come from the same person.
In some embodiments, a probability range (referred to as a second probability range for easy distinction) corresponding to each type of the identity check result may be set, so that, after the target probability is determined, the identity check result corresponding to the second probability range in which the target probability is located is used as the identity check result of the comparison speech and the sample speech.
In this embodiment, the target probability is calculated by combining the second weight corresponding to each granularity and the reference probability calculated under each granularity, and the target probability that the comparison speech and the sample speech are from the same person is determined by comprehensively considering the contribution degree of the syllable matching results under different granularities to the identity test result and the number of the target same syllables under different granularities, so that the reasonability and the accuracy of the probability that the determined comparison speech and the sample speech are from the same person are ensured, and the accuracy of the identity test result of the comparison speech and the sample speech determined based on the probability is further ensured.
Embodiments of the apparatus of the present application are described below, which may be used to perform the methods of the above-described embodiments of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the above-described embodiments of the method of the present application.
Fig. 11 is a block diagram showing a voice identity verifying apparatus according to an embodiment, and as shown in fig. 11, the voice identity verifying apparatus includes: a syllable matching module 1110, configured to perform syllable matching on a speech recognition result of a comparison speech and a speech recognition result of a sample speech, and determine a plurality of same syllables of the comparison speech with respect to the sample speech; the plurality of identical syllables including syllables of at least two syllable types, the syllable types including a word type, and a phone type; a voiceprint similarity calculation module 1120, configured to calculate a voiceprint similarity between a first speech segment corresponding to the same syllable and a corresponding second speech segment according to a speech feature of the first speech segment corresponding to the same syllable and a speech feature of the corresponding second speech segment; the first voice segment is a voice segment corresponding to the same syllable in the comparison voice; the second speech segment is a speech segment corresponding to the same syllable in the sample speech; the identity test result determining module 1130 is configured to determine an identity test result of the comparison speech and the sample speech according to a voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable.
In some embodiments, the apparatus for verifying speech identity further comprises: the first voice feature acquisition module is used for acquiring the voice features of the first voice section corresponding to the same syllable; and the second voice characteristic acquisition module is used for acquiring the voice characteristics of the second voice section corresponding to the same syllable.
In some embodiments, the speech recognition result of the comparison speech indicates time position information of each syllable included in the comparison speech; in this embodiment, the first voice feature obtaining module includes: the segment extraction unit is used for extracting segments in the comparison voice according to the time position information of the same syllable in the comparison voice to obtain a first voice segment corresponding to the same syllable; and the first extraction unit is used for extracting the voice characteristics of the first voice section to obtain the voice characteristics of the first voice section.
In other embodiments, the speech recognition result of the comparison speech indicates the time position information of each syllable included in the comparison speech; in this embodiment, the first voice feature obtaining module includes: the segmented spectrogram determining unit is used for determining a segmented spectrogram corresponding to the first voice segment in the spectrogram of the comparison voice according to the time position information of the same syllable in the comparison voice; and the voice feature acquisition unit is used for acquiring the voice features of the first voice section from the segmented spectrogram corresponding to the first voice section.
In some embodiments, the speech features include speech feature curves and speech feature parameters; the voiceprint similarity calculation module 1120 includes: a characteristic curve similarity determining unit, configured to determine a similarity of characteristic curves according to a voice characteristic curve of a first voice segment corresponding to the same syllable and a voice characteristic curve of a second voice segment corresponding to the same syllable; the characteristic parameter deviation calculating unit is used for determining the characteristic parameter deviation according to the voice characteristic parameters of the first voice section corresponding to the same syllable and the voice characteristic parameters of the second voice section corresponding to the same syllable; and the first voiceprint similarity determining unit is used for determining the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable according to the characteristic curve similarity and the characteristic parameter deviation.
In other embodiments, the speech feature comprises at least two speech feature parameters; the voiceprint similarity calculation module 1120 includes: a first speech feature vector determining unit, configured to determine a first speech feature vector of a first speech segment corresponding to the same syllable according to speech features of the first speech segment corresponding to the same syllable; a second speech feature vector determining unit, configured to determine a second speech feature vector of a second speech segment corresponding to the same syllable according to speech features of the second speech segment corresponding to the same syllable; and the second voiceprint similarity determining unit is used for calculating the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable according to the first voice feature vector of the first voice section corresponding to the same syllable and the second voice feature vector of the second voice section corresponding to the same syllable.
In some embodiments, the voiceprint similarity calculation module 1120 includes: the candidate same syllable selecting unit is used for selecting one same syllable from the plurality of same syllables as a candidate same syllable according to the sequence of syllables from long to short; a calculating unit, configured to calculate a voiceprint similarity between a first speech segment corresponding to the candidate same syllable and the second speech segment corresponding to the candidate same syllable; in this embodiment, the identity check result determining module 1130 includes: a quantity accumulation unit, configured to accumulate quantities if a voiceprint similarity between a first speech segment corresponding to the candidate same syllable and a second speech segment corresponding to the candidate same syllable exceeds a similarity threshold; a judging unit for judging whether the accumulated number reaches a target number; a first result determination unit configured to determine, if a target number is reached, the identity test result as a test result indicating that the comparison voice and the sample voice are from the same person; and if the target number is not reached, returning to the step of selecting one same syllable from the plurality of same syllables as a candidate same syllable according to the sequence of the syllables from long to short.
In some embodiments, the identity check result determination module 1130 includes: a target identical syllable determining unit, configured to determine a target identical syllable, for which a voiceprint similarity exceeds a similarity threshold, according to a voiceprint similarity between a first speech segment corresponding to each identical syllable and a second speech segment corresponding to each identical syllable; and the second result determining unit is used for determining the identity test result of the comparison voice and the sample voice according to the number of the target same syllables.
In other embodiments, the identity check result determination module 1130 includes: a target identical syllable determining unit, configured to determine a target identical syllable, for which a voiceprint similarity exceeds a similarity threshold, according to a voiceprint similarity between a first speech segment corresponding to each identical syllable and a second speech segment corresponding to each identical syllable; a first number determination unit for counting the number of target identical syllables belonging to each syllable type according to the syllable type to which the target identical syllables belong; a probability calculation unit for calculating the probability that the comparison speech and the sample speech come from the same person according to the first weight corresponding to each syllable type and the number of target same syllables belonging to each syllable type; and the third result determining unit is used for determining the identity test result of the comparison voice and the sample voice according to the probability that the comparison voice and the sample voice come from the same person.
In some embodiments, the syllable matching module 1110 is further configured to: performing syllable matching on a voice recognition result of a comparison voice and a voice recognition result of a sample voice under at least two granularities, and determining the same syllable of the comparison voice relative to the sample voice under each granularity, wherein the granularities comprise word granularity, word granularity and phoneme granularity; in this embodiment, the identity check result determining module 1130 includes: a first quantity determining unit, configured to count, based on the same syllables at each of the particle sizes, the quantity of target same syllables for which the corresponding voiceprint similarity exceeds a similarity threshold; a reference probability calculation unit, configured to calculate, according to the number of target identical syllables whose corresponding voiceprint similarities exceed a similarity threshold at each granularity, a reference probability that the comparison speech and the sample speech are from the same person at each granularity; a target probability calculating unit, configured to determine a target probability that the comparison speech and the sample speech are from the same person according to the reference probability that the comparison speech and the sample speech are from the same person at each granularity and a second weight corresponding to each granularity; and the fourth result determining unit is used for determining the identity test result of the comparison voice and the sample voice according to the target probability that the comparison voice and the sample voice come from the same person.
FIG. 12 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application. It should be noted that the computer system 1200 of the electronic device shown in fig. 12 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 12, the computer system 1200 includes a Central Processing Unit (CPU) 1201, which can perform various appropriate actions and processes, such as executing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data necessary for system operation are also stored. The CPU1201, ROM1202, and RAM 1203 are connected to each other by a bus 1204. An Input/Output (I/O) interface 1205 is also connected to bus 1204.
The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output section 1207 including a Display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 1208 including a hard disk and the like; and a communication section 1209 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. A driver 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1210 as necessary, so that a computer program read out therefrom is mounted into the storage section 1208 as necessary.
In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1209, and/or installed from the removable medium 1211. The computer program executes various functions defined in the system of the present application when executed by a Central Processing Unit (CPU) 1201.
It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.
As another aspect, the present application also provides a computer-readable storage medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable storage medium carries computer readable instructions which, when executed by a processor, implement the method of any of the embodiments described above.
According to an aspect of the present application, there is also provided an electronic device, including: a processor; a memory having computer readable instructions stored thereon which, when executed by the processor, implement the method of any of the above embodiments.
According to an aspect of an embodiment of the present application, there is provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of any of the above embodiments.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (13)

1. A method for testing speech identity, comprising:
performing syllable matching on a voice recognition result of a comparison voice and a voice recognition result of a sample voice, and determining a plurality of same syllables of the comparison voice relative to the sample voice; the plurality of identical syllables including syllables of at least two syllable types, the syllable types including a word type, and a phone type;
calculating the voiceprint similarity between the first voice section corresponding to the same syllable and the corresponding second voice section according to the voice characteristics of the first voice section corresponding to the same syllable and the voice characteristics of the corresponding second voice section; the first voice segment is a voice segment corresponding to the same syllable in the comparison voice; the second speech segment is a speech segment corresponding to the same syllable in the sample speech;
and determining the identity test result of the comparison voice and the sample voice according to the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable.
2. A method according to claim 1, wherein before calculating the voiceprint similarity between a first speech segment corresponding to the same syllable and a corresponding second speech segment according to the speech characteristics of the first speech segment corresponding to the same syllable and the speech characteristics of the corresponding second speech segment, the method further comprises:
acquiring the voice characteristics of the first voice section corresponding to the same syllable;
and acquiring the voice characteristics of the second voice section corresponding to the same syllable.
3. The method according to claim 2, wherein the speech recognition result of the comparison speech indicates time position information of each syllable included in the comparison speech;
the obtaining of the voice features of the first voice segment corresponding to the same syllable includes:
according to the time position information of the same syllable in the comparison voice, segment extraction is carried out in the comparison voice to obtain a first voice section corresponding to the same syllable;
and performing voice feature extraction on the first voice section to obtain the voice feature of the first voice section.
4. The method according to claim 2, wherein the speech recognition result of the comparison speech indicates time position information of each syllable included in the comparison speech;
the obtaining of the voice features of the first voice segment corresponding to the same syllable includes:
determining a segmented spectrogram corresponding to the first voice segment in the spectrogram of the comparison voice according to the time position information of the same syllable in the comparison voice;
and acquiring the voice features of the first voice section from the segmented voice spectrogram corresponding to the first voice section.
5. The method of claim 1, wherein the speech features comprise speech feature curves and speech feature parameters;
the calculating the voiceprint similarity between the first speech segment corresponding to the same syllable and the corresponding second speech segment according to the speech features of the first speech segment corresponding to the same syllable and the speech features of the corresponding second speech segment includes:
determining the similarity of the characteristic curves according to the voice characteristic curve of the first voice section corresponding to the same syllable and the voice characteristic curve of the second voice section corresponding to the same syllable;
determining the characteristic parameter deviation according to the voice characteristic parameters of the first voice section corresponding to the same syllable and the voice characteristic parameters of the second voice section corresponding to the same syllable;
and determining the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable according to the characteristic curve similarity and the characteristic parameter deviation.
6. The method according to claim 1, wherein said calculating the voiceprint similarity between the first speech segment corresponding to the same syllable and the corresponding second speech segment according to the speech features of the first speech segment corresponding to the same syllable and the speech features of the corresponding second speech segment comprises:
determining a first voice feature vector of a first voice segment corresponding to the same syllable according to the voice feature of the first voice segment corresponding to the same syllable;
determining a second voice characteristic vector of a second voice segment corresponding to the same syllable according to the voice characteristics of the second voice segment corresponding to the same syllable;
and calculating the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable according to the first voice characteristic vector of the first voice section corresponding to the same syllable and the second voice characteristic vector of the second voice section corresponding to the same syllable.
7. The method according to claim 1, wherein said calculating the voiceprint similarity between the first speech segment corresponding to the same syllable and the corresponding second speech segment according to the speech features of the first speech segment corresponding to the same syllable and the speech features of the corresponding second speech segment comprises:
selecting one same syllable from the plurality of same syllables as a candidate same syllable according to the sequence of syllables from long to short;
calculating the voiceprint similarity between the first voice section corresponding to the candidate same syllable and the second voice section corresponding to the candidate same syllable;
the determining the identity test result of the comparison voice and the sample voice according to the voiceprint similarity between the first voice segment corresponding to the same syllable and the second voice segment corresponding to the same syllable comprises:
if the voiceprint similarity between the first voice section corresponding to the candidate same syllable and the second voice section corresponding to the candidate same syllable exceeds a similarity threshold value, accumulating the number;
judging whether the accumulated quantity reaches a target quantity or not;
if the target number is reached, determining the identity test result as a test result indicating that the comparison voice and the sample voice are from the same person;
and if the target number is not reached, returning to the step of selecting one same syllable from the plurality of same syllables as a candidate same syllable according to the sequence of the syllables from long to short.
8. A method according to claim 1, wherein said determining the result of the identity check between the comparison speech and the sample speech according to the voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable comprises:
determining a target same syllable of which the corresponding voiceprint similarity exceeds a similarity threshold according to the voiceprint similarity between the first voice section corresponding to each same syllable and the second voice section corresponding to each same syllable;
and determining the identity test result of the comparison voice and the sample voice according to the number of the target same syllables.
9. A method according to claim 1, wherein said determining the result of the identity check between the comparison speech and the sample speech according to the voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable comprises:
determining a target same syllable of which the corresponding voiceprint similarity exceeds a similarity threshold according to the voiceprint similarity between the first voice section corresponding to each same syllable and the second voice section corresponding to each same syllable;
counting the number of the target identical syllables belonging to each syllable type according to the syllable type to which the target identical syllables belong;
calculating the probability that the comparison voice and the sample voice come from the same person according to the first weight corresponding to each syllable type and the number of the target same syllables belonging to each syllable type;
and determining the identity test result of the comparison voice and the sample voice according to the probability that the comparison voice and the sample voice come from the same person.
10. The method of claim 1, wherein syllable matching the speech recognition result of the comparison speech with the speech recognition result of the sample speech to determine a plurality of identical syllables of the comparison speech with respect to the sample speech comprises:
performing syllable matching on a voice recognition result of a comparison voice and a voice recognition result of a sample voice under at least two granularities, and determining the same syllable of the comparison voice relative to the sample voice under each granularity, wherein the granularities comprise word granularity, word granularity and phoneme granularity;
the determining the identity test result of the comparison voice and the sample voice according to the voiceprint similarity between the first voice segment corresponding to the same syllable and the second voice segment corresponding to the same syllable comprises:
based on the same syllables under each granularity, counting the number of target same syllables of which the corresponding voiceprint similarity exceeds a similarity threshold;
calculating reference probabilities that the comparison speech and the sample speech come from the same person at each granularity according to the number of target same syllables of which the corresponding voiceprint similarity exceeds a similarity threshold at each granularity;
determining a target probability that the comparison voice and the sample voice come from the same person according to the reference probability that the comparison voice and the sample voice come from the same person at each granularity and a second weight corresponding to each granularity;
and determining the identity test result of the comparison voice and the sample voice according to the target probability that the comparison voice and the sample voice come from the same person.
11. A device for verifying speech identity, comprising:
the syllable matching module is used for carrying out syllable matching on the voice recognition result of the comparison voice and the voice recognition result of the sample voice and determining a plurality of same syllables of the comparison voice relative to the sample voice; the plurality of identical syllables including syllables of at least two syllable types, the syllable types including a word type, and a phone type;
the voiceprint similarity calculation module is used for calculating the voiceprint similarity between the first voice section corresponding to the same syllable and the corresponding second voice section according to the voice characteristics of the first voice section corresponding to the same syllable and the voice characteristics of the corresponding second voice section; the first voice segment is a voice segment corresponding to the same syllable in the comparison voice; the second speech segment is a speech segment corresponding to the same syllable in the sample speech;
and the identity test result determining module is used for determining the identity test result of the comparison voice and the sample voice according to the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable.
12. An electronic device, comprising:
a processor;
a memory having computer-readable instructions stored thereon which, when executed by the processor, implement the method of any one of claims 1-10.
13. A computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the method of any one of claims 1-10.
CN202111524105.3A 2021-12-14 2021-12-14 Voice identity detection method and device, electronic equipment and storage medium Pending CN113921017A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111524105.3A CN113921017A (en) 2021-12-14 2021-12-14 Voice identity detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111524105.3A CN113921017A (en) 2021-12-14 2021-12-14 Voice identity detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113921017A true CN113921017A (en) 2022-01-11

Family

ID=79249176

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111524105.3A Pending CN113921017A (en) 2021-12-14 2021-12-14 Voice identity detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113921017A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114255764A (en) * 2022-02-28 2022-03-29 深圳市声扬科技有限公司 Audio information processing method and device, electronic equipment and storage medium
CN117133271A (en) * 2023-10-25 2023-11-28 北京吉道尔科技有限公司 Block chain-based e-commerce platform shopping and intelligent voice evaluation method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101887722A (en) * 2009-06-18 2010-11-17 博石金(北京)信息技术有限公司 Rapid voiceprint authentication method
CN107680601A (en) * 2017-10-18 2018-02-09 深圳势必可赢科技有限公司 A kind of identity homogeneity method of inspection retrieved based on sound spectrograph and phoneme and device
CN108492830A (en) * 2018-03-28 2018-09-04 深圳市声扬科技有限公司 Method for recognizing sound-groove, device, computer equipment and storage medium
CN108766417A (en) * 2018-05-29 2018-11-06 广州国音科技有限公司 A kind of the identity homogeneity method of inspection and device based on phoneme automatically retrieval
CN110875044A (en) * 2018-08-30 2020-03-10 中国科学院声学研究所 Speaker identification method based on word correlation score calculation
CN111341300A (en) * 2020-02-28 2020-06-26 广州国音智能科技有限公司 Method, device and equipment for acquiring voice comparison phonemes
CN111429921A (en) * 2020-03-02 2020-07-17 厦门快商通科技股份有限公司 Voiceprint recognition method, system, mobile terminal and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101887722A (en) * 2009-06-18 2010-11-17 博石金(北京)信息技术有限公司 Rapid voiceprint authentication method
CN107680601A (en) * 2017-10-18 2018-02-09 深圳势必可赢科技有限公司 A kind of identity homogeneity method of inspection retrieved based on sound spectrograph and phoneme and device
CN108492830A (en) * 2018-03-28 2018-09-04 深圳市声扬科技有限公司 Method for recognizing sound-groove, device, computer equipment and storage medium
CN108766417A (en) * 2018-05-29 2018-11-06 广州国音科技有限公司 A kind of the identity homogeneity method of inspection and device based on phoneme automatically retrieval
CN110875044A (en) * 2018-08-30 2020-03-10 中国科学院声学研究所 Speaker identification method based on word correlation score calculation
CN111341300A (en) * 2020-02-28 2020-06-26 广州国音智能科技有限公司 Method, device and equipment for acquiring voice comparison phonemes
CN111429921A (en) * 2020-03-02 2020-07-17 厦门快商通科技股份有限公司 Voiceprint recognition method, system, mobile terminal and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114255764A (en) * 2022-02-28 2022-03-29 深圳市声扬科技有限公司 Audio information processing method and device, electronic equipment and storage medium
CN117133271A (en) * 2023-10-25 2023-11-28 北京吉道尔科技有限公司 Block chain-based e-commerce platform shopping and intelligent voice evaluation method and system

Similar Documents

Publication Publication Date Title
CN106683680B (en) Speaker recognition method and device, computer equipment and computer readable medium
CN111429946A (en) Voice emotion recognition method, device, medium and electronic equipment
CN107665705B (en) Voice keyword recognition method, device, equipment and computer readable storage medium
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
JP4568371B2 (en) Computerized method and computer program for distinguishing between at least two event classes
US10607601B2 (en) Speech recognition by selecting and refining hot words
CN109559735B (en) Voice recognition method, terminal equipment and medium based on neural network
CN113921017A (en) Voice identity detection method and device, electronic equipment and storage medium
CN111783450B (en) Phrase extraction method and device in corpus text, storage medium and electronic equipment
CN116635934A (en) Unsupervised learning of separate phonetic content and style representations
CN112015872A (en) Question recognition method and device
Vuppala et al. Improved consonant–vowel recognition for low bit‐rate coded speech
CN110503956B (en) Voice recognition method, device, medium and electronic equipment
CN114360557A (en) Voice tone conversion method, model training method, device, equipment and medium
WO2021014612A1 (en) Utterance segment detection device, utterance segment detection method, and program
US11037583B2 (en) Detection of music segment in audio signal
CN113112992B (en) Voice recognition method and device, storage medium and server
Kashani et al. Sequential use of spectral models to reduce deletion and insertion errors in vowel detection
CN110675858A (en) Terminal control method and device based on emotion recognition
CN115662473A (en) Emotion recognition method and device based on voice data and electronic equipment
Płonkowski Using bands of frequencies for vowel recognition for Polish language
CN114530142A (en) Information recommendation method, device and equipment based on random forest and storage medium
CN113035230A (en) Authentication model training method and device and electronic equipment
Karpov Efficient speaker recognition for mobile devices
Mittal et al. Classical and deep learning data processing techniques for speech and speaker recognitions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination