CN113921017A - Voice identity detection method and device, electronic equipment and storage medium - Google Patents
Voice identity detection method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN113921017A CN113921017A CN202111524105.3A CN202111524105A CN113921017A CN 113921017 A CN113921017 A CN 113921017A CN 202111524105 A CN202111524105 A CN 202111524105A CN 113921017 A CN113921017 A CN 113921017A
- Authority
- CN
- China
- Prior art keywords
- voice
- speech
- syllable
- same
- comparison
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title description 8
- 238000012360 testing method Methods 0.000 claims abstract description 66
- 238000000034 method Methods 0.000 claims abstract description 47
- 235000019580 granularity Nutrition 0.000 claims description 60
- 238000004364 calculation method Methods 0.000 claims description 20
- 238000000605 extraction Methods 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 10
- 238000004590 computer program Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 230000001186 cumulative effect Effects 0.000 description 5
- 230000009471 action Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 238000013507 mapping Methods 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 238000009825 accumulation Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/14—Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Telephone Function (AREA)
Abstract
The application discloses a method and a device for testing voice identity, electronic equipment and a storage medium, wherein the method comprises the following steps: performing syllable matching on the voice recognition result of the comparison voice and the voice recognition result of the sample voice, and determining a plurality of same syllables of the comparison voice relative to the sample voice; the plurality of identical syllables including syllables of at least two syllable types; calculating the voiceprint similarity between the first voice section corresponding to the same syllable and the corresponding second voice section according to the voice characteristics of the first voice section corresponding to the same syllable and the voice characteristics of the corresponding second voice section; and determining the identity test result of the comparison voice and the sample voice according to the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable. The scheme can improve the accuracy of the identity test result.
Description
Technical Field
The present application relates to the field of audio processing technologies, and in particular, to a method and an apparatus for testing voice identity, an electronic device, and a storage medium.
Background
The voice identity test means that whether two pieces of input voice come from the same person or not is determined by comparing and analyzing the two pieces of input voice. The problem that the accuracy of voice identity detection is not high exists in the prior art, so that the reliability of an identity detection result is not high. Therefore, how to improve the accuracy of the voice identity check is an urgent technical problem to be solved in the prior art.
Disclosure of Invention
In view of the foregoing problems, embodiments of the present application provide a method and an apparatus for checking speech identity, an electronic device, and a storage medium, so as to improve the foregoing problems.
According to an aspect of the embodiments of the present application, there is provided a method for checking speech identity, including: performing syllable matching on a voice recognition result of a comparison voice and a voice recognition result of a sample voice, and determining a plurality of same syllables of the comparison voice relative to the sample voice; the plurality of identical syllables including syllables of at least two syllable types, the syllable types including a word type, and a phone type; calculating the voiceprint similarity between the first voice section corresponding to the same syllable and the corresponding second voice section according to the voice characteristics of the first voice section corresponding to the same syllable and the voice characteristics of the corresponding second voice section; the first voice segment is a voice segment corresponding to the same syllable in the comparison voice; the second speech segment is a speech segment corresponding to the same syllable in the sample speech; and determining the identity test result of the comparison voice and the sample voice according to the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable.
According to an aspect of an embodiment of the present application, there is provided a device for verifying speech identity, including: the syllable matching module is used for carrying out syllable matching on the voice recognition result of the comparison voice and the voice recognition result of the sample voice and determining a plurality of same syllables of the comparison voice relative to the sample voice; the plurality of identical syllables including syllables of at least two syllable types, the syllable types including a word type, and a phone type; the voiceprint similarity calculation module is used for calculating the voiceprint similarity between the first voice section corresponding to the same syllable and the corresponding second voice section according to the voice characteristics of the first voice section corresponding to the same syllable and the voice characteristics of the corresponding second voice section; the first voice segment is a voice segment corresponding to the same syllable in the comparison voice; the second speech segment is a speech segment corresponding to the same syllable in the sample speech; and the identity test result determining module is used for determining the identity test result of the comparison voice and the sample voice according to the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable.
According to an aspect of an embodiment of the present application, there is provided an electronic device including: a processor; a memory having computer readable instructions stored thereon which, when executed by the processor, implement a method of verifying speech identity as described above.
According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium having stored thereon computer-readable instructions, which, when executed by a processor, implement the method for checking the identity of speech as described above.
In the scheme of the application, syllable matching is carried out on the basis of a voice recognition result of comparison voice and a voice recognition result of sample voice, a plurality of same syllables of the comparison voice relative to the sample voice are determined, then voiceprint similarity between a first voice section corresponding to the same syllables and a second voice section corresponding to the same syllables is calculated according to voice characteristics of the first voice section corresponding to the same syllables and voice characteristics of the second voice section corresponding to the same syllables, and then an identity test result of the comparison voice and the sample voice is determined according to the determined voiceprint similarity. In the scheme, the voiceprint similarity calculation is carried out by combining a plurality of same syllables of the comparison voice and the sample voice, the plurality of same syllables comprise syllables of at least two syllable types, and the stability of the syllables of different syllable types and the granularity of expressed characteristics are different, so that the scheme combines the same syllables of at least two syllable types to carry out the voiceprint similarity calculation and determine the identity test result, and the accuracy and the validity of the identity test result can be improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a flow chart illustrating a method of verifying speech identity according to an embodiment of the present application.
Fig. 2 is a flow chart illustrating obtaining a speech recognition result according to an embodiment of the present application.
FIG. 3 is a flow chart illustrating syllable matching according to an embodiment of the present application.
FIG. 4 is a flowchart illustrating step 120 according to an embodiment of the present application.
FIGS. 5A-5C are a spectrogram of a first speech segment corresponding to three identical syllables and a spectrogram of a second speech segment corresponding to the three identical syllables according to an embodiment of the present application.
Fig. 6 is a flowchart illustrating step 120 according to another embodiment of the present application.
Fig. 7 is a flowchart illustrating steps 120 and 130 according to an embodiment of the present application.
Fig. 8 is a flowchart illustrating step 130 according to another embodiment of the present application.
Fig. 9 is a flowchart illustrating step 130 according to another embodiment of the present application.
Fig. 10 is a flowchart illustrating step 130 according to another embodiment of the present application.
Fig. 11 is a block diagram of a device for verifying speech identity according to an embodiment of the present application.
FIG. 12 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It should be noted that: reference herein to "a plurality" means two or more. "and/or" describe the association relationship of the associated objects, meaning that there may be three relationships, e.g., A and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
The implementation details of the technical solution of the embodiment of the present application are set forth in detail below:
fig. 1 is a flowchart illustrating a method for checking voice identity according to an embodiment of the present application, which may be performed by a computer device with processing capability, such as a server, a terminal device (e.g., a desktop computer, a notebook computer), and the like, and may also be performed by a checking system including the server and the terminal device, which is not limited in detail herein. Referring to fig. 1, the method includes at least steps 110 to 130, which are described in detail as follows:
In some embodiments, the speech recognition result may be obtained by text-to-text recognition of the comparison speech (or the sample speech), such that the speech recognition result of the comparison speech indicates the text content corresponding to the comparison speech, and the speech recognition result of the sample speech indicates the text content corresponding to the sample speech.
In some embodiments, the speech recognition result may be obtained by performing phoneme recognition on the comparison speech (or the sample speech), so that the speech recognition result of the comparison speech indicates the phoneme content corresponding to the comparison speech, and the speech recognition result of the sample speech indicates the phoneme content corresponding to the sample speech.
A phoneme (phone) is a minimum unit of speech divided according to natural attributes of speech, and is divided into two major categories, namely vowels and consonants. The phonemes are analyzed according to pronunciation actions in syllables, and one action forms one phoneme, such as Chinese syllable (ā) only has one phoneme ā; the love (aii) has two phonemes, i.e. an and i; the generations (d-a-i) have three phonemes, i.e. d, a-and i.
In the scheme, the syllables with the syllable type of the word type are words, and the syllables with the syllable type of the phoneme type are phonemes. Thus, in this scheme, the same syllables may be the same word, or the same phone.
In some embodiments, the speech recognition result includes phoneme content and text content, such that phoneme matching may be performed based on the phoneme content to determine identical phonemes, and word matching may be performed based on the text content to determine identical words; and performing word matching based on the text content to determine the same word.
Fig. 2 is a flowchart illustrating obtaining a speech recognition result according to an embodiment of the present application, as shown in fig. 2, including: step 210, inputting a voice signal; step 220, active voice detection; step 230, voice recognition; and step 240, outputting the phoneme content and the character content.
If the speech signal in step 210 is a comparison speech, the content of the phoneme and the content of the text output in step 240 are speech recognition results of the comparison speech; on the contrary, if the speech signal in step 210 is the sample speech, the content of the phoneme and the content of the text output in step 240 are the speech recognition result of the sample speech.
The active speech detection is also called speech endpoint detection, speech boundary detection, which means that a long-time silence segment is identified and eliminated from the speech signal, and an active speech segment (i.e. non-silence segment) in the speech signal is determined, so that, in step 230, speech recognition is performed on the active speech segment in the speech signal without paying attention to the silence segment in the speech signal.
The same syllable of the comparison speech with respect to the sample speech refers to a syllable included in both the speech recognition result of the comparison speech and the speech recognition result of the sample speech.
In some embodiments, a plurality of syllables may be selected from the sample speech as the reference syllables, each reference syllable is then matched with the speech recognition result of the comparison speech, whether the reference syllable is included in the speech recognition result of the comparison speech is determined, and if the reference syllable is included, the reference syllable is determined as an identical syllable of the comparison speech relative to the sample speech; otherwise, if not, the reference syllable is not the same syllable of the comparison speech relative to the sample speech; the above process is repeated to determine whether each reference syllable is the same syllable for the comparison speech relative to the sample speech. It will be appreciated that in order to ensure that the determined identical syllables include syllables of at least two syllable types, the plurality of reference syllables also includes syllables of at least two syllable types.
FIG. 3 is a flow chart illustrating syllable matching according to an embodiment of the present application. In this embodiment, the speech recognition result of the comparison speech includes the phoneme content and the text content corresponding to the comparison speech; the speech recognition result of the sample speech comprises phoneme content and character content corresponding to the sample speech. As shown in fig. 3, the phoneme content and the text content corresponding to the comparison speech are syllable-matched with the phoneme content and the text content corresponding to the sample speech, and then a plurality of identical syllables and time position information of each identical syllable are output. The time position information of the same syllable includes the time position information of the same syllable in the comparison speech segment and the time position information of the same syllable in the sample speech segment.
Specifically, the time position information of the same syllable in the comparison speech is also understood as the time position information of the first speech segment corresponding to the same syllable in the comparison speech, and the time position information of the first speech segment corresponding to the same syllable in the comparison speech indicates the start time of the first speech segment in the comparison speech and the end time of the first speech segment in the comparison speech.
For example, if the text content corresponding to a comparison speech is "where you go today", if the same syllable is "today", the start time of the speech segment corresponding to the word "today" in the comparison speech is t1, and the end time in the speech is t2, the time position information of the first speech segment corresponding to the same syllable "today" in the comparison speech indicates the start time t1 and the end time t 2.
In some embodiments, in the process of performing speech recognition on the comparison speech (or the sample speech), for example, performing speech-to-text recognition, not only the text content corresponding to each audio segment is identified and determined, but also the time position information of the audio segment corresponding to each text content (word or word) in the comparison speech is determined, so that, in the process of performing syllable matching based on the speech recognition result of the comparison speech and the speech recognition result of the sample speech, not only the same word and/or the same word of the comparison speech relative to the sample speech can be determined, but also the time position information of the same word or the same word in the comparison speech and the time position information of the same word or the same word in the sample speech can be further determined.
Similarly, in the phoneme recognition process of performing speech recognition on the comparison speech (or the sample speech), not only the phoneme corresponding to each audio segment is recognized and determined, but also the time position information of the audio segment corresponding to each phoneme in the comparison speech (or the sample speech) is correspondingly determined, so that when the syllables are matched, the time position information of the same phoneme in the comparison speech and the time position information of the same phoneme in the sample speech can be correspondingly determined.
It can be understood that, because the time lengths of the audio segments corresponding to the syllables of different syllable types are different, the comprehensiveness of the embodied sound characteristics also has difference. Generally, among syllables corresponding to the word type, the subtype and the phone type, the audio corresponding to the syllable (word) of the word type has the longest duration, so that the audio segment corresponding to the syllable of the word type has a larger information amount of the voiceprint feature of the speaker, and the audio segment corresponding to the syllable of the phone type has the shortest duration, so that the audio segment corresponding to the syllable (phone) of the phone type has a larger information amount of the voiceprint feature of the speaker; however, since the duration of the audio corresponding to the syllable of the phoneme type is the shortest, it can express the voiceprint feature of the speaker at a smaller fine granularity.
Therefore, in the scheme of the application, the same syllables of at least two syllable types are combined to carry out identity verification on the comparison voice and the sample voice, and the comprehensiveness and the fineness of the voice frequency corresponding to the same syllables in the voiceprint feature expression are considered, so that the accuracy of the identity verification result can be ensured.
In some embodiments, prior to step 120, the method further comprises: acquiring the voice characteristics of the first voice section corresponding to the same syllable; and acquiring the voice characteristics of the second voice section corresponding to the same syllable.
In some embodiments, the speech recognition result of the comparison speech indicates time position information of each syllable included in the comparison speech; in this embodiment, the step of obtaining the speech features of the first speech segment corresponding to the same syllable further includes: according to the time position information of the same syllable in the comparison voice, segment extraction is carried out in the comparison voice to obtain a first voice section corresponding to the same syllable; and performing voice feature extraction on the first voice section to obtain the voice feature of the first voice section.
The voice features may include characteristic parameters such as center frequency, bandwidth, and intensity of a formant, and further, the voice features may further include characteristic curves such as formant trend, fundamental frequency trajectory, and LPC spectrum. In an embodiment, frequency domain features such as a formant, a fundamental frequency, and the like of each speech frame in the speech segments (the first speech segment and the second speech segment) may be automatically calculated through a speech signal processing algorithm, for example, an autocorrelation method, a cepstrum method, a Linear Prediction (LPC), and the like, so as to obtain feature curves such as a formant trend, a fundamental frequency trajectory, an LPC spectrum, and the like of the speech segments (the first speech segment and the second speech segment).
After the first speech segment corresponding to the same syllable is extracted, the feature curves of the center frequency, the bandwidth and the intensity, the vibration peak trend, the fundamental frequency trajectory, the LPC spectrum and the like can be extracted in the above manner to obtain the speech feature of the first speech segment.
The speech features of the second speech segment corresponding to the same syllable can be extracted in a similar manner, and are not described herein again.
In other embodiments, the speech recognition result of the comparison speech indicates the time position information of each syllable included in the comparison speech; in this embodiment, the step of obtaining the speech characteristics of the first speech segment corresponding to the same syllable further includes: determining a segmented spectrogram corresponding to the first voice segment in the spectrogram of the comparison voice according to the time position information of the same syllable in the comparison voice; and acquiring the voice features of the first voice section from the segmented voice spectrogram corresponding to the first voice section.
In a case where a spectrogram of a comparison voice and a spectrogram of a sample voice are generated in advance, a segment spectrogram corresponding to each syllable is associated with time position information in the spectrogram of the voice based on time position information of each syllable indicated by a voice recognition result, whereby the corresponding segment spectrogram can be extracted from the spectrogram of the voice based on the time position information of the syllable.
The speech spectrogram of the speech calculates the frequency domain characteristics of each speech frame by performing time-frequency transformation on the speech of the time domain, so that the speech spectrogram of the speech correspondingly expresses the frequency domain characteristics of each syllable, such as formants, center frequency, bandwidth and the like of the syllables, and the speech characteristics of the first speech segment can be directly obtained from the segmented speech spectrogram corresponding to the first speech segment corresponding to the same syllables. Similarly, the speech features of the second speech segment can also be obtained in a similar manner, and are not described herein again.
Referring to fig. 3, in step 130, the identity check result of the comparison speech and the sample speech is determined according to the voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable.
The identity test is to test whether the two voices are from the same person, so that the result of the identity test between the comparison voice and the sample voice is used to indicate whether the comparison voice and the sample voice are from the same person or indicate the probability of the comparison voice and the sample voice being from the same person.
In some embodiments, the number of target identical syllables with a voiceprint similarity exceeding a similarity threshold between the first speech segment corresponding to the identical syllable and the second speech segment corresponding to the identical syllable may be counted, and then the identity check result of the comparison speech and the sample speech is determined based on the number of target identical syllables.
In the scheme of the application, syllable matching is carried out on the basis of a voice recognition result of comparison voice and a voice recognition result of sample voice, a plurality of same syllables of the comparison voice relative to the sample voice are determined, then voiceprint similarity between a first voice section corresponding to the same syllables and a second voice section corresponding to the same syllables is calculated according to voice characteristics of the first voice section corresponding to the same syllables and voice characteristics of the second voice section corresponding to the same syllables, and then an identity test result of the comparison voice and the sample voice is determined according to the determined voiceprint similarity. In the scheme, the voiceprint similarity calculation is carried out by combining a plurality of same syllables of the comparison voice and the sample voice, the plurality of same syllables comprise syllables of at least two syllable types, and the stability of the syllables of different syllable types and the granularity of expressed characteristics are different, so that the scheme combines the same syllables of at least two syllable types to carry out the voiceprint similarity calculation and determine the identity test result, and the accuracy and the validity of the identity test result can be improved.
In some embodiments, the speech features include speech feature curves and speech feature parameters; in this embodiment, as shown in fig. 4, step 120 includes:
The speech characteristic may be one or more of the formant trends above, the fundamental frequency trajectory, the LPC spectrum, etc. And according to each voice characteristic curve, calculating the characteristic curve similarity between the voice characteristic curve of the first voice section corresponding to the same syllable and the voice characteristic curve of the second voice section corresponding to the same syllable.
It is understood that, when there are a plurality of voice characteristic curves, the calculated similarity of the characteristic curves is also a plurality.
Step 420, determining a feature parameter deviation according to the voice feature parameters of the first voice segment corresponding to the same syllable and the voice feature parameters of the second voice segment corresponding to the same syllable.
The speech characteristic parameters may be the center frequency, bandwidth, intensity, etc. of the formants listed above, and are not particularly limited herein.
The feature parameter deviation may be obtained by subtracting, for each speech feature parameter, the speech feature parameter of the first speech segment corresponding to the same syllable from the speech feature parameter of the second speech segment corresponding to the same syllable.
It is understood that, when the speech feature parameter includes a plurality of parameters, the calculated feature parameter deviation also corresponds to a plurality of parameters.
Step 430, determining the voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable according to the feature curve similarity and the feature parameter deviation.
In some embodiments, the feature curve similarity and the feature parameter deviation may be weighted, and the result of the weighted calculation is used as the voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable. The weighting weight corresponding to the similarity of each characteristic curve and the deviation of each characteristic parameter may be set according to actual needs, and is not specifically limited herein.
Through the above process, the voiceprint similarity between the first voice segment corresponding to the same syllable and the second voice segment corresponding to the same syllable is comprehensively calculated by combining the voice characteristic curve and the voice parameters of the voice segments, and the voiceprint similarity is calculated by combining the voice characteristics under multiple dimensions, so that the accuracy of the calculated voiceprint similarity can be ensured.
In some embodiments, based on the segmented speech spectrogram of the first speech segment corresponding to the extracted same syllable and the segmented speech spectrogram of the corresponding second speech segment, the segmented speech spectrogram of the first speech segment corresponding to the same syllable and the segmented speech spectrogram of the corresponding second speech segment can be respectively displayed in the display interface, thereby facilitating a user to visually observe the similarity of the same syllable in the comparison speech and the sample speech based on the displayed segmented speech spectrogram of the same syllable.
FIG. 5A illustrates a spectrogram of the word "which" is segmented in two voices according to one embodiment. The left side of fig. 5A shows the segmentation spectrogram of the word "which" in audio 1; the right side of fig. 5A shows the segmentation spectrogram of the word "which bit" in audio 2.
FIG. 5B illustrates a segmented spectrogram of a "YES" word in two voices, according to one embodiment. FIG. 5B shows on the left hand side a spectrogram of a "YES" word in Audio 1; the segmented speech spectrogram of the "yes" word in Audio 2 is shown on the right side of FIG. 5B.
FIG. 5C illustrates a segmented spectrogram of the "e 4" phoneme in two voices, according to one embodiment. The left side of fig. 5C shows a segmented spectrogram of the phoneme of "e 4" in "this" in audio 1; the segmentation spectrogram of the factor "e 4" in "this" in audio 2 is shown on the right side of fig. 5C.
In some embodiments, as shown in fig. 6, step 120, comprises:
step 610, determining a first speech feature vector of the first speech segment corresponding to the same syllable according to the speech feature of the first speech segment corresponding to the same syllable.
Step 620, determining a second speech feature vector of the second speech segment corresponding to the same syllable according to the speech feature of the second speech segment corresponding to the same syllable.
In this embodiment, the speech features of the first speech segment corresponding to the same syllable and the speech features of the second speech segment corresponding to the same syllable are vectorized respectively to obtain a first speech feature vector of the first speech segment corresponding to the same syllable and a second speech feature vector of the second speech segment corresponding to the same syllable.
In some embodiments, a distance between the first speech feature vector and the second speech feature vector may be calculated, and a voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable may be determined according to the calculated distance. The distance may be a euclidean distance or the like, and is not particularly limited. It will be appreciated that the greater the calculated distance, the less the degree of voiceprint similarity between a first speech segment corresponding to the same syllable and the corresponding second speech segment.
In some embodiments, a similarity between the first speech feature vector and the second speech feature vector may be calculated as a voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable. The calculated similarity, such as cosine similarity, relative entropy, Jaccard similarity coefficient, etc., is not particularly limited herein.
In this embodiment, the voice features of the first voice segment corresponding to the same syllable and the voice features of the second voice segment corresponding to the same syllable are vectorized, so that the calculation of the voiceprint similarity is simplified.
In some embodiments, as shown in fig. 7, step 120, comprises:
and step 121, selecting one same syllable from the plurality of same syllables as a candidate same syllable according to the sequence of syllables from long to short.
As described above, the length of the syllable of the word type is greater than that of the syllable of the word type, which is greater than that of the syllable of the phoneme type, and the longer the length of the syllable is, the longer the duration of the corresponding audio segment is, and correspondingly, the more speech characteristics are expressed. Therefore, in the scheme of the embodiment, the candidate same syllables are selected from the multiple same syllables according to the order of the syllables from long to short to calculate the voiceprint similarity, and the accuracy of the calculated voiceprint similarity can be ensured because the expressed voice features are more.
Step 122, calculating the voiceprint similarity between the first speech segment corresponding to the candidate same syllable and the second speech segment corresponding to the candidate same syllable. The specific calculation process of the voiceprint similarity is described above, and is not described herein again.
In this embodiment, the step 130 includes:
The similarity threshold may be set as needed, for example, 95%, 96%, 98%, etc., and is not particularly limited herein.
In some embodiments, if the cumulative number is determined to reach the target number in step 132, the voiceprint similarity calculation may not be performed subsequently by selecting the candidate identical syllable from the plurality of identical syllables.
In some embodiments, the identity test result may include a test result (referred to as a first test result for convenience of description) indicating that the comparison voice and the sample voice are from the same person, and a test result (referred to as a second test result for convenience of description) indicating that the comparison voice and the sample voice are not from the same person. Thus, if the cumulative number counted by the voiceprint similarity calculation based on all the same syllables does not reach the target number according to the above process, the identity test result is determined to be the second test result.
In some embodiments, the identity test result may further include a plurality of test results other than the first test result and the second test result, and a number range may be set for each identity test result, so that, when a cumulative number counted based on the voiceprint similarity of all the same syllables is calculated, the identity test result corresponding to the number range in which the cumulative number is located is used as the identity test result of the comparison speech and the sample speech.
For example, if the identity check result includes the first check result, the second check result, the third check result, the fourth check result, and the fifth check result, wherein the third check result indicates that the comparison voice and the sample voice are from the same person with a higher probability, and the fourth check result indicates that the comparison voice and the sample voice are from the same person with a lower probability; the fifth test result indicates that it cannot be determined whether the comparison voice and the sample voice are from the same person; the first test result can be set to correspond to a first quantity range, and the third test result can be set to correspond to a second quantity range; the fifth test result corresponds to a third quantity range; the fourth test result corresponds to a fourth quantity range; the second test result corresponds to a fifth number range, and thus, if it is determined that the finally obtained accumulated number is within the fifth number range, the result of the identity test of the comparison voice and the sample voice is determined to be the second test result. It is understood that the number in the first number range > the number in the second number range > the number in the third number range > the number in the fourth number range > the number in the fifth number range.
In other embodiments, as shown in FIG. 8, step 130, comprises:
step 810, determining a target same syllable with the corresponding voiceprint similarity exceeding a similarity threshold according to the voiceprint similarity between the first speech segment corresponding to each same syllable and the second speech segment corresponding to each same syllable.
The target identical syllable is the identical syllable whose corresponding voiceprint similarity exceeds the similarity threshold.
In some embodiments, after calculating the voiceprint similarity between the first speech segment and the second speech segment corresponding to all the same syllables, the target same syllables may be determined correspondingly, and the number of the target same syllables may be counted.
In some embodiments, a number range may be set for each type of the identity check result, so that, after the number of the target identical syllables is determined, the identity check result corresponding to the number range to which the number of the target identical syllables belongs is used as the identity check result of the comparison speech and the sample speech.
In other embodiments, as shown in FIG. 9, step 130, comprises:
step 910, determining a target same syllable whose corresponding voiceprint similarity exceeds a similarity threshold according to the voiceprint similarity between the first speech segment corresponding to each same syllable and the second speech segment corresponding to each same syllable.
In some embodiments, a weighting calculation may be performed according to the first weight corresponding to each syllable type and the number of target identical syllables belonging to each syllable type, and the weighting calculation result is used as the probability that the comparison speech and the sample speech come from the same person.
For example, if the first weight corresponding to the word type is a1, the first weight corresponding to the word type is a2, and the first weight corresponding to the phone type is A3, it is statistically determined that the number of target identical syllables belonging to the word type is B1, and the number of target identical syllables belonging to the word type is B2; the number of target identical syllables attributed to the phoneme type is B3, the result of a1 × B1+ a2 × B2+ A3 × B3 can be taken as the probability that the comparison speech and the sample speech come from the same person. The foregoing is, of course, merely exemplary and is not to be construed as limiting the scope of the invention.
In the present embodiment, the first weight is used to reflect the degree of contribution of the same syllable of the corresponding syllable type to the result of the identity check, and as described above, since the syllable of the word type expresses more features, the first weight corresponding to the word type > the first weight corresponding to the syllable type.
And 940, determining the identity test result of the comparison voice and the sample voice according to the probability that the comparison voice and the sample voice come from the same person.
In some embodiments, a probability range (referred to as a first probability range for ease of distinction) may be set for each type of the identity check result, so that, based on the probabilities that the determined comparison voice and the sample voice are from the same person, the identity check result corresponding to the first probability range in which the determined probability is located is taken as the identity check result of the comparison voice and the sample voice.
In this embodiment, the probability that the comparison speech and the sample speech are from the same person is calculated by combining the first weight corresponding to each syllable type and the number of target identical syllables belonging to each syllable type, so that the probability that the comparison speech and the sample speech are from the same person is determined by comprehensively considering the contribution degree of syllables of different syllable types to the identity test result and the number of target identical syllables of different syllable types, the reasonability and the accuracy of the probability that the determined comparison speech and the sample speech are from the same person are ensured, and the accuracy of the identity test result of the comparison speech and the sample speech determined based on the probability is ensured.
In other embodiments, step 110 includes: performing syllable matching on a voice recognition result of a comparison voice and a voice recognition result of a sample voice under at least two granularities, and determining the same syllable of the comparison voice relative to the sample voice under each granularity, wherein the granularities comprise word granularity, word granularity and phoneme granularity. At each granularity, the same syllable may correspond to the syllable type for which the granularity is determined. Specifically, if syllable matching is performed under the word granularity, the determined same syllables are all words; if syllable matching is carried out under the word granularity, the determined same syllables are all words; if syllable matching is performed at the phoneme granularity, the determined identical syllables are all phonemes.
In this embodiment, since syllable matching is performed at least two granularities, the obtained syllable matching result includes a syllable matching result corresponding to each of the at least two granularities.
In this embodiment, as shown in fig. 10, step 130 includes:
In some embodiments, the mapping relationship between the number of target identical syllables and the reference probability for each granularity may be preset, so that, after determining the number of target identical syllables for each granularity, the reference probability mapped by the number of target identical syllables for the granularity is used as the reference probability that the comparison speech and the sample speech come from the same person for the granularity.
Under different granularities, the mapping relation between the number of the same syllables of the set target and the reference probability can be the same or different, and the mapping relation can be specifically set according to actual needs.
Step 1030, determining a target probability that the comparison voice and the sample voice come from the same person according to the reference probability that the comparison voice and the sample voice come from the same person at each granularity and the second weight corresponding to each granularity.
In some embodiments, the reference probabilities at least two granularities may be weighted according to the second weight corresponding to each granularity, and the result of the weighted calculation is used as the target probability that the comparison speech and the sample speech are from the same person.
For example, if the at least two granularities include a word granularity and a phoneme granularity, the second weight corresponding to the word granularity is C1, the second weight corresponding to the phoneme granularity is C2, the reference probability at the word granularity is P1, and the reference probability at the phoneme granularity is P2, the target probability may be: C1P 1+ C2P 2. It should be noted that the above are merely exemplary and should not be considered as limiting the scope of the application.
In some embodiments, a probability range (referred to as a second probability range for easy distinction) corresponding to each type of the identity check result may be set, so that, after the target probability is determined, the identity check result corresponding to the second probability range in which the target probability is located is used as the identity check result of the comparison speech and the sample speech.
In this embodiment, the target probability is calculated by combining the second weight corresponding to each granularity and the reference probability calculated under each granularity, and the target probability that the comparison speech and the sample speech are from the same person is determined by comprehensively considering the contribution degree of the syllable matching results under different granularities to the identity test result and the number of the target same syllables under different granularities, so that the reasonability and the accuracy of the probability that the determined comparison speech and the sample speech are from the same person are ensured, and the accuracy of the identity test result of the comparison speech and the sample speech determined based on the probability is further ensured.
Embodiments of the apparatus of the present application are described below, which may be used to perform the methods of the above-described embodiments of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the above-described embodiments of the method of the present application.
Fig. 11 is a block diagram showing a voice identity verifying apparatus according to an embodiment, and as shown in fig. 11, the voice identity verifying apparatus includes: a syllable matching module 1110, configured to perform syllable matching on a speech recognition result of a comparison speech and a speech recognition result of a sample speech, and determine a plurality of same syllables of the comparison speech with respect to the sample speech; the plurality of identical syllables including syllables of at least two syllable types, the syllable types including a word type, and a phone type; a voiceprint similarity calculation module 1120, configured to calculate a voiceprint similarity between a first speech segment corresponding to the same syllable and a corresponding second speech segment according to a speech feature of the first speech segment corresponding to the same syllable and a speech feature of the corresponding second speech segment; the first voice segment is a voice segment corresponding to the same syllable in the comparison voice; the second speech segment is a speech segment corresponding to the same syllable in the sample speech; the identity test result determining module 1130 is configured to determine an identity test result of the comparison speech and the sample speech according to a voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable.
In some embodiments, the apparatus for verifying speech identity further comprises: the first voice feature acquisition module is used for acquiring the voice features of the first voice section corresponding to the same syllable; and the second voice characteristic acquisition module is used for acquiring the voice characteristics of the second voice section corresponding to the same syllable.
In some embodiments, the speech recognition result of the comparison speech indicates time position information of each syllable included in the comparison speech; in this embodiment, the first voice feature obtaining module includes: the segment extraction unit is used for extracting segments in the comparison voice according to the time position information of the same syllable in the comparison voice to obtain a first voice segment corresponding to the same syllable; and the first extraction unit is used for extracting the voice characteristics of the first voice section to obtain the voice characteristics of the first voice section.
In other embodiments, the speech recognition result of the comparison speech indicates the time position information of each syllable included in the comparison speech; in this embodiment, the first voice feature obtaining module includes: the segmented spectrogram determining unit is used for determining a segmented spectrogram corresponding to the first voice segment in the spectrogram of the comparison voice according to the time position information of the same syllable in the comparison voice; and the voice feature acquisition unit is used for acquiring the voice features of the first voice section from the segmented spectrogram corresponding to the first voice section.
In some embodiments, the speech features include speech feature curves and speech feature parameters; the voiceprint similarity calculation module 1120 includes: a characteristic curve similarity determining unit, configured to determine a similarity of characteristic curves according to a voice characteristic curve of a first voice segment corresponding to the same syllable and a voice characteristic curve of a second voice segment corresponding to the same syllable; the characteristic parameter deviation calculating unit is used for determining the characteristic parameter deviation according to the voice characteristic parameters of the first voice section corresponding to the same syllable and the voice characteristic parameters of the second voice section corresponding to the same syllable; and the first voiceprint similarity determining unit is used for determining the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable according to the characteristic curve similarity and the characteristic parameter deviation.
In other embodiments, the speech feature comprises at least two speech feature parameters; the voiceprint similarity calculation module 1120 includes: a first speech feature vector determining unit, configured to determine a first speech feature vector of a first speech segment corresponding to the same syllable according to speech features of the first speech segment corresponding to the same syllable; a second speech feature vector determining unit, configured to determine a second speech feature vector of a second speech segment corresponding to the same syllable according to speech features of the second speech segment corresponding to the same syllable; and the second voiceprint similarity determining unit is used for calculating the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable according to the first voice feature vector of the first voice section corresponding to the same syllable and the second voice feature vector of the second voice section corresponding to the same syllable.
In some embodiments, the voiceprint similarity calculation module 1120 includes: the candidate same syllable selecting unit is used for selecting one same syllable from the plurality of same syllables as a candidate same syllable according to the sequence of syllables from long to short; a calculating unit, configured to calculate a voiceprint similarity between a first speech segment corresponding to the candidate same syllable and the second speech segment corresponding to the candidate same syllable; in this embodiment, the identity check result determining module 1130 includes: a quantity accumulation unit, configured to accumulate quantities if a voiceprint similarity between a first speech segment corresponding to the candidate same syllable and a second speech segment corresponding to the candidate same syllable exceeds a similarity threshold; a judging unit for judging whether the accumulated number reaches a target number; a first result determination unit configured to determine, if a target number is reached, the identity test result as a test result indicating that the comparison voice and the sample voice are from the same person; and if the target number is not reached, returning to the step of selecting one same syllable from the plurality of same syllables as a candidate same syllable according to the sequence of the syllables from long to short.
In some embodiments, the identity check result determination module 1130 includes: a target identical syllable determining unit, configured to determine a target identical syllable, for which a voiceprint similarity exceeds a similarity threshold, according to a voiceprint similarity between a first speech segment corresponding to each identical syllable and a second speech segment corresponding to each identical syllable; and the second result determining unit is used for determining the identity test result of the comparison voice and the sample voice according to the number of the target same syllables.
In other embodiments, the identity check result determination module 1130 includes: a target identical syllable determining unit, configured to determine a target identical syllable, for which a voiceprint similarity exceeds a similarity threshold, according to a voiceprint similarity between a first speech segment corresponding to each identical syllable and a second speech segment corresponding to each identical syllable; a first number determination unit for counting the number of target identical syllables belonging to each syllable type according to the syllable type to which the target identical syllables belong; a probability calculation unit for calculating the probability that the comparison speech and the sample speech come from the same person according to the first weight corresponding to each syllable type and the number of target same syllables belonging to each syllable type; and the third result determining unit is used for determining the identity test result of the comparison voice and the sample voice according to the probability that the comparison voice and the sample voice come from the same person.
In some embodiments, the syllable matching module 1110 is further configured to: performing syllable matching on a voice recognition result of a comparison voice and a voice recognition result of a sample voice under at least two granularities, and determining the same syllable of the comparison voice relative to the sample voice under each granularity, wherein the granularities comprise word granularity, word granularity and phoneme granularity; in this embodiment, the identity check result determining module 1130 includes: a first quantity determining unit, configured to count, based on the same syllables at each of the particle sizes, the quantity of target same syllables for which the corresponding voiceprint similarity exceeds a similarity threshold; a reference probability calculation unit, configured to calculate, according to the number of target identical syllables whose corresponding voiceprint similarities exceed a similarity threshold at each granularity, a reference probability that the comparison speech and the sample speech are from the same person at each granularity; a target probability calculating unit, configured to determine a target probability that the comparison speech and the sample speech are from the same person according to the reference probability that the comparison speech and the sample speech are from the same person at each granularity and a second weight corresponding to each granularity; and the fourth result determining unit is used for determining the identity test result of the comparison voice and the sample voice according to the target probability that the comparison voice and the sample voice come from the same person.
FIG. 12 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application. It should be noted that the computer system 1200 of the electronic device shown in fig. 12 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 12, the computer system 1200 includes a Central Processing Unit (CPU) 1201, which can perform various appropriate actions and processes, such as executing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data necessary for system operation are also stored. The CPU1201, ROM1202, and RAM 1203 are connected to each other by a bus 1204. An Input/Output (I/O) interface 1205 is also connected to bus 1204.
The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output section 1207 including a Display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 1208 including a hard disk and the like; and a communication section 1209 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. A driver 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1210 as necessary, so that a computer program read out therefrom is mounted into the storage section 1208 as necessary.
In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1209, and/or installed from the removable medium 1211. The computer program executes various functions defined in the system of the present application when executed by a Central Processing Unit (CPU) 1201.
It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.
As another aspect, the present application also provides a computer-readable storage medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable storage medium carries computer readable instructions which, when executed by a processor, implement the method of any of the embodiments described above.
According to an aspect of the present application, there is also provided an electronic device, including: a processor; a memory having computer readable instructions stored thereon which, when executed by the processor, implement the method of any of the above embodiments.
According to an aspect of an embodiment of the present application, there is provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of any of the above embodiments.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.
Claims (13)
1. A method for testing speech identity, comprising:
performing syllable matching on a voice recognition result of a comparison voice and a voice recognition result of a sample voice, and determining a plurality of same syllables of the comparison voice relative to the sample voice; the plurality of identical syllables including syllables of at least two syllable types, the syllable types including a word type, and a phone type;
calculating the voiceprint similarity between the first voice section corresponding to the same syllable and the corresponding second voice section according to the voice characteristics of the first voice section corresponding to the same syllable and the voice characteristics of the corresponding second voice section; the first voice segment is a voice segment corresponding to the same syllable in the comparison voice; the second speech segment is a speech segment corresponding to the same syllable in the sample speech;
and determining the identity test result of the comparison voice and the sample voice according to the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable.
2. A method according to claim 1, wherein before calculating the voiceprint similarity between a first speech segment corresponding to the same syllable and a corresponding second speech segment according to the speech characteristics of the first speech segment corresponding to the same syllable and the speech characteristics of the corresponding second speech segment, the method further comprises:
acquiring the voice characteristics of the first voice section corresponding to the same syllable;
and acquiring the voice characteristics of the second voice section corresponding to the same syllable.
3. The method according to claim 2, wherein the speech recognition result of the comparison speech indicates time position information of each syllable included in the comparison speech;
the obtaining of the voice features of the first voice segment corresponding to the same syllable includes:
according to the time position information of the same syllable in the comparison voice, segment extraction is carried out in the comparison voice to obtain a first voice section corresponding to the same syllable;
and performing voice feature extraction on the first voice section to obtain the voice feature of the first voice section.
4. The method according to claim 2, wherein the speech recognition result of the comparison speech indicates time position information of each syllable included in the comparison speech;
the obtaining of the voice features of the first voice segment corresponding to the same syllable includes:
determining a segmented spectrogram corresponding to the first voice segment in the spectrogram of the comparison voice according to the time position information of the same syllable in the comparison voice;
and acquiring the voice features of the first voice section from the segmented voice spectrogram corresponding to the first voice section.
5. The method of claim 1, wherein the speech features comprise speech feature curves and speech feature parameters;
the calculating the voiceprint similarity between the first speech segment corresponding to the same syllable and the corresponding second speech segment according to the speech features of the first speech segment corresponding to the same syllable and the speech features of the corresponding second speech segment includes:
determining the similarity of the characteristic curves according to the voice characteristic curve of the first voice section corresponding to the same syllable and the voice characteristic curve of the second voice section corresponding to the same syllable;
determining the characteristic parameter deviation according to the voice characteristic parameters of the first voice section corresponding to the same syllable and the voice characteristic parameters of the second voice section corresponding to the same syllable;
and determining the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable according to the characteristic curve similarity and the characteristic parameter deviation.
6. The method according to claim 1, wherein said calculating the voiceprint similarity between the first speech segment corresponding to the same syllable and the corresponding second speech segment according to the speech features of the first speech segment corresponding to the same syllable and the speech features of the corresponding second speech segment comprises:
determining a first voice feature vector of a first voice segment corresponding to the same syllable according to the voice feature of the first voice segment corresponding to the same syllable;
determining a second voice characteristic vector of a second voice segment corresponding to the same syllable according to the voice characteristics of the second voice segment corresponding to the same syllable;
and calculating the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable according to the first voice characteristic vector of the first voice section corresponding to the same syllable and the second voice characteristic vector of the second voice section corresponding to the same syllable.
7. The method according to claim 1, wherein said calculating the voiceprint similarity between the first speech segment corresponding to the same syllable and the corresponding second speech segment according to the speech features of the first speech segment corresponding to the same syllable and the speech features of the corresponding second speech segment comprises:
selecting one same syllable from the plurality of same syllables as a candidate same syllable according to the sequence of syllables from long to short;
calculating the voiceprint similarity between the first voice section corresponding to the candidate same syllable and the second voice section corresponding to the candidate same syllable;
the determining the identity test result of the comparison voice and the sample voice according to the voiceprint similarity between the first voice segment corresponding to the same syllable and the second voice segment corresponding to the same syllable comprises:
if the voiceprint similarity between the first voice section corresponding to the candidate same syllable and the second voice section corresponding to the candidate same syllable exceeds a similarity threshold value, accumulating the number;
judging whether the accumulated quantity reaches a target quantity or not;
if the target number is reached, determining the identity test result as a test result indicating that the comparison voice and the sample voice are from the same person;
and if the target number is not reached, returning to the step of selecting one same syllable from the plurality of same syllables as a candidate same syllable according to the sequence of the syllables from long to short.
8. A method according to claim 1, wherein said determining the result of the identity check between the comparison speech and the sample speech according to the voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable comprises:
determining a target same syllable of which the corresponding voiceprint similarity exceeds a similarity threshold according to the voiceprint similarity between the first voice section corresponding to each same syllable and the second voice section corresponding to each same syllable;
and determining the identity test result of the comparison voice and the sample voice according to the number of the target same syllables.
9. A method according to claim 1, wherein said determining the result of the identity check between the comparison speech and the sample speech according to the voiceprint similarity between the first speech segment corresponding to the same syllable and the second speech segment corresponding to the same syllable comprises:
determining a target same syllable of which the corresponding voiceprint similarity exceeds a similarity threshold according to the voiceprint similarity between the first voice section corresponding to each same syllable and the second voice section corresponding to each same syllable;
counting the number of the target identical syllables belonging to each syllable type according to the syllable type to which the target identical syllables belong;
calculating the probability that the comparison voice and the sample voice come from the same person according to the first weight corresponding to each syllable type and the number of the target same syllables belonging to each syllable type;
and determining the identity test result of the comparison voice and the sample voice according to the probability that the comparison voice and the sample voice come from the same person.
10. The method of claim 1, wherein syllable matching the speech recognition result of the comparison speech with the speech recognition result of the sample speech to determine a plurality of identical syllables of the comparison speech with respect to the sample speech comprises:
performing syllable matching on a voice recognition result of a comparison voice and a voice recognition result of a sample voice under at least two granularities, and determining the same syllable of the comparison voice relative to the sample voice under each granularity, wherein the granularities comprise word granularity, word granularity and phoneme granularity;
the determining the identity test result of the comparison voice and the sample voice according to the voiceprint similarity between the first voice segment corresponding to the same syllable and the second voice segment corresponding to the same syllable comprises:
based on the same syllables under each granularity, counting the number of target same syllables of which the corresponding voiceprint similarity exceeds a similarity threshold;
calculating reference probabilities that the comparison speech and the sample speech come from the same person at each granularity according to the number of target same syllables of which the corresponding voiceprint similarity exceeds a similarity threshold at each granularity;
determining a target probability that the comparison voice and the sample voice come from the same person according to the reference probability that the comparison voice and the sample voice come from the same person at each granularity and a second weight corresponding to each granularity;
and determining the identity test result of the comparison voice and the sample voice according to the target probability that the comparison voice and the sample voice come from the same person.
11. A device for verifying speech identity, comprising:
the syllable matching module is used for carrying out syllable matching on the voice recognition result of the comparison voice and the voice recognition result of the sample voice and determining a plurality of same syllables of the comparison voice relative to the sample voice; the plurality of identical syllables including syllables of at least two syllable types, the syllable types including a word type, and a phone type;
the voiceprint similarity calculation module is used for calculating the voiceprint similarity between the first voice section corresponding to the same syllable and the corresponding second voice section according to the voice characteristics of the first voice section corresponding to the same syllable and the voice characteristics of the corresponding second voice section; the first voice segment is a voice segment corresponding to the same syllable in the comparison voice; the second speech segment is a speech segment corresponding to the same syllable in the sample speech;
and the identity test result determining module is used for determining the identity test result of the comparison voice and the sample voice according to the voiceprint similarity between the first voice section corresponding to the same syllable and the second voice section corresponding to the same syllable.
12. An electronic device, comprising:
a processor;
a memory having computer-readable instructions stored thereon which, when executed by the processor, implement the method of any one of claims 1-10.
13. A computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the method of any one of claims 1-10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111524105.3A CN113921017A (en) | 2021-12-14 | 2021-12-14 | Voice identity detection method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111524105.3A CN113921017A (en) | 2021-12-14 | 2021-12-14 | Voice identity detection method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113921017A true CN113921017A (en) | 2022-01-11 |
Family
ID=79249176
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111524105.3A Pending CN113921017A (en) | 2021-12-14 | 2021-12-14 | Voice identity detection method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113921017A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114255764A (en) * | 2022-02-28 | 2022-03-29 | 深圳市声扬科技有限公司 | Audio information processing method and device, electronic equipment and storage medium |
CN117133271A (en) * | 2023-10-25 | 2023-11-28 | 北京吉道尔科技有限公司 | Block chain-based e-commerce platform shopping and intelligent voice evaluation method and system |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101887722A (en) * | 2009-06-18 | 2010-11-17 | 博石金(北京)信息技术有限公司 | Rapid voiceprint authentication method |
CN107680601A (en) * | 2017-10-18 | 2018-02-09 | 深圳势必可赢科技有限公司 | A kind of identity homogeneity method of inspection retrieved based on sound spectrograph and phoneme and device |
CN108492830A (en) * | 2018-03-28 | 2018-09-04 | 深圳市声扬科技有限公司 | Method for recognizing sound-groove, device, computer equipment and storage medium |
CN108766417A (en) * | 2018-05-29 | 2018-11-06 | 广州国音科技有限公司 | A kind of the identity homogeneity method of inspection and device based on phoneme automatically retrieval |
CN110875044A (en) * | 2018-08-30 | 2020-03-10 | 中国科学院声学研究所 | Speaker identification method based on word correlation score calculation |
CN111341300A (en) * | 2020-02-28 | 2020-06-26 | 广州国音智能科技有限公司 | Method, device and equipment for acquiring voice comparison phonemes |
CN111429921A (en) * | 2020-03-02 | 2020-07-17 | 厦门快商通科技股份有限公司 | Voiceprint recognition method, system, mobile terminal and storage medium |
-
2021
- 2021-12-14 CN CN202111524105.3A patent/CN113921017A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101887722A (en) * | 2009-06-18 | 2010-11-17 | 博石金(北京)信息技术有限公司 | Rapid voiceprint authentication method |
CN107680601A (en) * | 2017-10-18 | 2018-02-09 | 深圳势必可赢科技有限公司 | A kind of identity homogeneity method of inspection retrieved based on sound spectrograph and phoneme and device |
CN108492830A (en) * | 2018-03-28 | 2018-09-04 | 深圳市声扬科技有限公司 | Method for recognizing sound-groove, device, computer equipment and storage medium |
CN108766417A (en) * | 2018-05-29 | 2018-11-06 | 广州国音科技有限公司 | A kind of the identity homogeneity method of inspection and device based on phoneme automatically retrieval |
CN110875044A (en) * | 2018-08-30 | 2020-03-10 | 中国科学院声学研究所 | Speaker identification method based on word correlation score calculation |
CN111341300A (en) * | 2020-02-28 | 2020-06-26 | 广州国音智能科技有限公司 | Method, device and equipment for acquiring voice comparison phonemes |
CN111429921A (en) * | 2020-03-02 | 2020-07-17 | 厦门快商通科技股份有限公司 | Voiceprint recognition method, system, mobile terminal and storage medium |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114255764A (en) * | 2022-02-28 | 2022-03-29 | 深圳市声扬科技有限公司 | Audio information processing method and device, electronic equipment and storage medium |
CN117133271A (en) * | 2023-10-25 | 2023-11-28 | 北京吉道尔科技有限公司 | Block chain-based e-commerce platform shopping and intelligent voice evaluation method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106683680B (en) | Speaker recognition method and device, computer equipment and computer readable medium | |
CN111429946A (en) | Voice emotion recognition method, device, medium and electronic equipment | |
CN107665705B (en) | Voice keyword recognition method, device, equipment and computer readable storage medium | |
CN107729313B (en) | Deep neural network-based polyphone pronunciation distinguishing method and device | |
JP4568371B2 (en) | Computerized method and computer program for distinguishing between at least two event classes | |
US10607601B2 (en) | Speech recognition by selecting and refining hot words | |
CN109559735B (en) | Voice recognition method, terminal equipment and medium based on neural network | |
CN113921017A (en) | Voice identity detection method and device, electronic equipment and storage medium | |
CN111783450B (en) | Phrase extraction method and device in corpus text, storage medium and electronic equipment | |
CN116635934A (en) | Unsupervised learning of separate phonetic content and style representations | |
CN112015872A (en) | Question recognition method and device | |
Vuppala et al. | Improved consonant–vowel recognition for low bit‐rate coded speech | |
CN110503956B (en) | Voice recognition method, device, medium and electronic equipment | |
CN114360557A (en) | Voice tone conversion method, model training method, device, equipment and medium | |
WO2021014612A1 (en) | Utterance segment detection device, utterance segment detection method, and program | |
US11037583B2 (en) | Detection of music segment in audio signal | |
CN113112992B (en) | Voice recognition method and device, storage medium and server | |
Kashani et al. | Sequential use of spectral models to reduce deletion and insertion errors in vowel detection | |
CN110675858A (en) | Terminal control method and device based on emotion recognition | |
CN115662473A (en) | Emotion recognition method and device based on voice data and electronic equipment | |
Płonkowski | Using bands of frequencies for vowel recognition for Polish language | |
CN114530142A (en) | Information recommendation method, device and equipment based on random forest and storage medium | |
CN113035230A (en) | Authentication model training method and device and electronic equipment | |
Karpov | Efficient speaker recognition for mobile devices | |
Mittal et al. | Classical and deep learning data processing techniques for speech and speaker recognitions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |