CN111739547A - Voice matching method and device, computer equipment and storage medium - Google Patents

Voice matching method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN111739547A
CN111739547A CN202010719805.7A CN202010719805A CN111739547A CN 111739547 A CN111739547 A CN 111739547A CN 202010719805 A CN202010719805 A CN 202010719805A CN 111739547 A CN111739547 A CN 111739547A
Authority
CN
China
Prior art keywords
voice
voiceprint
restored
speech
transformation model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010719805.7A
Other languages
Chinese (zh)
Other versions
CN111739547B (en
Inventor
张伟彬
丁俊豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Voiceai Technologies Co ltd
Original Assignee
Voiceai Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Voiceai Technologies Co ltd filed Critical Voiceai Technologies Co ltd
Priority to CN202010719805.7A priority Critical patent/CN111739547B/en
Publication of CN111739547A publication Critical patent/CN111739547A/en
Application granted granted Critical
Publication of CN111739547B publication Critical patent/CN111739547B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

The application relates to a voice matching method, a voice matching device, computer equipment and a storage medium. The method comprises the following steps: acquiring a voice change voice to be matched; restoring the sound-changing voice through a voice feature transformation model to obtain restored voice; comparing the original voice of the suspect with the restored voice by voiceprint; when the voiceprint comparison result is not matched, adjusting parameter values of parameters in the voice feature transformation model, returning to the step of restoring the sound-changing voice through the voice feature transformation model for iteration, and stopping iteration until the voiceprint comparison result is matched or stopping iteration until an iteration stop condition is met; and determining a matching result of the voice-changing voice and the original voice according to a result of voiceprint comparison when iteration is stopped. By adopting the method, the efficiency of voice matching can be improved.

Description

Voice matching method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for voice matching, a computer device, and a storage medium.
Background
With the development of computer technology, a sound-changing technology has appeared, by which the signal characteristics of speech can be changed to change the speaker's speech into a sound-changing speech. In some scenarios, for example, during the case handling process of the public security officer, it is necessary to restore the voiced speech and match the restored speech with the original speech of the suspect to determine whether the speaker of the voiced speech is the suspect.
In the conventional technology, when a voice feature transformation model is used for restoring a voice-variant voice to judge whether a speaker of the voice-variant voice is a suspect, several groups of typical signal feature transformation parameters of the voice feature transformation model are manually tried and fine-tuned to find out the parameters of the voice feature transformation model which can enable the restored voice to be closest to the original voice. The method for adjusting the parameters of the voice feature transformation model through manual trial and error and fine tuning and matching the restored voice and the original voice according to the adjusted voice feature transformation model has the advantages of complex operation, large workload and low efficiency.
Disclosure of Invention
In view of the above, it is necessary to provide a voice matching method, apparatus, computer device and storage medium capable of improving the efficiency of voice matching.
A method of speech matching, the method comprising:
acquiring a voice change voice to be matched;
restoring the sound-changing voice through a voice feature transformation model to obtain restored voice;
comparing the original voice of the suspect with the restored voice by voiceprint;
when the voiceprint comparison result is not matched, adjusting parameter values of parameters in the voice feature transformation model, returning to the step of restoring the sound-changing voice through the voice feature transformation model for iteration, and stopping iteration until the voiceprint comparison result is matched or stopping iteration until an iteration stop condition is met;
and determining a matching result of the voice-changing voice and the original voice according to a result of voiceprint comparison when iteration is stopped.
In one embodiment, before the restoring the unvoiced sound by the speech feature transformation model to obtain the restored speech, the method further includes:
determining at least one parameter of a speech feature transformation model; the at least one parameter characterizes at least one speech feature for restoring unvoiced speech;
respectively selecting initial parameter values of parameters of the voice characteristic transformation model;
and establishing a voice feature transformation model according to the parameters of the voice feature transformation model and the initial parameter values of the parameters.
In one embodiment, the voiceprint comparing the original voice of the suspect with the restored voice comprises:
respectively carrying out high-pass filtering on original voice of the suspect and the restored voice;
respectively carrying out segmentation processing on the original voice and the restored voice which have been subjected to high-pass filtering;
and comparing the original voice subjected to the segmentation processing with the restored voice by voice print.
In one embodiment, the voiceprint comparing the original voice of the suspect with the restored voice comprises:
acquiring a first voiceprint feature of the original voice and a second voiceprint feature of the restored voice;
calculating a voiceprint comparison score of the first voiceprint feature and the second voiceprint feature;
when the voiceprint comparison score is higher than or equal to a score threshold value, the result of voiceprint comparison is matching;
and when the voiceprint comparison score is lower than the score threshold value, the result of the voiceprint comparison is not matched.
In one embodiment, the adjusting the parameter values of the parameters in the speech feature transformation model comprises:
determining a target interval for adjusting parameter values of parameters in the voice feature transformation model;
searching a target parameter value which enables the voiceprint comparison score between the restored voice and the original voice to be high in the target interval;
determining the target parameter value as an adjusted parameter value of the corresponding parameter in the speech feature transformation model.
In one embodiment, the determining a target interval for adjusting a parameter value of a parameter in the speech feature transformation model includes:
acquiring the interval length for adjusting the parameter value of the parameter;
and determining a target interval corresponding to the parameter according to the interval length by taking the current parameter value of the parameter in the voice feature transformation model as a center.
In one embodiment, the obtaining the first voiceprint feature of the original speech and the second voiceprint feature of the restored speech includes:
extracting the frame level characteristics of the original voice, and operating the frame level characteristics of the original voice to obtain sentence level characteristics of the original voice;
obtaining the first voiceprint feature according to the frame-level feature and sentence-level feature of the original voice;
extracting the frame-level features of the restored voice, and operating the frame-level features of the restored voice to obtain sentence-level features of the restored voice;
and obtaining the second voiceprint characteristic according to the frame-level characteristic and sentence-level characteristic of the restored voice.
A speech matching apparatus, the apparatus comprising:
the acquisition module is used for acquiring the voice-changing voice to be matched;
the restoring module is used for restoring the sound-changing voice through the voice feature transformation model to obtain restored voice;
the voiceprint comparison module is used for carrying out voiceprint comparison on original voice of the suspect and the restored voice;
the adjusting module is used for adjusting parameter values of parameters in the voice feature transformation model when the voiceprint comparison result is not matched, returning to the step of restoring the sound-changing voice through the voice feature transformation model for iteration, and stopping iteration until the voiceprint comparison result is matched or stopping iteration until an iteration stopping condition is met;
and the determining module is used for determining a matching result of the voice-changing voice and the original voice according to a result of voiceprint comparison when iteration is stopped.
In one embodiment, the apparatus further comprises:
the determining module is further used for determining at least one parameter of the voice feature transformation model; the at least one parameter characterizes at least one speech feature for restoring unvoiced speech;
the selection module is used for respectively selecting initial parameter values of the parameters of the voice feature transformation model;
and the establishing module is used for establishing a voice characteristic transformation model according to the parameters of the voice characteristic transformation model and the initial parameter values of the parameters.
In one embodiment, the voiceprint comparison module is further configured to:
respectively carrying out high-pass filtering on original voice of the suspect and the restored voice;
respectively carrying out segmentation processing on the original voice and the restored voice which have been subjected to high-pass filtering;
and comparing the original voice subjected to the segmentation processing with the restored voice by voice print.
In one embodiment, the voiceprint comparison module is further configured to:
acquiring a first voiceprint feature of the original voice and a second voiceprint feature of the restored voice;
calculating a voiceprint comparison score of the first voiceprint feature and the second voiceprint feature;
when the voiceprint comparison score is higher than or equal to a score threshold value, the result of voiceprint comparison is matching;
and when the voiceprint comparison score is lower than the score threshold value, the result of the voiceprint comparison is not matched.
In one embodiment, the adjustment module is further configured to:
determining a target interval for adjusting parameter values of parameters in the voice feature transformation model;
searching a target parameter value which enables the voiceprint comparison score between the restored voice and the original voice to be high in the target interval;
determining the target parameter value as an adjusted parameter value of the corresponding parameter in the speech feature transformation model.
In one embodiment, the determining module is further configured to:
acquiring the interval length for adjusting the parameter value of the parameter;
and determining a target interval corresponding to the parameter according to the interval length by taking the current parameter value of the parameter in the voice feature transformation model as the center.
In one embodiment, the voiceprint comparison module is further configured to:
extracting the frame level characteristics of the original voice, and operating the frame level characteristics of the original voice to obtain sentence level characteristics of the original voice;
obtaining the first voiceprint feature according to the frame-level feature and sentence-level feature of the original voice;
extracting the frame-level features of the restored voice, and operating the frame-level features of the restored voice to obtain sentence-level features of the restored voice;
and obtaining the second voiceprint characteristic according to the frame-level characteristic and sentence-level characteristic of the restored voice.
A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the speech matching method when executing the computer program.
A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech matching method.
In the above embodiment, the computer device restores the sound-changing voice to be matched through the voice feature transformation model to obtain the restored voice, automatically adjusts the parameter values of the parameters in the voice feature transformation model according to the result of comparing the voiceprint of the original voice and the restored voice, and finally determines the matching result of the original voice and the original voice. In the process of restoring the sound-changing voice, the computer equipment can quickly obtain the restored voice closest to the original voice by automatically adjusting the parameter values of the parameters in the voice characteristic transformation model, and obtain the matching result according to the restored voice closest to the original voice, so that the voice matching efficiency is improved.
Drawings
FIG. 1 is a flow diagram illustrating a method for speech matching in one embodiment;
FIG. 2 is a flow diagram illustrating the process of obtaining unvoiced speech according to one embodiment;
FIG. 3 is a schematic flow chart of obtaining voiceprint comparison results in one embodiment;
FIG. 4 is a schematic flow chart of obtaining voiceprint comparison results in another embodiment;
FIG. 5 is a block diagram showing the structure of a speech matching apparatus according to an embodiment;
FIG. 6 is a block diagram showing the construction of a speech matching apparatus according to another embodiment;
FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment;
fig. 8 is an internal structural view of a computer device in another embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
According to the voice matching method, the computer equipment restores the obtained voice-changing voice through the voice feature transformation model to obtain the restored voice, and compares the restored voice with the original voice of the suspect through the voiceprint. And when the voiceprint comparison result is not matched, adjusting parameter values of parameters in the voice feature transformation model, returning to the step of restoring the voice-changing voice through the voice feature transformation model for iteration, and stopping iteration until the voiceprint comparison result is matched or until an iteration stopping condition is met. And the computer equipment determines whether the voice-changing voice is matched with the original voice according to the result of the voiceprint comparison when the iteration is stopped. The computer device may be a terminal or a server. The terminal can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices, and the server can be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 1, a speech matching method is provided, which is described by taking the method as an example applied to a computer device, and comprises the following steps:
s102, the computer equipment obtains the sound-changing voice to be matched.
The voice-changing speech is obtained by changing the speech characteristics in the speech through a voice-changing algorithm. The speech features include a note feature and a supersound feature. The segment features reflect the tone color features of the speech, are mainly related to the physiological and physical features of the pronunciation organs, have certain stability, and are not easy to change in a short time, such as tone, speech speed, formants and the like. The characteristics of the super-sound segment reflect the rhythm characteristics of the voice, are mainly influenced by the society and the psychology, and have instability, including the rhythm, the naturalness and the like of the language.
Suppose that
Figure 129482DEST_PATH_IMAGE001
Is the input pre-voicing speech,
Figure 781044DEST_PATH_IMAGE002
is a voice-over-voice message,
Figure 468770DEST_PATH_IMAGE003
representing the altered speech characteristics involved in the voicing process. The computer device then changes the pre-voicing speech
Figure 395138DEST_PATH_IMAGE001
By the speech characteristics of
Figure 881614DEST_PATH_IMAGE004
Wherein,
Figure 438497DEST_PATH_IMAGE005
denotes a parameter of
Figure 236689DEST_PATH_IMAGE003
The change of sound function of. Computer device pass function
Figure 107693DEST_PATH_IMAGE004
Output voiced sound with parameters
Figure 307730DEST_PATH_IMAGE003
Changes occur.
Figure 35515DEST_PATH_IMAGE003
May contain one parameter (e.g., pitch, tempo, cadence or formant, etc.), or may contain multiple parameters (e.g., both pitch and tempo, or both pitch, prosody and formant, etc. speech features). The computer equipment can pass
Figure 196369DEST_PATH_IMAGE004
Is inverse function of
Figure 464539DEST_PATH_IMAGE006
For voice with changeable sound
Figure 892984DEST_PATH_IMAGE007
Restoring to obtain sound-changing voice
Figure 119566DEST_PATH_IMAGE007
The voice is restored. Wherein,
Figure 767716DEST_PATH_IMAGE006
has the parameters of
Figure 573998DEST_PATH_IMAGE003
By obtaining
Figure 951890DEST_PATH_IMAGE003
Can obtain the parameter value of
Figure 693581DEST_PATH_IMAGE006
And S104, the computer equipment restores the sound-changing voice through the voice characteristic transformation model to obtain the restored voice.
Wherein, the voice feature transformation model is the voice feature involved in the voice sound changing process of the computer equipment
Figure 219240DEST_PATH_IMAGE003
Established function
Figure 704580DEST_PATH_IMAGE006
Wherein
Figure 936978DEST_PATH_IMAGE007
Is a voice-over-voice message,
Figure 239783DEST_PATH_IMAGE003
may contain a parameter (e.g. tone)Pitch, tempo, rhythm, or formant, etc.), and may also include multiple parameters (e.g., including both pitch and tempo, or including speech features such as pitch, rhythm, and formant, etc.).
In one embodiment, the computer device building a speech feature transformation model includes: determining at least one parameter of a speech feature transformation model; the at least one parameter characterizes at least one speech feature for restoring the unvoiced speech; respectively selecting initial parameter values of parameters of the voice characteristic transformation model; and establishing a voice characteristic transformation model according to the parameters of the voice characteristic transformation model and the initial parameter values of the parameters.
In one embodiment, the computer device determines parameters of a speech feature transformation model to be pitch frequencies for characterizing the pitch of the restored speech.
In one embodiment, the computer device determines the parameters of the speech feature transformation model as the number of accents and the frequency of sound, which are used to characterize the prosodic and natural speech features of the restored speech, respectively.
In one embodiment, the computer device determines parameters of a speech feature transformation model as a pitch frequency, a formant frequency bandwidth, and a formant amplitude, for characterizing pitch and formant speech features of the restored speech, respectively. After determining the parameters of the voice characteristic transformation model, the computer equipment selects the initial value of the fundamental tone frequency as 200Hz, selects the initial value of the formant frequency as 1000Hz, selects the initial value of the formant frequency bandwidth as 500Hz, and selects the initial value of the formant amplitude as 15 dB. Computer equipment is provided with
Figure 363990DEST_PATH_IMAGE006
Parameter (d) of
Figure 777654DEST_PATH_IMAGE003
Setting fundamental tone frequency, formant frequency bandwidth and formant amplitude, and establishing a voice characteristic transformation model according to the selected parameter initial value.
And S106, comparing the original voice of the suspect with the restored voice by the computer equipment.
The original voice of the suspect is the voice of the speaker suspected to be the voice of the voice change to be matched, which is acquired by the voice acquisition device. For example, if the public security officer suspects that the speaker of the voice change is Zhang III, the original voice of the suspect is a voice or a plurality of voices of Zhang III recorded by the public security officer. The content of the original speech may be the same as that of the altered speech or may be different from that of the altered speech.
The computer equipment can judge whether the original voice and the restored voice are matched or not by comparing the original voice and the restored voice of the suspect through voiceprints.
And S108, when the voiceprint comparison result is not matched, the computer equipment adjusts parameter values of parameters in the voice feature transformation model, returns to the step of restoring the voice-changing voice through the voice feature transformation model to carry out iteration, and stops iteration until the voiceprint comparison result is matched or stops iteration until an iteration stop condition is met.
Wherein the iteration stop condition is a condition set by the computer equipment for terminating the iteration process. For example, the iteration stop condition is that the number of iterations reaches a preset value. Or the iteration stop condition is that all parameter values in a preselected range are traversed. Or the iteration stopping condition is that the difference value between the voiceprint comparison score of the restored voice and the original voice after iteration and the voiceprint comparison score of the restored voice and the original voice before iteration is smaller than a preset threshold value.
And the computer equipment compares the voice print of the restored voice obtained by the voice feature transformation model with the original voice of the suspect, and adjusts the parameter values of the parameters in the voice feature transformation model according to the matching result if the voice print comparison result is not matched. After the computer equipment adjusts the parameter value of the voice feature transformation model, the voice feature transformation model is continuously utilized to restore the sound-changing voice, the restored voice and the original voice are compared in a voiceprint mode, then the matching result of the restored voice and the sound-changing voice is determined according to the comparison result or the parameter value is continuously adjusted until the voiceprint comparison result is matched or the parameter value is stopped being adjusted when the iteration stopping condition is met.
And S110, the computer equipment determines a matching result of the voice-changing voice and the original voice according to a voiceprint comparison result when iteration is stopped.
If the voiceprint comparison result of the restored voice and the original voice obtained by restoring the sound-changing voice through the voice feature transformation model when the iteration is stopped is matched, the computer equipment determines that the sound-changing voice is matched with the original voice; and if the iteration stopping condition is met, but the voiceprint comparison result of the restored voice and the original voice obtained by restoring the sound-changing voice through the voice feature transformation model when the iteration stopping condition is met is not matched, the computer equipment determines that the sound-changing voice is not matched with the original voice.
In the above embodiment, the computer device restores the sound-changing voice to be matched through the voice feature transformation model to obtain the restored voice, automatically adjusts the parameter values of the parameters in the voice feature transformation model according to the result of comparing the voiceprint of the original voice and the restored voice, and finally determines the matching result of the original voice and the original voice. In the process of restoring the sound-changing voice, the computer equipment can quickly obtain the restored voice closest to the original voice by automatically adjusting the parameter values of the parameters in the voice characteristic transformation model, and obtain the matching result according to the restored voice closest to the original voice, so that the voice matching efficiency is improved.
In one embodiment, the unvoiced speech is generated by changing the pitch of normal speech. Changing the pitch of the speech also changes the pitch frequency of the speech. The computer device obtains the pitch-changed variant voice by changing the pitch frequency of the normal voice based on a resampling method. As shown in fig. 2, the computer device obtains the unvoiced sound by the pitch-changing method according to the following steps:
s202, obtaining the voice to be voice changed and the speed change factor Q/P.
And S204, carrying out speed change processing on the voice to be changed according to the speed change factor Q/P to obtain speed change voice.
And S206, carrying out P-time upsampling on the variable speed voice according to the obtained resampling factor P/Q.
And S208, performing Q-time down-sampling on the variable-speed voice according to the obtained resampling factor to obtain the variable-speed voice with the fundamental tone frequency being Q/P times of the normal voice.
Resampling speech may enable stretching or compressing of the speech spectrum, so that the change in sampling frequency before and after speech resampling coincides with the change in pitch frequency. The ratio of the re-sampled sampling frequency to the original sampling frequency is set to be P/Q, wherein P/Q is the simplest rational equation, and P, Q is the multiple of up-sampling and down-sampling respectively. Due to the fact that the times of up-down sampling are different, after P/Q times of resampling, the duration of voice is changed to be P/Q times of the original duration. Therefore, in order to ensure the consistency of the speech speed, the speech is subjected to a speed change process before resampling the speech, and the speech speed is changed to be Q/P times of the original speed.
In one embodiment, the computer device voiceprint compares the original voice of the suspect with the restored voice, including: respectively carrying out high-pass filtering on original voice and restored voice of the suspect; respectively carrying out segmentation processing on the original voice and the restored voice which have been subjected to high-pass filtering; and comparing the original voice subjected to the segmentation processing with the restored voice by voice print.
The purpose of the high-pass filtering of the original voice and the restored voice of the suspect by the computer equipment is to pre-emphasize the high-frequency components of the voice. Because the energy of the voice is mainly distributed in the low frequency band and the energy of the high frequency band is small, the output signal-to-noise ratio in the high frequency band is obviously insufficient, and the information of the high frequency part is difficult to obtain. By increasing the high frequency band and the high frequency resolution of the voice, the voiceprint characteristics of the original voice and the restored voice can be better extracted.
Since speech is a time-series signal and is not stable macroscopically, but the generation of speech is closely related to the movement of the sounding organ, the speed of state change is much slower than the speed of sound vibration due to the inertial movement of the sounding organ, and thus it can be considered that the voiceprint characteristics of speech remain substantially unchanged for a period of time. The computer device divides the original speech and the restored speech into several segments and extracts the voiceprint feature of each segment separately. For example, the computer device divides each 15ms, 30ms, or 40ms duration of the original speech and the restored speech into segments, respectively.
In one embodiment, the computer device voiceprint compares the original voice of the suspect with the restored voice, including: acquiring a first voiceprint feature of original voice and a second voiceprint feature of restored voice; calculating a voiceprint comparison score of the first voiceprint characteristic and the second voiceprint characteristic; when the voiceprint comparison score is higher than or equal to the score threshold value, the result of the voiceprint comparison is matching; when the score of voiceprint comparison is lower than the score threshold, the result of voiceprint comparison is not matched.
The voiceprint features are personalized physiological features capable of representing the voice characteristics of the speaker, and the voiceprint features are unique. The voiceprint features include: (1) features related to the mechanism of articulation on human physiological structures, such as spectrum, cepstrum, formants, pitch frequency, reflection coefficient, etc.; (2) lexical characteristics associated with socioeconomic and educated levels, such as speaker preference expressed in spoken words or preference expressed in written words; (3) prosody, speech rate, etc. (4) Language, dialect, and accent features.
The computer equipment respectively extracts a first voiceprint feature of original voice of the suspect and a second voiceprint feature of the restored voice. The commonly used method for extracting the voiceprint features of the voice comprises the following steps: mel frequency cepstrum coefficient method, linear prediction cepstrum coefficient method, voiceprint feature extraction method based on deep learning, etc.
After obtaining the first voiceprint feature and the second voiceprint feature, the computer device obtains a voiceprint comparison score of the first voiceprint feature and the second voiceprint feature through calculation. Common methods for calculating the voiceprint comparison score include an I-Vector method, an X-Vector method and the like.
And the computer equipment acquires the voiceprint comparison score of the first voiceprint characteristic and the second voiceprint characteristic, and compares the voiceprint comparison score with a score threshold value to determine a voiceprint comparison result. For example, the score of the voiceprint comparison score is 100 points, the score threshold is 80 points, and when the voiceprint comparison score is higher than or equal to 80 points, the result of the voiceprint comparison is a match; when the score of the voiceprint alignment is lower than 80 min, the result of the voiceprint alignment is not matched.
Since each individual's voiceprint characteristics are different and remain stable over time after adulthood. Even if the speaker intentionally imitates the voice and tone of other people, the voice print characteristics cannot be made the same. Therefore, the matching result of the original voice and the restored voice can be quickly determined according to the voiceprint characteristics, and the accuracy of the matching result is high.
In one embodiment, a computer device extracts a first voiceprint feature of an original voice and a second voiceprint feature of a restored voice using a voiceprint recognition engine and obtains a voiceprint comparison score for the first voiceprint feature and the second voiceprint feature. The voiceprint recognition engine is packaged with an algorithm for extracting voiceprint features and calculating voiceprint comparison scores.
The process of the computer device obtaining the result of comparing the voiceprint of the original voice and the restored voice by using the voiceprint recognition engine is shown in fig. 3, and comprises the following steps:
s302, original voice and restored voice are obtained.
S304, inputting the original voice and the restored voice into a voiceprint recognition engine.
S306, obtaining a voiceprint comparison score of the first voiceprint feature of the original voice output by the voiceprint recognition engine and the second voiceprint feature of the restored voice.
S308, judging whether the voiceprint comparison score is higher than or equal to a score threshold value, and if the voiceprint comparison score is higher than or equal to the score threshold value, executing S310; if the score of voiceprint comparison is lower than the score threshold, S312 is executed.
S310, determining that the result of the voiceprint comparison is matching.
S312, determining that the result of the voiceprint comparison is mismatching.
The specific contents of S302 to S312 may refer to the specific implementation process described above.
In one embodiment, taking the parameters in the speech feature transformation model as the pitch as an example, the flow of the computer device adjusting the parameter values of the pitch in the speech feature transformation model and determining the matching result of the unvoiced speech and the original speech is shown in fig. 4, and includes the following steps:
s402, obtaining the current parameter value of the tone parameter to be adjusted in the voice characteristic transformation model.
S404, determining a target interval and searching a target parameter value of the tone in the target interval.
And S406, determining the target parameter value as the parameter value of the tone in the voice feature transformation model.
S408, the pitch of the sound changing voice is restored through the voice feature transformation model, and the restored voice is obtained.
S410, obtaining a first voiceprint feature of the original voice of the suspect and a second voiceprint feature of the restored voice.
S412, calculating a voiceprint comparison score of the first voiceprint feature and the second voiceprint feature.
S414, determining whether the voiceprint comparison score is higher than or equal to the score threshold, if so, executing S416; if the voiceprint comparison score is below the score threshold, S418 is performed.
And S416, determining the result of the voiceprint comparison as matching.
S418, judging whether an iteration stopping condition is met, and executing S420 if the iteration stopping condition is met; if the iteration stop condition is not satisfied, returning to S404, re-determining the target interval, and searching for the target parameter value of the tone in the target interval.
And S420, determining that the result of the voiceprint comparison is mismatching.
The specific contents of S402 to S420 may refer to the specific implementation process described above.
In one embodiment, the computer device adjusting parameter values of parameters in the speech feature transformation model comprises: determining a target interval for adjusting parameter values of parameters in the voice feature transformation model; searching a target parameter value which enables the voiceprint comparison score between the restored voice and the original voice to be high in the target interval; and determining the target parameter value as a parameter value corresponding to the parameter in the voice feature transformation model.
The computer device searches a target parameter value which enables a voiceprint comparison score between the restored voice and the original voice to be high in a target interval, determines candidate parameter values according to a search algorithm in the target interval and preset step length, and then searches parameter values which enable the voiceprint comparison score to be higher for each candidate parameter value until all candidate parameter values in the target interval are traversed.
In one embodiment, the computer device gradually reduces the target interval according to the search result by a preset step length, so that the voiceprint comparison score obtained based on the average value of the parameter values in the reduced target interval is higher than the voiceprint comparison score obtained based on the average value of the parameter values in the target interval before reduction. And selecting the average value of the target interval as a target parameter value until the length of the target interval is smaller than the preset length.
In one embodiment, the computer device selects candidate parameter values of the target parameter values within the target interval according to a preset step size. And then searching a target parameter value which enables the voiceprint comparison score between the restored voice and the original voice to be the highest in the candidate parameter values through a searching algorithm. In one embodiment, the computer device searches for a target parameter value within the target interval by a dichotomy search algorithm that results in a highest score for a voiceprint comparison between the recovered speech and the original speech.
Model transformation by speech characteristics
Figure 864559DEST_PATH_IMAGE006
Parameter (2) of
Figure 213632DEST_PATH_IMAGE003
For example, a tone. The computer equipment firstly determines the pairs
Figure 448304DEST_PATH_IMAGE003
Target interval for adjusting parameter value of
Figure 275446DEST_PATH_IMAGE008
I.e. adjusted parameters
Figure 482436DEST_PATH_IMAGE003
In the target interval
Figure 392623DEST_PATH_IMAGE008
And (4) the following steps. Then, the computer device calculates
Figure 724379DEST_PATH_IMAGE009
And
Figure 683107DEST_PATH_IMAGE010
intermediate value of (1)
Figure 275763DEST_PATH_IMAGE011
According to
Figure 465173DEST_PATH_IMAGE012
Computing
Figure 674438DEST_PATH_IMAGE013
And
Figure 436857DEST_PATH_IMAGE014
wherein
Figure 24965DEST_PATH_IMAGE015
Figure 745796DEST_PATH_IMAGE016
. The value of the pitch parameter for the computer device is
Figure 317723DEST_PATH_IMAGE013
And
Figure 618254DEST_PATH_IMAGE014
the voice feature transformation model restores the sound-changing voice to respectively obtain the restored voice
Figure 451081DEST_PATH_IMAGE017
And
Figure 483759DEST_PATH_IMAGE018
wherein
Figure 402036DEST_PATH_IMAGE019
Figure 506259DEST_PATH_IMAGE020
. Respectively calculating by computer
Figure 316563DEST_PATH_IMAGE017
Figure 910355DEST_PATH_IMAGE018
And comparing the score with the voiceprint of the original voice. If it is
Figure 925716DEST_PATH_IMAGE017
Has a score higher than or equal to the voiceprint comparison score of the original voice
Figure 833629DEST_PATH_IMAGE018
Comparing with the voiceprint of the original voice to obtain a score, and then the computer equipment is in
Figure 375468DEST_PATH_IMAGE021
Continuously searching a parameter value which enables the voiceprint ratio to be higher than the score by using a dichotomy; if it is
Figure 749949DEST_PATH_IMAGE017
Has a score lower than the original voice's voiceprint comparison
Figure 642819DEST_PATH_IMAGE018
Comparing with the voiceprint of the original voice to obtain a score, and then the computer equipment is in
Figure 760947DEST_PATH_IMAGE022
The binary search continues for parameter values that make the voiceprint higher than the score.
In another embodiment, the computer device may further search for parameter values of parameters in the speech feature transformation model by a sequential search algorithm, a block search algorithm, a hash table search algorithm, a binary tree search algorithm, or the like.
In one embodimentThe computer device determines a target interval for adjusting parameter values of parameters in the speech feature transformation model, including: acquiring the interval length for adjusting the parameter value of the parameter; and taking the current parameter value of the parameter in the voice feature transformation model as the center, and determining a target interval corresponding to the parameter according to the interval length. For example, transforming models by speech features
Figure 157294DEST_PATH_IMAGE006
Parameter (2) of
Figure 30572DEST_PATH_IMAGE003
For pitch example, pitch is expressed in pitch frequencies, in a model of speech feature transformation
Figure 519060DEST_PATH_IMAGE003
The current parameter value is 360Hz, and the computer equipment acquires the pair
Figure 565513DEST_PATH_IMAGE003
The length of the interval for adjusting the parameter value of (2) is 200 Hz. The computer device determines pairs within the interval length of 200Hz centered at 360Hz
Figure 957311DEST_PATH_IMAGE003
The target interval for adjustment is [260Hz,460Hz ]]。
In one embodiment, the computer device obtaining a first voiceprint feature of an original speech and a second voiceprint feature of a restored speech comprises: extracting frame-level features of the original voice, and operating the frame-level features of the original voice to obtain sentence-level features of the original voice; obtaining a first voiceprint characteristic according to the frame-level characteristic and sentence-level characteristic of the original voice; extracting frame-level features of the restored voice, and calculating the frame-level features of the restored voice to obtain sentence-level features of the restored voice; and obtaining a second voiceprint characteristic according to the frame-level characteristic and the sentence-level characteristic of the restored voice.
Wherein the frame-level features include speech features on a particular frame and a temporal correlation of the speech features between frames. The sentence-level features are averaged over the frame-level features.
In one embodiment, the computer device inputs the original speech and the restored speech into a neural network, extracts frame-level features of the original speech and the restored speech through the neural network, then inputs the frame-level features into a statistical layer of the neural network, and calculates a mean and a standard deviation of the frame-level features through the statistical layer of the neural network to obtain sentence-level features.
Because the speech signals of the original speech and the restored speech are time sequence signals, and the extracted characteristics are different at different moments, the computer equipment can acquire the characteristic information of the original speech and the restored speech on different frames and the time sequence correlation between frames by extracting the frame-level characteristics of the original speech and the restored speech, and acquire the characteristic information of the original speech and the restored speech more finely. And then the sentence-level features are obtained by averaging the frame-level features. Since the original speech and the restored speech voiceprint features obtained by the computer device include frame-level features and sentence-level features, the obtained voiceprint features include more comprehensive feature information.
It should be understood that although the various steps in the flow charts of fig. 1-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-4 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.
In one embodiment, as shown in fig. 5, there is provided a voice matching apparatus including: an obtaining module 502, a restoring module 504, a voiceprint comparison module 506, an adjusting module 508, and a determining module 510, wherein:
an obtaining module 502, configured to obtain a sound-changing voice to be matched;
the restoring module 504 is configured to restore the unvoiced sound through the voice feature transformation model to obtain a restored voice;
a voiceprint comparison module 506, configured to perform voiceprint comparison on the original voice of the suspect and the restored voice;
an adjusting module 508, configured to adjust parameter values of parameters in the voice feature transformation model when a voiceprint comparison result is a mismatch, and return to the step of restoring the unvoiced speech through the voice feature transformation model for iteration, and stop iteration until the voiceprint comparison result is a match or stop iteration until an iteration stop condition is met;
and the determining module 510 is configured to determine a matching result between the unvoiced speech and the original speech according to a result of the voiceprint comparison when the iteration is stopped.
In the above embodiment, the computer device restores the sound-changing voice to be matched through the voice feature transformation model to obtain the restored voice, automatically adjusts the parameter values of the parameters in the voice feature transformation model according to the result of comparing the voiceprint of the original voice and the restored voice, and finally determines the matching result of the original voice and the original voice. In the process of restoring the sound-changing voice, the computer equipment can quickly obtain the restored voice closest to the original voice by automatically adjusting the parameter values of the parameters in the voice characteristic transformation model, and obtain the matching result according to the restored voice closest to the original voice, so that the voice matching efficiency is improved.
In one embodiment, as shown in fig. 6, the apparatus further comprises:
a determining module 510, further configured to determine at least one parameter of the speech feature transformation model; the at least one parameter characterizes at least one speech feature for restoring the unvoiced speech;
a selecting module 512, configured to select initial parameter values of parameters of the speech feature transformation model respectively;
and a building module 514, configured to build the voice feature transformation model according to the parameters of the voice feature transformation model and the initial parameter values of the parameters.
In one embodiment, the voiceprint comparison module 506 is further configured to:
respectively carrying out high-pass filtering on original voice and restored voice of the suspect;
respectively carrying out segmentation processing on the original voice and the restored voice which have been subjected to high-pass filtering;
and comparing the original voice subjected to the segmentation processing with the restored voice by voice print.
In one embodiment, the voiceprint comparison module 506 is further configured to:
acquiring a first voiceprint feature of original voice and a second voiceprint feature of restored voice;
calculating a voiceprint comparison score of the first voiceprint characteristic and the second voiceprint characteristic;
when the voiceprint comparison score is higher than or equal to the score threshold value, the result of the voiceprint comparison is matching;
when the score of voiceprint comparison is lower than the score threshold, the result of voiceprint comparison is not matched.
In one embodiment, the adjustment module 508 is further configured to:
determining a target interval for adjusting parameter values of parameters in the voice feature transformation model;
searching a target parameter value which enables the voiceprint comparison score between the restored voice and the original voice to be high in the target interval;
determining the target parameter value as the adjusted parameter value of the corresponding parameter in the speech feature transformation model.
In one embodiment, the determining module 510 is further configured to:
acquiring the interval length for adjusting the parameter value of the parameter;
and taking the current parameter value of the parameter in the voice feature transformation model as the center, and determining a target interval corresponding to the parameter according to the interval length.
In one embodiment, the voiceprint comparison module 506 is further configured to:
extracting frame-level features of the original voice, and operating the frame-level features of the original voice to obtain sentence-level features of the original voice;
obtaining a first voiceprint characteristic according to the frame-level characteristic and sentence-level characteristic of the original voice;
extracting frame-level features of the restored voice, and calculating the frame-level features of the restored voice to obtain sentence-level features of the restored voice;
and obtaining a second voiceprint characteristic according to the frame-level characteristic and the sentence-level characteristic of the restored voice.
For the specific definition of the voice matching device, reference may be made to the above definition of the voice matching method, which is not described herein again. The modules in the voice matching device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store voice match data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech matching method.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a speech matching method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the configurations shown in fig. 7 and 8 are only block diagrams of some configurations relevant to the present disclosure, and do not constitute a limitation on the computer apparatus to which the present disclosure may be applied, and a particular computer apparatus may include more or less components than those shown in the figures, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring a voice change voice to be matched; restoring the sound-changing voice through a voice characteristic transformation model to obtain restored voice; comparing the original voice of the suspect with the restored voice by voiceprint; when the voiceprint comparison result is not matched, adjusting parameter values of parameters in the voice feature transformation model, returning to the step of restoring the voice-changing voice through the voice feature transformation model for iteration, and stopping iteration until the voiceprint comparison result is matched or until an iteration stop condition is met; and determining a matching result of the voice-changing voice and the original voice according to the result of the voiceprint comparison when the iteration is stopped.
In one embodiment, the processor, when executing the computer program, further performs the steps of: determining at least one parameter of a speech feature transformation model; the at least one parameter characterizes at least one speech feature for restoring the unvoiced speech; respectively selecting initial parameter values of parameters of the voice characteristic transformation model; and establishing a voice characteristic transformation model according to the parameters of the voice characteristic transformation model and the initial parameter values of the parameters.
In one embodiment, the processor, when executing the computer program, further performs the steps of: respectively carrying out high-pass filtering on original voice and restored voice of the suspect; respectively carrying out segmentation processing on the original voice and the restored voice which have been subjected to high-pass filtering; and comparing the original voice subjected to the segmentation processing with the restored voice by voice print.
In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a first voiceprint feature of original voice and a second voiceprint feature of restored voice; calculating a voiceprint comparison score of the first voiceprint characteristic and the second voiceprint characteristic; when the voiceprint comparison score is higher than or equal to the score threshold value, the result of the voiceprint comparison is matching; when the score of voiceprint comparison is lower than the score threshold, the result of voiceprint comparison is not matched.
In one embodiment, the processor, when executing the computer program, further performs the steps of: determining a target interval for adjusting parameter values of parameters in the voice feature transformation model; searching a target parameter value which enables the voiceprint comparison score between the restored voice and the original voice to be high in the target interval; determining the target parameter value as the adjusted parameter value of the corresponding parameter in the speech feature transformation model.
In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring the interval length for adjusting the parameter value of the parameter; and taking the current parameter value of the parameter in the voice feature transformation model as a center, and determining a target interval corresponding to the parameter according to the interval length.
In one embodiment, the processor, when executing the computer program, further performs the steps of: extracting frame-level features of the original voice, and operating the frame-level features of the original voice to obtain sentence-level features of the original voice; obtaining a first voiceprint characteristic according to the frame-level characteristic and sentence-level characteristic of the original voice; extracting frame-level features of the restored voice, and calculating the frame-level features of the restored voice to obtain sentence-level features of the restored voice; and obtaining a second voiceprint characteristic according to the frame-level characteristic and the sentence-level characteristic of the restored voice.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring a voice change voice to be matched; restoring the sound-changing voice through a voice characteristic transformation model to obtain restored voice; comparing the original voice of the suspect with the restored voice by voiceprint; when the voiceprint comparison result is not matched, adjusting parameter values of parameters in the voice feature transformation model, returning to the step of restoring the voice-changing voice through the voice feature transformation model for iteration, and stopping iteration until the voiceprint comparison result is matched or until an iteration stop condition is met; and determining a matching result of the voice-changing voice and the original voice according to the result of the voiceprint comparison when the iteration is stopped.
In one embodiment, the computer program when executed by the processor further performs the steps of: determining at least one parameter of a speech feature transformation model; the at least one parameter characterizes at least one speech feature for restoring the unvoiced speech; respectively selecting initial parameter values of parameters of the voice characteristic transformation model; and establishing a voice characteristic transformation model according to the parameters of the voice characteristic transformation model and the initial parameter values of the parameters.
In one embodiment, the computer program when executed by the processor further performs the steps of: respectively carrying out high-pass filtering on original voice and restored voice of the suspect; respectively carrying out segmentation processing on the original voice and the restored voice which have been subjected to high-pass filtering; and comparing the original voice subjected to the segmentation processing with the restored voice by voice print.
In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a first voiceprint feature of original voice and a second voiceprint feature of restored voice; calculating a voiceprint comparison score of the first voiceprint characteristic and the second voiceprint characteristic; when the voiceprint comparison score is higher than or equal to the score threshold value, the result of the voiceprint comparison is matching; when the score of voiceprint comparison is lower than the score threshold, the result of voiceprint comparison is not matched.
In one embodiment, the computer program when executed by the processor further performs the steps of: determining a target interval for adjusting parameter values of parameters in the voice feature transformation model; searching a target parameter value which enables the voiceprint comparison score between the restored voice and the original voice to be high in the target interval; determining the target parameter value as the adjusted parameter value of the corresponding parameter in the speech feature transformation model.
In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring the interval length for adjusting the parameter value of the parameter; and taking the current parameter value of the parameter in the voice feature transformation model as the center, and determining a target interval corresponding to the parameter according to the interval length.
In one embodiment, the computer program when executed by the processor further performs the steps of: extracting frame-level features of the original voice, and operating the frame-level features of the original voice to obtain sentence-level features of the original voice; obtaining a first voiceprint characteristic according to the frame-level characteristic and sentence-level characteristic of the original voice; extracting frame-level features of the restored voice, and calculating the frame-level features of the restored voice to obtain sentence-level features of the restored voice; and obtaining a second voiceprint characteristic according to the frame-level characteristic and the sentence-level characteristic of the restored voice.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of speech matching, the method comprising:
acquiring a voice change voice to be matched;
restoring the sound-changing voice through a voice feature transformation model to obtain restored voice;
comparing the original voice of the suspect with the restored voice by voiceprint;
when the voiceprint comparison result is not matched, adjusting parameter values of parameters in the voice feature transformation model, returning to the step of restoring the sound-changing voice through the voice feature transformation model for iteration, and stopping iteration until the voiceprint comparison result is matched or stopping iteration until an iteration stop condition is met;
and determining a matching result of the voice-changing voice and the original voice according to a result of voiceprint comparison when iteration is stopped.
2. The method of claim 1, wherein before the restoring the unvoiced speech by the speech feature transformation model to obtain the restored speech, the method further comprises:
determining at least one parameter of the speech feature transformation model; the at least one parameter characterizes at least one speech feature for restoring the unvoiced speech;
respectively selecting initial parameter values of parameters of the voice characteristic transformation model;
and establishing a voice feature transformation model according to the parameters of the voice feature transformation model and the initial parameter values of the parameters.
3. The method of claim 1, wherein the voiceprint comparing the original voice of the suspect with the restored voice comprises:
respectively carrying out high-pass filtering on original voice of the suspect and the restored voice;
respectively carrying out segmentation processing on the original voice and the restored voice which have been subjected to high-pass filtering;
and comparing the original voice subjected to the segmentation processing with the restored voice by voice print.
4. The method of claim 1, wherein the voiceprint comparing the original voice of the suspect with the restored voice comprises:
acquiring a first voiceprint feature of the original voice and a second voiceprint feature of the restored voice;
calculating a voiceprint comparison score of the first voiceprint feature and the second voiceprint feature;
when the voiceprint comparison score is higher than or equal to a score threshold value, the result of voiceprint comparison is matching;
and when the voiceprint comparison score is lower than the score threshold value, the result of voiceprint comparison is not matched.
5. The method of claim 4, wherein the adjusting the parameter values of the parameters in the speech feature transformation model comprises:
determining a target interval for adjusting parameter values of parameters in the voice feature transformation model;
searching a target parameter value which enables the voiceprint comparison score between the restored voice and the original voice to be high in the target interval;
determining the target parameter value as an adjusted parameter value of the corresponding parameter in the speech feature transformation model.
6. The method of claim 5, wherein the determining a target interval for adjusting parameter values of parameters in the speech feature transformation model comprises:
acquiring the interval length for adjusting the parameter value of the parameter;
and determining a target interval corresponding to the parameter according to the interval length by taking the current parameter value of the parameter in the voice feature transformation model as a center.
7. The method of claim 4, the obtaining a first voiceprint feature of the original speech and a second voiceprint feature of the restored speech comprising:
extracting the frame level characteristics of the original voice, and operating the frame level characteristics of the original voice to obtain sentence level characteristics of the original voice;
obtaining the first voiceprint feature according to the frame-level feature and sentence-level feature of the original voice;
extracting the frame-level features of the restored voice, and operating the frame-level features of the restored voice to obtain sentence-level features of the restored voice;
and obtaining the second voiceprint characteristic according to the frame-level characteristic and sentence-level characteristic of the restored voice.
8. A speech matching apparatus, characterized in that the apparatus comprises:
the acquisition module is used for acquiring the voice-changing voice to be matched;
the restoring module is used for restoring the sound-changing voice through the voice feature transformation model to obtain restored voice;
the voiceprint comparison module is used for carrying out voiceprint comparison on original voice of the suspect and the restored voice;
the adjusting module is used for adjusting parameter values of parameters in the voice feature transformation model when the voiceprint comparison result is not matched, returning to the step of restoring the sound-changing voice through the voice feature transformation model for iteration, and stopping iteration until the voiceprint comparison result is matched or stopping iteration until an iteration stopping condition is met;
and the determining module is used for determining a matching result of the voice-changing voice and the original voice according to a result of voiceprint comparison when iteration is stopped.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202010719805.7A 2020-07-24 2020-07-24 Voice matching method and device, computer equipment and storage medium Active CN111739547B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010719805.7A CN111739547B (en) 2020-07-24 2020-07-24 Voice matching method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010719805.7A CN111739547B (en) 2020-07-24 2020-07-24 Voice matching method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111739547A true CN111739547A (en) 2020-10-02
CN111739547B CN111739547B (en) 2020-11-24

Family

ID=72657542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010719805.7A Active CN111739547B (en) 2020-07-24 2020-07-24 Voice matching method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111739547B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002236666A (en) * 2001-02-09 2002-08-23 Matsushita Electric Ind Co Ltd Personal authentication device
US7412039B2 (en) * 2004-04-23 2008-08-12 International Business Machines Corporation Method and system for verifying an attachment file within an e-mail
CN103730121A (en) * 2013-12-24 2014-04-16 中山大学 Method and device for recognizing disguised sounds
CN108198574A (en) * 2017-12-29 2018-06-22 科大讯飞股份有限公司 Change of voice detection method and device
CN109215680A (en) * 2018-08-16 2019-01-15 公安部第三研究所 A kind of voice restoration method based on convolutional neural networks
CN109616131A (en) * 2018-11-12 2019-04-12 南京南大电子智慧型服务机器人研究院有限公司 A kind of number real-time voice is changed voice method
CN110459242A (en) * 2019-08-21 2019-11-15 广州国音智能科技有限公司 Change of voice detection method, terminal and computer readable storage medium
CN110932960A (en) * 2019-11-04 2020-03-27 深圳市声扬科技有限公司 Social software-based fraud prevention method, server and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002236666A (en) * 2001-02-09 2002-08-23 Matsushita Electric Ind Co Ltd Personal authentication device
US7412039B2 (en) * 2004-04-23 2008-08-12 International Business Machines Corporation Method and system for verifying an attachment file within an e-mail
CN103730121A (en) * 2013-12-24 2014-04-16 中山大学 Method and device for recognizing disguised sounds
CN108198574A (en) * 2017-12-29 2018-06-22 科大讯飞股份有限公司 Change of voice detection method and device
CN109215680A (en) * 2018-08-16 2019-01-15 公安部第三研究所 A kind of voice restoration method based on convolutional neural networks
CN109616131A (en) * 2018-11-12 2019-04-12 南京南大电子智慧型服务机器人研究院有限公司 A kind of number real-time voice is changed voice method
CN110459242A (en) * 2019-08-21 2019-11-15 广州国音智能科技有限公司 Change of voice detection method, terminal and computer readable storage medium
CN110932960A (en) * 2019-11-04 2020-03-27 深圳市声扬科技有限公司 Social software-based fraud prevention method, server and system

Also Published As

Publication number Publication date
CN111739547B (en) 2020-11-24

Similar Documents

Publication Publication Date Title
US11887582B2 (en) Training and testing utterance-based frameworks
US11450313B2 (en) Determining phonetic relationships
US12027165B2 (en) Computer program, server, terminal, and speech signal processing method
WO2013020329A1 (en) Parameter speech synthesis method and system
CN111508511A (en) Real-time sound changing method and device
US10854182B1 (en) Singing assisting system, singing assisting method, and non-transitory computer-readable medium comprising instructions for executing the same
US20230206896A1 (en) Method and system for applying synthetic speech to speaker image
WO2023279976A1 (en) Speech synthesis method, apparatus, device, and storage medium
CN112735454A (en) Audio processing method and device, electronic equipment and readable storage medium
CN116994553A (en) Training method of speech synthesis model, speech synthesis method, device and equipment
KR102528019B1 (en) A TTS system based on artificial intelligence technology
CN111739547B (en) Voice matching method and device, computer equipment and storage medium
CN114999440B (en) Avatar generation method, apparatus, device, storage medium, and program product
CN114708876B (en) Audio processing method, device, electronic equipment and storage medium
CN115810341A (en) Audio synthesis method, apparatus, device and medium
CN112992110B (en) Audio processing method, device, computing equipment and medium
KR102532253B1 (en) A method and a TTS system for calculating a decoder score of an attention alignment corresponded to a spectrogram
CN115273806A (en) Song synthesis model training method and device and song synthesis method and device
US9928832B2 (en) Method and apparatus for classifying lexical stress
KR102503066B1 (en) A method and a TTS system for evaluating the quality of a spectrogram using scores of an attention alignment
JP7079455B1 (en) Acoustic model learning devices, methods and programs, as well as speech synthesizers, methods and programs
CN113990346A (en) Fundamental frequency prediction method and device
CN116580693A (en) Training method of tone color conversion model, tone color conversion method, device and equipment
CN117746832A (en) Singing voice synthesizing method, singing voice synthesizing device, computer equipment and storage medium
CN116486765A (en) Singing voice generating method, computer device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant