CN115064149A - Model matching method and device, electronic equipment and readable storage medium - Google Patents

Model matching method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN115064149A
CN115064149A CN202210642792.7A CN202210642792A CN115064149A CN 115064149 A CN115064149 A CN 115064149A CN 202210642792 A CN202210642792 A CN 202210642792A CN 115064149 A CN115064149 A CN 115064149A
Authority
CN
China
Prior art keywords
voiceprint
target
audio
scene
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210642792.7A
Other languages
Chinese (zh)
Inventor
吕翔
印晶晶
卢恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Himalaya Technology Co ltd
Original Assignee
Shanghai Himalaya Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Himalaya Technology Co ltd filed Critical Shanghai Himalaya Technology Co ltd
Priority to CN202210642792.7A priority Critical patent/CN115064149A/en
Publication of CN115064149A publication Critical patent/CN115064149A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Abstract

The embodiment of the invention provides a model matching method and device, electronic equipment and a readable storage medium, and relates to the field of computers. Firstly, acquiring a target voiceprint feature of a target user, first matching information corresponding to a fixed text and a plurality of scene voiceprint features; and respectively acquiring second matching information corresponding to each scene voiceprint feature, and respectively acquiring a matching degree score of the target voiceprint feature and each scene voiceprint feature. And then, respectively carrying out normalization processing on each matching degree score by utilizing the first matching information and the second matching information to obtain a normalized score corresponding to each scene voiceprint feature. And finally, determining a target voice synthesis model matched with the target voiceprint characteristics from the model library according to all the normalized scores. Therefore, the matched target speech synthesis model is selected from the model library according to the voiceprint characteristics of the target user, and the method is time-saving and convenient.

Description

Model matching method and device, electronic equipment and readable storage medium
Technical Field
The invention relates to the field of computers, in particular to a model matching method, a model matching device, electronic equipment and a readable storage medium.
Background
Speech synthesis technology, text to speech (tts), refers to computer speech formed of words. Personalized tts is the ability to generate speech while preserving the tone characteristics specific to a person speaking. Compared with general tts, the personalized tts are more popular with users because the personalized tts can simulate the timbre characteristics of a human and improve the closeness of audiences.
In the prior art, synthesizing personalized speech of a user needs to train a personalized tts model of the user in advance, but on one hand, time is consumed to prepare training data, and time is consumed to train the personalized tts model. And on the other hand, different personalized tts models are trained according to different user requirements. Therefore, in the prior art, obtaining the personalized tts model specific to the user is time-consuming and labor-consuming.
Disclosure of Invention
The invention aims to provide a model matching method, a model matching device, an electronic device and a readable storage medium, so as to solve the problems in the prior art.
Embodiments of the invention may be implemented as follows:
in a first aspect, the present invention provides a model matching method, including:
acquiring a target voiceprint characteristic corresponding to a target audio of a target user; the target audio corresponds to a fixed text, and the fixed text corresponds to a service scene;
acquiring first matching information corresponding to the target voiceprint characteristics and the anchor voiceprint library;
acquiring a plurality of scene voiceprint characteristics corresponding to the fixed text; the model library comprises a plurality of voice synthesis models, and each scene voiceprint feature corresponds to one voice synthesis model;
respectively acquiring second matching information corresponding to each scene voiceprint feature;
respectively obtaining a matching degree score of the target voiceprint characteristics and each scene voiceprint characteristic;
respectively carrying out normalization processing on each matching degree score by using the first matching information and the second matching information to obtain a normalization score corresponding to each scene voiceprint feature;
and determining a target voice synthesis model matched with the target voiceprint characteristics from the model library according to all the normalized scores.
In an alternative embodiment, the method further comprises:
acquiring a template audio text set corresponding to the template audio set; the template audio frequency set comprises a plurality of template audio frequencies, and the template audio frequency text set comprises a plurality of template audio frequency texts; any one template audio has a corresponding template audio text;
respectively inputting the template audio text sets into each speech synthesis model to obtain a speech synthesis audio set corresponding to each template audio text set; the voice synthesis audio set comprises a plurality of voice synthesis audios; a speech synthesis audio corresponding to each training audio text corresponds to one speech synthesis model;
performing feature extraction on the voice synthesis audio set and the template audio set by using a voiceprint model to obtain a voiceprint feature set so as to form the anchor voiceprint library; wherein the anchor voiceprint library comprises voiceprint features of each of the speech synthesis audios and voiceprint features of each of the template audios.
In an optional implementation manner, the anchor voiceprint library includes a plurality of voiceprint features, the first matching information includes a first mean value and a first standard deviation, and the step of obtaining the first matching information corresponding to the target voiceprint feature and the anchor voiceprint library includes:
matching the target voiceprint characteristics with each voiceprint characteristic in an anchor voiceprint library respectively to obtain a plurality of first matching scores corresponding to the target voiceprint characteristics; each of the first match scores corresponds to one of the voiceprint features in the anchor voiceprint library;
k first matching scores are selected from the first matching scores, and the first mean value and the first standard deviation are calculated based on the K first matching scores.
In an optional implementation manner, the step of obtaining a plurality of scene voiceprint features corresponding to the fixed text includes:
respectively inputting the fixed text into each voice synthesis model to obtain a plurality of target voice synthesis audios;
and respectively extracting the characteristics of each target voice synthesis audio by using a voiceprint model to obtain a plurality of scene voiceprint characteristics.
In an optional implementation manner, the anchor voiceprint library includes a plurality of voiceprint features, the second matching information includes a second mean and a second standard deviation of each scene voiceprint feature, and the step of respectively obtaining the second matching information corresponding to each scene voiceprint feature includes:
matching each scene voiceprint feature with each voiceprint feature in an anchor voiceprint library one by one to obtain a plurality of second matching scores corresponding to each scene voiceprint feature;
and aiming at each scene voiceprint feature, selecting K second matching scores from a plurality of second matching scores corresponding to each scene voiceprint feature, and calculating a second mean value and a second standard deviation of the scene voiceprint feature based on the K second matching scores.
In a second aspect, the present invention provides a model matching apparatus, including a first obtaining module, a second obtaining module and a processing module;
the first obtaining module is configured to:
acquiring a target voiceprint characteristic corresponding to a target audio of a target user; the target audio corresponds to a fixed text, and the fixed text corresponds to a service scene;
acquiring first matching information corresponding to the target voiceprint characteristics and the anchor voiceprint library;
the second obtaining module is configured to:
acquiring a plurality of scene voiceprint characteristics corresponding to the fixed text; the model library comprises a plurality of voice synthesis models, and each scene voiceprint feature corresponds to one voice synthesis model;
respectively acquiring second matching information corresponding to each scene voiceprint feature;
respectively obtaining a matching degree score of the target voiceprint characteristics and each scene voiceprint characteristic;
the processing module is configured to:
respectively carrying out normalization processing on each matching degree score by using the first matching information and the second matching information to obtain a normalization score corresponding to each scene voiceprint feature;
and determining a target voice synthesis model matched with the target voiceprint characteristics from the model library according to all the normalized scores.
In an optional embodiment, the first obtaining module is further configured to:
acquiring a template audio text set corresponding to the template audio set; the template audio frequency set comprises a plurality of template audio frequencies, and the template audio frequency text set comprises a plurality of template audio frequency texts; any one template audio has a corresponding template audio text;
respectively inputting the template audio text sets into each speech synthesis model to obtain a speech synthesis audio set corresponding to each template audio text set; the voice synthesis audio set comprises a plurality of voice synthesis audios; one speech synthesis audio corresponding to each training audio text corresponds to one speech synthesis model;
performing feature extraction on the voice synthesis audio set and the template audio set by using a voiceprint model to obtain a voiceprint feature set so as to form the anchor voiceprint library; and the anchor point voiceprint library comprises the voiceprint characteristics of each voice synthesis audio and the voiceprint characteristics of each template audio.
In an optional embodiment, the first obtaining module is specifically configured to:
respectively inputting the fixed text into each speech synthesis model to obtain a plurality of target speech synthesis audios;
and respectively extracting the characteristics of each target voice synthesis audio by utilizing a voiceprint model to obtain a plurality of scene voiceprint characteristics.
In a third aspect, the present invention provides an electronic device comprising: a memory and a processor, the memory storing machine readable instructions executable by the processor, the processor executing the machine readable instructions to implement the method of any one of the preceding embodiments when the electronic device is running.
In a fourth aspect, the present invention provides a readable storage medium storing a computer program for execution by a processor to implement the method of any one of the preceding embodiments.
Compared with the prior art, the embodiment of the invention provides a model matching method, a model matching device, electronic equipment and a readable storage medium, and the method comprises the steps of firstly obtaining a target voiceprint feature of a target user, first matching information corresponding to a fixed text and a plurality of scene voiceprint features; and respectively acquiring second matching information corresponding to each scene voiceprint feature, and respectively acquiring a matching degree score of the target voiceprint feature and each scene voiceprint feature. And then, respectively carrying out normalization processing on each matching degree score by utilizing the first matching information and the second matching information to obtain a normalized score corresponding to each scene voiceprint feature. And finally, determining a target voice synthesis model matched with the target voiceprint characteristics from the model library according to all the normalized scores. Therefore, the matched target speech synthesis model is selected from the model library according to the voiceprint characteristics of the target user, and the method is time-saving and convenient.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a schematic flow chart of a model matching method according to an embodiment of the present invention.
Fig. 2 is a second schematic flowchart of a model matching method according to an embodiment of the present invention.
Fig. 3 is a third schematic flowchart of a model matching method according to an embodiment of the present invention.
Fig. 4 is a fourth flowchart illustrating a model matching method according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a process for constructing an anchor voiceprint library according to an embodiment of the present invention.
Fig. 6 is a schematic structural diagram of a model matching apparatus according to an embodiment of the present invention.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
Furthermore, the appearances of the terms "first," "second," and the like, if any, are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.
It should be noted that the features of the embodiments of the present invention may be combined with each other without conflict.
Speech synthesis technology, text to speech (tts), refers to computer speech formed of words. With the development of deep learning, the general tts task has been developed greatly, and can generate voice with extremely high reality degree. Personalized tts is the ability to generate speech while preserving the tone characteristics specific to a person speaking. Compared with general tts, the personalized tts is more popular with users because the personalized tts can simulate human tone characteristics and improve the closeness of audiences.
In the prior art, synthesizing personalized speech of a user needs to train a personalized tts model of the user in advance, but on one hand, time is consumed to prepare training data, and time is consumed to train the personalized tts model. And on the other hand, different personalized tts models are trained according to different user requirements. Therefore, in the prior art, obtaining the personalized tts model specific to the user is time-consuming and labor-consuming.
In view of this, the embodiment of the present invention provides a model matching method, which can determine a target speech synthesis model matched with a target voiceprint feature of a target user from a model library according to an anchor voiceprint library, and is convenient and fast, thereby avoiding waiting time of the user. The following detailed description is made by way of examples, with reference to the accompanying drawings.
Referring to fig. 1, fig. 1 is a schematic flow chart of a model matching method according to an embodiment of the present invention, where an execution subject of the method may be an electronic device. The method comprises the following steps:
s101, obtaining a target voiceprint characteristic corresponding to a target audio of a target user.
In this embodiment, the target audio corresponds to a fixed text, and the fixed text corresponds to a service scene.
Alternatively, a target audio of the target user may be obtained first, where the target audio is an audio obtained by the target user reading the fixed text. And then, carrying out feature extraction on the target audio by using the voiceprint model to obtain target voiceprint features.
S102, first matching information corresponding to the target voiceprint characteristics and the anchor voiceprint library is obtained.
It will be appreciated that the first match information may characterize the mean and standard deviation of the match value scores for the target voiceprint feature and each voiceprint feature in the anchor voiceprint library.
S103, obtaining a plurality of scene voiceprint characteristics corresponding to the fixed text.
In this embodiment, the model library may include a plurality of speech synthesis models, and each scene voiceprint feature corresponds to one speech synthesis model.
S104, respectively acquiring second matching information corresponding to the voiceprint characteristics of each scene;
in this embodiment, each scene voiceprint feature may correspond to one piece of second matching information. The second matching information may characterize a mean and variance of the matching value scores for the corresponding scene voiceprint features and each of the voiceprint features in the anchor voiceprint library.
And S105, respectively obtaining the matching degree scores of the target voiceprint characteristics and the voiceprint characteristics of each scene.
In this embodiment, the matching degree score may represent the similarity between the target voiceprint feature and one scene voiceprint feature.
And S106, respectively carrying out normalization processing on each matching degree score by utilizing the first matching information and the second matching information to obtain a normalization score corresponding to each scene voiceprint feature.
In this embodiment, the normalized score corresponding to each scene voiceprint feature may represent the timbre similarity between the audio synthesized by the speech synthesis model corresponding to the scene voiceprint feature and the speaking voice of the target user. It can be understood that, since the matching degree score evaluates the similarity between the target voiceprint feature of the target audio (real voice) and the scene voiceprint feature (belonging to the synthetic voice), which may have natural difference and inaccuracy, the matching degree score needs to be normalized, so as to reduce the inaccuracy and the inconsistency in the similarity evaluation process and improve the accuracy of the normalized score.
And S107, determining a target voice synthesis model matched with the target voiceprint characteristics from the model library according to all the normalized scores.
It will be appreciated that the greater the normalized score, the greater the degree of timbre similarity between the audio synthesized by the corresponding speech synthesis model and the speech of the target user. The target speech synthesis model may be the speech synthesis model for which the normalized score is greatest.
The embodiment of the invention provides a model matching method, which comprises the steps of firstly obtaining a target voiceprint feature of a target user, first matching information corresponding to a fixed text and a plurality of scene voiceprint features; and respectively acquiring second matching information corresponding to each scene voiceprint feature, and respectively acquiring a matching degree score of the target voiceprint feature and each scene voiceprint feature. And then, respectively carrying out normalization processing on each matching degree score by utilizing the first matching information and the second matching information to obtain a normalized score corresponding to each scene voiceprint feature. And finally, determining a target voice synthesis model matched with the target voiceprint characteristics from the model library according to all the normalized scores. Therefore, the matched target speech synthesis model is directly selected from the model library according to the voiceprint characteristics of the target user, and the method is time-saving and convenient.
In an alternative embodiment, the first matching information and the second matching information are derived based on an anchor voiceprint library. The model matching method may further include the steps of:
and S100, constructing an anchor voiceprint library based on the template audio set and the model library.
It is understood that the template audio set includes a plurality of template audios, and each template audio may be a real human voice. The model library comprises a plurality of trained speech synthesis models.
Optionally, the anchor voiceprint library includes a plurality of voiceprint features, one part is a voiceprint feature of the template audio, and the other part is a voiceprint feature of the speech synthesis audio. The substeps of step S100 may include:
s1001, acquiring a template audio text set corresponding to the template audio set.
In this embodiment, the template audio set may include a plurality of template audios, and a template audio text set may be obtained by performing speech recognition on all the template audios. Accordingly, a plurality of template audio texts may be included in the set of template audio texts. Any one template audio has a corresponding template audio text.
S1002, respectively inputting the template audio text sets into each speech synthesis model, and obtaining a speech synthesis audio set corresponding to each template audio text set.
In this embodiment, the set of speech synthesis audios includes a plurality of speech synthesis audios. The number of speech synthesis audios corresponding to each template audio text matches the number of speech synthesis models in the model library. That is, one speech synthesis audio corresponding to each training audio text corresponds to one speech synthesis model.
S1003, utilizing the voiceprint model to perform feature extraction on the voice synthesis audio set and the template audio set to obtain a voiceprint feature set so as to form an anchor voiceprint library.
In this embodiment, the voice synthesis audio set and the template audio set may be respectively input into the voiceprint model, and feature extraction may be performed to obtain an anchor voiceprint library. The anchor voiceprint library can contain voiceprint features for each speech synthesis audio and the voiceprint features for each template audio.
It can be understood that the first matching information and the second matching information can be more accurate due to the voiceprint characteristics of the template audio as well as the voiceprint characteristics of the speech synthesis audio in the anchor voiceprint library.
In an optional embodiment, the anchor voiceprint library includes a plurality of voiceprint features, and the first matching information is obtained by matching the target voiceprint feature with each voiceprint feature in the anchor voiceprint library. The first matching information may include a first mean and a first standard deviation. Accordingly, on the basis of fig. 1, referring to fig. 2, the sub-steps of the step S102 may include:
and S1021, matching the target voiceprint characteristics with each voiceprint characteristic in the anchor voiceprint library respectively to obtain a plurality of first matching scores corresponding to the target voiceprint characteristics.
S1022, K first matching scores are selected from the multiple first matching scores, and a first mean value and a first standard deviation are calculated based on the K first matching scores.
In this embodiment, each first match score corresponds to a voiceprint feature in the anchor voiceprint library. The first match score may represent a similarity between the target voiceprint feature and the voiceprint features in the anchor voiceprint library.
It will be appreciated that the K first match scores may be the first K first match scores in descending order of all the first match scores. The size of K is a preset empirical value, for example, K may be 1000, 1500, etc. This example is merely an example and is not intended to be limiting.
In an alternative embodiment, the scene voiceprint features are extracted from the target speech synthesis audio obtained from a fixed text input speech synthesis model. Accordingly, on the basis of fig. 1, referring to fig. 3, the sub-steps of step S103 may include:
and S1031, respectively inputting the fixed texts into each speech synthesis model to obtain a plurality of target speech synthesis audios.
S1032, respectively carrying out feature extraction on each target voice synthesis audio by using the voiceprint model to obtain a plurality of scene voiceprint features.
In this embodiment, the fixed text may correspond to one service scenario, and different service scenarios may correspond to different fixed texts. The number of target speech synthesis audios is the same as the number of speech synthesis models in the model library.
In an optional embodiment, the second matching information is obtained by matching each scene voiceprint feature with each voiceprint feature in the anchor voiceprint library one by one. The second matching information may include a second mean and a second standard deviation of each scene voiceprint feature. Accordingly, on the basis of fig. 1, referring to fig. 4, the sub-steps of step S104 may include:
s1041, matching each scene voiceprint feature with each voiceprint feature in the anchor voiceprint library one by one to obtain a plurality of second matching scores corresponding to each scene voiceprint feature.
S1042, aiming at each scene voiceprint feature, selecting K second matching scores from the plurality of second matching scores corresponding to each scene voiceprint feature, and calculating a second mean value and a second standard deviation of the scene voiceprint feature based on the K second matching scores.
In this embodiment, for each scene voiceprint feature, the scene voiceprint feature may be matched with each voiceprint feature in the anchor voiceprint library to obtain a plurality of second matching scores corresponding to the scene voiceprint feature. The second match score may represent a similarity between the scene voiceprint features and the voiceprint features in the anchor voiceprint library.
For each scene voiceprint feature, K second matching scores may be selected from a plurality of second matching scores corresponding to the scene voiceprint feature, and a second mean value and a second standard deviation of the scene voiceprint feature may be calculated based on the selected K second matching scores.
It can be understood that the value of K in step S1042 is the same as the value of K in step S1022 described above. The K second match scores may be the first K second match scores in descending order of all second score values.
Optionally, the target voiceprint features, the scene voiceprint features, and the voiceprint features in the anchor voiceprint library may be in the form of vectors. The manner in which the first match score, the second match score, and the degree of match score are calculated may be the same.
In an alternative example, the first matching score, the second matching score or the matching score can be obtained by calculating cosine similarity, that is, the similarity between two voiceprint features is measured by cosine similarity. In another alternative example, the first match score, the second match score, or the match score may be obtained using a PLDA (predictive Linear Discriminontant analysis) algorithm.
Taking the cosine similarity calculation as an example, the following is a calculation formula of the cosine similarity:
Figure BDA0003682735300000111
where a denotes one vector, B denotes the other vector, and cos θ denotes the cosine distance between the two vectors.
When calculating the first match score, A may represent a target voiceprint feature, B may represent a voiceprint feature in an anchor voiceprint library, and cos θ may represent a similarity of the target voiceprint feature to a voiceprint feature in the anchor voiceprint library.
When calculating the second match score, a may represent a scene voiceprint feature, B may represent a voiceprint feature in the anchor voiceprint library, and cos θ may represent a similarity of the scene voiceprint feature to a voiceprint feature in the anchor voiceprint library.
When calculating the matching degree score, a may represent a target voiceprint feature, B may represent a scene voiceprint feature, and cos θ may represent the similarity of the target voiceprint feature and the scene voiceprint feature.
In order to facilitate understanding of the model matching method provided in the embodiment of the present invention, the above steps are described below with specific examples.
Referring to fig. 5, fig. 5 is a schematic diagram of a process for constructing an anchor voiceprint library according to an embodiment of the present invention. Assuming that the model library contains m speech synthesis models (speech synthesis model 1-speech synthesis model m in fig. 5), the following is the process of constructing the anchor voiceprint library:
taking an example that the template audio set includes a template audio X, first, in the template audio text set obtained by speech recognition in step S1001, the template audio text corresponding to the template audio X is X.
Secondly, the template audio text X is respectively input into each speech synthesis model, and the obtained speech synthesis audio is X 1 ,x 2 ...x m
Then, the template audio and the total voice synthesis audio (x, x) are synthesized by utilizing the voiceprint model 1 ,x 2 ...x m ) Carrying out feature extraction to obtain a voiceprint feature set e, e 1 ,e 2 ...e m
Thus, the voiceprint features contained in the anchor voiceprint library include e, e 1 ,e 2 ...e m Thus m +1 voiceprint features. It should be noted that the anchor voiceprint library may be pre-constructed, stored in the database, and directly called when in use.
Assuming that the fixed text is Y, the following describes a process of obtaining the second matching information:
firstly, acquiring a plurality of scene voiceprint characteristics corresponding to a fixed text: inputting the fixed text Y input into each speech synthesis model respectively to obtain m target speech synthesis audio frequencies Y 1 ,y 2 ...y m . Then, the voiceprint model is utilized to respectively carry out feature extraction on each target voice synthesis audio frequency to obtain m scene voiceprint features i 1 ,i 2 ...i m
To obtain scene voiceprint characteristics i 1 For example, the second matching information of (2) is obtained to obtain the scene voiceprint feature i 1 The process of the second matching information of (2) is: the scene voiceprint characteristic i can be calculated 1 Obtaining the cosine similarity between the sound pattern characteristics and each sound pattern characteristic in the anchor point sound pattern library to obtain the sound pattern characteristics i of the scene 1 Corresponding m +1 second match scores. Then K second matching scores are selected from the m +1 second matching scores, and a corresponding second mean value mean is obtained through calculation 1 And a second standard deviation std 1 . Second mean 1 And a second standard deviation std 1 I is the voiceprint characteristic i of the scene 1 The second matching information of (1).
Thus, the above process is repeated for each scene voiceprint feature, and m scene voiceprint features i can be obtained 1 ,i 2 ...i m Respective second mean 1 ,mean 2 ...mean m Second standard deviation std 1 ,std 2 ...std m
It should be noted that, the plurality of scene voiceprint features corresponding to the fixed text may be obtained by preprocessing and stored in the database, and may be directly called when used. Similarly, the second matching information of each scene voiceprint feature can be obtained by preprocessing, stored in the database, and directly called when in use.
When the target user has a need of a speech synthesis model needing personalization, the following describes a process of determining a target speech synthesis star matched with the target user from a model library:
firstly, in step S101, a target user actively reads a fixed text Y, and then the obtained target audio of the target user is Y, and then feature extraction is performed on the target audio Y by using a voiceprint model to obtain a target voiceprint feature i.
The process of obtaining the first matching information is similar to the process of obtaining the second matching information, and is briefly described here: in step S102, a cosine similarity between the target voiceprint feature i and each voiceprint feature in the anchor voiceprint library is first calculated, so as to obtain m +1 first matching scores corresponding to the target voiceprint feature i. And then K first matching scores are selected from the m +1 first matching scores, and a corresponding first mean value mean and a corresponding first standard deviation std are obtained through calculation.
It can be understood that the first mean and the first standard deviation std are the first matching information of the target voiceprint feature i.
Next, in step S103, m scene voiceprint features i corresponding to the fixed text Y may be obtained from the database 1 ,i 2 ...i m . Or, the m scene voiceprint features i corresponding to the fixed text Y can be obtained by performing real-time processing based on the fixed text Y and the model library 1 ,i 2 ...i m . Likewise, in step S104, second matching information corresponding to each scene voiceprint feature may be acquired from the database. Alternatively, the m scene voiceprint features i in the previous step S103 may be used as the basis 1 ,i 2 ...i m And calculating in real time by the anchor point voiceprint library to obtain second matching information corresponding to the voiceprint characteristics of each scene.
In step S105, a matching degree score S of the target voiceprint feature i and each scene voiceprint feature is obtained respectively 1 ,s 2 ...s m . To obtain a target voiceprint characteristic i and a scene voiceprint characteristic i 1 The matching degree score of (2) is as follows: can calculate the target voiceprint characteristic i and the scene voiceprint characteristic i 1 Cosine similarity between the target and scene voiceprint features, and taking the obtained cosine similarity as the target voiceprint feature i and the scene voiceprint feature i 1 And (4) scoring the matching degree.
In step S106, the first matching information (mean and std) is used withSecond matching information (mean) 1 ,mean 2 ...mean m And std 1 ,std 2 ...std m ) And respectively carrying out normalization processing on each matching degree score to obtain a normalization score corresponding to each scene voiceprint feature. To score the degree of match s m Normalization is performed as an example, and the matching score s is shown below m Formula for normalization:
s normm =0.5*((s m -mean)/std+(s m -mean m )/std m )
wherein s is m Is a target voiceprint characteristic i and a scene voiceprint characteristic i m And (4) scoring the matching degree, wherein mean is a first mean value corresponding to the target voiceprint characteristic i, and a first standard deviation corresponding to the std target voiceprint characteristic i. mean is a measure of m As a scene voiceprint feature i m Corresponding second mean value, std m As a scene voiceprint feature i m Corresponding second standard deviation. s is normm As a scene voiceprint feature i m The corresponding normalized score.
Thus, m scene voiceprint characteristics i can be obtained 1 ,i 2 ...i m Corresponding m normalized scores
Figure BDA0003682735300000141
In a possible situation, if each normalized score is lower than a preset threshold, it indicates that a speech synthesis model matching the target user does not exist in the model library, and the model library is to be updated urgently.
Finally, in step S107, all normalized scores may be sorted in a descending order, and the scene voiceprint feature corresponding to the first normalized score is the highest matching degree with the target voiceprint feature, and accordingly, the speech synthesis model corresponding to the first normalized score may be used as the target speech synthesis model.
In the above description, m representing the number is an integer greater than 0. The execution sequence of each step in the method embodiments described above is not limited to that shown in the drawings, and the execution sequence of each step is subject to the practical application.
In order to perform the corresponding steps in the above method embodiments and various possible embodiments, an implementation of the model matching apparatus is given below.
Referring to fig. 6, fig. 6 is a schematic structural diagram illustrating a model matching apparatus according to an embodiment of the present invention. The apparatus includes a first acquisition module 210, a second acquisition module 220, and a processing module 230.
A first obtaining module 210, configured to:
acquiring a target voiceprint characteristic corresponding to a target audio of a target user;
acquiring first matching information corresponding to the target voiceprint characteristics and the anchor voiceprint library;
the target audio corresponds to a fixed text, and the fixed text corresponds to a service scene.
A second obtaining module 220, configured to:
acquiring a plurality of scene voiceprint characteristics corresponding to the fixed text;
respectively acquiring second matching information corresponding to the voiceprint characteristics of each scene;
and respectively obtaining the matching degree score of the target voiceprint characteristics and each scene voiceprint characteristics.
The model library comprises a plurality of voice synthesis models, and each scene voiceprint feature corresponds to one voice synthesis model.
A processing module 230 configured to:
respectively carrying out normalization processing on each matching degree score by using the first matching information and the second matching information to obtain a normalization score corresponding to each scene voiceprint feature;
and determining a target voice synthesis model matched with the target voiceprint characteristics from the model library according to all the normalized scores.
In an alternative embodiment, the first obtaining module 210 may be further configured to construct an anchor voiceprint library based on the template audio set and the model library. Specifically, the first obtaining module 210 may be configured to:
acquiring a template audio text set corresponding to the template audio set;
it is understood that the set of template audio may include a plurality of template audio and the set of template audio text includes a plurality of template audio text. Any one template audio may have a corresponding template audio text.
Respectively inputting the template audio text sets into each speech synthesis model to obtain a speech synthesis audio set corresponding to each template audio text set;
wherein the set of speech synthesis audio comprises a plurality of speech synthesis audio.
One speech synthesis audio corresponding to each training audio text corresponds to one speech synthesis model;
and performing feature extraction on the voice synthesis audio set and the template audio set by using a voiceprint model to obtain a voiceprint feature set so as to form an anchor voiceprint library.
Wherein, the anchor point voiceprint library comprises the voiceprint characteristics of each voice synthesis audio and the voiceprint characteristics of each template audio.
In an optional embodiment, the second obtaining module 220 may be specifically configured to:
respectively inputting the fixed text into each voice synthesis model to obtain a plurality of target voice synthesis audios;
and respectively extracting the characteristics of each target voice synthesis audio by using the voiceprint model to obtain a plurality of scene voiceprint characteristics.
It is understood that the first obtaining module 210 may be configured to perform the steps S100, S101, S102 and their respective sub-steps, the second obtaining module 220 may be configured to perform the steps S103, S104, S105 and their respective sub-steps, and the processing module 230 may be configured to perform the steps S106, S107 and their respective sub-steps.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the model matching apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device 300 includes a processor 310, a memory 320, and a bus 330, the processor 310 being coupled to the memory 320 via the bus 330.
The electronic device 300 may be, but is not limited to, a smart phone, a computer, a personal computer, a smart tablet, a notebook, etc.
The memory 320 may be used to store a software program, such as the model matching device shown in FIG. 6. The Memory 320 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Flash Memory (Flash), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The processor 310 may be an integrated circuit chip having signal processing capabilities. The Processor 310 may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
The memory 320 stores machine-readable instructions executable by the processor 310. The processor 310, when executing the machine-readable instructions, implements the model matching method disclosed in the above embodiments.
It will be appreciated that the configuration shown in fig. 7 is merely illustrative and that electronic device 300 may include more or fewer components than shown in fig. 7 or have a different configuration than shown in fig. 7. The components shown in fig. 7 may be implemented in hardware, software, or a combination thereof.
The embodiment of the invention also provides a readable storage medium, wherein a computer program is stored on the readable storage medium, and when the computer program is executed by a processor, the computer program realizes the model matching method disclosed by the embodiment. The readable storage medium can be, but is not limited to, various media that can store program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a PROM, an EPROM, an EEPROM, a FLASH disk or an optical disk.
In summary, embodiments of the present invention provide a model matching method, an apparatus, an electronic device, and a readable storage medium, first obtaining a target voiceprint feature of a target user, first matching information corresponding to a fixed text, and a plurality of scene voiceprint features; and respectively acquiring second matching information corresponding to each scene voiceprint characteristic, and respectively acquiring a matching degree score of the target voiceprint characteristic and each scene voiceprint characteristic. And then, respectively carrying out normalization processing on each matching degree score by utilizing the first matching information and the second matching information to obtain a normalized score corresponding to each scene voiceprint feature. And finally, determining a target voice synthesis model matched with the target voiceprint characteristics from the model library according to all the normalized scores. Therefore, the matched target speech synthesis model is selected from the model library according to the voiceprint characteristics of the target user, and the method is time-saving and convenient. Moreover, normalization processing is carried out on each matching degree score, so that inaccuracy and inconsistency of similarity evaluation processes among voiceprint features are reduced, and accuracy of normalization scores is improved.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. A method of model matching, comprising:
acquiring a target voiceprint characteristic corresponding to a target audio of a target user; the target audio corresponds to a fixed text;
acquiring first matching information corresponding to the target voiceprint characteristics and the anchor voiceprint library;
acquiring a plurality of scene voiceprint characteristics corresponding to the fixed text; the model library comprises a plurality of voice synthesis models, and each scene voiceprint feature corresponds to one voice synthesis model;
respectively acquiring second matching information corresponding to each scene voiceprint feature;
respectively obtaining a matching degree score of the target voiceprint characteristics and each scene voiceprint characteristic;
respectively carrying out normalization processing on each matching degree score by using the first matching information and the second matching information to obtain a normalization score corresponding to each scene voiceprint feature;
and determining a target voice synthesis model matched with the target voiceprint characteristics from the model library according to all the normalized scores.
2. The method of claim 1, wherein the method further comprises:
acquiring a template audio text set corresponding to the template audio set; the template audio frequency set comprises a plurality of template audio frequencies, and the template audio frequency text set comprises a plurality of template audio frequency texts; any one template audio has a corresponding template audio text;
respectively inputting the template audio text sets into each speech synthesis model to obtain a speech synthesis audio set corresponding to each template audio text set; the voice synthesis audio set comprises a plurality of voice synthesis audios; one speech synthesis audio corresponding to each training audio text corresponds to one speech synthesis model;
performing feature extraction on the voice synthesis audio set and the template audio set by using a voiceprint model to obtain a voiceprint feature set so as to form the anchor voiceprint library; wherein the anchor voiceprint library comprises voiceprint features of each of the speech synthesis audios and voiceprint features of each of the template audios.
3. The method of claim 1, wherein the anchor voiceprint library comprises a plurality of voiceprint features, the first matching information comprises a first mean and a first standard deviation, and the step of obtaining the first matching information for the target voiceprint feature corresponding to the anchor voiceprint library comprises:
matching the target voiceprint features with each voiceprint feature in an anchor voiceprint library respectively to obtain a plurality of first matching scores corresponding to the target voiceprint features; each of the first match scores corresponds to one of the voiceprint features in the anchor voiceprint library;
k first matching scores are selected from the first matching scores, and the first mean value and the first standard deviation are calculated based on the K first matching scores.
4. The method of claim 1, wherein the step of obtaining a plurality of scene voiceprint features corresponding to the fixed text comprises:
respectively inputting the fixed text into each voice synthesis model to obtain a plurality of target voice synthesis audios;
and respectively extracting the characteristics of each target voice synthesis audio by using a voiceprint model to obtain a plurality of scene voiceprint characteristics.
5. The method according to claim 1, wherein the anchor voiceprint library includes a plurality of voiceprint features, the second matching information includes a second mean and a second standard deviation of each of the scene voiceprint features, and the step of respectively obtaining the second matching information corresponding to each of the scene voiceprint features includes:
matching each scene voiceprint feature with each voiceprint feature in an anchor voiceprint library one by one to obtain a plurality of second matching scores corresponding to each scene voiceprint feature;
and aiming at each scene voiceprint feature, selecting K second matching scores from a plurality of second matching scores corresponding to each scene voiceprint feature, and calculating a second mean value and a second standard deviation of the scene voiceprint feature based on the K second matching scores.
6. A model matching device is characterized by comprising a first acquisition module, a second acquisition module and a processing module;
the first obtaining module is configured to:
acquiring a target voiceprint characteristic corresponding to a target audio of a target user; the target audio corresponds to a fixed text;
acquiring first matching information corresponding to the target voiceprint characteristics and the anchor voiceprint library;
the second obtaining module is configured to:
acquiring a plurality of scene voiceprint characteristics corresponding to the fixed text; the model library comprises a plurality of voice synthesis models, and each scene voiceprint feature corresponds to one voice synthesis model;
respectively acquiring second matching information corresponding to each scene voiceprint feature;
respectively obtaining a matching degree score of the target voiceprint characteristics and each scene voiceprint characteristic;
the processing module is configured to:
respectively carrying out normalization processing on each matching degree score by using the first matching information and the second matching information to obtain a normalization score corresponding to each scene voiceprint feature;
and determining a target voice synthesis model matched with the target voiceprint characteristics from the model library according to all the normalized scores.
7. The apparatus of claim 6, wherein the first obtaining module is further to:
acquiring a template audio text set corresponding to the template audio set; the template audio set comprises a plurality of template audios, and the template audio text set comprises a plurality of template audio texts; any one template audio has a corresponding template audio text;
respectively inputting the template audio text sets into each speech synthesis model to obtain a speech synthesis audio set corresponding to each template audio text set; the voice synthesis audio set comprises a plurality of voice synthesis audios; one speech synthesis audio corresponding to each training audio text corresponds to one speech synthesis model;
performing feature extraction on the voice synthesis audio set and the template audio set by using a voiceprint model to obtain a voiceprint feature set so as to form the anchor voiceprint library; and the anchor point voiceprint library comprises the voiceprint characteristics of each voice synthesis audio and the voiceprint characteristics of each template audio.
8. The apparatus of claim 6, wherein the first obtaining module is specifically configured to:
respectively inputting the fixed text into each voice synthesis model to obtain a plurality of target voice synthesis audios;
and respectively extracting the characteristics of each target voice synthesis audio by using a voiceprint model to obtain a plurality of scene voiceprint characteristics.
9. An electronic device, comprising: a memory storing machine-readable instructions executable by the processor, and a processor executing the machine-readable instructions to implement the method of any one of claims 1 to 5 when the electronic device is run.
10. A readable storage medium, characterized in that it stores a computer program which is executed by a processor to implement the method of any one of claims 1 to 5.
CN202210642792.7A 2022-06-08 2022-06-08 Model matching method and device, electronic equipment and readable storage medium Pending CN115064149A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210642792.7A CN115064149A (en) 2022-06-08 2022-06-08 Model matching method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210642792.7A CN115064149A (en) 2022-06-08 2022-06-08 Model matching method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN115064149A true CN115064149A (en) 2022-09-16

Family

ID=83200793

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210642792.7A Pending CN115064149A (en) 2022-06-08 2022-06-08 Model matching method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN115064149A (en)

Similar Documents

Publication Publication Date Title
CN105679324B (en) A kind of method and apparatus of Application on Voiceprint Recognition similarity score
JP6171544B2 (en) Audio processing apparatus, audio processing method, and program
CN107610707A (en) A kind of method for recognizing sound-groove and device
Gulati et al. Automatic tonic identification in Indian art music: approaches and evaluation
CN109346056B (en) Speech synthesis method and device based on depth measurement network
US9601106B2 (en) Prosody editing apparatus and method
JP2021152682A (en) Voice processing device, voice processing method and program
CN108764114B (en) Signal identification method and device, storage medium and terminal thereof
CN110738980A (en) Singing voice synthesis model training method and system and singing voice synthesis method
CN110400567A (en) Register vocal print dynamic updating method and computer storage medium
CN112735371A (en) Method and device for generating speaker video based on text information
EP3813061A1 (en) Attribute identifying device, attribute identifying method, and program storage medium
CN111883106A (en) Audio processing method and device
CN110534134A (en) Speech detection method, system, computer equipment and computer storage medium
CA2947957C (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
JP5083951B2 (en) Voice processing apparatus and program
CN110226201A (en) The voice recognition indicated using the period
CN115064149A (en) Model matching method and device, electronic equipment and readable storage medium
JP4716125B2 (en) Pronunciation rating device and program
CN115631748A (en) Emotion recognition method and device based on voice conversation, electronic equipment and medium
JP4765971B2 (en) Mixed model generation apparatus, sound processing apparatus, and program
WO2022038958A1 (en) Musical piece structure analysis device and musical piece structure analysis method
JP7107377B2 (en) Speech processing device, speech processing method, and program
Seo A music similarity function based on the centroid model
CN114512113B (en) Audio synthesis method and related method and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination