CN117275526A

CN117275526A - Evaluation method and device of speech synthesis system, storage medium and computing device

Info

Publication number: CN117275526A
Application number: CN202311057123.4A
Authority: CN
Inventors: 沈伟林; 周邦建
Original assignee: Huayuan Computing Technology Shanghai Co ltd
Current assignee: Huayuan Computing Technology Shanghai Co ltd
Priority date: 2023-08-21
Filing date: 2023-08-21
Publication date: 2023-12-22

Abstract

The application provides an evaluation method and device, a storage medium and a computing device of a voice synthesis system, wherein the voice synthesis system is used for generating synthesized voice for restoring target tone, and the evaluation method comprises the following steps: acquiring user voice with target tone color and at least one section of synthesized voice; extracting a first voiceprint feature of user voice and a second voiceprint feature of at least one section of synthesized voice; calculating the similarity of the first voiceprint feature and the second voiceprint feature to obtain an estimated similarity; and evaluating the speech synthesis system according to the evaluation similarity. The method and the device can automatically, objectively and accurately evaluate the voice similarity of the synthesized voice and the target voice.

Description

Evaluation method and device of speech synthesis system, storage medium and computing device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for evaluating a speech synthesis system, a storage medium, and a computing device.

Background

Speech synthesis (TTS) refers to a technique of converting text into speech, and the speech is synthesized by speaking the content to be expressed in different tone colors. The similarity degree of the synthesized voice and the voice of the target speaker represents the performance of the voice synthesis system for restoring the tone and speaking style of the target speaker, and is an important index of the quality of the voice synthesis system.

Conventional speaker similarity assessment uses subjective assessment methods that target speaker similarity, such as mean opinion score (Mean Opinion Score, MOS). The method requires personnel participating in evaluation to score the overall similarity of the voice, wherein the score ranges from 1 point to 5 points, and the higher the score is, the better the voice similarity is. The subjective evaluation method of the voice similarity is widely applied to the evaluation of the capability of a voice synthesis system.

However, the subjective evaluation method has the following problems: 1) A large number of personnel involved in evaluation are needed for realizing subjective evaluation, and the whole process is time-consuming and labor-consuming; 2) Not objective: the scoring given by a human evaluator may be disturbed by various factors, such as different guidelines, and different intrinsic criteria of the evaluator may influence the scoring result.

Disclosure of Invention

The method and the device can automatically, objectively and accurately evaluate the voice similarity of the synthesized voice and the target voice.

In order to achieve the above purpose, the present application provides the following technical solutions:

in a first aspect, there is provided an evaluation method of a speech synthesis system for generating a synthesized speech for restoring a target tone, the evaluation method comprising: acquiring user voice with target tone color and at least one section of synthesized voice; extracting a first voiceprint feature of the user voice and a second voiceprint feature of the at least one segment of synthesized voice; calculating the similarity of the first voiceprint feature and the second voiceprint feature to obtain an estimated similarity; and evaluating the voice synthesis system according to the evaluation similarity.

Optionally, the calculating the similarity between the first voiceprint feature and the second voiceprint feature includes: respectively calculating first similarity of the first voiceprint features and each section of synthesized voice in the plurality of sections of synthesized voice; an average value of the plurality of first similarities is calculated as the evaluation similarity.

Optionally, the evaluating the speech synthesis system according to the evaluation similarity includes: calculating a second similarity between the user speech and a reference user speech, wherein the reference user speech and the user speech are from the same user; and evaluating the voice synthesis system according to the difference between the evaluation similarity and the second similarity.

Optionally, the smaller the difference between the estimated similarity and the second similarity, the better the performance of the speech synthesis system.

Optionally, the evaluating the speech synthesis system according to the evaluation similarity includes: and calculating the performance grade of the voice synthesis system according to the evaluation similarity, wherein the evaluation similarity is positively correlated with the performance grade.

Optionally, the extracting the first voiceprint feature of the user voice and the second voiceprint feature of the at least one segment of synthesized voice includes: and extracting the first voiceprint features and the second voiceprint features by adopting a voiceprint recognition model which is trained in advance.

Optionally, after the step of obtaining the user voice with the target tone color and at least one section of synthesized voice, the method further includes: preprocessing the user speech and the at least one segment of synthesized speech, the preprocessing including one or more of: volume normalization processing and mute voice removal.

In a second aspect, the present application also discloses an evaluation device of a speech synthesis system for generating a synthesized speech for restoring a target tone, the evaluation device comprising: the acquisition module is used for acquiring the user voice with the target tone and at least one section of synthesized voice; the extraction module is used for extracting the first voiceprint feature of the user voice and the second voiceprint feature of the at least one section of synthesized voice; the computing module is used for computing the similarity between the first voiceprint feature and the second voiceprint feature to obtain evaluation similarity; and the evaluation module is used for evaluating the voice synthesis system according to the evaluation similarity.

In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program for execution by a processor to perform the method provided by the first aspect.

In a fourth aspect, there is provided an evaluation device for a speech synthesis system comprising a memory and a processor, the memory having stored thereon a computer program executable on the processor, the processor running the computer program to perform any one of the methods provided in the first aspect.

In a fifth aspect, there is provided a computer program product having a computer program stored thereon, the computer program being executable by a processor to perform the method provided by the first aspect.

In a sixth aspect, embodiments of the present application further provide a chip (or data transmission device) on which a computer program is stored, which when executed by the chip, implements the steps of the method described above.

In a seventh aspect, an embodiment of the present application further provides a system chip, where the system chip is applied to a terminal, where the system chip includes at least one processor and an interface circuit, where the interface circuit and the at least one processor are interconnected by a line, and the at least one processor is configured to execute instructions to perform a method provided in the first aspect.

Compared with the prior art, the technical scheme of the application has the following beneficial effects:

in the technical scheme, user voice with target tone color and at least one section of synthesized voice are obtained; extracting a first voiceprint feature of user voice and a second voiceprint feature of at least one section of synthesized voice; calculating the similarity of the first voiceprint feature and the second voiceprint feature to obtain an estimated similarity; and evaluating the speech synthesis system according to the evaluation similarity. According to the technical scheme, the similarity between the synthesized voice and the user voice is evaluated by utilizing voiceprints, so that the automatic objective evaluation of the similarity between the synthesized voice and the user voice is realized, and the defect of subjectivity of the traditional method is overcome; in addition, the evaluation similarity is used for evaluating the performance of the speech synthesis system for restoring the tone and speaking style of the user, so that the objectivity and the accuracy of the performance evaluation of the speech synthesis system can be realized.

Further, calculating a second similarity between the user speech and the reference user speech, wherein the reference user speech and the user speech are from the same user; and evaluating the voice synthesis system according to the difference between the evaluation similarity and the second similarity. According to the technical scheme, when the performance of the voice synthesis system is evaluated, the evaluation similarity is not directly evaluated based on the evaluation similarity of the voice of the user and the synthesized voice, but the difference between the evaluation similarity and the second similarity is compared, so that errors caused by the tone difference of the user are avoided, and the accuracy of the performance evaluation of the voice synthesis system is further improved.

Drawings

FIG. 1 is a flow chart of a method for evaluating a speech synthesis system according to an embodiment of the present application;

FIG. 2 is a block diagram of a speech synthesis system and an evaluation device according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of another method of evaluating a speech synthesis system provided in an embodiment of the present application;

fig. 4 is a schematic structural diagram of an evaluation device of a speech synthesis system according to an embodiment of the present application;

fig. 5 is a schematic diagram of a hardware structure of evaluation of a speech synthesis system according to an embodiment of the present application.

Detailed Description

As described in the background art, the subjective evaluation method has the following problems: 1) A large number of personnel involved in evaluation are needed for realizing subjective evaluation, and the whole process is time-consuming and labor-consuming; 2) Not objective: the scoring given by a human evaluator may be disturbed by various factors, such as different guidelines, and different intrinsic criteria of the evaluator may influence the scoring result.

The application provides a method for evaluating the similarity of the synthesized voice and the user voice by utilizing voiceprints, so that the automatic objective evaluation of the similarity of the synthesized voice and the user voice is realized, and the defect of subjectivity of the traditional method is overcome; in addition, the evaluation similarity is used for evaluating the performance of the speech synthesis system for restoring the tone and speaking style of the user, so that the objectivity and the accuracy of the performance evaluation of the speech synthesis system can be realized.

In order to make the above objects, features and advantages of the present application more comprehensible, embodiments accompanied with figures are described in detail below.

Referring to fig. 1, the method provided in the present application specifically includes the following steps:

step 101: acquiring user voice with target tone color and at least one section of synthesized voice;

step 102: extracting a first voiceprint feature of user voice and a second voiceprint feature of at least one section of synthesized voice;

step 103: calculating the similarity of the first voiceprint feature and the second voiceprint feature to obtain an estimated similarity;

step 104: and evaluating the speech synthesis system according to the evaluation similarity.

It should be noted that the serial numbers of the steps in the present embodiment do not represent a limitation on the execution sequence of the steps.

It will be appreciated that in a specific implementation, the method for evaluating a speech synthesis system may be implemented in a software program running on a processor integrated within a chip or a chip module. The method may also be implemented by combining software with hardware, which is not limited in this application.

Referring to fig. 2 together, fig. 2 shows an architecture diagram of a speech synthesis system and an evaluation device. Wherein the speech synthesis system 20 is configured to generate a synthesized speech that restores the target timbre. The evaluation means 30 are for evaluating the performance of the speech synthesis system 20. The evaluation means 30 evaluates the performance of the speech synthesis system 20 from a tone color point of view, i.e. whether the synthesized speech generated by the speech synthesis system 20 is close to the target tone color.

Specifically, the target tone color is input to the speech synthesis system 20, and the speech synthesis system 20 may generate a synthesized speech that restores the target tone color.

Specifically, the evaluation device 30 may perform the steps of the method shown in fig. 1 to achieve performance evaluation of the estimated speech synthesis system 20. The synthesized voice and the user voice having the target tone color are input to the evaluation device 30, and the evaluation device 30 may output the evaluation result. The evaluation result may specifically be a performance parameter of the speech synthesis system, for example, a performance level, a performance score, or the like.

With continued reference to fig. 1, in the implementation of step 101, the synthesized speech is speech having a duration greater than or equal to a threshold, for example, the duration of the synthesized speech is greater than or equal to 10 seconds. Accordingly, the duration of the user's voice is also greater than or equal to the threshold, e.g., the duration of the user's voice is greater than or equal to 10 seconds.

Further, to ensure accuracy of subsequent voiceprint feature extraction, the synthesized speech as well as the user speech may be pre-processed. Specifically, the pretreatment includes one or more of the following: volume normalization processing and mute voice removal.

According to the embodiment of the invention, the volume of the synthesized voice and the volume of the user voice are normalized, so that the influence of the volume on the extracted voiceprint is reduced. By removing the mute voice, invalid data is removed, the calculated amount of subsequent calculation is reduced, and the evaluation efficiency is improved.

In an implementation of step 102, the evaluation device 30 may extract a first voiceprint feature of the user's voice and at least one second voiceprint feature of at least one segment of the synthesized voice.

In implementations, the first voiceprint feature and the second voiceprint feature can be extracted using a pre-trained voiceprint recognition model.

Specifically, the voiceprint recognition model can be trained in the following manner: constructing training data, wherein the training data comprises a plurality of synthesized voices and corresponding voiceprint features thereof, and a plurality of user voices and corresponding voiceprint features thereof; and training the voiceprint recognition model by using the training data.

It should be noted that, the voiceprint recognition model can be constructed by using any executable neural network algorithm, which is not limited in this application.

In the specific implementation of step 103 and step 104, the evaluation similarity may be obtained by the similarity between the first voiceprint feature and the second voiceprint feature, and the evaluation similarity may be used to evaluate the speech synthesis system 20, that is, the evaluation result of the speech synthesis system 20 may be obtained by calculating according to the evaluation similarity.

Specifically, cosine similarity of the first voiceprint feature and the second voiceprint feature may be calculated as the evaluation similarity.

In one non-limiting embodiment, a plurality of synthesized voices may be obtained in consideration of certain errors and differences among different voice fragments, and when calculating the evaluation similarity, the first similarity between the user and the voice and between the user and each synthesized voice is calculated. An average value of the plurality of first similarities is calculated as the evaluation similarity.

In one embodiment, the evaluation similarity may be used directly to evaluate the speech synthesis system 20. Specifically, the greater the number of evaluated similarities, the better the performance of the speech synthesis system 20.

Specifically, the performance level of the speech synthesis system 20 is calculated from the estimated similarity, which is positively correlated with the performance level. For example, when the evaluation similarity is 0.95 to 1, the performance grade is 1; when the evaluation similarity is 0.9-0.95, the performance grade is 2; when the evaluation similarity is 0.85-0.9, the performance grade is 3, and so on.

It can be understood that the specific number of levels and the mapping relationship between the evaluation similarity and the levels can be set according to the actual application scenario, which is not limited in this application.

In another embodiment, the estimated similarity may be compared to a threshold, and if the estimated similarity is above the threshold, the synthesized speech may be rated to be similar to the user's speech, with the speech synthesis system 20 performing better; if the estimated similarity is below the threshold, it may be assessed that the synthesized speech is dissimilar from the user speech and that the performance of the speech synthesis system 20 is poor.

In yet another embodiment, a second similarity of the user speech to a reference user speech is calculated, the reference user speech and the user speech being from the same user; and evaluating the voice synthesis system according to the difference between the evaluation similarity and the second similarity. Specifically, the smaller the difference in the evaluation similarity from the second similarity, the better the performance of the speech synthesis system 20. Accordingly, the greater the difference in the estimated similarity from the second similarity, the poorer the performance of the speech synthesis system 20.

Specifically, the difference between the estimated similarity and the second similarity may be compared to a threshold, and if the difference is above the threshold, it indicates that the performance of the speech synthesis system 20 is poor; if the difference is below the threshold, it indicates that the performance of the speech synthesis system 20 is better.

For example, the second similarity between the user speech and the reference user speech is 0.9, the estimated similarity between the synthesized speech and the user speech is 0.8, and the difference between the estimated similarity and the second similarity is 0.1, and the performance of the speech synthesis system 20 is better. If the estimated similarity between the synthesized speech and the user speech is 0.6, the difference between the estimated similarity and the second similarity is 0.3, and the performance of the speech synthesis system 20 is poor.

In this embodiment, by selecting the reference user voice with the same target tone color, comparing the second similarity between the user voice and the reference user voice, and evaluating the performance of the voice synthesis system based on the difference between the evaluation similarity and the second similarity, the error caused by the tone color difference of the user is avoided, so that the evaluation of the voice synthesis system is more objective and accurate.

Referring to fig. 3, fig. 3 shows a specific flow of another evaluation method of the speech synthesis system.

In step 301, a user speech of 10 seconds or more and a synthesized speech of 10 seconds or more are read.

In step 302, the user speech and the synthesized speech are volume normalized and mute removed.

In step 303, voiceprint feature vectors of the user voice and the multi-segment synthesized voice are extracted using the voiceprint recognition model, resulting in a first voiceprint feature and a plurality of second voiceprint features.

In step 304, the similarity between the first voiceprint feature and the plurality of second voiceprint features is calculated and averaged as an estimated similarity.

In step 305, the speech synthesis system is evaluated based on the evaluation similarity.

For more specific implementations of the embodiments of the present application, please refer to the foregoing embodiments, and the details are not repeated here.

The above embodiment evaluates the performance of the speech synthesis system from the viewpoint of tone color, and may also evaluate the performance of the speech synthesis system from multiple dimensions such as word accuracy of synthesized speech and naturalness of synthesized speech.

In one non-limiting embodiment, a speech synthesis system is used to synthesize synthesized speech for more than 10 seconds of target text. The above evaluation method may further include the steps of: and converting the synthesized voice into a first text, and comparing whether each word in the first text is consistent with each word in the target text. Specifically, the number of words is counted under the condition that the words are consistent, and the ratio of the number to the total number of words in the target text is calculated. The larger the ratio, the better the performance of the speech synthesis system.

In another non-limiting embodiment, multiple first prosodic features of the synthesized speech, such as the fundamental frequency, the cadence, etc., and second prosodic features of the user's speech may be extracted. By comparing the first prosodic feature with the second prosodic feature, prosodic similarity can be obtained as an evaluation index for the naturalness of the synthesized speech.

It should be noted that, in practical application, different combinations may be selected according to practical application requirements to evaluate the performance of the speech synthesis system, which is not limited in this application.

Referring to fig. 4, fig. 4 shows an evaluation device 40 of a speech synthesis system, where the evaluation device 40 of the speech synthesis system may include:

an acquisition module 401, configured to acquire a user voice with a target tone color and at least one segment of synthesized voice;

an extraction module 402, configured to extract a first voiceprint feature of the user voice and a second voiceprint feature of the at least one segment of synthesized voice;

a calculating module 403, configured to calculate a similarity between the first voiceprint feature and the second voiceprint feature to obtain an estimated similarity;

and the evaluation module 404 is configured to evaluate the speech synthesis system according to the evaluation similarity.

In a specific implementation, the above-mentioned evaluation device 40 may correspond to a Chip with a function of determining a power control parameter in a terminal device, such as a System-On-a-Chip (SOC), a baseband Chip, etc.; or the terminal equipment comprises a chip module with a power control parameter determining function; or corresponds to a chip module having a chip with a data processing function or corresponds to a terminal device.

Other relevant descriptions of the evaluation device 40 may refer to those in the foregoing embodiments, and will not be repeated here.

With respect to each of the apparatuses and each of the modules/units included in the products described in the above embodiments, it may be a software module/unit, a hardware module/unit, or a software module/unit, and a hardware module/unit. For example, for each device or product applied to or integrated on a chip, each module/unit included in the device or product may be implemented in hardware such as a circuit, or at least part of the modules/units may be implemented in software program, where the software program runs on a processor integrated inside the chip, and the rest (if any) of the modules/units may be implemented in hardware such as a circuit; for each device and product applied to or integrated in the chip module, each module/unit contained in the device and product can be realized in a hardware manner such as a circuit, different modules/units can be located in the same component (such as a chip, a circuit module and the like) or different components of the chip module, or at least part of the modules/units can be realized in a software program, the software program runs on a processor integrated in the chip module, and the rest (if any) of the modules/units can be realized in a hardware manner such as a circuit; for each device, product, or application to or integrated with the terminal device, each module/unit included in the device may be implemented in hardware such as a circuit, and different modules/units may be located in the same component (e.g., a chip, a circuit module, etc.) or different components in the terminal device, or at least some modules/units may be implemented in a software program, where the software program runs on a processor integrated within the terminal device, and the remaining (if any) part of the modules/units may be implemented in hardware such as a circuit.

The embodiment of the application also discloses a storage medium, which is a computer readable storage medium, and a computer program is stored on the storage medium, and the computer program can execute the steps of the methods shown in fig. 1 to 3 when running. The storage medium may include Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic or optical disks, and the like. The storage medium may also include non-volatile memory (non-volatile) or non-transitory memory (non-transitory) or the like.

Referring to fig. 5, the embodiment of the application further provides a hardware structure schematic diagram of the communication device. The apparatus comprises a processor 501, a memory 502 and a transceiver 503.

The processor 501 may be a general purpose central processing unit (central processing unit, CPU), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with aspects of the present application. The processor 501 may also include multiple CPUs, and the processor 501 may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, or processing cores for processing data (e.g., computer program instructions).

The memory 502 may be a ROM or other type of static storage device, a RAM or other type of dynamic storage device that can store static information and instructions, or that can store information and instructions, or an electrically erasable programmable read-only memory (EEPROM), a compact disk read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, as the embodiments of the present application do not limit in any way. The memory 502 may be separate (in this case, the memory 502 may be located outside the apparatus or inside the apparatus) or may be integrated with the processor 501. Wherein the memory 502 may contain computer program code. The processor 501 is configured to execute computer program code stored in the memory 502, thereby implementing the methods provided in the embodiments of the present application.

The processor 501, the memory 502 and the transceiver 503 are connected by a bus. The transceiver 503 is used to communicate with other devices or communication networks. Alternatively, the transceiver 503 may include a transmitter and a receiver. The means for implementing the receiving function in the transceiver 503 may be regarded as a receiver for performing the steps of receiving in the embodiments of the present application. The means for implementing the transmitting function in the transceiver 503 may be regarded as a transmitter for performing the steps of transmitting in the embodiments of the present application.

While the schematic structural diagram shown in fig. 5 is used to illustrate the structure of the terminal device according to the above embodiment, the processor 501 is configured to control and manage the actions of the terminal device, for example, the processor 501 is configured to support the terminal device to perform the steps shown in fig. 1 or fig. 3, and/or the actions performed by the terminal device in other processes described in the embodiments of the present application. The processor 501 may communicate with other network entities, such as with the network devices described above, through the transceiver 503. The memory 502 is used for storing program codes and data of the terminal device.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In this context, the character "/" indicates that the front and rear associated objects are an "or" relationship.

The term "plurality" as used in the embodiments herein refers to two or more.

The first, second, etc. descriptions in the embodiments of the present application are only used for illustrating and distinguishing the description objects, and no order division is used, nor does it indicate that the number of the devices in the embodiments of the present application is particularly limited, and no limitation on the embodiments of the present application should be construed.

The "connection" in the embodiments of the present application refers to various connection manners such as direct connection or indirect connection, so as to implement communication between devices, which is not limited in any way in the embodiments of the present application.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with the embodiments of the present application are all or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed method, apparatus, and system may be implemented in other manners. For example, the device embodiments described above are merely illustrative; for example, the division of the units is only one logic function division, and other division modes can be adopted in actual implementation; for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be physically included separately, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform part of the steps of the methods described in the embodiments of the present application.

Although the present application is disclosed above, the present application is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention shall be defined by the appended claims.

Claims

1. An evaluation method of a speech synthesis system for generating a synthesized speech for restoring a target tone, the evaluation method comprising:

acquiring user voice with target tone color and at least one section of synthesized voice;

extracting a first voiceprint feature of the user voice and a second voiceprint feature of the at least one segment of synthesized voice;

calculating the similarity of the first voiceprint feature and the second voiceprint feature to obtain an estimated similarity;

and evaluating the voice synthesis system according to the evaluation similarity.

2. The method of claim 1, wherein the computing the similarity of the first voiceprint feature to the second voiceprint feature comprises:

respectively calculating first similarity of the first voiceprint features and each section of synthesized voice in the plurality of sections of synthesized voice;

an average value of the plurality of first similarities is calculated as the evaluation similarity.

3. The method of evaluating a speech synthesis system according to claim 1, wherein the evaluating the speech synthesis system according to the evaluation similarity comprises:

calculating a second similarity between the user speech and a reference user speech, wherein the reference user speech and the user speech are from the same user;

and evaluating the voice synthesis system according to the difference between the evaluation similarity and the second similarity.

4. The method according to claim 3, wherein the smaller the difference between the estimated similarity and the second similarity, the better the performance of the speech synthesis system.

5. The method of evaluating a speech synthesis system according to claim 1, wherein the evaluating the speech synthesis system according to the evaluation similarity comprises:

and calculating the performance grade of the voice synthesis system according to the evaluation similarity, wherein the evaluation similarity is positively correlated with the performance grade.

6. The method of claim 1, wherein the extracting the first voiceprint feature of the user's voice and the second voiceprint feature of the at least one segment of synthesized voice comprises:

and extracting the first voiceprint features and the second voiceprint features by adopting a voiceprint recognition model which is trained in advance.

7. The method for evaluating a speech synthesis system according to claim 1, wherein the step of obtaining the user speech having the target tone color and at least one segment of the synthesized speech further comprises:

preprocessing the user speech and the at least one segment of synthesized speech, the preprocessing including one or more of: volume normalization processing and mute voice removal.

8. An evaluation device of a speech synthesis system for generating a synthesized speech for restoring a target tone, the evaluation device comprising:

the acquisition module is used for acquiring the user voice with the target tone and at least one section of synthesized voice;

the extraction module is used for extracting the first voiceprint feature of the user voice and the second voiceprint feature of the at least one section of synthesized voice;

the computing module is used for computing the similarity between the first voiceprint feature and the second voiceprint feature to obtain evaluation similarity;

and the evaluation module is used for evaluating the voice synthesis system according to the evaluation similarity.

9. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the evaluation method of a speech synthesis system according to any one of claims 1 to 7.

10. A computing device comprising a memory and a processor, the memory having stored thereon a computer program executable on the processor, characterized in that the processor, when executing the computer program, performs the steps of the method of evaluating a speech synthesis system according to any of claims 1 to 7.