CN111312210B

CN111312210B - Text-text fused voice synthesis method and device

Info

Publication number: CN111312210B
Application number: CN202010145198.8A
Authority: CN
Inventors: 张晋
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2023-03-21
Anticipated expiration: 2040-03-05
Also published as: CN111312210A

Abstract

The invention discloses a method and a device for synthesizing voice fused with pictures and texts, wherein the method comprises the following steps: acquiring text information characteristics of current sports characters; determining a video picture corresponding to the current sports characters according to the current sports characters, and acquiring visual information characteristics of the video picture; fusing the text information features and the visual information features to obtain multi-modal features; synthesizing the target speech based on the multi-modal features. The text information features and the visual information features are fused to form multi-mode features, and then voice synthesis is carried out, so that the result of voice synthesis is not single and boring, but different voices can be synthesized according to different scene states, and the problems that in the prior art, due to the fact that the single-mode text features are simply utilized, accurate and flexible customized voice synthesis cannot be achieved aiming at a synthesis scene, the result of voice synthesis becomes single and boring, and special conditions cannot be met are solved.

Description

Text-text fused voice synthesis method and device

Technical Field

The invention relates to the technical field of voice synthesis, in particular to a method and a device for synthesizing voice by fusing pictures and texts.

Background

With the development of speech synthesis technology and the popularization of applications, speech synthesis services are increasingly accepted and used by users. The existing voice synthesis system is mainly divided into a front end and a rear end, wherein the front end is responsible for extracting text characteristic information such as word segmentation, part of speech, polyphone labeling and the like; the back end module completes voice generation according to the text features extracted from the front end. End-to-end synthesis systems have emerged in order to allow the synthesis system to be as simplified as possible, reducing manual intervention and the requirements on the linguistics related background knowledge. Directly input text or ZhuYin characters, and the system outputs audio waveforms. However, this method has the following disadvantages: the speech synthesis method in the prior art only utilizes the single-mode text characteristics, neglects the importance of visual information, and cannot realize accurate and flexible customized speech synthesis aiming at the synthesis scene. The result of the speech synthesis becomes single and boring, and the special situation can not be dealt with.

Disclosure of Invention

In response to the above-identified problems, the method is based on combining textual information features and visual information features to form multi-modal features and resultant speech.

A speech synthesis method for fusing graphics context comprises the following steps:

acquiring text information characteristics of current sports characters;

determining a video picture corresponding to the current sports characters according to the current sports characters, and acquiring visual information characteristics of the video picture;

fusing the text information features and the visual information features to obtain multi-modal features;

synthesizing the target speech according to the multi-modal features.

Preferably, the acquiring the text information characteristic of the current sports word includes:

constructing a preset dictionary;

acquiring each character in the current sports text;

encoding each of the characters into a one-hot encoded vector;

and acquiring the text information characteristic of the one-hot coded vector by utilizing the embedded layer of the preset dictionary.

Preferably, the determining, according to the current sports text, the video picture corresponding to the current sports text, and obtaining the visual information characteristic of the video picture includes:

determining a sports event video corresponding to the current sports text according to the current sports text;

acquiring n frames corresponding to the sports event video;

acquiring n visual information of the n frames by using an image encoder;

and determining the n pieces of visual information as the visual information characteristics of the video pictures in combination.

Preferably, the fusing the text information feature and the visual information feature to obtain a multi-modal feature includes:

generating a weight proportion by using the visual information characteristics;

carrying out weighting processing on the text information characteristics by utilizing the weight proportion;

and determining the text information features after the weighting processing as the multi-modal features.

Preferably, the synthesizing the target speech according to the multi-modal features includes:

decoding the multi-modal features with an attention-based decoder;

generating a frequency spectrum of the decoded multi-modal features by utilizing a post-processing module based on the decoded multi-modal features;

and synthesizing the target voice according to the frequency spectrum.

A device for synthesizing speech by fusing pictures and texts, the device comprising:

the first acquisition module is used for acquiring the text information characteristics of the current sports characters;

the second acquisition module is used for determining a video picture corresponding to the current sports characters according to the current sports characters and acquiring visual information characteristics of the video picture;

the fusion module is used for fusing the text information features and the visual information features to obtain multi-modal features;

and the synthesis module is used for synthesizing the target voice according to the multi-modal characteristics.

Preferably, the first obtaining module includes:

the construction submodule is used for constructing a preset dictionary;

the first acquisition sub-module is used for acquiring each character in the current sports text;

an encoding sub-module for encoding each of the characters into a one-hot encoded vector;

and the second obtaining submodule is used for obtaining the text information characteristics of the one-hot coded vector by utilizing the embedded layer of the preset dictionary.

Preferably, the second obtaining module includes:

the first determining submodule is used for determining the sports event video corresponding to the current sports characters according to the current sports characters;

the third acquisition submodule is used for acquiring n frames corresponding to the sports event video;

a fourth obtaining sub-module, configured to obtain n pieces of visual information of the n frames by using an image encoder;

and the combining sub-module is used for combining the n pieces of visual information to determine the visual information characteristics of the video pictures.

Preferably, the fusion module includes:

the first generation submodule is used for generating a weight proportion by utilizing the visual information characteristics;

the processing submodule is used for carrying out weighting processing on the text information characteristics by utilizing the weight proportion;

and the second determining sub-module is used for determining the text information features after the weighting processing as the multi-modal features.

Preferably, the synthesis module comprises:

a decoding sub-module for decoding the multi-modal features with an attention-based decoder;

the second generation submodule is used for generating a frequency spectrum of the decoded multi-modal characteristics by utilizing the post-processing module based on the decoded multi-modal characteristics;

and the synthesis sub-module is used for synthesizing the target voice according to the frequency spectrum.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

Fig. 1 is a flowchart of a method for synthesizing voice with text-text fusion according to the present invention;

fig. 2 is another work flow chart of the method for synthesizing voice with text and graphics integration provided by the present invention;

FIG. 3 is a screenshot of a workflow of a speech synthesis method for image-text fusion according to the present invention;

fig. 4 is a structural diagram of a text-fusion speech synthesis apparatus provided in the present invention;

fig. 5 is another structural diagram of a text-text fused speech synthesis apparatus provided by the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosure, as detailed in the appended claims.

With the development of speech synthesis technology and the popularization of applications, speech synthesis services are increasingly accepted and used by users. The existing voice synthesis system is mainly divided into a front end and a rear end, wherein the front end is responsible for extracting text characteristic information such as word segmentation, part of speech, polyphone labeling and the like; the back end module completes voice generation according to the text features extracted by the front end. End-to-end synthesis systems have emerged in order to simplify the synthesis system as much as possible, reducing manual intervention and the requirements on linguistically relevant background knowledge. Directly input text or ZhuYin characters, and the system outputs audio waveforms. However, this method has the following disadvantages: the speech synthesis method in the prior art only utilizes the single-mode text characteristics, neglects the importance of visual information, and cannot realize accurate and flexible customized speech synthesis aiming at the synthesis scene. The result of the speech synthesis becomes single and boring, and the special situation can not be dealt with. In order to solve the above problem, the present embodiment discloses a method for combining into speech based on combining text information features and visual information features to form multi-modal features.

A method for synthesizing voice with fused text and text, as shown in fig. 1, includes the following steps:

s101, acquiring text information characteristics of current sports characters;

s102, determining a video picture corresponding to the current sports characters according to the current sports characters, and acquiring visual information characteristics of the video picture;

s103, fusing the text information features and the visual information features to obtain multi-modal features;

and step S104, synthesizing the target voice according to the multi-modal characteristics.

The working principle of the technical scheme is as follows: the method comprises the steps of obtaining text information characteristics of current sports characters, determining video pictures corresponding to the current sports characters according to the current sports characters, obtaining visual information characteristics of the video pictures, then fusing the text information characteristics and the visual information characteristics to obtain multi-mode characteristics, and finally synthesizing target voice according to the multi-mode characteristics.

The beneficial effects of the above technical scheme are: the text information features and the visual information features are fused to form multi-mode features, and then voice synthesis is carried out, so that the result of voice synthesis is not single and boring, but different voices can be synthesized according to different scene states, and the problems that in the prior art, due to the fact that the single-mode text features are simply utilized, accurate and flexible customized voice synthesis cannot be achieved aiming at a synthesis scene, the result of voice synthesis becomes single and boring, and special conditions cannot be met are solved.

In one embodiment, as shown in fig. 2, acquiring the text information characteristic of the current sports word includes:

step S201, constructing a preset dictionary;

step S202, acquiring each character in the current sports text;

step S203, encoding each character into a one-hot encoding vector;

and step S204, acquiring text information characteristics of the one-hot coded vector by using an embedded layer of a preset dictionary.

The beneficial effects of the above technical scheme are: the characters are coded into the one-hot coded vectors, so that the characteristics of the text information can be more conveniently and accurately obtained, and each character is coded into the one-hot coded vectors, so that the problem that the characteristics of the obtained text information are inaccurate due to the fact that important information is missed is avoided.

In one embodiment, determining a video picture corresponding to the current sports text according to the current sports text, and acquiring visual information characteristics of the video picture includes:

determining a sports event video corresponding to the current sports characters according to the current sports characters;

acquiring n frames corresponding to a sports event video;

acquiring n visual information of n frames by using an image encoder;

combining the n pieces of visual information to determine the visual information characteristics of the video pictures;

in this embodiment, n is a positive integer of 2 or more.

The beneficial effects of the above technical scheme are: the n frames corresponding to the sports event video are obtained, so that the obtained n visual information is more accurate, and meanwhile, the determined visual information characteristics obtained by combining the n frames are clearer and clearer compared with the visual information characteristics obtained by one-time sports event video obtaining, so that a good sample is provided for text information characteristic fusion.

In one embodiment, fusing the textual information features and visual information features to obtain multimodal features includes:

generating a weight proportion by using visual information characteristics;

carrying out weighting processing on the text information characteristics by using the weight proportion;

and determining the text information features after the weighting processing as multi-modal features.

The beneficial effects of the above technical scheme are: through multi-mode information fusion of visual information update and text information characteristics, personalized voice can be generated, and more visual and nice experience is provided for users. For example: to sports broadcast football match, can be through catching visual information such as goal, goal in the video frame, supplementary text generates the speech effect of exciting a lot. And an emotion key generated by voice can be established according to the information of the picture inserted in the book during the intelligent reading of machines such as a book drawing, tang poetry and the like. For example, the color in the illustration is dark, the voice generation effect may be more focused on sad emotions, and the like.

In one embodiment, synthesizing target speech from multi-modal features includes:

decoding the multi-modal features with an attention-based decoder;

and synthesizing the target voice according to the frequency spectrum.

The beneficial effects of the above technical scheme are: the decoder is used for decoding the attention focusing part in the multi-modal characteristics and emphasizing the process of synthesizing the voice to properly adjust the type and emotion of the synthesized voice according to the attention focusing part. The synthesized voice has the advantage of diversification.

In one embodiment, as shown in FIG. 3, includes:

step 1: and (4) obtaining the text information characteristic representation of the sports text to be broadcasted through a character coder. By constructing a dictionary, each character is coded into a one-hot vector, and then the feature representation of the text information is obtained through an embedding layer;

step 2: and obtaining visual information characteristic representation of the picture frame corresponding to the video of the sports event through an image encoder. Visual information characteristic representation of the corresponding image can be extracted through a sub-network such as ResNet;

and 3, step 3: and the multi-mode feature fusion module fuses the text information features and the corresponding visual information features to obtain multi-mode feature representation. Splicing of characteristic dimensions can be adopted, and weights can be generated through image characteristics to perform weighting processing on text information;

and 4, step 4: the multi-modal features are decoded by an attention-based decoder module. The decoder is an RNN structure and can output the characteristic representation of the sequence based on the attention component;

and 5, step 5: the post-processing module processes the output of the decoder to generate a spectral representation, and obtains the voice.

The beneficial effects of the above technical scheme are: the unimodal text-to-speech synthesis ignores important visual information, and variable speech cannot be generated according to scenes in some scenes. The method can generate personalized voice representation through multi-modal information fusion of the image and the text. For example, for a sports broadcast football game, a exciting voice effect can be generated by capturing visual information such as a goal, a goal and the like in a video frame and assisting a text. And the emotion key generated by voice can be established according to the information of the images inserted in the book during the intelligent reading of machines such as book drawing, tang poem and the like. For example, the color in the illustration is dark, the voice generation effect may be more focused on sad emotions, and the like.

The embodiment also discloses a text-fused speech synthesis device, as shown in fig. 4, the device includes:

a first obtaining module 401, configured to obtain text information characteristics of a current sports word;

a second obtaining module 402, configured to determine, according to the current sports text, a video picture corresponding to the current sports text, and obtain visual information characteristics of the video picture;

the fusion module 403 is configured to fuse the text information features and the visual information features to obtain multi-modal features;

a synthesis module 404 for synthesizing the target speech based on the multi-modal features.

In one embodiment, the first obtaining module includes:

a construction sub-module 4011 configured to construct a preset dictionary;

the first obtaining sub-module 4012 is configured to obtain each character in the current sports text;

an encoding sub-module 4013 for encoding each character into a unique hot encoded vector;

the second obtaining sub-module 4014 is configured to obtain text information features of the unique hot coded vector by using the embedded layer of the preset dictionary.

In one embodiment, the second obtaining module includes:

the third acquisition sub-module is used for acquiring n frames corresponding to the sports event video;

a fourth obtaining sub-module, configured to obtain n pieces of visual information for every n frames by using the image encoder;

In one embodiment, a fusion module includes:

and the second determining sub-module is used for determining the text information features after the weighting processing as multi-modal features.

In one embodiment, the synthesis module comprises:

and the synthesis submodule is used for synthesizing the target voice according to the frequency spectrum.

It will be understood by those skilled in the art that the first and second terms of the present invention refer to different stages of application.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A speech synthesis method for fusing pictures and texts is characterized by comprising the following steps:

acquiring text information characteristics of current sports characters;

synthesizing target speech according to the multi-modal features;

the fusing the text information features and the visual information features to obtain multi-modal features comprises:

generating a weight proportion by using the visual information characteristics;

2. The method for synthesizing voice with fused texts according to claim 1, wherein the obtaining the text information characteristics of the current sports text comprises:

constructing a preset dictionary;

acquiring each character in the current sports text;

encoding each of the characters into a one-hot encoded vector;

and acquiring text information characteristics of the one-hot coded vector by utilizing an embedded layer of the preset dictionary.

3. The method for synthesizing voice with fused texts according to claim 1, wherein the determining a video picture corresponding to the current sports text according to the current sports text to obtain the visual information characteristics of the video picture comprises:

acquiring n frames corresponding to the sports event video;

acquiring n visual information of the n frames by using an image encoder;

4. The method for synthesizing fused text speech according to claim 1, wherein the synthesizing target speech according to the multi-modal features comprises:

decoding the multi-modal features with an attention-based decoder;

and synthesizing the target voice according to the frequency spectrum.

5. A text-fused speech synthesis apparatus, comprising:

the second acquisition module is used for determining a video picture corresponding to the current sports text according to the current sports text and acquiring visual information characteristics of the video picture;

a synthesis module for synthesizing a target speech according to the multi-modal features;

the fusion module comprises:

6. The device for synthesizing fused text speech according to claim 5, wherein the first obtaining module comprises:

the construction submodule is used for constructing a preset dictionary;

the first obtaining sub-module is used for obtaining each character in the current sports text;

7. The device for synthesizing fused text speech according to claim 5, wherein the second obtaining module comprises:

8. The device for synthesizing fused text speech according to claim 5, wherein the synthesis module comprises: