CN111916050A

CN111916050A - Speech synthesis method, speech synthesis device, storage medium and electronic equipment

Info

Publication number: CN111916050A
Application number: CN202010771100.XA
Authority: CN
Inventors: 殷翔
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-08-03
Filing date: 2020-08-03
Publication date: 2020-11-10

Abstract

The present disclosure relates to a speech synthesis method, apparatus, storage medium, and electronic device, the method comprising: acquiring a target image to be processed; extracting first characteristic information of the target image and second characteristic information of at least one target object in the target image, and generating description information corresponding to the target image according to the first characteristic information and the second characteristic information; and carrying out voice synthesis according to the description information to obtain audio information corresponding to the target image. Therefore, the comprehensiveness and richness of the acquired feature information of the target image can be improved, and accurate data support is provided for the subsequent generation of the description information. Meanwhile, when the description information of the target image is generated, not only the information of the target image but also the content information of the target object contained in the target image is considered, so that the target image can be more comprehensively described based on the target object, and the matching degree of the obtained audio information and the target image is improved.

Description

Speech synthesis method, speech synthesis device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method, apparatus, storage medium, and electronic device.

Background

With the development of computer technology, the application of image processing is becoming more and more extensive. For example, to further simplify user operations, corresponding audio may be generated based on the image to implement the image application. In the prior art, the image is usually subjected to convolution processing, so as to generate text description information corresponding to the image, and further generate audio. In the above process, the size of the convolution kernel for performing convolution processing on the image is set to be different, so that the obtained text description information is not matched with the image, and the matching degree between the image and the audio is low.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method of speech synthesis, the method comprising:

acquiring a target image to be processed;

extracting first characteristic information of the target image and second characteristic information of at least one target object in the target image, and generating description information corresponding to the target image according to the first characteristic information and the second characteristic information;

and carrying out voice synthesis according to the description information to obtain audio information corresponding to the target image.

In a second aspect, a speech synthesis apparatus is provided, the apparatus comprising:

the first acquisition module is used for acquiring a target image to be processed;

the generating module is used for extracting first characteristic information of the target image and second characteristic information of at least one target object in the target image, and generating description information corresponding to the target image according to the first characteristic information and the second characteristic information;

and the synthesis module is used for carrying out voice synthesis according to the description information so as to obtain the audio information corresponding to the target image.

In a third aspect, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of any of the first aspects.

In a fourth aspect, an electronic device is provided, comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to implement the steps of the method of any of the above first aspects.

In the above technical solution, by extracting first feature information of a target image and second feature information of at least one target object in the target image, and generating description information corresponding to the target image according to the first feature information and the second feature information, speech synthesis can be performed according to the description information to obtain audio information corresponding to the target image. Therefore, by the technical scheme, the characteristic information of the target image can be obtained, the target object in the target image can be further identified, and the characteristic information corresponding to the target object can be obtained, so that the comprehensiveness and richness of the obtained characteristic information of the target image can be improved, and accurate data support is provided for the subsequent generation of the description information. Meanwhile, in the disclosure, when the description information of the target image is generated, not only the information of the target image but also the content information of the target object contained in the target image is considered, so that the target image can be more comprehensively described based on the target object, and feature processing is not required through different convolution kernels, thereby ensuring the comprehensiveness and accuracy of the determined description information and improving the matching degree of the obtained audio information and the target image.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

In the drawings:

FIG. 1 is a flow diagram of a method of speech synthesis provided in accordance with an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a description information generation model provided in accordance with one embodiment of the present disclosure;

FIG. 3 is a block diagram of a speech synthesis apparatus provided in accordance with an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device for implementing an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 1 is a flowchart of a speech synthesis method according to an embodiment of the present disclosure, where as shown in fig. 1, the method includes:

in step 11, a target image to be processed is acquired. The target image may be an image uploaded or imported by a user.

In step 12, first feature information of the target image and second feature information of at least one target object in the target image are extracted, and description information corresponding to the target image is generated according to the first feature information and the second feature information.

The first feature information of the target image may be extracted through an image feature extraction model, where the first feature information may be a global feature corresponding to the target image, that is, feature information obtained by performing feature extraction on the complete target image, and for example, a Convolutional Neural Network (CNN) may be trained in advance to obtain the image feature extraction model. For example, the target image may include a target object such as a mountain, a sun, and the like, and the feature information of the target object, that is, the second feature information, may be obtained when the target object in the target image is determined. For example, the description information corresponding to the target image is generated according to the first characteristic information and the second characteristic information, and may be obtained by splicing matrices of the first characteristic information and the second characteristic information. The matrix splicing can be performed by using an existing splicing function, such as concat, which is not limited by the present disclosure. Therefore, comprehensiveness and richness of the characteristic information corresponding to the target image can be guaranteed by extracting, and data support is provided for generating more accurate description information.

In step 13, speech synthesis is performed according to the description information to obtain audio information corresponding to the target image.

As an example, speech synthesis may be implemented based on a tts (text To speech) technology To obtain audio information corresponding To the target image. As another example, a speech synthesis model may be trained based on a neural network to obtain corresponding acoustic feature information based on the description information, and the acoustic feature information is synthesized by a vocoder to obtain audio information, so that intelligibility and fluency of the obtained audio information may be further improved.

Therefore, in the above technical solution, by extracting first feature information of a target image and second feature information of at least one target object in the target image, and generating description information corresponding to the target image according to the first feature information and the second feature information, speech synthesis can be performed according to the description information to obtain audio information corresponding to the target image. Therefore, by the technical scheme, the characteristic information of the target image can be obtained, the target object in the target image can be further identified, and the characteristic information corresponding to the target object can be obtained, so that the comprehensiveness and richness of the obtained characteristic information of the target image can be improved, and accurate data support is provided for the subsequent generation of the description information. Meanwhile, in the disclosure, when the description information of the target image is generated, not only the information of the target image but also the content information of the target object contained in the target image is considered, so that the target image can be more comprehensively described based on the target object, and feature processing is not required through different convolution kernels, thereby ensuring the comprehensiveness and accuracy of the determined description information and improving the matching degree of the obtained audio information and the target image.

In order to make those skilled in the art understand the technical solutions provided by the embodiments of the present invention, the following detailed descriptions are provided for the above steps.

Optionally, in step 12, another exemplary implementation manner of extracting first feature information of the target image and second feature information of at least one target object in the target image, and generating description information corresponding to the target image according to the first feature information and the second feature information is as follows, and the step may include:

inputting the target image into a description information generation model, extracting first feature information of the target image and second feature information of at least one target object in the target image through the description information generation model, and generating description information corresponding to the target image according to the first feature information and the second feature information.

As shown in fig. 2, the description information generation model 200 includes:

a first feature extraction sub-model 201, configured to perform feature extraction on the target image to obtain the first feature information, where the first feature extraction sub-model may be a CNN network;

a second feature extraction submodel 202, configured to perform image segmentation on the target image to determine at least one target object in the target image, and use feature information corresponding to the target object in a last feature layer in the second feature extraction submodel as the second feature information, where the second feature extraction submodel may be a Mask-RCNN network;

and the description information generation submodel 203 is used for obtaining target feature information according to the first feature information and the second feature information, and generating description information corresponding to the target image according to the target feature information.

Hereinafter, the training of the description information generation model obtained by:

acquiring a sample image, target text information corresponding to the sample image and target segmentation information corresponding to a sample object in the sample image, wherein the target segmentation information is used for representing the position of the sample object in the sample image. For example, a user may download a plurality of sample images from a local image library or a network, and calibrate a description text of the sample images as the target text information, and may calibrate position information of at least one target object in the sample images as the target segmentation information.

And training a neural network model according to the sample image, the target text information and the target segmentation information to obtain the description information generation model.

The method includes the steps of inputting a sample image as a model, outputting target text information corresponding to the sample image as a target of the model, and training a neural network model by using the target segmentation information as a model constraint condition to obtain the description information generation model. For example, if a sample image is selected from the plurality of sample images as a model input, the first feature information of the sample image may be obtained based on the first feature extraction submodel, the segmentation information of at least one target object in the target image and the second feature information of the target object may be obtained based on the second feature extraction model, then the description information generation submodule may obtain the target feature information according to the first feature information and the second feature information, generate the description information corresponding to the target image according to the target feature information, and obtain the segmentation information of the target object in the sample image based on the neural network model, then the neural network model may be constrained in combination with the target segmentation information corresponding to the sample image, so as to improve the image segmentation accuracy, that is, the accuracy of the obtained second feature information of the target object, and enabling the description information output by the neural network model to be closer to the target output of the model so as to obtain the description information generation model.

In the technical scheme, a sample image, target text information corresponding to the sample image and target segmentation information corresponding to a sample object in the sample image are obtained, and a neural network model is trained according to the sample image, the target text information and the target segmentation information to obtain the description information generation model. Therefore, the description information generation model is obtained through training the neural network model, on one hand, the relevance of the segmentation of the target object in the image can be learned in the model training process, so that the accuracy of the determined target object is improved, the accuracy of the second characteristic information of the determined target object can be ensured, the comprehensiveness and the accuracy of the determined description information are ensured to a certain extent, and the accuracy of the description information generation model is improved; on the other hand, the application range of the description information generation can be widened. In addition, in the technical scheme, learning of target output of the neural network model is included, and the applicability of the description information generation model to different test data can be improved by adding an auxiliary task of segmentation information learning, so that the accuracy and the application range of the description information generation model are further improved, and the use by a user is facilitated.

Optionally, in the training of the neural network model, a loss value of the neural network model is determined according to a first loss value and a second loss value, wherein the first loss value is determined according to the target text information and description information corresponding to the sample image output by the neural network model, and the second loss value is determined according to the target segmentation information and segmentation information corresponding to the sample image determined by the neural network model.

In this embodiment, inputting the sample image to the neural network model to obtain the description information and the segmentation information corresponding to the sample image, a first loss value may be determined according to the target text information and the description information corresponding to the sample image output by the neural network model, and may be calculated by means of softmax cross entropy, for example; also, a second loss value may be calculated from the target segmentation information and the segmentation information corresponding to the sample image output by the neural network model, and may be calculated, for example, by RMSE root mean square error. The calculation method of the first loss value and the calculation method of the second loss value may be selected according to an actual usage scenario, and may be the same or different, and this is not limited in this disclosure.

Therefore, in the above technical scheme, when determining the loss value of the neural network model to train the neural network model, not only the loss of the learning target of the neural network model, i.e. the description information, is considered, but also the constraint of the segmentation information in the image is considered, i.e. the loss value of the position information of the target object is obtained, so that the accurate segmentation of the target object in the input image is ensured, and the accuracy of the feature information of the target object is improved. In addition, the loss of the description information and the segmentation information is comprehensively considered in the process, so that the influence of the accuracy of the description information generation model when the training data and the test data have deviation can be reduced to a certain extent, the generalization of the description information generation model obtained by training can be improved, and the application range of the description information generation model is enlarged.

Optionally, in step 13, an exemplary implementation manner of performing speech synthesis according to the description information to obtain audio information corresponding to the target image is as follows, including:

performing semantic correction on the description information to obtain corrected target description information;

and carrying out voice synthesis according to the target description information to obtain the audio information.

For example, a language model may be implemented by a large amount of language text based on LSTM-RNN (long short term memory recurrent neural network), so that semantic correction may be performed on the description information based on the language model, for example, the language order correction may be performed by the language model, or a connection word may be added, so that the semantics in the corrected target description information more conforms to the natural language logic of the user. The method of performing speech synthesis according to the target description information is the same as the above-mentioned method, and is not described herein again.

Therefore, through the technical scheme, the target description information which better accords with the logic of the natural language of the user can be obtained, so that the matching degree of the target description information and the target image is improved, the readability of the target description information is improved, and the intelligibility of the audio information is further improved. Moreover, the user does not need to manually correct the voice, the automation level and the simplicity of the voice synthesis method can be further improved, and the voice synthesis method is convenient for the user to use.

Optionally, the method further comprises:

and acquiring music information corresponding to the target image, wherein the music information is used for carrying out background music dubbing on the target image.

As an example, the obtaining of the music information corresponding to the target image is implemented as follows, including:

and displaying an audio template selection interface, wherein a plurality of alternative audio templates are loaded in the audio template selection interface. The alternative audio template may be, for example, a piece of music uploaded by a user clip, or a piece of music pre-stored in an audio library.

And responding to the selection operation of the user in the audio template selection interface, determining the audio template selected by the user as a target audio template, extracting the accompaniment information and the lyric information in the target audio template, and determining the accompaniment information and the lyric information as the music information, namely respectively extracting the lyric and the accompaniment of the target audio template as the music information.

As another example, the obtaining of the music information corresponding to the target image is implemented as follows, including:

receiving music score data input by a user;

and analyzing the music score data, and determining accompaniment information obtained by analysis and corresponding to the music score data as the music information. Wherein the accompaniment information may include one or more of melody information, tempo information, measure information, and paragraph information.

The music score data can be input into the music information extraction model for analysis, and the music information extraction model can be a pre-trained model, and can be calibrated in advance, for example, accompaniment information can be identified through a Label, so that the music information extraction model can be obtained based on the training music score and the corresponding accompaniment information identification. The music information extraction model can be obtained by training a training music score based on any machine learning mode, which is not limited by the present disclosure.

Therefore, by the mode, a user can automatically obtain music information for matching music on the target image only by selecting the audio template or uploading music score data, so that a data base can be provided for subsequent voice synthesis, the synthesis of the voice information with rhythm is facilitated, the complex operation of manually adding music subsequently by the user is avoided, the operation flow of the user is simplified, and the use by the user is facilitated.

Accordingly, in step 13, another exemplary implementation manner of performing speech synthesis according to the description information to obtain the audio information corresponding to the target image is as follows, including:

and carrying out voice synthesis according to the description information and the music information to obtain audio information corresponding to the target image.

Optionally, the performing speech synthesis according to the description information and the music information to obtain audio information corresponding to the target image includes:

determining voice acoustic characteristic information corresponding to the description information according to the music information;

performing voice synthesis according to the voice acoustic characteristic information to obtain voice waveform data;

and synthesizing the voice waveform data and the accompaniment waveform data determined according to the music information to obtain the audio information.

As an example, if the music information is extracted from the target music template, the singing duration of each character in the description information may be determined according to the lyric information in the music information, for example, the lyric information and the character sequence of the description information may be matched in a manner of strongly aligning, and the frequency spectrum data may be generated according to the singing duration of each character in the description information and the accompaniment information in the music information, that is, the speech acoustic feature information corresponding to the description information is obtained. Thereafter, the voice acoustic feature information may be synthesized into voice waveform data by the vocoder. In this embodiment, the accompaniment audio without human voice can be directly extracted from the target audio template to obtain the accompaniment waveform data, and then the voice waveform data and the accompaniment waveform data of the target audio template are synthesized to obtain the audio information.

As another example, if the music information is analyzed from music score data, the description information and the music information may be processed through a music synthesis model, where the music synthesis model may be obtained by performing joint processing training through a neural network model based on text information and accompaniment information in a training sample in advance, and illustratively, by taking the text information and the accompaniment information of the training sample as model inputs and outputting singing data and accompaniment data extracted from music in the training sample as targets of the model, model training is implemented to obtain a music synthesis model, and by performing joint processing on the text information and the music information, speech waveform data corresponding to the description information and speech waveform data corresponding to the music information may be adjusted and matched with each other. Therefore, the description information and the music information can be input into the music synthesis model, and the speech acoustic feature information corresponding to the description information and the accompaniment acoustic feature information corresponding to the music information can be output through the music synthesis model. And then, voice synthesis can be carried out through a vocoder, voice waveform data corresponding to the voice acoustic characteristic information is obtained, voice synthesis is carried out on the accompaniment acoustic characteristic information through the vocoder to obtain accompaniment waveform data, and then the voice waveform data and the accompaniment waveform data determined according to the music information can be synthesized to obtain the audio information.

Therefore, by the technical scheme, the audio information containing the music characteristics can be obtained by acquiring the music information, and the song audio can be generated by the description information corresponding to the target image, so that the diversity of the audio information generated based on the target image can be expanded, a user can conveniently perform personalized operation, the use requirements of the user are met, the simplicity and the automation level of the voice synthesis method can be ensured, and the use experience of the user is further improved.

Optionally, the method further comprises:

and determining an audio segment related to the target object in the audio information. The audio segment may be determined to be the audio segment, and may be an audio segment corresponding to a name including the target object, and if the target object is the sun, the audio segment related to the target object may be a segment corresponding to the "sun" in the audio information; or a segment having associated information with the target object, for example, if the target object is the sun, the audio segment related to the target object may be a segment corresponding to "weather" in the audio information, where the association relationship may be realized by pre-storing a corresponding relationship table for query, and an inference model may be realized by training a neural network model for associated inference, which is not limited in this disclosure.

And then, according to the audio clip, carrying out special effect processing on the image area where the target object is located to obtain a special effect image.

After the audio segment is determined, the frame number of the special effect image can be determined according to the duration corresponding to the audio segment and the image frame rate, and the image frame rate can be set according to the actual use scene. The special effect processing may be enlarging an image area where the target object is displayed, or may be transparentizing an area other than the image area where the target object is located in the target image. The special effect processing may be implemented by performing pixel interpolation amplification on the image region after determining the boundary of the image region where the target object is located, or by multiplying a pixel point outside the boundary by a transparency.

In one possible embodiment, a plurality of identical special effect images may be generated in the manner described above, and in another possible embodiment, a plurality of different special effect images may be generated in the manner described above to achieve the effect of a gradation display. For example, when the image area where the target object is located is enlarged, the enlargement ratio of each frame may be set to be gradually increased, so as to achieve the effect of gradually enlarging the image area where the target object is located. For another example, the transparency rate corresponding to each frame of special effect image may be set to gradually increase to achieve the effect of highlighting the image area.

After the special effect image is obtained, video synthesis is carried out according to the audio information, the target image and the special effect image to obtain a target video, wherein in the target video, an image frame corresponding to the audio clip is the special effect image.

In this embodiment, video generation may be performed according to the content in the audio information, that is, when the audio information is an audio clip related to the object, the corresponding image frame is a special effect image corresponding to the audio clip, and when the audio information is not an audio clip related to the object, the corresponding image frame is the object image, so that in the generated object video, when the audio information corresponds to the content related to the playing object, the image area corresponding to the object may be highlighted in the displayed image data.

Therefore, according to the technical scheme, the user can directly obtain the video information corresponding to the content in the target image based on the target image only by uploading the target image, and the displayed image data can be changed along with the content in the audio information, so that the matching degree of the audio data and the image data can be ensured, the accuracy of the obtained target video is improved, the user operation is simplified, the use requirements of the user are further met, the diversification and individuation of the target image processing are improved, and the use experience of the user is further improved.

The present disclosure also provides a speech synthesis apparatus, as shown in fig. 3, the apparatus 300 includes:

a first obtaining module 301, configured to obtain a target image to be processed;

a generating module 302, configured to extract first feature information of the target image and second feature information of at least one target object in the target image, and generate description information corresponding to the target image according to the first feature information and the second feature information;

and a synthesis module 303, configured to perform speech synthesis according to the description information to obtain audio information corresponding to the target image.

Optionally, the generating module includes:

inputting the target image into a description information generation model, extracting first feature information of the target image and second feature information of at least one target object in the target image through the description information generation model, and generating description information corresponding to the target image according to the first feature information and the second feature information; wherein the description information generation model includes:

the first feature extraction submodel is used for carrying out feature extraction on the target image so as to obtain first feature information;

the second feature extraction submodel is used for carrying out image segmentation on the target image so as to determine at least one target object in the target image, and taking feature information corresponding to the target object in the last feature layer in the second feature extraction submodel as the second feature information;

and the description information generation submodel is used for obtaining target feature information according to the first feature information and the second feature information and generating description information corresponding to the target image according to the target feature information.

Optionally, the synthesis module comprises:

the correction submodule is used for carrying out semantic correction on the description information to obtain corrected target description information;

and the first synthesis submodule is used for carrying out voice synthesis according to the target description information so as to obtain the audio information.

Optionally, the apparatus further comprises:

the second acquisition module is used for acquiring music information corresponding to the target image;

the synthesis module comprises:

and the second synthesis submodule is used for carrying out voice synthesis according to the description information and the music information so as to obtain audio information corresponding to the target image.

Optionally, the second obtaining module includes:

the display submodule is used for displaying an audio template selection interface, wherein the audio template selection interface carries a plurality of alternative audio templates;

the first determining sub-module is used for responding to the selection operation of a user in the audio template selection interface, determining the audio template selected by the user as a target audio template, and extracting accompaniment information and lyric information in the target audio template to determine the accompaniment information and the lyric information as the music information;

or, the second obtaining module includes:

the receiving submodule is used for receiving the music score data input by a user;

and the second determining submodule is used for analyzing the music score data, and determining accompaniment information obtained by analysis and corresponding to the music score data as the music information.

Optionally, the apparatus further comprises:

the determining module is used for determining an audio segment related to the target object in the audio information;

the processing module is used for carrying out special effect processing on the image area where the target object is located according to the audio frequency fragment so as to obtain a special effect image;

and a third synthesis module, configured to perform video synthesis according to the audio information, the target image, and the special effect image to obtain a target video, where in the target video, an image frame corresponding to the audio clip is the special effect image.

Optionally, the second synthesis submodule comprises:

the third determining submodule is used for determining the voice acoustic characteristic information corresponding to the description information according to the music information;

the third synthesis submodule is used for carrying out voice synthesis according to the voice acoustic characteristic information to obtain voice waveform data;

and the fourth synthesis submodule is used for synthesizing the voice waveform data and the accompaniment waveform data determined according to the music information to obtain the audio information.

Optionally, the description information generation model is obtained by:

acquiring a sample image, target text information corresponding to the sample image and target segmentation information corresponding to a sample object in the sample image, wherein the target segmentation information is used for representing the position of the sample object in the sample image;

Referring now to FIG. 4, a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 4, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 4 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a target image to be processed; extracting first characteristic information of the target image and second characteristic information of at least one target object in the target image, and generating description information corresponding to the target image according to the first characteristic information and the second characteristic information; and carrying out voice synthesis according to the description information to obtain audio information corresponding to the target image.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not in some cases constitute a limitation to the module itself, and for example, the first acquisition module may also be described as a "module that acquires a target image to be processed".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a speech synthesis method according to one or more embodiments of the present disclosure, wherein the method includes:

acquiring a target image to be processed;

Example 2 provides the method of example 1, wherein the extracting first feature information of the target image and second feature information of at least one target object in the target image, and generating description information corresponding to the target image according to the first feature information and the second feature information includes:

Example 3 provides the method of example 1, wherein the performing speech synthesis according to the description information to obtain audio information corresponding to the target image includes:

Example 4 provides the method of example 1, wherein the method further comprises:

acquiring music information corresponding to the target image;

the performing speech synthesis according to the description information to obtain audio information corresponding to the target image includes:

Example 5 provides the method of example 4, wherein the obtaining of the music information corresponding to the target image includes:

displaying an audio template selection interface, wherein a plurality of alternative audio templates are loaded in the audio template selection interface;

responding to the selection operation of a user in the audio template selection interface, determining an audio template selected by the user as a target audio template, and extracting accompaniment information and lyric information in the target audio template to determine the accompaniment information and the lyric information as the music information;

or, the acquiring the music information corresponding to the target image includes:

receiving music score data input by a user;

and analyzing the music score data, and determining accompaniment information obtained by analysis and corresponding to the music score data as the music information.

Example 6 provides the method of example 4 or example 5, wherein the method further comprises:

determining an audio segment related to the target object in the audio information;

according to the audio frequency fragment, carrying out special effect processing on the image area where the target object is located to obtain a special effect image;

and performing video synthesis according to the audio information, the target image and the special effect image to obtain a target video, wherein in the target video, an image frame corresponding to the audio clip is the special effect image.

Example 7 provides the method of example 4 or example 5, wherein the performing speech synthesis according to the description information and the music information to obtain audio information corresponding to the target image, includes:

Example 8 provides the method of example 2, wherein the description information generation model is obtained by:

Example 9 provides a speech synthesis apparatus according to one or more embodiments of the present disclosure, wherein the apparatus includes:

Example 10 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing device, performs the steps of the method of any of examples 1-8, in accordance with one or more embodiments of the present disclosure.

Example 11 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method of any of examples 1-8.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method of speech synthesis, the method comprising:

acquiring a target image to be processed;

2. The method according to claim 1, wherein the extracting first feature information of the target image and second feature information of at least one target object in the target image, and generating description information corresponding to the target image according to the first feature information and the second feature information comprises:

3. The method according to claim 1, wherein the performing speech synthesis according to the description information to obtain audio information corresponding to the target image comprises:

4. The method of claim 1, further comprising:

acquiring music information corresponding to the target image;

5. The method according to claim 4, wherein the acquiring of the music information corresponding to the target image includes:

receiving music score data input by a user;

6. The method according to claim 4 or 5, characterized in that the method further comprises:

7. The method according to claim 4 or 5, wherein performing speech synthesis according to the description information and the music information to obtain audio information corresponding to the target image comprises:

8. The method of claim 2, wherein the description information generation model is obtained by:

9. A speech synthesis apparatus, characterized in that the apparatus comprises:

10. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 8.

11. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 8.