CN110808019A

CN110808019A - Song generation method and electronic equipment

Info

Publication number: CN110808019A
Application number: CN201911053532.0A
Authority: CN
Inventors: 曹新英; 秦帅
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2020-02-18

Abstract

The embodiment of the invention discloses a song generation method and electronic equipment, wherein the method comprises the following steps: receiving image and character information input by a user; determining a target tune, a target rhythm and target lyrics according to the image and the character information; and generating a target song according to the target tune, the target rhythm and the target lyrics. According to the song generation method disclosed by the embodiment of the invention, the electronic equipment can be triggered to generate the song according to the image and the text information only by inputting the image and the text information of the generated song into the electronic equipment, even a non-professional person can easily complete the song production, the operation is convenient and fast, and no requirement is imposed on the professional degree.

Description

Song generation method and electronic equipment

Technical Field

The embodiment of the invention relates to the technical field of electronic equipment, in particular to a song generation method and electronic equipment.

Background

With the rapid development of mobile internet technology and the continuous popularization of electronic devices, people are increasingly unable to leave electronic devices for work, communication, entertainment and other activities in daily life. Through the electronic device the user may watch a video, play a song, navigate or communicate, etc.

The songs played in the electronic equipment are all finished products, and the user of the electronic equipment can only select the songs from the finished products to play. Finished songs played by the electronic equipment need professional composers, word makers and other professionals to collaborate to generate, the requirement of song making on the professional degree is very high, and non-professionals cannot create and generate songs according to personal requirements.

Disclosure of Invention

The embodiment of the invention provides a song generation method, which aims to solve the problem that a non-professional person cannot create and generate a song according to personal requirements in the prior art.

In order to solve the technical problem, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a song generating method applied to an electronic device, where the method includes: receiving image and character information input by a user; determining a target tune, a target rhythm and target lyrics according to the image and the character information; and generating a target song according to the target tune, the target rhythm and the target lyrics.

In a second aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes: the receiving module is used for receiving images and character information input by a user; the determining module is used for determining a target tune, a target rhythm and target lyrics according to the image and the character information; and the generating module is used for generating the target song according to the target tune, the target rhythm and the target lyrics.

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a processor, a memory, and a computer program stored on the memory and executable on the processor, and when executed by the processor, the computer program implements the steps of any one of the song generation methods described in the embodiments of the present invention.

In a fourth aspect, the present invention provides a computer-readable storage medium, wherein the computer-readable storage medium stores thereon a computer program, and the computer program, when executed by a processor, implements the steps of any one of the song generation methods as described in the embodiments of the present invention.

In the embodiment of the invention, the image and the character information input by a user are received through the electronic equipment; determining a target tune, a target rhythm and target lyrics according to the image and the character information; the target song is generated according to the target tune, the target rhythm and the target lyrics, and the electronic equipment user only needs to select the image and the text information of the generated song to input into the electronic equipment, so that the electronic equipment can be triggered to generate the song according to the image and the text information, even non-professionals can easily complete song production, the operation is convenient and fast, and no requirement is imposed on the professional degree.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

Fig. 1 is a flowchart illustrating steps of a song generation method according to a first embodiment of the present invention;

FIG. 2 is a flowchart of steps of a song generation method according to a second embodiment of the invention;

FIG. 3 is a schematic diagram of model training;

fig. 4 is a block diagram of an electronic device according to a third embodiment of the present invention;

fig. 5 is a schematic diagram of a hardware structure of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

In various embodiments of the present invention, it should be understood that the sequence numbers of the following processes do not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the inherent logic, but should not constitute any limitation to the implementation process of the embodiments of the present invention

Example one

Referring to fig. 1, a flowchart illustrating steps of a song generating method according to a first embodiment of the present invention is shown.

The song generation method of the embodiment of the invention comprises the following steps:

step 101: and receiving image and text information input by a user.

The song generation method disclosed by the embodiment of the invention is applied to electronic equipment, and song generation software is installed on the electronic equipment. When a user wants to make a song, the image and the text information are input into the electronic equipment, and then the electronic equipment can be triggered to generate the song according to the image and the text information input by the user.

The text information can be articles, poems or short texts edited by users, and the corresponding map image can be illustrations in the articles, matching images of the poems or images of places shot by the users. The specific origin of the image and the text information is not particularly limited in the embodiment of the invention.

Step 102: and determining a target tune, a target rhythm and target lyrics according to the image and the character information.

The image contains features such as scenes, behavior of the subject, etc., such as: silent night, cornfield in clear sky, venue of competition, eave in rainy days, solemn church and the like. The genre of the song can be determined by image characteristics, such as: a light jazz, a shake, a sad song, etc. Through semantic information, central meaning and the like contained in the character information, scenes, main characters, time, events, emotion types and the like described by the character information can be analyzed. Such as: the students running outside the window in the evening in cloudy days and on the campus playground in the early morning, the injuries caused by departure of the railway station in the evening, and the like. The target tune and the target rhythm corresponding to the song to be generated can be determined according to the style type of the song, the scene described by the text information, the main character, the time, the event, the emotion type and the like, and the target lyrics are determined according to the content contained in the text information.

Step 103: and generating the target song according to the target tune, the target rhythm and the target lyrics.

After the target tune, the target rhythm and the target lyrics are determined, the target tune, the target rhythm and the target lyrics can be synthesized in the existing mode to finally generate the target song, and the specific synthesis mode can be selected by a person skilled in the art according to actual requirements, which is not limited in the embodiment of the invention.

According to the song generation method provided by the embodiment of the invention, the image and the character information input by the user are received through the electronic equipment; determining a target tune, a target rhythm and target lyrics according to the image and the character information; the target song is generated according to the target tune, the target rhythm and the target lyrics, and the electronic equipment user only needs to select the image and the text information of the generated song to input into the electronic equipment, so that the electronic equipment can be triggered to generate the song according to the image and the text information, even non-professionals can easily complete song production, the operation is convenient and fast, and no requirement is imposed on the professional degree.

Example two

Referring to fig. 2, a flowchart illustrating steps of a song generating method according to a second embodiment of the present invention is shown.

step 201: n sets of song training data are determined.

Steps 201 to 202 are training procedures for each network model applied in the song generation method. The network model applied in the song generation method includes: a first neural network model, a second neural network model, a tune generation network model, a tempo generation network model, and a lyric generation network model. When each network model is trained, N songs need to be labeled in advance to obtain N groups of song training data. Wherein each group of song training data comprises: image, text information, tune, rhythm and lyrics corresponding to the song. N is a positive integer.

The songs used for training may be randomly selected from a library of songs or may be manually selected by one skilled in the art. The songs used for training encompass different types, styles, languages, etc.

Step 202: and training the first neural network model, the second neural network model, the tune generation network model, the rhythm generation network model and the lyric generation network model through N groups of song training data.

In the process of training the network model, each group of song training data is extracted from the N groups of song training data in sequence to train each network model until each network model converges to a preset effect, and then the training of the network model is finished. The network model training process is schematically described below with reference to fig. 3.

The electronic equipment extracts a group of song training data, takes the image and the character information as input, and respectively inputs the image and the character information into the first neural network model and the second neural network model. The left and right rectangular regions encircled by the dotted line portion in fig. 3 respectively represent a first neural network model and a second neural network model for extracting image features and character features, and the extracted image features and character features are n-dimensional feature vectors and m-dimensional feature vectors.

After the image characteristic vector and the character characteristic vector are obtained, the image characteristic vector and the character characteristic vector are spliced to obtain a target characteristic vector. And respectively substituting the target characteristic vectors into the tune generation network model, the rhythm generation network model and the song generation network model to respectively determine the tune, the rhythm and the lyrics. The melody generation network model, the rhythm generation network model and the song generation network model all adopt a structure for generating a confrontation network, and the network structure comprises two parts: generating a network portion and discriminating the network portion. Wherein the generating network part is used for generating required results such as tune, rhythm or lyrics; the discrimination network section discriminates whether or not the generated result is accurate. For example: and when judging whether the generated tune is accurate, comparing the generated tune with the tune corresponding to the original song to obtain the distance between the generated tune and the original song, wherein the distance is used as a loss function generated in the network model training process, and the smaller the loss function is, the more real and reliable the generated result is. The process of training the network model is a process of adjusting parameters of the network model to continuously reduce the generated loss function, when the loss function generated by the network model is reduced to a preset value, the convergence of the network model to a preset effect is determined, and the training of the network model is completed.

Specifically, a rectangular portion circled by each solid line from left to right in fig. 3 sequentially generates a tone generating network model, a rhythm generating network model, and a lyric generating network model. Since the tune, the tempo and the lyrics all belong to serial number data types, the three generation network models adopt a structure of generating serialized data. And generating a network model to output the tune, the rhythm and the lyrics to be respectively compared with the tune, the rhythm and the lyrics in the group of song training data to obtain a loss function. And when the loss functions generated by the three generated network models are higher than a preset value, correspondingly adjusting parameters of the generated network models, continuously inputting the next group of song training data to train each generated network model, and repeating the training process until the loss functions generated by the three generated network models are all reduced to the preset value, and then stopping training.

The above description only exemplarily describes the process of training the network model based on the training data, and in a specific implementation process, a person skilled in the art may train the network model required in the song generation process in any suitable manner in the prior art.

It should be noted that, steps 201 to 202 are optional steps, and if the training of each network model is completed before the song generating method shown in the embodiment of the present invention is executed, the model training process is not required to be executed, and step 203 and the subsequent steps are directly executed.

Step 203: and receiving image and text information input by a user.

In the embodiment of the invention, the example of generating the song based on the image and the text information is taken as an example for explanation, so that the user needs to input the image and the text information to the electronic equipment when the user wants to make the song. The image and the text information can be imported into the electronic equipment by a user, can be downloaded from a network end by the user, and can be manually input by the user or drawn by the user.

Step 204: and respectively extracting the features of the image and the character information to obtain the image features and the character features.

One method for preferably and respectively extracting the features of the image and the character information to obtain the image features and the character features is as follows:

inputting the image into a first neural network model; acquiring an image feature vector output by a first neural network model, wherein the image feature vector is used for representing the style of a target song; inputting the text information into a second neural network model; and acquiring a character feature vector output by the second neural network model, wherein the character feature vector is used for representing the central meaning of the target song.

Preferably, the first neural network model is set as a convolutional neural network model, and the convolutional neural network has better effect in image feature extraction. The second neural network model is set as a recurrent neural network, because the second neural network model is mainly used for extracting character information features, and the character information features belong to characters of sequence data, and the recurrent neural network is better suitable for extracting character features of the sequence data.

Step 205: and carrying out feature fusion on the image features and the character features to obtain target features.

For a specific technology for fusing image features and character features, reference may be made to the existing related technology, which is not specifically limited in the embodiment of the present invention.

Step 206: and determining a target tune, a target rhythm and target lyrics according to the target characteristics.

The specific implementation of this step is similar to step 102, and is not described herein again.

In addition, optionally, in a specific implementation process, the target features may be input into the tune generation network model, the rhythm generation network model, and the song generation network model, respectively; and respectively obtaining a target tune output by the tune generation network model, a target rhythm output by the rhythm generation network model and target lyrics output by the lyric generation network model.

The method for determining the tune, the rhythm and the lyrics through the pre-trained generation network model has the advantage that the output tune, the rhythm and the lyrics are more reliable.

Step 207: and generating the target song according to the target tune, the target rhythm and the target lyrics.

After the target tune, the target tempo and the target lyrics are determined, the target song can be automatically generated according to song generation software.

After the target song is generated, a trial listening key, a storage key and a deletion key can be displayed on the interface, and a user can trigger the trial listening key to perform trial listening on the target song or directly trigger the storage key to perform storage on the target song. And after the target song is audited, the target song can be selectively stored or deleted according to the effect of the target song.

Optionally, the step 204 and the step 206 may be replaced with the step 102 in the first embodiment, which is not specifically limited in this embodiment of the present invention.

According to the song generation method provided by the embodiment of the invention, the image and the character information input by the user are received through the electronic equipment; determining a target tune, a target rhythm and target lyrics according to the image and the character information; the target song is generated according to the target tune, the target rhythm and the target lyrics, and the electronic equipment user only needs to select the image and the text information of the generated song to input into the electronic equipment, so that the electronic equipment can be triggered to generate the song according to the image and the text information, even non-professionals can easily complete song production, the operation is convenient and fast, and no requirement is imposed on the professional degree. In addition, according to the song generation method provided by the embodiment of the invention, the melody, the rhythm and the lyrics corresponding to the target song are determined through a plurality of generation network models trained in advance, so that the matching degree of the generated target song with the user input image and the text information can be improved.

Having described the song generation method according to the embodiment of the present invention, an electronic device according to the embodiment of the present invention will be described with reference to the accompanying drawings.

EXAMPLE III

Referring to fig. 4, a block diagram of an electronic device according to a third embodiment of the present invention is shown.

The electronic device of the embodiment of the invention comprises: a receiving module 401, configured to receive an image and text information input by a user; a determining module 402, configured to determine a target tune, a target rhythm, and target lyrics according to the image and the text information; a generating module 403, configured to generate a target song according to the target tune, the target rhythm, and the target lyric.

Preferably, the determining module 402 comprises: the extraction submodule 4021 is configured to perform feature extraction on the image and the text information respectively to obtain image features and text features; a fusion sub-module 4022, configured to perform feature fusion on the image features and the text features to obtain target features; a determining sub-module 4023, configured to determine a target tune, a target rhythm, and target lyrics according to the target feature.

Preferably, the extraction submodule is specifically configured to: inputting the image into a first neural network model; acquiring an image feature vector output by the first neural network model, wherein the image feature vector is used for representing the style of a target song; inputting the text information into a second neural network model; and acquiring a character feature vector output by the second neural network model, wherein the character feature vector is used for representing the central meaning of the target song.

Preferably, the determination submodule is specifically configured to: inputting the target characteristics into a tune generation network model, a rhythm generation network model and a song generation network model respectively; and respectively acquiring a target tune output by the tune generation network model, a target rhythm output by the rhythm generation network model and a target song output by the lyric generation network model.

Preferably, the electronic device further includes: a training data determining module 404, configured to determine N groups of song training data before the receiving module receives the image and text information input by the user, where each group of song training data includes: images, text information, tunes, rhythms and lyrics corresponding to the songs; a model training module 405, configured to train the first neural network model, the second neural network model, the tune generating network model, the rhythm generating network model, and the lyric generating network model through the N groups of song training data.

The electronic device provided in the embodiment of the present invention can implement each process implemented by the electronic device in the method embodiments of fig. 1 to fig. 3, and is not described herein again to avoid repetition.

The electronic equipment provided by the embodiment of the invention receives the image and the character information input by the user; determining a target tune, a target rhythm and target lyrics according to the image and the character information; the target song is generated according to the target tune, the target rhythm and the target lyrics, and the electronic equipment user only needs to select the image and the text information of the generated song to input into the electronic equipment, so that the electronic equipment can be triggered to generate the song according to the image and the text information, even non-professionals can easily complete song production, the operation is convenient and fast, and no requirement is imposed on the professional degree.

Example four

Referring to fig. 5, a block diagram of an electronic device according to a fourth embodiment of the present invention is shown.

Fig. 5 is a schematic diagram of a hardware structure of an electronic device 500 for implementing various embodiments of the present invention, where the electronic device 500 includes, but is not limited to: a radio frequency unit 501, a network module 502, an audio output unit 503, an input unit 504, a sensor 505, a display unit 506, a user input unit 507, an interface unit 508, a memory 509, a processor 510, and a power supply 511. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 5 does not constitute a limitation of the electronic device, and that the electronic device may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. In the embodiment of the present invention, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.

The user input unit 507 is configured to receive an image and text information input by a user; a processor 510, configured to determine a target tune, a target rhythm, and target lyrics according to the image and the text information; and generating a target song according to the target tune, the target rhythm and the target lyrics.

It should be understood that, in the embodiment of the present invention, the radio frequency unit 501 may be used for receiving and sending signals during a message sending and receiving process or a call process, and specifically, receives downlink data from a base station and then processes the received downlink data to the processor 510; in addition, the uplink data is transmitted to the base station. In general, radio frequency unit 501 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 501 can also communicate with a network and other devices through a wireless communication system.

The electronic device provides wireless broadband internet access to the user via the network module 502, such as assisting the user in sending and receiving e-mails, browsing web pages, and accessing streaming media.

The audio output unit 503 may convert audio data received by the radio frequency unit 501 or the network module 502 or stored in the memory 509 into an audio signal and output as sound. Also, the audio output unit 503 may also provide audio output related to a specific function performed by the electronic apparatus 500 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 503 includes a speaker, a buzzer, a receiver, and the like.

The input unit 504 is used to receive an audio or video signal. The input Unit 504 may include a Graphics Processing Unit (GPU) 5041 and a microphone 5042, and the Graphics processor 5041 processes image data of a still picture or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 506. The image frames processed by the graphic processor 5041 may be stored in the memory 509 (or other storage medium) or transmitted via the radio frequency unit 501 or the network module 502. The microphone 5042 may receive sounds and may be capable of processing such sounds into audio data. The processed audio data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 501 in case of the phone call mode.

The electronic device 500 also includes at least one sensor 505, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor includes an ambient light sensor that can adjust the brightness of the display panel 5061 according to the brightness of ambient light, and a proximity sensor that can turn off the display panel 5061 and/or a backlight when the electronic device 500 is moved to the ear. Display panel 501 is a flexible display screen, and the flexible display screen includes a screen base, a liftable module array and a flexible screen that are stacked in sequence. As one type of motion sensor, an accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of an electronic device (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), and vibration identification related functions (such as pedometer, tapping); the sensors 505 may also include fingerprint sensors, pressure sensors, iris sensors, molecular sensors, gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., which are not described in detail herein.

The display unit 506 is used to display information input by the user or information provided to the user. The Display unit 506 may include a Display panel 5061, and the Display panel 5061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 507 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 507 includes a touch panel 5071 and other input devices 5072. Touch panel 5071, also referred to as a touch screen, may collect touch operations by a user on or near it (e.g., operations by a user on or near touch panel 5071 using a finger, stylus, or any suitable object or attachment). The touch panel 5071 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 510, and receives and executes commands sent by the processor 510. In addition, the touch panel 5071 may be implemented in various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. In addition to the touch panel 5071, the user input unit 507 may include other input devices 5072. In particular, other input devices 5072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein.

Further, the touch panel 5071 may be overlaid on the display panel 5061, and when the touch panel 5071 detects a touch operation thereon or nearby, the touch operation is transmitted to the processor 510 to determine the type of the touch event, and then the processor 510 provides a corresponding visual output on the display panel 5061 according to the type of the touch event. Although in fig. 5, the touch panel 5071 and the display panel 5061 are two independent components to implement the input and output functions of the electronic device, in some embodiments, the touch panel 5071 and the display panel 5061 may be integrated to implement the input and output functions of the electronic device, and is not limited herein.

The interface unit 508 is an interface for connecting an external device to the electronic apparatus 500. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 508 may be used to receive input (e.g., data information, power, etc.) from external devices and transmit the received input to one or more elements within the electronic apparatus 500 or may be used to transmit data between the electronic apparatus 500 and external devices.

The memory 509 may be used to store software programs as well as various data. The memory 509 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 509 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 510 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 509 and calling data stored in the memory 509, thereby performing overall monitoring of the electronic device. Processor 510 may include one or more processing units; preferably, the processor 510 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 510.

The electronic device 500 may further include a power supply 511 (e.g., a battery) for supplying power to various components, and preferably, the power supply 511 may be logically connected to the processor 510 via a power management system, so as to implement functions of managing charging, discharging, and power consumption via the power management system.

In addition, the electronic device 500 includes some functional modules that are not shown, and are not described in detail herein.

Preferably, an embodiment of the present invention further provides an electronic device, which includes a processor 510, a memory 509, and a computer program that is stored in the memory 509 and can be run on the processor 510, and when the computer program is executed by the processor 510, the processes of the above-mentioned embodiment of the song generation method are implemented, and the same technical effect can be achieved, and in order to avoid repetition, details are not described here again.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the embodiment of the song generating method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A song generation method applied to electronic equipment is characterized by comprising the following steps:

receiving image and character information input by a user;

determining a target tune, a target rhythm and target lyrics according to the image and the character information;

and generating a target song according to the target tune, the target rhythm and the target lyrics.

2. The method of claim 1,

the step of determining a target tune, a target rhythm and target lyrics according to the image and the text information specifically comprises the following steps:

respectively extracting the features of the image and the character information to obtain image features and character features;

performing feature fusion on the image features and the character features to obtain target features;

and determining a target tune, a target rhythm and target lyrics according to the target characteristics.

3. The method of claim 2,

the step of respectively extracting the features of the image and the character information to obtain the image features and the character features specifically comprises the following steps:

inputting the image into a first neural network model;

acquiring an image feature vector output by the first neural network model, wherein the image feature vector is used for representing the style of a target song;

inputting the text information into a second neural network model;

and acquiring a character feature vector output by the second neural network model, wherein the character feature vector is used for representing the central meaning of the target song.

4. The method of claim 2,

the step of determining a target tune, a target rhythm and target lyrics according to the target characteristics specifically comprises the following steps:

inputting the target characteristics into a tune generation network model, a rhythm generation network model and a song generation network model respectively;

and respectively acquiring a target tune output by the tune generation network model, a target rhythm output by the rhythm generation network model and a target song output by the lyric generation network model.

5. The method of claim 1,

before the step of receiving the image and the text information input by the user, the method further comprises:

determining N groups of song training data, wherein each group of song training data comprises: images, text information, tunes, rhythms and lyrics corresponding to the songs;

and training the first neural network model, the second neural network model, the tune generation network model, the rhythm generation network model and the lyric generation network model through the N groups of song training data.

6. An electronic device, characterized in that the electronic device comprises:

the receiving module is used for receiving images and character information input by a user;

the determining module is used for determining a target tune, a target rhythm and target lyrics according to the image and the character information;

and the generating module is used for generating the target song according to the target tune, the target rhythm and the target lyrics.

7. The electronic device of claim 6, wherein the determining module comprises:

the extraction submodule is used for respectively extracting the features of the image and the character information to obtain image features and character features;

the fusion submodule is used for carrying out feature fusion on the image features and the character features to obtain target features;

and the determining submodule is used for determining a target tune, a target rhythm and target lyrics according to the target characteristics.

8. The electronic device of claim 7, wherein the extraction submodule is specifically configured to:

inputting the image into a first neural network model;

inputting the text information into a second neural network model;

9. The electronic device of claim 7, wherein the determination submodule is specifically configured to:

10. The electronic device of claim 6, further comprising:

a training data determining module, configured to determine N groups of song training data before the receiving module receives image and text information input by a user, where each group of song training data includes: images, text information, tunes, rhythms and lyrics corresponding to the songs;

and the model training module is used for training the first neural network model, the second neural network model, the tune generation network model, the rhythm generation network model and the lyric generation network model through the N groups of song training data.