CN110019919B

CN110019919B - Method and device for generating rhyme-rhyme lyrics

Info

Publication number: CN110019919B
Application number: CN201710939775.9A
Authority: CN
Inventors: 邹子馨; 王楠; 朱晓龙; 张友谊; 林少彬; 郑永森; 李廣之; 康世胤; 陀得意; 何静; 陈在真
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-09-30
Filing date: 2017-09-30
Publication date: 2022-07-26
Anticipated expiration: 2037-09-30
Also published as: CN110019919A

Abstract

The embodiment of the invention discloses a method and a device for generating rhyme lyrics, which are used for automatically generating the rhyme lyrics according to an input image. The embodiment of the invention provides a method for generating rhyme lyrics, which comprises the following steps: respectively carrying out scene recognition on a plurality of images input in a terminal to generate description characters which are respectively matched with scenes corresponding to the images; obtaining Chinese pinyin and a vowel corresponding to the last word in the description characters from the description characters matched with the scene corresponding to each image; and generating rhyme lyrics corresponding to the plurality of images according to the pinyin and the rhyme foot corresponding to the last word in the description characters, wherein the rhyme lyrics corresponding to each image have the same rhyme foot with the last word of the description characters matched with the scene corresponding to the image.

Description

Method and device for generating rhyme lyrics

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for generating rhyme-rhyme lyrics.

Background

Music plays an irreplaceable role in life of people, and can be divided into various music types according to different rhythms, wherein Hip-hop music (Rap music or Hip hop) is a music style with rhythm reciting (vocalization) followed by accompaniment, and the accompaniment is generated by music sampling means. At present, the generation mode of music is mainly completed by people through manual creation, for example, hip-hop music can be compiled by professional hip-hop singers. But for people without a music base, the music creation capability is not provided at all.

In order to realize threshold-free music creation, music which can be enjoyed by common users needs to be generated, and in the music generation process, the design of rhyme is a crucial link. In the prior art, rhyme characters are generally written manually, and the rhythm adopted is also determined manually. The design method of rhyme retention needs much time consumption, and automatic generation of rhyme retention lyrics cannot be realized.

Disclosure of Invention

The embodiment of the invention provides a method and a device for generating rhyme lyrics, which are used for automatically generating the rhyme lyrics according to an input image.

In order to solve the above technical problem, the embodiments of the present invention provide the following technical solutions:

in a first aspect, an embodiment of the present invention provides a method for generating rhyme lyrics, including:

respectively carrying out scene recognition on a plurality of images input in a terminal to generate description characters which are respectively matched with scenes corresponding to the images;

obtaining Chinese pinyin and a vowel corresponding to the last word in the description characters from the description characters matched with the scene corresponding to each image;

and generating rhyme lyrics corresponding to the plurality of images according to the pinyin and the rhyme foot corresponding to the last word in the description characters, wherein the rhyme lyrics corresponding to each image have the same rhyme foot with the last word of the description characters matched with the scene corresponding to the image.

In a second aspect, an embodiment of the present invention further provides an apparatus for generating rhyme lyrics, where the apparatus includes:

the scene recognition module is used for respectively carrying out scene recognition on a plurality of images input in the terminal and generating descriptive characters which are respectively matched with scenes corresponding to the images;

the vowel acquisition module is used for acquiring Chinese pinyin and vowel corresponding to the last word in the description characters from the description characters matched with the scene corresponding to each image;

and the lyric generating module is used for generating rhyme lyrics corresponding to the plurality of images according to the Chinese pinyin and the rhyme foot corresponding to the last word in the description characters, wherein the rhyme lyrics corresponding to each image have the same rhyme foot with the last word of the description characters matched with the scene corresponding to the image.

In a third aspect of the present application, a computer-readable storage medium is provided, having stored therein instructions, which, when run on a computer, cause the computer to perform the method of the above-mentioned aspects.

According to the technical scheme, the embodiment of the invention has the following advantages:

in the embodiment of the invention, firstly, a plurality of images input in a terminal are respectively subjected to scene recognition to generate description characters which are respectively matched with scenes corresponding to the plurality of images, then Chinese pinyin and rhyme feet corresponding to the last word in the description characters are obtained from the description characters matched with the scenes corresponding to each image, and finally rhyme lyrics corresponding to the plurality of images can be generated according to the Chinese pinyin and the rhyme feet corresponding to the last word in the description characters, wherein the rhyme lyrics corresponding to each image have the same rhyme feet with the last word of the description characters matched with the scenes corresponding to the image. In the embodiment of the invention, the image music can be generated only by providing a plurality of images by the terminal, scene recognition is carried out on the plurality of images, then the description characters which are suitable for the scene are automatically matched, and the rhyme-typing design is carried out on the description characters of the scene, so that the generated rhyme-typing lyrics accord with the characteristics of the music. The rhyme lyrics are generated according to the image input by the terminal, so that the output image music can be closely associated with the image materials provided by the user, and the rhyme lyrics can be automatically generated according to the input image.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings.

Fig. 1 is a schematic flow chart diagram of a method for generating rhyme lyrics according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a music generating method according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating a generation process of hip-hop music according to an embodiment of the present invention;

fig. 4-a is a schematic diagram illustrating a client uploading a plurality of images according to an embodiment of the present invention;

FIG. 4-b is a schematic flow chart of converting rhyme lyrics into voice according to an embodiment of the present invention;

fig. 5-a is a schematic structural diagram of a music generating apparatus according to an embodiment of the present invention;

fig. 5-b is a schematic structural diagram of a scene recognition module according to an embodiment of the present invention;

fig. 5-c is a schematic diagram of a component structure of an rhyme matching module according to an embodiment of the present invention;

FIG. 5-d is a schematic diagram of a structure of a lyric generating module according to an embodiment of the present invention;

fig. 5-e is a schematic structural diagram of a lyric obtaining module according to an embodiment of the present invention;

FIG. 5-f is a schematic diagram of a structure of a speech generation module according to an embodiment of the present invention;

FIG. 6-a is a schematic diagram of a component structure of an apparatus for generating rhyme lyrics according to an embodiment of the present invention;

fig. 6-b is a schematic structural diagram of a scene recognition module according to an embodiment of the present invention;

FIG. 6-c is a schematic diagram of a structure of a lyric generating module according to an embodiment of the present invention;

FIG. 6-d is a schematic diagram of a structure of a lyric acquisition module according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a composition of the method for generating rhyme lyrics, which is provided by the embodiment of the present invention and is applied to a terminal.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one skilled in the art from the embodiments given herein are intended to be within the scope of the invention.

The terms "comprises" and "comprising," and any variations thereof, in the description and claims of this invention and the above-described drawings are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The following are detailed descriptions.

The embodiment of the method for generating the rhyme lyrics can be particularly applied to generating the rhyme lyrics matched with the description characters of the images based on the various images input by the user. Referring to fig. 1, a method for generating rhyme lyrics according to an embodiment of the present invention may include the following steps:

101. and respectively carrying out scene recognition on a plurality of images input in the terminal to generate description characters which are respectively matched with scenes corresponding to the plurality of images.

In the embodiment of the present invention, a user may input a plurality of images in a terminal to generate image music, and the image music described in the embodiment of the present invention refers to music having a rhythm adapted to the plurality of images input by the user. The plurality of images input in the terminal can be pre-stored in the terminal by a user, or can be acquired by the user in real time by using a camera of the terminal, for example, the plurality of images can be acquired after the terminal enters a photographing mode; or, the multiple images are acquired from an album of the terminal. The implementation manner of the plurality of images input in the terminal is not limited.

In the embodiment of the present invention, a plurality of images input in a terminal may be respectively subjected to scene recognition, so as to recognize a scene corresponding to each image, for example, there may be a plurality of implementation manners for classifying image scenes, for example, four scenes such as landscape, people, food, self-portrait, and the like may be mainly classified, image scene recognition is performed according to a picture uploaded by a user, scene recognition is performed on each image, and characters capable of describing scenes corresponding to different images are automatically matched for each image, for example, scene recognition is performed on a plurality of images, for example, if one image has a blue sky and a bird, a description character of "the bird soars in the blue sky" is automatically given after scene recognition.

In some embodiments of the present invention, the step 101 performs scene recognition on a plurality of images that have been input in the terminal, and generates description words that respectively match scenes corresponding to the plurality of images, including:

a1, carrying out scene recognition on the multiple images according to the deep learning neural network model to obtain recognized image characteristics, and determining scenes corresponding to the multiple images according to the image characteristics;

and A2, performing image description generation according to the identified image characteristics and the scenes corresponding to the multiple images respectively to obtain description characters matched with the scenes corresponding to the multiple images respectively.

In the embodiment of the invention, a deep learning neural network model can be adopted to perform scene recognition on a plurality of images, the deep learning neural network model can also be called a neural image annotation model, image features can be recognized through the deep learning neural network model, and scenes corresponding to the plurality of images are determined according to the image features. The image recognition is a technology for processing, analyzing and understanding an image by using a computer to recognize various targets and objects in different modes. And then, performing image description generation according to the identified image characteristics and scenes corresponding to the multiple images respectively to obtain description characters matched with the scenes corresponding to the multiple images respectively. And identifying the image scene by using a deep learning neural network, and automatically matching related descriptor characters of the scene. The image description generation refers to extracting image features by using scene and object type information as priori knowledge based on computer vision, and cooperatively generating an image description sentence fusing the scene and the object type.

102. And acquiring the pinyin and the vowel corresponding to the last word in the description characters from the description characters matched with the scene corresponding to each image.

103. And generating rhyme lyrics corresponding to a plurality of images according to the Chinese pinyin and the rhyme foot corresponding to the last word in the description characters, wherein the rhyme lyrics corresponding to each image have the same rhyme foot with the last word of the description characters matched with the corresponding scene of the image.

In the embodiment of the invention, the description characters matched with the scene corresponding to each image can be generated through scene recognition, the description characters matched with the scene corresponding to each image are the basis for further generating the lyrics, the rhyme design can be carried out through the description characters matched with each image scene, and rhyme lyrics can be generated aiming at each image, wherein the rhyme lyrics refer to a section of lyrics with rhyme, and the rhyme lyrics corresponding to each image can be one lyric or two or more lyrics.

In some embodiments of the present invention, the pinyin and the vowel corresponding to the last word in the description text are obtained from the description text matched with the scene corresponding to each image. And generating rhyme lyrics corresponding to a plurality of images according to the pinyin and the rhyme foot corresponding to the last word in the description characters, wherein the rhyme lyrics corresponding to each image have the same rhyme foot with the last word of the description characters matched with the scene corresponding to the image.

The description characters matched with the scenes corresponding to each image can be pinyin and vowels corresponding to the last word in the description characters. In the Chinese characters, the number of the common Chinese characters is less than 8000, so that a pinyin table of the common Chinese characters can be generated in advance, indexes are built according to the Chinese characters and loaded into a memory, pinyin of the Chinese characters can be obtained according to needs, 35 vowels can be known by looking up a vowel table, all the vowels can be placed in an array and sorted according to the length of the vowels and the character strings are compared in sequence, and the vowel corresponding to the last word is obtained. After the Chinese pinyin and the vowel corresponding to the last word in the description characters are obtained, the rhyme-added lyrics corresponding to the multiple images are generated based on the Chinese pinyin and the vowel corresponding to the last word in the description characters, wherein the rhyme-added lyrics corresponding to each image have the same vowel as the last word of the description characters matched with the scene corresponding to the image, so that the generated rhyme-added lyrics corresponding to the multiple images can be derived from the vowel of the last word of the description characters, and the generated rhyme-added lyrics can have harmonious and unified vowel by adopting the same vowel design, so that the generated rhyme-added lyrics corresponding to the multiple images can be read more clearly.

Further, in some embodiments of the present invention, the step 103 of generating rhyme lyrics corresponding to a plurality of images according to the pinyin and the finals corresponding to the last word in the description text includes:

1031. arranging all vowels from the pinyin corresponding to the last word in the description characters;

1032. determining a vowel distribution rule according to all listed vowels;

1033. determining a vowel foot corresponding to the last word in the description characters from the vowels according with the vowel distribution rule;

1034. obtaining the rhyme lyrics corresponding to the multiple images from a lyric template generated in advance according to the scenes corresponding to the multiple images and the corresponding rhyme feet of the multiple scenes, wherein lyric characters corresponding to the multiple scenes and the multiple rhyme feet are configured in advance in the lyric template.

The method comprises the steps that a scene corresponding to each image in a plurality of images is matched with description characters, the description characters comprise a last word, all possible finals can be arranged from Chinese pinyin in the last word, description characters with a plurality of sentences and different scenes are generated in advance for each final to serve as song word templates, a final distribution rule of the last word in the description characters is found through a plurality of data samples of the description characters, a plurality of most distributed finals are found, the data volume is increased for the plurality of finals, and therefore which final is used as the final, the song templates are searched based on the final distribution rule, and finally the final corresponding to the images can be obtained by using the song word templates.

For example, taking the rhyme lyric generation of hip-hop music as an example, the hip-hop conversation techniques corresponding to different finals in different scenes can be generated as a lyric template, and the high-frequency finals generate more conversation techniques for selection. Then, according to the final and the scene, a matched conversation is randomly selected to generate rhyme lyrics of hip-hop music. For the same hip-hop syllabary, hip-hop syllabary generated by the same vowel is the same, when the frequency of certain vowel appears is high, more lyric templates can be generated for the vowel feet with high frequency, and therefore, the rhyme-entering lyrics can be generated from a plurality of lyric templates.

Further, in some embodiments of the present invention, the acquiring step 1034 of rhyme lyrics corresponding to the multiple images from a pre-generated lyric template according to the scenes corresponding to the multiple images and the corresponding rhymes in each scene includes:

10341. generating image description lyrics according to description words matched with scenes corresponding to each image;

10342. acquiring supplementary lyrics from a pre-generated lyric template according to scenes corresponding to the plurality of images and corresponding vowel feet under each scene;

10343. and synthesizing the image description lyrics and the supplementary lyrics together to obtain the rhyme lyrics.

Specifically, in the above embodiment of the present invention, the description words matched with the scene corresponding to each image may be used as image description lyrics, where the image description lyrics refer to lyrics derived from the description words, and for example, the description words may be "birds soaring in the blue sky", and the description words may be used as image description lyrics. In step 10342, the supplemental lyrics may be obtained simultaneously with the image-describing lyrics, where the supplemental lyrics are derived from a lyrics template and may have the same vowel as the image-describing lyrics. And finally, synthesizing the image description lyrics and the supplementary lyrics together to obtain the rhyme lyrics. For example, the description words are supplemented with rhymes, for example, for an image description lyric "a bird soaks over a blue sky", a supplementary lyric having the same vowel as the image lyric can be found from the lyric template, and the supplementary lyric may be "almost good, almost praise", so that the finally generated rhyme lyric in the embodiment of the present application may be: the bird soaks in the sky; almost good, almost praise.

Further, in some embodiments of the present invention, in step 10342, acquiring supplementary lyrics from a pre-generated lyric template according to scenes corresponding to the multiple images and corresponding finals in each scene, including:

determining a rhyme foot which accords with the double rhyme according to the image description lyrics;

and acquiring supplementary lyrics from a pre-generated lyric template according to the scenes corresponding to the multiple images and the corresponding rhymes conforming to the double rhymes in each scene.

When the image description lyrics are used for obtaining the rhyme feet, the method and the device can also determine the rhyme feet conforming to double rhymes, the double rhymes are the rhymes of two characters, the supplementary lyrics can be obtained from a lyric template based on scenes and the rhyme feet conforming to the double rhymes, and the supplementary lyrics are generated in a double rhyme mode, so that the supplementary lyrics can have the same double rhyme feet as the image description lyrics.

As can be known from the description of the embodiment of the present invention in the above embodiment, first, scene recognition is performed on a plurality of images that have been input in a terminal, to generate description characters that are respectively matched with scenes corresponding to the plurality of images, then, a chinese pinyin and a vowel corresponding to a last word in the description characters are obtained from the description characters matched with the scenes corresponding to each image, and finally, an entrusting lyric corresponding to the plurality of images can be generated according to the chinese pinyin and the vowel corresponding to the last word in the description characters, where the entrusting lyric corresponding to each image has the same vowel as the last word of the description characters matched with the scenes corresponding to the image. In the embodiment of the invention, the image music can be generated only by providing a plurality of images by the terminal, scene recognition is carried out on the plurality of images, then description characters which are suitable for the scene are automatically matched, and rhyme-giving design is carried out on the description characters of the scene, so that the generated rhyme-giving lyrics accord with the music characteristics. The rhyme lyrics are generated according to the image input by the terminal, so that the output image music can be closely associated with the image material provided by the user, and the rhyme lyrics can be automatically generated according to the input image.

The embodiment of the music generation method can be particularly applied to generating music matched with the descriptive characters of the images based on various images input by a user. Referring to fig. 2, a method for generating music according to an embodiment of the present invention includes the following steps:

In the embodiment of the present invention, a user may input a plurality of images in a terminal to generate image music, and the image music described in the embodiment of the present invention refers to music having a rhythm adapted to the plurality of images input by the user. The plurality of images input in the terminal can be pre-stored in the terminal by a user, or can be acquired by the user in real time by using a camera of the terminal, for example, the plurality of images can be acquired after the terminal enters a photographing mode; or, the multiple images are acquired from an album of the terminal. The implementation of the plurality of images input in the terminal is not limited.

103. And generating rhyme lyrics corresponding to a plurality of images according to the pinyin and the rhyme foot corresponding to the last word in the description characters, wherein the rhyme lyrics corresponding to each image have the same rhyme foot with the last word of the description characters matched with the scene corresponding to the image.

104. Converting the rhyme words respectively corresponding to the plurality of images into voice.

In the embodiment of the present invention, after obtaining the rhyme lyrics corresponding To the plurality of images, the rhyme lyrics may be subjected To Text-To-Speech conversion, where Text-To-Speech (TTS) may be specifically used, and all the rhyme lyrics obtained in step 103 are converted into Speech.

In some embodiments of the present invention, the step 104 of converting the rhyme lyrics corresponding to each of the plurality of images into the speech includes:

c1, performing text analysis on the rhyme lyrics corresponding to the plurality of images respectively to obtain a text analysis result;

c2, extracting linguistic features from the text analysis result;

c3, performing phoneme-level duration prediction and duration self-adaptive adjustment according to the linguistic characteristics to obtain prosodic characteristics and part-of-speech characteristics matched with the rhymes and lyrics;

and C4, based on the linguistic characteristics and the prosodic characteristics and the part of speech characteristics matched with the rhymes and the lyrics, using a neural network model to carry out pronunciation generation to obtain the voice.

The method comprises the steps of obtaining a text analysis result, obtaining a prosody prediction result, obtaining a part-of-speech prediction result, and converting the result into an input vector of a neural network model. And then, a duration model can be used for phoneme-level duration prediction and duration self-adaptive adjustment, and because the rhyme lyrics generated in the embodiment of the application are different from ordinary speech and have rhythmicity, a self-adaptive adjustment is performed on the duration prediction result, so that the original pronunciation of each word can be ensured not to be changed on the beat. And finally, based on the linguistic characteristics and the prosodic characteristics and the part-of-speech characteristics matched with the rhymes and the lyrics, carrying out pronunciation generation by using a neural network model to obtain the voice.

105. And synthesizing the voice and the preset background music together to generate image music.

In the embodiment of the present invention, after the rhyme lyrics are converted into the voice, the voice may include the content of the rhyme lyrics, and the voice and the background music are combined to generate the final image music. The image music is synthesized by the rhyme lyrics compiled by a plurality of images input by the user and the background music, so that the user can hear a music with lyrics and rhythm when playing the image music. For example, after hip-hop rhyme lyrics are compiled through a plurality of images, the hip-hop rhyme lyrics are synthesized with hip-hop background music To obtain a segment of hip-hop music, so that Text To Rap (TTR) is completed.

As can be seen from the foregoing description of the embodiments of the present invention, firstly, scene recognition is performed on a plurality of images input in a terminal, so as to generate description characters respectively matched with scenes corresponding to the plurality of images, then, keyword-based rhyme matching is performed on the description characters matched with the scenes corresponding to each image, so as to generate rhyme lyrics respectively corresponding to the plurality of images, then, the rhyme lyrics respectively corresponding to the plurality of images are converted into voices, and finally, the voices and preset background music are synthesized together, so as to generate image music. In the embodiment of the invention, image music can be generated only by providing a plurality of images through the terminal, scene recognition is carried out on the plurality of images, then description characters adaptive to the scene are automatically matched, and rhyme design is carried out on the description characters of the scene, so that the generated rhyme lyrics accord with the characteristics of music, the rhyme lyrics are converted into voice, and finally the rhyme lyrics are synthesized with background music to form a section of image music. The rhyme lyrics in the image music are generated according to the image input by the terminal, so that the output image music can be closely associated with the image materials provided by the user, and the music matched with the description characters of the scene can be automatically generated by inputting the image.

In order to better understand and implement the above solution of the embodiment of the present invention, the following description specifically illustrates a corresponding application scenario.

In the embodiment of the invention, the song can be woven through Artificial Intelligence (Artificial Intelligence), which is an attempt with foresight and provides a reference value for future AI application in a larger scene. Taking the generation of hip-hop Music as an example, ttr (text To Rap), that is, text is converted into Rap Music, scene recognition is mainly performed on a plurality of input images, then a description language conforming To the scene is given, and then the design of rhyme of subtitles is performed according To the recognition based on the image content, finally, the description language of the scene is converted into voice through TTS, background Music with a specific rhythm is added subsequently, the background Music and the text voice are seamlessly connected To complete a piece of hip-hop Music, and finally a piece of beautiful Music with hip-hop characteristics is generated. TTR is an attempt with foresight by performing scene recognition on any input image and giving a description, converting the description into hip-hop music through a series of processes and utilizing AI to weave a song, and provides a reference value for future AI application in a larger scene.

In the embodiment of the invention, scene recognition is mainly carried out on a plurality of input images, and finally the plurality of input images are combined into an MV (video) with hip hop music. A user inputs a plurality of images from a small program of a mobile phone client, after the images are uploaded, the deep learning neural network model is used for identifying an image scene and automatically matching related descriptors of the scene, then the related descriptors are identified through image content for rhyme design, finally, rhyme characters are converted into voice through TTS technology,

as shown in fig. 3, a schematic diagram of a generation process of hip-hop music according to an embodiment of the present invention. The system mainly comprises four parts:

1. and uploading or selecting a plurality of images from the mobile phone client by the user. A user input image is acquired.

2. And identifying image scenes. And carrying out scene recognition on the input multiple images and giving out related descriptors.

3. And (5) performing rhyme design. And performing rhyme design on the related description language.

4. The text is converted to speech. And converting the description language passing through the rhyme into voice.

When a user submits a plurality of images at a mobile phone client, the plurality of input images are identified, scene identification is carried out on the input images, description words suitable for the input images are automatically matched, rhyme design and supplement are carried out according to rhyme conditions of the description words, the artificial intelligence algorithm is an artificial intelligence algorithm, the description words are directly given through input pictures, the artificial intelligence algorithm is a scene of how the description words are directly judged, for example, a bird flies in the blue sky, a person is on a beach and the like, then a text is converted into voice through TTS, and a piece of hip-hop music is generated through subsequent processing.

Next, each part of content is illustrated separately, and please refer to fig. 4-a, which is a schematic diagram illustrating that a user uploads a plurality of images from a mobile phone client. The user shoots a plurality of pictures or selects a plurality of pictures existing in the mobile phone to upload from the mobile phone client. And uploading the image by using the mobile phone client. When a user clicks an upload picture button, two selection modes appear, one mode is a 'photographing' mode, and the other mode is a 'selection from a mobile phone album' mode. Multiple pictures can be selected to be uploaded each time.

And then, carrying out image scene recognition and automatically matching characters. And carrying out image scene recognition according to the pictures uploaded by the user, carrying out scene recognition on each image, automatically matching characters for each image, and connecting the characters corresponding to each image in series. For the input image to generate text annotation, training the neural image annotation model can maximize the success probability of the neural image annotation model, and deeply learning the neural network model, wherein the meaning of the neural network model is the same as that of the annotation model. And can generate a novel image description. For example, the following annotations may be generated: one grey man waves the wand and black man looks next. As another example, the following annotations may be generated: a bus car "sits" next to a person.

The following is an example of the text rhyme design method provided by the embodiment of the present invention. The invention relates to the field of AI graph item description generation application, in particular to a keyword-based rhyme matching method, which mainly comprises the following steps:

1. and acquiring character information generated by image description to obtain Chinese pinyin and finals of corresponding Chinese characters.

2. All possible vowels are arranged from the pinyin, a plurality of sentences of description characters with different scenes are generated in advance for each vowel, and the lyrics of a second sentence are supplemented by the method through double rhymes in one sentence. The pre-generation mode is as follows: all vowels of the Chinese pinyin can be listed. Each vowel is written with four scenes of landscape, figure, self-timer and food.

3. And finding out a description character final distribution rule through the data sample, finding out several most distributed finals, and increasing the data volume for the several finals.

4. And generating a capture scene through the image description and the character description, and matching rhyme data through the character rhyme mother generated through the character description.

5. And finally, displaying the complete rhyme lyric works.

The technical scheme is based on an image recognition technology and an image description generation technology, converts images uploaded by a user into characters, matches any second sentence of lyrics through the last word and final sound of a first sentence generated by image description and an image scene, and finally generates the rhyme lyrics. And then sing the song to the AI. The picture uploaded by the user is formed, the AI works to sing a complete interactive process, and the interactivity and the interestingness are greatly enhanced. It means that a plurality of lyrics can be used for matching, here the final of the last word of the first sentence is used to match the second sentence.

Firstly, obtaining image description generation, obtaining image description information according to photos uploaded by a user and an AI image description generation technology, and obtaining a sentence description of each picture.

And then obtaining the pinyin of the Chinese characters, wherein the number of the common Chinese characters is less than 8000, generating a pinyin table of the common Chinese characters in advance, establishing an index according to the Chinese characters, loading the index into a memory, and quickly obtaining the pinyin of the Chinese characters in O (1) time through the index when needing to obtain the pinyin of the Chinese characters.

Examples are as follows:

a1, an o1, a1, a1, a sitting 1, a1, a1, a1 and a 1.

Then, obtaining the final, looking up the final table to know 35 finals, taking the Chinese character 'change' as an example, the three pinyin finals will include a compound final and a single final, for example ian includes a final i and a final an, so the final is to look at the three finals first, then at the compound final, and finally at the single final. The implementation mode puts all vowels in an array, sorts the vowels according to the length from large to small, and then compares the character strings in sequence.

The method comprises the steps of obtaining an image description scene, matching keywords contained in the scene according to characters described by the image to distinguish the corresponding scene, wherein at present, scenes such as landscapes, people, foods and self-timer are mainly distinguished, and the lower picture is a part of corresponding keywords.

Examples are as follows: when the scene is a landscape, there may be a plurality of descriptors, for example, landscape sunlight, landscape sea-horse, landscape rain, landscape flower, landscape grass. When the scene is a character, various descriptors can be provided, such as a character boy and a character girl. When the scene is food, a plurality of descriptors can be available, such as food gourmet. When the scene is a self-portrait, there may be multiple descriptors, such as a self-portrait photo, and a self-portrait avatar.

And then acquiring supplementary lyrics according to the scene and the vowel. Firstly, corresponding hip-hop dialogues are generated for different scenes and different vowels, and more dialogues are generated for the high-frequency vowels for selection. Then a matched dialect is randomly selected according to the finals and the scenes.

Examples are as follows:

a landscape has almost the same trails

a person is almost big with almost an brake

a almost frying of food, almost spicy with hot fire

ia shrimp with food and fright

Almost harmonious ia characters, there are almost two us

ia family with more or less food, with more or less sunset

The ua has wild flowers with almost landscape and is beautiful like a picture

The ua figures are almost similar, and the eight diagrams are similar

ce is subject to more or less loss and more or less in-whip

The river with almost the landscape of che is almost clear

greedy food with rare food and rare young pigeon with rare food

More or less sunny re landscape with more or less severe heat

A coal with similar white appearance of te character and similar white color of E character in \\245628

ye is a more or less universal night, is a more or less choking pharynx

The more and more common life of ze is the more and more selection

A river with more or less he scenery has more or less gaps

Almost ke-figure with almost severe customers

The food is almost drunk and has almost thirst

The final generated rhyme lyrics may be as follows:

a group of people walking on a busy street [ image description ]

Almost busy, and almost forget [ supplement lyrics ]

Tall buildings in cities [ image description ]

Almost scenery with almost any trails [ supplement of lyrics ]

Food photo when gathering with friends [ image description ]

Almost face with almost thought of words [ supplement lyrics ]

Finally, for example, the text is converted into speech, as shown in fig. 4-b, the text analysis is performed on the description language to provide information for subsequent feature extraction, and the method mainly includes: pronunciation generation, prosody prediction, word prediction and the like, and after the result of text analysis is obtained, linguistic features of the result are extracted and converted into an input vector of a neural network. Phoneme-level duration prediction is performed using a duration model. The phoneme is predicted using a time-long model, so that a better rhythm can be obtained. The hip-hop is different from the common speaking and has rhythmicity, so that a self-adaptive adjustment is made on the result of the duration prediction, and the duration self-adaptation is realized by automatically adjusting through a neural network, so that the original pronunciation of each character can be ensured not to change on the beat. Wherein, hip-hop singing input is the descriptor. The acoustic feature prediction comprises: prosody prediction and part-of-speech prediction. Hip-hop rhythm in hip-hop rhythm input is obtained through neural network prediction. The background music may be background music of a faster tempo. The hip-hop lyrics are obtained by performing scene recognition on an image to obtain a descriptor and performing rhyme-entering design.

It should be noted that for simplicity of description, the above-mentioned method embodiments are shown as a series of combinations of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the present invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

To facilitate a better implementation of the above-described aspects of embodiments of the present invention, the following also provides related apparatus for implementing the above-described aspects.

Referring to fig. 5-a, an apparatus 500 for generating music according to an embodiment of the present invention includes: scene recognition module 501, rhyme matching module 502, speech generation module 503, music generation module 504, wherein,

a scene recognition module 501, configured to perform scene recognition on multiple images input in a terminal, and generate description texts respectively matching scenes corresponding to the multiple images;

a rhyme matching module 502, configured to perform rhyme matching based on keywords on description characters matched in a scene corresponding to each image, and generate rhyme lyrics corresponding to the multiple images;

a voice generating module 503, configured to convert the rhyme lyrics corresponding to the multiple images into voice;

a music generating module 504, configured to synthesize the voice and preset background music together to generate image music.

In some embodiments of the present invention, referring to fig. 5-b, the scene recognition module 501 includes:

the scene determining module 5011 performs scene recognition on the multiple images according to the deep learning neural network model to obtain recognized image features, and determines scenes corresponding to the multiple images according to the image features;

the image description module 5012 is configured to perform image description generation according to the identified image features and the scenes corresponding to the multiple images, to obtain description texts matched with the scenes corresponding to the multiple images.

In some embodiments of the present invention, referring to fig. 5-c, the rhyme matching module 502 comprises:

a vowel acquisition module 5021, configured to acquire a pinyin and a vowel corresponding to a last word in description text from the description text matched in the scene corresponding to each image;

the lyric generating module 5022 is configured to generate rhyme lyrics corresponding to the multiple images according to the pinyin and the rhyme corresponding to the last word in the description text, where the rhyme lyrics corresponding to each image have the same rhyme as the last word in the description text matched in the scene corresponding to the image.

In some embodiments of the present invention, referring to fig. 5-d, the lyric generating module 5022 comprises:

a final arranging module 50221, configured to arrange all the finals from the pinyin corresponding to the last word in the description text;

a rule determining module 50222, configured to determine a final distribution rule according to all arranged finals;

a final determining module 50223, configured to determine a final corresponding to the last word in the description text from the final conforming to the final allocation rule;

the lyric obtaining module 50224 is configured to obtain rhyme lyrics corresponding to the multiple images from a pre-generated lyric template according to the scenes corresponding to the multiple images and the corresponding rhyme feet in each scene, where lyric characters corresponding to the multiple scenes and the multiple rhyme feet are pre-configured in the lyric template.

In some embodiments of the present invention, referring to fig. 5-e, the lyric obtaining module 50224 includes:

a description lyric generating module 502241, configured to generate image description lyrics according to the description words matched with the scene corresponding to each image;

a supplementary lyric generating module 502242, configured to obtain supplementary lyrics from a pre-generated lyric template according to scenes corresponding to the plurality of images and corresponding vowels in each scene;

and the lyric synthesizing module 502243 is configured to synthesize the image description lyrics with the supplementary lyrics to obtain the rhyme lyrics.

In some embodiments of the present invention, the plurality of images are acquired after the terminal enters a photographing mode; or the like, or a combination thereof,

and acquiring the plurality of images from an album of the terminal.

In some embodiments of the present invention, referring to fig. 5-f, the speech generation module 503 comprises:

the text analysis module 5031 is configured to perform text analysis on the rhyme lyrics corresponding to the multiple images to obtain a text analysis result;

a linguistic feature extraction module 5032, configured to extract linguistic features from the text analysis result;

a prosodic feature and part-of-speech feature obtaining module 5033, configured to perform phoneme-level duration prediction and duration adaptive adjustment according to the linguistic features to obtain prosodic features and part-of-speech features matching with the rhymes and lyrics;

a pronunciation generating module 5034, configured to generate pronunciation by using a neural network model based on the linguistic feature and the prosody feature and part-of-speech feature matched with the rhyme and lyric, so as to obtain the speech.

As can be seen from the foregoing description of the embodiments of the present invention, firstly, scene recognition is performed on a plurality of images input in a terminal, so as to generate description characters respectively matched with scenes corresponding to the plurality of images, then, keyword-based rhyme matching is performed on the description characters matched with the scenes corresponding to each image, so as to generate rhyme lyrics respectively corresponding to the plurality of images, then, the rhyme lyrics respectively corresponding to the plurality of images are converted into voices, and finally, the voices and preset background music are synthesized together, so as to generate image music. In the embodiment of the invention, image music can be generated only by providing a plurality of images through the terminal, scene recognition is carried out on the plurality of images, then description characters which are adaptive to the scene are automatically matched, and rhyme-giving design is carried out on the description characters of the scene, so that the generated rhyme-giving lyrics accord with the music characteristics, the rhyme-giving lyrics are converted into voice, and finally the rhyme-giving lyrics are synthesized with background music to form a section of image music. The rhyme lyrics in the image music are generated according to the image input by the terminal, so that the output image music can be closely associated with the image materials provided by the user, and the music matched with the description characters of the scene can be automatically generated by inputting the image.

Referring to fig. 6-a, an apparatus 600 for generating rhyme lyrics according to an embodiment of the present invention may include: a scene recognition module 601, a vowel acquisition module 602, a lyric generation module 603, wherein,

a scene recognition module 601, configured to perform scene recognition on multiple images input in a terminal, and generate description texts respectively matching scenes corresponding to the multiple images;

a vowel obtaining module 602, configured to obtain a pinyin and a vowel corresponding to a last word in description texts from the description texts matched in the scene corresponding to each image;

the lyric generating module 603 is configured to generate rhyme lyrics corresponding to the multiple images according to the pinyin and the rhyme corresponding to the last word in the description text, where the rhyme lyrics corresponding to each image have the same rhyme as the last word in the description text matched with the scene corresponding to the image.

In some embodiments of the present invention, referring to fig. 6-b, the scene recognition module 601 includes:

a scene determining module 6011, configured to perform scene recognition on the multiple images according to the deep learning neural network model to obtain recognized image features, and determine scenes corresponding to the multiple images according to the image features;

an image description module 6012, configured to perform image description generation according to the identified image features and the scenes corresponding to the multiple images, to obtain description words respectively matched with the scenes corresponding to the multiple images.

In some embodiments of the present invention, referring to fig. 6-c, the lyric generating module 603 includes:

a final arranging module 6031, configured to arrange all the finals from the pinyin corresponding to the last word in the description text;

a rule determining module 6032, configured to determine a final distribution rule according to all arranged finals;

a final foot determining module 6033, configured to determine a final foot corresponding to the last word in the description text from the final conforming to the final allocation rule;

a lyric obtaining module 6034, configured to obtain rhyme song words corresponding to the multiple images from a pre-generated lyric template according to scenes corresponding to the multiple images and corresponding rhyme feet in each scene, where lyric words corresponding to the multiple scenes and the multiple rhyme feet are pre-configured in the lyric template.

In some embodiments of the present invention, referring to fig. 6-d, the lyric obtaining module 6034 includes:

a description lyric generating module 60341, configured to generate image description lyrics according to the description words matched with the scene corresponding to each image;

a supplementary lyric generating module 60342, configured to obtain supplementary lyrics from a pre-generated lyric template according to scenes corresponding to the multiple images and corresponding vowels in each scene;

and the lyric synthesizing module 60343 is configured to synthesize the image description lyrics and the supplementary lyrics together to obtain the rhyme lyrics.

and acquiring the plurality of images from an album of the terminal.

In some embodiments of the invention, the supplementary lyric generating module 60342 is specifically configured to determine a vowel that conforms to a double rhyme from the image description lyrics; and acquiring supplementary lyrics from a pre-generated lyric template according to scenes corresponding to the plurality of images and corresponding rhymes conforming to the double rhymes in each scene.

The method comprises the steps of firstly, respectively carrying out scene recognition on a plurality of images input in a terminal to generate description characters which are respectively matched with scenes corresponding to the plurality of images, then obtaining Chinese pinyin and a vowel corresponding to the last word in the description characters from the description characters matched with the scenes corresponding to each image, and finally generating rhyme lyrics corresponding to the plurality of images according to the Chinese pinyin and the vowel corresponding to the last word in the description characters, wherein the rhyme lyrics corresponding to each image have the same vowel as the last word of the description characters matched with the scenes corresponding to the image. In the embodiment of the invention, the image music can be generated only by providing a plurality of images by the terminal, scene recognition is carried out on the plurality of images, then description characters which are suitable for the scene are automatically matched, and the rhyme-giving design is carried out on the description characters of the scene, so that the generated rhyme-giving lyrics accord with the music characteristics. The rhyme lyrics are generated according to the image input by the terminal, so that the output image music can be closely associated with the image materials provided by the user, and the rhyme lyrics can be automatically generated according to the input image.

Fig. 7 shows only a portion related to the embodiment of the present invention for convenience of description, and please refer to the method portion of the embodiment of the present invention for details that are not disclosed. The terminal may be any terminal device including a mobile phone, a tablet computer, a PDA (Personal Digital Assistant), a POS (Point of Sales), a vehicle-mounted computer, etc., taking the terminal as the mobile phone as an example:

fig. 7 is a block diagram illustrating a partial structure of a mobile phone related to a terminal provided in an embodiment of the present invention. Referring to fig. 7, the handset includes: radio Frequency (RF) circuit 1010, memory 1020, input unit 1030, display unit 1040, sensor 1050, audio circuit 1060, wireless fidelity (WiFi) module 1070, processor 1080, and power source 1090. Those skilled in the art will appreciate that the handset configuration shown in fig. 7 is not intended to be limiting and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 7:

RF circuit 1010 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, for processing downlink information of a base station after receiving the downlink information to processor 1080; in addition, the data for designing the uplink is transmitted to the base station. In general, RF circuit 1010 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 1010 may communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), etc.

The memory 1020 may be used to store software programs and modules, and the processor 1080 executes various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 1020. The memory 1020 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1020 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 1030 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1030 may include a touch panel 1031 and other input devices 1032. The touch panel 1031, also referred to as a touch screen, may collect touch operations by a user (e.g., operations by a user on or near the touch panel 1031 using any suitable object or accessory such as a finger, a stylus, etc.) and drive corresponding connection devices according to a preset program. Optionally, the touch panel 1031 may include two parts, namely a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects signals caused by touch operation and transmits the signals to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 1080, and can receive and execute commands sent by the processor 1080. In addition, the touch panel 1031 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 1030 may include other input devices 1032 in addition to the touch panel 1031. In particular, other input devices 1032 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a track ball, a mouse, a joystick, and the like.

The display unit 1040 may be used to display information input by a user or information provided to the user and various menus of the mobile phone. The Display unit 1040 may include a Display panel 1041, and optionally, the Display panel 1041 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 1031 can cover the display panel 1041, and when the touch panel 1031 detects a touch operation on or near the touch panel 1031, the touch panel is transferred to the processor 1080 to determine the type of the touch event, and then the processor 1080 provides a corresponding visual output on the display panel 1041 according to the type of the touch event. Although in fig. 7, the touch panel 1031 and the display panel 1041 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1031 and the display panel 1041 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 1050, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1041 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 1041 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing gestures of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometers and taps), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

Audio circuitry 1060, speaker 1061, and microphone 1062 may provide an audio interface between the user and the handset. The audio circuit 1060 can transmit the electrical signal converted from the received audio data to the speaker 1061, and the electrical signal is converted into a sound signal by the speaker 1061 and output; on the other hand, the microphone 1062 converts the collected sound signals into electrical signals, which are received by the audio circuit 1060 and converted into audio data, which are then processed by the audio data output processor 1080 and then sent to another mobile phone via the RF circuit 1010, or output to the memory 1020 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through a WiFi module 1070, and provides wireless broadband internet access for the user. Although fig. 7 shows the WiFi module 1070, it is understood that it does not belong to the essential constitution of the handset, and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 1080 is a control center of the mobile phone, connects various parts of the whole mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1020 and calling data stored in the memory 1020, thereby integrally monitoring the mobile phone. Optionally, processor 1080 may include one or more processing units; preferably, the processor 1080 can integrate an application processor, which primarily handles operating systems, user interfaces, application programs, etc., and a modem processor, which primarily handles wireless communications. It is to be appreciated that the modem processor described above may not be integrated into processor 1080.

The handset also includes a power supply 1090 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 1080 via a power management system that may be used to manage charging, discharging, and power consumption.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In the embodiment of the present invention, the processor 1080 included in the terminal also has a function of controlling and executing the above method flow executed by the terminal.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts shown as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. Can be understood and implemented by those of ordinary skill in the art without inventive step.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention may be implemented by software plus necessary general hardware, and may also be implemented by special purpose hardware including special purpose integrated circuits, special purpose CPUs, special purpose memories, special purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be various, such as analog circuits, digital circuits, or dedicated circuits. However, the implementation of a software program is a more preferable embodiment for the present invention. Based on such understanding, the technical solutions of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, where the computer software product is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

In summary, the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: the technical solutions described in the above embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for generating rhyme lyrics, characterized in that the method comprises:

obtaining Chinese pinyin and vowel corresponding to the last word in the description characters from the description characters matched with the scene corresponding to each image;

and generating rhyme lyrics corresponding to the plurality of images according to the Chinese pinyin and the final sound foot corresponding to the last word in the description characters, wherein the rhyme lyrics corresponding to each image have the same final sound foot with the last word of the description characters matched with the corresponding scene of the image.

2. The method according to claim 1, wherein the performing scene recognition on a plurality of images that have been input into the terminal, and generating descriptive texts that respectively match scenes corresponding to the plurality of images comprises:

scene recognition is carried out on the multiple images according to a deep learning neural network model to obtain recognized image characteristics, and scenes corresponding to the multiple images are determined according to the image characteristics;

and performing image description generation according to the identified image characteristics and the scenes corresponding to the plurality of images respectively to obtain description characters matched with the plurality of images respectively.

3. The method of claim 1, wherein generating the rhyme lyrics corresponding to the plurality of images according to the pinyin and the finals corresponding to the last word in the description text comprises:

arranging all vowels from the pinyin corresponding to the last word in the description characters;

determining a vowel distribution rule according to all listed vowels;

determining a vowel foot corresponding to the last word in the description characters from the vowels according with the vowel distribution rule;

and acquiring the rhyme lyrics corresponding to the plurality of images from a pre-generated lyric template according to the scenes corresponding to the plurality of images and the corresponding rhyme feet of each scene, wherein lyric characters corresponding to a plurality of scenes and a plurality of rhyme feet are pre-configured in the lyric template.

4. The method of claim 3, wherein obtaining the rhyme lyrics corresponding to the plurality of images from a pre-generated lyric template according to the scenes corresponding to the plurality of images and the corresponding finals in each scene comprises:

generating image description lyrics according to the description words matched with the scene corresponding to each image;

acquiring supplementary lyrics from a pre-generated lyric template according to scenes corresponding to the plurality of images and corresponding vowel feet under each scene;

and synthesizing the image description lyrics and the supplementary lyrics together to obtain the rhyme lyrics.

5. The method of claim 4, wherein the obtaining of the supplementary lyrics from the pre-generated lyric template according to the scenes corresponding to the plurality of images and the corresponding finals in each scene comprises:

6. The method according to claim 1, wherein the plurality of images are acquired by the terminal after entering a photographing mode; or the like, or a combination thereof,

and acquiring the plurality of images from an album of the terminal.

7. An apparatus for generating rhyme lyrics, the apparatus comprising:

the scene recognition module is used for respectively carrying out scene recognition on a plurality of images input in the terminal and generating description characters which are respectively matched with scenes corresponding to the images;

and the lyric generating module is used for generating rhyme lyrics corresponding to the plurality of images according to the Chinese pinyin and the rhyme foot corresponding to the last word in the description characters, wherein the rhyme lyrics corresponding to each image have the same rhyme foot with the last word of the description characters matched with the corresponding scene of the image.

8. The apparatus of claim 7, wherein the scene recognition module comprises:

the scene determining module is used for carrying out scene recognition on the multiple images according to the deep learning neural network model to obtain recognized image characteristics and determining scenes corresponding to the multiple images respectively according to the image characteristics;

and the image description module is used for carrying out image description generation according to the identified image characteristics and the scenes corresponding to the plurality of images respectively to obtain description characters matched with the scenes corresponding to the plurality of images respectively.

9. The apparatus of claim 7, wherein the lyric generation module comprises:

the final arranging module is used for arranging all the finals from the Chinese pinyin corresponding to the last word in the description characters;

the rule determining module is used for determining a final distribution rule according to all arranged finals;

a final determining module, configured to determine a final corresponding to the last word in the description text from the final conforming to a final distribution rule;

and the lyric obtaining module is used for obtaining rhyme lyrics corresponding to the plurality of images from a pre-generated lyric template according to the scenes corresponding to the plurality of images and the corresponding rhyme feet under each scene, and lyric characters corresponding to the plurality of scenes and the plurality of rhyme feet are pre-configured in the lyric template.

10. The apparatus of claim 9, wherein the lyric obtaining module comprises:

the description lyric generating module is used for generating image description lyrics according to the description words matched with the scene corresponding to each image;

the supplementary lyric generating module is used for acquiring supplementary lyrics from a pre-generated lyric template according to scenes corresponding to the plurality of images and corresponding vowel feet under each scene;

and the lyric synthesis module is used for synthesizing the image description lyrics and the supplementary lyrics together to obtain the rhyme lyrics.

11. The apparatus of claim 10, wherein the supplemental lyric generating module is specifically configured to determine a rhyme that conforms to a double rhyme from the image-described lyrics; and acquiring supplementary lyrics from a pre-generated lyric template according to the scenes corresponding to the multiple images and the corresponding rhymes conforming to the double rhymes in each scene.

12. The device according to claim 7, wherein the plurality of images are acquired by the terminal after entering a photographing mode; or the like, or, alternatively,

and acquiring the plurality of images from an album of the terminal.

13. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any one of claims 1-6.