CN115547278A

CN115547278A - Rap generation

Info

Publication number: CN115547278A
Application number: CN202110732470.7A
Authority: CN
Inventors: 谭旭; 秦涛; 刘铁岩
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2022-12-30

Abstract

According to an implementation of the present disclosure, a scheme for rap generation is presented. In this scheme, an input sequence corresponding to a first part of a rap is obtained. In the input sequence, the words in the same sentence of the first part are arranged in a reverse order, and the beat identifier of the beat of the first part is adjacent to the word corresponding to the beat. A text representation sequence and an rhyme representation sequence are determined that respectively correspond to the input sequence. And generating a second part of the rap according to the rap generation model based on the text representation sequence and the rhyme representation sequence. In this way, a rap having both an rhyme and a rhythm may be generated.

Description

Rap generation

Background

Rap is a form of music that originated in the 1970 s and has evolved into one of the world's mainstream music genres. In general, rap lyrics need to be semantically meaningful and fashionable to convey an interesting story or express emotion. Unlike natural languages or other artistic genres (e.g., lyrics or poems), rap has a distinctive feature. First, rap usually contains a complex rhyme pattern between several consecutive sentences; second, because rap lyrics are typically rap based on some rhythmic accompaniment, the lyrics need to be aligned with the beat.

Disclosure of Invention

According to an implementation of the present disclosure, a scheme for rap generation is presented. In this scheme, an input sequence corresponding to a first part of a rap is obtained. In the input sequence, the words in the same sentence of the first section are arranged in a reverse order, and the beat identifier of the beat of the first section is adjacent to the word corresponding to the beat. A text representation sequence and an rhyme representation sequence are determined that respectively correspond to the input sequence. And generating a second part of the rap according to the rap generation model based on the text representation sequence and the rhyme representation sequence. In this way, a rap having both a rhyme and a rhythm can be generated.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Drawings

FIG. 1 illustrates a block diagram of a computing device capable of implementing implementations of the present disclosure;

FIG. 2 illustrates an architecture diagram of a rap generation system according to some implementations of the present disclosure;

FIG. 3 illustrates a schematic diagram of lyrics arranged in a positive order and a negative order, in accordance with some implementations of the present disclosure;

FIG. 4 illustrates a schematic diagram of a process for constructing a training data set, in accordance with some implementations of the present disclosure;

FIG. 5 illustrates an example of training rap in a training dataset according to some implementations of the present disclosure;

FIG. 6 illustrates a flow diagram of a method of applying a rap generation model in accordance with some implementations of the present disclosure; and

fig. 7 illustrates a flow diagram of a method of training a rap generation model according to some implementations of the present disclosure.

In the drawings, the same or similar reference characters are used to designate the same or similar elements.

Detailed Description

The present disclosure will now be discussed with reference to several example implementations. It should be understood that these implementations are discussed only to enable those of ordinary skill in the art to better understand and thus implement the present disclosure, and are not intended to imply any limitation as to the scope of the present disclosure.

As used herein, the term "include" and its variants are to be read as open-ended terms meaning "including, but not limited to. The term "based on" is to be read as "based, at least in part, on". The terms "one implementation" and "an implementation" are to be read as "at least one implementation". The term "another implementation" is to be read as "at least one other implementation". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.

As used herein, a "neural network" is capable of processing an input and providing a corresponding output, which generally includes an input layer and an output layer and one or more hidden layers between the input layer and the output layer. Neural networks used in deep learning applications typically include many hidden layers, extending the depth of the network. The layers of the neural network are connected in sequence such that the output of a previous layer is provided as the input of a subsequent layer, wherein the input layer receives the input of the neural network and the output of the output layer is the final output of the neural network. Each layer of the neural network includes one or more nodes (also referred to as processing nodes or neurons), each node processing an input from a previous layer. The terms "neural network", "network", and "neural network model" are used interchangeably herein.

As used herein, the expressions "word corresponding to a beat", "word aligned with a beat", "beat aligned with a word", or "aligned beat" and the like mean that in singing, the word should be recited or sung at the same time as the beat.

As used herein, the terms "rap" and "rap song" are used interchangeably. As mentioned above, rap has distinct features compared to natural language and other artistic genres, such as complex rhyme patterns, beats that need to be aligned with the lyrics, etc. In view of this, how to automatically generate rap with good rhyme and rhythm is a challenging problem.

Some proposals have been made in connection with rap generation. These schemes mainly focus on lyric generation, and some of them propose rhyme modeling strategies. For example, in one approach, an end marker is added directly at the end of the verse line, in the hope that the model learns the rhyme pattern implicitly. In another scheme, a two-step strategy is applied, in which rap lyrics are generated first, and then an incarnation mark is added at the end of the generated lyrics. However, these schemes do not guarantee the rhyme of the lyrics of each sentence, and only concern the rhyme of the last word.

While many schemes propose rhyme modeling of other artistic genres (e.g., poetry), such schemes are not suitable for rap generation due to the complex rhyme patterns of rap songs. For example, poetry requires only the final word in each sentence to be rhymed, while rap requires the rhyme to be rhymed over multiple successive words at the end of each sentence. Furthermore, the conventional scheme described above does not take into account rhythm modeling (i.e., speaking beats in a song), and thus cannot generate beats aligned with lyrics. However, lyric generation without beats cannot be considered as complete rap generation.

In accordance with implementations of the present disclosure, a solution for generating rap songs is provided that addresses one or more of the above-mentioned problems, as well as other potential problems. In this scheme, an input sequence corresponding to a first portion of a rap song is obtained. The words in the sentences in the input sequence are arranged in reverse order. Beat identifications for respective beats of the first portion of the rap song are also included in the input sequence. The beat identifier is adjacent to the word corresponding to the beat. A text representation sequence and an rhyme representation sequence are determined that respectively correspond to the input sequence. And generating a second part of the rap song according to the rap generation model based on the text representation sequence and the rhyme representation sequence.

In this way, the remaining portion of the rap song can be generated based on the textual information and the tempo information of a portion of the rap song. The rap song generated in this manner takes into account the text information and the tempo information, and thus can be both rhyme-entrenched and rhythmic. Therefore, the rap generating model can generate rap music with rhythm and rhyme well.

Various example implementations of this approach are described in further detail below in conjunction with the figures.

Example Environment

FIG. 1 illustrates a block diagram of a computing device 100 capable of implementing multiple implementations of the present disclosure. It should be understood that the computing device 100 illustrated in FIG. 1 is merely exemplary and should not be construed as limiting in any way the functionality or scope of the implementations described in this disclosure. As shown in fig. 1, computing device 100 comprises computing device 100 in the form of a general purpose computing device. Components of computing device 100 may include, but are not limited to, one or more processors or processing units 110, memory 120, storage 130, one or more communication units 140, one or more input devices 150, and one or more output devices 160.

In some implementations, the computing device 100 may be implemented as various user terminals or service terminals with computing capabilities. The service terminals may be servers, mainframe computing devices, etc. provided by various service providers. A user terminal such as any type of mobile terminal, fixed terminal, or portable terminal, including a mobile handset, station, unit, device, multimedia computer, multimedia tablet, internet node, communicator, desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, personal Communication System (PCS) device, personal navigation device, personal Digital Assistant (PDA), audio/video player, digital camera/camcorder, positioning device, television receiver, radio broadcast receiver, electronic book device, game device, or any combination thereof, including the accessories and peripherals of these devices, or any combination thereof. It is also contemplated that computing device 100 can support any type of interface to the user (such as "wearable" circuitry, etc.).

The processing unit 110 may be a real or virtual processor and can perform various processes according to programs stored in the memory 120. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to increase the parallel processing capability of computing device 100. The processing unit 110 may also be referred to as a Central Processing Unit (CPU), microprocessor, controller, microcontroller.

Computing device 100 typically includes a number of computer storage media. Such media may be any available media that is accessible by computing device 100 and includes, but is not limited to, volatile and non-volatile media, removable and non-removable media. Memory 120 may be volatile memory (e.g., registers, cache, random Access Memory (RAM)), non-volatile memory (e.g., read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory), or some combination thereof. Memory 120 may include a rap generation module 122 configured to perform the functions of the various implementations described herein. The rap generating module 122 may be accessed and executed by the processing unit 110 to implement the corresponding functionality.

Storage device 130 may be a removable or non-removable medium and may include a machine-readable medium that can be used to store information and/or data and that can be accessed within computing device 100. The computing device 100 may further include additional removable/non-removable, volatile/nonvolatile storage media. Although not shown in FIG. 1, a magnetic disk drive for reading from or writing to a removable, nonvolatile magnetic disk and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk may be provided. In these cases, each drive may be connected to a bus (not shown) by one or more data media interfaces.

The communication unit 140 enables communication with another computing device over a communication medium. Additionally, the functionality of the components of computing device 100 may be implemented in a single computing cluster or multiple computing machines, which are capable of communicating over a communications connection. Thus, the computing device 100 may operate in a networked environment using logical connections to one or more other servers, personal Computers (PCs), or another general network node.

The input device 150 may be one or more of a variety of input devices such as a mouse, keyboard, trackball, voice input device, and the like. Output device 160 may be one or more output devices such as a display, speakers, printer, or the like. Computing device 100 may also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., communicating with one or more devices that enable a user to interact with computing device 100, or communicating with any devices (e.g., network cards, modems, etc.) that enable computing device 100 to communicate with one or more other computing devices, as desired, via communication unit 140. Such communication may be performed via input/output (I/O) interfaces (not shown).

In some implementations, some or all of the various components of computing device 100 may be provided in the form of a cloud computing architecture, in addition to being integrated on a single device. In a cloud computing architecture, these components may be remotely located and may work together to implement the functionality described in this disclosure. In some implementations, cloud computing provides computing, software, data access, and storage services that do not require end users to know the physical location or configuration of the systems or hardware providing these services. In various implementations, cloud computing provides services over a wide area network (such as the internet) using appropriate protocols. For example, cloud computing providers provide applications over a wide area network, and they may be accessed through a web browser or any other computing component. The software or components of the cloud computing architecture and corresponding data may be stored on a server at a remote location. Computing resources in a cloud computing environment may be consolidated at a remote data center location or they may be dispersed. Cloud computing infrastructures can provide services through shared data centers, even though they appear as a single point of access to users. Accordingly, the components and functionality described herein may be provided from a service provider at a remote location using a cloud computing architecture. Alternatively, they may be provided from a conventional server, or they may be installed directly or otherwise on the client device.

Computing device 100 may be used to implement rap song generation in various implementations of the present disclosure. As shown in fig. 1, rap generation module 122 is deployed with or otherwise utilizes rap generation model 180. The rap generation model 180 is configured to generate a rap song that includes lyrics and beats aligned with words in the lyrics. The rap generation model 180 may be implemented using any suitable type of neural network. For example, a Transformer (Transformer) may be used to implement the rap generation model 180. The transducer may have any suitable number of heads and layers.

Computing device 100 may receive input 170 through input device 150. In some implementations, the input 170 may be an indication or command to begin generating the rap 190. In some implementations, the input 170 may include a portion of the rap 190 to be generated. For example, the input 170 may include one or more sentences of lyrics of the rap 190 and a beat identification of beats aligned with the words (also referred to as "beat information"). As another example, input 170 may include one or more words in the lyrics of rap 190 without including beat identification.

In response to input 170, rap generation module 122 begins generating rap 190 using rap generation model 180. The generation of rap 190 will be described below with reference to fig. 2. The rap generating model 180 is trained using a training data set prior to being utilized by the rap generating module 122. Training of the rap generating model 180 may be implemented at the computing device 100 or other computing device.

System architecture

Fig. 2 illustrates an architecture diagram of a rap generation system 200 according to some implementations of the present disclosure. The rap generation system 200 takes an input sequence 210 corresponding to a first portion of a rap song and generates a second portion of the rap song according to the rap generation model 180. For example, fig. 2 shows an output sequence 250 that includes an element 251, element 251 being one element of the second portion of the rap song. Some other intermediate sequences are also shown in fig. 2 in addition to the input sequence 210 and the output sequence 250, which sequences will be described in detail below.

Examples of input sequences

As shown in FIG. 2, the element "[ START ]" in the input sequence 210 is a START marker that indicates the START of the input sequence 210, e.g., the START of an entire rap song. The element "[ BEAT ]" in the input sequence 210 is the BEAT identification of the BEAT related to the tempo. The beat identifier is adjacent to the word corresponding to the beat. For example, as shown in FIG. 2, the word to the left of the BEAT identifier [ BEAT ] corresponds to the BEAT. That is, when the beat corresponding to the beat label is typed, a word to the left of the beat label is recited or sung. Taking the input sequence 210 as an example, the word "sit" to the left of the BEAT identification [ BEAT ] at the 4 th position of the input sequence 210 will be pronounced or sung when the corresponding BEAT is typed. As another example, the phrase "up" to the left of the BEAT identification [ BEAT ] at position 7 of the input sequence 210 would be pronounced or sung when the corresponding BEAT is typed.

It should be understood that the location of the BEAT identifier [ BEAT ] shown in fig. 2 is merely exemplary. In some implementations, the input sequence 210 may be determined in such a way that the word to the right of the beat identification corresponds to the beat. In such an implementation, the word to the right of the beat identification will be pronounced or sung when the corresponding beat is typed.

The element "[ SEP ]" in the input sequence 210 is a sentence separation marker, which represents the separation between different sentences of the rap song. The input sequence 210 comprises two statement separation flags [ SEP ], which means that the first part comprises two statements.

Other elements in the input sequence 210, such as the elements "look", "look" and so on shown in FIG. 2, are the lyrics of the first part. The first part of the rap song may be several sentences in a certain rap song. The first part includes not only information of each word in the lyrics of the first part but also beat information of the first part.

In the input sequence 210, the words in the same sentence of the first part of the rap song are arranged in a reverse order. In the example of fig. 2, the lyrics of the first part of the rap song are "i raise one's head and look up. The sky is pale. "the input sequence 210 corresponding to this first portion may be as shown in fig. 2. Specifically, the element "[ START ]" (i.e., a START identifier) identifies the beginning of the input sequence 210. Subsequently, each element between the element "[ START ]" and the first element "[ SEP ]" (i.e., the sentence separator marker) corresponds to the first sentence "I look up in the first portion of the rap song. "As shown in FIG. 2, in the input sequence 210, the words in the first sentence are arranged in reverse order (i.e., in the order of" look, "head," "lift," "I"). Similarly, the second sentence "a paleness of the sky". The words in "are also in reverse order.

Reference is now made to fig. 3. FIG. 3 illustrates a schematic diagram of lyrics arranged in a positive order and a negative order, according to some implementations of the present disclosure. The sequence 310 in fig. 3 shows the lyrics in a positive order, while the sequence 320 shows the lyrics in a negative order. As shown in fig. 3, in the positive sequence 310, the sub-sequence corresponding to the first sentence of lyrics (i.e., the first line in the sequence 310) includes the element set of the rhyme "weather". The subsequence corresponding to the second sentence of lyrics (i.e. the second line in the sequence 310) comprises the strength of the set of elements of the rhyme ". The two sets of elements of the rhyme are located at different relative positions (offset positions with respect to the start of each sub-sequence) in the sub-sequences in the two positive orders.

In contrast, in the reverse order sequence 320, the subsequence corresponding to the first sentence of lyrics (i.e., the first line in the sequence 320) includes the set of elements of the rhyme "angry". The subsequence corresponding to the second sentence of lyrics (i.e., the second row in the sequence 320) includes the set of elements of the rhyme "strength". The two rhyme element sets are located at the same relative position in the two subsequences in the reverse order. By arranging the lyrics in reverse order, one or more words (also referred to as "phrases") of the rhyme in each sentence can be more easily located. While for talking songs, the words or phrases of the lingering charms in each sentence are of great importance. These final words or phrases are usually located at the end of each sentence. Thus, using the reverse order sequence, these lingering words or phrases can be more easily located, thereby facilitating the generation of an entrusting song.

Reference is made back to fig. 2. In general, in the example of fig. 2, the first portion of the rap song represented by the input sequence 210 includes two sentences in forward order, "i raise my head and look up. "and" a paleness of the sky. "and the words" raised "," faced "," at a glance "are aligned with the beat, i.e., will be spoken or sung when the beat is typed.

In some implementations, the first portion to which the input sequence 210 corresponds may be input by a user. For example, the user may input the first two sentences of the rap song, "i raise my head and look up. The sky is in a poor position. The input is in turn converted into an input sequence 210. Alternatively or additionally, in some implementations, the first portion to which the input sequence 210 corresponds may be generated by the rap generation model 180 based on a previous input sequence. For example, the user may input a start identification, and then the rap generation system 200 may generate the elements in the input sequence 210 in turn. Alternatively or additionally, in some implementations, the first portion to which the input sequence 210 corresponds may include both the portion of the user input and the portion generated by the rap generation model 180. For example, the first sentence "I raise our head and look up. "input by the user, and the second sentence" paleness of sky "is generated by the rap generation model 180.

Multiple representations of an input sequence

The input sequence of the rap generating system 200 is described in detail above. After the input sequence 210 is acquired, the rap generating system 200 determines a plurality of presentation sequences corresponding to the input sequence 210.

As shown in FIG. 2, the text representation sequence 220 is used to represent the textual meaning of the elements in the input sequence 210, which can also be considered as token embedding (E). In the text representation sequence, the start flag, the sentence division flag, and the beat flag in the input sequence 210 correspond to predetermined text representations, respectively. For example, the element "[ START ] in the input sequence 210]"in the textual representation sequence 220 may be represented as element" E "accordingly _[START] "; the element "[ SEP ] in the input sequence 210]"in text representation sequence 220 can be represented as element" E "accordingly _[SEP] "; element "[ BEAT ] in input sequence 210]"in the textual representation sequence 220 may be represented as element" E "accordingly _[BEAT] ". By using a text having a predetermined text representation element "E" corresponding to the beat identification _[SEP] ", it is easier to identify and locate the beats in the sequence.

In the text representation sequence 220, the morphemes in the input sequence 210 are represented as word tokens. For example, the element "hope" may be correspondingly represented as a label of the word "hope", i.e., the element "E _{Inspection of} ". As another example, the element "raised" may be correspondingly denoted as a marker for the word "raised", i.e., the element "E _{Lifting platform} ”。

After the input sequence 210 is acquired, the rap generation system 200 also determines an rhyme representation sequence 240 corresponding to the input sequence 210. The rhyme representation sequence 240 may include a sentence representation sequence 241, an intra-sentence position representation sequence 242, and a vowel representation sequence 243. In some implementations, the sequence of rhymes representation 240 can include any one or any combination of a sequence of sentence representations 241, a sequence of intra-sentence position representations 242, and a sequence of vowel representations 243. In the example of fig. 2, the rhyme representation sequence 240 may include three of a sentence representation sequence 241, an intra-sentence position representation sequence 242, and a vowel representation sequence 243.

As shown in fig. 2, the sentence representation sequence 241 represents a sentence to which each element (e.g., word, beat mark, etc.) in the input sequence 210 belongs. The sentence representation sequence 241 may be considered as a sentence representation embedding S. By using a sentence representation sequence 241, it may be convenient to distinguish between the individual sentences of the rap song. The rap generation system 200 can generate a sentence representation sequence 241 corresponding to the input sequence 210 based on the sentences to which the respective elements (e.g., words, beat labels, etc.) in the input sequence 210 belong. For example, based on the first element "[ START ] in the input sequence 210]", a first element" S "in the sentence representation sequence 241 corresponding thereto may be generated _[START] ", for indicating the start of the sentence representation sequence 241.

At element "[ START ] in input sequence 210]"with the first element" [ SEP]Each element between "(i.e., sentence separation indicator) corresponds to a first sentence. Thus, based on the element "[ START]"with the first element" [ SEP]"respective elements in between (word or beat identity), the corresponding element" S "in the sentence representation sequence 241 can be generated ₀ ". The sentence represents the element "S" in the sequence 241 ₀ "indicates that the element belongs to the first sentence. In some implementations, the input and outputInto the element "[ SEP ] in the sequence 210]"at the corresponding position, the same element as the element located before it can be generated. For example, the element "[ SEP ] in the input sequence 210]"at the corresponding position, the element" S "can be generated ₀ ”。

Similarly, based on the first element "[ SEP]"with a second element" [ SEP]"respective elements in between (word or beat identity), the corresponding element" S "in the sentence representation sequence 241 can be generated ₁ ". The sentence represents the element "S" in the sequence 241 ₁ "indicates that the element belongs to the second sentence. It should be appreciated that while in the example of fig. 2, the input sequence 210, the sentence representation sequence 241 correspond to only two sentences, in other examples, the input sequence 210, the sentence representation sequence 241 may correspond to only one sentence, or may correspond to more sentences.

As shown in fig. 2, the intra-sentence position representation sequence 242 represents the position of each word in the input sequence 210 in the sentence to which it belongs. The intra-sentence position representation sequence 242 may be regarded as an intra-sentence position representation embedding R. By using the sequence of intra-sentence position representations 242, the representation of the rhyme law of the talking song (e.g., the position and number of words of the rhyme) can be enhanced. The sequence of representations 242 of positions within a sentence corresponding to the input sequence 210 may be generated based on the position of each word in the sentence to which it belongs and the representation of the predetermined position for beat identity, start identity, and sentence separation identity.

In the intra-sentence position representation sequence, the start flag, the sentence division flag, and the beat flag in the input sequence 210 correspond to predetermined position representations, respectively. For example, the first element "[ START ] in the input sequence 210]"indicates the beginning of a sequence, which does not belong to any sentence. Based on the element "[ START ] in the input sequence 210]", an intra-sentence position representation sequence 242 may be generated with its corresponding first element" R _[START] ", which indicates that the position within the sentence indicates the beginning of the sequence 242. Similarly, the SEP may be based on the element "[ SEP ] in the input sequence 210]", the element" R "is generated at a corresponding position in the intra-sentence position-representing sequence 242 _[SEP] ", which denotes the separation between different sentences。

Each element (including a word element and a beat identification element) in the input sequence 210 that is located between the element "[ START ]" and the first element "[ SEP ]" (i.e., a sentence separation identification) corresponds to a first sentence. Based on the position of each word between the element "[ START ]" and the first element "[ SEP ]" in the first sentence and the predetermined position representation identified for the beat, corresponding individual elements in the intra-sentence position representation sequence 242 are generated.

The first subsequence of the input sequence 210 corresponding to the first sentence comprises the elements: "Wang", "lying" and "[ BEAT]"," head "," lift "," [ BEAT]"and" I ". The first sub-sequence includes two BEAT identifications, i.e., two elements "[ BEAT ] located at

positions

3 and 6 in the input sequence 210]". The element "R" may be generated at the corresponding 3 rd and 6 th positions in the generated intra-sentence position representation sequence 242 corresponding to the first sentence _[BEAT] "(i.e., a predetermined location representation of the beat identification).

The element "hope" in the input sequence 210 is at position 1 in the first sentence to which it belongs, so the corresponding element "R" in the intra-sentence position representation sequence 242 can be generated ₀ ". As another example, the element "head" in the input sequence 210 is located at position 3 in the first sentence to which it belongs (without regard to the beat identity in front of it). Thus, the element "R" may be generated at a position in the intra-sentence position representation sequence 242 corresponding to the element "head" in the input sequence 210 ₂ ”。

Similarly, a second subsequence of the input sequence 210 corresponding to the second sentence includes the elements: "in a field", "[ BEAT]"," xanthate "," of "," BEAT]"," empty ", and" day ". The element "[ BEAT ] in the intra-sentence position representation sequence 242 and in the second subsequence]"at the corresponding position, the corresponding element" R "can be generated _[BEAT] "(i.e., predetermined location representation). The elements "R" are respectively generated at positions in the intra-sentence position representation sequence 242 corresponding to the respective word elements "in the second subsequence" in the "blank", "cocked", "of", "empty", and "day" ₀ ”、“R ₁ ”、“R ₂ ”、“R ₃ "and" R ₄ ”。

As shown in fig. 2, the vowel representation sequence 243 represents the vowels of the words in the input sequence 210. The sequence of vowel representations 243 may be considered to be a vowel representation embedding F. By using the vowel representation sequence 243, the representation of the rhyme law of the rap song (e.g., the rhyme of the rhyme being rhyme) can be enhanced. A sequence 243 of vowel representations corresponding to the input sequence 210 may be generated based on the vowel of each word in the input sequence 210 and the predetermined vowel representations for the beat identification, the start identification, and the sentence separation identification.

In the vowel expression sequence 243, the start flag, the sentence division flag, and the beat flag in the input sequence 210 correspond to predetermined vowel expressions, respectively. Specifically, the first element "[ START ] in the input sequence 210]"indicates the beginning of the sequence, which contains no vowels. Based on the element "[ START ] in the input sequence 210]", the first element" F "in the vowel representation sequence 243 corresponding thereto may be generated _[START] ". Element "F _[START] "is a predetermined vowel representation corresponding to the start symbol and indicates the start of the vowel representation sequence 243. Similarly, the SEP may be based on an element "[ SEP ] in the input sequence 210]", the element" F "is generated at the corresponding position in the vowel representation sequence 243 _[SEP] ". Element "F _[SEP] "is a predetermined vowel representation corresponding to sentence separation indication and represents a separation between sentences. Similarly, the [ BEAT ] may be based on an element in the input sequence 210]", the element" F "is generated at the corresponding position in the vowel representation sequence 243 _[BEAT] ". Element "F _[BEAT] "is a predetermined vowel representation corresponding to a beat identification and indicates a predetermined position of a beat in the vowel representation sequence 243.

The elements of the corresponding positions in the vowel representation sequence 243 may be generated based on the vowels of the individual words in the input sequence 210. For example, the vowel of the element "hope" in the input sequence 210 is "ang", so the element "F" may be generated at the corresponding position in the vowel representation sequence 243 _ang ", as shown in fig. 2. Element "F _ang "means a vowel" ang ". Similarly, elements in input sequence 210The vowel of the element "head" is "ou", so the element "F" can be generated at the corresponding position in the vowel representation sequence 243 _ou ". Element "F _ou "represents the vowel" ou ".

Additionally or alternatively, in some implementations, the sequence of position representations 230 may also be generated based on the input sequence 210. As shown in fig. 2, the position representation sequence 230 may identify the order of the elements (including the start identifier, the sentence separation identifier, the beat identifier, and the word) in the input sequence 210 throughout the input sequence 210. The position representation sequence 230 may be considered as a position embedding P. For example, START is identified in position representation sequence 230 with the beginning of input sequence 210]Generating an element "P" at a corresponding location ₀ "; in the position representation sequence 230 with the element of the input sequence 210 "[ look at]"element is generated at corresponding position" P ₁ "; … …; by analogy, the last element "[ SEP ] in the position representation sequence 230 with the input sequence 210]"element is generated at corresponding position" P ₁₆ ”。

Example Generation of the second part

The determination of text representation sequence 220 and rhyme representation sequence 240 is described in detail above in conjunction with FIG. 2. The process of generating the second part of the rap sentence using rap generation model 180 will next be described in connection with fig. 2. A second portion of the rap song is generated according to the rap generation model 180 based on at least the determined text representation sequence 220 and rhyme representation sequence 240.

Additionally or alternatively, in some implementations, a second portion of the rap song may be generated according to the rap generation model 180 based on the determined combination of the text representation sequence 220, the rhyme representation sequence 240, and the position representation sequence 230. For example, in the example of fig. 2, the text representation sequence 220, the rhyme representation sequence 240 (including the sentence representation sequence 241, the intra-sentence position representation sequence 242, and the vowel representation sequence 243), and the position representation sequence 230 may be added. The summed sequence is input to rap generation model 180 to generate a second portion of the rap song. By using the rhyme representation sequence 240, which is a combination of the sentence representation sequence 241, the intra-sentence position representation sequence 242, and the vowel representation sequence 243, not only is it helpful to determine individual sentences in the first part of the rap song, but also to enhance the representation of rhymes in the first part of the rap song.

It should be appreciated that the text representation sequence 220 and the rhyme representation sequence 240 and the optional position representation sequence 230 may be combined in any suitable manner to generate the second portion of the rap song according to the rap generation model 180. For example, the text representation sequence 220, the rhyme representation sequence 240, and the optional position representation sequence 230 may be concatenated (concatenated), and then the concatenated representation sequences may be input to the rap generation model 180.

In some implementations, the rap generation model 180 may be implemented as an autoregressive model. In such an implementation, the previous output sequence of rap generating model 180 may serve as the next input sequence, sequentially generating elements of the second portion of the rap song. In particular, the elements of the second portion of the rap song following the input sequence 210 (also referred to herein as "target elements") may be determined according to the rap generation model 180 based on the determined text representation sequence 220 and the rhyme representation sequence 240. The target element may be one of a word, a beat flag, or a sentence break flag. Further, the remaining elements of the second portion other than the target element may be generated according to the rap generation model 180 based on an updated sequence (i.e., the output sequence 250 in fig. 2) that is a combination of the input sequence 210 and the determined target element. For example, a text representation sequence, an rhyme representation sequence, and an optional position representation sequence corresponding to the update sequence may be generated. The combination of these sequences is input to the rap generation model 180 to generate the next element after the target element, thereby updating the input sequence again. By analogy, the remaining elements of the second portion may be generated auto-regressively.

In the example of fig. 2, an output sequence 250 is obtained from the rap generation model 180 based on the input sequence 210 as shown in fig. 2. In the output sequence 250, the element 251 (i.e., the element "welcome") is the target element determined this time by the rap generating model 180. The rap generation model 180 combines the input sequence 210 and the element 251 into a sequence as the output sequence 250. A START flag is added to the front of the output sequence 250 and a new input sequence is obtained. A new target element, such as the element "happy", or such as a beat identification element or sentence separation identification, may be generated from the rap generation model 180 based on the new input sequence.

In the next generation, the new input sequence may be the new input sequence combined by input sequence 210, element "welcome" and element "happy". By performing a similar process, rap generation model 180 may generate a final output sequence corresponding to the second portion of the rap song. The final output sequence may be a reverse order of the words of the respective sentences corresponding to the second portion of the rap song. A second portion of the rap song may be derived from the final output sequence.

The various elements in the various sequences described above may be stored in the form of vectors. The respective sequences may take the form of a set of vectors, or may take the form of a matrix. It will be understood that the sequences and elements in the sequences may also be stored in any other suitable form.

By the above-described process of generating the second part of the rap song according to the rap generation model 180 based on the text representation sequence 220 and the rhyme representation sequence 240, a rap song with both rhyme and tempo can be generated.

Enhancement of multi-word rhyme

The process of determining the target element from the rap generation model 180 based on the text representation sequence 220, the rhyme representation sequence 240, and the optional position representation sequence 230 will be described below. In some implementations, the target element may be determined from a library of candidate words. A predetermined word library of candidate words may be used. The candidate word library may include various words, sentence separation identifiers, and beat identifiers. For example, various words in a predetermined word candidate lexicon can be generated using words contained in an existing corpus of rap songs. As another example, various words in the candidate word lexicon may be generated using electronic dictionary data. The words in the candidate word lexicon may be represented by corresponding vectors. The sentence separation mark and the beat mark can be respectively represented by special vectors in the candidate word lexicon.

As an example, the rap generation model 180 may determine the target element by the following equation (1):

w _i ＝arg max p(w|w＜i；θ) (1)

where θ represents the rap generating model 180; w represents a sentence subsequence in the output sequence in which the target element to be determined currently is located; i represents that the target element to be determined currently is positioned at the ith position in the sentence subsequence w; w is a _i Representing a target element to be currently determined; w < i represents a word, a sentence separation mark or a beat mark positioned before the position i in the sentence subsequence w; p (-) represents a probability distribution function, such as a normal distribution function, a Gaussian distribution function, or any other suitable probability distribution function. By using equation (1), the word having the highest probability in the candidate word lexicon, the sentence separation flag, or the beat flag may be determined as the target element. Alternatively, one may be selected as the target element from the top k (an integer equal to or greater than 1) words, sentence separation flags, or beat flags having the highest probability.

In some implementations, to generate an rhyme, particularly a sentence with multiple rhyme words, the probability in equation (1) may be adjusted based on the vowels of the candidate words. As an example, when the value of i is not greater than the predetermined threshold N (e.g., 3), i.e., when the target element is located at the last N positions of the sentence in the forward order, the adjusted probability of the candidate word may be calculated by using equation (2). The word in the candidate word corpus having the highest adjusted probability may be determined as the target element. Alternatively, one may be selected as the target element from the top k (integer greater than or equal to 1) words, sentence separation tokens, or tick tokens with the highest adjusted probability.

Wherein the content of the first and second substances,

representing the adjusted probability distribution function; α represents a predetermined adjustment factor, for example, the value of α may be 0.9 or any other value between 0 and 1; and pi (w) represents a vowel check function.

By way of example, the value of π (w) may be determined as follows: if for the target element w _i If the candidate word has the same vowel as the word at the same i-th position in the previous sentence, the value of pi (w) is 1; otherwise, the value of pi (w) is 0. It should be understood that the values for π (w) are exemplary only and are not intended to limit the scope of this disclosure. In implementations of the present disclosure, pi (w) may have any suitable value. Further, for a candidate word that is a beat flag or a sentence separation flag, since it has no vowel, its adjusted probability may not be calculated using equation (2) above.

In some implementations, the predetermined threshold N may be a predefined natural number greater than 1, such as 2 or 3. Additionally or alternatively, the value of the predetermined threshold N may also be determined from a historical input sequence. For example, a common singing song will typically have one word or multiple consecutive words (also known as N-bets) at the end of each sentence. When a speaking song is rhyme-saved at the end of each sentence, for example, 2 words or 3 lyrics, the number of words rhyme-saved at the end of the sentence of the speaking song can be determined by analyzing the input sequence corresponding to the speaking song. If the rap song impresses 2 words at the end of the sentence, the predetermined threshold N may be determined to be 2. For example, in the rap song example shown in fig. 3, 3 consecutive words are rhymed at the end of each sentence. In this example, the predetermined threshold N may be determined to be 3. Other ways of determining the value of the predetermined threshold N may also be used.

It should be understood that the above-described exemplary values of the adjustment factor α are merely exemplary, and do not limit the present invention in any way. Likewise, other suitable functions may be used for the vowel check function π (w) described above. For example, the vowel check function π (w) may be determined based on the similarity between the vowel of a candidate word and the vowel of a co-located word in the previous sentence. When the two vowels are the same, pi (w) has a value of 1. When the two vowels are similar (e.g., one vowel is "an" and the other vowel is "ang"), then the value of π (w) may be, for example, 0.5. When the two vowels are neither the same nor similar, then the value of pi (w) is 0.

By using the above described adjusted probability calculation, the characteristics of a rap song that is usually rhymed at the end of a sentence by one word or a plurality of continuous words are taken into account. In this way, the rhyme-convincing song, particularly the multiple consecutive word-rhyme (i.e., N-convincing) convincing songs, are more likely to be generated, thereby improving the quality of the generated convincing songs.

The above describes the process of determining the target element based on the text representation sequence 220 and the rhyme representation sequence 240 according to the rap generation model 180. Alternatively or additionally, a plurality of candidate words for the target element may also be determined according to the rap generation model 180 based on the text representation sequence 220 and the rhyme representation sequence 240. For example, a word having a preset number M (e.g., 3) with the highest probability value determined according to equation (1) described above is taken as the plurality of candidate words. In other examples, a probability threshold may also be set, and when the determined probability is higher than the probability threshold, the corresponding word may be considered a candidate word. In the previous sentence of the target element, the reference word which is located at the same position as the target element in the reverse order is determined. Selecting a candidate word from the plurality of candidate words as a target element based on respective similarities between the vowels of the plurality of candidate words and the vowel of the reference word. For example, adjusted probabilities may be calculated for a plurality of candidate words, respectively, by equation (2), and one candidate word having the highest adjusted probability may be selected as the target element. As another example, a candidate word that is the same as the vowel of the reference word may be taken as the target element. It should be understood that the preset number M and the probability threshold may be any suitable preset values.

Additionally or alternatively, the input sequence 210 may also have a beat frequency identification (not shown) therein. In this context, the term "beat frequency" refers to the ratio of the total number of words in a rap song to the total number of beats of the rap song. For example, a beat frequency corresponding to a first portion of a rap song may be determined from the first portion. As another example, a desired beat frequency may be input by a user.

If the beat frequency is below both the first threshold (e.g., 2) and the second threshold (e.g., 4), then the beat frequency belongs to a slow beat frequency. Accordingly, a beat frequency identification such as "[ BS ]" may be added at the beginning of the input sequence 210. If the beat frequency is higher than the first threshold and lower than the second threshold, the beat frequency belongs to the middle beat frequency. Accordingly, a beat frequency identification such as "[ BM ]" may be added at the beginning of the input sequence 210. If the beat frequency is higher than both the first threshold and the second threshold, the beat frequency belongs to a fast beat frequency. Accordingly, a beat frequency identification such as "[ BF ]" may be added at the beginning of the input sequence 210.

It should be understood that the values of the first and second thresholds described above are merely illustrative. In other examples, other suitable values may be used as the first and second thresholds. In some examples, the first threshold and the second threshold may be equal, e.g., both 3.

In this way, beat frequency identification may be added to the input sequence or output sequence corresponding to the rap song. By the beat frequency identification, information about the beat frequency of the rap song may be provided. It should be understood that different rap songs may have different beat frequencies. Some listeners may prefer rap songs that are more rhythmic (i.e., fast beat frequency), while some listeners may prefer rap songs that are less rhythmic (i.e., slow beat frequency). By adding the beat frequency identification, rap songs with different beat frequencies can be generated to meet the preferences of various listeners.

The architecture of the rap generation system 200, and the process of using the rap generation model 180 according to some implementations of the present disclosure, are described in detail above in conjunction with fig. 2. The training process of the rap generating model 180 will be described below in conjunction with fig. 4 and 5.

Rap generative model training

In the application of the rap generation model 180, the input sequence 210 includes lyrics in reverse order and corresponding beat identifications. Accordingly, in training of the rap generation model 180, the training sequence used to train the rap generation model 180 needs to include the lyrics in reverse order and the corresponding beat identifications similar to the input sequence 210. For this purpose, such a training data set needs to be constructed.

Fig. 4 illustrates a schematic diagram of a process 400 for constructing a training data set in accordance with some implementations of the present disclosure. As shown in fig. 4, at 410, data collection may occur. For example, a rap song with both lyrics and the audio sung may be grabbed from the internet as a training rap song. Additionally or alternatively, each sentence in the lyrics may also be grabbed corresponding to the start time and the end time of the audio. In this way, it can be ensured that the lyrics and the audio are aligned on the sentence level, thereby facilitating subsequent beat alignment on the word level.

At 420, vocal and accompaniment separation may be performed, i.e., separating the vocal (including singing to lyrics) and accompaniment (including rhythmic beats) in the captured audio. For example, any suitable music separation tool may be used to separate the vocal and accompaniment from the captured audio.

At 430, the grabbed lyrics may be aligned with a human voice separated from the grabbed audio. The time at which a word in the lyrics is represented in the audio (also referred to herein as the first time) may be determined based on the captured lyrics and the isolated human voice. For example, the separated human voice may be segmented to a sentence level (i.e., into individual sentences) according to the start time and end time of each sentence in the captured lyrics. Furthermore, the captured lyrics may be converted into phonemes (phonemes) using a suitable tool, such as a Phonemizer. However, the phoneme-level human voice-lyric alignment may be obtained from the segmented human voice data and the converted phoneme data. Further, based on the obtained phoneme level voice-lyric alignment, a lyric time stamp, i.e. a time stamp representing the first time, of each word in the singing audio may be obtained.

At 440, beat detection may be performed on the accompaniment separated from the audio. The time of the beat in the audio (also referred to herein as the second time) may be determined from the separated accompaniment. For example, a beat timestamp for each beat, i.e., a timestamp representing the second time, may be obtained (at 420) from the separated accompaniment using a beat trace detection tool, such as Librosa.

At 450, the lyrics and the beat may be aligned. A beat may be determined to correspond to a word if a first time determined for the word matches a second time determined for the beat. In other words, the word is aligned with the beat. For example, the lyrics may be aligned with the beat based on a match between the lyrics timestamp of the word and the beat timestamp of the beat.

In some implementations, the lyrics may be aligned to the beat using an approximate alignment approach. For example, W = { W may be employed ₁ ，w ₂ ，…，w _|W| Denotes the word sequence of the lyric statement, w _i Representing the ith word in the lyric sentence. B = { B =canbe employed ₁ ，b ₂ ，…，b _|B| Denotes a beat sequence of beats in the separated accompaniment, where b _j Representing the jth beat in the sequence of beats. By using

And

respectively represents w _i And b _j The time stamp of (c).

For each beat b _j The phrase of the song can be filtered by the formula (3) to obtain the beat b _j The filtered set of words of (a):

wherein

Representing a filtered set of wordsAnd r represents the average duration of words in the training rap song. For example, the average duration of words may be obtained by dividing the total duration of a training rap song by the total number of words in the training rap song.

Next, the words in the filtered set of words may be aligned with the beats in the beat sequence using equation (4):

wherein the word w satisfying the formula (4) _i The corresponding lyric timestamp is closest to the tempo b _j The beat timestamp of. Thus, the word w _i And beat b _j And (4) aligning. Similarly, words that align respectively with other beats in the beat sequence can be found.

While singing a rap song, the rap singer may not sing words exactly in tempo. Therefore, by the approximate alignment method described above, the beat and word whose beat time stamp is closest to the lyric time stamp can be aligned. In this way, the influence caused by the non-strict alignment of the lyrics and the beats when the speaking singer sings the lyrics can be eliminated, and the alignment accuracy of the lyrics and the beats is improved.

After determining the words corresponding to each beat in the beat sequence, the captured lyrics are stored as a data set 460 along with alignment information of the lyrics and the beat. Using process 400, a large number of rap songs may be collected to construct a data set for training rap generation model 180.

Fig. 5 illustrates an example of a training rap 510 in a dataset constructed by process 400 according to some implementations of the present disclosure. As shown in fig. 5, the data set 460 stores therein sequence information of each sentence of the training rap 510, start time information of each sentence, lyric information in each sentence, and alignment information (denoted by a symbol "×") about a beat. In the example of fig. 5, the symbol "", indicates that the beat is aligned with the word to the right of the symbol "". It should be understood that in other examples, other symbols may be used to represent alignment information of beats and words. Although in the training utterance 510 of fig. 5, the symbol "indicates that the beat is aligned with the word to the right of the symbol," in other examples, the beat may be aligned with the word to the left of the symbol by the symbol ".

The process 400 of collecting rap songs to construct a data set of training rap songs is described above. The above-described process 400 is merely illustrative and is not intended to limit the scope of the present disclosure. In other examples, other methods may be used to construct the data set.

The above describes the process of constructing a data set for training the rap generating model 180. The training process for the rap generating model 180 will be described below in conjunction with fig. 2.

During the training process, a training sequence corresponding to a first portion of a training rap song is obtained. For example, a training rap song may be selected from the data set 460. Hereinafter, the training rap 510 shown in fig. 5 will be described as an example. In the training sequence, similar to the input sequence 210 depicted in fig. 2, the words in the same sentence of the first part of the training rap 510 are arranged in reverse order, and the beat identity is adjacent to the word corresponding to the beat.

By processing the first part of the training rap song 210, a training sequence may be obtained. In the training sequence, the subsequences of different sentences corresponding to the first portion are ordered in the forward and backward order of the different sentences. And in the subsequence of each sentence, the words of the sentence are arranged in the reverse order. In addition, a BEAT identifier "[ BEAT ]" corresponding to the symbol "-" shown in fig. 5 is also retained in the subsequence. The sub-sequence also has a sentence separator flag "[ SEP ]" to separate different sentences, and a START flag "[ START ]" to indicate the START of the training sequence.

Taking the example of the first sentence of the training rap 510 as the first part, the corresponding training sequence may include the following elements: "[ START ]", "[ Zi ]", "[ The ]", "[ BEAT ]", "[ apo ]", "[ simple ]", "[ pieces ]", "[ like ]", "[ BEAT ]", "[ square ]" [ ground ] "," [ like ] "[ in ]", "[ BEAT ]", "[ big ]" [ long ] "," [ I ] "," [ SEP ] ". In this example, the first BEAT identification "[ BEAT ]" is adjacent to the word "[ of ]" corresponding to that BEAT. This means that the word "will be spoken or sung when the beat is typed.

Next, based on the training sequence, a text representation sequence and an rhyme representation sequence respectively corresponding to the training sequence are determined. Based on the text representation sequence and the rhyme representation sequence, a prediction of a second portion of the training rap song is generated according to the rap generation model. The process of determining the text representation sequence and the rhyme representation sequence is the same as the process of determining the text representation sequence 220 and the rhyme representation sequence 240 described above in connection with fig. 2, and thus the description thereof is not repeated in detail.

Similar to the process described in fig. 2, in the training phase, a sentence representation sequence corresponding to the training sequence may be generated based on the sentence to which each word and beat representation in the training sequence belongs. The sentence representation sequence may be part of an rhyme representation sequence. An intra-sentence position representation sequence corresponding to the training sequence may be generated based on the position of each word in the sentence to which it belongs and the predetermined position identification for the beat identification. The intra-sentence position representation sequence may be part of an rhyme representation sequence. A sequence of vowel representations corresponding to the training sequence may be generated based on the vowel of each word in the training sequence and the predetermined vowel representation identified for the beat. The vowel representation sequence may be part of an rhyme representation sequence.

Similar to fig. 2, additionally or alternatively, a sequence of positional representations corresponding to the training sequence may also be determined as part of the input to the rap generating model 180. A combination of the text representation sequence, the rhyme representation sequence (including the sentence representation sequence, the intra-sentence position representation sequence, and the vowel representation sequence), and optionally the position representation sequence determined by the training sequence described above may be used as input to the rap generation model 180. Based on the combination of the sequences, a prediction of the second portion of the training rap song may be generated in accordance with rap generation model 180. The process of predicting the training of the second portion of the rap song is similar to the process of generating the second portion of the rap song according to the rap generation model 180 described with reference to fig. 2, and thus the description thereof is not repeated.

Next, the rap generation model 180 may be trained based on the generated predictions for the second portion and the actual second portion of the training rap song. Taking the training rap 510 of fig. 5 as an example, the training sequence may correspond to the first sentence of the training rap 510 and the generated prediction may correspond to the first two sentences of the training rap 510. The penalty function may be determined based on the difference between the predicted sentence and the first two sentences (i.e., real sentences) of the training rap 510. By minimizing the loss function, a trained rap generating model 180 may be obtained.

As described with reference to fig. 2, beat frequency identification may also be included in the input sequence 210. Similarly, in the training phase, the training sequence may also include a beat frequency identification. In this way, in conjunction with beat identification in the training sequence, the rap generation model 180 can be made to explicitly learn information about beat frequencies.

In some implementations, other types of training data may also be used to pre-train the rap generating model 180 before training the rap generating model 180 with the training rap songs in the data set 460 (i.e., fine-tuning the rap generating model 180). As an example, the rap generation model 180 may be pre-trained with training lyrics that do not have beats. For example, rap generating model 180 may be trained using verse text with prosody as training lyrics without beats. Although not having beat information, it is understood that the words in the same sentence of training lyrics are arranged in a reverse order when used as a training sequence.

Alternatively or additionally, a song with a tempo of a different genre than the rap song may be utilized as the training song. Besides rap songs, there are also a large number of songs of other genres, such as pop songs, ethnic songs. These songs also have tempo information that can be learned by the rap generator model 180. Therefore, these songs may be used to pre-train the rap generation model 180. Likewise, the collected songs of these different genres may be constructed into a data set for pre-training the rap generating model 180 using the process 400 or a similar method.

By using training lyrics without beats and/or songs of different genres, etc. as training songs, it is possible to obtain more abundant training data for the rap generating model 180. In this way, a greater amount of training data can be used to better pre-train the rap generation model 180. This can avoid insufficient training of the rap generating model 180 due to insufficient rap song data.

Additionally or alternatively, training of the rap generation model 180 may also include fine tuning of the rap generation model 180. For example, the rap generation model 180 may be fine-tuned after it has been pre-trained. The training data sets used for pre-training and fine-tuning are different. For example, the above-described larger scale training data set generated by songs with lyrics and different genres that do not have beats, described above, may be used to make the above-described fine-tuning of the rap generating model 180. The rap generation model 180 may be fine-tuned using a smaller scale training data set generated by the rap song alone.

After pre-training, the rap generation model 180 is fine-tuned using a data set (e.g., data set 460) formed from the rap songs alone. In this way, the rap generating model 180 may be better adapted to rap songs, thereby further improving their accuracy. By utilizing pre-training and fine tuning, the language naturalness and rhyme accuracy of the generated rap song can be improved.

Example method

Fig. 6 illustrates a flow diagram of a method 600 of applying the rap generation model 180 according to some implementations of the present disclosure. The method 600 may be implemented by the computing device 100, for example, at the rap generation module 122 in the memory 120 of the computing device 100.

As shown in fig. 6, at block 610, the computing device 100 obtains an input sequence 210 corresponding to a first portion of a rap. In the input sequence 210, the words in the same sentence of the first part are arranged in a reverse order, and the beat identification of the beat of the first part is adjacent to the word corresponding to the beat. In some implementations, at least a portion of the first portion is generated according to the rap generation model 180. Alternatively or additionally, in some implementations, the first portion is input by a user.

At block 620, the computing device 100 determines a text representation sequence 220 and an rhyme representation sequence 240 that correspond to the input sequence 210, respectively. Additionally, in some implementations, a sequence of position representations 230 corresponding to the input sequence 210 may also be determined.

In some implementations, determining the sequence of rhymes representation 240 corresponding to the input sequence 210 includes: generating a sequence of vowel representations 243 corresponding to the input sequence 210 as part of the sequence of rhymes 240 based on the vowel of each word in the input sequence 210 and the predetermined vowel representation identified for the beat; generating an intra-sentence position representation sequence 242 corresponding to the input sequence 210 as part of an rhyme representation sequence 240 based on the position of each word in the sentence to which it belongs and the predetermined position representation identified for the beat; and generating a sentence representation sequence 241 corresponding to the input sequence 210 as a part of the rhyme representation sequence 240 based on the sentence to which each word and beat identification in the input sequence 210 belongs.

At block 630, the computing device 100 generates a second portion of the rap according to the rap generation model 180 based on the text representation sequence 220 and the rhyme representation sequence 240. For example, the elements of the second portion may be generated sequentially.

In some implementations, generating the second portion includes: determining elements of the second portion according to the rap generation model 180 based on the text representation sequence 220 and the rhyme representation sequence 240, the elements being one of words, beat labels, or sentence separation labels; and sequentially generating subsequent elements of the second portion after the element according to the rap generation model 180 based on the updated sequence combined by the input sequence 210 and the element.

In some implementations, determining the elements of the second portion includes: determining a plurality of candidate words for the element according to the rap generation model 180 based on the text representation sequence 220 and the rhyme representation sequence 240; if it is determined that the element is not more than a threshold distance from the end of the sentence in the sentence to which it belongs: determining a reference word located at the same position as the element in a reverse order in a sentence preceding the element; and selecting a candidate word from the plurality of candidate words as the element based on respective similarities between the vowels of the plurality of candidate words and the vowel of the reference word.

Fig. 7 illustrates a flow diagram of a method 700 of training the rap generating model 180 in accordance with some implementations of the present disclosure. Method 700 may be implemented by any suitable computing device. The method 700 may be implemented by the computing device 100, for example, at the rap generation module 122 in the memory 120 of the computing device 100. Method 700 may also be implemented by another computing device different from computing device 100.

As shown in fig. 7, at block 710, the computing device obtains a training sequence corresponding to a first portion of a training rap. In the training sequence, the words in the same sentence of the first part are arranged in a reverse order, and the beat identifier of the beat of the first part is adjacent to the word corresponding to the beat. At block 720, the computing device determines a text representation sequence and an rhyme representation sequence that respectively correspond to the training sequences.

In some implementations, determining the sequence of rhymes representation corresponding to the training sequence includes: generating a vowel representation sequence corresponding to the training sequence as part of an rhyme representation sequence based on a vowel of each word in the training sequence and a predetermined vowel representation identified for the beat; generating an intra-sentence position representation sequence corresponding to the training sequence as a part of an rhyme representation sequence based on the position of each word in the sentence to which the word belongs and the preset position representation for the beat mark; and generating a sentence representation sequence corresponding to the training sequence as a part of the rhyme representation sequence based on each word in the training sequence and the sentence to which the beat mark belongs.

At block 730, the computing device generates a prediction of a second portion of the training rap according to the rap generation model 180 based on the text representation sequence and the rhyme representation sequence. At block 740, the computing device trains the rap generation model 180 based on the prediction and the second portion of the training rap.

In some implementations, the method 700 further includes: the computing device obtains audio and lyrics of a training rap, the audio including accompaniment and human voice representing the lyrics; determining a first time at which a word in the lyrics is expressed in the audio based on the lyrics and the human voice; determining a second time of the beat in the audio from the accompaniment; and determining that the beat corresponds to a word whose first time matches the second time.

In some implementations, prior to training rap generation model 180, rap generation model 180 is pre-trained based on at least one of: training lyrics without beats, or training songs with beats of a different genre than the training rap.

Example implementation

Some example implementations of the present disclosure are listed below.

In a first aspect, the present disclosure provides a computer-implemented method. The method comprises the following steps: acquiring an input sequence corresponding to a first part of speaking, wherein in the input sequence, all words in the same sentence of the first part are arranged in a reverse order, and a beat identifier of a beat of the first part is adjacent to a word corresponding to the beat; determining a text representation sequence and an rhyme representation sequence respectively corresponding to the input sequence; and generating a second part of the rap according to a rap generation model based on the text representation sequence and the rhyme representation sequence.

In some implementations, determining the sequence of rhymes representation corresponding to the input sequence includes: generating a sequence of vowel representations corresponding to the input sequence as part of the sequence of rhymes representation based on a vowel of each word in the input sequence and a predetermined vowel representation identified for the beat; generating an intra-sentence position representation sequence corresponding to the input sequence as a part of the rhyme representation sequence based on the position of each word in the sentence to which the word belongs and a predetermined position representation identified for the beat; and generating a sentence representation sequence corresponding to the input sequence as a part of the rhyme representation sequence based on each word in the input sequence and the sentence to which the beat mark belongs.

In some implementations, generating the second portion includes: determining elements of the second part according to the rap generation model based on the text representation sequence and the rhyme representation sequence, wherein the elements are one of words, beat marks or sentence separation marks; and sequentially generating subsequent elements of the second portion after the element according to the rap generation model based on an updated sequence combined by the input sequence and the element.

In some implementations, determining the elements of the second portion includes: determining a plurality of candidate words for the element according to the rap generation model based on the text representation sequence and the rhyme representation sequence; if it is determined that the element is not more than a threshold distance from the end of the sentence in the sentence to which it belongs: determining a reference word co-located with the element in a reverse order in a previous sentence of the element; and selecting a candidate word from the plurality of candidate words as the element based on respective similarities between the vowels of the plurality of candidate words and the vowel of the reference word.

In some implementations, at least a portion of the first portion is generated according to the rap generation model.

In a second aspect, the present disclosure provides a computer-implemented method. The method comprises the following steps: acquiring a training sequence corresponding to a first part of a training rap, wherein in the training sequence, all words in the same sentence of the first part are arranged in a reverse order, and a beat identifier of a beat of the first part is adjacent to a word corresponding to the beat; determining a text representation sequence and an rhyme representation sequence respectively corresponding to the training sequences; generating a prediction of a second portion of the training rap according to a rap generation model based on the text representation sequence and the rhyme representation sequence; and training the rap generation model based on the prediction and the second portion of the training rap.

In some implementations, determining the rhyme representation sequence corresponding to the training sequence includes: generating a vowel representation sequence corresponding to the training sequence as part of the rhyme representation sequence based on a vowel of each word in the training sequence and a predetermined vowel representation identified for the beat; generating an intra-sentence position representation sequence corresponding to the training sequence as a part of the rhyme representation sequence based on the position of each word in the sentence to which the word belongs and the representation of the preset position of the beat mark; and generating a sentence representation sequence corresponding to the training sequence as a part of the rhyme representation sequence based on each word in the training sequence and the sentence to which the beat mark belongs.

In some implementations, the method further comprises: obtaining audio and lyrics of the training praise, wherein the audio comprises accompaniment and human voice expressing the lyrics; determining a first time at which a word in the lyrics is represented in the audio based on the lyrics and the human voice; determining a second time of the beat in the audio from the accompaniment; and determining that the beat corresponds to a word for which the first time matches the second time.

In some implementations, prior to training the rap generating model, the rap generating model is pre-trained based on at least one of: training lyrics without beats, or training songs with beats of a different genre than the training vocals.

In a third aspect, the present disclosure provides an electronic device. The electronic device includes: a processing unit; and a memory coupled to the processing unit and containing instructions stored thereon that, when executed by the processing unit, cause the apparatus to perform acts comprising: acquiring an input sequence corresponding to a first part of rap, wherein in the input sequence, words in the same sentence of the first part are arranged in a reverse order, and a beat identifier of a beat of the first part is adjacent to the word corresponding to the beat; determining a text representation sequence and an rhyme representation sequence respectively corresponding to the input sequence; and generating a second part of the rap according to a rap generation model based on the text representation sequence and the rhyme representation sequence.

In a fourth aspect, the present disclosure provides an electronic device. The electronic device includes: a processing unit; and a memory coupled to the processing unit and containing instructions stored thereon that, when executed by the processing unit, cause the apparatus to perform acts comprising: acquiring a training sequence corresponding to a first part of the training rap, wherein in the training sequence, all words in the same sentence of the first part are arranged in a reverse order, and a beat identifier of a beat of the first part is adjacent to a word corresponding to the beat; determining a text representation sequence and an rhyme representation sequence respectively corresponding to the training sequences; generating a prediction of a second portion of the training rap according to a rap generation model based on the text representation sequence and the rhyme representation sequence; and training the rap generation model based on the prediction and the second portion of the training rap.

In some implementations, determining the rhyme representation sequence corresponding to the training sequence includes: generating a vowel representation sequence corresponding to the training sequence as part of the rhyme representation sequence based on a vowel of each word in the training sequence and a predetermined vowel representation identified for the beat; generating an intra-sentence position representation sequence corresponding to the training sequence as a part of the rhyme representation sequence based on the position of each word in the sentence to which the word belongs and the representation of the preset position of the beat mark; and generating a sentence expression sequence corresponding to the training sequence as a part of the rhyme expression sequence based on each word in the training sequence and the sentence to which the beat mark belongs.

In some implementations, the actions further include: obtaining audio and lyrics of the training praise, wherein the audio comprises accompaniment and human voice expressing the lyrics; determining a first time at which a word in the lyrics is represented in the audio based on the lyrics and the human voice; determining a second time of the beat in the audio from the accompaniment; and determining that the beat corresponds to a word for which the first time matches the second time.

In a fifth aspect, the present disclosure provides a computer program product tangibly stored in a non-transitory computer storage medium and comprising machine executable instructions that, when executed by a device, cause the device to perform the method of the first aspect described above.

In a sixth aspect, the present disclosure provides a computer program product tangibly stored in a non-transitory computer storage medium and comprising machine executable instructions that, when executed by a device, cause the device to perform the method of the second aspect described above.

In a seventh aspect, the present disclosure provides a computer-readable medium having stored thereon machine-executable instructions that, when executed by a device, cause the device to perform the method of the first aspect described above.

In an eighth aspect, the present disclosure provides a computer-readable medium having stored thereon machine-executable instructions that, when executed by a device, cause the device to perform the method of the second aspect described above.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and the like.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate implementations can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A computer-implemented method, comprising:

acquiring an input sequence corresponding to a first part of speaking, wherein in the input sequence, all words in the same sentence of the first part are arranged in a reverse order, and a beat identifier of a beat of the first part is adjacent to a word corresponding to the beat;

determining a text representation sequence and an rhyme representation sequence respectively corresponding to the input sequence; and

and generating a second part of the rap according to a rap generation model based on the text representation sequence and the rhyme representation sequence.

2. The method of claim 1, wherein determining the sequence of rhymes representation corresponding to the input sequence comprises:

generating a vowel representation sequence corresponding to the input sequence as part of the rhyme representation sequence based on a vowel of each word in the input sequence and a predetermined vowel representation identified for the beat;

generating an intra-sentence position representation sequence corresponding to the input sequence as a part of the rhyme representation sequence based on the position of each word in the sentence to which the word belongs and a predetermined position representation identified for the beat; and

and generating a sentence representation sequence corresponding to the input sequence as a part of the rhyme representation sequence based on each word in the input sequence and the sentence to which the beat mark belongs.

3. The method of claim 1, wherein generating the second portion comprises:

determining elements of the second part according to the rap generation model based on the text representation sequence and the rhyme representation sequence, wherein the elements are one of words, beat identifiers or sentence separation identifiers; and

sequentially generating subsequent elements of the second portion after the element according to the rap generation model based on an updated sequence combined by the input sequence and the element.

4. The method of claim 3, wherein determining elements of the second portion comprises:

determining a plurality of candidate words for the element according to the rap generation model based on the text representation sequence and the rhyme representation sequence;

if it is determined that the element is not more than a threshold distance from the end of the sentence in the sentence to which it belongs:

determining a reference word co-located with the element in a reverse order in a previous sentence of the element; and

selecting a candidate word from the plurality of candidate words as the element based on respective similarities between the vowels of the plurality of candidate words and the vowel of the reference word.

5. The method of claim 1, wherein at least a portion of the first portion is generated according to the rap generating model.

6. A computer-implemented method, comprising:

acquiring a training sequence corresponding to a first part of the training rap, wherein in the training sequence, all words in the same sentence of the first part are arranged in a reverse order, and a beat identifier of a beat of the first part is adjacent to a word corresponding to the beat;

determining a text representation sequence and an rhyme representation sequence respectively corresponding to the training sequences;

generating a prediction of a second portion of the training rap according to a rap generation model based on the text representation sequence and the rhyme representation sequence; and

training the rap generation model based on the prediction and the second portion of the training rap.

7. The method of claim 6, wherein determining the rhyme representation sequence corresponding to the training sequence comprises:

generating a vowel representation sequence corresponding to the training sequence as part of the rhyme representation sequence based on a vowel of each word in the training sequence and a predetermined vowel representation identified for the beat;

generating an intra-sentence position representation sequence corresponding to the training sequence as a part of the rhyme representation sequence based on the position of each word in the sentence to which the word belongs and the preset position representation of the beat mark; and

and generating a sentence representation sequence corresponding to the training sequence as a part of the rhyme representation sequence based on each word in the training sequence and the sentence to which the beat mark belongs.

8. The method of claim 6, further comprising:

obtaining audio and lyrics of the training praise, wherein the audio comprises accompaniment and human voice expressing the lyrics;

determining a first time at which a word in the lyrics is represented in the audio based on the lyrics and the human voice;

determining a second time of the beat in the audio from the accompaniment; and

determining that the beat corresponds to a word that the first time matches the second time.

9. The method of claim 6, wherein prior to training the rap generating model, the rap generating model is pre-trained based on at least one of:

training lyrics without beats, or

A training song having a tempo of a different genre than the training rap.

10. An electronic device, comprising:

a processing unit; and

a memory coupled to the processing unit and containing instructions stored thereon that, when executed by the processing unit, cause the apparatus to:

11. The electronic device of claim 10, wherein determining the sequence of rhymes representation corresponding to the sequence of inputs comprises:

12. The electronic device of claim 10, wherein generating the second portion comprises:

13. The electronic device of claim 12, wherein determining elements of the second portion comprises:

14. The electronic device of claim 10, wherein at least a portion of the first portion is generated according to the rap generation model.

15. An electronic device, comprising:

a processing unit; and

16. The electronic device of claim 15, wherein determining the sequence of rhymes representation corresponding to the training sequence comprises:

generating an intra-sentence position representation sequence corresponding to the training sequence as a part of the rhyme representation sequence based on the position of each word in the sentence to which the word belongs and the representation of the preset position of the beat mark; and

17. The electronic device of claim 15, the acts further comprising:

determining a second time of the beat in the audio from the accompaniment; and

18. The electronic device of claim 15, wherein prior to training the rap generating model, the rap generating model is pre-trained based on at least one of:

training lyrics without beats, or

A training song having a tempo of a different genre than the training rap.

19. A computer program product comprising machine executable instructions that, when executed by a device, cause the device to perform acts comprising:

20. A computer program product comprising machine executable instructions that, when executed by a device, cause the device to perform acts comprising:

acquiring a training sequence corresponding to a first part of a training rap, wherein in the training sequence, all words in the same sentence of the first part are arranged in a reverse order, and a beat identifier of a beat of the first part is adjacent to a word corresponding to the beat;