CN117648575A

CN117648575A - Model training method, device and equipment

Info

Publication number: CN117648575A
Application number: CN202311663744.7A
Authority: CN
Inventors: 陈晓锋
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2023-12-05
Filing date: 2023-12-05
Publication date: 2024-03-05

Abstract

The application discloses a model training method, device and equipment, and belongs to the technical field of artificial intelligence. The model training method comprises the following steps: obtaining N sample songs, wherein each sample song comprises at least one sample song, and N is an integer greater than 1; processing each sample song list to obtain sample scene characteristics and sample song list characteristics of the sample song list; training a first generation model based on the sample scene characteristics and the sample song list characteristics to obtain a second generation model, wherein the second generation model is used for generating display information of the song list, and the display information is associated with the scene characteristics of the song list.

Description

Model training method, device and equipment

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a model training method, device and equipment.

Background

The song may be a collection of songs, and in a music application, different song presentation information such as song titles and song profiles may affect the user's music playing experience. Therefore, for one song, the generation of the corresponding song display information has important significance.

At present, when generating song list display information, the song list display information mainly depends on the operation of a user, and is time-consuming, labor-consuming and low in accuracy.

Disclosure of Invention

The embodiment of the application aims to provide a model training method, device and equipment, which can solve the problems that the related technology relies on the operation of a user to generate song list display information, and is time-consuming and labor-consuming and low in accuracy.

In a first aspect, an embodiment of the present application provides a model training method, including:

obtaining N sample songs, wherein each sample song comprises at least one sample song, and N is an integer greater than 1;

processing each sample song list to obtain sample scene characteristics and sample song list characteristics of the sample song list;

training a first generation model based on the sample scene characteristics and the sample song list characteristics to obtain a second generation model, wherein the second generation model is used for generating display information of the song list, and the display information is associated with the scene characteristics of the song list.

In a second aspect, an embodiment of the present application provides a model training apparatus, including:

the acquisition module is used for acquiring N sample songs, wherein each sample song comprises at least one sample song, and N is an integer greater than 1;

the processing module is used for processing each sample song list to obtain sample scene characteristics and sample song list characteristics of the sample song list;

the training module is used for training the first generation model based on the sample scene characteristics and the sample song list characteristics to obtain a second generation model, wherein the second generation model is used for generating display information of the song list, and the display information is associated with the scene characteristics of the song list.

In a third aspect, embodiments of the present application provide an electronic device comprising a processor and a memory storing a program or instructions executable on the processor, the program or instructions implementing the steps of the method as described in the first aspect when executed by the processor.

In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which, when executed by a processor, implement the steps of the method as described in the first aspect.

In a fifth aspect, embodiments of the present application provide a chip, the chip including a processor and a communication interface, the communication interface being coupled to the processor, the processor being configured to execute programs or instructions to implement the steps of the method according to the first aspect.

In a sixth aspect, embodiments of the present application provide a computer program product stored in a storage medium, the program product being executed by at least one processor to implement the steps of the method as described in the first aspect.

According to the embodiment of the application, N sample songs are obtained, each sample song comprises at least one sample song, and N is an integer greater than 1; processing each sample song list to obtain sample scene characteristics and sample song list characteristics of the sample song list; training a first generation model based on the sample scene characteristics and the sample song list characteristics to obtain a second generation model, wherein the second generation model is used for generating display information of the song list, and the display information is associated with the scene characteristics of the song list. That is, the embodiment of the application trains the first generation model by using the sample scene characteristics and the sample song characteristics of the sample song, so as to obtain the second generation model, so that the display information of the song can be automatically generated based on the second generation model, the operation of a user is reduced, the time and energy of the user are saved, and when the application is used for model training, the application is based on the song of a plurality of scenes, namely, the scene characteristics of each song are fully considered, so that the song display information generated based on the second generation model can be matched with the scene characteristics of the song, thereby improving the accuracy of the display information.

Drawings

FIG. 1 is a flowchart of a model training method according to an embodiment of the present application;

FIG. 2 is a flow chart of another model training method provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of an attention matrix according to an embodiment of the present application;

FIG. 4 is a schematic diagram of another attention matrix provided by embodiments of the present application;

FIG. 5 is a schematic structural diagram of a self-focusing layer according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a training process of a first generation model according to an embodiment of the present application;

fig. 7 is a schematic diagram of a determining process of a sample song name according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a song title augmentation process according to an embodiment of the present application;

FIG. 9 is a flowchart of another model training method provided in an embodiment of the present application;

fig. 10 is a flowchart of an information generating method according to an embodiment of the present application;

FIG. 11 is a schematic structural diagram of a model training device according to an embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of an information generating apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 14 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

Technical solutions in the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application are within the scope of the protection of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or otherwise described herein, and that the objects identified by "first," "second," etc. are generally of a type and do not limit the number of objects, for example, the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

Taking the example that the presentation information of the song includes the song title, the related technology is mainly performed manually when generating the song title, for example, the song title can be directly generated manually based on the song information contained in the song, or an appropriate song title is selected for the song from the candidate set, and the candidate set containing a large number of song titles needs to be pre-selected by a user.

That is, when the related technology generates the display information of the song list, the related technology is dependent on the operation of the user, and is time-consuming and labor-consuming, and low in accuracy.

Therefore, the embodiment of the application provides a model training method, device and equipment, which can solve the problems that the related technology relies on the operation of a user to generate song list display information, and is time-consuming and labor-consuming and low in accuracy.

The model training method, device and equipment provided by the embodiment of the application are described in detail below with reference to specific embodiments.

Fig. 1 is a flowchart of a model training method according to an embodiment of the present application, and as shown in fig. 1, the model training method may include the following steps:

s110, acquiring N sample songs.

Each sample song includes at least one sample song, N being an integer greater than 1.

S120, processing each sample song list to obtain sample scene characteristics and sample song list characteristics of the sample song list.

S130, training a first generation model based on the sample scene characteristics and the sample song list characteristics to obtain a second generation model.

The second generation model is used for generating display information of the song list, and the display information is associated with scene characteristics of the song list.

According to the method and the device, the sample scene characteristics and the sample song characteristics of the sample song are utilized to train the first generation model to obtain the second generation model, so that the display information of the song can be automatically generated based on the second generation model, the operation of a user is reduced, the time and energy of the user are saved, and when the model is trained, the song is based on multiple scenes, namely, the scene characteristics of each song are fully considered, so that the song display information generated based on the second generation model can be matched with the scene characteristics of the song, and therefore the accuracy of the display information is improved.

The following describes the above steps in detail, as follows:

in S110, the sample song may be a song that has been used to train the model and that already contains presentation information. Each song may include information such as a song ID, a song title, a song profile, a song category, and a song list.

The category of the song list is the scene to which the song list belongs. The list of songs may include, but is not limited to, song name, lyrics, singer, song tag, audio, etc. for each song.

The sample song list can be a song list created by a user, and can also be a song list in a song list library, wherein the song list library can be an online song library or an offline song library.

Each sample song may include one or more songs, and typically the sample song contains a large number of songs, which may enhance the training effect of the model. The embodiment of the application records each song contained in the sample song list as one sample song.

Exemplary, the sample song list selected in the embodiment of the application is at least from two scenes, and may include, for example, a song list of the ancient wind type, a song list of the grassland wind type, a song list of the square dance wind type, and the like.

In S120, the sample scene features, i.e., the scene features of the sample song list, may include, but are not limited to, an ancient wind scene, a grassland wind scene, a ethnic wind scene, a child song scene, and the like, for example.

The word style of songs is also different in different scenes, such as the word style of songs in ancient wind scenes is restrained and restrained, the word style of songs in grassland wind scenes is Haomai, and the like. The sample scene features of each sample scene in the embodiments of the present application are at least two different. By extracting scene characteristics of each sample song, the model can learn commonalities of songs of different scenes when the model is trained subsequently, and the diversity of the model generation space is improved.

The sample song feature may be a song feature of the sample song, which may include, but is not limited to, presentation information of the song and song features of sample songs in the sample song, for example. Among other things, song characteristics may include, but are not limited to, song name and lyrics of the song, and the like.

The processing manner of the sample song list is not particularly limited, and any processing manner capable of obtaining the display information of the sample song list and the song characteristics of the sample song can be applied to the embodiment of the application.

In S130, the first generation model may be a model that is not trained by the sample scene features and the sample song features, and is used to generate presentation information of the song. The structure of the first generation model is not limited in this embodiment, and for example, a neural network model Transformer with a self-attention mechanism may be used. Before the method is applied to the embodiment of the application, the disclosed natural language can be used for pre-training the transducer, so that training time can be shortened and training efficiency can be improved when training is performed again based on sample scene characteristics and sample song list characteristics.

The second generation model is a first generation model which is trained, and the second generation model which is obtained through training by the training method provided by the embodiment of the application can enable the display information to be matched with scene characteristics of the song as far as possible when the display information of the song is generated, so that the accuracy of the display information can be improved.

In some embodiments, as shown in fig. 2, the model training method may include the steps of:

s210, acquiring N sample songs.

S220, processing each sample song list to obtain sample scene characteristics and sample song list characteristics of the sample song list.

S230, inputting the sample scene characteristics and the sample song list characteristics into a first generation model to obtain first scene characteristics and first display information.

S240, determining a first loss function value based on the first scene feature and the reference scene feature.

The reference scene features are determined based on the sample scene features.

S250, determining a second loss function value based on the first display information and sample display information of the sample song.

S260, training the first generation model based on the weighted sum of the first loss function value and the second loss function value to obtain a second generation model.

The processes of S210 and S220 may be referred to the above embodiments, and are not repeated for brevity.

The following describes the other steps in detail, and is specifically as follows:

in S230, the first scene feature and the first display information are respectively the predicted results of the first generation model, that is, the first generation model in the embodiment of the present application may predict the display information of the song based on the input sample scene feature and sample song feature, and may further predict a scene feature based on the sample scene feature, sample song feature and predicted display information, that is, after the sample scene feature and sample song feature are input into the first generation model, the first scene feature and the first display information output by the first generation model may be obtained.

The learning target of the scene features is the average value of the same scene features in a plurality of sample songs, and the learning target has the effect of enabling the generation results of the same type of scenes to be close to each other, namely aiming at the same type of scenes, the first scene features output by the first generation model can be close.

In S240, the reference scene feature may be determined based on the sample scene feature of the sample song of the same scene, and taking the example of the ancient style song, the reference scene feature may be a mean value of the sample scene features of each ancient style song.

The first loss function value is used to characterize a difference between a first scene feature and a reference scene feature output by the first generation model. Illustratively, the first loss function value may be determined in combination with the following loss function:

Loss1＝[-zlog(P(z|(x,y)))]

the Loss1 is a first Loss function value, z is a first scene feature, x is a sample song list feature, y is first display information, and P (z| (x, y)) represents conditional probability of the first scene feature when the first generation model inputs the sample song list feature and outputs the first display information.

In S250, the sample presentation information is the actual presentation information of the sample song. The second loss function value is used for representing a difference value between the first display information and the sample display information output by the first generation model. Illustratively, the second loss function value may be determined in combination with the following loss function:

wherein Loss2 is the second Loss function value, N is the number of word vectors corresponding to the first display information, and P (y) _i The i x) represents the conditional probability of obtaining the word vector corresponding to the ith first display information when the first generation model inputs the sample song list feature, i is more than or equal to 1 and less than or equal to N.

It will be appreciated that, when the sample scene feature and the sample song feature are actually input into the first generation model, they are input in the form of word vectors, so that when the first presentation information is output, the first generation model is also output in the form of word vectors, that is, in the form of individual word divisions. For example, if the first presentation information contains 3 words, n=3.

In S260, the first loss function value and the second loss function value may be weighted and summed to obtain a weighted and summed result, and based on the weighted and summed result, model parameters of the first generation model may be updated to gradually optimize the model parameters, and finally converge to the second generation model for which training is completed.

For example, loss=α×loss1+β×loss2, where Loss is a weighted summation result, and α and β are weights of Loss1 and Loss2, respectively, and the magnitudes of α and β may be determined according to actual needs.

The los updates the model parameters of the first generation model through a back propagation algorithm, so that on one hand, the generation result of the model is more and more close to the real output (the first display information expressed as the model output is converged with the sample display information), and on the other hand, the style of the generation result is more and more close to the average style corresponding to the scene characteristics (expressed as the first scene characteristics are converged under the same kind of scene).

That is, the embodiment of the application introduces the scene feature into the loss function, and adjusts the loss function to minimize the first display information and the first scene feature at the same time by the original loss function which only minimizes the first display information, so that the scene feature can be effectively integrated into the training of the first generation model, the first generation model can independently identify the scene feature, the effect of guiding the generation style according to the scene feature can be achieved, the display information output by the second generation model obtained through training can be more attached to the scene feature of the song, and the accuracy of the display information is improved.

Taking the example that the first generation model includes a word segmentation processing layer and a self-attention layer, the above S230 may include the following steps:

s2301, inputting the sample scene features and the sample song features into a word segmentation processing layer for word segmentation processing to obtain a first word segmentation vector and at least one second word segmentation vector.

The first word segmentation vector is a word segmentation vector formed by taking sample scene characteristics as words, and each second word segmentation vector is a word segmentation vector formed by at least one word segmentation contained in sample song list characteristics.

S2302, inputting the first word segmentation vector and each second word segmentation vector into a self-attention layer for processing, and obtaining first scene features and first display information of the sample song.

In this embodiment of the present application, the sample scene feature and the sample song feature may be text-form features, and, for example, after the sample scene feature and the sample song feature are input into the first generation model, the word segmentation processing layer may perform word segmentation processing on the sample scene feature and the sample song feature, so as to obtain word segmentation vectors corresponding to each text feature.

For example, word segmentation may be performed on the sample song feature, for example, a token may be used to decompose a text corresponding to the sample song feature into tokens (token), where each token is a subsequence of characters in the text, and in this embodiment of the present application, each token may also be marked as a word segment, so that multiple word segments corresponding to the sample song feature may be obtained.

For example, where the sample song list feature includes lyrics, the lyrics may be decomposed into a plurality of tokens; where the sample song list feature includes a song title, the song title may be broken up into a plurality of tokens.

In order to enable the model to generate presentation information of a song list for specific scene features, and simultaneously combine the diversity of different scene features, for example, the whole sample scene feature can be used as a word. For example, for sample scene feature "category: [ class name ] "the" class "can be directly: [ Items ] as one word, it is not split into two words of "Items" and "Items". Therefore, the first generation model can learn the attention bidirectional association between the display information and the unique token corresponding to the scene feature in the training process, and further can perform special processing on the sample scene feature.

For each word obtained by the above, the word segmentation processing layer can convert the word into a word vector, so that a first word vector corresponding to the sample scene feature and each second word vector corresponding to the sample song feature can be obtained.

And inputting the first word segmentation vector and each second word segmentation vector into a self-attention layer for processing, so that the first scene characteristics and the first display information of the sample song can be obtained.

According to the embodiment of the application, when the self-attention model is utilized to process the sample scene characteristics and the sample song characteristics, the sample scene characteristics are used as the unique token and are fixed into the input of the model, so that the model can learn attention bidirectional association between the display information and the unique token in the training process, corresponding display information can be generated aiming at specific scene characteristics, specific commonalities of different scenes can be learned, and the training effect of the model is further improved.

In some embodiments, S2302 may include the steps of:

generating an attention matrix according to the first word segmentation vector and each second word segmentation vector, wherein the positions corresponding to the first word segmentation vector in the attention matrix are all preset values, and the preset values are used for representing that each second word segmentation vector associated with the first word segmentation vector is not shielded when the first word segmentation vector is processed;

multiplying the first word segmentation vector by a first weight matrix to obtain a scene matrix;

multiplying a first matrix formed by each second word segmentation vector with a second weight matrix, a third weight matrix and a fourth weight matrix respectively to obtain a query matrix, a key matrix and a value matrix;

and generating first scene characteristics and first display information of the sample song according to the attention matrix, the scene matrix, the query matrix, the key matrix and the value matrix.

The values at different locations in the self-attention matrix are used to characterize whether the model can see other word-segmentation vectors associated with the word-segmentation vector when processing the word-segmentation vector associated with the corresponding location.

In the embodiment of the present application, an attention matrix may be generated based on the first word-segmentation vector and each second word-segmentation vector, where positions corresponding to the first word-segmentation vector in the attention matrix are all preset values, and the preset values are used for characterizing that each second word-segmentation vector associated with the first word-segmentation vector is not masked when the first word-segmentation vector is processed.

The preset value may be set to 1, or may be set to other values, which are not limited in this embodiment.

Illustratively, the attention matrix may refer to fig. 3, where the dimension of the attention matrix is the same as the input dimension of the first generation model, so that the input of the model is conveniently processed by using the attention moment matrix, and the output of the model is obtained.

Fig. 3 is an attention matrix constructed taking as an example the input vectors of the model (sample scene features and sample song features), i.e. without considering the output vectors of the model (presentation information). In some embodiments, the attention matrix may also be generated based on input and output vectors, fig. 4 being a self-attention matrix constructed by taking the input and output vectors of the model as examples.

The row and column where the sample scene feature is located are all 1, that is, when the sample scene feature is processed, other related features can be seen at the same time, for example, for fig. 3, the sample song feature located behind the sample scene feature can be seen at the same time, for fig. 4, the sample song feature located behind the scene feature and the display information predicted by the model can be seen at the same time, therefore, the features of the sample song feature, or the features of the sample song feature and the display information can be learned, and further the training effect of the model can be improved.

The structure of the self-focusing layer may be referred to as fig. 5, and in practical application, the first generating model may include a plurality of self-focusing layers, and the structures of the respective focusing layers are similar.

The first weight matrix Wt, the second weight matrix Wq, the third weight matrix Wk and the fourth weight matrix Wv are learnable parameters of the self-attention layer, can be continuously optimized through a back propagation algorithm in the training process, and are fixed in the process of using the model after model training is completed.

Firstly, multiplying a first word segmentation vector by a first weight matrix Wt to obtain a scene matrix T; and multiplying the first matrix formed by each second word segmentation vector with the second weight matrix Wq, the third weight matrix Wk and the fourth weight matrix Wv respectively to obtain a query matrix Q, a key matrix K and a value matrix V.

The query matrix Q and the key matrix K are then multiplied and normalized to obtain a matrix P, which is, illustratively,in FIG. 5 +.>d is the length of the word vector input into the first generation model, namely the length of the first word segmentation vector or the second word segmentation vector, and the lengths of the first word segmentation vector and the second word segmentation vector are the same. Illustratively, d=4096.

After the matrix P is obtained, the matrix T and the matrix P can be spliced to obtain a spliced matrix L. And then performing mask filling operation on the spliced matrix L by using the obtained attention matrix to obtain a mask filling matrix U.

That is, for each position in the attention matrix, if the value at that position is 1, the value at the position corresponding to the concatenation matrix L is retained, and if the value at that position is 0, the value at the position corresponding to the concatenation matrix L is replaced with minus infinity, whereby the mask filling operation for the concatenation matrix L is achieved.

After the mask filling matrix U is obtained, the mask filling matrix U can be multiplied by the value matrix V, and then the first scene characteristics and the first display information can be obtained through two full-connection layers and one vocabulary probability layer respectively.

The vocabulary probability layer is used for outputting word vectors with the highest probability in the vocabulary as first scene features and first display information, the vocabulary is a prestored data table containing a large number of segmented words, the first generation model can predict and output the segmented words with the highest probability from the vocabulary by executing the processing on the first segmented word vectors and the second segmented word vectors, and the segmented words with the highest probability are output as models to obtain the first scene features and the first display information.

According to the embodiment of the application, the attention matrix is modified, so that all information of the sample scene characteristics, the sample song list characteristics and the predicted first display information can be seen when the sample scene characteristics are processed, the correlation between the sample scene characteristics and the sample song list characteristics and the first display information can be learned, and further, the model can generate song list display information which is more fit with the scene characteristics, and the training effect of the model is improved.

Taking the example that the presentation information of the song includes a song title and a song profile, the first generation model may include a first song title generation model and a first song profile generation model, and the second generation model may include a second song title generation model and a second song profile generation model, respectively;

taking the sample song list feature including the sample title of the sample song list and the sample lyrics and sample song names of the respective sample songs as an example, the above S130 may include the following steps:

training a first song title generation model based on sample scene characteristics, sample lyrics and sample song names to obtain a second song title generation model;

training a first song list profile generation model based on the sample scene characteristics, the sample song names and the sample titles to obtain a second song list profile generation model.

The sample song names are song names of songs contained in the sample song list, and for example, song names of all songs contained in the sample song list can be taken as sample song names, part of song names can be selected as sample song names, and for example, song names with higher relativity with song titles of the sample song list can be selected as sample song names.

The sample lyrics are lyrics in the song corresponding to the sample song name, for example, all the lyrics of the song corresponding to the sample song name may be used as the sample lyrics, or a part of the lyrics may be selected therefrom, for example, the lyrics similar to the sample song name may be used as the sample lyrics.

The sample title can be the song title of sample song list, also can be the song title similar with the song title of sample song list, can increase the variety of sample title from this, promotes the training effect of model.

Illustratively, a second song title generation model is used to generate the song title of the song, as shown in fig. 6, which is obtained by training the first song title generation model with the sample scene characteristics, the sample lyrics, and the sample song names as inputs and the sample title as output.

Illustratively, the input form of the first song title generation model is: category: [ class name ] [ SEP ] sample song name: sample song list "and" category: [ class name ] [ SEP ] sample lyrics: [ sample lyrics list ] ", the output form is: "[ sample title ].

Wherein, "category: [ class name ] represents a sample scene feature, [ SEP ] is a separator.

To "rescue songs from the world: taking 90 th favorite old song as an example, the input of the first song list title generation model is: category: ancient style [ SEP ] sample song name: red shou wish miscanthus … … "and" category: ancient style [ SEP ] sample lyrics: spring wind bypasses the hair tip red yarn … … ", and the output is: "rescue song waste: the favorite old song after 90 days).

Similarly, a second song order profile generation model is used to generate a song order, as shown in fig. 6, which is obtained by training the second song order profile generation model with sample scene characteristics, sample song names, and sample titles as inputs, and sample profiles as outputs.

Illustratively, the second song order profile generation model has an input form of: category: [ class name ] [ SEP ] sample song name: [ sample Song name List ] [ SEP ] sample title: [ sample title ], the output form is: "[ sample introduction ]. Wherein, "category: [ class name ] represents a sample scene feature, [ SEP ] is a separator.

Still "save song waste: for example, 90 favorite ancient song ", the second song order profile generation model is input as follows: category: ancient style [ SEP ] sample song name: red shou wang miscanthus … … [ SEP ] sample title: saving songs: after 90, favorite ancient song ", output is: "red and white hair … …" finally.

Because the class names (sample scene characteristics) are embedded into the input of the model, the information mapping relation between the class names and the model output (song title and song introduction) can be learned in the model training process, so that the commonality of different scene data can be learned, the diversity of the song title and song introduction can be improved, and the personalized song title and song introduction can be generated based on the characteristics of each scene.

In some embodiments, prior to S130, the model training method may further include the steps of:

for each sample song list, determining a sample song name from the candidate song names according to the first correlation degree of at least one candidate song name and a sample title in the sample song list, and determining a song corresponding to the sample song name as a sample song;

for each sample song, carrying out clause on lyrics of the sample song to obtain at least one lyrics clause;

and determining the sample lyrics of the sample song from each lyrics clause according to the first similarity of each lyrics clause and the sample song name of the sample song.

The candidate song names are song names contained in the sample song list. In order to improve the training effect of the model, when extracting the sample song name and the sample lyrics from the sample song sheet, it is necessary to extract text information highly related to the sample song sheet, and it is diverse.

Therefore, for each sample song list, the first correlation between each candidate song name and the sample title in the sample song list needs to be determined, and the sample song name is determined from each candidate song name based on the first correlation between each candidate song name and the sample title. For example, a candidate song title with a first degree of correlation greater than a preset threshold may be determined as the sample song title.

The specific determination manner of the first correlation is not limited, for example, the cosine distance between the text vector corresponding to each candidate song name and the text vector corresponding to the sample title may be calculated, and the cosine distance between each candidate song name and the sample title may be determined as the first correlation between each candidate song name and the sample title.

Illustratively, as shown in fig. 7, the candidate song names may be sorted in order of the first degree of correlation from high to low, and then 8 candidate song names may be selected from the candidate song names as sample song names.

For example, the first 8 candidate song names may be used as sample song names, and the first 3 candidate song names and the last 5 candidate song names may be used as sample song names, so that the extracted sample song names are not only representative, but also can cover tail information of the sample song.

Taking the candidate song names shown in fig. 7 as an example, the top 8 song names finally screened out may include: red Zhaowang (0.75)/how to have poetry debt to change wine money (0.73)/Jianzhen Jiuxiao (0.7)/Dongfengzhi (0.68)/Baijiu (0.65)/Tibet (0.6)/mango seed (0.55)/palace center (0.55).

The first 3 song names may include: red Zhaojun (0.75)/how must the poem debt trades for wine money (0.73)/Jianzhen Jiuxiao (0.7), the last 5 song names may include: miscanthus sinensis (0.55)/uterine core (0.55)/cinnabar tear (0.5)/white coat (0.5)/autumn on core (0.5).

Wherein the number in brackets indicates a first correlation of the song title with the sample title of the corresponding sample song. When the sample song title is input into the first song title generation model or the first song brief introduction generation model, the first 8 song names and the first 3 song names+the last 5 song names may be input into the model as two inputs, respectively.

After the sample song names are determined, the songs corresponding to the sample song names can be determined as sample songs, and the sample lyrics can be further determined for each sample song.

For example, for each sample song, the lyrics of the sample song may be divided into phrases to obtain at least one lyrics phrase, then the first similarity between each lyrics phrase and the sample song name of the sample song is determined, and the lyrics phrase corresponding to the first similarity greater than the preset threshold is used as the sample lyrics of the sample song. For example, the lyric clause with the first highest similarity may be determined as the sample lyric, so that the sample song is relatively high in representativeness.

The determining process of the first similarity between each lyric clause and the sample song name is similar to the determining process of the first correlation between each candidate song name and the sample title, namely, text vectors corresponding to each lyric clause and text vectors corresponding to the sample song name can be respectively determined, and cosine distances of the two text vectors are determined as the first similarity between each lyric clause and the sample song name.

Of course, the first similarity between each lyric clause and the sample song name may be determined in other manners, and the embodiment of the present application is not limited specifically.

According to the method and the device for generating the song title, the first correlation degree of each candidate song title and the sample title in the sample song list is utilized, the candidate song title which is representative and can cover tail information of the sample song list is firstly determined from the sample song list and is used as the sample song title, and then lyrics which are higher in similarity with the sample song title are further determined from sample songs corresponding to the sample song title and are used as the sample lyrics, so that the screened sample song title and sample lyrics have higher correlation degree and diversity with the sample song list, and training effects of a first song title generation model and a first song brief introduction generation model can be improved.

Considering that the song title is generally short, generally within 20 words, in order to increase the diversity of the song title, the original title of the sample song may be amplified, based on which, in some embodiments, the model training method may further include the following steps before S130:

performing word segmentation processing on an original title of a sample song to obtain a third word segmentation vector of at least one title word segmentation;

determining a second similarity of the third word segmentation vector and a first center vector and a third similarity of the third word segmentation vector and a second center vector for each third word segmentation vector, wherein the first center vector is a center vector of a first word stock corresponding to sample scene characteristics of a sample song, and the second center vector is a center vector of a word segmentation corresponding to part of speech of a title word in the first word stock;

and determining a sample title of the sample song according to the second similarity and the third similarity.

The first word stock is a word stock corresponding to a sample scene of a sample song, and a large number of segmented words belonging to the same scene can be contained in the first word stock.

The first center vector is an average value of word segmentation vectors corresponding to each word segmentation in the first word bank. In view of the different word styles of different scenes, in order to prevent the amplified title from changing the original word style, the embodiment of the application can amplify the original title in combination with the scene.

It should be understood that even though the parts of speech of the parts of speech may be different in the same scene, i.e. the first thesaurus may comprise a plurality of parts of speech, and the second center vector is an average value of word segmentation vectors corresponding to the parts of speech of the same part of speech in the first thesaurus.

For each third word-segmentation vector, the second similarity between the third word-segmentation vector and the first center vector, and the third similarity between the third word-segmentation vector and the second center vector may be determined, and the determination process of the second similarity and the third similarity may refer to the determination process of the first similarity, which is not repeated herein for brevity.

Based on the second similarity and the third similarity, a sample title of the sample song may be determined. For example, synonyms for each title word may be determined from the first word stock based on the second similarity and the third similarity, and the title after the synonym replacement and the original title are collectively referred to as a sample title.

When the original titles are amplified, the word styles of different scenes are fully considered, namely, the original titles of different scenes are subjected to title amplification processing according to word libraries of the different scenes, so that the word styles of the original titles are not changed while the diversity of the amplified titles is ensured, and the training effect of a model can be improved.

In some embodiments, the "determining the sample title of the sample song according to the second similarity and the third similarity" may include the following steps:

when the second similarity is larger than the first threshold value and the third similarity is larger than the second threshold value, performing synonym replacement processing on the title segmentation to obtain a second title;

and determining the second title and the original title as sample titles of the sample song list under the condition that the similarity between the second title and the original title is larger than a third threshold value.

The second similarity between the third word-segmentation vector and the first center vector is greater than a first threshold, which indicates that the third word-segmentation vector is similar to the first center vector, and similarly, the third similarity between the third word-segmentation vector and the second center vector is greater than a second threshold, which indicates that the third word-segmentation vector is similar to the second center vector, at this time, synonym replacement can be performed on the heading word corresponding to the third word-segmentation vector, each heading word is similarly processed, and after the processing is finished, the second heading can be obtained based on each synonym.

To further enhance the training effect of the model, the second title may be quality checked after the second title is obtained, so that the similarity between the amplified second title and the original title is as high as possible.

The embodiment of the present application does not limit the specific manner of quality inspection, for example, the original header and the second header may be converted into a vector a= (a 1, a2, …, an) and a vector b= (B1, B2, …, bn) respectively by using a Word2Vec model or other text vector model, and then the cosine distances cos (a, B) of the two vectors are calculated, where a larger value indicates a higher similarity.

Illustratively, when cos (A, B) is greater than the third threshold, it is indicated that the second title is similar to the original title, and the quality is satisfactory. The magnitude of the third threshold may be set according to actual needs, for example, may be set to 0.9.

Illustratively, as shown in FIG. 8, the song is "saved for the barren" in the original heading of the old-fashioned: the 90 th favorite ancient wind song is taken as an example, the 'rescue', 'song', 'barren', '90 th favorite', 'ancient wind', 'song' can be obtained through word segmentation, the synonym of each title word can be obtained through comparing the third word segmentation vector corresponding to each word segmentation with the first center vector and the second center vector of the ancient wind word stock, and the following second title can be obtained after replacement based on the synonym: saving songs: favorite ancient song (0.8) after 90 days, rescue starter is barred: after 90, loving the old song (0.91), saving the song: favorite ancient song (0.7) after 90 days, save song waste: after 90, loves the old-fashioned song (0.89), saves the song and is barren: the most popular old song (0.88) after 90 days and saving the song is barren: after 90 days, the user likes ancient music (0.95).

Through quality inspection, a second title meeting the quality requirement can be finally obtained: rescue of the yeast waste: favorite ancient song (0.91) after 90 days, saving song deserts: after 90 days, the most popular old-fashioned music (0.95) and the rescue of the yeast are barren: after 90 days, the most popular is the ancient wind heat curve (0.9).

At this time, the original title and the screened second title meeting the quality requirement can be determined as the sample title.

After synonym replacement of the same scene is carried out on each title word of the original title, quality inspection is further carried out on the obtained second title, the similarity between the second title and the original title is further determined, the second title with the similarity larger than a third threshold value is combined with the original title to determine a sample title, and the quality of the sample title is improved.

Since the song list profile is generally a long sentence and contains a rich amount of information, the first song list profile generation model is generally capable of generating an output with diversity, and thus the song list profile of the sample song list can be directly used as the sample profile.

After model training is finished, a song list title and a song list introduction can be generated for a song list uploaded by a user based on the training-finished model, or a newly added song list in a song list library, and the song list can be uploaded after the song list title and the song list introduction are generated. The training model can be further optimized by collecting user feedback data, so that the song titles and song profiles generated by the model accord with the preference of the user.

Based on this, in some embodiments, as shown in fig. 9, the model training method may include the steps of:

s910, obtaining N sample songs.

S920, processing each sample song list to obtain sample scene characteristics and sample song list characteristics of the sample song list.

S930, training the first generation model based on the sample scene characteristics and the sample song list characteristics to obtain a second generation model.

S940, determining similar songs of the target song list from the song list library, and adding the target song list and the similar song list into the first set.

The display information of the target song list and the similar song list is obtained based on the second generation model.

S950, ordering the songs in the first set according to the feedback data of the user for the use of the songs in the first set, and obtaining a first ordering result.

The usage feedback data is generated by the user based on the presentation information of each song.

S960, determining a first score of the first song list and a second score of the second song list in the first sorting result by utilizing a pre-trained reinforcement model.

Wherein the user's feedback data for the first song is better than the feedback data for the second song, and the reinforcement model is generated based on the second generation model.

S970, determining a third loss function value based on the first score and the second score.

And S980, updating the second generation model based on the third loss function value.

The process of S910 to S930 may be referred to the above embodiments, and are not repeated here for brevity.

in S940, to train an enhancement model that can evaluate user preferences, similar songs for the target song may be determined from a library of songs. Similar songs may be songs that have the same content but different titles and/or profiles, and the same content may be the same (similar) or mostly the same (similar) songs that are included.

The determination of similar songs may be found in the following examples.

After the similar songs are determined, the target song and the similar songs can be added to the first set for further optimizing the second generation model. The display information of the target song list and the similar song list is obtained based on the second generation model.

In S950, the feedback data is used as feedback data for the song by the user when using the song in the first set, the feedback data being generated based on the song title and the song profile of the user for the used song. The user's preference for the title of the song and the brief introduction of the song may be fed back to some extent using the feedback data.

Illustratively, the usage feedback data may include, but is not limited to, click-through rate, collection, average time of play, and the like.

After the songs are online, user feedback data of the songs in the first set can be collected. Based on the feedback data of the user for the use of each song, each song in the first set can be ordered to obtain a first ordering result.

For example, the score of the user on each song list may be calculated based on the feedback data of the user on the song list, and the song lists may be ranked based on the score of the user on each song list, so as to obtain the first ranking result.

Illustratively, the songs may be arranged in order of the scores from the top to the bottom, resulting in a first ranking result.

In S960, the reinforcement model is used to score pre-generated song titles and song profiles, and the reinforcement model may include, for example, a song title reinforcement model for optimizing the second song title generation model and a song profile reinforcement model for optimizing the second song profile generation model.

Illustratively, the reinforcement model may be generated based on the second generation model, for example, the vocabulary probability layer shown in fig. 5 may be replaced by a linear full-connection layer to obtain the reinforcement model.

The first song list may be the song list in the first set before the second song list, that is, the user's feedback data of the first song list is better than the feedback data of the second song list, that is, the user considers that the song list title or the song list introduction of the first song list is better than the song list title or the song list introduction of the second song list.

Using the reinforcement model, a first score for the first song and a second score for the second song may be determined. Wherein the first score and the second score may also be referred to as model scores, i.e. scores based on a reinforcement model.

Illustratively, the first score and the second score may be determined by:

determining a second scene feature of the first song sheet and a first song sheet feature; and a third scene feature and a second menu feature of the second menu;

inputting the second scene features and the first song list features into a pre-trained reinforcement model to obtain a first score of the first song list;

and inputting the third scene characteristic and the second song list characteristic into a pre-trained reinforcement model to obtain a second score of the second song list.

For example, in optimizing the second song title generation model based on the song title enhancement model, the first song features may include a song title of the first song and song names and lyrics in the first song; when optimizing the second song profile generation model based on the song profile enhancement model, the first song characteristics may include a song title of the first song, a song profile, and a song name in the first song, where the determining process of the song name and the song lyrics may refer to the determining process of the sample song name and the sample song lyrics.

That is, the song title strengthening model is obtained by training with the scene feature, the song title, the song name and the lyrics as inputs and the title score of the song title as output, and the song introduction strengthening model is obtained by training with the scene feature, the song title, the song name and the song introduction as inputs and the introduction score of the song introduction as output.

Thus, inputting the second scene characteristic of the first song, the song title of the first song and the song name and lyrics in the first song into the song title strengthening model to obtain the song title score of the first song; similarly, inputting the third scene characteristic of the second song, the song title of the second song and the song name and lyrics in the second song into the song title strengthening model to obtain the song title score of the second song.

Inputting the second scene characteristic of the first song, the song title of the first song, the song brief introduction of the first song and the song name in the first song into a song brief introduction strengthening model to obtain the song brief introduction score of the first song; similarly, inputting the third scene feature of the second song, the song title of the second song, the song profile of the second song, and the song name in the second song into the song profile enhancement model to obtain the song profile score of the second song.

Wherein the song title score of the first song and the song profile score of the first song are collectively referred to as a first score, and the song title score of the second song and the song profile score of the second song are collectively referred to as a second score.

In practical application, the features are input into the enhancement model in the form of word segmentation vectors, so that the enhancement model outputs scores of the words of each title and scores of the words of each brief introduction, and therefore, the average value of the scores of each title can be used as the score of the title of the song and the average value of the scores of the words of each brief introduction can be used as the score of the brief introduction.

That is, the embodiment of the application constructs the enhancement model based on the scene characteristics, the song title, the song introduction and other information, so that the enhancement model can predict the quality of the song title and the song introduction output by the second generation model on the basis of the scene characteristics, and further the optimization of the second generation model is more accurate.

In S970, a third loss function value may be calculated based on the first score and the second score. Illustratively, the third loss function value may be determined in combination with the following loss function:

Loss3＝-log[rank(A)-rank(B)]

where Loss3 is a third Loss function value, and rank (a) is an example of a song title, rank (B) is an example of a song title, rank (a) is a song profile, and rank (B) is an example of a song profile. The objective of the loss function is to maximize the difference between the title of the song of the first song and the song of the second song, or the difference between the profile of the song of the first song and the second song.

According to the method and the device for optimizing the model, the ranking loss function is combined, the songs are ranked only by using feedback data of the user, and then the loss function value is calculated based on the ranking result, so that the problem of inaccuracy in errors caused by calculating the mean square error by the traditional loss function through the model output score and the reference score is solved, and the optimization effect of the model can be improved.

In S980, the second generation model may be updated based on the third loss function value, so that the song title and the song profile output by the updated second generation model better conform to the preference of the user.

According to the method and the device for generating the song list, the user is combined with feedback data of the song list, the song lists are ordered, a third loss function value is calculated by combining the ordering loss function, and the second generation model is updated based on the third loss function value, so that the song list title and the song list profile output by the second generation model can be more in accordance with the preference of the user.

In some embodiments, the step S950 may include the steps of:

for each song in the first set, scoring the display information of the song according to the feedback data of the use of the song by the user to obtain a third score, wherein the feedback data comprises at least one of click rate, collection and average playing duration;

And sorting the songs in the first set according to the third scores of the songs in the first set to obtain a first sorting result.

For example, the song title and the song profile may be scored for the user's use of the song, respectively, resulting in a third score.

And based on the third scores of the songs, the songs in the first set can be ordered to obtain a first ordering result.

For example, for the song titles, the song titles may be ranked based on the scores of the song titles to obtain a song title ranking result, and the song titles may be ranked based on the scores of the song profiles to obtain a song profile ranking result. The song title ranking result and the song profile ranking result are collectively referred to as a first ranking result.

The usage feedback data may include at least one of a click rate, a collection, and a person average play duration.

According to the method and the device for generating the song list, the song list titles and the song list introduction of the song list can be respectively ordered based on the using feedback data of the song list of the user, so that the loss function value can be calculated by combining the ordering loss function model subsequently, and the quality of the second generation model can be improved.

Taking the example that the usage feedback data includes the click rate, the collection and the average playing time length, the above-mentioned "score the display information of the song list to obtain the third score according to the usage feedback data of the user to the song list" may include the following steps:

Respectively carrying out normalization processing on the click rate, the collection and the average playing time length;

weighting and summing the normalized click rate, collection and average playing time length to obtain a first summation result;

the first summation result is determined as a third score of the song.

In order to calculate the third score of each song sheet, the normalization processing needs to be performed on the feedback data, that is, the normalization processing needs to be performed on the click rate, the collection and the average playing time length, in consideration of different measurement units of the feedback data.

The click rate, collection, and average time length of play are normalized in a similar manner, and as an example, click rate is exemplified by click_norm= (click-click_ave)/(click_max-click_min).

Wherein, click_norm is the normalized click rate, click is the original click rate, click_ave is the average of the click rates, and click_max and click_min are the maximum click rate and the minimum click rate.

And carrying out weighted summation on the normalized click rate, collection and average playing time length to obtain a third score.

Illustratively, s1=a1×click_norm+b1×like_norm+c1×play_norm, s2=a2×click_norm+b2×like_norm+c2×play_norm.

Wherein s1 is the score of the song title, s2 is the score of the song brief introduction, click_norm, like_norm and play_norm are respectively the normalized click rate, collection and average playing time length, a1, b1 and c1 are respectively the weights in the score of the song title, and a2, b2 and c2 are respectively the weights in the score of the song brief introduction.

In practical applications, the click rate may represent the feedback of the user to the title of the song list to a large extent, and the collection may represent the feedback of the user to the brief introduction of the song list to a certain extent, based on which, for example, the values of the weights may be set as follows: a1 =0.9, b1=0.05, c1=0.05; a2 -0.1, b2=1.2, c2= -0.1. Of course, other values may be set, and the embodiments of the present application are not specifically limited.

According to the embodiment of the application, the click rate, the collection and the average playing time length of the song list are normalized, and the normalized click rate, collection and average playing time length are weighted and summed to obtain a third score, so that a basis is provided for subsequent ordering of each song list.

In some embodiments, the step S940 may include the following steps:

determining song attribute parameters of a first song contained in each third song in the song list library and song attribute parameters of a second song contained in the target song, wherein the song attribute parameters comprise at least one of song labels and lyrics;

And determining the similar song list of the target song list from the song list library according to the song attribute parameters of the first song and the song attribute parameters of the second song.

The third song list is any song list in the song list library. The first song is a song contained in the third song, and the first song may include one or more songs. The second song is a song contained in the target song, and the second song may include one or more.

Illustratively, the song attribute parameters may include, but are not limited to, song tags, lyrics, the song tags being used to characterize the characteristics of the song, and illustratively, the song tags may have one or more, when there are a plurality of song tags, the song tag that best represents the characteristics of the song may be selected therefrom.

For example, similar songs for a target song may be determined from a library of songs based on the song tags of the respective songs. For example, if the number of song tags in the third song list that are identical to the target song list is greater than a preset threshold, the third song list may be determined to be a similar song list to the target song list.

For example, similar songs for the target song may also be determined from a library of songs based on the lyrics of each song. For example, if the number of songs in the third song that are the same as or similar to the lyrics of the target song is greater than a preset threshold, the third song may be determined to be a similar song to the target song.

For example, similar songs for the target song may also be determined from a library of songs based on song tags and lyrics for each song. For example, candidate similar songs may be selected from the list library based on song labels, and then the similar song of the target song may be further determined from the candidate similar songs based on lyrics of each song in the candidate similar songs.

According to the method and the device, based on song attribute parameters of songs, similar songs of a target song are determined from a song list library, and data volume of the target song is enriched, so that use feedback data of users can be collected, and an enhancement model capable of evaluating preference degree of the users is trained.

Taking the example that the song attribute parameters comprise song labels and lyrics, the song attribute parameters of the first song comprise first song labels and first lyrics, and the song attribute parameters of the second song comprise second song labels and second lyrics;

accordingly, the "determining the similar song list of the target song list from the song list library according to the song attribute parameter of the first song and the song attribute parameter of the second song" may include the following steps:

matching each first song label and each second song label of each third song list aiming at each third song list to obtain the first song number which is the same as the second song label of the target song list in the third song list;

Under the condition that the number of the first songs is larger than a fourth threshold value, matching the first lyrics of all the first songs in the third song list with the second lyrics of all the second songs in the target song list to obtain the number of the second songs, wherein the similarity between the second lyrics of the third song list and the second lyrics of the target song list is larger than the fourth similarity;

in the case where the number of the second songs is larger than the fifth threshold, the third song is determined as a similar song to the target song.

For each third song list, the first song labels of the third song list and the second song labels of the target song list can be matched, the same number of song labels contained in the third song list and the target song list is determined, and the first song number which is the same as the second song label of the target song list in the third song list can be obtained based on the same number of song labels because one song label corresponds to one song.

In the event that the first number of songs is greater than a fourth threshold, it may be determined whether the third song is a similar song to the target song based further on the lyrics. Illustratively, the fourth threshold may be set to s×85%, s being the number of songs contained in the third song list.

For example, for each first song of the third song list, a similarity of the first lyrics of the first song to the second lyrics of the second songs may be determined, and then a number of second songs having a similarity greater than the fourth similarity may be counted from the third song list.

If the second number of songs is greater than the fifth threshold, the third song may be determined to be a similar song to the target song. Illustratively, the fifth threshold may be set at s 80%.

In practical application, the similar song list of the target song list can be determined in other ways, and the embodiment of the application is not limited.

According to the method and the device for generating the song list, based on song labels and lyrics, similar song lists of a target song list are determined from a song list library, the subsequent training is facilitated, the strengthening model of the user preference degree can be evaluated, the second generation model is optimized, and the quality of the second generation model is improved.

For example, the reinforcement model may be utilized periodically (e.g., every month) to enhance the quality of the generation of the second generation model.

That is, the embodiments of the present application may generate a song title and a song profile of a song using the second generation model, then accumulate user feedback data online, determine a user feedback score based on the user feedback data, and further arrange each song in combination with the user feedback score to obtain a model score output by the reinforcement model, and further obtain a loss function value, and update the second generation model based on the loss function value to form a cycle.

According to the embodiment of the application, the first generation model is trained by utilizing the information (song title, song brief introduction, scene characteristics and song characteristics) of the existing song so that the model has the capability of generating the song title and the song brief introduction according to the scene characteristics; meanwhile, the reinforcement model is trained based on feedback data of the online user on the song, and the generation quality of the second generation model is further improved through the reinforcement model, so that the optimized second generation model can generate song titles and song introduction meeting the user preference.

As shown in fig. 10, the embodiment of the application further provides an information generating method, which includes the following steps:

s1010, acquiring a fourth song list.

S1020, processing the fourth song list to obtain fourth scene characteristics and fourth song list characteristics of the fourth song list.

S1030, inputting the fourth scene feature and the fourth song list feature into the second generation model to obtain the display information of the fourth song list.

Wherein the second generated model is trained based on the model training method as described in the above embodiment.

In the determining process of the fourth scene feature and the fourth song list feature, the fourth scene feature and the fourth song list feature may be input into the second generation model, so that the song list title and the song list introduction corresponding to the scene of the fourth song list may be obtained, the operation of the user is reduced, and the generation efficiency of the song list title and the song list introduction is improved.

It should be noted that, in the model training method provided in the embodiment of the present application, the execution subject may be a model training device, or a processing module in the model training device for executing the model training method. In the embodiment of the present application, a model training device executes a model training method as an example, and the model training device provided in the embodiment of the present application is described.

Fig. 11 is a schematic structural diagram of a model training device according to an embodiment of the present application.

As shown in fig. 11, the model training apparatus 1100 may include:

an obtaining module 1101, configured to obtain N sample songs, each sample song including at least one sample song, where N is an integer greater than 1;

the processing module 1102 is configured to process each sample song to obtain sample scene features and sample song features of the sample song;

the training module 1103 is configured to train the first generation model based on the sample scene feature and the sample song list feature to obtain a second generation model, where the second generation model is used to generate display information of the song list, and the display information is associated with the scene feature of the song list.

In some possible implementations of embodiments of the present application, the training module 1103 is specifically configured to:

inputting sample scene characteristics and sample song list characteristics into a first generation model to obtain first scene characteristics and first display information;

determining a first loss function value based on the first scene feature and a reference scene feature, the reference scene feature being determined based on the sample scene feature;

determining a second loss function value based on the first presentation information and the reference presentation information;

and training the first generation model based on the weighted sum of the first loss function value and the second loss function value to obtain a second generation model.

In some possible implementations of embodiments of the present application, the first generative model includes a word segmentation processing layer and a self-attention layer;

the training module 1103 is specifically configured to:

inputting the sample scene characteristics and the sample song list characteristics into a word segmentation processing layer for word segmentation processing to obtain a first word segmentation vector and at least one second word segmentation vector, wherein the first word segmentation vector is a word segmentation vector formed by taking the sample scene characteristics as word segmentation, and each second word segmentation vector is a word segmentation vector formed based on at least one word segmentation contained in the sample song list characteristics;

and inputting the first word segmentation vector and each second word segmentation vector into a self-attention layer for processing to obtain first scene characteristics and first display information of the sample song.

generating an attention matrix according to the first word segmentation vector and each second word segmentation vector, wherein the positions of the attention matrix corresponding to the first word segmentation vector are all preset values, and the preset values are used for representing that each second word segmentation vector associated with the first word segmentation vector is not shielded when the first word segmentation vector is processed;

In some possible implementations of the embodiments of the present application, the processing module 1102 is further configured to, after the training module 1103 trains the first generation model based on the sample scene feature and the sample song list feature, obtain the second generation model, determine a similar song list of the target song list from the song list library, and add the target song list and the similar song list to the first set, where the display information of the target song list and the similar song list is obtained based on the second generation model;

Sequencing each song in the first set according to the use feedback data of the user on each song in the first set to obtain a first sequencing result, wherein the use feedback data is generated by the user based on the display information of each song;

determining a first score of a first song and a second score of a second song in a first sequencing result by using a pre-trained reinforcement model, wherein the feedback data of the use of the first song by a user is better than that of the second song, and the reinforcement model is generated based on a second generation model;

determining a third loss function value based on the first score and the second score;

the second generation model is updated based on the third loss function value.

In some possible implementations of embodiments of the present application, the processing module 1102 is specifically configured to:

In the information generating method provided in the embodiment of the present application, the execution subject may be an information generating apparatus, or a processing module in the information generating apparatus for executing the information generating method. In the embodiment of the present application, an information generating apparatus provided in the embodiment of the present application will be described by taking an example in which the information generating apparatus executes an information generating method.

Fig. 12 is a schematic structural diagram of an information generating apparatus according to an embodiment of the present application.

As shown in fig. 12, the information generating apparatus 1200 may include:

an obtaining module 1201, configured to obtain a fourth song list;

the processing module 1202 is configured to process the fourth song list to obtain a fourth scene feature and a fourth song list feature of the fourth song list;

The generating module 1203 is configured to input the fourth scene feature and the fourth song list feature into the second generating model to obtain display information of the fourth song list;

wherein the second generated model is trained based on the model training method of the above embodiment.

The model training device and the information generating device in the embodiments of the present application may be devices, or may be components in an electronic device, for example, an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices than a terminal. By way of example, the electronic device may be a mobile phone, tablet computer, notebook computer, palm computer, vehicle-mounted electronic device, mobile internet appliance (Mobile Internet Device, MID), augmented reality (augmented reality, AR)/Virtual Reality (VR) device, robot, wearable device, ultra-mobile personal computer, UMPC, netbook or personal digital assistant (personal digital assistant, PDA), etc., but may also be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., and the embodiments of the present application are not limited in particular.

The electronic device in the embodiment of the application may be an electronic device having an operating system. The operating system may be an Android operating system, an iOS operating system, or other possible operating systems, which are not specifically limited in the embodiments of the present application.

The model training device provided in the embodiment of the present application can implement each process in the embodiment of the model training method in fig. 1 to 9, and in order to avoid repetition, a detailed description is omitted here. The information generating device provided in the embodiment of the present application can implement each process in the embodiment of the information generating method in fig. 10, and in order to avoid repetition, a detailed description is omitted here.

As shown in fig. 13, the embodiment of the present application further provides an electronic device 1300, including a processor 1301 and a memory 1302, where the memory 1302 stores a program or instructions that can be executed on the processor 1301, and the program or instructions implement each step of the model training method or the information generating embodiment described above when executed by the processor 1301, and achieve the same technical effects, so that repetition is avoided and redundant description is omitted here.

The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device.

The electronic device 1400 includes, but is not limited to: radio frequency unit 1401, network module 1402, audio output unit 1403, input unit 1404, sensor 1405, display unit 1406, user input unit 1407, interface unit 1408, memory 1409, and processor 1410.

Those skilled in the art will appreciate that the electronic device 1400 may also include a power source (e.g., a battery) for powering the various components, which may be logically connected to the processor 1410 by a power management system to perform functions such as managing charging, discharging, and power consumption by the power management system. The electronic device structure shown in fig. 14 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein.

In some embodiments, when the electronic device 1400 performs the model training method illustrated in fig. 1-9, the components may perform the following functions:

the processor 1410 is configured to obtain N sample songs, where each sample song includes at least one sample song, and N is an integer greater than 1;

In some possible implementations of embodiments of the present application, the processor 1410 is specifically configured to:

the processor 1410 is specifically configured to:

determining a similar song list of the target song list from the song list library, adding the target song list and the similar song list into the first set, wherein the display information of the target song list and the similar song list is obtained based on the second generation model;

the second generation model is updated based on the third loss function value.

In some embodiments, when the electronic device 1400 performs the information generation method shown in fig. 10, the components may implement the following functions:

Wherein, the processor 1410 is configured to obtain a fourth song list;

processing the fourth song list to obtain a fourth scene characteristic and a fourth song list characteristic of the fourth song list;

inputting the fourth scene characteristics and the fourth song list characteristics into a second generation model to obtain display information of the fourth song list;

And inputting the fourth scene characteristic and the fourth song list characteristic into the second generation model to obtain the song list title and the song list introduction corresponding to the scene of the fourth song list, thereby reducing the operation of a user and improving the generation efficiency of the song list title and the song list introduction.

It should be appreciated that in embodiments of the present application, the input unit 1404 may include a graphics processor (Graphics Processing Unit, GPU) 14041 and a microphone 14042, with the graphics processor 14041 processing image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The display unit 1406 may include a display panel 14061, and the display panel 14061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1407 includes at least one of a touch panel 14071 and other input devices 14072. The touch panel 14071 is also referred to as a touch screen. The touch panel 14071 may include two parts, a touch detection device and a touch controller. Other input devices 14072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and so forth, which are not described in detail herein.

Memory 1409 may be used to store software programs as well as various data. The memory 1409 may mainly include a first memory area storing programs or instructions and a second memory area storing data, wherein the first memory area may store an operating system, application programs or instructions (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. Further, the memory 1409 may include volatile memory or nonvolatile memory, or the memory 1409 may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM), static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (ddr SDRAM), enhanced SDRAM (Enhanced SDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DRRAM). Memory 1409 in embodiments of the present application includes, but is not limited to, these and any other suitable types of memory.

Processor 1410 may include one or more processing units; optionally, the processor 1410 integrates an application processor that primarily processes operations involving an operating system, user interface, application programs, etc., and a modem processor that primarily processes wireless communication signals, such as a baseband processor. It will be appreciated that the modem processor described above may not be integrated into the processor 1410.

The embodiment of the application further provides a readable storage medium, and the readable storage medium stores a program or an instruction, which when executed by a processor, implements each process of the embodiment of the model training method or the information generating method, and can achieve the same technical effect, so that repetition is avoided, and no further description is provided herein.

The processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium such as a computer readable memory ROM, a random access memory RAM, a magnetic or optical disk, etc.

The embodiment of the application further provides a chip, the chip includes a processor and a communication interface, the communication interface is coupled with the processor, the processor is used for running a program or instructions, each process of the embodiment of the model training method or the information generating method can be realized, the same technical effect can be achieved, and in order to avoid repetition, the description is omitted here.

It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.

The embodiments of the present application provide a computer program product stored in a storage medium, where the program product is executed by at least one processor to implement the respective processes of the foregoing model training method or the information generating method embodiment, and achieve the same technical effects, and are not repeated herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may also be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the related art in the form of a computer software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), including several instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method described in the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

Claims

1. A method of model training, comprising:

2. The method of claim 1, wherein training a first generation model based on the sample scene features and the sample song features to obtain a second generation model comprises:

inputting the sample scene characteristics and the sample song list characteristics into the first generation model to obtain first scene characteristics and first display information;

determining a first loss function value based on the first scene feature and a reference scene feature, the reference scene feature determined based on the sample scene feature;

determining a second loss function value based on the first presentation information and sample presentation information of the sample song;

3. The method of claim 2, wherein the first generative model comprises a word segmentation process layer and a self-attention layer;

inputting the sample scene features and the sample song list features into the first generation model to obtain first scene features and first display information, wherein the method comprises the following steps:

inputting the sample scene characteristics and the sample song list characteristics into the word segmentation processing layer for word segmentation processing to obtain a first word segmentation vector and at least one second word segmentation vector, wherein the first word segmentation vector is a word segmentation vector formed by taking the sample scene characteristics as words, and each second word segmentation vector is a word segmentation vector formed based on at least one word segmentation contained in the sample song list characteristics;

and inputting the first word segmentation vector and each second word segmentation vector into the self-attention layer for processing to obtain first scene characteristics and first display information of the sample song.

4. The method of claim 3, wherein said inputting the first word-segmentation vector and each of the second word-segmentation vectors into the self-attention layer for processing results in the first scene feature and the first presentation information of the sample song, comprising:

and generating first scene features and first display information of the sample song according to the attention matrix, the scene matrix, the query matrix, the key matrix and the value matrix.

5. The method of any of claims 1-4, wherein the training a first generative model based on the sample scene features and the sample song features results in a second generative model, the method further comprising:

determining a similar song list of a target song list from a song list library, adding the target song list and the similar song list into a first set, wherein the display information of the target song list and the similar song list is obtained based on the second generation model;

Ranking each song list in the first set according to the use feedback data of the user on each song list in the first set to obtain a first ranking result, wherein the use feedback data is generated by the user based on the display information of each song list;

determining a first score of a first song and a second score of a second song in the first sorting result by using a pre-trained reinforcement model, wherein the feedback data of the use of the first song by a user is superior to the feedback data of the use of the second song, and the reinforcement model is generated based on the second generation model;

updating the second generative model based on the third loss function value.

6. The method of claim 5, wherein the ranking each song in the first set according to the user feedback data for each song in the first set to obtain a first ranking result includes:

for each song list in the first set, scoring the display information of the song list according to the use feedback data of the user on the song list to obtain a third score, wherein the use feedback data comprises at least one of click rate, collection and average playing duration;

7. A model training device, comprising:

8. The device according to claim 7, wherein the training module is specifically configured to:

9. The apparatus of claim 8, wherein the first generative model comprises a word segmentation processing layer and a self-attention layer;

the training module is specifically configured to:

10. The device according to claim 9, wherein the training module is specifically configured to:

11. The apparatus according to any one of claims 7-10, wherein the processing module is further configured to determine a similar song list of a target song list from a song list library, and add the target song list and the similar song list to the first set, where presentation information of the target song list and the similar song list are obtained based on the second generation model;

updating the second generative model based on the third loss function value.

12. The apparatus according to claim 11, wherein the processing module is specifically configured to:

13. An electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method of any one of claims 1 to 6.