CN112231511A

CN112231511A - Neural network model training method and song mining method and device

Info

Publication number: CN112231511A
Application number: CN202011124015.0A
Authority: CN
Inventors: 夏志强; 吴斌; 雷兆恒
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2020-10-20
Filing date: 2020-10-20
Publication date: 2021-01-15

Abstract

The application discloses a song mining method, a song mining device, song mining equipment and a computer readable storage medium, and a target song to be mined is obtained; acquiring historical songs meeting similar conditions with target songs; determining first target input information based on the historical songs, and determining second target input information based on the target songs; transmitting the first target input information and the second target input information to a pre-trained neural network model, and acquiring target interaction information of a target song output by the pre-trained neural network model; and determining a mining result of the target song based on the target interaction information. Because the target interaction information is the information of the interaction between the user and the target song predicted by the pre-trained neural network model, the target interaction information is predicted according to the target song and the historical song similar to the target song, the mining result conforms to the actual requirement of the user on the song, and the song mining performance is good.

Description

Neural network model training method and song mining method and device

Technical Field

The application relates to the technical field of information processing, in particular to a neural network model training method, a song mining method and a song mining device.

Background

Currently, along with the development of communication technology and the popularization of music, more and more songs are produced by singers, more and more songs can be browsed by users, and the energy of the users is limited, so that the songs required by the users are difficult to find in a plurality of songs. Therefore, the songs need to be mined to obtain the songs meeting the user requirements, for example, a song value index can be established, and the songs meeting the user requirements are mined based on the song value index by means of a deep learning method. However, the above scheme requires manual definition of song value indexes, and finally mined songs may not be songs meeting the requirements of users, so that the song mining performance is poor.

In view of the above, how to improve the performance of song mining is a problem to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of this, the present invention provides a neural network model training method, a song mining method, an apparatus, a device and a computer readable storage medium, which can effectively improve the performance of song mining. The specific scheme is as follows:

in a first aspect, the present application discloses a song mining method, including:

acquiring a target song to be mined;

acquiring historical songs meeting similar conditions with the target songs;

determining first target input information based on the historical songs, and determining second target input information based on the target songs;

transmitting the first target input information and the second target input information to a pre-trained neural network model, and acquiring target interaction information of the target song output by the pre-trained neural network model;

determining a mining result of the target song based on the target interaction information;

and the target interaction information is used for representing the interaction result of the user and the target song.

Optionally, the outputting, by the pre-trained neural network model, the target interaction information based on the first target input information and the second target input information includes:

performing feature extraction on the first target input information based on a learnable CNN network structure to obtain a first target CNN feature;

performing feature extraction on the second target input information based on the learnable CNN network structure to obtain a second target CNN feature;

and calculating the similarity between the first target CNN characteristic and the second target CNN characteristic, and determining the target interaction information based on the similarity.

Optionally, the calculating a similarity between the first target CNN feature and the second target CNN feature, and determining the target interaction information based on the similarity includes:

calculating a dot product value of the first target CNN feature and the second target CNN feature;

concatenating the dot product value and the first target CNN feature into a long feature;

based on the long characteristics, calculating the similarity between the target song and each song in the historical songs through full links;

taking the similarity as a characteristic weight value of a corresponding song in the historical song, and carrying out weighted summation on the first target CNN characteristic based on the characteristic weight value to obtain a comprehensive characteristic of the historical song;

connecting the comprehensive characteristic and the second target CNN characteristic in series to obtain a series characteristic;

and classifying the series connection characteristics through full links to obtain the target interaction information of the target song.

In a second aspect, the present application discloses a song mining apparatus, comprising:

the target song acquisition module is used for acquiring a target song to be mined;

a historical song acquisition module used for acquiring historical songs meeting similar conditions with the target songs;

the target input information acquisition module is used for determining first target input information based on the historical songs and determining second target input information based on the target songs;

the target interaction information acquisition module is used for transmitting the first target input information and the second target input information to a pre-trained neural network model and acquiring target interaction information of the target song output by the pre-trained neural network model;

the mining result determining module is used for determining the mining result of the target song based on the target interaction information;

In a third aspect, the application discloses a training method of a neural network model, comprising:

acquiring a sample song with known interaction information;

dividing a training set in the sample song;

selecting a training song set and a first song set which meets similar conditions with the training song set from the training set;

determining first training input information based on the first set of songs, and determining second training input information based on the training set of songs;

taking the first training input information and the second training input information as the input of an initial neural network model, training the initial neural network model, and inputting the known interaction information of the training song set and the predicted interaction information of the training song set output by the initial neural network model to a preset loss function to obtain a loss value;

judging whether the loss value is converged, if not, adjusting the network parameters of the initial neural network model according to the loss value, returning to execute the step of selecting a training song set and a first song set meeting similar conditions with the training song set; and if the loss value is converged, finishing the training of the initial neural network model to obtain a pre-trained neural network model, and performing song mining based on the pre-trained neural network model.

Optionally, after obtaining the pre-trained neural network model, the method further includes:

dividing a verification set in the sample song;

performing performance evaluation on the pre-trained neural network model based on the verification set to obtain a performance evaluation result;

judging whether the performance evaluation result meets a preset requirement or not;

if the performance evaluation result meets the preset requirement, allowing the pre-trained neural network model to be applied;

and if the performance evaluation result does not meet the preset requirement, changing a training strategy to continue training the pre-trained neural network model.

Optionally, the performing performance evaluation on the pre-trained neural network model based on the validation set to obtain a performance evaluation result includes:

selecting a verification song set and a second song set which meets the similar condition with the verification song set in the verification set;

determining first verification input information based on the second set of songs, determining second verification input information based on the verification set of songs;

taking the first verification input information and the second verification input information as the input of the pre-trained neural network model, and acquiring the prediction interaction information output by the pre-trained neural network model;

determining the performance evaluation result based on the predicted interaction information and the known interaction information of the verification album.

Optionally, the determining first training input information based on the first album and determining second training input information based on the training set includes:

determining a target Mel frequency spectrum characteristic of the first song set, and determining the target Mel frequency spectrum characteristic of the first song set as the first training input information;

and determining the target Mel frequency spectrum characteristics of the training set, and determining the target Mel frequency spectrum characteristics of the training set as the second training input information.

Optionally, the determining process of the target mel-frequency spectrum feature of the song includes:

carrying out short-time Fourier transform on the audio of the song to obtain a short-time Fourier transform result;

carrying out Mel frequency spectrum coefficient conversion on the short-time Fourier transform result to obtain an initial Mel frequency spectrum characteristic;

and truncating the initial Mel frequency spectrum characteristic to obtain a target Mel frequency spectrum characteristic of the song.

In a fourth aspect, the present application provides a training apparatus for a neural network model, including:

the sample song acquisition module is used for acquiring sample songs with known interactive information;

the training set dividing module is used for dividing a training set from the sample songs;

the first song set selection module is used for selecting a training song set and a first song set which meets similar conditions with the training song set from the training song set;

a training input information determination module for determining first training input information based on the first set of songs, and second training input information based on the training set of songs;

a loss value obtaining module, configured to take the first training input information and the second training input information as inputs of an initial neural network model, train the initial neural network model, and input known interaction information of the training song set and predicted interaction information of the training song set output by the initial neural network model to a preset loss function, so as to obtain a loss value;

an adjusting module, configured to determine whether the loss value converges, and if the loss value does not converge, adjust a network parameter of the initial neural network model according to the loss value, prompt the first song set selecting module to execute in the training set, and select a training song set and a first song set that satisfies a similar condition with the training song set; and if the loss value is converged, finishing the training of the initial neural network model to obtain a pre-trained neural network model, and performing song mining based on the pre-trained neural network model.

In a fifth aspect, the present application discloses an electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the neural network model training method or the song mining method as described in any one of the above.

In a sixth aspect, the present application discloses a computer readable storage medium for storing a computer program which, when executed by a processor, implements a method for training a neural network model or a method for song mining as described in any one of the above.

According to the song mining method, a target song to be mined is obtained firstly; acquiring historical songs meeting similar conditions with target songs; then, determining first target input information based on the historical songs, and determining second target input information based on the target songs; transmitting the first target input information and the second target input information to a pre-trained neural network model, and acquiring target interaction information of a target song output by the pre-trained neural network model; and finally, determining a mining result of the target song based on the target interaction information. The target interaction information is the information of interaction between the user and the target song predicted by the pre-trained neural network model, so that the target interaction information is predicted according to the target song and the historical song similar to the target song, the song value index does not need to be manually defined in the process, the mining result is adaptive to the target interaction information, the target interaction information reflects the requirement of the user on the target song, the mining result is consistent with the actual requirement of the user on the song, and the song mining performance is good. The song mining device, the electronic equipment and the computer readable storage medium provided by the application also solve the corresponding technical problems.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a block diagram of a system framework to which the song mining scheme provided herein is applicable;

fig. 2 is a flowchart of a song mining method according to an embodiment of the present application;

FIG. 3 is a flow chart of the training of neural network models in the present application;

FIG. 4 is another training flow diagram of the neural network model of the present application;

FIG. 5 is a flow chart of the determination of a target Mel frequency spectrum characteristic in the present application;

FIG. 6 is a flow chart of the neural network model determining target interaction information in the present application;

FIG. 7 is a schematic diagram of a deployment of a neural network model;

fig. 8 is a schematic structural diagram of a song mining device according to the present application;

FIG. 9 is a schematic structural diagram of a training apparatus for a neural network model provided in the present application;

FIG. 10 is a block diagram illustrating an electronic device 20 according to an example embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Currently, along with the development of communication technology and the popularization of music, more and more songs are produced by singers, more and more songs can be browsed by users, and the energy of the users is limited, so that the songs required by the users are difficult to find in a plurality of songs. Therefore, the songs need to be mined to obtain songs meeting the requirements of the user, for example, a song value index can be established, and songs liked by the user are mined based on the song value index by means of a deep learning method. However, the above scheme requires manual definition of a song value index, and the manually defined song value index is easily deviated from the actual requirement of the user for the song, for example, if the song value index is manually defined according to the song quality degree, although a good song or a poor song can be mined, the finally mined song may not be the song required by the user, and the mining accuracy is poor. In order to overcome the technical problem, the application provides a song mining scheme, and the accuracy of a song mining method can be improved.

In the song mining scheme of the present application, a system framework adopted may specifically refer to fig. 1, and may specifically include: a backend server 01 and a number of clients 02 establishing a communication connection with the backend server 01.

In the application, the background server 01 is used for executing the steps of the song mining method, including acquiring a target song to be mined; acquiring historical songs meeting similar conditions with target songs; determining first target input information based on the historical songs, and determining second target input information based on the target songs; transmitting the first target input information and the second target input information to a pre-trained neural network model, and acquiring target interaction information of a target song output by the pre-trained neural network model; determining a mining result of the target song based on the target interaction information; and the target interaction information is used for representing the interaction result of the user and the target song.

Further, the background server 01 may further be provided with a target song database, a history song database, a target interaction information database, and a mining result database. The target song database is used for storing target songs to be mined, the historical song database is used for storing songs with known interaction information, the target interaction information database is used for storing interaction information output by a pre-trained neural network model, and the mining result database is used for storing mining results finally obtained. It can be understood that after the target song is mined by the song mining scheme of the application, the target song can be transferred from the target song database to the history song database, so that the target song can be subsequently used as a history song to mine a new target song.

Of course, the database such as the target song database may also be set in a service server of a third party, and the data uploaded by the service end may be collected specially by the service server. In this way, when the background server 01 needs to use the corresponding data, the corresponding data may be obtained by initiating a corresponding data call request to the service server, for example, by sending a history song call request to the service server to obtain a history song.

In the application, the background server 01 may respond to the song mining requests of one or more user terminals 02, and it can be understood that the song mining requests initiated by different user terminals 02 in the application may be on-demand requests initiated for the same song or on-demand requests initiated for different songs.

Fig. 2 is a flowchart of a song mining method according to an embodiment of the present application. Referring to fig. 2, the song mining method includes:

step S11: and acquiring the target song to be mined.

In this embodiment, the target song refers to a song whose interaction information is unknown, and may be a song that is newly issued by the singer, or a song that has been issued by the singer but is not heard by the user or is rarely heard by the user.

It can be understood that, because the interaction information of the target song is unknown, it cannot be determined whether the target song meets the user requirement, and if the target song is not mined, on one hand, the song meeting the user requirement is easily missed, and on the other hand, the value of the song is easily reduced, so the target song to be mined needs to be acquired in the application, and the target song is mined.

It should be noted that the interaction information of the song is used for characterizing the result of the interaction between the user and the song, for example, the interaction information may be song popularity information, the time when the user plays the song, etc., and the type of the interaction information may be determined according to actual needs. In addition, the Audio data amount of the song can be reduced to ensure the operation efficiency of the song mining method, that is, the format of the target song and other songs can be MP3(Moving Picture Experts Group Audio Layer III, Moving Picture Experts compression standard Audio Layer 3) and the like.

Step S12: and acquiring historical songs meeting similar conditions with the target songs.

In this embodiment, after the target song to be mined is acquired, the song value characteristic of the target song is not determined directly or by a deep learning method, but a history song meeting a similar condition with the target song is acquired first, and because the similar condition is used for judging whether the history song is similar to the target song or not and the history song is a song whose interaction information is known, the history song whose interaction information is similar to the target song and whose interaction information is known is acquired in the present application, so that the target interaction information of the target song can be predicted subsequently by using the history song similar to the target song.

It should be noted that the similarity condition for determining whether the history song is similar to the target song may be determined according to actual needs, for example, the similarity condition may be one or more of a language similarity condition, a genre similarity condition, a singer similarity condition, a year similarity condition, and the like.

Step S13: first target input information is determined based on the history songs, and second target input information is determined based on the target songs.

Step S14: and transmitting the first target input information and the second target input information to the pre-trained neural network model, and acquiring target interaction information of the target song output by the pre-trained neural network model.

In this embodiment, since there is no artificially defined song value index, it is not necessary to extract specific value index information in a song, and the application may predict target interaction information of a target song by means of a neural network model, for example, after a history song satisfying similar conditions with the target song is acquired, determine first target input information based on the history song, determine second target input information based on the target song, transmit the first target input information and the second target input information to a pre-trained neural network model, and acquire target interaction information of the target song output by the pre-trained neural network model. The first target input information and the second target input information are input information when the neural network model carries out mutual information prediction on a target song, the first target input information is also song input information corresponding to a historical song, the first target input information can be the historical song directly, and can also be corresponding song characteristics obtained after characteristic extraction is carried out on the historical song, and the like; the second target input information is also the song input information corresponding to the target song, and the second target input information can be the target song itself directly or corresponding song characteristics obtained after characteristic extraction is carried out on the target song, and the like; the present application is not specifically limited herein.

It should be noted that, since the interaction information of the history song is known, if the first target input information is determined based on the history song and transmitted to the pre-trained neural network model, the target interaction information of the target song can be constrained by the interaction information of the history song, and the authenticity of the target interaction information can be ensured.

Step S15: determining a mining result of the target song based on the target interaction information; the target interaction information comprises a result of interaction between the user and the target song predicted by the pre-trained neural network model.

In the embodiment, after the target interaction information is obtained by means of the pre-trained neural network model, only the predicted interaction information between the user and the target song is obtained, and at this time, the mining result of the target song cannot be determined, after the target interaction information is obtained, it is also necessary to determine mining results of the target song based on the target interaction information, specifically, the target interaction information may be directly used as a mining result, or the target interaction information may be extended to obtain a mining result, for example, determining a corresponding rating of a song according to the target interaction information, using the rating as a mining result, etc., in the process, the corresponding relation between the target interactive information and the corresponding grade of the song can be determined according to the specific value of the interactive information output by the pre-trained neural network model and the grade definition of the song, and then the corresponding grade and the like corresponding to the target interactive information are determined by means of the corresponding relation.

It should be noted that the type of the mining result in the present application may be determined according to the scenario to which the present application is applied and the type of the interaction information, for example, if the song mining method provided in the present application is applied to a hot song mining scenario, the type of the interaction information may be limited to song popularity information, and the type of the song popularity information may be a song completion rate, a song playing amount, a song click amount, a song forwarding amount, and the like. In addition, in this scenario, a promotion manner of the target song or corresponding subscription information of the target song may be further set according to the mining result, for example, if the mining result represents that the popularity level of the target song is high, the target song may be promoted in a manner of a top song or on the name of a high-quality song, the promotion strength on the target song is increased, the target song may be signed, a subscription may be made to an author of the target song, an author of the target song is invited to stay in a music platform, and the like.

The song mining method provided by the application is applied to a scene that songs played in different time periods are recommended for a user, the type of interactive information can be limited to the time when the user plays the songs, in the process, the obtained time when the user plays the target songs can be directly determined as a mining result, the time when the user plays the target songs can also be compared with the time period when the user regularly listens to the songs to determine the mining result, for example, the predicted time when the user plays the target songs is 5 pm, and the user regularly listens to the songs at 6 pm, the information representing that the playing time of the target songs is 6 pm can be determined as the mining result, and in the scene, the user can listen to the target songs at 6 pm later, so that the user experience of listening to the songs is enriched.

In addition, it should be further noted that the users served by the song mining method provided by the present application may be single users, or group users, and the present application is not limited specifically herein.

According to the song mining method, a target song to be mined is obtained firstly; acquiring historical songs meeting similar conditions with target songs; then, determining first target input information based on the historical songs, and determining second target input information based on the target songs; transmitting the first target input information and the second target input information to a pre-trained neural network model, and acquiring target interaction information of a target song output by the pre-trained neural network model; and finally, determining a mining result of the target song based on the target interaction information. The target interaction information is the information of interaction between the user and the target song predicted by the pre-trained neural network model, so that the target interaction information is predicted according to the target song and the historical song similar to the target song, the song value index does not need to be manually defined in the process, the mining result is adaptive to the target interaction information, the target interaction information reflects the requirement of the user on the target song, the mining result is consistent with the actual requirement of the user on the song, and the song mining performance is good.

Fig. 3 is a flowchart of training a neural network model according to the present application.

In the song mining method provided in the embodiment of the present application, before transmitting the first target input information and the second target input information to the pre-trained neural network model, the neural network model may be further trained, and the training process may include:

step S201: a sample song is obtained for which the interaction information is known.

In this embodiment, in the process of training the neural network model, a sample song with known interaction information may be obtained first, so that the neural network model is trained by the sample song later.

It will be appreciated that, in a particular application scenario, in the process of obtaining a sample song for which the interaction information is known, the interaction records of the user and the songs can be collected, for example, the corresponding song interaction records are collected through music software, then the sample songs and the interaction information of the sample songs are determined according to the interaction records of the user and the songs, in this process, the song played by the user in the music software may be used as a sample song, and the interaction record of the user and the sample song may be converted into corresponding interaction information, for example, the interaction record of the user and the sample song is that the user played the sample song three times, then, the interactive information of the sample song may be that the playing time is three, and the like, and for example, the interaction between the user and the sample song is recorded that the user starts playing the sample song at 8 am, and the interactive information of the sample song may be that the playing time is 8 am, and the like. It should be noted that the information of the sample song may be determined according to actual needs, for example, the sample song may include a song ID, song information, singer information, and the like, and the type of the interaction information may also be determined according to actual needs, for example, the interaction information may be song popularity information, the time when the user played the song, and the like.

Step S202: a training set is partitioned among the sample songs.

In this embodiment, after a sample song with known interaction information is obtained, a training set may be partitioned from the sample song, and since the training set is partitioned from the sample song, the interaction information of the training set is also known, so that the initial neural network model may be trained subsequently by applying the training set.

Step S203: and in the training set, selecting the training song set and a first song set meeting similar conditions with the training song set.

In this embodiment, in the process of outputting the target interaction information by the neural network model, participation of a history song similar to the target song is required, and the interaction information of the history song is required to constrain the interaction information of the target song, so that in the training process of the neural network model, a song set similar to the training song set is required to be determined, that is, after the training set is divided from the sample songs, the training song set and a first song set satisfying similar conditions with the training song set can be selected from the training set, and because the first song set is also obtained by dividing the training set, the interaction information of the first song set is also known, so that the neural network model can be subsequently trained by using the first song set to cooperate with the training song set.

It should be noted that the number of the sample songs, the training albums and the first albums can be determined according to actual needs, and the application is not limited in detail herein. For example, 80% of the songs in the sample song can be used as a training set; and in the training set, 256 songs are selected as the training song set each time, and 5000 songs except the training song set are selected as the first song set each time.

Step S204: first training input information is determined based on the first set of songs and second training input information is determined based on the training set of songs.

Step S205: and taking the first training input information and the second training input information as the input of an initial neural network model, training the initial neural network model, and inputting the known interaction information of the training song set and the predicted interaction information of the training song set output by the initial neural network model to a preset loss function to obtain a loss value.

In this embodiment, after the training dataset and the first dataset are determined, the first training input information may be determined based on the first dataset, the second training input information may be determined based on the training dataset, the first training input information and the second training input information are used as inputs of the initial neural network model, the initial neural network model is trained, the known interaction information of the training dataset and the predicted interaction information of the training dataset output by the initial neural network model are input to a preset loss function, and a loss value is obtained, so that the network parameters of the initial neural network model are adjusted according to the loss value in the following process.

Note that the first training input information and the second training input information are input information when the neural network model is trained; the first training input information is also song input information corresponding to the first song set, and the first training input information can be the first song set itself directly or corresponding song characteristics obtained after characteristic extraction is carried out on the first song set, and the like; the second training input information is also the song input information corresponding to the training song set, and the second training input information can be the training song set itself directly or corresponding song characteristics obtained by performing characteristic extraction on the training song set, and the like; the present application is not specifically limited herein. In addition, the type of the loss function can be determined according to actual needs, for example, the loss function can be an L2 loss function, a pairweiseliseloss loss function, a cross-entropy loss function, or the like.

Step S206: judging whether the loss value is converged, if not, executing step S207; if the loss value converges, step S208 is performed.

Step S207: the network parameters of the initial neural network model are adjusted according to the loss values, and the process returns to step S203.

Step S208: and finishing the training of the initial neural network model to obtain a pre-trained neural network model.

In this embodiment, after obtaining the loss value, it is necessary to determine whether to complete training of the initial neural network model by determining whether the loss value converges, and if the loss value does not converge, it indicates that the initial neural network model does not meet the training standard, at this time, it is necessary to adjust network parameters of the initial neural network model according to the loss value, and return to execute step S203, so as to start another round of training process of the initial neural network model, it should be noted that the selected training song set may be different from the previously selected training song set when step S203 is executed again, so as to achieve a new training effect; if the loss value is converged, the initial neural network model is shown to reach the training standard, and the pre-trained neural network model can be obtained at the moment

It should be noted that the type of neural network model applied in the present application can be determined according to actual needs, for example, the neural network model can be a convolutional neural network, a recurrent neural network, a Transfomer, etc.

Therefore, in the embodiment, the initial neural network model is trained through the interactive information, the training song set and the first song set similar to the training song set, so that the pre-trained neural network model can be quickly obtained, the song value index does not need to be manually defined in the training process, and the condition that the mining result does not meet the user requirement due to manual definition is avoided.

Fig. 4 is another training flowchart of the neural network model of the present application.

Step S202: a training set is partitioned among the sample songs.

Step S208: the training of the initial neural network model is completed to obtain a pre-trained neural network model, and step S209 is performed.

Step S209: and marking out a verification set in the sample song, performing performance evaluation on the pre-trained neural network model by using the verification set, if the performance evaluation result meets the preset requirement, allowing the pre-trained neural network model to be applied, and if the performance evaluation result does not meet the preset requirement, changing the training strategy to continue training the neural network model to obtain the pre-trained neural network model allowed to be applied.

In this embodiment, in the process of training the neural network model, after the initial neural network model is trained by applying the interactive information of the training set, the first song set, and the training song set, the neural network model meeting the requirement may be directly obtained, or the neural network model meeting the requirement may not be obtained, and at this time, the trained neural network model obtained by training with the interactive information of the training set, the first song set, and the training song set needs to be verified and adjusted to obtain the pre-trained neural network model meeting the requirement. In this process, for convenience of operation, a verification set may be directly partitioned from the sample songs, and it should be noted that the verification set and the training set should be different song sets, that is, there are no coincident songs in the verification set and the training set.

In practical application, in the process of evaluating the performance of the pre-trained neural network model by using the verification set, the verification song set and a second song set meeting similar conditions with the verification song set can be selected in the verification set; determining first verification input information based on the second set of songs, and determining second verification input information based on the verification set of songs; taking the first verification input information and the second verification input information as the input of the trained neural network model, and acquiring the prediction interaction information output by the trained neural network model; a performance assessment result is determined based on the predicted interaction information and the known interaction information of the verified album.

It should be noted that the number of verification sets, the number of verification songs selected in the verification set, and the number of songs in the second song set may be determined according to actual needs, for example, the remaining 20% of the sample songs may be used as the verification sets; in the verification set, 256 songs are selected as the verification song set each time, 5000 songs are selected as the second song set, and the like. In addition, the first verification input information and the second verification input information are input information when the neural network model is verified; the first verification input information is also song input information corresponding to the second song set, and the first verification input information can be the second song set itself directly or corresponding song characteristics obtained after characteristic extraction is carried out on the second song set, and the like; the second verification input information is also the song input information corresponding to the verification song set, and the second verification input information can be the verification song set itself directly or corresponding song characteristics obtained after characteristic extraction is carried out on the verification song set; the present application is not specifically limited herein. Furthermore, if the trained neural network model meets the requirements, the difference between the predicted interaction information and the known interaction information of the verification song set is not too large, so that performance evaluation can be performed on the trained neural network model based on the predicted interaction information and the known interaction information of the verification song set, and at the moment, the preset requirements can be that the difference between the predicted interaction information and the known interaction information of the verification song set is within a preset range and the like.

Therefore, in the embodiment, only the training set and the verification set need to be marked out from the sample song, so that the training and the adjustment of the neural network model can be completed by means of the training set and the verification set, the pre-trained neural network model meeting the requirements can be quickly obtained, the training efficiency of the neural network model is accelerated, and the song mining efficiency is further accelerated.

In the song mining method provided by the embodiment of the application, although the corresponding song can be directly used as the input information of the neural network model, the operation load of the neural network model can be increased due to the large data volume of the song, in order to avoid the situation, the song information of the corresponding song can be extracted, the extracted song information is used as the input of the neural network model, the extracted song information can be the corresponding Mel spectrum (Mel spectrum, Mel) feature of the song, namely, the process of determining the first training input information based on the first song set, determining the second training input information based on the training song set can be specifically used for determining the target Mel spectrum feature of the first song set, and determining the target Mel spectrum feature of the first song set as the first training input information; and determining the target Mel frequency spectrum characteristics of the training song set, and determining the target Mel frequency spectrum characteristics of the training song set as second training input information. Correspondingly, the target mel frequency spectrum feature of the second song set is required to be determined as first verification input information, the target mel frequency spectrum feature of the verification song set is determined as second verification input information, the target mel frequency spectrum feature of the historical song is determined as first target input information, and the target mel frequency spectrum feature of the target song is determined as second target input information.

Fig. 5 is a flow chart of the determination of the target mel-frequency spectrum feature in the present application.

In the song mining method provided by the embodiment of the application, the process for determining the target mel-frequency spectrum feature of the song may include the following steps:

step S31: and carrying out short-time Fourier transform on the audio of the song to obtain a short-time Fourier transform result.

In this embodiment, in the process of determining the target mel-frequency spectrum characteristics of the song, the short-time fourier transform may be performed on the audio frequency of the song to obtain a short-time fourier transform result, and the short-time fourier transform process may be determined according to actual needs, for example, the time window parameter W of the short-time fourier transform may be determined₁May be 1024, and may be R in size^T*FEtc., where R represents a real number, T represents time, and F represents a frequency domain; and the pattern of the short-time Fourier change result can be determined according to actual needs.

Step S32: and carrying out Mel frequency spectrum coefficient conversion on the short-time Fourier transform result to obtain initial Mel frequency spectrum characteristics.

In this embodiment, if the short-time fourier transform result is directly used as the input of the neural network model, the input information of the neural network model is still larger, and in order to further reduce the data amount of the input information of the neural network model, after the short-time fourier transform result is obtained, mel-frequency spectrum coefficient conversion may be performed on the short-time fourier transform result to obtain an initial mel-frequency spectrum feature, and the structure of the initial mel-frequency spectrum feature may be R^T*FWhere R represents a real number, T represents time, and F represents a frequency domain.

Step S33: and cutting the initial Mel frequency spectrum characteristic to obtain the target Mel frequency spectrum characteristic of the song.

In this embodiment, if the initial mel-frequency spectrum feature is directly used as the input information of the neural network model, since the initial mel-frequency spectrum feature may carry invalid information and the data volume of the initial mel-frequency spectrum feature may be large, in order to further reduce the data volume of the input information of the neural network model, the initial mel-frequency spectrum feature may be truncated to obtain the target mel-frequency spectrum feature of the song, and the target mel-frequency spectrum feature is used as the input information of the neural network model.

It should be noted that the truncation of the initial mel-frequency spectrum feature refers to the truncation of the initial mel-frequency spectrum feature according to the time information, for example, the time length of the initial mel-frequency spectrum feature is 3 minutes, and the time length of the required target mel-frequency spectrum feature is 2 minutes, so that the initial mel-frequency spectrum feature can be truncated at 2 minutes of the initial mel-frequency spectrum feature to obtain the target mel-frequency spectrum feature, and the target mel-frequency spectrum feature is R^5167*F。

Therefore, in the embodiment, by determining the target mel frequency spectrum characteristics of the song and using the target mel frequency spectrum characteristics of the song as the input information of the neural network model, the data volume of the input information of the neural network model can be reduced, the operation efficiency of the neural network model is accelerated, and the song mining efficiency is further improved.

In the song mining method provided in the embodiment of the present application, the process of outputting the target interaction information based on the first target input information and the second target input information by the pre-trained neural network model may specifically be: performing feature extraction on the first target input information based on a learnable CNN (Convolutional Neural Networks) network structure to obtain a first target CNN feature; performing feature extraction on the second target input information based on the learnable CNN network structure to obtain a second target CNN feature; and calculating the similarity between the first target CNN characteristic and the second target CNN characteristic, and determining target interaction information based on the similarity.

That is, in this embodiment, the neural network model may extract the CNN features of the history song and the CNN features of the target song, determine the target interaction information according to the similarity between the CNN features of the target song and the CNN features of the history song, and further enhance the constraint of the history song on the target interaction information.

It should be noted that, because the learnable CNN network structure does not need to manually define a similar standard and degree calculation method, the CNN features output by the learnable CNN network structure are automatically learned by the network structure, and the high-quality songs have similarity and the poor-quality songs also have commonality, the embodiment can output the CNN features representing the commonality of the songs through the learnable CNN network structure without manual definition, so that the subsequent neural network model can predict the interaction information based on the CNN features; the characteristic used for predicting the interactive information is carried in the CNN characteristic output by the learnable CNN network structure, and the song can be converted into a data type which is easy to process by the neural network model by means of the CNN characteristic, so that the interactive information prediction of the song can be conveniently carried out by the subsequent neural network model based on the CNN characteristic. In addition, the first target CNN feature is also the CNN feature of the history song, the second target CNN feature is also the CNN feature of the target song, and the learnable CNN network structure may be a part of the neural network model.

FIG. 6 is a flow chart of the neural network model for determining target interaction information in the present application.

In the song mining method provided by the embodiment of the application, the process of determining the target interaction information by the neural network model may include the following steps:

step S41: and performing feature extraction on the first target input information based on the learnable CNN network structure to obtain a first target CNN feature.

Step S42: and performing feature extraction on the second target input information based on the learnable CNN network structure to obtain a second target CNN feature.

In this embodiment, the neural network model may first perform feature extraction on the first target input information and perform feature extraction on the second target input information based on the learnable CNN network structure to obtain corresponding CNN features, so as to determine target interaction information according to the CNN features of the target song and the CNN features of the history songs in the following.

Step S43: and calculating a dot product value of the first target CNN characteristic and the second target CNN characteristic.

Step S44: concatenating the dot product value and the first target CNN feature into a long feature.

Step S45: and based on the long characteristics, the similarity between the target song and each song in the history songs is obtained through full-link calculation.

In this embodiment, in order to quickly calculate the similarity value, the neural network model may calculate a dot product value of the first target CNN feature and the second target CNN feature, connect the dot product value and the first target CNN feature in series as a long feature, and calculate, based on the long feature, the similarity between the target song and each song in the history songs through full link. Correspondingly, a connection layer and a full link layer need to be built in the neural network model, and certainly, the structure of the neural network model can be enriched according to actual needs, for example, the neural network model can also include attention, softmax and the like.

It should be noted that the calculation manner of the similarity may be determined according to actual needs, for example, the similarity may be determined by a cosine distance, an L1 distance, an L2 distance, and the like.

Step S46: and taking the similarity as a characteristic weight value of a corresponding song in the historical song, and carrying out weighted summation on the first target CNN characteristic based on the characteristic weight value to obtain the comprehensive characteristic of the historical song.

Step S47: and connecting the comprehensive characteristic and the second target CNN characteristic in series to obtain a series characteristic.

Step S48: and classifying the series connection characteristics through the full link to obtain the target interaction information of the target song.

In this embodiment, in the process of determining the target interaction information based on the similarity, the neural network model may use the similarity as a feature weight value of a corresponding song in the history song, perform weighted summation on the first target CNN feature based on the feature weight value to obtain a comprehensive feature of the history song, connect the comprehensive feature and the second target CNN feature in series to obtain a series feature, and classify the series feature through full links to obtain the target interaction information of the target song.

It should be noted that in a specific application scenario, because the history song is an existing song, the CNN feature of the history song may be stored in advance, and when necessary, the CNN feature of the history song may be directly input to the neural network model, so that the neural network model may obtain the CNN feature of the history song without operation, and the processing efficiency of the neural network model is accelerated. Referring to fig. 7, fig. 7 is a schematic diagram of a deployment of a neural network model, in fig. 7, Seed features represent CNN features of historical songs, audio background represents a network for extracting CNN features of a target song, Seed Feature Fusion represents a network for calculating similarity and outputting target interaction information, Concat represents a connection layer, and FC represents a full-link layer; pooling stands for Pooling; out Product represents a dot Product value; similarity represents similarity; the MelSpectrum represents the extraction of a target Mel spectrum feature; ConvBlock represents a volume block; average Pooling indicates Average Pooling.

It can be seen that, in this embodiment, the neural network model may determine the similarity between the target song and the history song through the CNN characteristics, and may increase the coupling degree between the similarity and the target interaction information through weighted summation, concatenation, and classification, thereby improving the accuracy of the neural network model in determining the target interaction information according to the similarity.

For convenience of understanding, the song mining method provided by the present application is described with reference to a hot song mining scenario, and assuming that the interaction information based on the song mining process is the song playing completion rate, the song mining method provided by the present application may include the following steps:

acquiring specific information of each song to be played from an original massive user playing record, wherein the specific information comprises a song ID, a user ID, playing time, song time and the like;

taking all songs in the user playing record as sample songs, downloading the sample songs in the MP3 format, and calculating the playing completion rate of the sample songs; the playing completion rate is the ratio of the playing completion number of the songs to the effective playing number, the playing completion number can be the playing number with the playing percentage being more than or equal to 90%, and the effective playing number can be the playing number with the playing percentage being more than or equal to 30%, and the like;

randomly dividing sample songs into a training set and a verification set according to the proportion of 8: 2;

in the training set, selecting a training song set and a first song set meeting similar conditions with the training song set, for example, 256 songs in the training set are used as the training song set, and 5000 songs in the training set except the training song set are used as the first song set;

determining the target mel frequency spectrum feature of the first song set as first training input information, determining the target mel frequency spectrum feature of the training song set as second training input information, and acquiring the target mel frequency spectrum feature with the size of R^M*1*5167*FM has a value of 5000 and the first training input information has a size of R^B*1*5167*FAnd B has a value of 256;

taking the first training input information and the second training input information as the input of an initial neural network model, training the initial neural network model, and inputting the known playing completion rate of a training song set and the predicted playing completion rate of the training song set output by the initial neural network model to a preset loss function to obtain a loss value;

judging whether the loss value is converged, if not, adjusting the network parameters of the initial neural network model according to the loss value, returning to the training set, and selecting the training song set and a first song set meeting similar conditions with the training song set; if the loss value is converged, finishing the training of the initial neural network model to obtain a pre-trained neural network model, and executing the subsequent steps;

performing performance evaluation on the pre-trained neural network model by using the verification set, if the performance evaluation result meets the preset requirement, allowing the pre-trained neural network model to be applied, and if the performance evaluation result does not meet the preset requirement, changing the training strategy to continue training the neural network model to obtain the pre-trained neural network model allowed to be applied

After a pre-trained neural network model is obtained, determining a play-out rate threshold value for distinguishing songs with different qualities according to the predicted play-out rate of the pre-trained neural network model to the verification set and the known play-out rate of the verification set, wherein the play-out rate threshold value is supposed to be TH;

deploying a pre-trained neural network model as shown in fig. 7;

acquiring a target song to be mined;

acquiring historical songs meeting similar conditions with target songs;

determining the CNN characteristics of the history songs as first target input information, and determining the target songs as second target input information;

transmitting the first target input information and the second target input information to a pre-trained neural network model, and acquiring a target playing completion rate of a target song output by the pre-trained neural network model;

determining a mining result after mining the quality of the target song based on the target playing completion rate and the playing completion rate threshold;

and determining the promotion mode and the corresponding subscription information of the target song according to the mining result.

Referring to fig. 8, a song mining apparatus correspondingly disclosed in the embodiments of the present application is applied to a background server, and includes:

a target song obtaining module 11, configured to obtain a target song to be mined;

a history song obtaining module 12, configured to obtain a history song that satisfies a similar condition with a target song;

a target input information acquisition module 13, configured to determine first target input information based on the history songs, and determine second target input information based on the target songs;

a target interaction information obtaining module 14, configured to transmit the first target input information and the second target input information to the pre-trained neural network model, and obtain target interaction information of the target song output by the pre-trained neural network model;

a mining result determining module 15, configured to determine a mining result of the target song based on the target interaction information;

As can be seen, in the present application, a target song to be mined is obtained first; acquiring historical songs meeting similar conditions with target songs; then, determining first target input information based on the historical songs, and determining second target input information based on the target songs; transmitting the first target input information and the second target input information to a pre-trained neural network model, and acquiring target interaction information of a target song output by the pre-trained neural network model; and finally, determining a mining result of the target song based on the target interaction information. The target interaction information is the information of interaction between the user and the target song predicted by the pre-trained neural network model, so that the target interaction information is predicted according to the target song and the historical song similar to the target song, the song value index does not need to be manually defined in the process, the mining result is adaptive to the target interaction information, the target interaction information reflects the requirement of the user on the target song, the mining result is consistent with the actual requirement of the user on the song, and the song mining performance is good.

The CNN characteristic determining module is used for extracting the characteristics of the first target input information based on the learnable CNN network structure to obtain first target CNN characteristics; performing feature extraction on the second target input information based on the learnable CNN network structure to obtain a second target CNN feature;

and the target interaction information determining module is used for calculating the similarity between the first target CNN characteristic and the second target CNN characteristic and determining target interaction information based on the similarity.

In some specific embodiments, the target interaction information determining module may be specifically configured to: calculating a dot product value of the first target CNN characteristic and the second target CNN characteristic; connecting the dot product value and the first target CNN characteristic in series to form a long characteristic; based on the long characteristics, the similarity between the target song and each song in the history songs is obtained through full-link calculation; taking the similarity as a characteristic weight value of a corresponding song in the historical song, and carrying out weighted summation on the first target CNN characteristic based on the characteristic weight value to obtain the comprehensive characteristic of the historical song; connecting the comprehensive characteristic and the second target CNN characteristic in series to obtain a series characteristic; and classifying the series connection characteristics through the full link to obtain the target interaction information of the target song.

Referring to fig. 9, an embodiment of the present application further discloses a training apparatus for a neural network model, applied to a background server, and including:

a sample song obtaining module 111, configured to obtain a sample song with known interaction information;

a training set dividing module 112, configured to divide a training set from the sample song;

a first song set selecting module 113, configured to select a training song set and a first song set that satisfies similar conditions with the training song set from the training song set;

a training input information determination module 114 to determine first training input information based on the first set of songs, and second training input information based on the training set of songs;

a loss value obtaining module 115, configured to take the first training input information and the second training input information as inputs of an initial neural network model, train the initial neural network model, and input known interaction information of the training song set and predicted interaction information of the training song set output by the initial neural network model to a preset loss function, so as to obtain a loss value;

an adjusting module 116, configured to determine whether the loss value converges, and if the loss value does not converge, adjust a network parameter of the initial neural network model according to the loss value, prompt the first song set selecting module to execute in the training set, and select a training song set and a first song set that satisfies a similar condition with the training song set; and if the loss value is converged, finishing the training of the initial neural network model to obtain a pre-trained neural network model, and performing song mining based on the pre-trained neural network model.

In some embodiments, the training device of the neural network model may further include:

the verification set dividing module is used for dividing a verification set in the sample song after the pre-trained neural network model is obtained by the adjusting module;

the performance evaluation module is used for evaluating the performance of the pre-trained neural network model based on the verification set to obtain a performance evaluation result;

the judging module is used for judging whether the performance evaluation result meets a preset requirement or not; if the performance evaluation result meets the preset requirement, allowing the pre-trained neural network model to be applied; and if the performance evaluation result does not meet the preset requirement, changing a training strategy to continue training the pre-trained neural network model.

In some embodiments, the performance evaluation module may include:

a song set selecting unit, configured to select, in the verification set, a verification song set and a second song set that satisfies the similarity condition with the verification song set;

a verification input information determination unit configured to determine first verification input information based on the second album, and determine second verification input information based on the verification album;

a prediction interaction information obtaining unit, configured to obtain the prediction interaction information output by the pre-trained neural network model by using the first verification input information and the second verification input information as inputs of the pre-trained neural network model;

a performance evaluation result determination unit for determining the performance evaluation result based on the predicted interaction information and the known interaction information of the verification album.

In some embodiments, the training input information determination module may include:

a first training input information determining unit, configured to determine a target mel-frequency spectrum feature of the first song set, and determine the target mel-frequency spectrum feature of the first song set as the first training input information;

and a second training input information determining unit configured to determine a target mel-frequency spectrum feature of the training set, and determine the target mel-frequency spectrum feature of the training set as the second training input information.

In some embodiments, the training device of the neural network model may include a target mel-frequency spectrum determination module, configured to perform a short-time fourier transform on the audio frequency of the song, so as to obtain a short-time fourier transform result; carrying out Mel frequency spectrum coefficient conversion on the short-time Fourier transform result to obtain an initial Mel frequency spectrum characteristic; and truncating the initial Mel frequency spectrum characteristic to obtain a target Mel frequency spectrum characteristic of the song.

Further, the embodiment of the application also provides electronic equipment. FIG. 10 is a block diagram illustrating an electronic device 20 according to an exemplary embodiment, and nothing in the figure should be taken as a limitation on the scope of use of the present application.

Fig. 10 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present disclosure. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is used for storing a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the song mining method or the neural network model training method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically a server.

In this embodiment, the power supply 23 is configured to provide a working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.

In addition, the storage 22 is used as a carrier for resource storage, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., and the resources stored thereon may include an operating system 221, a computer program 222, video data 223, etc., and the storage may be a transient storage or a permanent storage.

The operating system 221 is used for managing and controlling each hardware device and the computer program 222 on the electronic device 20, so as to realize the operation and processing of the processor 21 on the mass video data 223 in the memory 22, and may be Windows Server, Netware, Unix, Linux, and the like. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the song mining method or the neural network model training method performed by the electronic device 20 disclosed in any of the foregoing embodiments. Data 223 may include various video data collected by electronic device 20.

Further, an embodiment of the present application further discloses a storage medium, in which a computer program is stored, and when the computer program is loaded and executed by a processor, the steps of the song mining method or the neural network model training method disclosed in any of the foregoing embodiments are implemented.

The computer-readable storage media to which this application relates include Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage media known in the art.

For a description of a relevant part in the song mining apparatus, the electronic device, and the computer-readable storage medium provided in the embodiment of the present application, reference is made to detailed descriptions of a corresponding part in the song mining method provided in the embodiment of the present application, and details are not repeated here. In addition, parts of the above technical solutions provided in the embodiments of the present application, which are consistent with the implementation principles of corresponding technical solutions in the prior art, are not described in detail so as to avoid redundant description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A song mining method, comprising:

acquiring a target song to be mined;

acquiring historical songs meeting similar conditions with the target songs;

2. The method of claim 1, wherein the pre-trained neural network model outputs the target interaction information based on the first target input information and the second target input information, comprising:

3. The method of claim 2, wherein the calculating a similarity between the first target CNN feature and the second target CNN feature and determining the target interaction information based on the similarity comprises:

4. A song mining apparatus, comprising:

5. A training method of a neural network model is characterized by comprising the following steps:

acquiring a sample song with known interaction information;

dividing a training set in the sample song;

judging whether the loss value is converged, if not, adjusting the network parameters of the initial neural network model according to the loss value, returning to execute the step of selecting a training song set and a first song set meeting similar conditions with the training song set; and if the loss value is converged, finishing the training of the initial neural network model to obtain a pre-trained neural network model.

6. The method of claim 5, wherein after obtaining the pre-trained neural network model, further comprising:

dividing a verification set in the sample song;

7. The method of claim 6, wherein said performing a performance evaluation on said pre-trained neural network model based on said validation set, resulting in a performance evaluation result, comprises:

8. The method of claim 5, wherein determining first training input information based on the first set of songs and second training input information based on the training set comprises:

9. The method of claim 8, wherein the determining the target mel-frequency spectrum characteristic of the song comprises:

10. An apparatus for training a neural network model, comprising: