CN111737414A

CN111737414A - Song recommendation method and device, server and storage medium

Info

Publication number: CN111737414A
Application number: CN202010500439.6A
Authority: CN
Inventors: 徐东
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2020-10-02

Abstract

The embodiment of the application discloses a song recommendation method and device, a server and a storage medium, wherein the song recommendation method comprises the following steps: matching the received retrieval information with emotion words in an emotion text database to obtain emotion words matched with the retrieval information, and determining the emotion words matched with the retrieval information as emotion keywords; acquiring at least one emotion label of each song in a song set to be selected, wherein the song set to be selected comprises at least one song; determining an emotion category corresponding to the emotion keyword in a song collection to be selected, determining at least one emotion label belonging to the emotion category in the song collection to be selected as at least one emotion label matched with the emotion keyword, determining at least one song corresponding to the at least one emotion label as an initial song collection, and screening a recommended song collection from the initial song collection based on song attributes. By the method and the device, the user experience of listening to the songs can be improved, and the manual operation cost of the song library is reduced.

Description

Song recommendation method and device, server and storage medium

Technical Field

The present application relates to the field of music recommendation systems, and in particular, to a song recommendation method and apparatus, a server, and a storage medium.

Background

At present, in the process of listening to songs, a user often selects songs wanted to listen to by text searching for song titles, singer titles, album titles and the like, which is effective for accurate matching, but because the song titles, singer titles and album title information in the song information do not contain the emotion types of song contents, if the user wants to accurately and quickly obtain songs with a certain specific emotion type, the requirements of the user cannot be met in the above situation. Therefore, how to accurately push songs of emotional types in which the user is interested to the user according to the input information of the user is a problem to be solved.

Disclosure of Invention

The embodiment of the application provides a song recommendation method and device, a server and a storage medium, so that a user can rapidly acquire interested songs with emotional types, the song listening experience of the user is improved, and the manual operation cost of a song library is reduced.

In a first aspect, a song recommendation method is provided for an embodiment of the present application, including:

matching the received retrieval information with emotion words in an emotion text database to obtain emotion words matched with the retrieval information, and determining the emotion words matched with the retrieval information as emotion keywords;

acquiring at least one emotion label of each song in a song set to be selected, wherein the song set to be selected comprises at least one song, and the at least one emotion label comprises at least one emotion label of at least one emotion category;

determining an emotion category corresponding to the emotion keyword in the song set to be selected, determining at least one emotion label belonging to the emotion category in the song set to be selected as at least one emotion label matched with the emotion keyword, determining at least one song corresponding to the at least one emotion label as an initial song set, and screening the recommended song set from the initial song set based on song attributes.

Optionally, before the obtaining of at least one emotion label of each song in the candidate song set, the method includes:

extracting emotion text information and melody information of each song in a sample song set to obtain emotion text information and melody information of the sample song set, wherein each song in the sample song set carries a corresponding actual emotion label;

training an initial convolution cyclic neural network model according to emotion text information and melody information of the sample song set and actual emotion labels carried by each song corresponding to the emotion text information and the melody information to obtain a first convolution cyclic neural network model and a predicted emotion label of each song in the sample song set;

adjusting the first convolution recurrent neural network model according to the predicted emotion label and the actual emotion label of each song in the sample song set;

and when the adjusted first convolution recurrent neural network model meets the convergence condition, determining the adjusted first convolution recurrent neural network model as an emotion label prediction model.

Wherein the emotion label prediction model comprises a convolutional layer, a cyclic layer and a transcription layer;

optionally, after determining the adjusted first convolutional recurrent neural network model as an emotion label prediction model, the method includes:

extracting emotion text information and melody information of each song in a song set to be selected to obtain emotion text information and melody information of the song set to be selected, and inputting the emotion text information and melody information of the song set to be selected into the emotion label prediction model;

extracting features of the emotion text information and melody information of the song set to be selected through the convolutional layer to obtain an emotion feature sequence of the song set to be selected;

predicting the emotion characteristic sequence of the song set to be selected through the circulation layer to obtain a predicted sequence of the song set to be selected;

and converting the predicted sequence of the song set to be selected into an emotion label sequence of the song set to be selected through the transcription layer, so as to obtain at least one emotion label of each song in the song set to be selected.

Optionally, the extracting emotional text information of each song in the set of songs to be selected includes:

the method comprises the steps of obtaining a lyric text of a target song in a song collection to be selected, splitting the lyric text of the target song into at least one lyric word, calculating the degree of correlation between the target lyric word in the at least one lyric word and each emotion word in an emotion text database, obtaining a plurality of degree of correlation values between the target lyric word and each emotion word, determining the maximum value of the degree of correlation values as the emotion score of the target lyric word, thus obtaining the emotion score of each lyric word in the target song, and determining the lyric word with the emotion score larger than a first preset threshold value as the emotion text information of the target song.

Optionally, the method further includes:

calculating a first matching value of the emotion keyword and each emotion label in the song set to be selected, determining the emotion label with the first matching value being greater than or equal to a second preset threshold value as at least one emotion label matched with the emotion keyword, determining at least one song corresponding to the at least one emotion label matched with the emotion keyword as an initial song set, and screening the recommended song set from the initial song set based on song attributes.

Optionally, the song attribute includes a song playing frequency and a song online time;

the selecting the recommended set of songs from the initial set of songs based on song attributes includes:

and sequencing at least one song in the initial song set, wherein the song playing times are greater than a preset playing time threshold value, and the song online time is less than a preset online time threshold value, according to the sequence of the song playing times from large to small to obtain the recommended song set.

Wherein the retrieval information comprises voice information;

optionally, the matching the received search information with the emotion words in the emotion text database to obtain emotion words matched with the search information, and determining the emotion words matched with the search information as emotion keywords includes:

detecting the language type of the voice information, acquiring a voice standardized model matched with the language type, and converting the voice information into standardized voice information through the voice standardized model;

acquiring a voice conversion text model matched with the language type, and converting the voice information into retrieval text information through the voice conversion model;

and matching the retrieved text information with emotion words in an emotion text database to obtain emotion words matched with the retrieved text information, and determining the emotion words matched with the retrieved text information as emotion keywords.

In a second aspect, a song recommending apparatus is provided for an embodiment of the present application, including:

the receiving and matching unit is used for matching the received retrieval information with the emotion words in the emotion text database to obtain the emotion words matched with the retrieval information, and determining the emotion words matched with the retrieval information as emotion keywords;

the system comprises an acquisition unit, a selection unit and a display unit, wherein the acquisition unit is used for acquiring at least one emotion label of each song in a song set to be selected, the song set to be selected comprises at least one song, and the at least one emotion label comprises at least one emotion label of at least one emotion category;

and the determining and screening unit is used for determining an emotion category corresponding to the emotion keyword in the song set to be selected, determining at least one emotion label belonging to the emotion category in the song set to be selected as at least one emotion label matched with the emotion keyword, determining at least one song corresponding to the at least one emotion label as an initial song set, and screening the recommended song set from the initial song set based on song attributes.

Optionally, the apparatus further comprises:

the system comprises an extraction unit, a storage unit and a processing unit, wherein the extraction unit is used for extracting emotion text information and melody information of each song in a sample song set to obtain the emotion text information and melody information of the sample song set, and each song in the sample song set carries a corresponding actual emotion tag;

the model training unit is used for training the initial convolution cyclic neural network model according to emotion text information and melody information of the sample song set and actual emotion labels carried by each song corresponding to the emotion text information and the melody information, so as to obtain a first convolution cyclic neural network model and a predicted emotion label of each song in the sample song set;

the model adjusting unit is used for adjusting the first convolution recurrent neural network model according to the predicted emotion label and the actual emotion label of each song in the sample song set;

and the model determining unit is used for determining the adjusted first convolution recurrent neural network model as the emotion label prediction model when the adjusted first convolution recurrent neural network model meets the convergence condition.

optionally, the apparatus further comprises:

the extraction and input unit is used for extracting emotion text information and melody information of each song in a song set to be selected to obtain the emotion text information and melody information of the song set to be selected, and inputting the emotion text information and melody information of the song set to be selected into the emotion label prediction model;

the characteristic extraction unit is used for extracting the characteristics of the emotion text information and the melody information of the song set to be selected through the convolutional layer to obtain an emotion characteristic sequence of the song set to be selected;

the prediction unit is used for predicting the emotion characteristic sequence of the song set to be selected through the circulation layer to obtain a prediction sequence of the song set to be selected;

and the transcription unit is used for converting the predicted sequence of the song set to be selected into an emotion label sequence of the song set to be selected through the transcription layer, so that at least one emotion label of each song in the song set to be selected is obtained.

Optionally, the extraction input unit is specifically configured to:

Optionally, the apparatus further comprises:

and the calculation determination screening unit is used for calculating a first matching value of the emotion keyword and each emotion label in the song set to be selected, determining the emotion label of which the first matching value is greater than or equal to a second preset threshold value as at least one emotion label matched with the emotion keyword, determining at least one song corresponding to the at least one emotion label matched with the emotion keyword as an initial song set, and screening the recommended song set from the initial song set based on song attributes.

the determination screening unit or the calculation determination screening unit is specifically configured to: and sequencing at least one song in the initial song set, wherein the song playing times are greater than a preset playing time threshold value, and the song online time is less than a preset online time threshold value, according to the sequence of the song playing times from large to small to obtain the recommended song set.

Wherein the retrieval information comprises voice information;

optionally, the receiving matching unit is specifically configured to:

In a third aspect, a server is provided for an embodiment of the present application, and includes a processor, a memory, and a transceiver, where the processor, the memory, and the transceiver are connected to each other, where the memory is used to store a computer program that supports the electronic device to execute the above song recommendation method, and the computer program includes program instructions; the processor is configured to invoke the program instructions to perform a song recommendation method as described in an aspect of an embodiment of the present application.

In a fourth aspect, a storage medium is provided for embodiments of the present application, the storage medium storing a computer program, the computer program comprising program instructions; the program instructions, when executed by a processor, cause the processor to perform a song recommendation method as described in an aspect of an embodiment of the present application.

In the embodiment of the application, the received retrieval information is matched with the emotion words in the emotion text database to obtain the emotion words matched with the retrieval information, and the emotion words matched with the retrieval information are determined as emotion keywords; acquiring at least one emotion label of each song in a song set to be selected, wherein the song set to be selected comprises at least one song; determining an emotion category corresponding to the emotion keyword in a song collection to be selected, determining at least one emotion label belonging to the emotion category in the song collection to be selected as at least one emotion label matched with the emotion keyword, determining at least one song corresponding to the at least one emotion label as an initial song collection, and screening a recommended song collection from the initial song collection based on song attributes. Therefore, the user can quickly obtain the songs with interesting emotion types, the song listening experience of the user is improved, and the manual operation cost of the song library is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram of one possible system architecture provided by an embodiment of the present application;

fig. 2 is a schematic flowchart of a song recommendation method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating another song recommendation method according to an embodiment of the present application;

FIGS. 4a and 4b are schematic diagrams of a human-computer interaction interface for inputting a reminder message according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a song recommending apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, a diagram of one possible system architecture to which the present application is applicable is shown. As shown in fig. 1, the system architecture diagram includes a user terminal and a server, wherein the user terminal includes a plurality of user terminals corresponding to users. The method comprises the steps that a first user terminal sends an input reminding message to a first user through a man-machine interaction interface, the first user inputs retrieval information according to the input reminding message, the first user terminal obtains the retrieval information and sends the retrieval information to a server, the server matches the received retrieval information with an emotion text database to obtain emotion keywords matched with the retrieval information, the emotion keywords are matched with each emotion label in a preset song set to be selected to obtain at least one emotion label matched with the emotion keywords, a recommended song set is determined according to at least one song corresponding to the at least one emotion label matched with the emotion keywords and sent to the first user terminal, and the first user terminal conducts display through the man-machine interaction interface.

The user terminal may refer to a User Equipment (UE), which may be a handheld terminal, a notebook computer, a subscriber unit (subscriber unit), a cellular phone (cellular phone), a smart phone (smartphone), a Personal Digital Assistant (PDA) computer, a tablet computer, a wireless modem (modem), a handheld device (hand), a laptop computer (laptop computer), a cordless phone (cordlessphone), a Wireless Local Loop (WLL) station, a Machine Type Communication (MTC) terminal, or other devices.

Please refer to fig. 2, which is a flowchart illustrating a song recommendation method according to an embodiment of the present application. As shown in fig. 2, the method embodiment comprises the following steps:

s101, matching the received retrieval information with each emotion word in an emotion text database to obtain emotion words matched with the retrieval information, and determining the emotion words matched with the retrieval information as emotion keywords.

Wherein the retrieval information may comprise text information and the mood text database comprises at least one mood word.

In the embodiment of the method, the server receives a specific implementation manner of the retrieval information, please refer to the description of the subsequent embodiment.

In an optional embodiment, the server calculates a matching value between the retrieval information and each emotion word in the emotion text database, and determines the target emotion word as an emotion keyword matched with the retrieval information if the matching value of the target emotion word in the emotion text database is the maximum.

Specifically, if the retrieved text information is A, B, C and the emotion word in the emotion text database is a, the calculation formula of the matching degree between the retrieved text information and the emotion word a in the emotion text database may be the sum of the matching degree between the retrieved text information A, B, C and the emotion word a, where the matching degree between a single piece of retrieved text information and a single emotion word may be obtained through a preset matching degree table. For example, assuming that the retrieval text information includes excitement and happiness, the emotional words in the emotional text database include pleasure, excitement and sadness, the matching values between the retrieval text information "excitement" and "joy" and the emotional word "joy" are respectively 90 and 95, the matching values between the "excitement" and "joy" and the emotional word "excitement" are respectively 100 and 90, the matching values between the "excitement" and "joy" and the emotional word "sadness" are respectively 10 and 0, and the matching values of the emotional words "joy", "excitement" and "sadness" are further respectively 185, 190 and 10, the emotional word "excitement" with the largest matching value is determined as the emotional keyword.

S102, at least one emotion label of each song in a song collection to be selected is obtained, wherein the song collection to be selected comprises at least one song, and the at least one emotion label comprises at least one emotion label of at least one emotion category.

Before executing step S102, the server may obtain the emotion label prediction model by means of a convolutional recurrent neural network.

In a possible embodiment, before obtaining at least one emotion label of each song in the candidate song set, the method includes:

The emotion text information of each song may be lyric words with emotion in a lyric text corresponding to the song audio, the melody information may be Musical Instrument Digital Interface (MIDI) information of the song audio, and the convergence condition may include that the output accuracy of the model reaches a preset accuracy.

Specifically, the server randomly selects a sample song set in a song library, the sample song set comprises at least one song and a lyric text of each song in the at least one song, an actual emotion label of each song in the sample song set is obtained through a manual classification mode, then the server extracts emotion text information and MIDI information of each song in the sample song set, the server extracts a specific implementation mode of the emotion text information and the MIDI information of each song, the server extracts the specific implementation mode of the emotion text information and the MIDI information of each song from at least one emotion label determined in each song in a to-be-selected song set according to an emotion label prediction model in the step, repeated description is omitted, the emotion text information and the MIDI information of the sample set are obtained, and the sample set is divided into a training set according to a certain proportion, And (5) verifying the set. For example, the sample song set has 1000 songs and the lyric text corresponding to each song, 300 songs with positive, neutral and negative emotions obtained by manual classification are 300 songs, 400 songs and 300 songs, respectively, according to 8: a ratio of 2 gives 800 and 200 songs for the training and validation sets, respectively.

Then, the server inputs emotion text information and MIDI information of the training set and actual emotion labels carried by each song in the training set into an initial convolution cyclic neural network model for training and learning to obtain a first convolution cyclic neural network model, inputs emotion text information and MIDI information of the verification set into the first convolution cyclic neural network model to obtain predicted emotion labels of the verification set, calculates the proportion of the number of songs in the predicted emotion labels of the verification set, which are consistent with the actual emotion labels, in the total number of the songs in the verification set, and judges whether the first convolution cyclic neural network model reaches a convergence condition or not according to the proportion, for example, the server calculates the proportion of the number of the songs in the predicted emotion labels of the verification set, which are consistent with the actual emotion labels, in the total number of the songs in the verification set, which is 75%, and if the output accuracy of the model reaches 95%, determining the first convolution recurrent neural network model at the moment as the emotion label prediction model. The emotion label prediction model comprises a convolutional layer, a cyclic layer and a transcription layer.

After obtaining the emotion label prediction model, the server determines at least one emotion label of each song in the song set to be selected according to the emotion label prediction model in the following specific implementation manner:

Wherein, the extracting of the emotion text information of each song in the song collection to be selected comprises:

Specifically, the server obtains a lyric text of the target song from the song collection to be selected according to the song ID (such as the name of the song, the singer, and the like) of the target song, splits the lyric text to obtain at least one lyric word, and calculates a degree of correlation between the target lyric word in the target song and each emotion word in the emotion text database, wherein the calculation formula is as follows:

in the formula, A represents a word containing the target lyric and belonging to the emotion word c_iB represents the number of words containing the target lyrics T but not belonging to the emotion word c_iC represents a word that does not contain the target lyric word T and belongs to the emotion word C_iD represents that the target lyric word T is not included and does not belong to the emotion word c_iThe number of words of lyrics.

And then, determining the maximum value of the multiple correlation degree values of the target lyric words as the emotion score of the target lyric words, obtaining the emotion score of each lyric word in the target song according to the method, sequencing the lyric words of the target song according to the emotion score, and determining the first k lyric words with the emotion scores larger than or equal to a first preset threshold value as the emotion text information of the target song, namely one k-dimensional emotion text information of the target song. Further, emotion text information of each song in the song set to be selected is obtained according to the mode.

Further, the server may process the songs in the song collection to be selected through the MIDI digital interface conversion software to obtain MIDI information, i.e., melody information, of each song in the song collection to be selected, where the melody information includes time, frequency, and the like of the song.

And then, inputting the emotion text information and melody information of the song set to be selected into an emotion label prediction model, and performing feature extraction on the emotion text information and melody information of the input song set to be selected through a convolution layer in the emotion label prediction model, wherein the convolution layer in the emotion label prediction model can be understood as a standard convolution neural network model without a full connection layer, the emotion text information and the melody information of the song set to be selected firstly enter the convolution layer in the standard convolution neural network model, a small part of input information is randomly selected as a sample, some feature information is learned from the small sample, and then convolution operation is performed on the feature information learned from the sample and the input information. After the convolution operation is finished, the features of the input information are extracted, but the number of the features extracted only through the convolution operation is large, in order to reduce the calculated amount, pooling operation is needed, namely the features extracted through the convolution operation are transmitted to a pooling layer of a standard convolution neural network model, aggregation statistics is carried out on the extracted features, the order of magnitude of the statistical features is far lower than that of the features extracted through the convolution operation, and meanwhile, the classification effect is improved. The commonly used pooling methods mainly include an average pooling operation method and a maximum pooling operation method. The average pooling operation method is to calculate an average characteristic in a characteristic set to represent the characteristic of the characteristic set; the maximum pooling operation is to extract the feature of which the maximum feature represents in a feature set. Through convolution processing of the convolution layer and pooling processing of the pooling layer, static structure characteristic information of the input information can be extracted, and an emotion characteristic sequence of the song set to be selected can be obtained.

Thereafter, the sequence of emotional features of the set of songs to be selected is input into a cyclic layer of the emotion label prediction model, which may be understood as a cyclic neural network model, which, although capable of modeling time series, has a limited ability to learn long-term dependency information, i.e. if the current output is associated with a long-term sequence before, the dependency is difficult to learn, because the sequence is too long and long-term dependency may lead to a disappearance of the gradient or an explosion of the gradient. Therefore, in this embodiment, a special recurrent neural network model, that is, a Long Short-term memory (LSTM) model, is selected to predict an emotion feature sequence of a song set to be selected, so as to obtain a prediction sequence of the song set to be selected, that is, probability vectors of T dimensions, where each vector is composed of probabilities corresponding to N different words in a dictionary, and the prediction sequence of the song to be selected can be understood as a probability matrix of T rows and N columns. Wherein T is the total number of songs in the song collection to be selected.

Then, the probability matrixes of the T row and the N column are sent to the transcription layer, and the transcription layer can be understood as finding the emotion label corresponding to each song with the highest probability according to the probability vector of each song in the probability matrixes of the T row and the N column by using a (connected Temporal Classification, CTC) connection-oriented time Classification algorithm, namely, the emotion label of each song.

S103, determining the emotion category corresponding to the emotion keyword in the song set to be selected, determining at least one emotion label belonging to the emotion category in the song set to be selected as at least one emotion label matched with the emotion keyword, determining at least one song corresponding to the at least one emotion label as an initial song set, and screening out a recommended song set from the initial song set based on song attributes.

For example, the server classifies the songs in the song collection to be selected into three emotion categories, namely positive, negative and neutral according to the emotion label prediction model. The positive emotion category comprises a plurality of emotion labels such as happiness and excitement, the negative emotion category comprises a plurality of emotion labels such as sadness and sadness, and the neutral emotion category comprises a plurality of emotion labels such as calmness and relaxation. The server matches the emotion keyword 'relaxing' with at least one emotion label of each song in the song set to be selected to obtain that the emotion keyword 'relaxing' is consistent with 'relaxing' in at least one emotion label in the song set to be selected, further determines that the emotion keyword 'relaxing' belongs to a neutral emotion category, determines emotion labels 'calming, relaxing and calm' corresponding to the neutral emotion category as a plurality of emotion labels matched with the emotion keyword 'relaxing', and determines at least one song corresponding to the emotion labels 'calming, relaxing and calm' as an initial song set.

And then, screening the initial song set according to the song attributes to obtain a recommended song set.

For example, the server obtains songs A, B, C and D with the song playing times of more than 1 ten thousand and the song online time of less than 6 months from the initial song set, wherein the song playing times of the songs A, B, C and D are 10 ten thousand, 2 ten thousand, 5 ten thousand and 9 ten thousand respectively, and then sorts the songs A, B, C and D in the order of the song playing times from large to small to obtain the recommended song set { a, D, C, B }.

Further, the server can send the recommended song set to the first user terminal, the first user terminal displays the recommended song set through a human-computer interaction interface, and the recommended song set can also be stored in a folder with a file name of the relaxing emotion recommended song and stored on a disk or other media, so that the user can conveniently obtain, manage and distribute the recommended song set.

In the embodiment of the application, a server can identify interested emotion content input by a user, obtain emotion keywords by matching the identified content with emotion words in an emotion text database, obtain emotion labels of each song in a to-be-selected song set by using an emotion label prediction model obtained by a convolutional recurrent neural network algorithm, match the emotion keywords with the emotion labels of each song in the to-be-selected song set to obtain at least one emotion label matched with the emotion keywords, determine at least one song corresponding to the at least one emotion label as an initial song set, and screen the initial song set according to song attributes to obtain a recommended song set. Therefore, the user can quickly obtain the songs with interesting emotion types, the experience of listening to the songs of the user is improved, and the manual operation cost of the song library is reduced.

Please refer to fig. 3, which is a flowchart illustrating a song recommendation method according to an embodiment of the present application. As shown in fig. 3, this method embodiment includes the steps of:

s201, receiving retrieval information sent by a first user terminal.

The retrieval information may include voice information or text information.

Specifically, as shown in fig. 4a and 4b, the man-machine interaction interface schematic diagram for inputting the reminder message provided in the embodiment of the present application is shown. The first user terminal may send different forms of input prompting messages to the first user through the human-computer interaction interface, so that the first user may input the retrieval information, i.e., text information, by inputting characters in an input box of the human-computer interaction interface shown in fig. 4a, or may input the retrieval information, i.e., voice information, by pressing a voice key of the human-computer interaction interface shown in fig. 4b to speak. The first user terminal receives the retrieval information (text information or voice information) sent by the first user and sends the retrieval information to the server, and the server judges the received retrieval information and executes step S202 or step S203 according to the judgment result.

And S202, if the retrieval information is voice information, converting the voice information into retrieval text information.

In one possible implementation, the specific implementation manner of converting the voice information into the retrieval text information is as follows:

and acquiring a voice conversion text model matched with the language type, and converting the voice information into retrieval text information through the voice conversion model.

The language type may include a language type and may also include a dialect type. The language types may include mandarin, english, french, etc., and the dialect types may include cantonese, henna, etc.

Specifically, the server may use a dedicated voice detection tool to detect the language type of the voice information, and it may be understood that a plurality of voice detection tools of specific language types, such as a cantonese detection tool, a shanghai talk detection tool, and the like, are arranged in the server, and respectively detect the voice information, determine the language type of the voice information according to the detection result, and obtain a voice standardized model matching the language type according to the language type of the voice information. For example, if it is detected that the language type of the voice information is cantonese, a cantonese standardized model is obtained. Then, the speech information is standardized by using the speech standardized model, which can be understood as that the emotion indication part in the speech information is removed by the speech standardized model, and the speech information is subjected to speech speed adjustment, loudness adjustment and fundamental frequency adjustment, so that the speech information in a statement sentence mode consistent with the preset speech speed, volume and tone, namely the standardized speech information, is obtained.

The server then inputs the normalized speech information into a speech translation text model matching the language type, which translates the normalized speech information into search text information. For example, if the language type of the speech information is the Sichuan language and the speech information is the speech expressed as "barth-to-great-fit" in the Sichuan language, the retrieved text information in the text form is "particularly comfortable" after the text model processing by the Sichuan language conversion. Step S204 is then performed.

And S203, if the search information is text information, determining the search information as search text information.

Step S204 is then performed.

And S204, matching the retrieved text information with each emotion word in the emotion text database to obtain emotion words matched with the retrieved text information, and determining the emotion words matched with the retrieved text information as emotion keywords.

S205, at least one emotion label of each song in the song collection to be selected is obtained.

Here, the specific implementation manner of steps S204 to S205 may refer to the description of steps S101 to S102 in the embodiment corresponding to fig. 2, and is not described herein again.

S206, calculating a first matching value of the emotion keyword and each emotion label in the song set to be selected, determining the emotion label with the first matching value being greater than or equal to a second preset threshold value as at least one emotion label matched with the emotion keyword, determining at least one song corresponding to the at least one emotion label matched with the emotion keyword as an initial song set, and screening out a recommended song set from the initial song set based on song attributes.

For example, the server calculates first matching values between the emotion keyword "happy" and the emotion labels "happy", "calm", "relaxed" and "sad" in the selected song set, respectively, wherein the first matching values between the single emotion keyword and the single emotion label can be obtained through a preset first matching value table, the first matching values between the emotion keyword "happy" and the emotion labels "happy", "calm", "relaxed" and "sad" in the selected song set are respectively 50, 90, 100 and 10 by querying the preset first matching value table, further the emotion labels with the first matching values larger than a second preset threshold 80 are obtained as "calm" and "relaxed", the "calm" and the "relaxed" are determined as the emotion labels matching with the emotion label "calm", and at least one song corresponding to the emotion labels "calm" and "relaxed" is determined as the initial song set, and screening the initial song set according to song attributes such as song popularity, song shelf time, song genre and the like to obtain a recommended song set.

In the embodiment of the application, the server converts the received retrieval information into retrieval text information, matches the retrieval text information with each emotion word in an emotion text database to obtain the emotion word matched with the retrieval text information, and determines the emotion word matched with the retrieval text information as an emotion keyword; the method comprises the steps of obtaining at least one emotion label of each song in a song set to be selected, matching emotion keywords with the at least one emotion label of each song in the song set to be selected to obtain at least one emotion label matched with the emotion keywords, and determining a recommended song set according to the at least one song corresponding to the at least one emotion label matched with the emotion keywords. Therefore, the user can quickly obtain the songs with interesting emotion types, the song listening experience of the user is improved, and the manual operation cost of the song library is reduced.

Please refer to fig. 5, which is a schematic structural diagram of a song recommending apparatus according to an embodiment of the present application. As shown in fig. 5, the song recommending apparatus includes a reception matching unit 501, an obtaining unit 502, and a determination filtering unit 503.

A receiving matching unit 501, configured to match the received search information with emotion words in an emotion text database to obtain emotion words matched with the search information, and determine the emotion words matched with the search information as emotion keywords;

an obtaining unit 502, configured to obtain at least one emotion tag of each song in a candidate song set, where the candidate song set includes at least one song, and the at least one emotion tag includes at least one emotion tag of at least one emotion category;

a determining and screening unit 503, configured to determine an emotion category corresponding to the emotion keyword in the to-be-selected song set, determine at least one emotion tag in the to-be-selected song set that belongs to the emotion category as at least one emotion tag that matches the emotion keyword, determine at least one song corresponding to the at least one emotion tag as an initial song set, and screen out the recommended song set from the initial song set based on song attributes.

Optionally, the apparatus further comprises:

an extracting unit 504, configured to extract emotion text information and melody information of each song in a sample song set to obtain emotion text information and melody information of the sample song set, where each song in the sample song set carries a corresponding actual emotion tag;

a model training unit 505, configured to train the initial convolutional recurrent neural network model according to the emotion text information and the melody information of the sample song set and the actual emotion label carried by each song, to obtain a first convolutional recurrent neural network model and a predicted emotion label of each song in the sample song set;

a model adjusting unit 506, configured to adjust the first convolutional recurrent neural network model according to the predicted emotion label and the actual emotion label of each song in the sample song set;

a model determining unit 507, configured to determine, when the adjusted first convolutional recurrent neural network model satisfies a convergence condition, the adjusted first convolutional recurrent neural network model as an emotion label prediction model.

optionally, the apparatus further comprises:

the extraction and input unit 508 is used for extracting emotion text information and melody information of each song in a song set to be selected to obtain emotion text information and melody information of the song set to be selected, and inputting the emotion text information and melody information of the song set to be selected into the emotion label prediction model;

a feature extraction unit 509, configured to perform feature extraction on the emotional text information and the melody information of the to-be-selected song set through the convolutional layer to obtain an emotional feature sequence of the to-be-selected song set;

the prediction unit 510 is configured to predict, through the loop layer, an emotional characteristic sequence of the to-be-selected song set to obtain a predicted sequence of the to-be-selected song set;

the transcription unit 511 is configured to convert the predicted sequence of the to-be-selected song set into an emotion tag sequence of the to-be-selected song set through the transcription layer, so as to obtain at least one emotion tag of each song in the to-be-selected song set.

Optionally, the extraction input unit 508 is specifically configured to:

Optionally, the apparatus further comprises:

the calculation determination screening unit 512 is configured to calculate a first matching value between the emotion keyword and each emotion tag in the song set to be selected, determine an emotion tag of which the first matching value is greater than or equal to a second preset threshold as at least one emotion tag matched with the emotion keyword, determine at least one song corresponding to the at least one emotion tag matched with the emotion keyword as an initial song set, and screen the recommended song set from the initial song set based on song attributes.

the determination screening unit 503 or the calculation determination screening unit 512 is specifically configured to: and sequencing at least one song in the initial song set, wherein the song playing times are greater than a preset playing time threshold value, and the song online time is less than a preset online time threshold value, according to the sequence of the song playing times from large to small to obtain the recommended song set.

Wherein the retrieval information comprises voice information;

optionally, the receiving and matching unit 501 is specifically configured to:

and matching the retrieval text information with each emotion word in an emotion text database to obtain emotion words matched with the retrieval text information, and determining the emotion words matched with the retrieval text information as emotion keywords.

It will be appreciated that the song recommender 500 is arranged to implement the steps performed by the server in the embodiments of figures 2 to 3. For specific implementation and corresponding advantageous effects of the functional blocks included in the song recommending apparatus 500 of fig. 5, reference may be made to the detailed description of the embodiments of fig. 2 to fig. 3, which is not repeated herein.

The song recommending apparatus 500 in the embodiment shown in fig. 5 described above may be implemented by the server 600 shown in fig. 6. Please refer to fig. 6, which provides a schematic structural diagram of a server according to an embodiment of the present application. As shown in fig. 6, the server 600 may include: one or more processors 601, memory 602, and a transceiver 603. The processor 601, the memory 602, and the transceiver 603 are connected by a bus 604. Wherein the transceiver 603 is configured to obtain search information or send a recommended set of songs, and the memory 602 is configured to store a computer program, which includes program instructions; the processor 601 is configured to execute the program instructions stored in the memory 602, and perform the following operations:

Optionally, before the processor 601 obtains at least one emotion tag of each song in the song set to be selected, the following operations are specifically performed:

optionally, after the processor 601 determines the adjusted first convolutional recurrent neural network model as an emotion label prediction model, the following operation is specifically performed:

Optionally, the processor 601 extracts emotion text information of each song in the song set to be selected, and specifically performs the following operations:

Optionally, the processor 601 further specifically performs the following operations:

the processor 601 selects the recommended song set from the initial song set based on the song attributes, and specifically performs the following operations:

Wherein the retrieval information comprises voice information;

optionally, the processor 601 matches the received search information with each emotion word in an emotion text database to obtain an emotion word matched with the search information, and determines the emotion word matched with the search information as an emotion keyword, specifically performing the following operations:

A computer storage medium may be provided in an embodiment of the present application, and may be used to store computer software instructions for the song recommending apparatus in the embodiment shown in fig. 5, which includes a program designed for executing the song recommending apparatus in the embodiment. The storage medium includes, but is not limited to, flash memory, hard disk, solid state disk.

A computer program product is also provided in the embodiments of the present application, and when the computer program product is executed by a computing device, the apparatus for recommending songs designed in the embodiment shown in fig. 5 can be executed.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

In the present application, "a and/or B" means one of the following cases: a, B, A and B. "at least one of … …" refers to any combination of the listed items or any number of the listed items, e.g., "at least one of A, B and C" refers to one of: any one of seven cases, a, B, C, a and B, B and C, a and C, A, B and C.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flow and/or block in the flowchart and/or the block diagram can be specifically implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A song recommendation method, comprising:

2. The method of claim 1, wherein obtaining at least one emotion label for each song in the candidate set of songs comprises:

3. The method of claim 2, wherein the emotion label prediction model comprises a convolutional layer, a cyclic layer, and a transcription layer;

after the determining the adjusted first convolutional recurrent neural network model as the emotion label prediction model, the method includes:

4. The method according to claim 3, wherein the extracting emotional text information of each song in the set of songs to be selected comprises:

5. The method of claim 1, further comprising:

6. The method according to claim 1 or 5, wherein the song attributes comprise song playing times and song online time;

and sequencing at least one song in the initial song set, wherein the song playing times are greater than a preset playing word threshold value, and the song online time is less than a preset online time threshold value, according to the sequence of the song playing times from large to small to obtain the recommended song set.

7. The method of claim 1, wherein the retrieved information comprises voice information;

the matching of the received retrieval information and the emotion words in the emotion text database to obtain the emotion words matched with the retrieval information, and the determining of the emotion words matched with the retrieval information as emotion keywords include:

8. A song recommendation apparatus, comprising:

9. A server, comprising a processor, a memory and a transceiver, the processor, the memory and the transceiver being interconnected, wherein the transceiver is configured to receive or transmit data, the memory is configured to store program code, and the processor is configured to invoke the program code to perform a song recommendation method according to any one of claims 1-7.

10. A storage medium, characterized in that the storage medium stores a computer program comprising program instructions; the program instructions, when executed by a processor, cause the processor to perform a song recommendation method according to any one of claims 1-7.