CN113297412A

CN113297412A - Music recommendation method and device, electronic equipment and storage medium

Info

Publication number: CN113297412A
Application number: CN202010113506.9A
Authority: CN
Inventors: 刘永起
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-02-24
Filing date: 2020-02-24
Publication date: 2021-08-24
Anticipated expiration: 2040-02-24
Also published as: CN113297412B

Abstract

The disclosure relates to a music recommendation method and device and electronic equipment. The method comprises the following steps: acquiring audio information of target music; extracting the characteristics of the audio information to obtain the audio characteristics of the target music; inputting the audio features into a preset feedforward network model to obtain a first output result corresponding to the target music; determining the distance between the target music and each piece of music to be recommended according to a first output result corresponding to the target music and a second output result corresponding to each piece of music to be recommended in the plurality of pieces of music to be recommended; and selecting the music to be recommended meeting the preset distance requirement from the plurality of music to be recommended according to the distance between the target music and each music to be recommended, and determining the music to be recommended to the user. Because the feedforward network model has the characteristic that music with the same style is more similar to music with different styles, and the model training process based on the characteristic belongs to a supervised training process, the recommendation accuracy of the music with the similar style can be improved.

Description

Music recommendation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of music technologies, and in particular, to a music recommendation method and apparatus, an electronic device, and a storage medium.

Background

Music is an artistic form, different music expressions can bring distinct feelings to users, and meanwhile, the users can express own emotions through the music. In the current life, music is ubiquitous, and the demand of people for music is increasing and decreasing. With the rapid development of science and technology and the wide popularization of the internet, music websites and music software are increasing, network music resources are more and more abundant, and the music on demand and downloading through the network become a part of people's entertainment life. Although a great deal of music resources provide users with much convenience, it is difficult for users to quickly find music needed by themselves from the great deal of music resources.

In the related art, a music recommendation method based on content understanding is generally adopted, and the music recommendation method can make the recommended music and the music listened by the user reach the same music style as much as possible, so that the music in the style preferred by the user is quickly found and recommended to the user.

However, the music recommendation method in the related art relies on the ability to manually extract features, which often results in poor recommendation accuracy.

Disclosure of Invention

The present disclosure provides a music recommendation method, apparatus, electronic device and storage medium, to at least solve the problem of poor recommendation accuracy in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a music recommendation method, including:

acquiring audio information of target music;

extracting the characteristics of the audio information to obtain the audio characteristics of the target music;

inputting the audio features into a preset feedforward network model to obtain a first output result corresponding to the target music, wherein the feedforward network model is any one of the trained three-component network models, and the three-component network model is obtained by training a triple sample constructed according to first music of a first style, second music of the first style and third music of a second style based on a constraint condition that the distance between the first music and the second music is smaller than the distance between the first music and the third music;

determining the distance between the target music and each piece of music to be recommended according to a first output result corresponding to the target music and a second output result corresponding to each piece of music to be recommended in a plurality of pieces of music to be recommended;

and selecting the music to be recommended meeting the preset distance requirement from the plurality of pieces of music to be recommended according to the distance between the target music and each piece of music to be recommended, and determining the music to be recommended to the user.

In one embodiment, the step of determining the distance between the target music and each piece of music to be recommended according to the first output result corresponding to the target music and the second output result corresponding to each piece of music to be recommended in a plurality of pieces of music to be recommended includes:

according to a first output vector corresponding to the target music and a second output vector corresponding to each piece of music to be recommended in the plurality of pieces of music to be recommended, calculating the cosine distance between the first output vector and the second output vector, and determining the cosine distance as the distance between the target music and each piece of music to be recommended.

In one embodiment, the step of selecting, according to the distance between the target music and each piece of music to be recommended, music to be recommended that meets a preset distance requirement from the plurality of pieces of music to be recommended, and determining music to be recommended to the user includes:

arranging a plurality of distances corresponding to the plurality of pieces of music to be recommended according to the distance between the target music and each piece of music to be recommended in an ascending order;

and selecting the music to be recommended which is positioned in the front row and meets the preset number, and determining the music to be recommended to the user.

In one embodiment, the step of extracting the features of the audio information to obtain the audio features of the target music includes:

framing the audio information to obtain each frame of audio information;

extracting the Mel frequency cepstrum coefficient of each frame of audio information to obtain the audio characteristics of each frame of audio information;

and selecting audio features which are positioned in the front row and meet the preset frame number from all the frame audio information for splicing to obtain the audio features of the target music.

In one embodiment, the training process of the three-tuple network model comprises the following steps:

acquiring first music of a first style, second music of the first style and third music of a second style;

constructing a triple sample according to the first music of the first style, the second music of the first style and the third music of the second style;

performing feature extraction on the audio information of the first music, the audio information of the second music and the audio information of the third music in the triple sample to obtain audio features corresponding to the triple sample;

inputting the audio features corresponding to the triple samples into a triple network model to be trained, and training the triple network model to be trained based on a constraint condition and an objective function of the triple network model to be trained to obtain the trained triple network model, wherein the constraint condition includes that the distance between the first music and the second music is smaller than the distance between the first music and the third music, and the objective function includes that the sum of the accumulated distances of all the triple samples meets the minimum requirement.

In one embodiment, the step of inputting the audio features corresponding to the triplet samples into the triplet network model to be trained, and training the triplet network model to be trained based on the constraint condition and the objective function of the triplet network model to be trained to obtain the trained triplet network model includes:

respectively inputting the audio features of the first music, the second music and the third music in the audio features corresponding to the triple samples into a three-way feedforward network model in the triple network model to be trained according to the corresponding relation;

performing convolution processing on the corresponding audio characteristics of the first music, the audio characteristics of the second music and the audio characteristics of the third music through a convolution layer in each feedforward network model of the three feedforward network models;

respectively reprocessing the corresponding convolution processing results through the hidden layer in each feedforward network model to obtain an output result of the first music, an output result of the second music and an output result of the third music;

calculating the distance between the first music and the second music and the distance between the first music and the third music according to the output result of the first music, the output result of the second music and the output result of the third music;

and training the to-be-trained triple network model according to the distance between the first music and the second music, the distance between the first music and the third music, the constraint condition and the target function to obtain the trained triple network model.

In one embodiment, the extracting the features of the audio information of the first music, the audio information of the second music, and the audio information of the third music in the triple sample to obtain the audio features corresponding to the triple sample includes:

framing the audio information of the first music to obtain each frame of audio information; extracting the Mel frequency cepstrum coefficient of each frame of audio information to obtain audio features corresponding to each frame of audio information; selecting audio features which are positioned in the front row and meet the preset frame number from all the frame audio information to be spliced to obtain audio features of the first music;

and/or framing the audio information of the second music to obtain each frame of audio information; extracting the Mel frequency cepstrum coefficient of each frame of audio information to obtain audio features corresponding to each frame of audio information; selecting audio features which are positioned in the front row and meet the preset frame number from all the frame audio information to be spliced to obtain audio features of the second music;

and/or framing the audio information of the third music to obtain each frame of audio information; extracting the Mel frequency cepstrum coefficient of each frame of audio information to obtain audio features corresponding to each frame of audio information; and selecting audio features which are positioned in the front row and meet the preset frame number from all the frame audio information for splicing to obtain the audio features of the second music.

According to a second aspect of the embodiments of the present disclosure, there is provided a music recommendation apparatus including:

an audio information acquisition unit configured to acquire audio information of a target music;

an audio characteristic determination unit configured to perform characteristic extraction on the audio information to obtain an audio characteristic of the target music;

the output result obtaining unit is configured to input the audio features into a preset feed-forward network model to obtain a first output result corresponding to the target music, wherein the feed-forward network model is any one-path network model in a trained three-element network model, and the three-element network model is obtained by constructing a triple sample according to first music of a first style, second music of the first style and third music of a second style and training on the basis of a constraint condition that a distance between the first music and the second music is smaller than a distance between the first music and the third music;

a distance determining unit configured to determine a distance between the target music and each piece of music to be recommended according to a first output result corresponding to the target music and a second output result corresponding to each piece of music to be recommended in a plurality of pieces of music to be recommended;

and the recommended music determining unit is configured to select music to be recommended, which meets a preset distance requirement, from the plurality of music to be recommended according to the distance between the target music and each piece of music to be recommended, and determine the music to be recommended to the user.

In one embodiment, the distance determining unit is specifically configured to calculate a cosine distance between a first output vector corresponding to the target music and a second output vector corresponding to each of a plurality of pieces of music to be recommended according to the first output vector corresponding to the target music and the second output vector corresponding to each of the plurality of pieces of music to be recommended, and determine the cosine distance as a distance between the target music and each piece of music to be recommended.

In one embodiment, the recommended music determining unit is specifically configured to arrange a plurality of distances corresponding to the plurality of pieces of music to be recommended in an ascending order according to a distance between the target music and each piece of music to be recommended; and selecting the music to be recommended which is positioned in the front row and meets the preset number, and determining the music to be recommended to the user.

In one embodiment, the audio feature determination unit is specifically configured to frame the audio information to obtain each frame of audio information; extracting the Mel frequency cepstrum coefficient of each frame of audio information to obtain the audio characteristics of each frame of audio information; and selecting audio features which are positioned in the front row and meet the preset frame number from all the frame audio information for splicing to obtain the audio features of the target music.

In one embodiment, the method further comprises the following steps:

a music data acquisition unit configured to acquire first music of a first genre, second music of the first genre, and third music of a second genre;

a sample construction unit configured to construct a triple sample according to the first music of the first style, the second music of the first style and the third music of the second style;

the audio feature acquisition unit is configured to perform feature extraction on the audio information of the first music, the audio information of the second music and the audio information of the third music in the triple sample to obtain an audio feature corresponding to the triple sample;

and the model training unit is configured to input the audio features corresponding to the triple samples into a triple network model to be trained, train the triple network model to be trained on the basis of a constraint condition and an objective function of the triple network model to be trained, and obtain the trained triple network model, wherein the constraint condition includes that the distance between the first music and the second music is smaller than the distance between the first music and the third music, and the objective function includes that the sum of the distances accumulated by all the triple samples should meet a minimum requirement.

In one embodiment, the model training unit is specifically configured to input, according to a corresponding relationship, an audio feature of first music, an audio feature of second music, and an audio feature of third music in the audio features corresponding to the triple samples into a three-way feedforward network model in the triple network model to be trained, respectively; performing convolution processing on the corresponding audio characteristics of the first music, the audio characteristics of the second music and the audio characteristics of the third music through a convolution layer in each feedforward network model of the three feedforward network models; respectively reprocessing the corresponding convolution processing results through the hidden layer in each feedforward network model to obtain an output result of the first music, an output result of the second music and an output result of the third music; calculating the distance between the first music and the second music and the distance between the first music and the third music according to the output result of the first music, the output result of the second music and the output result of the third music; and training the to-be-trained triple network model according to the distance between the first music and the second music, the distance between the first music and the third music, the constraint condition and the target function to obtain the trained triple network model.

In one embodiment, the audio feature obtaining unit is specifically configured to frame the audio information of the first music to obtain each frame of audio information; extracting the Mel frequency cepstrum coefficient of each frame of audio information to obtain audio features corresponding to each frame of audio information; selecting audio features which are positioned in the front row and meet the preset frame number from all the frame audio information to be spliced to obtain audio features of the first music; and/or framing the audio information of the second music to obtain each frame of audio information; extracting the Mel frequency cepstrum coefficient of each frame of audio information to obtain audio features corresponding to each frame of audio information; selecting audio features which are positioned in the front row and meet the preset frame number from all the frame audio information to be spliced to obtain audio features of the second music; and/or framing the audio information of the third music to obtain each frame of audio information; extracting the Mel frequency cepstrum coefficient of each frame of audio information to obtain audio features corresponding to each frame of audio information; and selecting audio features which are positioned in the front row and meet the preset frame number from all the frame audio information for splicing to obtain the audio features of the second music.

In one embodiment, the first music of the first style comprises first music of a first singer, the second music of the first style comprises second music of the first singer, and the third music of the second style comprises third music of a second singer;

or, the first music of the first genre includes first music of a first composer, the second music of the first genre includes second music of the first composer, and the third music of the second genre includes third music of a second composer.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the music recommendation method as described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium having instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the music recommendation method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

in the music recommendation method, the music recommendation device, the electronic device and the storage medium, the feedforward network model is any one of the trained three-component network models, and the three-component network model is obtained by training the triple samples constructed on the basis of the first music of the first style, the second music of the first style and the third music of the second style and constraining the distance between the first music and the second music to be smaller than the distance between the first music and the third music, so that the music recommendation method, the music recommendation device, the electronic device and the storage medium are easy to understand, the trained three-component network model has the characteristic that the music with the same style is more similar to the music with different styles, and the model training process based on the characteristic belongs to a supervised training process, so the feedforward network model also has the characteristics, the audio characteristics of the target music are used as the input of the feedforward network model, and the obtained first output result comprises the music style characteristics with higher discrimination, thereby improving the recommendation accuracy of music with similar styles.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow diagram illustrating a music recommendation method according to an example embodiment.

FIG. 2 is a flow diagram illustrating a training process for a triple network model according to an exemplary embodiment.

FIG. 3 is a block diagram illustrating a framework of a triple network model in accordance with an exemplary embodiment.

FIG. 4 is a block diagram illustrating a feed-forward network model in accordance with an exemplary embodiment.

Fig. 5 is a block diagram illustrating a music recommendation device according to an example embodiment.

FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

FIG. 1 is a flow diagram illustrating a music recommendation method according to an example embodiment. In one embodiment, referring to fig. 1, the music recommendation method includes the following steps:

in step S21, audio information of the target music is acquired.

In step S22, feature extraction is performed on the audio information to obtain the audio features of the target music.

In step S23, the audio features are input into a preset feed-forward network model, and a first output result corresponding to the target music is obtained.

In step S24, a distance between the target music and each of the plurality of pieces of music to be recommended is determined according to the first output result corresponding to the target music and the second output result corresponding to each of the plurality of pieces of music to be recommended.

In step S25, music to be recommended that meets a preset distance requirement is selected from the plurality of music to be recommended according to the distance between the target music and each music to be recommended, and is determined as the music to be recommended to the user.

The target music may be music that the user listened to recently or music that the user listened to most frequently, but is not limited to this, and the target music may be any music in the history song listening record.

The feedforward network model is any one-path network model in the trained three-element network model, and the three-element network model is obtained by training a triple sample constructed according to first music of a first style, second music of the first style and third music of a second style based on a constraint condition that the distance between the first music and the second music is smaller than the distance between the first music and the third music.

Specifically, first, target music is selected from a history of the user listening to songs, and audio information of the target music is acquired. And then, carrying out feature extraction on the audio information to obtain the audio features of the target music. Alternatively, audio feature extraction techniques such as mel-frequency cepstrum coefficients (MFCC), Linear Predictive Coefficients (LPC), Linear Predictive Cepstrum Coefficients (LPCC), and Line Spectral Frequencies (LSF) may be used to extract audio features of the target music. Then, the audio features are input into a preset feedforward network model, where the feedforward network model is any one of the trained triplet network models, it can be understood that the network parameters in the feedforward network model are the network parameters of the trained triplet network model, and for example, the weights in the feedforward network model are the weights of the trained triplet network model. Therefore, the output result obtained by the feedforward network model is suitable for the characteristics of the trained three-element network model, the output result comprises the music style characteristics with higher discrimination, the difference of the music on the style can be better described, and the accuracy of music recommendation is improved. And then, determining the distance between the target music and each piece of music to be recommended according to a first output result corresponding to the target music and a second output result corresponding to each piece of music to be recommended in the plurality of pieces of music to be recommended, and selecting the piece of music to be recommended which meets the preset distance requirement from the plurality of pieces of music to be recommended according to the distance between the target music and each piece of music to be recommended to determine the piece of music to be recommended to the user. Optionally, the music to be recommended with the distance less than the preset distance threshold may be recommended to the user by setting the preset distance threshold.

In the music recommendation method, the feedforward network model is any one of the trained three-component network models, and the three-component network model is a three-component sample constructed on the basis of the first music of the first style, the second music of the first style and the third music of the second style, and is obtained by constraining the distance between the first music and the second music to be smaller than the distance between the first music and the third music, so that the feedforward network model has the characteristics that the music with the same style is more similar to the music with different styles, and the model training process based on the characteristics belongs to a supervised training process, so the feedforward network model also has the characteristics, the audio characteristics of the target music are used as the input of the feedforward network model, and the obtained first output result comprises the music style characteristics with higher discrimination, thereby improving the recommendation accuracy of music with similar styles.

In one embodiment, the method involves determining a distance between the target music and each piece of music to be recommended according to a first output result corresponding to the target music and a second output result corresponding to each piece of music to be recommended in a plurality of pieces of music to be recommended. On the basis of the above embodiment, step S24 includes the steps of:

in step S241, a cosine distance between the first output vector and the second output vector is calculated according to the first output vector corresponding to the target music and the second output vector corresponding to each of the plurality of pieces of music to be recommended, and the cosine distance is determined as a distance between the target music and each piece of music to be recommended.

Wherein the first output result may be a first output vector and the second output result may be a second output vector. Specifically, the cosine distance between the first output vector and the second output vector is calculated according to the first output vector corresponding to the target music and the second output vector corresponding to each piece of music to be recommended in the plurality of pieces of music to be recommended. The cosine distance may be used to measure the similarity between the first output vector and the second output vector. The smaller the cosine distance is, the higher the similarity between the first output vector and the second output vector is, namely the target music is more similar to the music to be recommended; the larger the cosine distance is, the lower the similarity between the first output vector and the second output vector is, that is, the target music is more different from the music to be recommended. Therefore, the cosine distance can be used to represent the distance between the target music and each music to be recommended.

In the embodiment, the cosine distance is used as the similarity measurement between different pieces of music, so that the similarity of different pieces of music can be accurately reflected, and the accuracy of judging similar pieces of music is improved.

In one embodiment, the method relates to a possible implementation process of selecting music to be recommended, which meets a preset distance requirement, from a plurality of pieces of music to be recommended according to the distance between a target music and each piece of music to be recommended, and determining the music to be recommended to a user. On the basis of the above embodiment, step S25 includes the steps of:

in step S251, a plurality of distances corresponding to a plurality of pieces of music to be recommended are arranged in ascending order according to the distance between the target music and each piece of music to be recommended.

In step S252, music to be recommended that is located in the front row and satisfies the preset number is selected and determined as music recommended to the user.

Specifically, the distance between the target music and each piece of music to be recommended is large or small due to the difference between the target music and each piece of music to be recommended. According to the sequence from small to large, the distances corresponding to the plurality of pieces of music to be recommended are arranged according to the ascending sequence, the pieces of music to be recommended which are positioned in the front row and meet the preset number are selected, the pieces of music to be recommended are determined to be the pieces of music recommended to the user, namely the first N pieces of music to be recommended are selected as the pieces of music recommended to the user, wherein N is a positive integer.

In the embodiment, music with high similarity to the target music style is selected as recommended music, the recommending accuracy is higher, and the satisfaction degree of the user is improved.

In another embodiment, the implementation of determining the music recommended to the user may also be: the method comprises the steps of comparing a plurality of distances corresponding to a plurality of pieces of music to be recommended with a preset distance threshold value respectively by setting the preset distance threshold value, and selecting the pieces of music to be recommended with the distances smaller than the preset distance threshold value as pieces of music recommended to a user.

In one embodiment, the method relates to a possible implementation process of extracting the characteristics of the audio information to obtain the audio characteristics of the target music. On the basis of the above embodiment, step S22 includes the steps of:

in step S221, framing the audio information to obtain each frame of audio information;

in step S222, extracting mel-frequency cepstrum coefficients of each frame of audio information to obtain audio features of each frame of audio information;

in step S223, the audio features that are located in the front row and satisfy the preset number of frames in all the frames of audio information are selected for splicing, so as to obtain the audio features of the target music.

Specifically, first, the audio information is framed, and each frame of audio information is obtained. For each frame of audio information, a mel-frequency cepstrum coefficient (MFCC) is extracted, which is an audio feature of each frame of audio information. Then, the audio features that are in the front of all the frames of audio information and meet the preset number of frames are selected for splicing, for example, the MFCCs of the first 20 frames are selected for splicing, and an MFCC matrix, which is the audio feature of the target music, is obtained.

In the embodiment, the mel frequency cepstrum coefficient is used as the audio frequency characteristic, and the auditory characteristic of human is considered, so that the mel frequency cepstrum coefficient is closer to the actually heard sound, the audio frequency identification rate is improved, and the accuracy of similar music judgment can be improved.

FIG. 2 is a flow diagram illustrating a training process for a triple network model according to an exemplary embodiment. In one embodiment, referring to fig. 2, the training process of the triple network model includes the following steps:

in step S31, acquiring first music of a first genre, second music of the first genre, and third music of a second genre;

in step S32, constructing a triple sample from the first music of the first genre, the second music of the first genre, and the third music of the second genre;

in step S33, performing feature extraction on the audio information of the first music, the audio information of the second music, and the audio information of the third music in the triple sample to obtain audio features corresponding to the triple sample;

in step S34, the audio features corresponding to the triple samples are input into the triple network model to be trained, and the triple network model to be trained is trained based on the constraint condition and the objective function of the triple network model to be trained, so as to obtain the trained triple network model, where the constraint condition includes that the distance between the first music and the second music is smaller than the distance between the first music and the third music, and the objective function includes that the sum of the accumulated distances of all the triple samples should meet the minimum requirement.

Alternatively, before step S31, audio information, singer information, composer information of a plurality of music samples are first collected, and a sample library is constructed based on these information. It should be noted that, for a singer, the style of the music sung by the singer is basically consistent. Similarly, the style of the music made by a composer is basically consistent. Of course, in exceptional cases, singers with various styles or different styles of music sung by the same singer can be eliminated, so that the representativeness of the sample is ensured. Therefore, in the sample library, the style of music sung by the same singer may be considered to be consistent, and the style of music composed by the same composer may also be considered to be consistent. Based on this, the selection process of the first music of the first style, the second music of the first style and the third music of the second style can be as follows:

randomly selecting first music from a constructed music set corresponding to a first singer, and then randomly selecting second music different from the first music from the music set of the first singer; a third music is then randomly selected from the music collection that the second singer has constructed. By the characteristic that ' different music of the same singer ' is more similar to ' different music of different singers ' in style ', supervised learning of a three-component network model can be performed, so that the characteristic more effective for judging music similarity is extracted, and better similar music recommendation is achieved. Similarly, the first music may be randomly selected from the constructed music set corresponding to the first composer, and then a second music different from the first music may be randomly selected from the music set of the first composer; a third music is then randomly selected from the music collection that the second composer has constructed. By the feature that "different music of the same composer" is more similar in style than "different music of different composers", supervised learning of the three-component network model can be performed.

After obtaining the first music of the first genre, the second music of the first genre, and the third music of the second genre, the first music of the first genre, the second music of the first genre, and the third music of the second genre are constructed into a triple sample. For example: suppose a first music is denoted as q_ijThe second music is denoted as q_itThe third music is denoted as q_ksThen, the triplet sample may be represented as (q)_ij，q_it，q_ks). And then, performing feature extraction on the audio information of the first music, the audio information of the second music and the audio information of the third music in the triple sample to obtain the audio features corresponding to the triple sample. Optionally, framing the audio information of the first music to obtain each frame of audio information; extracting a Mel frequency cepstrum coefficient of each frame of audio information to obtain audio features corresponding to each frame of audio information; selecting audio features which are positioned in the front row and meet the preset frame number from all the frame audio information for splicing to obtain audio features of first music; and/or framing the audio information of the second music to obtain each frame of audio information; extracting a Mel frequency cepstrum coefficient of each frame of audio information to obtain audio features corresponding to each frame of audio information; selecting audio features which are positioned in the front row and meet the preset frame number from all the frame audio information for splicing to obtain audio features of second music; and/or framing the audio information of the third music to obtain each frame of audio information; extracting Mel frequency cepstrum coefficient of each frame of audio information to obtainAudio features corresponding to each frame of audio information; and selecting audio features which are positioned in the front row and meet the preset frame number from all the frame audio information for splicing to obtain the audio features of the second music.

After the audio features corresponding to the triple samples are obtained, the audio features corresponding to the triple samples are input into a triple network model to be trained, the triple network model to be trained is trained based on constraint conditions and an objective function of the triple network model to be trained, and the trained triple network model is obtained, wherein the constraint conditions comprise that the distance between first music and second music is smaller than the distance between the first music and third music, and the objective function comprises that the sum of the accumulated distances of all the triple samples meets the minimum requirement.

In the embodiment, the three-component network is adopted to learn useful variables by comparing distances, and when the music data is less, the obtained recommendation result is more accurate compared with other recommendation modes. And the triple network realizes supervised training, and can extract and judge more effective characteristics of music similarity, thereby improving the recommendation effect of similar music.

More specifically, in one embodiment, as shown in FIG. 3, the training of the triple network model is implemented as follows: and respectively inputting the audio features of the first music, the second music and the third music in the audio features corresponding to the triple samples into three feedforward network models in the triple network model to be trained according to the corresponding relation, and respectively performing convolution processing on the corresponding audio features of the first music, the second music and the third music through convolution layers in each feedforward network model in the three feedforward network models, wherein the convolution layers in different paths share weight. And then, respectively reprocessing the corresponding convolution processing results through the hidden layer in each feedforward network model to obtain the output result of the first music, the output result of the second music and the output result of the third music. Further, the first music is calculated according to the output result of the first music, the output result of the second music and the output result of the third musicThe distance between the music and the second music and the distance between the first music and the third music. Alternatively, when the output result is a one-dimensional vector, the cosine distance between the vectors may be calculated. And finally, training the triple network model to be trained according to the distance between the first music and the second music, the distance between the first music and the third music, the constraint condition and the objective function to obtain the trained triple network model. Illustratively, the constraint may be expressed as: loss (q)_it,q_ij,q_ks) Max (sim2-sim1,0) wherein q is_ijRepresenting a first music, q_itRepresenting a second music, q_ksRepresenting third music, sim2 representing q_ijAnd q is_ksCorresponding cosine distance, sim2 denotes q_ijAnd q is_itThe corresponding cosine distance. The objective function can be expressed as:

where n represents the number of triplet samples.

After the training of the triple network model is completed, one of the feedforward networks can be selected as the feedforward network model shown in fig. 4, so as to perform vector expression on the target music.

In this embodiment, the triple network model may learn that music with similar style is closer than music with dissimilar style, so as to determine similarity between music based on distance, so as to improve accuracy of music recommendation.

It should be understood that although the various steps in the flow charts of fig. 1-2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

FIG. 5 is a block diagram illustrating a music recommendation device according to an example embodiment. Referring to fig. 5, the apparatus 40 includes a data obtaining unit 411 and a display information determining unit 412.

The audio information acquisition unit 411 is configured to acquire audio information of target music;

the audio feature determination unit 412 is configured to perform feature extraction on the audio information to obtain an audio feature of the target music;

the output result obtaining unit 413 is configured to input the audio features into a preset feed-forward network model to obtain a first output result corresponding to the target music, where the feed-forward network model is any one-way network model in a trained three-component network model, and the three-component network model is obtained by training a triple sample constructed according to first music of a first style, second music of the first style and third music of a second style based on a constraint condition that a distance between the first music and the second music is smaller than a distance between the first music and the third music;

the distance determining unit 414 is configured to determine a distance between the target music and each piece of music to be recommended according to a first output result corresponding to the target music and a second output result corresponding to each piece of music to be recommended in the plurality of pieces of music to be recommended;

the recommended music determining unit 415 is configured to select music to be recommended, which satisfies a preset distance requirement, from a plurality of pieces of music to be recommended, according to a distance between the target music and each piece of music to be recommended, and determine the music to be recommended to the user.

In the music recommendation device, the feedforward network model is any one of the trained three-component network models, and the three-component network model is a three-component sample constructed based on the first music of the first style, the second music of the first style and the third music of the second style, and is obtained by constraining the distance between the first music and the second music to be smaller than the distance between the first music and the third music, so that the feedforward network model has the characteristics that the music with the same style is more similar to the music with different styles, and the model training process based on the characteristics belongs to a supervised training process, so the feedforward network model also has the characteristics, the audio characteristics of the target music are used as the input of the feedforward network model, and the obtained first output result comprises the music style characteristics with higher discrimination, thereby improving the recommendation accuracy of music with similar styles.

In one embodiment, the distance determining unit 414 is specifically configured to calculate a cosine distance between the first output vector and the second output vector according to the first output vector corresponding to the target music and the second output vector corresponding to each of the plurality of pieces of music to be recommended, and determine the cosine distance as a distance between the target music and each piece of music to be recommended.

In one embodiment, the recommended music determining unit 415 is specifically configured to arrange a plurality of distances corresponding to a plurality of pieces of music to be recommended in an ascending order according to the distance between the target music and each piece of music to be recommended; and selecting the music to be recommended which is positioned in the front row and meets the preset number, and determining the music to be recommended to the user.

In one embodiment, the audio feature determining unit 412 is specifically configured to frame the audio information to obtain each frame of audio information; extracting a Mel frequency cepstrum coefficient of each frame of audio information to obtain audio features of each frame of audio information; and selecting audio features which are positioned in the front row and meet the preset frame number from all the frame audio information for splicing to obtain the audio features of the target music.

In one embodiment, the output result obtaining unit 413 is specifically configured to obtain a first music of a first style, a second music of the first style, and a third music of a second style; constructing a triple sample according to first music of a first style, second music of the first style and third music of a second style; extracting the characteristics of the audio information of the first music, the audio information of the second music and the audio information of the third music in the triple sample to obtain the audio characteristics corresponding to the triple sample; inputting the audio features corresponding to the triple samples into a triple network model to be trained, and training the triple network model to be trained based on constraint conditions and an objective function of the triple network model to be trained to obtain the trained triple network model, wherein the constraint conditions comprise that the distance between first music and second music is smaller than the distance between the first music and third music, and the objective function comprises that the sum of the accumulated distances of all the triple samples meets the minimum requirement.

In one embodiment, the output result obtaining unit 413 is specifically configured to input the audio feature of the first music, the audio feature of the second music, and the audio feature of the third music in the audio features corresponding to the triple samples into a three-way feedforward network model in the triple network model to be trained according to the corresponding relationship; performing convolution processing on the audio features of the corresponding first music, the audio features of the corresponding second music and the audio features of the corresponding third music through a convolution layer in each feedforward network model in the three feedforward network models; respectively reprocessing the corresponding convolution processing results through a hidden layer in each feed-forward network model to obtain an output result of the first music, an output result of the second music and an output result of the third music; calculating the distance between the first music and the second music and the distance between the first music and the third music according to the output result of the first music, the output result of the second music and the output result of the third music; and training the triple network model to be trained according to the distance between the first music and the second music, the distance between the first music and the third music, the constraint condition and the objective function to obtain the trained triple network model.

For specific limitations of the music recommendation device, reference may be made to the above limitations of the music recommendation method, which are not described herein again. The modules in the music recommendation device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment. The electronic device may be a server. The electronic device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the electronic device is used for storing data of user interaction with the article. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a music recommendation method.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is provided an electronic device including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the instructions to perform the steps of:

acquiring audio information of target music;

inputting the audio features into a preset feedforward network model to obtain a first output result corresponding to the target music, wherein the feedforward network model is any one-way network model in a trained three-component network model, and the three-component network model is obtained by training a triple sample constructed according to first music of a first style, second music of the first style and third music of a second style based on a constraint condition that the distance between the first music and the second music is smaller than the distance between the first music and the third music;

determining the distance between the target music and each piece of music to be recommended according to a first output result corresponding to the target music and a second output result corresponding to each piece of music to be recommended in the plurality of pieces of music to be recommended;

and selecting the music to be recommended meeting the preset distance requirement from the plurality of music to be recommended according to the distance between the target music and each music to be recommended, and determining the music to be recommended to the user.

In the electronic device, the feedforward network model is any one of the trained three-component network models, and the three-component network model is a three-component sample constructed based on the first music of the first style, the second music of the first style and the third music of the second style, and is obtained by constraining the distance between the first music and the second music to be smaller than the distance between the first music and the third music, so that the feedforward network model has the characteristics that the music with the same style is more similar to the music with different styles, and the model training process based on the characteristics belongs to a supervised training process, so the feedforward network model also has the characteristics, the audio characteristics of the target music are used as the input of the feedforward network model, and the obtained first output result comprises the music style characteristics with higher discrimination, thereby improving the recommendation accuracy of music with similar styles.

In one embodiment, the processor when executing the instructions further performs the steps of: according to a first output vector corresponding to the target music and a second output vector corresponding to each piece of music to be recommended in the plurality of pieces of music to be recommended, calculating the cosine distance between the first output vector and the second output vector, and determining the cosine distance as the distance between the target music and each piece of music to be recommended.

In one embodiment, the processor when executing the instructions further performs the steps of: arranging a plurality of distances corresponding to a plurality of pieces of music to be recommended according to the distance between the target music and each piece of music to be recommended in an ascending order; and selecting the music to be recommended which is positioned in the front row and meets the preset number, and determining the music to be recommended to the user.

In one embodiment, the processor when executing the instructions further performs the steps of: framing the audio information to obtain each frame of audio information; extracting a Mel frequency cepstrum coefficient of each frame of audio information to obtain audio features of each frame of audio information; and selecting audio features which are positioned in the front row and meet the preset frame number from all the frame audio information for splicing to obtain the audio features of the target music.

In one embodiment, the processor when executing the instructions further performs the steps of: acquiring first music of a first style, second music of the first style and third music of a second style; constructing a triple sample according to first music of a first style, second music of the first style and third music of a second style; extracting the characteristics of the audio information of the first music, the audio information of the second music and the audio information of the third music in the triple sample to obtain the audio characteristics corresponding to the triple sample; inputting the audio features corresponding to the triple samples into a triple network model to be trained, and training the triple network model to be trained based on constraint conditions and an objective function of the triple network model to be trained to obtain the trained triple network model, wherein the constraint conditions comprise that the distance between first music and second music is smaller than the distance between the first music and third music, and the objective function comprises that the sum of the accumulated distances of all the triple samples meets the minimum requirement.

In one embodiment, the processor when executing the instructions further performs the steps of: respectively inputting the audio features of the first music, the second music and the third music in the audio features corresponding to the triple samples into a three-way feedforward network model in a triple network model to be trained according to the corresponding relation; performing convolution processing on the audio features of the corresponding first music, the audio features of the corresponding second music and the audio features of the corresponding third music through a convolution layer in each feedforward network model in the three feedforward network models; respectively reprocessing the corresponding convolution processing results through a hidden layer in each feed-forward network model to obtain an output result of the first music, an output result of the second music and an output result of the third music; calculating the distance between the first music and the second music and the distance between the first music and the third music according to the output result of the first music, the output result of the second music and the output result of the third music; and training the triple network model to be trained according to the distance between the first music and the second music, the distance between the first music and the third music, the constraint condition and the objective function to obtain the trained triple network model.

In an exemplary embodiment, there is also provided a storage medium comprising instructions, such as a memory comprising instructions, executable by a processor of an apparatus to perform the above method.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

acquiring audio information of target music;

In the computer-readable storage medium, since the feedforward network model is any one of the trained three-component network models, and the three-component network model is a three-component sample constructed based on the first music of the first style, the second music of the first style and the third music of the second style, and the distance between the first music and the second music is constrained to be smaller than the distance between the first music and the third music, it is easy to understand that the trained three-component network model has the characteristic that the music with the same style is more similar than the music with different styles, and the model training process based on the characteristic belongs to a supervised training process, then the feedforward network model also has the characteristics, so that the audio characteristics of the target music are used as the input of the feedforward network model, and the obtained first output result includes the music style characteristics with higher discrimination, thereby improving the recommendation accuracy of music with similar styles.

In one embodiment, the computer program when executed by the processor further performs the steps of: according to a first output vector corresponding to the target music and a second output vector corresponding to each piece of music to be recommended in the plurality of pieces of music to be recommended, calculating the cosine distance between the first output vector and the second output vector, and determining the cosine distance as the distance between the target music and each piece of music to be recommended.

In one embodiment, the computer program when executed by the processor further performs the steps of: arranging a plurality of distances corresponding to a plurality of pieces of music to be recommended according to the distance between the target music and each piece of music to be recommended in an ascending order; and selecting the music to be recommended which is positioned in the front row and meets the preset number, and determining the music to be recommended to the user.

In one embodiment, the computer program when executed by the processor further performs the steps of: framing the audio information to obtain each frame of audio information; extracting a Mel frequency cepstrum coefficient of each frame of audio information to obtain audio features of each frame of audio information; and selecting audio features which are positioned in the front row and meet the preset frame number from all the frame audio information for splicing to obtain the audio features of the target music.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring first music of a first style, second music of the first style and third music of a second style; constructing a triple sample according to first music of a first style, second music of the first style and third music of a second style; extracting the characteristics of the audio information of the first music, the audio information of the second music and the audio information of the third music in the triple sample to obtain the audio characteristics corresponding to the triple sample; inputting the audio features corresponding to the triple samples into a triple network model to be trained, and training the triple network model to be trained based on constraint conditions and an objective function of the triple network model to be trained to obtain the trained triple network model, wherein the constraint conditions comprise that the distance between first music and second music is smaller than the distance between the first music and third music, and the objective function comprises that the sum of the accumulated distances of all the triple samples meets the minimum requirement.

In one embodiment, the computer program when executed by the processor further performs the steps of: respectively inputting the audio features of the first music, the second music and the third music in the audio features corresponding to the triple samples into a three-way feedforward network model in a triple network model to be trained according to the corresponding relation; performing convolution processing on the audio features of the corresponding first music, the audio features of the corresponding second music and the audio features of the corresponding third music through a convolution layer in each feedforward network model in the three feedforward network models; respectively reprocessing the corresponding convolution processing results through a hidden layer in each feed-forward network model to obtain an output result of the first music, an output result of the second music and an output result of the third music; calculating the distance between the first music and the second music and the distance between the first music and the third music according to the output result of the first music, the output result of the second music and the output result of the third music; and training the triple network model to be trained according to the distance between the first music and the second music, the distance between the first music and the third music, the constraint condition and the objective function to obtain the trained triple network model.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A music recommendation method, comprising:

acquiring audio information of target music;

2. The music recommendation method according to claim 1, wherein the step of determining the distance between the target music and each of the plurality of pieces of music to be recommended according to the first output result corresponding to the target music and the second output result corresponding to each of the plurality of pieces of music to be recommended comprises:

3. The music recommendation method according to claim 1, wherein the step of selecting, from the plurality of pieces of music to be recommended, music to be recommended that meets a preset distance requirement according to the distance between the target music and each piece of music to be recommended, and determining the music to be recommended to the user comprises:

4. The music recommendation method according to claim 1, wherein the step of extracting the features of the audio information to obtain the audio features of the target music comprises:

framing the audio information to obtain each frame of audio information;

5. The music recommendation method of claim 1, wherein the training process of the three-tuple network model comprises:

6. The music recommendation method according to claim 5, wherein the step of inputting the audio features corresponding to the triplet samples into the triplet network model to be trained, and training the triplet network model to be trained based on the constraint conditions and the objective function of the triplet network model to be trained to obtain the trained triplet network model comprises:

7. The music recommendation method according to claim 5, wherein performing feature extraction on the audio information of the first music, the audio information of the second music, and the audio information of the third music in the triple sample to obtain the audio features corresponding to the triple sample comprises:

8. A music recommendation device, comprising:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the music recommendation method of any of claims 1-7.

10. A storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the music recommendation method of any of claims 1-7.