CN111611432A

CN111611432A - Singer classification method based on Labeled LDA model

Info

Publication number: CN111611432A
Application number: CN202010477122.5A
Authority: CN
Inventors: 籍汉超; 王丹; 张力; 齐保峰
Original assignee: Beijing Kuwo Technology Co Ltd
Current assignee: Beijing Kuwo Technology Co Ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2020-09-01
Anticipated expiration: 2040-05-29
Also published as: CN111611432B

Abstract

The invention relates to a singer classification method based on a Labeled LDA model, which comprises the following steps: s1, collecting manual labels of singers and preprocessing the manual labels; s2, establishing a singer classification model based on user behavior and collecting user behavior data; s3, cleaning user behavior data and filtering data which are not beneficial to model training; s4, distributing the weight value of each singer corresponding to each user in the user behavior data; s5 merging the user behavior data and the manual label data to generate training data; and S6, performing Labeled LDA model training based on optimized Gibbs sampling by referring to the label combination relation based on the training data. According to the invention, the singing behavior of the user is taken as training data, the coverage to the user is high, the preference characteristics of each user group are considered, the change of the user behavior reflects the change of social hotspots and public cognition, the model can be periodically trained to follow the change, the adaptability is strong, the accuracy degree is high, the label coverage rate is improved, and the classification is fine enough.

Description

Singer classification method based on Labeled LDA model

Technical Field

The invention relates to the technical field of internet personalized services, in particular to a singer classification method based on a Labeled LDA model.

Background

In the last decade, internet music has developed rapidly, and gradually eats the traditional music market. Teng news music group, Internet music such as Internet music cloud, shrimp music and the like walk into thousands of households. In the traditional music market, users generally know new music (songs) in a record store except limited promotion means such as television, movies and network.

The internet music takes the music app as a channel, and the music selection of the user on the music app is unprecedented and abundant, but the user cannot know every song and every singer, so an effective information filtering means is needed to help the user to filter the songs. The number of users of the internet at present is generally in the order of millions to billions, the tastes of each user listening to songs are more or less different, and it is obviously unrealistic to screen music for each user through operators. In summary, personalized music recommendation is very important for user experience in the internet music era.

In the internet industry such as advertising, e-commerce, etc., articles naturally have class attribute tags that identify the categories to which they belong, such as: a pair of runing shoes, belonging to the classification of 'clothes- > shoes and boots- > sports shoes- > running shoes'; the heaven king watch advertisement belongs to the classification of luxury, clock, watch and mechanical watch. The belonged classification of the articles is relatively clear and objective, and a plurality of recommendation algorithms are established on the objective class attribute label system at present.

Internet music also belongs to the internet industry, but unlike other internet industries, the classification of music (also called music genre) is often subjective, fuzzy and abstract. The most basic attribute of a piece of music (song) is the singer, and if relatively accurate classification judgment is carried out on the singer, the music genre can be judged to a certain extent, so that the design of a recommendation algorithm is very important.

In the prior art, singers are classified to obtain singer labels, and the common modes include two categories: one is to extract a machine tag as a singer tag by a machine learning method, and the other is to obtain an artificial tag as a singer tag by an expert discrimination method.

By collecting indirect data, machine learning models such as clustering and topic extraction are established, and then the mode of extracting a machine label as a singer label has the following limitations:

(1) the classified categories are abstract categories, so there is little interpretability. After the model extracts the labels for the singers, the genre attributes of the singers or music cannot be evaluated from an intuitive angle, and the adaptability to some application scenes is poor.

(2) The classification is not controllable, so the classification quantity is uncertain, and the quantity of the articles contained in each classification is uncertain. This is not friendly to many recommendation algorithms, and may have an impact on the implementation and also on the implementation efficiency.

(3) When the same data is used for learning, the results of multiple times of learning are often inconsistent. The learning result may be related to the difference of the initial values and may be related to the order of data input. After a plurality of results are obtained, the user can hardly judge whether the user is good or bad, and the user is likely to need to manually screen the user.

The method for acquiring the artificial tag as the singer tag by the expert tagging method has the following limitations:

(1) at a certain cost, experts can only label a small part of singers with famous and representative singers. For most singers who are less popular, or have a more vague style, the coverage of manual labeling is very limited.

(2) Many singers have multiple manual tags at the same time and experts cannot assign a weight to each tag of the singer. In practice, however, singers' creations are generally of great importance, with important and unimportant scores for different tagging principles.

(3) In many cases, experts and users represent two groups of people who must have differences in the classification of music genres. Therefore, the labels that the experts label the singer do not necessarily completely conform to the cognition of the user.

(4) The labels given by experts to singers are generally fixed and do not change over time, but the genre of singers is likely to change over time from the user's perspective.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a singer classification method based on a Labeled LDA model, which takes the singing behavior of a user as training data, has high coverage on the user and gives consideration to the preference characteristics of each user group, the change of the user behavior reflects the change of social hotspots and public cognition, the model can be periodically trained to follow the change, the adaptability is strong, the accuracy degree is high, the label coverage rate is improved, and the classification is fine enough.

In order to achieve the above purposes, the technical scheme adopted by the invention is as follows:

a singer classification method based on a Labeled LDA model is characterized by comprising the following steps:

s1, collecting the manual label of the singer and preprocessing the manual label to be used as a training reference;

s2, establishing a singer classification model based on user behavior, and collecting user behavior data;

s3, cleaning user behavior data and filtering data which are not beneficial to model training;

s4, distributing the weight of each singer corresponding to each user in the user behavior data;

s5, combining the user behavior data and the manual label data to generate training data;

and S6, performing Labeled LDA model training based on optimized Gibbs sampling by referring to the label combination relation based on the training data.

On the basis of the above technical solution, in step S1, the collecting artificial tags of singers divides different artificial tags according to dimensions, where the dimensions include any one of:

singer language classification dimension; this dimension includes, but is not limited to, the following manual labels: chinese, cantonese, European and American;

singer genre classification dimension; this dimension includes, but is not limited to, the following manual labels: pop, classical, rock, ballad;

singer style classification dimension; this dimension includes, but is not limited to, the following manual labels: chinese wind, dance, ancient wind, grassland wind;

singers use the primary instrument classification dimension; this dimension includes, but is not limited to, the following manual labels: saxophone, violin, piano;

a chronological classification dimension; this dimension includes, but is not limited to, the following manual labels: early, medium and late;

in step S1, the preprocessing includes: carrying out accuracy evaluation on the artificial tags in each dimension, and taking the artificial tags in the dimension exceeding or reaching a preset threshold value to form a tag system;

after other dimensions except the dimensions are subjected to accuracy evaluation and exceed or reach a preset threshold value, the other dimensions are added into a label system;

the pre-processing further comprises: carrying out fine-grained optimization on the artificial label according to the dimension, wherein the optimization specifically comprises the following steps:

the fine granularity of the manual label is evaluated,

for the manual label with wide range, the dimension is further subdivided through the combination of the dimensions,

the dimensions are further subdivided by default not to exceed three levels.

On the basis of the above technical solution, in step S1, the preprocessing further includes: and generating a tag group, and merging the similar artificial tags into the same tag group.

On the basis of the above technical solution, in step S2, the establishing a singer classification model based on user behavior, converting the singer classification problem into a document topic classification problem, and then applying the document topic classification model to classify the singer specifically includes:

the users are aggregated as words into an article representing a singer as training data,

the user is a user who plays the singer's song in its entirety,

each singer is respectively used as an article, the article and the singer are in one-to-one correspondence,

setting a behavior time window w, and interpreting the following user behaviors in the behavior time window w: the singer's song is played completely within the action time window w, the user is treated as a word, the word is merged into the article corresponding to the singer,

namely: the article corresponds to a singer, the content of the article is formed by words, the words correspond to the user, and the user meets the condition of completely playing the singer song in the action time window w;

the collecting of the user behavior data refers to: in the action time window w, the user actions of all users who completely play the song of a singer are obtained,

the user behavior specifically includes: a completely played song, the corresponding singer of the song.

On the basis of the technical scheme, the length of the behavior time window w can be adjusted within a certain range, and the following steps are carried out:

the longer the w length is, the more stable the training result is, but the weaker the adaptability is, the more the training result cannot be changed along with the overall interest change of the user;

the shorter the w length is, the more unstable the training result is, the greater the randomness of the classification result is, but the stronger the adaptability is, and the classification result can change along with the change of the overall interest of the user;

the value length of w follows the following principle:

if the system access amount is large and the user behaviors generated in unit time are large, selecting a relatively short behavior time window, and generating a classification result which can reflect the trend of updating tide;

if the interpretability is required to be high in order to show the user a specific classification, a relatively long time window of behavior is selected to generate a more interpretable classification result.

On the basis of the foregoing technical solution, in step S3, the cleaning the user behavior data specifically includes:

the threshold L is set for users who have listening behavior to many singers' songs, i.e. words (users) contained in many articles (singers)_user-maxFor articles containing the word, the number of articles is greater than L_user-maxThe word of (2) can be considered to be too extensive in interest and hobbies of the part of users, and the convergence of the model is not good, so that the word is removed from all articles;

the threshold L is set for users who have listening behavior only to few singers, i.e. words contained only in a very limited number of articles_user-minWhen a word is contained only in a range less than L_user-minWhen the article is played, the model training is not greatly assisted, and in an extreme case, if a user only listens to the song of a singer, the word is removed from all the articles without any help for the classification of the singer;

after the two cleaning modes, a threshold L is set for the article containing too few words_artist-minFor compositions containing only less than L_artist-minThe article of the word number considers that the existing data can not accurately judge the singer classification, and the article is removed.

On the basis of the technical scheme, in step S4, the features are further highlighted by performing weight assignment on the user behavior data;

wherein, the optimized TF-IDF formula is used as the weight of the word, and the formula is as follows:

wherein

For the weight of the word u in the article a,

representing the number of occurrences of the word containing the most in article a,

representing the number of articles in D where the words appearing in the most articles appear;

on the basis of the technical scheme, in the step S5, the number of the model classification labels is set as the number of the artificial label categories + n, n is the default label number, and 5-10 can be selected according to experience according to the actual effect;

for articles with some artificial tags, labeling as the artificial tags + n default tags;

for an article without any artificial tags, all classification tags are labeled, i.e., all artificial tags and n default tags are included.

On the basis of the above technical solution, in step S6, the sampled LDA adopts Gibbs sampling algorithm for learning, and after adding the optimized TF-IDF weight, the probability formula of the sampling model is:

wherein

For the current word u_iThe probability of belonging to the class label k,

the probability of occurrence of the tag k in document a in addition to the current word,

for the word u in addition to the current word_iProbability of corresponding label k, α_k、β_iIn order to be a hyper-parameter of the model,

the TF-IDF weight of the current document for the improved current word;

considering the correlation of each label in the label group, the sampling probability formula is further optimized as follows:

wherein

For the sampling time, consider the current word u_iAnd (3) the probability of belonging to a classification label k, wherein T is all label groups, | T | is the size of the label group, II (·) is an indication function, if the parameter in the function is true, the function value is 1, otherwise, the function value is 0, and λ is a hyperparameter which is larger than 0 and smaller than 1.

On the basis of the technical scheme, when training is started, firstly, each word of each article is randomly assigned with a classification label k,

if the article is a manual label, initializing each word in the article within the range of the manual label; if the article has no manual label, initializing in all label ranges;

each word of each article is then traversed, each word is reassigned a category label,

the distribution time sampling probability is calculated according to other words of all current articles except the current word, all words of other articles and a sampling probability formula, if the article is artificially labeled, each word in the article is initialized in a specified range of the artificially labeled,

after a plurality of rounds of training, the training is stopped, the number of words contained under each classification label in each article is calculated, and then normalization processing is carried out to serve as the classification of the article, namely the classification of the singer.

The singer classification method based on the Labeled LDA model has the following beneficial effects:

1. the manual label and the machine learning model are combined for classification, so that the labor input cost is reduced, the interpretability of the label is ensured, the label coverage rate is improved, and the prediction accuracy is improved; the classification model is based on the existing artificial label system, so that the generated classification result has good interpretability, the total number of labels reaches more than one hundred, and the classification classes are sufficiently subdivided; the classification result generated by the method is closely related to the specific behavior of the user, so that the method has good fit to the actual preference of the user and is very helpful to design a subsequent recommendation algorithm.

2. The accuracy degree of the model is ensured on the basis of the singing row of the user with the largest magnitude order and the widest coverage of the online music playing platform; according to the method, the singing behavior of the user is used as training data, the collectable data volume is very large, enough data can be accumulated in a short time, for most models, the shorter the accumulation time is, the stronger the model timeliness is, and the more the training data is, the higher the model accuracy is; the coverage of the data set adopted by the invention to the users can reach more than 90 percent, and the preference characteristics of each user group can be considered.

3. Aiming at the data characteristics, the specific classification method of the Labeled LDA is optimized and improved, and finally, a good effect is achieved in the production environment; the invention optimizes the Labeled LDA algorithm formula, gives different weights to the behaviors of different users, and simultaneously references the correlation among labels; the model provided by the invention can be periodically trained by applying the invention, and as the overall taste of the user can change along with social hotspots and public cognition, the change can be quickly reflected on the behavior of the user, and the result obtained by the model can also change.

The singer classification method based on the Labeled LDA model has the following characteristics:

1. the machine learning model and the artificial labeling model are comprehensively used, and the problems of poor classification interpretability of the machine learning model and low coverage rate of the artificial model are solved;

2. the method has the advantages that user behaviors are creatively taken as basic data, the basic data are converted into a text-like form, and then LabeledLDA is used for semi-supervised classified learning, so that the problems of low text data accuracy, poor representativeness and the like are solved;

3. optimizing a training target label of the Labeled LDA to achieve a better classification effect, and facilitating the application of the model in a production environment;

4. based on the characteristics of service data, optimizing a TF-IDF empirical formula and combining the TF-IDF empirical formula into a classification model, on one hand, paying certain attention to key user behaviors, and on the other hand, preventing excessive interference of individual user behaviors on a model result;

5. and the label grouping optimization Labeled LDA sampling algorithm of manual merging is utilized, so that the judgment of the model on similar labels is closer to the perception of users.

Drawings

The invention has the following drawings:

FIG. 1 is a flow chart of a singer classification method based on a Labeled LDA model according to a first embodiment of the present invention.

Fig. 2 is a schematic diagram of a behavior time window w.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

In the field of text topic classification, the Labeled LDA model provides a concept combining artificial tags and a machine learning model, such as dividing news categories according to news titles/contents and dividing song categories according to song descriptions/comments, but in the field of recommendation of online music playing software, a text-based classification model is not suitable:

(1) the amount of text data generated by the music platform is very limited, and only channels such as song list description, user comments and the like exist, so that the accuracy of the model is limited;

(2) in these texts, a large part of contents are not descriptions for music, and the descriptions for music are fuzzy, so that the texts can be only used for coarse-grained division and cannot be used for fine-grained classification;

(3) these texts are often generated by a small fraction of users who are very core to the music platform, and thus, as from an expert perspective, the preferences of the entire user population are not necessarily representative.

As shown in fig. 1, the singer classification method based on the Labeled LDA model according to the present invention includes the following steps:

singer language classification dimension; this dimension includes, but is not limited to, the following manual labels: chinese, cantonese, European and American; singer genre classification dimension; this dimension includes, but is not limited to, the following manual labels: pop, classical, rock, ballad; singer style classification dimension; this dimension includes, but is not limited to, the following manual labels: chinese wind, dance, ancient wind, grassland wind; singers use the primary instrument classification dimension; this dimension includes, but is not limited to, the following manual labels: saxophone, violin, piano; a chronological classification dimension; this dimension includes, but is not limited to, the following manual labels: early, middle and late stage.

On the basis of the above technical solution, in step S1, the preprocessing includes: carrying out accuracy evaluation on the artificial tags in each dimension, and taking the artificial tags in the dimension exceeding or reaching a preset threshold value to form a tag system;

and after the accuracy of other dimensions except the above dimensions is evaluated and the other dimensions exceed or reach a preset threshold value, adding the other dimensions into the label system.

On the basis of the above technical solution, in step S1, the preprocessing further includes: carrying out fine-grained optimization on the artificial label according to the dimension, wherein the optimization specifically comprises the following steps:

the fine granularity of the manual label is evaluated,

artificial tags which are contained in a narrow range and specifically refer to a certain category of music are not optimized, such as ancient wind and grassland wind;

for the manual label with a wide range, the dimensions are further subdivided through the combination of the dimensions, for example: the popularity can be further divided into Chinese popularity, Guangdong popularity and European popularity … …; another example is: chinese epidemic, which can be further subdivided into early Chinese epidemic, middle Chinese epidemic, and late Chinese epidemic … …;

the dimension is further subdivided by default not to exceed three levels, taking the aforementioned example as an example, "subdivided into Chinese popular, Cantonese popular, European popular" as a first level, and "subdivided into early Chinese popular, mid Chinese popular, late Chinese popular" as a second level. The narrow range of inclusion, as well as the wide range of inclusion of the manual label, can be set based on empirical values.

For example: for finer-grained artificial tags, such as popular balladry, campus balladry, traditional balladry, etc., the boundaries are not so obvious, and the division of tags by the user group and the division of songs by experts are not necessarily completely consistent. Therefore, these more similar artificial tags are merged into a tag group to be considered in the classification calculation. The more similar manual label sets may be set based on empirical values.

the user is a user who plays the singer's song in its entirety,

namely: the article corresponds to a singer, the content of the article is composed of words, the words correspond to a user, and the user meets the condition that the singer songs are completely played within the action time window w.

It should be noted that, the articles in the training data are generated by the user behavior (specifically, the act of playing songs), and are not articles in the general sense, but because the current music playing has no price grading threshold, all users can enjoy almost all music on the platform, so the act of playing songs by a single user has strong randomness, and therefore, the user behavior data needs to be filtered and cleaned in the subsequent processing.

On the basis of the technical scheme, the value length of the behavior time window w can be adjusted within a certain range.

The behavior time window w is an observation window, as shown in fig. 2, where:

the shorter the w length is, the more unstable the training result is, and the greater the randomness of the classification result is, but the stronger the adaptability is, and the classification result can change along with the overall interest change of the user.

As an alternative embodiment, the value of the specific behavior time window w balances the adaptability requirement according to the rate of generating user data by the actual system.

As shown in fig. 2, if the system access amount is large and a lot of user behaviors are generated in unit time, a relatively short behavior time window can be selected, and the generated classification result can reflect the trend of the refresh tide; if interpretability is required to be high, such as a particular classification needs to be presented to the user, a relatively long action time window needs to be selected to generate a more interpretable classification result.

On the basis of the technical scheme, the collecting user behavior data refers to: in the action time window w, the user actions of all users who completely play the song of a singer are obtained,

It should be noted that, as an alternative embodiment, each threshold needs to be set to a higher tolerance value according to experience, that is: for L_user-maxA larger value (e.g., 10% of total singers) may be set for L_user-minAnd L_artist-minFirstly setting a smaller value (such as 2-5) larger than 1, and then adjusting according to the model effectAnd (6) finishing. The higher the threshold tolerance is, the more singers can be covered by the classification model, but the lower the accuracy is; the lower the threshold tolerance, the fewer singers the classification model can cover, but the higher the accuracy. When the model is trained, if accuracy is as expected, then further expansion of L may be attempted_user-maxOr reduce L_user-minAnd L_artist-min(ii) a If accuracy is not as expected, L needs to be reduced_user-maxAnd increase L_user-minAnd L_artist-min。

On the basis of the technical scheme, in step S4, the features are further highlighted by performing weight assignment on the user behavior data; for the document theme classification model, word weight assignment is an indispensable process for ensuring the model effect;

intuitively, it is understood that for two words in an article, words that appear more often are more representative of the subject matter of the article than words that appear less often; for two words in all articles, the words with fewer occurrences are more representative of the subject of a single article than the words with more occurrences;

for this case, it is common practice in the field of text processing to use TF-IDF as the weight of a word, and the empirical formula is as follows:

wherein W_TF-IDF(u, a) is the weight of the word u in article a, N_a,uRepresenting the number of times the word u is contained in article a, N_aRepresenting the total number of words in the article a, D representing the total number of articles, D_uRepresenting the number of articles containing the word u in D;

however, for the model described in the present invention, this empirical formula does not apply, since:

1. for the first TF part of the formula, since the singing behavior of the user has a strong marbles effect, the number of words of the generated articles has a serious disproportionate, part of the articles contain hundreds of thousands of words, and part of the articles have only tens of words.

For an article with few words, if the number of occurrences of a single word is too many, the judgment on the theme of the single article is seriously influenced, so that log treatment needs to be carried out on the comparative example; however, for articles with a very large number of words, the difference in weight between words after log processing directly becomes very small.

2. For the second term IDF part of the formula, except for the very individual cases which are already eliminated, most users cannot listen to too many singers within a certain time, most users concentrate on about tens of singers, but the total number of the singers is ten thousand, so that after log is taken, the second term scores of all the users are hardly different.

Aiming at the two problems, the optimized TF-IDF formula is used as the weight of the word, and the optimized TF-IDF formula provided by the invention is as follows:

wherein

For the weight of the word u in the article a,

representing the number of articles in D where the word appearing in the most articles appears.

On the basis of the technical scheme, for a general Labeled LDA model, all manual labels can be collected and numbered, and then all articles in the existing training data are Labeled:

for articles with some artificial tags, labeling as these artificial tags;

for an article without any manual tags, all manual tags are labeled.

In addition, in some cases, the Labeled LDA also uses all the manual label category numbers +1 as model training targets. The +1 is to provide a default classification for all articles, the model can be intelligently judged in the training process, and if the current articles cannot be classified into any one of the existing classifications, the articles can be classified into the default classification.

Further optimization is required for the model described in the present invention.

In the invention, the singer does not have a clear genre boundary, fuzzy zones exist among different genres, and the singer genre cognition of the user is not necessarily the same as academic genre division, so that the small crowd classification in the user cognition is probably not available in an artificial label, and the singer can cause great trouble to the next recommendation no matter the singer is classified into any current genre.

In order to avoid such a situation to the maximum extent, through experimental tests, the following optimization scheme is adopted in step S5, so that a better classification effect can be achieved:

in the step S5, setting the number of the model classification labels as the number of the artificial label categories + n, wherein n is the default label number, and 5-10 can be selected according to experience according to actual effects;

wherein

For the current word u_iThe probability of belonging to the class label k,

the TF-IDF weight of the current document for the improved current word;

wherein

Wherein, lambda needs to be adjusted according to the actual effect: when the value of lambda is too small, the labels in the label group tend to be completely independent labels, and the correlation relationship among the labels cannot be embodied; too large a value of λ may result in an excessive sampling probability tending to a tag group containing many tags.

One specific example is as follows.

As shown in FIG. 1, the present invention provides a singer classification method based on Labeled LDA model, comprising the following steps

S1, collecting label data manually marked to singers, and determining a training reference.

The manual singer labels adopted in the text are partly from the labels stored in the database when the singer is stored in the database, and partly from the labels specially marked by relevant experts for the singer in the genre. Together, approximately 3000 singers are manually tagged, including 153 tags.

Table 1 example of manual tagging of labels by singers:

the tags are sorted and the tags which are finely divided and close to each other are combined together, so that 26 tag groups are established.

Table 2 singer tag grouping example:

and S2, establishing a singer classification model based on user behaviors and collecting data.

The singer classification problem is converted into a document subject classification problem. And taking each singer as an article, selecting a behavior time window for 30 days, taking a user who completely plays the singer song in the behavior time window as a word of the article, combining the user words to form the article corresponding to the singer, wherein each article is in one-to-one correspondence with each singer.

This process is followed by symbiosis into 20000+ articles containing the word 1000000 +.

Table 3 example of a singer's article generated based on a user listening to a song:

and S3, cleaning the behavior data, and filtering the data which are not beneficial to model training.

Words that appear in more than 1000 or less than 5 articles are filtered. The articles containing less than 5 words are then filtered. The article 15000+, word 500000+ is obtained.

And coding the article words contained in the cleaned data set for convenient training. The original article id and the word id are discrete numerical values, and all the ids are converted into continuous numerical sequences after being coded. And counts each word in the article.

Results show an example of a singer's article encoded after cleaning, for example, in table 4:

and S4, carrying out weight distribution on the behavior data, namely evaluating the importance degree of the words in the article by using TF-IDF.

The weight of each word in each article was calculated using the following formula

Where Na, u represents the number of times the word u is included in article a,

representing the number of occurrences of the word containing the most in article a, Du representing the number of articles containing word u in D,

And S5, marking labels on all articles (singers) in the existing behavior data by using the manual label data.

There are 153 manual tags plus 158 default tags of 5. For the singer with the artificial label, the corresponding label is marked on the corresponding article, and the default label is added. For a singer without manual tags, the corresponding article is labeled with all 158 tags.

S6, training a Labeled LDA model.

a.LDA

LDA (Laten Dirichlet allocation) is a document topic generation model, which comprises three elements: articles, topics, and words. The method is a three-layer Bayesian probability model in principle, and is generated by sampling from Dirichlet distribution with parameters of alpha and beta respectively on the assumption that the topic probability distribution theta of each document and the word probability distribution phi of each topic are mutually independent. The probability distribution formula of the LDA model is as follows:

where u represents the word, k represents the topic of the word assignment, P (u, k, θ, φ | α, β) is the generation probability that the current word u topic is assigned as k, α and β are hyperparameters, and θ and φ may be estimated from α, β and the current training data.

Gibbs sampling

Based on the assumption of LDA, it can be assumed that the topics of all other words except the current word are determined, the probability distribution of the topic of the current word is generated according to the information provided by the other words, and the process of randomly allocating topics to the current word according to the probability distribution is called Gibbs sampling. The probability distribution calculation formula used in Gibbs sampling is as follows:

wherein

For the sampling time, consider the current word u_iThe probability of belonging to the class label k,

for the word u in addition to the current word_iProbability of corresponding label k, α_k、β_iIs a hyper-parameter.

θ and φ can be estimated according to the following equations:

c.Labeled LDA

LDA is similar in nature to clustering and is an unsupervised machine learning algorithm. After the clustering is completed, the central point of each class is uncertain, and the meaning represented by each class is also uncertain, so that the generated result is poor in stability and has no interpretability.

Labeled LDA specifies a range of topics that a portion of an article (the words in) may select at the time of Gibbs sampling, thus converting unsupervised LDA into a semi-supervised learning algorithm. The central point of the generated clustering result fluctuates in a certain range, the stability is strong, and the interpretability is also possessed.

d. Optimized Gibbs sampling

Adding the weight obtained by calculation of the optimized TF-IDF into a Gibbs sampling process, wherein an improved Gibbs sampling probability formula is as follows:

wherein

For the current word u_iThe probability of belonging to class label k (without regard to label correlation),

the weight of the current word to the current document.

Then, according to the correlation relationship of each label in the label group, further optimizing a sampling probability formula as follows:

wherein

For the sampling time, consider the current word u_iProbability of belonging to classification label k, T is all label groups, | T | is label group size, II (-) is indication function, if parameter in function is true, function value is 1, otherwise function isThe value is 0 and λ is a hyperparameter greater than 0 and less than 1.

Training is started:

(1) first, a is set to 0.1, β is set to 0.1, and λ is set to 0.2 empirically.

(2) And randomly selecting a label from the selectable range of article labels for each word of each article to label.

(3) And calculating the Gibbs sampling probability distribution of each word of each article one by one, wherein the probability is only in the selectable range of the article label when the probability is calculated, and then selecting one label from the labels at random according to the probability for labeling. One traversal of all the words of all the articles is calculated as a round of training.

(4) Repeating the step (3) for a plurality of times.

(5) And finishing training, and calculating the number of words contained under each classification label in each article as the probability that the article belongs to the classification, namely the score of the singer belonging to the classification.

For judging how many rounds of training are needed to achieve a relatively ideal effect, two ways can be adopted for evaluation. The singer classification can be checked and calculated in a sampling mode, on one hand, whether the calculation of the singer classification scores with the artificial labels is reasonable or not is checked, and on the other hand, whether the singer classification without the artificial labels is reasonable or not is checked. And secondly, dividing the singer with the artificial label into a training set and a testing set according to a ratio of 7:3 randomly, taking the testing set as the singer without the artificial label to participate in training, and checking the difference between the artificial label of the testing set and the model calculation label after training.

Those not described in detail in this specification are within the skill of the art. The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above embodiment, but equivalent modifications or changes made by those skilled in the art according to the present disclosure should be included in the scope of the present invention as set forth in the appended claims.

Claims

1. A singer classification method based on a Labeled LDA model is characterized by comprising the following steps:

2. The singer classifying method based on Labeled LDA model according to claim 1, wherein: in step S1, the collecting of the artificial tags of singers divides the artificial tags into different artificial tags according to dimensions, where the dimensions include any one of the following:

the fine granularity of the manual label is evaluated,

the dimensions are further subdivided by default not to exceed three levels.

3. The singer classifying method based on Labeled LDA model according to claim 2, wherein: in step S1, the preprocessing further includes: and generating a tag group, and merging the similar artificial tags into the same tag group.

4. The singer classifying method based on Labeled LDA model according to claim 1, wherein: in step S2, the establishing of the singer classification model based on the user behavior converts the singer classification problem into a document topic classification problem, and then the applying of the document topic classification model classifies the singer, which specifically includes:

the user is a user who plays the singer's song in its entirety,

5. The singer classifying method based on Labeled LDA model according to claim 4, wherein: the length of the behavior time window w can be adjusted within a certain range, and:

the value length of w follows the following principle:

6. The singer classifying method based on Labeled LDA model according to claim 1, wherein: in step S3, the cleaning of the user behavior data specifically includes:

the threshold L is set for users who have listening behavior only to few singers, i.e. words contained only in a very limited number of articles_user-minWhen a wordOnly included in less than L_user-minWhen the article is played, the model training is not greatly assisted, and in an extreme case, if a user only listens to the song of a singer, the word is removed from all the articles without any help for the classification of the singer;

7. The singer classifying method based on Labeled LDA model according to claim 1, wherein: in step S4, the features are further highlighted by performing weight assignment on the user behavior data;

wherein

For the weight of the word u in the article a,

8. The singer classifying method based on Labeled LDA model according to claim 1, wherein: in the step S5, setting the number of the model classification labels as the number of the artificial label categories + n, wherein n is the default label number, and 5-10 can be selected according to experience according to actual effects;

9. The singer classifying method based on Labeled LDA model according to claim 1, wherein: in step S6, the labelled LDA learns by using Gibbs sampling algorithm, and after adding the optimized TF-IDF weight, the probability formula of the sampling model is:

wherein

For the current word u_iThe probability of belonging to the class label k,

the TF-IDF weight of the current document for the improved current word;

wherein

10. The singer classifying method based on Labeled LDA model according to claim 9, wherein: at the beginning of training, each word of each article is first randomly assigned a class label k,

then traversing each word of each article, reassigning a classification label to each word, calculating the sampling probability according to other words of all current articles except the current word and all words of other articles and a sampling probability formula when assigning, if the article is artificially labeled, initializing each word in the article in a range specified by the artificially labeled,