CN111611432B

CN111611432B - Singer classification method based on Labeled LDA model

Info

Publication number: CN111611432B
Application number: CN202010477122.5A
Authority: CN
Inventors: 籍汉超; 王丹; 张力; 齐保峰
Original assignee: Beijing Kuwo Technology Co Ltd
Current assignee: Beijing Kuwo Technology Co Ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2023-09-15
Anticipated expiration: 2040-05-29
Also published as: CN111611432A

Abstract

The invention relates to a singer classification method based on a Labeled LDA model, which comprises the following steps: s1, collecting manual labels of singers and preprocessing; s2, establishing a singer classification model based on user behaviors and collecting user behavior data; s3, cleaning user behavior data, and filtering data which is unfavorable for model training; s4, distributing the weight of each singer corresponding to each user in the user behavior data; s5, combining the user behavior data and the manual tag data to generate training data; s6, based on training data, referring to the label combination relation, performing Labeled LDA model training based on optimized Gibbs sampling. According to the invention, the song playing behaviors of the users are used as training data, the coverage of the users is high, the preference characteristics of each user group are considered, the changes of the user behaviors reflect the changes of social hotspots and public cognition, the model can be periodically trained to follow the changes, the adaptability is high, the accuracy is high, the label coverage rate is improved, and the classification is thin enough.

Description

Singer classification method based on Labeled LDA model

Technical Field

The invention relates to the technical field of internet personalized services, in particular to a singer classification method based on a Labeled LDA model.

Background

In the last decade, internet music has evolved rapidly, gradually feeding the traditional music market. Internet music manufacturers such as Tengxun music groups, internet Yiyun music, dried shrimp music and the like walk into thousands of households. In the traditional music market, except for limited popularization means such as televisions, movies, networks and the like, users generally know new music (songs) in record shops.

Internet music uses a music app as a channel, and the music selection faced by a user on the music app is unprecedented, but the user cannot know each song and each singer, so an effective information filtering means is needed to help the user to screen the songs. The current internet users are generally in the millions to billions level, the taste of songs listened to by each user is more or less different, and screening music for each user by operators is obviously impractical. In summary, personalized music recommendation is very important to user experience in the internet music era.

In the internet industry of advertising, electronic commerce and the like, an article is naturally provided with a category attribute tag which can identify the category to which the article belongs, for example: a pair of Li Ning running shoes, which are classified as 'clothing- > -shoes- > -sports shoes- > -running shoes'; an advertisement of the heaven list belongs to the category of "luxury goods- > -watches- > -mechanical watches". The classification of these items is relatively explicit and objective, and many recommendation algorithms are currently based on these objective category attribute tags.

Internet music also belongs to the internet industry, but unlike other internet industries, classification of music (also called music genre) is often subjective, fuzzy and abstract. The most basic attribute of a piece of music (song) is singer, if the singer is classified with relatively accurate classification judgment, the music genre can be judged to a certain extent, and the design of a recommendation algorithm plays a role.

In the prior art, singers can obtain singer labels in a classified manner, and two general modes are available: one is to extract a machine label as a singer label through a machine learning method, and the other is to obtain a manual label as the singer label through an expert discrimination mode.

By collecting indirect data, establishing machine learning models such as clustering, topic extraction and the like, and then extracting machine labels as singer labels has the following limitations:

(1) The divided categories are abstract categories and therefore have little interpretability. After the model extracts the label from the singer, the genre attribute of the singer or music cannot be evaluated from an intuitive angle, and the suitability for certain application scenes is poor.

(2) The classification is not controllable, so the classification quantity is uncertain, and the quantity of articles contained in each class is also uncertain. This is very unfriendly to many recommendation algorithms, and may have an impact on implementation and may also have an impact on implementation efficiency.

(3) When learning is performed using the same data, there is a tendency that the learning results are inconsistent. The learning result may be related to the difference in initial values or to the order of data input. After a plurality of result data are obtained, the quality of the result data can not be judged almost, and the result data are likely to be manually screened.

The way to obtain the manual tag as a singer tag by expert tagging has the following limitations:

(1) At a certain cost, the expert can only label a small part of singers with relatively famous air and representatives. For most less popular or blurry singers, the coverage of manual labels is very limited.

(2) Many singers will have multiple manual labels at the same time, and the expert cannot assign a weight to each label of the singer. In practice, however, singer creation is generally focused on, and important and unimportant for different labels should be considered.

(3) In many cases, the expert and the user actually represent two types of people who must be different for the division of the genre of music. Therefore, the expert labels the singer and does not necessarily fully conform to the user's knowledge.

(4) The labels that the specialist gives to the singers are typically fixed and cannot change over time, but the genre of the singers is likely to change over time from the user's perspective.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention aims to provide a singer classification method based on a LabeledLDA model, which takes the singing behavior of a user as training data, has high coverage to the user and gives consideration to preference characteristics of each user group, changes the user behavior to reflect social hot spots and public cognition, and the model can be periodically trained to follow the changes, so that the adaptability is strong, the accuracy degree is high, the label coverage rate is improved, and the classification class is thin enough.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a singer classification method based on LabeledLDA model is characterized by comprising the following steps:

s1, collecting manual labels of singers and preprocessing the manual labels to serve as training benchmarks;

s2, establishing a singer classification model based on user behaviors, and collecting user behavior data;

s3, cleaning user behavior data, and filtering data which is unfavorable for model training;

s4, distributing the weight of each singer corresponding to each user in the user behavior data;

s5, combining the user behavior data and the manual label data to generate training data;

s6, based on training data, referring to the label combination relation, performing Labeled LDA model training based on optimized Gibbs sampling.

Based on the above technical solution, in step S1, the manual labels of singers are collected, and different manual labels are divided according to dimensions, where the dimensions include any one of the following:

singer language classification dimension; this dimension includes, but is not limited to, the following manual labels: chinese, european and american;

singer genre classification dimension; this dimension includes, but is not limited to, the following manual labels: epidemic, classical, rock, ballad;

singer style classification dimension; this dimension includes, but is not limited to, the following manual labels: dance music, ancient wind and grassland wind;

singers use the main instrument classification dimension; this dimension includes, but is not limited to, the following manual labels: saxophone, violin and piano;

an chronology classification dimension; this dimension includes, but is not limited to, the following manual labels: early, medium and late;

in step S1, the preprocessing includes: performing accuracy evaluation on the artificial labels included in each dimension, and taking the artificial labels in the dimension exceeding or reaching a preset threshold value to form a label system;

after accuracy evaluation is carried out on other dimensions except the dimension, the other dimensions are supplemented into a label system after exceeding or reaching a preset threshold value;

the preprocessing further comprises: carrying out fine granularity optimization on the manual label according to dimensions, wherein the optimization specifically comprises the following steps:

The manual label fine granularity is evaluated and,

for a wide range of manual labels, the dimensions are further subdivided by combining the dimensions,

the dimension further subdivides by default no more than three levels.

Based on the above technical solution, in step S1, the preprocessing further includes: generating a tag group, and combining the more similar manual tags into the same tag group.

Based on the above technical solution, in step S2, the establishing a singer classification model based on user behavior, by converting the singer classification problem into a document theme classification problem, and then classifying the singer by applying the document theme classification model, specifically includes:

the user is aggregated as words into articles representing singers as training data,

the user is a user who plays the singer's song in its entirety,

each singer is used as an article, the articles and the singers are in one-to-one correspondence,

setting a behavior time window w, and interpreting the following user behaviors in the behavior time window w: the singer song is played in its entirety within the behavioral time window w, the user is then treated as a word, the word is incorporated into the article corresponding to the singer,

Namely: the article corresponds to the singer, the content of the article is composed of words, the words correspond to users, and the users meet the condition of playing songs of the singer completely in an action time window w;

the step of collecting user behavior data is that: within the action time window w, the user actions of all users who play the songs of a singer completely are acquired,

the user behavior specifically comprises: and a song that is completely played, and a singer corresponding to the song.

On the basis of the technical scheme, the value length of the behavior time window w can be adjusted within a certain range, and:

the longer the length w is, the more stable the training result is, but the weaker the adaptability is, and the more the training result cannot be changed along with the whole interest transition of the user;

the shorter the length w is, the more unstable the training result is, the more random the classification result is, but the stronger the adaptability is, and the change can be along with the whole interest transition of the user;

the value length of w follows the following principle:

if the system access amount is large, the user behaviors generated in unit time are large, a relatively short behavior time window is selected, and the generated classification result can show the trend of update tide more;

if a high interpretability is required to present a specific classification to the user, a relatively long behavior time window is selected to generate a more interpretable classification result.

Based on the above technical solution, in step S3, the cleaning user behavior data specifically includes:

a user who has listening behavior to songs of many singers, i.e. words (users) contained in many articles (singers), sets a threshold L _user-max The number of articles containing the word is greater than L _user-max The word of the model is considered to be too wide in user interest, has no benefit on convergence of the model, and is removed from all articles;

setting threshold L for users with very few singers listening to words contained in a very limited number of articles _user-min When a word is contained only in less than L _user-min In the article, too, model training is not greatly assisted, and in the extreme case, if a user only listens to songs of one singer, no assistance is provided for classification of the singer, and the word is removed from all articles;

after the two ways of cleaning, the text with too few words is cleanedChapter, set threshold L _artist-min For containing less than L _artist-min The articles with the word number can not accurately judge the singer classification according to the existing data, and the articles are removed.

On the basis of the technical scheme, in the step S4, the characteristics are further highlighted by carrying out weight distribution on the user behavior data;

Wherein, the optimized TF-IDF formula is used as the weight of the word, and the formula is as follows:

wherein the method comprises the steps ofWeight for word u in article a, +.>Representing the number of occurrences of the word most contained in article a, +.>Representing the number of articles in which the word appearing in the most articles in D appears;

on the basis of the technical scheme, in the step S5, setting the model classification label number as the manual label class number +n, n as the default label number, and selecting 5-10 according to experience and depending on actual effects;

for articles with some manual labels, labeling as these manual labels+n default labels;

for articles without any manual tags, all category tags are labeled, i.e., all manual tags and n default tags are included.

Based on the above technical solution, in step S6, labelelda learns by using Gibbs sampling algorithm, and after adding optimized TF-IDF weight, the probability formula of the sampling model is:

wherein the method comprises the steps ofFor the current word u _i Probability belonging to class label k->To indicate the probability of occurrence of tag k in document a, in addition to the current word,/o>To be in addition to the current word, word u _i Probability of corresponding to label k, alpha _k 、β _i For model superparameter, < >>TF-IDF weights for the current document for the improved current word;

Considering the correlation of each label in the label group, the sampling probability formula is further optimized as follows:

wherein the method comprises the steps ofFor the current word u when sampled _i The probability of classifying the labels k is that T is all label groups, T is the size of the label groups, II (-) is an indication function, the function value is 1 when the parameters in the function are true, otherwise, the function value is 0, and lambda is a super parameter which is more than 0 and less than 1.

On the basis of the technical proposal, when training starts, firstly, each word of each article is randomly assigned with a classification label k,

if the article has a manual label, initializing each word in the article in the manual label range; if the article has no manual label, initializing in all label ranges;

each word of each article is then traversed, each word is reassigned a category label,

the sampling probability at allocation is calculated according to the other words of all the current articles except the current word and all the words of other articles and the sampling probability formula, if the article has a manual label, each word in the article is initialized within the manual label specified range,

after several rounds of training, the training is stopped and the number of words contained under each classification label in each article is calculated, and then normalization processing is performed to be used as the classification of the article, namely the classification of the singer.

The singer classification method based on the Labeled LDA model has the following beneficial effects:

1. the manual label and the machine learning model are combined for classification, so that the manual input cost is reduced, the label interpretability is ensured, the label coverage rate is improved, and the prediction accuracy is improved; the classification model is based on the existing manual label system, so that the generated classification result has good interpretability, the total number of labels reaches hundreds, and classification categories are sufficiently subdivided; the classification result generated by the invention is closely related to the specific behavior of the user, so that the classification result has good fit to the actual preference of the user and is helpful to the design of the follow-up recommendation algorithm.

2. The user song playing behavior with the largest order of magnitude and the widest coverage of the online music playing platform is used as a basis, so that the accuracy of the model is ensured; according to the invention, the song playing behavior of the user is used as training data, the data volume which can be collected is huge, enough data can be accumulated in a short time, for most models, the shorter the accumulated time is, the stronger the model timeliness is, and the more the training data is, the higher the model accuracy is; the coverage of the data set adopted by the invention to the users can reach more than 90 percent, and the preference characteristics of each user group can be considered.

3. Aiming at the data characteristics, the LabeledLDA specific classification method is optimized and improved, and finally, a good effect is achieved in the production environment; the LabeledLDA algorithm formula is optimized, different weights are given to behaviors of different users, and meanwhile, correlation relations among labels are referenced; the model provided by the invention can be periodically trained, and the overall taste of the user can be changed along with social hotspots and public cognition, so that the change can be quickly reflected on the behavior of the user, and the result obtained by the model can be changed along with the change.

The singer classification method based on LabeledLDA model has the following characteristics:

1. the machine learning model and the manual labeling model are comprehensively utilized, and meanwhile, the problems of poor classification interpretability of the machine learning model and low coverage rate of the manual model are solved;

2. creatively taking user behaviors as basic data, converting the basic data into a text-like form, and then using LabeledLDA to perform semi-supervised classification learning, so that the problems of low text data accuracy, poor representativeness and the like are solved;

3. optimizing a training target label of LabeledLDA so as to achieve a better classification effect and facilitate the application of the model in a production environment;

4. Based on the characteristics of service data, optimizing TF-IDF empirical formulas and combining the TF-IDF empirical formulas into a classification model, wherein on one hand, important user behaviors are paid attention to, and on the other hand, excessive interference of individual user behaviors to model results is prevented;

5. and the label grouping optimization LabeledLDA sampling algorithm of manual merging is utilized, so that the discrimination of the model on the similar labels is closer to the perception of users.

Drawings

The invention has the following drawings:

FIG. 1 is a flowchart of a singer classification method based on LabeledLDA model according to an embodiment of the present invention.

Figure 2 shows a behavior time window w.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

In the text topic classification field, the labelelda model provides a concept combining an artificial tag and a machine learning model, such as classifying news categories according to news headlines/contents and classifying song categories according to song descriptions/comments, but in the recommendation field of online music playing software, the text-based classification model is not suitable:

(1) The number of text data generated by the music platform is very limited, and only channels such as song description, user comments and the like are provided, so that the accuracy of the model is limited;

(2) In these texts, a large part of the content is often not a description of music, and the description of music is often ambiguous, so that the method can only be used for coarse-granularity division, and cannot be used for fine-granularity classification;

(3) These texts are often generated by a very central fraction of users of the music platform, and thus, like the expert's perspective, the preferences for the entire user population are not necessarily representative.

As shown in fig. 1, the singer classification method based on the labelelda model according to the present invention includes the following steps:

s6, based on training data, referring to the label combination relation, performing LabeledLDA model training based on optimized Gibbs sampling.

Singer language classification dimension; this dimension includes, but is not limited to, the following manual labels: chinese, european and american; singer genre classification dimension; this dimension includes, but is not limited to, the following manual labels: epidemic, classical, rock, ballad; singer style classification dimension; this dimension includes, but is not limited to, the following manual labels: dance music, ancient wind and grassland wind; singers use the main instrument classification dimension; this dimension includes, but is not limited to, the following manual labels: saxophone, violin and piano; an chronology classification dimension; this dimension includes, but is not limited to, the following manual labels: early, medium and late.

Based on the above technical solution, in step S1, the preprocessing includes: performing accuracy evaluation on the artificial labels included in each dimension, and taking the artificial labels in the dimension exceeding or reaching a preset threshold value to form a label system;

after accuracy assessment is performed on other dimensions than the above-mentioned dimensions, the other dimensions are added into the tag system beyond or up to a preset threshold.

Based on the above technical solution, in step S1, the preprocessing further includes: carrying out fine granularity optimization on the manual label according to dimensions, wherein the optimization specifically comprises the following steps:

The manual label fine granularity is evaluated and,

the artificial labels with very narrow range and specific music of a certain category are not optimized, such as ancient wind and grassland wind;

for a wide range of manual labels, the dimensions are further subdivided by combinations between the dimensions, for example: popularity, which can be further subdivided into the chinese popularity, the euro-american popularity … …; also for example: the Chinese epidemic can be further subdivided into early Chinese epidemic, middle Chinese epidemic and later Chinese epidemic … …;

the dimension further subdivides by default no more than three levels, with the foregoing example being "subdivision into chinese fashion, eumeric fashion" as the first level and "subdivision into early chinese fashion, mid-term chinese fashion, late chinese fashion" as the second level. The range of the content is narrow, and the range of the content is wide, the manual label can be set according to experience values.

For example: for artificial tags of finer granularity, such as popular ballads, campus ballads, traditional ballads, etc., the boundaries of which are not so obvious, the user population does not necessarily completely agree with the tag and expert song. Thus, these more similar artificial tags are combined into tag groups, which are considered in the classification calculation. The more similar manual label sets can be set according to experience values.

the user is a user who plays the singer's song in its entirety,

namely: the article corresponds to the singer, the content of the article is composed of words, and the words correspond to the user, and the user meets the condition of 'playing the song of the singer completely within the behavioural time window w'.

It should be noted that, the user is used as words to aggregate the articles representing singers as training data, and it is obvious that the articles in the training data are generated through user behaviors (specifically, song playing behaviors), and are not articles in a common sense, but all users can enjoy almost all music on a platform because the current music playing has no price grading threshold, so that song playing behaviors of single users have strong randomness, and the user behavior data need to be filtered and cleaned in the subsequent processing.

Based on the technical scheme, the value length of the behavior time window w can be adjusted in a certain range.

The behavior time window w is an observation window, as shown in fig. 2, in which:

the shorter the length w is, the more unstable the training result is, the more random the classification result is, but the stronger the adaptability is, and the change can be along with the whole interest transition of the user.

As an alternative embodiment, the value of the specific behavior time window w balances the adaptability requirement with respect to the rate at which the actual system generates user data.

As shown in FIG. 2, if the system access amount is large, the user behavior generated in unit time is large, a relatively short behavior time window can be selected, and the generated classification result can reflect the trend of update tide; if a high level of interpretability is required, such as a particular classification needs to be presented to the user, a relatively long behavior time window needs to be selected to generate a more interpretable classification result.

On the basis of the above technical solution, the collecting user behavior data means: within the action time window w, the user actions of all users who play the songs of a singer completely are acquired,

a user who has listening behavior to songs of many singers, i.e. words (users) contained in many articles (singers), sets a threshold L _user-max The number of articles containing the word is greater than L _user-max The word of (2) can be considered to be too popular, and has no benefit for model convergence, and the word is divided from all the textRemoving the chapters;

after the two methods are cleaned, a threshold L is set for articles with too few words _artist-min For containing less than L _artist-min The articles with the word number can not accurately judge the singer classification according to the existing data, and the articles are removed.

It should be noted that, as one of alternative embodiments, each threshold needs to be set to a value with higher tolerance empirically, that is: for L _user-max A larger value (e.g., 10% of the total singers) can be set for L _user-min And L is equal to _artist-min A relatively small value (e.g., 2-5) greater than 1 is set first, and then adjusted according to the model effect. The higher the threshold tolerance, the more singers the classification model can cover, but the lower the accuracy; the lower the threshold tolerance, the fewer singers the classification model can cover, but the higher the accuracy. After the model has been trained, if accuracy meets expectations, an attempt may be made to further expand L _user-max Or reduce L _user-min And L is equal to _artist-min The method comprises the steps of carrying out a first treatment on the surface of the If accuracy is not expected, L needs to be reduced _user-max And increase L _user-min And L is equal to _artist-min 。

On the basis of the technical scheme, in the step S4, the characteristics are further highlighted by carrying out weight distribution on the user behavior data; for a document topic classification model, word weight distribution is an indispensable process for ensuring model effects;

intuitively, it is understood that for two words in an article, words that occur more often than less often than more often are more representative of the subject matter of the article; for two words in all articles, words with fewer occurrences than more occurrences are more representative of the subject of a single article;

For this case, it is common practice in the text processing field to use TF-IDF as the weight of the word, and the empirical formula is as follows:

wherein W is _TF-IDF (u, a) is the weight of the word u in article a, N _a,u Representing the number of times the word u is contained in article a, N _a Represents the total word number in the article a, D represents the total article number, D _u Representing the number of articles containing the word u in D;

however, for the model described in this invention, this empirical formula does not apply, because:

1. for the first term TF part of the formula, because the user's song playing behavior has a strong martai effect, the number of words of the generated articles is seriously scaled unevenly, and part of the articles contain hundreds of thousands of words and part of the articles only have tens of words.

For articles with a small number of words, if the number of occurrence times of individual words is too large, the judgment of the subject of a single article is seriously affected, so log processing is required to be carried out on the comparison example item; however, for articles with a very large number of words, the distinction of weights between words after direct log processing becomes very small.

2. For the second term IDF part of the formula, most users cannot hear too many singers in a certain period of time except for the extreme cases which have been eliminated, most of them are concentrated on about tens of singers, but the total singers are of ten thousand levels, so that after taking log, the second term scores of all users are hardly different.

Aiming at the two problems, the optimized TF-IDF formula is used as the weight of the word, and the optimized TF-IDF formula provided by the invention is as follows:

wherein the method comprises the steps ofWeight for word u in article a, +.>Representing the number of occurrences of the word most contained in article a, +.>Representing the number of articles in D in which the word appearing in the most articles appears.

Based on the technical scheme, for a general LabeledLDA model, all manual labels can be summarized and numbered, and then all articles in the existing training data are marked:

for articles with some artificial labels, then label as these artificial labels;

for articles without any manual labels, all manual labels are labeled.

In addition, in some cases, labeledLDA also uses all the artificial tag class numbers +1 as model training targets. Wherein "+1" is to provide a default classification for all articles, the model can intelligently judge during training, and if the current article cannot be classified into any existing classification, the current article can be classified into the default classification.

For the model described in the present invention, further optimization is required.

In the invention, singers have no clear genre limit, fuzzy zones exist between different genres, users' knowledge of singers does not necessarily have to be the same as academic genre division, so that there is a high probability that some audience classification in user knowledge is not available in a manual label, and the singers can have great trouble on the next recommending work no matter how they are separated into any existing genre.

In order to avoid the situation to the maximum extent, through the test, the following optimization scheme is adopted in the step S5, so that a better classification effect can be achieved:

in the step S5, setting the model classification label number as the manual label class number +n, and n as the default label number, wherein 5-10 can be selected according to experience, and the model classification label number is determined according to actual effects;

Wherein lambda needs to be adjusted according to the actual effect: the lambda value is too small, the labels in the label group are more prone to be completely independent labels, and the correlation between the labels cannot be reflected; lambda values that are too large may result in sampling probabilities that are too high for a population of tags that contains many tags.

The following is a specific example.

As shown in FIG. 1, the invention provides a singer classification method based on LabeledLDA model, which comprises the following steps

S1, collecting label data marked by singers manually and determining a training standard.

As used herein, a singer manual label is derived in part from a warehousing label when the singer is warehoused, and in part from a label specifically labeled by an associated expert for a genre representative singer. Together, about 3000 singers are manually tagged, containing 153 tags in total.

Table 1 example singer manual labeling label:

the labels are sorted, and the labels which are finely divided and close are combined together to establish 26 label groups.

Example, table 2 singer tag grouping example:

s2, establishing a singer classification model based on user behaviors and collecting data.

The singer classification problem is converted into a document theme classification problem. And taking each singer as an article, selecting a behavior time window for 30 days, taking the user who plays the songs of the singer completely in the behavior time window as words of the articles, combining the words of the user to form an article corresponding to the singer, and enabling each article to correspond to each singer one by one.

This process is followed by the symbiotic 20000+ articles, containing the word 1000000+.

Example singer article generated based on user listening to songs, table 3:

s3, cleaning behavior data, and filtering data which are unfavorable for model training.

Words that occur in more than 1000 or less than 5 articles are filtered. Articles containing less than 5 words are then filtered. The article 15000+, word 500000+ is obtained.

For training convenience, the article words contained in the cleaned data set are encoded. The original article id and word id are discrete values, and all ids are converted into continuous number columns after encoding. And counts each word in the article.

Results show, for example, examples of cleaned encoded singer articles of table 4:

s4, weight distribution is carried out on the behavior data, namely, the importance degree of the words in the articles is evaluated by using TF-IDF.

The weight of each word in each article is calculated using the following formula

Where Na, u represents the number of times the word u is contained in article a,representing the number of occurrences of the word most contained in article a, du representing the number of articles containing word u in D, +.>Representing the number of articles in D in which the word appearing in the most articles appears.

S5, labeling all articles (singers) in the existing behavior data by using the manual label data.

The manual tags total 153, plus the 5 default tags total 158. And labeling corresponding labels to corresponding articles by singers with manual labels, and adding default labels. For singers without manual tags, all 158 tags are labeled to the corresponding articles.

S6, training a LabeledLDA model.

a.LDA

LDA (LatentDirichletAllocation) is a document theme generation model, comprising three elements: articles, topics, and words. The method is a three-layer Bayesian probability model in principle, and is assumed that the topic probability distribution theta of each document is independent of the word probability distribution phi of each topic and is generated by sampling from Dirichlet distributions with parameters alpha and beta respectively. The probability distribution formula of the LDA model is as follows:

where u represents a word, k represents a topic of word assignment, P (u, k, θ, Φ|α, β) is a probability of generation of k for the current word u topic assignment, α and β are hyper-parameters, θ and Φ can be estimated from α, β and current training data.

Gibbs sampling

Based on the LDA assumption, it may be assumed that the topics of all other words except the current word are determined, a probability distribution of the topics of the current word is generated according to information provided by the other words, and a process of randomly assigning topics to the current word according to the probability distribution is called Gibbs sampling. The probability distribution calculation formula used in Gibbs sampling is as follows:

Wherein the method comprises the steps ofFor the current word u when sampled _i Probability belonging to class label k->To tag k out in document a in addition to the current wordProbability of occurrence of->To be in addition to the current word, word u _i Probability of corresponding to label k, alpha _k 、β _i Is a super parameter.

θ and φ can be estimated according to the following equation:

c.LabeledLDA

the LDA is similar in nature to clustering, and is an unsupervised machine learning algorithm. After the clustering is completed, the center point of each class is uncertain, and the meaning represented by each class is also uncertain, so that the stability of the generated result is poor and the interpretation is not available.

LabeledLDA specifies a topic range for (words in) a portion of the articles at the time of Gibbs sampling that is selectable at the time of sampling, thus converting the unsupervised LDA into a semi-supervised learning algorithm. The center point of the generated clustering result fluctuates in a certain range, so that the stability is high and the interpretability is also realized.

d. Optimized Gibbs sampling

Adding the weight calculated by the optimized TF-IDF into the Gibbs sampling process, and the improved Gibbs sampling probability formula is as follows:

/>

wherein the method comprises the steps ofFor the current word u _i The probability of belonging to the classified label k (irrespective of the label correlation),to indicate the probability of occurrence of tag k in document a, in addition to the current word,/o >To be in addition to the current word, word u _i Probability of corresponding to label k, alpha _k 、β _i For model superparameter, < >>And the weight of the current word to the current document.

And then, according to the correlation relation of all the labels in the label group, further optimizing a sampling probability formula as follows:

Training is started:

(1) First, a=0.1, β=0.1, and λ=0.2 are empirically set.

(2) And randomly selecting a label for each word of each article in the label selectable range of the article to label.

(3) And calculating Gibbs sampling probability distribution for each word of each article one by one, wherein the probability calculation is also only in the selectable range of the label of the article, and then randomly selecting one label from the label according to the probability size for labeling. One traversal calculation is performed on all words of all articles for a round of training.

(4) Repeating the step (3) for a plurality of times.

(5) And (3) finishing training, and calculating the number of words contained under each classification label in each article as the probability of the classification to which the article belongs, namely the score of the singer belonging to the classification.

For judging how many rounds of training are needed to achieve a relatively ideal effect, two ways of evaluation can be used. 1. The singer classification can be checked in a sampling way, so that whether the calculation of the classification score of the existing manual tag singer is reasonable or not is checked, and whether the classification score of the singer without the manual tag is reasonable or not is checked. 2. The singer with the artificial tag can be divided into a training set and a testing set by using a ratio of 7:3 at random, and the singer without the artificial tag is used as the singer with the testing set to participate in training, and the difference between the artificial tag of the testing set and the model calculation tag is checked after the training is finished.

What is not described in detail in this specification is prior art known to those skilled in the art. The above description is merely of the preferred embodiments of the present invention, the protection scope of the present invention is not limited to the above embodiments, but all equivalent modifications or variations according to the disclosure of the present invention should be included in the protection scope of the claims.

Claims

1. A singer classification method based on a Labelid LDA model is characterized by comprising the following steps:

Collecting manual labels of singers, and dividing different manual labels according to dimensions;

the pretreatment comprises the following steps: performing accuracy evaluation on the artificial labels included in each dimension, and taking the artificial labels in the dimension exceeding or reaching a preset threshold value to form a label system;

the establishing of the singer classification model based on the user behavior converts the singer classification problem into a document theme classification problem, and then applies the document theme classification model to classify the singer, and the method specifically comprises the following steps:

aggregating the user as words into articles representing singers as training data, wherein the user is a user who plays songs of the singers completely;

each singer is respectively provided with an article, and the articles and the singers are in one-to-one correspondence, namely: the article corresponds to the singer, the content of the article is composed of words, the words correspond to users, and the users meet the condition of playing songs of the singer completely in an action time window w; the method specifically comprises the following steps: setting a behavior time window w, and interpreting the following user behaviors in the behavior time window w: in the action time window w, the singer song is completely played, the user is taken as a word, and the word is combined into an article corresponding to the singer;

The step of collecting user behavior data is that: acquiring user behaviors of all users which completely play songs of a singer in a behavior time window w; the user behavior specifically comprises: a song that is played completely, a singer corresponding to the song;

s4, distributing the weight of each singer corresponding to each user in the user behavior data; the characteristics are further highlighted by carrying out weight distribution on the user behavior data; wherein, the optimized TF-IDF formula is used as the weight of the word, and the formula is as follows:

wherein the method comprises the steps ofFor the weight of the word u in the article a, na, u represents the number of times the word u is contained in the article a,/->Representing the number of occurrences of the word containing the greatest number of occurrences in article a, D representing the total number of articles, du representing the word containing the word u in DThe number of chapters is the number of chapters,representing the number of articles in which the word appearing in the most articles in D appears;

s6, based on training data, referring to the label combination relation, performing Labeled LDA model training based on optimized Gibbs sampling;

the Labeled LDA adopts Gibbs sampling algorithm to learn, and after adding optimized TF-IDF weight, the probability formula of the sampling model is as follows:

Wherein the method comprises the steps ofFor the current word u _i Probability belonging to class label k->To indicate the probability of occurrence of tag k in document a, in addition to the current word,/o>To be in addition to the current word, word u _i Probability of corresponding to label k, alpha _k 、β _i For the super-parameters of the model, the model is a model super-parameter,optimized TF-IDF weights for the current document for the improved current word;

2. The singer classification method based on a Labeled LDA model of claim 1, wherein: in step S1, the dimension includes any one of the following:

in step S1, after forming a label system, performing accuracy evaluation on other dimensions except the dimensions, and supplementing the other dimensions into the label system after exceeding or reaching a preset threshold;

the manual label fine granularity is evaluated and,

the dimension further subdivides by default no more than three levels.

3. The singer classification method based on a Labeled LDA model of claim 2, wherein: in step S1, the preprocessing further includes: generating a tag group, and combining the more similar manual tags into the same tag group.

4. The singer classification method based on a Labeled LDA model of claim 1, wherein: the value length of the behavior time window w is adjusted in a certain range, and:

the value length of w follows the following principle:

5. The singer classification method based on a Labeled LDA model of claim 1, wherein: in step S3, the cleaning user behavior data specifically includes:

a user who has listening to songs of many singers, i.e. words contained in many articles, sets a threshold L _user-max The number of articles containing the word is greater than L _user-max The word of the model is considered to be too wide in user interest, has no benefit on convergence of the model, and is removed from all articles;

Users with listening behavior to only a few singers, i.e. contained in a very limited number of articlesWord, set threshold L _user-min When a word is contained only in less than L _user-min In the article, too, model training is not greatly assisted, and in the extreme case, if a user only listens to songs of one singer, no assistance is provided for classification of the singer, and the word is removed from all articles;

6. The singer classification method based on a Labeled LDA model of claim 1, wherein: in the step S5, setting the model classification label number as the manual label class number +n, and n as the default label number, selecting 5-10 according to experience, and depending on actual effects;

7. The singer classification method based on a Labeled LDA model of claim 1, wherein: at the beginning of training, each word of each article is first randomly assigned a classification label k,