CN109726286B

CN109726286B - Automatic book classification method based on LDA topic model

Info

Publication number: CN109726286B
Application number: CN201811584226.5A
Authority: CN
Inventors: 符俊涛; 王超芸; 李曲; 应文佳; 马堃; 沈钦壮
Original assignee: EB INFORMATION TECHNOLOGY Ltd
Current assignee: Xinxun Digital Technology Hangzhou Co ltd
Priority date: 2018-12-24
Filing date: 2018-12-24
Publication date: 2020-10-16
Anticipated expiration: 2038-12-24
Also published as: CN109726286A

Abstract

An automatic book classification method based on an LDA topic model comprises the following steps: establishing a classification system; selecting books of known categories as training books, forming a book label total set by the labels of all the training books, and allocating a unique serial number to each label in the book label total set; constructing and training a multi-item distribution model, wherein the input of the multi-item distribution model is book labels and book categories contained in training books, and the output is the probability of each label in a book label total set under different categories; the method comprises the steps of selecting book labels from books to be classified, forming a label set of the books to be classified, then adopting a Gibbs sampling method to sample and distribute a category for each book label contained in the books to be classified based on an LDA topic model, counting the score of each category of the books to be classified after convergence is achieved, and accordingly obtaining the category of the books to be classified. The invention belongs to the technical field of information, and can realize automatic book classification based on an LDA topic model.

Description

Automatic book classification method based on LDA topic model

Technical Field

The invention relates to an automatic book classification method based on an LDA topic model, and belongs to the technical field of information.

Background

Book sorting has long been of great significance to both online and offline book organization that holds a large number of books. For a network literature platform and an online bookstore which are popular with emerging reading groups, accurate book classification is the basis for accurate recommendation of various books, and for libraries and physical bookstores bearing traditional publishing literature, accurate book classification can improve management efficiency and user experience. For these mechanisms, because there are many old books that need to be classified and new books that are continuously put on shelves, the current manual book classification method has the problems of large workload, low efficiency, subjectivity in classification, inaccuracy and the like, so that the invention of an efficient and accurate automatic book classification method becomes increasingly important.

The current automatic book classification algorithm mainly focuses on using machine learning algorithms such as naive Bayes, support vector machines and neural networks. Since books are essentially a collection of texts, the classified books can contain both web literature and traditional literature, and the above method cannot achieve good effect.

The traditional LDA topic model based on NLP (natural language processing) is unsupervised to learn, and directly applying the LDA topic model is equivalent to clustering some books, which is contrary to the original intention of classifying the books, so how to reform the LDA topic model, and applying the LDA topic model to automatic book classification becomes a technical problem which needs to be solved by technicians urgently.

Disclosure of Invention

In view of the above, the present invention provides an automatic book classification method based on an LDA topic model, which can realize automatic book classification based on the LDA topic model.

In order to achieve the above object, the present invention provides an automatic book classification method based on an LDA topic model, comprising:

step one, establishing a classification system comprising K categories;

secondly, selecting books of known categories as training books, extracting book labels from each training book, forming a book label aggregate by the book labels of all the training books, and allocating a unique serial number to each book label in the book label aggregate;

thirdly, constructing and training a multi-item distribution model by taking the training books as samples, wherein the input of the multi-item distribution model is all book labels contained in each training book and the category to which the training books belong, and the output is the probability of each book label in the book label total set under different categories;

step four, selecting the book labels in the book label aggregate from the books to be classified, and forming a label set W (W) of the books to be classified₁，w₂，…，w_d) Wherein d is the number of book labels contained in the book to be classified, w₁、w₂、…、w_dThe book labels contained in the books to be classified are respectively classified, then a category is sampled and distributed to each book label contained in the books to be classified by adopting a Gibbs sampling method according to the probability of each book label in the book label total set under different categories based on an LDA topic model, when convergence is reached, the probability distribution of different categories to which each book label of the books to be classified belongs is calculated, the score of each category to which the books to be classified belong is counted, and the category to which the books to be classified belong is obtained accordingly,

the fourth step further comprises:

step 41, randomly initializing a category for each book label in the book to be classified, and initializing i to 1;

step 42, extracting the ith book label from the label set W of the book to be classified;

and 43, calculating probability distribution of different categories to which the extracted ith book label belongs:

wherein, p (z)_i＝k,w_i) Is the ith book label w_iProbability of the kth class in the classification system, K ═ 1, 2, …, or K, z_iIs w_iV is the serial number of the ith book label of the book to be classified in the book label total set, p_kvIs the probability of the v-th book label in the book label total set under the k-th category, the value is obtained by the calculation of the step three, n_k(-i)、n_k'(-i)The label numbers of the k and k' categories of the i book label are removed from all book labels in the label set W of the books to be classified, α_k、α_k'Is the adjustment parameter of the k, k' th category;

step 44, randomly sampling to obtain a category according to the probability distribution of different categories to which the ith book label belongs, and updating the category of the ith book label into the category obtained after sampling;

step 45, updating i to i +1, judging whether the updated i is larger than d, if so, indicating that all book labels in the label set W are updated, and continuing the next step; if not, go to step 42;

step 46, judging whether the category of each book label in the updated W of the current time and the category consistency of each book label updated for the latest time at present reach a convergence threshold, if so, indicating that convergence is reached; if not, update i to 1, and then go to step 42 to continue the next update of the category for each book tag in W.

Compared with the prior art, the invention has the beneficial effects that: the method comprises the steps of transforming a traditional unsupervised LDA topic model into a supervised LDA topic model algorithm, obtaining the probability of each book label in a book label total set under different classes by training known classified book labels, and calculating and obtaining the score of each class to which a book to be classified belongs by applying a Gibbs sampling method, thereby realizing automatic classification of the book; when calculating the score of each class to which the book to be classified belongs, the invention does not directly count the classification number of the sample label according to the LDA model, but calculates the probability sum of each class to which the book to be classified belongs by adopting probability distribution, and simultaneously adjusts the weight of each book label by adopting IDF, thereby identifying the class of the book more accurately and reducing the error caused by single sampling.

Drawings

FIG. 1 is a flow chart of an automatic book classification method based on LDA topic model according to the present invention.

Fig. 2 is a flowchart showing the detailed steps of step four in fig. 1.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings.

As shown in FIG. 1, the present invention provides an automatic book classification method based on LDA topic model, which comprises:

step one, establishing a classification system comprising K categories;

the types of all book labels contained in each training book are consistent with the type of the training book, and the probability sum of all book labels in the book label total set under each type is 1;

step four, selecting the book labels in the book label aggregate from the books to be classified, and forming a label set W (W) of the books to be classified₁，w₂，…，w_d) Wherein d is the number of book labels contained in the book to be classified, w₁、w₂、…、w_dThe method comprises the steps of classifying books to be classified according to book labels contained in books to be classified, adopting a Gibbs sampling method to sample and distribute a class for each book label contained in the books to be classified according to the probability of each book label in a book label total set under different classes based on an LDA topic model, calculating the probability distribution of different classes to which each book label of the books to be classified belongs after convergence is achieved, and counting the score of each class to which the books to be classified belong, so that the classes to which the books to be classified belong are obtained.

And for a plurality of books to be classified, the category of each book to be classified can be obtained in sequence by only repeating the step four.

In the second step, word segmentation and part-of-speech tagging can be performed on the text part chapters of the training book by applying the NLP technology, and effective nouns are extracted to serve as book labels.

In the third step, according to the probability statistical calculation, the optimal calculation formula of the probability of each book label in the book label total set under different categories output by the multi-term distribution model can be: book tag aggregation under the kth categoryProbability p of the v-th book tag in (1)_kvIs the ratio of the number of the v-th book label in all books belonging to the k-th category to the number of all book labels in all books belonging to the k-th category.

As shown in fig. 2, the step four may further include:

wherein, p (z)_i＝k,w_i) Is the ith book label w_iProbability of the kth class in the classification system, K ═ 1, 2, …, or K, z_iIs w_iV is the serial number of the ith book label of the book to be classified in the book label total set, p_kvIs the probability of the v-th book label in the book label total set under the k-th category, the value is obtained by the calculation of the step three, n_k(-i)、n_k'(-i)The label numbers of the k and k' categories of the i book label are removed from all book labels in the label set W of the books to be classified, α_k、α_k'Adjustment parameters of the kth and k' th categories are set, and the values of the adjustment parameters can be set according to actual service needs, for example, the values of the adjustment parameters are both set to 1;

step 45, updating i to i +1, and determining whether the updated i is greater than d? If yes, all book labels in the updated label set W are represented, and the next step is continued; if not, go to step 42;

step 46, judging whether the category of each book label in the current updating W and the category consistency of each book label updated in the last updating W reach a convergence threshold or not? If yes, convergence is achieved, and the next step is continued; if not, updating i to 1, and then turning to step 42, and continuing to update the category of each book label in W in the next pass;

step 47, calculating the probability of different categories of each book label in the label set of the book to be classified:

and 48, calculating the score of each category of the books to be classified according to the probability of different categories of each book label in the label set of the books to be classified, and then selecting the maximum score value from the categories, wherein the category corresponding to the maximum score value is the category of the books to be classified.

It is worth mentioning that the invention does not directly count the classification number of the sample label in the LDA model, but calculates the probability sum of each category to which the book to be classified belongs by using probability distribution, so as to identify the book category more accurately, reduce the error caused by single sampling, and meanwhile, because some book labels are more common and almost exist in all classification categories, the classification degree of the book labels is not large, therefore, the invention further adopts IDF as weight to adjust the score of each category to which the book to be classified belongs, and also comprises:

calculating the IDF of each book label in the book label total set according to the probability of each book label in the book label total set under different categories, which is obtained by the calculation in the step three:

wherein idf_vIs the v-th book label b in the book label total set_vIDF value of (a), num-type (b)_v) The book label b containing the vth obtained after inputting sample data in the step three_vThe number of all the categories of (a),

thus, in the fourth step, the books to be classified are countedThe formula for calculating the score for each category is:

score_kis the score, p (z), of the kth category to which the book to be classified belongs_i＝k,w_i) Is the ith book label w_iThe probability of the kth category in the classification system, v is the serial number of the ith book label of the book to be classified in the book label total set.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An automatic book classification method based on an LDA topic model is characterized by comprising the following steps:

step one, establishing a classification system comprising K categories;

step four, selecting the book labels in the book label aggregate from the books to be classified, and forming a label set W (W) of the books to be classified₁，w₂，…，w_d) Wherein d is the number of book labels contained in the book to be classified, w₁、w₂、…、w_dThe book labels contained in the books to be classified are respectively adopted, and then Gibbs is adopted according to the probability of each book label in the book label total set under different classes based on the LDA topic modelThe sampling method samples and allocates a category for each book label contained in the book to be classified, calculates the probability distribution of different categories to which each book label of the book to be classified belongs after convergence is achieved, counts the score of each category to which the book to be classified belongs, thereby obtaining the category to which the book to be classified belongs according to the score,

the fourth step further comprises:

2. The method as claimed in claim 1, wherein in the second step, the NLP technique is applied to perform word segmentation and part-of-speech tagging on the text part chapters of the training book, and effective nouns are extracted as book tags.

3. The method of claim 1, wherein in step three, the formula for calculating the probability of each book label in the book label total set under different categories output by the multi-term distribution model is: probability p of the v-th book tag in the book tag aggregate under the k-th category_kvIs the ratio of the number of the v-th book label in all books belonging to the k-th category to the number of all book labels in all books belonging to the k-th category.

4. The method as claimed in claim 1, wherein in step four, after convergence is reached, the probability distribution of different categories to which each book tag of the books to be classified belongs is calculated, and the score of each category to which the books to be classified belong is counted, so as to obtain the category to which the books to be classified belong according to the score, further comprising:

step A1, calculating the probability of different categories of each book label in the label set of the book to be classified:

wherein, p (z)_i＝k,w_i) Is the ith book label w_iProbability of the kth class in the classification system, K ═ 1, 2, …, or K, z_iIs w_iV is the serial number of the ith book label of the book to be classified in the book label total set, p_kvIs the summary of the v-th book tag in the book tag aggregate under the k-th categoryRate, whose value is obtained by the calculation of step three, n_k(-i)、n_k'(-i)The label numbers of the k and k' categories of the i book label are removed from all book labels in the label set W of the books to be classified, α_k、α_k'Is the adjustment parameter of the k, k' th category;

step A2, calculating the score of each category of the books to be classified according to the probability of different categories of each book label in the label set of the books to be classified, and then selecting the maximum score value from the categories, wherein the category corresponding to the maximum score value is the category of the books to be classified.

5. The method of claim 1, further comprising:

in the fourth step, the calculation formula for counting the score of each category of the books to be classified is as follows: