CN109726286B - Automatic book classification method based on LDA topic model - Google Patents

Automatic book classification method based on LDA topic model Download PDF

Info

Publication number
CN109726286B
CN109726286B CN201811584226.5A CN201811584226A CN109726286B CN 109726286 B CN109726286 B CN 109726286B CN 201811584226 A CN201811584226 A CN 201811584226A CN 109726286 B CN109726286 B CN 109726286B
Authority
CN
China
Prior art keywords
book
label
classified
books
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811584226.5A
Other languages
Chinese (zh)
Other versions
CN109726286A (en
Inventor
符俊涛
王超芸
李曲
应文佳
马堃
沈钦壮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinxun Digital Technology Hangzhou Co ltd
Original Assignee
EB INFORMATION TECHNOLOGY Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by EB INFORMATION TECHNOLOGY Ltd filed Critical EB INFORMATION TECHNOLOGY Ltd
Priority to CN201811584226.5A priority Critical patent/CN109726286B/en
Publication of CN109726286A publication Critical patent/CN109726286A/en
Application granted granted Critical
Publication of CN109726286B publication Critical patent/CN109726286B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An automatic book classification method based on an LDA topic model comprises the following steps: establishing a classification system; selecting books of known categories as training books, forming a book label total set by the labels of all the training books, and allocating a unique serial number to each label in the book label total set; constructing and training a multi-item distribution model, wherein the input of the multi-item distribution model is book labels and book categories contained in training books, and the output is the probability of each label in a book label total set under different categories; the method comprises the steps of selecting book labels from books to be classified, forming a label set of the books to be classified, then adopting a Gibbs sampling method to sample and distribute a category for each book label contained in the books to be classified based on an LDA topic model, counting the score of each category of the books to be classified after convergence is achieved, and accordingly obtaining the category of the books to be classified. The invention belongs to the technical field of information, and can realize automatic book classification based on an LDA topic model.

Description

Automatic book classification method based on LDA topic model
Technical Field
The invention relates to an automatic book classification method based on an LDA topic model, and belongs to the technical field of information.
Background
Book sorting has long been of great significance to both online and offline book organization that holds a large number of books. For a network literature platform and an online bookstore which are popular with emerging reading groups, accurate book classification is the basis for accurate recommendation of various books, and for libraries and physical bookstores bearing traditional publishing literature, accurate book classification can improve management efficiency and user experience. For these mechanisms, because there are many old books that need to be classified and new books that are continuously put on shelves, the current manual book classification method has the problems of large workload, low efficiency, subjectivity in classification, inaccuracy and the like, so that the invention of an efficient and accurate automatic book classification method becomes increasingly important.
The current automatic book classification algorithm mainly focuses on using machine learning algorithms such as naive Bayes, support vector machines and neural networks. Since books are essentially a collection of texts, the classified books can contain both web literature and traditional literature, and the above method cannot achieve good effect.
The traditional LDA topic model based on NLP (natural language processing) is unsupervised to learn, and directly applying the LDA topic model is equivalent to clustering some books, which is contrary to the original intention of classifying the books, so how to reform the LDA topic model, and applying the LDA topic model to automatic book classification becomes a technical problem which needs to be solved by technicians urgently.
Disclosure of Invention
In view of the above, the present invention provides an automatic book classification method based on an LDA topic model, which can realize automatic book classification based on the LDA topic model.
In order to achieve the above object, the present invention provides an automatic book classification method based on an LDA topic model, comprising:
step one, establishing a classification system comprising K categories;
secondly, selecting books of known categories as training books, extracting book labels from each training book, forming a book label aggregate by the book labels of all the training books, and allocating a unique serial number to each book label in the book label aggregate;
thirdly, constructing and training a multi-item distribution model by taking the training books as samples, wherein the input of the multi-item distribution model is all book labels contained in each training book and the category to which the training books belong, and the output is the probability of each book label in the book label total set under different categories;
step four, selecting the book labels in the book label aggregate from the books to be classified, and forming a label set W (W) of the books to be classified1,w2,…,wd) Wherein d is the number of book labels contained in the book to be classified, w1、w2、…、wdThe book labels contained in the books to be classified are respectively classified, then a category is sampled and distributed to each book label contained in the books to be classified by adopting a Gibbs sampling method according to the probability of each book label in the book label total set under different categories based on an LDA topic model, when convergence is reached, the probability distribution of different categories to which each book label of the books to be classified belongs is calculated, the score of each category to which the books to be classified belong is counted, and the category to which the books to be classified belong is obtained accordingly,
the fourth step further comprises:
step 41, randomly initializing a category for each book label in the book to be classified, and initializing i to 1;
step 42, extracting the ith book label from the label set W of the book to be classified;
and 43, calculating probability distribution of different categories to which the extracted ith book label belongs:
Figure GDA0002539536940000021
wherein, p (z)i=k,wi) Is the ith book label wiProbability of the kth class in the classification system, K ═ 1, 2, …, or K, ziIs wiV is the serial number of the ith book label of the book to be classified in the book label total set, pkvIs the probability of the v-th book label in the book label total set under the k-th category, the value is obtained by the calculation of the step three, nk(-i)、nk'(-i)The label numbers of the k and k' categories of the i book label are removed from all book labels in the label set W of the books to be classified, αk、αk'Is the adjustment parameter of the k, k' th category;
step 44, randomly sampling to obtain a category according to the probability distribution of different categories to which the ith book label belongs, and updating the category of the ith book label into the category obtained after sampling;
step 45, updating i to i +1, judging whether the updated i is larger than d, if so, indicating that all book labels in the label set W are updated, and continuing the next step; if not, go to step 42;
step 46, judging whether the category of each book label in the updated W of the current time and the category consistency of each book label updated for the latest time at present reach a convergence threshold, if so, indicating that convergence is reached; if not, update i to 1, and then go to step 42 to continue the next update of the category for each book tag in W.
Compared with the prior art, the invention has the beneficial effects that: the method comprises the steps of transforming a traditional unsupervised LDA topic model into a supervised LDA topic model algorithm, obtaining the probability of each book label in a book label total set under different classes by training known classified book labels, and calculating and obtaining the score of each class to which a book to be classified belongs by applying a Gibbs sampling method, thereby realizing automatic classification of the book; when calculating the score of each class to which the book to be classified belongs, the invention does not directly count the classification number of the sample label according to the LDA model, but calculates the probability sum of each class to which the book to be classified belongs by adopting probability distribution, and simultaneously adjusts the weight of each book label by adopting IDF, thereby identifying the class of the book more accurately and reducing the error caused by single sampling.
Drawings
FIG. 1 is a flow chart of an automatic book classification method based on LDA topic model according to the present invention.
Fig. 2 is a flowchart showing the detailed steps of step four in fig. 1.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings.
As shown in FIG. 1, the present invention provides an automatic book classification method based on LDA topic model, which comprises:
step one, establishing a classification system comprising K categories;
secondly, selecting books of known categories as training books, extracting book labels from each training book, forming a book label aggregate by the book labels of all the training books, and allocating a unique serial number to each book label in the book label aggregate;
thirdly, constructing and training a multi-item distribution model by taking the training books as samples, wherein the input of the multi-item distribution model is all book labels contained in each training book and the category to which the training books belong, and the output is the probability of each book label in the book label total set under different categories;
the types of all book labels contained in each training book are consistent with the type of the training book, and the probability sum of all book labels in the book label total set under each type is 1;
step four, selecting the book labels in the book label aggregate from the books to be classified, and forming a label set W (W) of the books to be classified1,w2,…,wd) Wherein d is the number of book labels contained in the book to be classified, w1、w2、…、wdThe method comprises the steps of classifying books to be classified according to book labels contained in books to be classified, adopting a Gibbs sampling method to sample and distribute a class for each book label contained in the books to be classified according to the probability of each book label in a book label total set under different classes based on an LDA topic model, calculating the probability distribution of different classes to which each book label of the books to be classified belongs after convergence is achieved, and counting the score of each class to which the books to be classified belong, so that the classes to which the books to be classified belong are obtained.
And for a plurality of books to be classified, the category of each book to be classified can be obtained in sequence by only repeating the step four.
In the second step, word segmentation and part-of-speech tagging can be performed on the text part chapters of the training book by applying the NLP technology, and effective nouns are extracted to serve as book labels.
In the third step, according to the probability statistical calculation, the optimal calculation formula of the probability of each book label in the book label total set under different categories output by the multi-term distribution model can be: book tag aggregation under the kth categoryProbability p of the v-th book tag in (1)kvIs the ratio of the number of the v-th book label in all books belonging to the k-th category to the number of all book labels in all books belonging to the k-th category.
As shown in fig. 2, the step four may further include:
step 41, randomly initializing a category for each book label in the book to be classified, and initializing i to 1;
step 42, extracting the ith book label from the label set W of the book to be classified;
and 43, calculating probability distribution of different categories to which the extracted ith book label belongs:
Figure GDA0002539536940000041
wherein, p (z)i=k,wi) Is the ith book label wiProbability of the kth class in the classification system, K ═ 1, 2, …, or K, ziIs wiV is the serial number of the ith book label of the book to be classified in the book label total set, pkvIs the probability of the v-th book label in the book label total set under the k-th category, the value is obtained by the calculation of the step three, nk(-i)、nk'(-i)The label numbers of the k and k' categories of the i book label are removed from all book labels in the label set W of the books to be classified, αk、αk'Adjustment parameters of the kth and k' th categories are set, and the values of the adjustment parameters can be set according to actual service needs, for example, the values of the adjustment parameters are both set to 1;
step 44, randomly sampling to obtain a category according to the probability distribution of different categories to which the ith book label belongs, and updating the category of the ith book label into the category obtained after sampling;
step 45, updating i to i +1, and determining whether the updated i is greater than d? If yes, all book labels in the updated label set W are represented, and the next step is continued; if not, go to step 42;
step 46, judging whether the category of each book label in the current updating W and the category consistency of each book label updated in the last updating W reach a convergence threshold or not? If yes, convergence is achieved, and the next step is continued; if not, updating i to 1, and then turning to step 42, and continuing to update the category of each book label in W in the next pass;
step 47, calculating the probability of different categories of each book label in the label set of the book to be classified:
Figure GDA0002539536940000042
and 48, calculating the score of each category of the books to be classified according to the probability of different categories of each book label in the label set of the books to be classified, and then selecting the maximum score value from the categories, wherein the category corresponding to the maximum score value is the category of the books to be classified.
It is worth mentioning that the invention does not directly count the classification number of the sample label in the LDA model, but calculates the probability sum of each category to which the book to be classified belongs by using probability distribution, so as to identify the book category more accurately, reduce the error caused by single sampling, and meanwhile, because some book labels are more common and almost exist in all classification categories, the classification degree of the book labels is not large, therefore, the invention further adopts IDF as weight to adjust the score of each category to which the book to be classified belongs, and also comprises:
calculating the IDF of each book label in the book label total set according to the probability of each book label in the book label total set under different categories, which is obtained by the calculation in the step three:
Figure GDA0002539536940000051
wherein idfvIs the v-th book label b in the book label total setvIDF value of (a), num-type (b)v) The book label b containing the vth obtained after inputting sample data in the step threevThe number of all the categories of (a),
thus, in the fourth step, the books to be classified are countedThe formula for calculating the score for each category is:
Figure GDA0002539536940000052
scorekis the score, p (z), of the kth category to which the book to be classified belongsi=k,wi) Is the ith book label wiThe probability of the kth category in the classification system, v is the serial number of the ith book label of the book to be classified in the book label total set.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (5)

1. An automatic book classification method based on an LDA topic model is characterized by comprising the following steps:
step one, establishing a classification system comprising K categories;
secondly, selecting books of known categories as training books, extracting book labels from each training book, forming a book label aggregate by the book labels of all the training books, and allocating a unique serial number to each book label in the book label aggregate;
thirdly, constructing and training a multi-item distribution model by taking the training books as samples, wherein the input of the multi-item distribution model is all book labels contained in each training book and the category to which the training books belong, and the output is the probability of each book label in the book label total set under different categories;
step four, selecting the book labels in the book label aggregate from the books to be classified, and forming a label set W (W) of the books to be classified1,w2,…,wd) Wherein d is the number of book labels contained in the book to be classified, w1、w2、…、wdThe book labels contained in the books to be classified are respectively adopted, and then Gibbs is adopted according to the probability of each book label in the book label total set under different classes based on the LDA topic modelThe sampling method samples and allocates a category for each book label contained in the book to be classified, calculates the probability distribution of different categories to which each book label of the book to be classified belongs after convergence is achieved, counts the score of each category to which the book to be classified belongs, thereby obtaining the category to which the book to be classified belongs according to the score,
the fourth step further comprises:
step 41, randomly initializing a category for each book label in the book to be classified, and initializing i to 1;
step 42, extracting the ith book label from the label set W of the book to be classified;
and 43, calculating probability distribution of different categories to which the extracted ith book label belongs:
Figure FDA0002539536930000011
wherein, p (z)i=k,wi) Is the ith book label wiProbability of the kth class in the classification system, K ═ 1, 2, …, or K, ziIs wiV is the serial number of the ith book label of the book to be classified in the book label total set, pkvIs the probability of the v-th book label in the book label total set under the k-th category, the value is obtained by the calculation of the step three, nk(-i)、nk'(-i)The label numbers of the k and k' categories of the i book label are removed from all book labels in the label set W of the books to be classified, αk、αk'Is the adjustment parameter of the k, k' th category;
step 44, randomly sampling to obtain a category according to the probability distribution of different categories to which the ith book label belongs, and updating the category of the ith book label into the category obtained after sampling;
step 45, updating i to i +1, judging whether the updated i is larger than d, if so, indicating that all book labels in the label set W are updated, and continuing the next step; if not, go to step 42;
step 46, judging whether the category of each book label in the updated W of the current time and the category consistency of each book label updated for the latest time at present reach a convergence threshold, if so, indicating that convergence is reached; if not, update i to 1, and then go to step 42 to continue the next update of the category for each book tag in W.
2. The method as claimed in claim 1, wherein in the second step, the NLP technique is applied to perform word segmentation and part-of-speech tagging on the text part chapters of the training book, and effective nouns are extracted as book tags.
3. The method of claim 1, wherein in step three, the formula for calculating the probability of each book label in the book label total set under different categories output by the multi-term distribution model is: probability p of the v-th book tag in the book tag aggregate under the k-th categorykvIs the ratio of the number of the v-th book label in all books belonging to the k-th category to the number of all book labels in all books belonging to the k-th category.
4. The method as claimed in claim 1, wherein in step four, after convergence is reached, the probability distribution of different categories to which each book tag of the books to be classified belongs is calculated, and the score of each category to which the books to be classified belong is counted, so as to obtain the category to which the books to be classified belong according to the score, further comprising:
step A1, calculating the probability of different categories of each book label in the label set of the book to be classified:
Figure FDA0002539536930000021
wherein, p (z)i=k,wi) Is the ith book label wiProbability of the kth class in the classification system, K ═ 1, 2, …, or K, ziIs wiV is the serial number of the ith book label of the book to be classified in the book label total set, pkvIs the summary of the v-th book tag in the book tag aggregate under the k-th categoryRate, whose value is obtained by the calculation of step three, nk(-i)、nk'(-i)The label numbers of the k and k' categories of the i book label are removed from all book labels in the label set W of the books to be classified, αk、αk'Is the adjustment parameter of the k, k' th category;
step A2, calculating the score of each category of the books to be classified according to the probability of different categories of each book label in the label set of the books to be classified, and then selecting the maximum score value from the categories, wherein the category corresponding to the maximum score value is the category of the books to be classified.
5. The method of claim 1, further comprising:
calculating the IDF of each book label in the book label total set according to the probability of each book label in the book label total set under different categories, which is obtained by the calculation in the step three:
Figure FDA0002539536930000022
wherein idfvIs the v-th book label b in the book label total setvIDF value of (a), num-type (b)v) The book label b containing the vth obtained after inputting sample data in the step threevThe number of all the categories of (a),
in the fourth step, the calculation formula for counting the score of each category of the books to be classified is as follows:
Figure FDA0002539536930000031
scorekis the score, p (z), of the kth category to which the book to be classified belongsi=k,wi) Is the ith book label wiThe probability of the kth category in the classification system, v is the serial number of the ith book label of the book to be classified in the book label total set.
CN201811584226.5A 2018-12-24 2018-12-24 Automatic book classification method based on LDA topic model Active CN109726286B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811584226.5A CN109726286B (en) 2018-12-24 2018-12-24 Automatic book classification method based on LDA topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811584226.5A CN109726286B (en) 2018-12-24 2018-12-24 Automatic book classification method based on LDA topic model

Publications (2)

Publication Number Publication Date
CN109726286A CN109726286A (en) 2019-05-07
CN109726286B true CN109726286B (en) 2020-10-16

Family

ID=66296376

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811584226.5A Active CN109726286B (en) 2018-12-24 2018-12-24 Automatic book classification method based on LDA topic model

Country Status (1)

Country Link
CN (1) CN109726286B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569270B (en) * 2019-08-15 2022-07-05 中国人民解放军国防科技大学 Bayesian-based LDA topic label calibration method, system and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101587493A (en) * 2009-06-29 2009-11-25 中国科学技术大学 Text classification method
CN102929937A (en) * 2012-09-28 2013-02-13 福州博远无线网络科技有限公司 Text-subject-model-based data processing method for commodity classification
CN103473309A (en) * 2013-09-10 2013-12-25 浙江大学 Text categorization method based on probability word selection and supervision subject model
CN105045812A (en) * 2015-06-18 2015-11-11 上海高欣计算机系统有限公司 Text topic classification method and system
US9342591B2 (en) * 2012-02-14 2016-05-17 International Business Machines Corporation Apparatus for clustering a plurality of documents
CN106326495A (en) * 2016-09-27 2017-01-11 浪潮软件集团有限公司 Topic model based automatic Chinese text classification method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101587493A (en) * 2009-06-29 2009-11-25 中国科学技术大学 Text classification method
US9342591B2 (en) * 2012-02-14 2016-05-17 International Business Machines Corporation Apparatus for clustering a plurality of documents
CN102929937A (en) * 2012-09-28 2013-02-13 福州博远无线网络科技有限公司 Text-subject-model-based data processing method for commodity classification
CN103473309A (en) * 2013-09-10 2013-12-25 浙江大学 Text categorization method based on probability word selection and supervision subject model
CN105045812A (en) * 2015-06-18 2015-11-11 上海高欣计算机系统有限公司 Text topic classification method and system
CN106326495A (en) * 2016-09-27 2017-01-11 浪潮软件集团有限公司 Topic model based automatic Chinese text classification method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于Labeled LDA 主题模型的医学文献自动分类;宫小翠等;《中华医学图书情报杂志》;20181031;第27卷(第10期);第53-58页 *

Also Published As

Publication number Publication date
CN109726286A (en) 2019-05-07

Similar Documents

Publication Publication Date Title
CN110347835B (en) Text clustering method, electronic device and storage medium
CN110033281B (en) Method and device for converting intelligent customer service into manual customer service
CN107025284B (en) Network comment text emotional tendency recognition method and convolutional neural network model
CN108073568B (en) Keyword extraction method and device
CN111414479B (en) Label extraction method based on short text clustering technology
CN109165294B (en) Short text classification method based on Bayesian classification
CN106055673B (en) A kind of Chinese short text sensibility classification method based on text feature insertion
CN111104526A (en) Financial label extraction method and system based on keyword semantics
CN110196908A (en) Data classification method, device, computer installation and storage medium
CN108519971B (en) Cross-language news topic similarity comparison method based on parallel corpus
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN110807086B (en) Text data labeling method and device, storage medium and electronic equipment
CN111914159B (en) Information recommendation method and terminal
CN113626607B (en) Abnormal work order identification method and device, electronic equipment and readable storage medium
CN113434688B (en) Data processing method and device for public opinion classification model training
WO2023065642A1 (en) Corpus screening method, intention recognition model optimization method, device, and storage medium
Van et al. Vietnamese news classification based on BoW with keywords extraction and neural network
CN113704389A (en) Data evaluation method and device, computer equipment and storage medium
TWI734085B (en) Dialogue system using intention detection ensemble learning and method thereof
CN116186268A (en) Multi-document abstract extraction method and system based on Capsule-BiGRU network and event automatic classification
CN113505154B (en) Digital reading statistical analysis method and system based on big data
CN116882414B (en) Automatic comment generation method and related device based on large-scale language model
CN109726286B (en) Automatic book classification method based on LDA topic model
CN113987161A (en) Text sorting method and device
CN111767404A (en) Event mining method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 310013 4th floor, No.398 Wensan Road, Xihu District, Hangzhou City, Zhejiang Province

Patentee after: Xinxun Digital Technology (Hangzhou) Co.,Ltd.

Address before: 310013 4th floor, No.398 Wensan Road, Xihu District, Hangzhou City, Zhejiang Province

Patentee before: EB Information Technology Ltd.