CN109726286B  Automatic book classification method based on LDA topic model  Google Patents
Automatic book classification method based on LDA topic model Download PDFInfo
 Publication number
 CN109726286B CN109726286B CN201811584226.5A CN201811584226A CN109726286B CN 109726286 B CN109726286 B CN 109726286B CN 201811584226 A CN201811584226 A CN 201811584226A CN 109726286 B CN109726286 B CN 109726286B
 Authority
 CN
 China
 Prior art keywords
 book
 label
 classified
 books
 category
 Prior art date
 Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
 Active
Links
Images
Abstract
An automatic book classification method based on an LDA topic model comprises the following steps: establishing a classification system; selecting books of known categories as training books, forming a book label total set by the labels of all the training books, and allocating a unique serial number to each label in the book label total set; constructing and training a multiitem distribution model, wherein the input of the multiitem distribution model is book labels and book categories contained in training books, and the output is the probability of each label in a book label total set under different categories; the method comprises the steps of selecting book labels from books to be classified, forming a label set of the books to be classified, then adopting a Gibbs sampling method to sample and distribute a category for each book label contained in the books to be classified based on an LDA topic model, counting the score of each category of the books to be classified after convergence is achieved, and accordingly obtaining the category of the books to be classified. The invention belongs to the technical field of information, and can realize automatic book classification based on an LDA topic model.
Description
Technical Field
The invention relates to an automatic book classification method based on an LDA topic model, and belongs to the technical field of information.
Background
Book sorting has long been of great significance to both online and offline book organization that holds a large number of books. For a network literature platform and an online bookstore which are popular with emerging reading groups, accurate book classification is the basis for accurate recommendation of various books, and for libraries and physical bookstores bearing traditional publishing literature, accurate book classification can improve management efficiency and user experience. For these mechanisms, because there are many old books that need to be classified and new books that are continuously put on shelves, the current manual book classification method has the problems of large workload, low efficiency, subjectivity in classification, inaccuracy and the like, so that the invention of an efficient and accurate automatic book classification method becomes increasingly important.
The current automatic book classification algorithm mainly focuses on using machine learning algorithms such as naive Bayes, support vector machines and neural networks. Since books are essentially a collection of texts, the classified books can contain both web literature and traditional literature, and the above method cannot achieve good effect.
The traditional LDA topic model based on NLP (natural language processing) is unsupervised to learn, and directly applying the LDA topic model is equivalent to clustering some books, which is contrary to the original intention of classifying the books, so how to reform the LDA topic model, and applying the LDA topic model to automatic book classification becomes a technical problem which needs to be solved by technicians urgently.
Disclosure of Invention
In view of the above, the present invention provides an automatic book classification method based on an LDA topic model, which can realize automatic book classification based on the LDA topic model.
In order to achieve the above object, the present invention provides an automatic book classification method based on an LDA topic model, comprising:
step one, establishing a classification system comprising K categories;
secondly, selecting books of known categories as training books, extracting book labels from each training book, forming a book label aggregate by the book labels of all the training books, and allocating a unique serial number to each book label in the book label aggregate;
thirdly, constructing and training a multiitem distribution model by taking the training books as samples, wherein the input of the multiitem distribution model is all book labels contained in each training book and the category to which the training books belong, and the output is the probability of each book label in the book label total set under different categories;
step four, selecting the book labels in the book label aggregate from the books to be classified, and forming a label set W (W) of the books to be classified_{1}，w_{2}，…，w_{d}) Wherein d is the number of book labels contained in the book to be classified, w_{1}、w_{2}、…、w_{d}The book labels contained in the books to be classified are respectively classified, then a category is sampled and distributed to each book label contained in the books to be classified by adopting a Gibbs sampling method according to the probability of each book label in the book label total set under different categories based on an LDA topic model, when convergence is reached, the probability distribution of different categories to which each book label of the books to be classified belongs is calculated, the score of each category to which the books to be classified belong is counted, and the category to which the books to be classified belong is obtained accordingly,
the fourth step further comprises:
and 43, calculating probability distribution of different categories to which the extracted ith book label belongs:wherein, p (z)_{i}＝k,w_{i}) Is the ith book label w_{i}Probability of the kth class in the classification system, K ═ 1, 2, …, or K, z_{i}Is w_{i}V is the serial number of the ith book label of the book to be classified in the book label total set, p_{kv}Is the probability of the vth book label in the book label total set under the kth category, the value is obtained by the calculation of the step three, n_{k(i)}、n_{k'(i)}The label numbers of the k and k' categories of the i book label are removed from all book labels in the label set W of the books to be classified, α_{k}、α_{k'}Is the adjustment parameter of the k, k' th category;
Compared with the prior art, the invention has the beneficial effects that: the method comprises the steps of transforming a traditional unsupervised LDA topic model into a supervised LDA topic model algorithm, obtaining the probability of each book label in a book label total set under different classes by training known classified book labels, and calculating and obtaining the score of each class to which a book to be classified belongs by applying a Gibbs sampling method, thereby realizing automatic classification of the book; when calculating the score of each class to which the book to be classified belongs, the invention does not directly count the classification number of the sample label according to the LDA model, but calculates the probability sum of each class to which the book to be classified belongs by adopting probability distribution, and simultaneously adjusts the weight of each book label by adopting IDF, thereby identifying the class of the book more accurately and reducing the error caused by single sampling.
Drawings
FIG. 1 is a flow chart of an automatic book classification method based on LDA topic model according to the present invention.
Fig. 2 is a flowchart showing the detailed steps of step four in fig. 1.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings.
As shown in FIG. 1, the present invention provides an automatic book classification method based on LDA topic model, which comprises:
step one, establishing a classification system comprising K categories;
secondly, selecting books of known categories as training books, extracting book labels from each training book, forming a book label aggregate by the book labels of all the training books, and allocating a unique serial number to each book label in the book label aggregate;
thirdly, constructing and training a multiitem distribution model by taking the training books as samples, wherein the input of the multiitem distribution model is all book labels contained in each training book and the category to which the training books belong, and the output is the probability of each book label in the book label total set under different categories;
the types of all book labels contained in each training book are consistent with the type of the training book, and the probability sum of all book labels in the book label total set under each type is 1;
step four, selecting the book labels in the book label aggregate from the books to be classified, and forming a label set W (W) of the books to be classified_{1}，w_{2}，…，w_{d}) Wherein d is the number of book labels contained in the book to be classified, w_{1}、w_{2}、…、w_{d}The method comprises the steps of classifying books to be classified according to book labels contained in books to be classified, adopting a Gibbs sampling method to sample and distribute a class for each book label contained in the books to be classified according to the probability of each book label in a book label total set under different classes based on an LDA topic model, calculating the probability distribution of different classes to which each book label of the books to be classified belongs after convergence is achieved, and counting the score of each class to which the books to be classified belong, so that the classes to which the books to be classified belong are obtained.
And for a plurality of books to be classified, the category of each book to be classified can be obtained in sequence by only repeating the step four.
In the second step, word segmentation and partofspeech tagging can be performed on the text part chapters of the training book by applying the NLP technology, and effective nouns are extracted to serve as book labels.
In the third step, according to the probability statistical calculation, the optimal calculation formula of the probability of each book label in the book label total set under different categories output by the multiterm distribution model can be: book tag aggregation under the kth categoryProbability p of the vth book tag in (1)_{kv}Is the ratio of the number of the vth book label in all books belonging to the kth category to the number of all book labels in all books belonging to the kth category.
As shown in fig. 2, the step four may further include:
and 43, calculating probability distribution of different categories to which the extracted ith book label belongs:wherein, p (z)_{i}＝k,w_{i}) Is the ith book label w_{i}Probability of the kth class in the classification system, K ═ 1, 2, …, or K, z_{i}Is w_{i}V is the serial number of the ith book label of the book to be classified in the book label total set, p_{kv}Is the probability of the vth book label in the book label total set under the kth category, the value is obtained by the calculation of the step three, n_{k(i)}、n_{k'(i)}The label numbers of the k and k' categories of the i book label are removed from all book labels in the label set W of the books to be classified, α_{k}、α_{k'}Adjustment parameters of the kth and k' th categories are set, and the values of the adjustment parameters can be set according to actual service needs, for example, the values of the adjustment parameters are both set to 1;
and 48, calculating the score of each category of the books to be classified according to the probability of different categories of each book label in the label set of the books to be classified, and then selecting the maximum score value from the categories, wherein the category corresponding to the maximum score value is the category of the books to be classified.
It is worth mentioning that the invention does not directly count the classification number of the sample label in the LDA model, but calculates the probability sum of each category to which the book to be classified belongs by using probability distribution, so as to identify the book category more accurately, reduce the error caused by single sampling, and meanwhile, because some book labels are more common and almost exist in all classification categories, the classification degree of the book labels is not large, therefore, the invention further adopts IDF as weight to adjust the score of each category to which the book to be classified belongs, and also comprises:
calculating the IDF of each book label in the book label total set according to the probability of each book label in the book label total set under different categories, which is obtained by the calculation in the step three:wherein idf_{v}Is the vth book label b in the book label total set_{v}IDF value of (a), numtype (b)_{v}) The book label b containing the vth obtained after inputting sample data in the step three_{v}The number of all the categories of (a),
thus, in the fourth step, the books to be classified are countedThe formula for calculating the score for each category is:score_{k}is the score, p (z), of the kth category to which the book to be classified belongs_{i}＝k,w_{i}) Is the ith book label w_{i}The probability of the kth category in the classification system, v is the serial number of the ith book label of the book to be classified in the book label total set.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (5)
1. An automatic book classification method based on an LDA topic model is characterized by comprising the following steps:
step one, establishing a classification system comprising K categories;
secondly, selecting books of known categories as training books, extracting book labels from each training book, forming a book label aggregate by the book labels of all the training books, and allocating a unique serial number to each book label in the book label aggregate;
thirdly, constructing and training a multiitem distribution model by taking the training books as samples, wherein the input of the multiitem distribution model is all book labels contained in each training book and the category to which the training books belong, and the output is the probability of each book label in the book label total set under different categories;
step four, selecting the book labels in the book label aggregate from the books to be classified, and forming a label set W (W) of the books to be classified_{1}，w_{2}，…，w_{d}) Wherein d is the number of book labels contained in the book to be classified, w_{1}、w_{2}、…、w_{d}The book labels contained in the books to be classified are respectively adopted, and then Gibbs is adopted according to the probability of each book label in the book label total set under different classes based on the LDA topic modelThe sampling method samples and allocates a category for each book label contained in the book to be classified, calculates the probability distribution of different categories to which each book label of the book to be classified belongs after convergence is achieved, counts the score of each category to which the book to be classified belongs, thereby obtaining the category to which the book to be classified belongs according to the score,
the fourth step further comprises:
step 41, randomly initializing a category for each book label in the book to be classified, and initializing i to 1;
step 42, extracting the ith book label from the label set W of the book to be classified;
and 43, calculating probability distribution of different categories to which the extracted ith book label belongs:wherein, p (z)_{i}＝k,w_{i}) Is the ith book label w_{i}Probability of the kth class in the classification system, K ═ 1, 2, …, or K, z_{i}Is w_{i}V is the serial number of the ith book label of the book to be classified in the book label total set, p_{kv}Is the probability of the vth book label in the book label total set under the kth category, the value is obtained by the calculation of the step three, n_{k(i)}、n_{k'(i)}The label numbers of the k and k' categories of the i book label are removed from all book labels in the label set W of the books to be classified, α_{k}、α_{k'}Is the adjustment parameter of the k, k' th category;
step 44, randomly sampling to obtain a category according to the probability distribution of different categories to which the ith book label belongs, and updating the category of the ith book label into the category obtained after sampling;
step 45, updating i to i +1, judging whether the updated i is larger than d, if so, indicating that all book labels in the label set W are updated, and continuing the next step; if not, go to step 42;
step 46, judging whether the category of each book label in the updated W of the current time and the category consistency of each book label updated for the latest time at present reach a convergence threshold, if so, indicating that convergence is reached; if not, update i to 1, and then go to step 42 to continue the next update of the category for each book tag in W.
2. The method as claimed in claim 1, wherein in the second step, the NLP technique is applied to perform word segmentation and partofspeech tagging on the text part chapters of the training book, and effective nouns are extracted as book tags.
3. The method of claim 1, wherein in step three, the formula for calculating the probability of each book label in the book label total set under different categories output by the multiterm distribution model is: probability p of the vth book tag in the book tag aggregate under the kth category_{kv}Is the ratio of the number of the vth book label in all books belonging to the kth category to the number of all book labels in all books belonging to the kth category.
4. The method as claimed in claim 1, wherein in step four, after convergence is reached, the probability distribution of different categories to which each book tag of the books to be classified belongs is calculated, and the score of each category to which the books to be classified belong is counted, so as to obtain the category to which the books to be classified belong according to the score, further comprising:
step A1, calculating the probability of different categories of each book label in the label set of the book to be classified:wherein, p (z)_{i}＝k,w_{i}) Is the ith book label w_{i}Probability of the kth class in the classification system, K ═ 1, 2, …, or K, z_{i}Is w_{i}V is the serial number of the ith book label of the book to be classified in the book label total set, p_{kv}Is the summary of the vth book tag in the book tag aggregate under the kth categoryRate, whose value is obtained by the calculation of step three, n_{k(i)}、n_{k'(i)}The label numbers of the k and k' categories of the i book label are removed from all book labels in the label set W of the books to be classified, α_{k}、α_{k'}Is the adjustment parameter of the k, k' th category;
step A2, calculating the score of each category of the books to be classified according to the probability of different categories of each book label in the label set of the books to be classified, and then selecting the maximum score value from the categories, wherein the category corresponding to the maximum score value is the category of the books to be classified.
5. The method of claim 1, further comprising:
calculating the IDF of each book label in the book label total set according to the probability of each book label in the book label total set under different categories, which is obtained by the calculation in the step three:wherein idf_{v}Is the vth book label b in the book label total set_{v}IDF value of (a), numtype (b)_{v}) The book label b containing the vth obtained after inputting sample data in the step three_{v}The number of all the categories of (a),
in the fourth step, the calculation formula for counting the score of each category of the books to be classified is as follows:score_{k}is the score, p (z), of the kth category to which the book to be classified belongs_{i}＝k,w_{i}) Is the ith book label w_{i}The probability of the kth category in the classification system, v is the serial number of the ith book label of the book to be classified in the book label total set.
Priority Applications (1)
Application Number  Priority Date  Filing Date  Title 

CN201811584226.5A CN109726286B (en)  20181224  20181224  Automatic book classification method based on LDA topic model 
Applications Claiming Priority (1)
Application Number  Priority Date  Filing Date  Title 

CN201811584226.5A CN109726286B (en)  20181224  20181224  Automatic book classification method based on LDA topic model 
Publications (2)
Publication Number  Publication Date 

CN109726286A CN109726286A (en)  20190507 
CN109726286B true CN109726286B (en)  20201016 
Family
ID=66296376
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

CN201811584226.5A Active CN109726286B (en)  20181224  20181224  Automatic book classification method based on LDA topic model 
Country Status (1)
Country  Link 

CN (1)  CN109726286B (en) 
Families Citing this family (1)
Publication number  Priority date  Publication date  Assignee  Title 

CN110569270B (en) *  20190815  20220705  中国人民解放军国防科技大学  Bayesianbased LDA topic label calibration method, system and medium 
Citations (6)
Publication number  Priority date  Publication date  Assignee  Title 

CN101587493A (en) *  20090629  20091125  中国科学技术大学  Text classification method 
CN102929937A (en) *  20120928  20130213  福州博远无线网络科技有限公司  Textsubjectmodelbased data processing method for commodity classification 
CN103473309A (en) *  20130910  20131225  浙江大学  Text categorization method based on probability word selection and supervision subject model 
CN105045812A (en) *  20150618  20151111  上海高欣计算机系统有限公司  Text topic classification method and system 
US9342591B2 (en) *  20120214  20160517  International Business Machines Corporation  Apparatus for clustering a plurality of documents 
CN106326495A (en) *  20160927  20170111  浪潮软件集团有限公司  Topic model based automatic Chinese text classification method 

2018
 20181224 CN CN201811584226.5A patent/CN109726286B/en active Active
Patent Citations (6)
Publication number  Priority date  Publication date  Assignee  Title 

CN101587493A (en) *  20090629  20091125  中国科学技术大学  Text classification method 
US9342591B2 (en) *  20120214  20160517  International Business Machines Corporation  Apparatus for clustering a plurality of documents 
CN102929937A (en) *  20120928  20130213  福州博远无线网络科技有限公司  Textsubjectmodelbased data processing method for commodity classification 
CN103473309A (en) *  20130910  20131225  浙江大学  Text categorization method based on probability word selection and supervision subject model 
CN105045812A (en) *  20150618  20151111  上海高欣计算机系统有限公司  Text topic classification method and system 
CN106326495A (en) *  20160927  20170111  浪潮软件集团有限公司  Topic model based automatic Chinese text classification method 
NonPatent Citations (1)
Title 

基于Labeled LDA 主题模型的医学文献自动分类;宫小翠等;《中华医学图书情报杂志》;20181031;第27卷(第10期);第5358页 * 
Also Published As
Publication number  Publication date 

CN109726286A (en)  20190507 
Similar Documents
Publication  Publication Date  Title 

CN108710651B (en)  Automatic classification method for largescale customer complaint data  
CN109189901B (en)  Method for automatically discovering new classification and corresponding corpus in intelligent customer service system  
CN107025284B (en)  Network comment text emotional tendency recognition method and convolutional neural network model  
CN108073568B (en)  Keyword extraction method and device  
CN110033281B (en)  Method and device for converting intelligent customer service into manual customer service  
CN109165294B (en)  Short text classification method based on Bayesian classification  
CN111414479B (en)  Label extraction method based on short text clustering technology  
CN106055673B (en)  A kind of Chinese short text sensibility classification method based on text feature insertion  
CN110196908A (en)  Data classification method, device, computer installation and storage medium  
CN109086265B (en)  Semantic training method and multisemantic word disambiguation method in short text  
CN108519971B (en)  Crosslanguage news topic similarity comparison method based on parallel corpus  
CN111104526A (en)  Financial label extraction method and system based on keyword semantics  
CN113434688B (en)  Data processing method and device for public opinion classification model training  
CN109446423B (en)  System and method for judging sentiment of news and texts  
CN112529638B (en)  Service demand dynamic prediction method and system based on user classification and deep learning  
CN108027814A (en)  Disable word recognition method and device  
WO2023065642A1 (en)  Corpus screening method, intention recognition model optimization method, device, and storage medium  
CN108595411B (en)  Method for acquiring multiple text abstracts in same subject text set  
CN113934835B (en)  Retrieval type reply dialogue method and system combining keywords and semantic understanding representation  
CN109726286B (en)  Automatic book classification method based on LDA topic model  
CN113987161A (en)  Text sorting method and device  
CN111914159A (en)  Information recommendation method and terminal  
TWI734085B (en)  Dialogue system using intention detection ensemble learning and method thereof  
CN116186268A (en)  Multidocument abstract extraction method and system based on CapsuleBiGRU network and event automatic classification  
CN115510326A (en)  Internet forum user interest recommendation algorithm based on text features and emotional tendency 
Legal Events
Date  Code  Title  Description 

PB01  Publication  
PB01  Publication  
SE01  Entry into force of request for substantive examination  
SE01  Entry into force of request for substantive examination  
GR01  Patent grant  
GR01  Patent grant  
CP01  Change in the name or title of a patent holder  
CP01  Change in the name or title of a patent holder 
Address after: 310013 4th floor, No.398 Wensan Road, Xihu District, Hangzhou City, Zhejiang Province Patentee after: Xinxun Digital Technology (Hangzhou) Co.,Ltd. Address before: 310013 4th floor, No.398 Wensan Road, Xihu District, Hangzhou City, Zhejiang Province Patentee before: EB Information Technology Ltd. 