CN108920508A

CN108920508A - Textual classification model training method and system based on LDA algorithm

Info

Publication number: CN108920508A
Application number: CN201810535046.1A
Authority: CN
Inventors: 冯广辉; 王雷; 居燕峰; 李福�; 周小华
Original assignee: FUJIAN NEW LAND SOFTWARE ENGINEERING Co Ltd
Current assignee: FUJIAN NEW LAND SOFTWARE ENGINEERING Co Ltd
Priority date: 2018-05-29
Filing date: 2018-05-29
Publication date: 2018-11-30

Abstract

The present invention discloses a kind of textual classification model training method based on LDA algorithm, the method includes obtaining unordered text in real time, the participle of LDA theme training pattern according to setting safeguards information, that is, includes synonym maintenance, stop words maintenance etc., the unordered text of input is carried out word segmentation processing.It is the word frequency vector of vectorization by the unordered text conversion after word segmentation processing, extracts the 10% of word frequency vector and be used as classification input condition, obtains the result return of classification prediction by Bayes's training.The present invention is changed to ensure the safety of data using the location mode based on distributed HDFS as medium by traditional disk to storage data, reduces the data time loaded into memory.Using the MapReduce distributed computing architecture based on hadoop, there are better dilatation and fault-tolerance relative to single machine, bigger sample size can be loaded, save the runing time of program.

Description

Textual classification model training method and system based on LDA algorithm

Technical field

The present invention relates to software technology field more particularly to a kind of textual classification model training methods based on LDA algorithm And system.

Background technique

Comprehensive arrival of information age, internet status in people's lives and people to its dependence increasingly It is high.For at this stage, on the interactive form of a variety of applications and the application of internet, text remains a kind of important The ways of presentation of medium.With social progress, the development in epoch, the information content precipitating, more and more text information meetings It is stored and is saved；How these historical informations are correctly analyzed and used, it is also gradually of interest by everybody.For text It is important that being exactly the classification of text information in the analysis of information and the excavation of data.

In real life can the scenes of more and more text classifications need to solve, such as to recommend user interested News content carries out Fast Classification to the contribution that reporter writes out, carries out distribution archive etc. to the content that web crawlers obtains.Cause This, is in the follow-up data system of processing of the portal website of news category, the Message Entry System of newspaper office and web crawlers, for The demand of text classification is more more and more urgent.And for ordinary user, how to be not required to it is to be understood that sorting algorithm behind Complicated technical principle and implementation and the class categories that can be needed by scene fast custom oneself of classifying required for oneself It is even more important.

In the practical problem for solving text classification, the beginning of this century, Blei, David M., Wu Enda and Jordan, Michael I proposes a kind of classification method of topic model, i.e., implicit Di Li Cray is distributed abbreviation LDA (Latent Dirichlet allocation).This method advantage is a kind of non-supervisory learning algorithm, i.e., is carrying out classification based training to text Early period do not need the text corresponding relationship that labor costs' plenty of time arranges classifying content Yu each class, instead can With directly various text files as the input condition of algorithm, after the number of classification is manually specified, all other Algorithm oneself is transferred to handle.

But just due to the convenience of LDA topic model algorithm, use is not often able to satisfy in actual living scene The actual use demand at family.For example classify to the news of Mr. Yu's news website, user wishes the content of news being divided into " body Educate ", " finance and economics ", " car information ", the classification such as " automobile complaint ", but since the algorithm of LDA is merely able to of specified classification scheme Number, and since the essence of algorithm is the process of a cluster, but actual scene is：It may for the news under " finance and economics " There is the related financial statement of automobile, or occurs the description as described in economy such as some purchase vehicles, maintenance under " automobile " channel. Sorted situation is carried out according to LDA algorithm to be often merged together the content of " car information " and " automobile complaint ", or It is that some classification not enough clearly happen.

Furthermore the algorithm that the existing Bayes' theorem according to Bayes's invention is realized out, can be very good to solve most In the case of classification demand, but due to the algorithm itself for condition mutual independence hypothesis as premise, it is therefore desirable to the greatest extent The problem of the problem of event independence of amount improved between training data, event independence feeds back in Chinese classification is as synonymous Word problem.

Summary of the invention

It is an object of the present invention to propose a kind of training data acquisition, disaggregated model training and text for Chinese text This orientation, which is classified, to be predicted, the mass text disaggregated model instruction based on LDA algorithm of text classification can simply, be accurately carried out Practice method, solves existing Text Classification principle complexity, realizes that difficulty is big and Chinese classification is ineffective, it can not be according to oneself The problem of class categories that demand fast custom needs.

To achieve the goals above, the technical scheme adopted by the invention is as follows：

A kind of textual classification model training method based on LDA algorithm, includes the following steps：

Unordered text is obtained as training data；

The LDA theme training pattern of Chinese word segmentation is set, unordered text progress word segmentation processing is obtained into every unordered text In set theme distributed intelligence and word distribution situation, according to LDA theme training pattern to the unordered text after word segmentation processing It is trained and generates preliminary classification file；

Preliminary classification file is saved and carries out TFIDF algorithm process and is converted to word frequency vector, is stored into HDFS；

Extract the test training sample that word frequency vector does training result return, the training based on Bayes, using hadoop Mapreduce training process by test training sample be cut into multiple mapping tasks and finally in specification process will mapping times The result of business carries out the collection and arrangement of data；

Disaggregated model training finishes.

Wherein, the content of preservation TFIDF algorithm process is carried out to specifically comprise the following steps：

Word after all word segmentation processings is numbered, and stores the content of text after number；

The word frequency that each participle occurs is counted, the reversed word frequency of each participle is obtained；

Content of text after obtaining number storage is converted to word frequency vector and stores into HDFS.

Wherein, the test training sample is the 10% of word frequency vector.

Wherein, when preliminary classification file being saved to distributed memory system HDFS, to preliminary classification file carry out two into System is serialized and is stored after compressing to preliminary classification file content.

Wherein, further include before preliminary classification file being saved to distributed memory system HDFS：

Preliminary classification file is screened, orientation defines new classification information and stores to HDFS system.

Invention additionally discloses a kind of textual classification model training system based on LDA algorithm, including

Input unit obtains unordered text as training data；

Storage medium, for executing to give an order：

Extract the test training sample that word frequency vector does training result return, the training based on Bayes, using hadoop Mapreduce training process by test training sample be cut into multiple mapping tasks and finally in specification process will mapping times The result of business carries out the collection and arrangement of data.

Wherein, the test training sample is the 10% of word frequency vector.

Beneficial effects of the present invention are：

The present invention, will be in unordered a small amount of text by the preliminary screening of LDA topic model and in conjunction with artificial final election function Appearance carries out unified conclusion, data cleansing, has ensured event independence between each attribute, improves the accuracy rate of text classification prediction.

Storage data are changed to ensure using the location mode based on distributed HDFS by traditional disk as medium The safety of data reduces the data time loaded into memory.Using the MapReduce distributed computing based on hadoop Framework has better dilatation and fault-tolerance relative to single machine, can load bigger sample size, saves the runing time of program.

Detailed description of the invention

Fig. 1 is the schematic diagram of unordered text collection of the invention；

Fig. 2 is a kind of flow chart of textual classification model training method based on LDA algorithm of the invention.

Specific embodiment

Below with reference to specific embodiment shown in the drawings, the present invention will be described in detail.But these embodiments are simultaneously The present invention is not limited, structure that those skilled in the art are made according to these embodiments, method or functionally Transformation is included within the scope of protection of the present invention.

A kind of textual classification model training method based on LDA algorithm is disclosed in one embodiment of the present invention, referring to fig. 2 It is shown, include the following steps：

S001：Unordered text is obtained as training data；In the embodiment, it can obtain from designated position without preface This, the data source including abiding by File Transfer Protocol, http protocol；Support that the upper transmitting file of parsing user includes：TXT format compression packet, The data format of CSV compressed package, excel and WORD compressed package；By way of specified link and the form of set of links processing Http data is uploaded, is based on http data, it is possible to specify the breadth and depth of data acquisition；By the text in HTML Content is obtained by the orientation that XPATH format carries out data；Add the number to dynamically load or AJAX Asynchronous Request in webpage It is obtained according to the dynamic data carried out based on selenium solution.

S002：The LDA theme training pattern of Chinese word segmentation is set, by unordered text progress word segmentation processing obtain every it is unordered The distributed intelligence of set theme and word distribution situation in text, according to LDA theme training pattern to unordered after word segmentation processing Text, which is trained, generates preliminary classification file, and saves；In this step, the unordered text got is carried out at participle Reason, word segmentation processing include the relevant stop words maintenance of the pre- business processing of progress, synonym relevant to business processing maintenance, thus Data cleansing is done for subsequent classification.Using the LDA theme training pattern of Chinese word segmentation, sets number of topics and configure tnum, iteration Number configuration inum, mode input position sloc；LDA theme training pattern main function be for some documents at random being not necessarily into Row is concentrated, and is found theme distribution in each document and is distributed to the word in the document, how to automatically generate theme and The mosquito net for how analyzing theme is required to that LDA theme training pattern is transferred to automatically generate.In probability statistics, in every article Word is all to first pass through certain probability to generate or select a theme and then generate from the theme that this is chosen once more Or some word has been selected, the probabilistic relation of word and document is as follows：

The main function of LDA theme training pattern is to find tnum theme from unordered text and establish corresponding word Corresponding relationship, shown in Figure 1, Doc1, Doc2...Docm represent unordered text collection, Docm represents m-th of nothing The Wordm of preface sheet, corresponding row indicates there be n word in the unordered text.In conjunction with configured in LDA theme training pattern Good tnum, inum parameter, program can be based on tnum theme number to all unordered texts and calculate, Word probability distribution Process is in the following way：

Wherein, θ indicates a theme vector in above procedure, and the data of each theme vector represent and its in document With the probability occurred；P (θ) is used to indicate the distribution of θ, and used herein is Dirichlet distribution, i.e. one group of continuous multivariable Probability distribution；N and w_n also respectively indicates corresponding distribution；Z_n indicates the best theme selected after the completion of algorithm operation, p (z | θ) indicate circulation θ theme Z probability distribution, the specially value of θ, i.e. p (z=i | θ)=θ _ i；P (w | z) ibid.

Above-mentioned corresponding new probability formula is as follows：

Unordered text after word segmentation processing carries out just unordered text in conjunction with configured LDA theme training pattern Step training, generates preliminary classification file.

Preliminary classification file is screened, orientation defines new classification information and storage.The preliminary classification of above-mentioned generation File is stored by classification, and user can be presented in a manner of visual, facilitates access, confirmation.By preliminary classification text Part is screened, and the screening includes deleting some contents and reservation partial content.Preliminary classification file is subjected to secondary relationship Association, by user demand orientation definition it is new divide write information, and store to HDFS.

In one embodiment, can to preliminary classification file carry out Binary Serialization and to preliminary classification file content into It stores after row compression, text information is subjected to Binary Serialization and is saved after being compressed to content, data can be saved and deposited Between emptying, the time that data are loaded into memory is saved, reduces the time of data prediction.

S003：Preliminary classification file is subjected to TFIDF algorithm process and is converted to word frequency vector and stores into HDFS.This In the embodiment of invention, TFIDF is carried out to unordered text and is included the following steps：

Word after all word segmentation processings is numbered, and stores the content of text after number；Such as：Key1- China, Key2-, key3- emerge etc.；

Count the word frequency that each participle occurs, such as key1-100, key2-50, key3-30 etc.；

The calculation formula of word frequency statistics is：

Wherein n_{I, j}Indicate word t_iIn document d_jThe number of middle appearance, ∑_k n_{K, j}Indicate document d_jIn all word occur The sum of number.

Obtain the reversed word frequency of each participle, such as key1-0.005, key2-0.004, key3-0.002 etc.；

Reversely the calculation formula of word frequency is：

Wherein | D | indicate the sum of document, | { j：t_i∈d_j| it indicates to include word t_iNumber of files, but for not Word in corpus, will lead to | { j：t_i∈d_j| being worth is 0, and formula is meaningless at this time, and 1+ can be used | { j：t_i∈d_j| to replace Generation.

Content of text after obtaining number storage is converted to word frequency vector and stores into HDFS.The process carries out vectorization Operation.

S004：Then it is trained result feedback, extracts the test training sample that word frequency vector does training result return, base In the training of Bayes, multiple mappings (map) are cut into for training sample is tested using the mapreduce training process of hadoop Task and the result that (map) task will be finally mapped in specification (reduce) process carry out the collection and arrangement of data.Specifically , in this process, extracting test training sample is the 10% of word frequency vector, is then based on the training process of Bayes, Ye Si Operation logic is as follows：

I.e.：P (B [i] | A)=P (B [i]) * P (A | B [i])/P (B [1]) * P (A | B [1])+P (B [2]) * P (A | B [2]) +…+P(B[n])*P(A|B[n])}

In order to accelerate trained speed and improve trained input sample capacity, the realization of bayesian algorithm is revised as base In the realization of the mapreduce of hadoop, i.e., training mission is pressed into certain segmentation condition, is cut into multiple map tasks and most The result of map is carried out in reduce process the collection and arrangement of data eventually；From training data, m parts are divided into, default M=total_size/block_size；Wherein total_size indicates the size of input file entirety；block_size：hdfs File block size, be defaulted as 64M, can be arranged by parameter dfs.block.size.The process by the word frequency of vectorization to Amount carries out disaggregated model training as Bayes's input condition, and is divided using the LDA model that training generates sample data The prediction of class and the statistics of classification accuracy, so far disaggregated model training finish.

Classification prediction is carried out using the disaggregated model that above embodiment is formed, by crawler in network, such as Netease's public affairs It opens and crawls data on webpage in real time, the participle of the LDA theme training pattern according to above embodiment setting safeguards information, that is, wraps Include synonym maintenance, stop words maintenance etc..The unordered text of input is subjected to word segmentation processing.By the unordered text after word segmentation processing The word frequency vector of vectorization is converted to, the 10% of word frequency vector is extracted and is used as classification input condition, obtained by Bayes's training The result of classification prediction returns.

A kind of textual classification model training system based on LDA algorithm, the system is also disclosed in one embodiment of the present invention System using the textual classification model training method based on LDA algorithm all processes and step, the system have including：

Input unit obtains unordered text as training data；

Storage medium, for executing to give an order：

The LDA theme training pattern of Chinese word segmentation is set, unordered text progress word segmentation processing is obtained into every unordered text In set theme distributed intelligence and word distribution situation, according to LDA theme training pattern to the unordered text after word segmentation processing It is trained and generates preliminary classification file, and save；

Preliminary classification file is subjected to TFIDF algorithm process and is converted to word frequency vector, is stored into HDFS；

Extract the test training sample that word frequency vector does training result return, the training based on Bayes, using hadoop Mapreduce training process by test training sample be cut into multiple mapping tasks and finally in specification process will mapping times The result of business carries out the collection and arrangement of data.In yet another embodiment, the content of preservation is subjected to TFIDF algorithm process tool Body includes the following steps：

In another preferred embodiment, the test training sample is the 10% of word frequency vector.

It is loaded into time of memory in order to save the space of data storage, save data, reduces the substantial amounts of data prediction time, it will When preliminary classification file is saved to distributed memory system HDFS, Binary Serialization is carried out to preliminary classification file and to preliminary Sort file content stores after being compressed.

In another preferred embodiment, also wrapped before preliminary classification file is saved to distributed memory system HDFS It includes：Preliminary classification file is screened, orientation defines new classification information and stores to HDFS system.

Method and system described in above embodiment, by the preliminary screening of LDA topic model and in conjunction with artificial final election Unordered a small amount of content of text is carried out unified conclusion, data cleansing, has ensured event independence between each attribute, improved by function The accuracy rate of text classification prediction.

It should be appreciated that although this specification is described in terms of embodiments, but not each embodiment only includes one A independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should will say As a whole, the technical solution in each embodiment may also be suitably combined to form those skilled in the art can for bright book With the other embodiments of understanding.

The series of detailed descriptions listed above only for feasible embodiment of the invention specifically Protection scope bright, that they are not intended to limit the invention, it is all without departing from equivalent implementations made by technical spirit of the present invention Or change should all be included in the protection scope of the present invention.

Claims

1. a kind of textual classification model training method based on LDA algorithm, it is characterised in that include the following steps：

Unordered text is obtained as training data；

The LDA theme training pattern of Chinese word segmentation is set, unordered text progress word segmentation processing is obtained into institute in every unordered text The distributed intelligence and word distribution situation for setting theme carry out the unordered text after word segmentation processing according to LDA theme training pattern Training generates preliminary classification file, and saves；

Extract the test training sample that word frequency vector does training result return, the training based on Bayes, using hadoop's Mapreduce training process is cut into multiple mapping tasks and finally in specification process by mapping tasks for training sample is tested Result carry out data collection and arrangement；

Disaggregated model training finishes.

2. the textual classification model training method according to claim 1 based on LDA algorithm, it is characterised in that：It will save Content carry out TFIDF algorithm process specifically comprise the following steps：

3. the textual classification model training method according to claim 1 or 2 based on LDA algorithm, it is characterised in that：It is described Test 10% that training sample is word frequency vector.

4. the textual classification model training method according to claim 1 or 2 based on LDA algorithm, it is characterised in that：It will be first When step sort file is saved to distributed memory system HDFS, Binary Serialization is carried out to preliminary classification file and to preliminary point Class file content stores after being compressed.

5. the textual classification model training method described in Claims 1-4 any one based on LDA algorithm, it is characterised in that： Further include before carrying out TFIDF processing：

Preliminary classification file is screened, orientation defines new classification information.

6. a kind of textual classification model training system based on LDA algorithm, it is characterised in that：Including

Input unit obtains unordered text as training data；

Storage medium, for executing to give an order：

Extract the test training sample that word frequency vector does training result return, the training based on Bayes, using hadoop's Mapreduce training process is cut into multiple mapping tasks and finally in specification process by mapping tasks for training sample is tested Result carry out data collection and arrangement.

7. the textual classification model training system according to claim 6 based on LDA algorithm, it is characterised in that：It will save Content carry out TFIDF algorithm process specifically comprise the following steps：

8. the textual classification model training system according to claim 6 based on LDA algorithm, it is characterised in that：The survey Try 10% that training sample is word frequency vector.

9. the textual classification model training system according to claim 6 based on LDA algorithm, it is characterised in that：It will be preliminary When sort file is saved to distributed memory system HDFS, Binary Serialization is carried out to preliminary classification file and to preliminary classification File content stores after being compressed.

10. the textual classification model training system according to claim 6 based on LDA algorithm, it is characterised in that：It will be preliminary Sort file is saved to distributed memory system HDFS：