CN108920508A - Textual classification model training method and system based on LDA algorithm - Google Patents

Textual classification model training method and system based on LDA algorithm Download PDF

Info

Publication number
CN108920508A
CN108920508A CN201810535046.1A CN201810535046A CN108920508A CN 108920508 A CN108920508 A CN 108920508A CN 201810535046 A CN201810535046 A CN 201810535046A CN 108920508 A CN108920508 A CN 108920508A
Authority
CN
China
Prior art keywords
training
text
word
lda
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810535046.1A
Other languages
Chinese (zh)
Inventor
冯广辉
王雷
居燕峰
李福�
周小华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
FUJIAN NEW LAND SOFTWARE ENGINEERING Co Ltd
Original Assignee
FUJIAN NEW LAND SOFTWARE ENGINEERING Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by FUJIAN NEW LAND SOFTWARE ENGINEERING Co Ltd filed Critical FUJIAN NEW LAND SOFTWARE ENGINEERING Co Ltd
Priority to CN201810535046.1A priority Critical patent/CN108920508A/en
Publication of CN108920508A publication Critical patent/CN108920508A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of textual classification model training method based on LDA algorithm, the method includes obtaining unordered text in real time, the participle of LDA theme training pattern according to setting safeguards information, that is, includes synonym maintenance, stop words maintenance etc., the unordered text of input is carried out word segmentation processing.It is the word frequency vector of vectorization by the unordered text conversion after word segmentation processing, extracts the 10% of word frequency vector and be used as classification input condition, obtains the result return of classification prediction by Bayes's training.The present invention is changed to ensure the safety of data using the location mode based on distributed HDFS as medium by traditional disk to storage data, reduces the data time loaded into memory.Using the MapReduce distributed computing architecture based on hadoop, there are better dilatation and fault-tolerance relative to single machine, bigger sample size can be loaded, save the runing time of program.

Description

Textual classification model training method and system based on LDA algorithm
Technical field
The present invention relates to software technology field more particularly to a kind of textual classification model training methods based on LDA algorithm And system.
Background technique
Comprehensive arrival of information age, internet status in people's lives and people to its dependence increasingly It is high.For at this stage, on the interactive form of a variety of applications and the application of internet, text remains a kind of important The ways of presentation of medium.With social progress, the development in epoch, the information content precipitating, more and more text information meetings It is stored and is saved;How these historical informations are correctly analyzed and used, it is also gradually of interest by everybody.For text It is important that being exactly the classification of text information in the analysis of information and the excavation of data.
In real life can the scenes of more and more text classifications need to solve, such as to recommend user interested News content carries out Fast Classification to the contribution that reporter writes out, carries out distribution archive etc. to the content that web crawlers obtains.Cause This, is in the follow-up data system of processing of the portal website of news category, the Message Entry System of newspaper office and web crawlers, for The demand of text classification is more more and more urgent.And for ordinary user, how to be not required to it is to be understood that sorting algorithm behind Complicated technical principle and implementation and the class categories that can be needed by scene fast custom oneself of classifying required for oneself It is even more important.
In the practical problem for solving text classification, the beginning of this century, Blei, David M., Wu Enda and Jordan, Michael I proposes a kind of classification method of topic model, i.e., implicit Di Li Cray is distributed abbreviation LDA (Latent Dirichlet allocation).This method advantage is a kind of non-supervisory learning algorithm, i.e., is carrying out classification based training to text Early period do not need the text corresponding relationship that labor costs' plenty of time arranges classifying content Yu each class, instead can With directly various text files as the input condition of algorithm, after the number of classification is manually specified, all other Algorithm oneself is transferred to handle.
But just due to the convenience of LDA topic model algorithm, use is not often able to satisfy in actual living scene The actual use demand at family.For example classify to the news of Mr. Yu's news website, user wishes the content of news being divided into " body Educate ", " finance and economics ", " car information ", the classification such as " automobile complaint ", but since the algorithm of LDA is merely able to of specified classification scheme Number, and since the essence of algorithm is the process of a cluster, but actual scene is:It may for the news under " finance and economics " There is the related financial statement of automobile, or occurs the description as described in economy such as some purchase vehicles, maintenance under " automobile " channel. Sorted situation is carried out according to LDA algorithm to be often merged together the content of " car information " and " automobile complaint ", or It is that some classification not enough clearly happen.
Furthermore the algorithm that the existing Bayes' theorem according to Bayes's invention is realized out, can be very good to solve most In the case of classification demand, but due to the algorithm itself for condition mutual independence hypothesis as premise, it is therefore desirable to the greatest extent The problem of the problem of event independence of amount improved between training data, event independence feeds back in Chinese classification is as synonymous Word problem.
Summary of the invention
It is an object of the present invention to propose a kind of training data acquisition, disaggregated model training and text for Chinese text This orientation, which is classified, to be predicted, the mass text disaggregated model instruction based on LDA algorithm of text classification can simply, be accurately carried out Practice method, solves existing Text Classification principle complexity, realizes that difficulty is big and Chinese classification is ineffective, it can not be according to oneself The problem of class categories that demand fast custom needs.
To achieve the goals above, the technical scheme adopted by the invention is as follows:
A kind of textual classification model training method based on LDA algorithm, includes the following steps:
Unordered text is obtained as training data;
The LDA theme training pattern of Chinese word segmentation is set, unordered text progress word segmentation processing is obtained into every unordered text In set theme distributed intelligence and word distribution situation, according to LDA theme training pattern to the unordered text after word segmentation processing It is trained and generates preliminary classification file;
Preliminary classification file is saved and carries out TFIDF algorithm process and is converted to word frequency vector, is stored into HDFS;
Extract the test training sample that word frequency vector does training result return, the training based on Bayes, using hadoop Mapreduce training process by test training sample be cut into multiple mapping tasks and finally in specification process will mapping times The result of business carries out the collection and arrangement of data;
Disaggregated model training finishes.
Wherein, the content of preservation TFIDF algorithm process is carried out to specifically comprise the following steps:
Word after all word segmentation processings is numbered, and stores the content of text after number;
The word frequency that each participle occurs is counted, the reversed word frequency of each participle is obtained;
Content of text after obtaining number storage is converted to word frequency vector and stores into HDFS.
Wherein, the test training sample is the 10% of word frequency vector.
Wherein, when preliminary classification file being saved to distributed memory system HDFS, to preliminary classification file carry out two into System is serialized and is stored after compressing to preliminary classification file content.
Wherein, further include before preliminary classification file being saved to distributed memory system HDFS:
Preliminary classification file is screened, orientation defines new classification information and stores to HDFS system.
Invention additionally discloses a kind of textual classification model training system based on LDA algorithm, including
Input unit obtains unordered text as training data;
Storage medium, for executing to give an order:
The LDA theme training pattern of Chinese word segmentation is set, unordered text progress word segmentation processing is obtained into every unordered text In set theme distributed intelligence and word distribution situation, according to LDA theme training pattern to the unordered text after word segmentation processing It is trained and generates preliminary classification file;
Preliminary classification file is saved and carries out TFIDF algorithm process and is converted to word frequency vector, is stored into HDFS;
Extract the test training sample that word frequency vector does training result return, the training based on Bayes, using hadoop Mapreduce training process by test training sample be cut into multiple mapping tasks and finally in specification process will mapping times The result of business carries out the collection and arrangement of data.
Wherein, the content of preservation TFIDF algorithm process is carried out to specifically comprise the following steps:
Word after all word segmentation processings is numbered, and stores the content of text after number;
The word frequency that each participle occurs is counted, the reversed word frequency of each participle is obtained;
Content of text after obtaining number storage is converted to word frequency vector and stores into HDFS.
Wherein, the test training sample is the 10% of word frequency vector.
Wherein, when preliminary classification file being saved to distributed memory system HDFS, to preliminary classification file carry out two into System is serialized and is stored after compressing to preliminary classification file content.
Wherein, further include before preliminary classification file being saved to distributed memory system HDFS:
Preliminary classification file is screened, orientation defines new classification information and stores to HDFS system.
Beneficial effects of the present invention are:
The present invention, will be in unordered a small amount of text by the preliminary screening of LDA topic model and in conjunction with artificial final election function Appearance carries out unified conclusion, data cleansing, has ensured event independence between each attribute, improves the accuracy rate of text classification prediction.
Storage data are changed to ensure using the location mode based on distributed HDFS by traditional disk as medium The safety of data reduces the data time loaded into memory.Using the MapReduce distributed computing based on hadoop Framework has better dilatation and fault-tolerance relative to single machine, can load bigger sample size, saves the runing time of program.
Detailed description of the invention
Fig. 1 is the schematic diagram of unordered text collection of the invention;
Fig. 2 is a kind of flow chart of textual classification model training method based on LDA algorithm of the invention.
Specific embodiment
Below with reference to specific embodiment shown in the drawings, the present invention will be described in detail.But these embodiments are simultaneously The present invention is not limited, structure that those skilled in the art are made according to these embodiments, method or functionally Transformation is included within the scope of protection of the present invention.
A kind of textual classification model training method based on LDA algorithm is disclosed in one embodiment of the present invention, referring to fig. 2 It is shown, include the following steps:
S001:Unordered text is obtained as training data;In the embodiment, it can obtain from designated position without preface This, the data source including abiding by File Transfer Protocol, http protocol;Support that the upper transmitting file of parsing user includes:TXT format compression packet, The data format of CSV compressed package, excel and WORD compressed package;By way of specified link and the form of set of links processing Http data is uploaded, is based on http data, it is possible to specify the breadth and depth of data acquisition;By the text in HTML Content is obtained by the orientation that XPATH format carries out data;Add the number to dynamically load or AJAX Asynchronous Request in webpage It is obtained according to the dynamic data carried out based on selenium solution.
S002:The LDA theme training pattern of Chinese word segmentation is set, by unordered text progress word segmentation processing obtain every it is unordered The distributed intelligence of set theme and word distribution situation in text, according to LDA theme training pattern to unordered after word segmentation processing Text, which is trained, generates preliminary classification file, and saves;In this step, the unordered text got is carried out at participle Reason, word segmentation processing include the relevant stop words maintenance of the pre- business processing of progress, synonym relevant to business processing maintenance, thus Data cleansing is done for subsequent classification.Using the LDA theme training pattern of Chinese word segmentation, sets number of topics and configure tnum, iteration Number configuration inum, mode input position sloc;LDA theme training pattern main function be for some documents at random being not necessarily into Row is concentrated, and is found theme distribution in each document and is distributed to the word in the document, how to automatically generate theme and The mosquito net for how analyzing theme is required to that LDA theme training pattern is transferred to automatically generate.In probability statistics, in every article Word is all to first pass through certain probability to generate or select a theme and then generate from the theme that this is chosen once more Or some word has been selected, the probabilistic relation of word and document is as follows:
The main function of LDA theme training pattern is to find tnum theme from unordered text and establish corresponding word Corresponding relationship, shown in Figure 1, Doc1, Doc2...Docm represent unordered text collection, Docm represents m-th of nothing The Wordm of preface sheet, corresponding row indicates there be n word in the unordered text.In conjunction with configured in LDA theme training pattern Good tnum, inum parameter, program can be based on tnum theme number to all unordered texts and calculate, Word probability distribution Process is in the following way:
Wherein, θ indicates a theme vector in above procedure, and the data of each theme vector represent and its in document With the probability occurred;P (θ) is used to indicate the distribution of θ, and used herein is Dirichlet distribution, i.e. one group of continuous multivariable Probability distribution;N and w_n also respectively indicates corresponding distribution;Z_n indicates the best theme selected after the completion of algorithm operation, p (z | θ) indicate circulation θ theme Z probability distribution, the specially value of θ, i.e. p (z=i | θ)=θ _ i;P (w | z) ibid.
Above-mentioned corresponding new probability formula is as follows:
Unordered text after word segmentation processing carries out just unordered text in conjunction with configured LDA theme training pattern Step training, generates preliminary classification file.
Preliminary classification file is screened, orientation defines new classification information and storage.The preliminary classification of above-mentioned generation File is stored by classification, and user can be presented in a manner of visual, facilitates access, confirmation.By preliminary classification text Part is screened, and the screening includes deleting some contents and reservation partial content.Preliminary classification file is subjected to secondary relationship Association, by user demand orientation definition it is new divide write information, and store to HDFS.
In one embodiment, can to preliminary classification file carry out Binary Serialization and to preliminary classification file content into It stores after row compression, text information is subjected to Binary Serialization and is saved after being compressed to content, data can be saved and deposited Between emptying, the time that data are loaded into memory is saved, reduces the time of data prediction.
S003:Preliminary classification file is subjected to TFIDF algorithm process and is converted to word frequency vector and stores into HDFS.This In the embodiment of invention, TFIDF is carried out to unordered text and is included the following steps:
Word after all word segmentation processings is numbered, and stores the content of text after number;Such as:Key1- China, Key2-, key3- emerge etc.;
Count the word frequency that each participle occurs, such as key1-100, key2-50, key3-30 etc.;
The calculation formula of word frequency statistics is:
Wherein nI, jIndicate word tiIn document djThe number of middle appearance, ∑k nK, jIndicate document djIn all word occur The sum of number.
Obtain the reversed word frequency of each participle, such as key1-0.005, key2-0.004, key3-0.002 etc.;
Reversely the calculation formula of word frequency is:
Wherein | D | indicate the sum of document, | { j:ti∈dj| it indicates to include word tiNumber of files, but for not Word in corpus, will lead to | { j:ti∈dj| being worth is 0, and formula is meaningless at this time, and 1+ can be used | { j:ti∈dj| to replace Generation.
Content of text after obtaining number storage is converted to word frequency vector and stores into HDFS.The process carries out vectorization Operation.
S004:Then it is trained result feedback, extracts the test training sample that word frequency vector does training result return, base In the training of Bayes, multiple mappings (map) are cut into for training sample is tested using the mapreduce training process of hadoop Task and the result that (map) task will be finally mapped in specification (reduce) process carry out the collection and arrangement of data.Specifically , in this process, extracting test training sample is the 10% of word frequency vector, is then based on the training process of Bayes, Ye Si Operation logic is as follows:
I.e.:P (B [i] | A)=P (B [i]) * P (A | B [i])/P (B [1]) * P (A | B [1])+P (B [2]) * P (A | B [2]) +…+P(B[n])*P(A|B[n])}
In order to accelerate trained speed and improve trained input sample capacity, the realization of bayesian algorithm is revised as base In the realization of the mapreduce of hadoop, i.e., training mission is pressed into certain segmentation condition, is cut into multiple map tasks and most The result of map is carried out in reduce process the collection and arrangement of data eventually;From training data, m parts are divided into, default M=total_size/block_size;Wherein total_size indicates the size of input file entirety;block_size:hdfs File block size, be defaulted as 64M, can be arranged by parameter dfs.block.size.The process by the word frequency of vectorization to Amount carries out disaggregated model training as Bayes's input condition, and is divided using the LDA model that training generates sample data The prediction of class and the statistics of classification accuracy, so far disaggregated model training finish.
Classification prediction is carried out using the disaggregated model that above embodiment is formed, by crawler in network, such as Netease's public affairs It opens and crawls data on webpage in real time, the participle of the LDA theme training pattern according to above embodiment setting safeguards information, that is, wraps Include synonym maintenance, stop words maintenance etc..The unordered text of input is subjected to word segmentation processing.By the unordered text after word segmentation processing The word frequency vector of vectorization is converted to, the 10% of word frequency vector is extracted and is used as classification input condition, obtained by Bayes's training The result of classification prediction returns.
A kind of textual classification model training system based on LDA algorithm, the system is also disclosed in one embodiment of the present invention System using the textual classification model training method based on LDA algorithm all processes and step, the system have including:
Input unit obtains unordered text as training data;
Storage medium, for executing to give an order:
The LDA theme training pattern of Chinese word segmentation is set, unordered text progress word segmentation processing is obtained into every unordered text In set theme distributed intelligence and word distribution situation, according to LDA theme training pattern to the unordered text after word segmentation processing It is trained and generates preliminary classification file, and save;
Preliminary classification file is subjected to TFIDF algorithm process and is converted to word frequency vector, is stored into HDFS;
Extract the test training sample that word frequency vector does training result return, the training based on Bayes, using hadoop Mapreduce training process by test training sample be cut into multiple mapping tasks and finally in specification process will mapping times The result of business carries out the collection and arrangement of data.In yet another embodiment, the content of preservation is subjected to TFIDF algorithm process tool Body includes the following steps:
Word after all word segmentation processings is numbered, and stores the content of text after number;
The word frequency that each participle occurs is counted, the reversed word frequency of each participle is obtained;
Content of text after obtaining number storage is converted to word frequency vector and stores into HDFS.
In another preferred embodiment, the test training sample is the 10% of word frequency vector.
It is loaded into time of memory in order to save the space of data storage, save data, reduces the substantial amounts of data prediction time, it will When preliminary classification file is saved to distributed memory system HDFS, Binary Serialization is carried out to preliminary classification file and to preliminary Sort file content stores after being compressed.
In another preferred embodiment, also wrapped before preliminary classification file is saved to distributed memory system HDFS It includes:Preliminary classification file is screened, orientation defines new classification information and stores to HDFS system.
Method and system described in above embodiment, by the preliminary screening of LDA topic model and in conjunction with artificial final election Unordered a small amount of content of text is carried out unified conclusion, data cleansing, has ensured event independence between each attribute, improved by function The accuracy rate of text classification prediction.
Storage data are changed to ensure using the location mode based on distributed HDFS by traditional disk as medium The safety of data reduces the data time loaded into memory.Using the MapReduce distributed computing based on hadoop Framework has better dilatation and fault-tolerance relative to single machine, can load bigger sample size, saves the runing time of program.
It should be appreciated that although this specification is described in terms of embodiments, but not each embodiment only includes one A independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should will say As a whole, the technical solution in each embodiment may also be suitably combined to form those skilled in the art can for bright book With the other embodiments of understanding.
The series of detailed descriptions listed above only for feasible embodiment of the invention specifically Protection scope bright, that they are not intended to limit the invention, it is all without departing from equivalent implementations made by technical spirit of the present invention Or change should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of textual classification model training method based on LDA algorithm, it is characterised in that include the following steps:
Unordered text is obtained as training data;
The LDA theme training pattern of Chinese word segmentation is set, unordered text progress word segmentation processing is obtained into institute in every unordered text The distributed intelligence and word distribution situation for setting theme carry out the unordered text after word segmentation processing according to LDA theme training pattern Training generates preliminary classification file, and saves;
Preliminary classification file is subjected to TFIDF algorithm process and is converted to word frequency vector, is stored into HDFS;
Extract the test training sample that word frequency vector does training result return, the training based on Bayes, using hadoop's Mapreduce training process is cut into multiple mapping tasks and finally in specification process by mapping tasks for training sample is tested Result carry out data collection and arrangement;
Disaggregated model training finishes.
2. the textual classification model training method according to claim 1 based on LDA algorithm, it is characterised in that:It will save Content carry out TFIDF algorithm process specifically comprise the following steps:
Word after all word segmentation processings is numbered, and stores the content of text after number;
The word frequency that each participle occurs is counted, the reversed word frequency of each participle is obtained;
Content of text after obtaining number storage is converted to word frequency vector and stores into HDFS.
3. the textual classification model training method according to claim 1 or 2 based on LDA algorithm, it is characterised in that:It is described Test 10% that training sample is word frequency vector.
4. the textual classification model training method according to claim 1 or 2 based on LDA algorithm, it is characterised in that:It will be first When step sort file is saved to distributed memory system HDFS, Binary Serialization is carried out to preliminary classification file and to preliminary point Class file content stores after being compressed.
5. the textual classification model training method described in Claims 1-4 any one based on LDA algorithm, it is characterised in that: Further include before carrying out TFIDF processing:
Preliminary classification file is screened, orientation defines new classification information.
6. a kind of textual classification model training system based on LDA algorithm, it is characterised in that:Including
Input unit obtains unordered text as training data;
Storage medium, for executing to give an order:
The LDA theme training pattern of Chinese word segmentation is set, unordered text progress word segmentation processing is obtained into institute in every unordered text The distributed intelligence and word distribution situation for setting theme carry out the unordered text after word segmentation processing according to LDA theme training pattern Training generates preliminary classification file, and saves;
Preliminary classification file is subjected to TFIDF algorithm process and is converted to word frequency vector, is stored into HDFS;
Extract the test training sample that word frequency vector does training result return, the training based on Bayes, using hadoop's Mapreduce training process is cut into multiple mapping tasks and finally in specification process by mapping tasks for training sample is tested Result carry out data collection and arrangement.
7. the textual classification model training system according to claim 6 based on LDA algorithm, it is characterised in that:It will save Content carry out TFIDF algorithm process specifically comprise the following steps:
Word after all word segmentation processings is numbered, and stores the content of text after number;
The word frequency that each participle occurs is counted, the reversed word frequency of each participle is obtained;
Content of text after obtaining number storage is converted to word frequency vector and stores into HDFS.
8. the textual classification model training system according to claim 6 based on LDA algorithm, it is characterised in that:The survey Try 10% that training sample is word frequency vector.
9. the textual classification model training system according to claim 6 based on LDA algorithm, it is characterised in that:It will be preliminary When sort file is saved to distributed memory system HDFS, Binary Serialization is carried out to preliminary classification file and to preliminary classification File content stores after being compressed.
10. the textual classification model training system according to claim 6 based on LDA algorithm, it is characterised in that:It will be preliminary Sort file is saved to distributed memory system HDFS:
Preliminary classification file is screened, orientation defines new classification information and stores to HDFS system.
CN201810535046.1A 2018-05-29 2018-05-29 Textual classification model training method and system based on LDA algorithm Pending CN108920508A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810535046.1A CN108920508A (en) 2018-05-29 2018-05-29 Textual classification model training method and system based on LDA algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810535046.1A CN108920508A (en) 2018-05-29 2018-05-29 Textual classification model training method and system based on LDA algorithm

Publications (1)

Publication Number Publication Date
CN108920508A true CN108920508A (en) 2018-11-30

Family

ID=64411043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810535046.1A Pending CN108920508A (en) 2018-05-29 2018-05-29 Textual classification model training method and system based on LDA algorithm

Country Status (1)

Country Link
CN (1) CN108920508A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109977119A (en) * 2019-03-25 2019-07-05 浙江大学 Data classification and storage method for bioelectronics mixing man-made organ system
CN110222179A (en) * 2019-05-28 2019-09-10 深圳市小赢信息技术有限责任公司 A kind of address list file classification method, device and electronic equipment
CN111695020A (en) * 2020-06-15 2020-09-22 广东工业大学 Hadoop platform-based information recommendation method and system
CN112580355A (en) * 2020-12-30 2021-03-30 中科院计算技术研究所大数据研究院 News information topic detection and real-time aggregation method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810293A (en) * 2014-02-28 2014-05-21 广州云宏信息科技有限公司 Text classification method and device based on Hadoop
CN105912525A (en) * 2016-04-11 2016-08-31 天津大学 Sentiment classification method for semi-supervised learning based on theme characteristics
CN105975478A (en) * 2016-04-09 2016-09-28 北京交通大学 Word vector analysis-based online article belonging event detection method and device
US20170056764A1 (en) * 2015-08-31 2017-03-02 Omniscience Corporation Event categorization and key prospect identification from storylines
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN107943824A (en) * 2017-10-17 2018-04-20 广东广业开元科技有限公司 A kind of big data news category method, system and device based on LDA

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810293A (en) * 2014-02-28 2014-05-21 广州云宏信息科技有限公司 Text classification method and device based on Hadoop
US20170056764A1 (en) * 2015-08-31 2017-03-02 Omniscience Corporation Event categorization and key prospect identification from storylines
CN105975478A (en) * 2016-04-09 2016-09-28 北京交通大学 Word vector analysis-based online article belonging event detection method and device
CN105912525A (en) * 2016-04-11 2016-08-31 天津大学 Sentiment classification method for semi-supervised learning based on theme characteristics
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN107943824A (en) * 2017-10-17 2018-04-20 广东广业开元科技有限公司 A kind of big data news category method, system and device based on LDA

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
胡晓东,高嘉伟: "大数据下基于mapreduce 的朴素贝叶斯文本分类算法", 《科技通报》 *
董帅: "基于半监督学习的文本分类算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109977119A (en) * 2019-03-25 2019-07-05 浙江大学 Data classification and storage method for bioelectronics mixing man-made organ system
CN110222179A (en) * 2019-05-28 2019-09-10 深圳市小赢信息技术有限责任公司 A kind of address list file classification method, device and electronic equipment
CN111695020A (en) * 2020-06-15 2020-09-22 广东工业大学 Hadoop platform-based information recommendation method and system
CN112580355A (en) * 2020-12-30 2021-03-30 中科院计算技术研究所大数据研究院 News information topic detection and real-time aggregation method

Similar Documents

Publication Publication Date Title
Vysotska et al. Web Content Support Method in Electronic Business Systems.
Ma et al. An ontology-based text-mining method to cluster proposals for research project selection
CA3033859C (en) Method and system for automatically extracting relevant tax terms from forms and instructions
CN108920508A (en) Textual classification model training method and system based on LDA algorithm
Hussain et al. Approximation of COSMIC functional size to support early effort estimation in Agile
Basiri et al. Sentiment prediction based on dempster‐shafer theory of evidence
CN110458324A (en) Calculation method, device and the computer equipment of risk probability
Chyrun et al. Content monitoring method for cut formation of person psychological state in social scoring
CN110442872B (en) Text element integrity checking method and device
Zhang et al. A multiclassification model of sentiment for E-commerce reviews
CN113312480A (en) Scientific and technological thesis level multi-label classification method and device based on graph convolution network
KR20230052609A (en) Review analysis system using machine reading comprehension and method thereof
US20220004718A1 (en) Ontology-Driven Conversational Interface for Data Analysis
CN112307336A (en) Hotspot information mining and previewing method and device, computer equipment and storage medium
Habbat et al. Topic modeling and sentiment analysis with lda and nmf on moroccan tweets
Bonfitto et al. Semi-automatic column type inference for CSV table understanding
Bhole et al. Extracting named entities and relating them over time based on Wikipedia
Lin et al. Ensemble making few-shot learning stronger
CN116541517A (en) Text information processing method, apparatus, device, software program, and storage medium
Han et al. An evidence-based credit evaluation ensemble framework for online retail SMEs
CN110222179A (en) A kind of address list file classification method, device and electronic equipment
Gillmann et al. Quantification of Economic Uncertainty: a deep learning approach
CN114861655A (en) Data mining processing method, system and storage medium
Anastasopoulos et al. Computational text analysis for public management research: An annotated application to county budgets
KR20230059364A (en) Public opinion poll system using language model and method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181130

RJ01 Rejection of invention patent application after publication