CN108920508A - Textual classification model training method and system based on LDA algorithm - Google Patents
Textual classification model training method and system based on LDA algorithm Download PDFInfo
- Publication number
- CN108920508A CN108920508A CN201810535046.1A CN201810535046A CN108920508A CN 108920508 A CN108920508 A CN 108920508A CN 201810535046 A CN201810535046 A CN 201810535046A CN 108920508 A CN108920508 A CN 108920508A
- Authority
- CN
- China
- Prior art keywords
- training
- text
- word
- lda
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 99
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000013145 classification model Methods 0.000 title claims abstract description 20
- 230000011218 segmentation Effects 0.000 claims abstract description 32
- 238000012545 processing Methods 0.000 claims abstract description 29
- 239000000284 extract Substances 0.000 claims abstract description 7
- 230000008569 process Effects 0.000 claims description 31
- 238000012360 testing method Methods 0.000 claims description 14
- 238000013507 mapping Methods 0.000 claims description 11
- 238000013480 data collection Methods 0.000 claims 2
- 238000012423 maintenance Methods 0.000 abstract description 7
- 238000006243 chemical reaction Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 4
- 238000004321 preservation Methods 0.000 description 3
- 238000012216 screening Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 102100035353 Cyclin-dependent kinase 2-associated protein 1 Human genes 0.000 description 1
- 241000255925 Diptera Species 0.000 description 1
- 102100031554 Double C2-like domain-containing protein alpha Human genes 0.000 description 1
- 101000866272 Homo sapiens Double C2-like domain-containing protein alpha Proteins 0.000 description 1
- BUGBHKTXTAQXES-UHFFFAOYSA-N Selenium Chemical compound [Se] BUGBHKTXTAQXES-UHFFFAOYSA-N 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000001376 precipitating effect Effects 0.000 description 1
- 229910052711 selenium Inorganic materials 0.000 description 1
- 239000011669 selenium Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses a kind of textual classification model training method based on LDA algorithm, the method includes obtaining unordered text in real time, the participle of LDA theme training pattern according to setting safeguards information, that is, includes synonym maintenance, stop words maintenance etc., the unordered text of input is carried out word segmentation processing.It is the word frequency vector of vectorization by the unordered text conversion after word segmentation processing, extracts the 10% of word frequency vector and be used as classification input condition, obtains the result return of classification prediction by Bayes's training.The present invention is changed to ensure the safety of data using the location mode based on distributed HDFS as medium by traditional disk to storage data, reduces the data time loaded into memory.Using the MapReduce distributed computing architecture based on hadoop, there are better dilatation and fault-tolerance relative to single machine, bigger sample size can be loaded, save the runing time of program.
Description
Technical field
The present invention relates to software technology field more particularly to a kind of textual classification model training methods based on LDA algorithm
And system.
Background technique
Comprehensive arrival of information age, internet status in people's lives and people to its dependence increasingly
It is high.For at this stage, on the interactive form of a variety of applications and the application of internet, text remains a kind of important
The ways of presentation of medium.With social progress, the development in epoch, the information content precipitating, more and more text information meetings
It is stored and is saved;How these historical informations are correctly analyzed and used, it is also gradually of interest by everybody.For text
It is important that being exactly the classification of text information in the analysis of information and the excavation of data.
In real life can the scenes of more and more text classifications need to solve, such as to recommend user interested
News content carries out Fast Classification to the contribution that reporter writes out, carries out distribution archive etc. to the content that web crawlers obtains.Cause
This, is in the follow-up data system of processing of the portal website of news category, the Message Entry System of newspaper office and web crawlers, for
The demand of text classification is more more and more urgent.And for ordinary user, how to be not required to it is to be understood that sorting algorithm behind
Complicated technical principle and implementation and the class categories that can be needed by scene fast custom oneself of classifying required for oneself
It is even more important.
In the practical problem for solving text classification, the beginning of this century, Blei, David M., Wu Enda and Jordan,
Michael I proposes a kind of classification method of topic model, i.e., implicit Di Li Cray is distributed abbreviation LDA (Latent
Dirichlet allocation).This method advantage is a kind of non-supervisory learning algorithm, i.e., is carrying out classification based training to text
Early period do not need the text corresponding relationship that labor costs' plenty of time arranges classifying content Yu each class, instead can
With directly various text files as the input condition of algorithm, after the number of classification is manually specified, all other
Algorithm oneself is transferred to handle.
But just due to the convenience of LDA topic model algorithm, use is not often able to satisfy in actual living scene
The actual use demand at family.For example classify to the news of Mr. Yu's news website, user wishes the content of news being divided into " body
Educate ", " finance and economics ", " car information ", the classification such as " automobile complaint ", but since the algorithm of LDA is merely able to of specified classification scheme
Number, and since the essence of algorithm is the process of a cluster, but actual scene is:It may for the news under " finance and economics "
There is the related financial statement of automobile, or occurs the description as described in economy such as some purchase vehicles, maintenance under " automobile " channel.
Sorted situation is carried out according to LDA algorithm to be often merged together the content of " car information " and " automobile complaint ", or
It is that some classification not enough clearly happen.
Furthermore the algorithm that the existing Bayes' theorem according to Bayes's invention is realized out, can be very good to solve most
In the case of classification demand, but due to the algorithm itself for condition mutual independence hypothesis as premise, it is therefore desirable to the greatest extent
The problem of the problem of event independence of amount improved between training data, event independence feeds back in Chinese classification is as synonymous
Word problem.
Summary of the invention
It is an object of the present invention to propose a kind of training data acquisition, disaggregated model training and text for Chinese text
This orientation, which is classified, to be predicted, the mass text disaggregated model instruction based on LDA algorithm of text classification can simply, be accurately carried out
Practice method, solves existing Text Classification principle complexity, realizes that difficulty is big and Chinese classification is ineffective, it can not be according to oneself
The problem of class categories that demand fast custom needs.
To achieve the goals above, the technical scheme adopted by the invention is as follows:
A kind of textual classification model training method based on LDA algorithm, includes the following steps:
Unordered text is obtained as training data;
The LDA theme training pattern of Chinese word segmentation is set, unordered text progress word segmentation processing is obtained into every unordered text
In set theme distributed intelligence and word distribution situation, according to LDA theme training pattern to the unordered text after word segmentation processing
It is trained and generates preliminary classification file;
Preliminary classification file is saved and carries out TFIDF algorithm process and is converted to word frequency vector, is stored into HDFS;
Extract the test training sample that word frequency vector does training result return, the training based on Bayes, using hadoop
Mapreduce training process by test training sample be cut into multiple mapping tasks and finally in specification process will mapping times
The result of business carries out the collection and arrangement of data;
Disaggregated model training finishes.
Wherein, the content of preservation TFIDF algorithm process is carried out to specifically comprise the following steps:
Word after all word segmentation processings is numbered, and stores the content of text after number;
The word frequency that each participle occurs is counted, the reversed word frequency of each participle is obtained;
Content of text after obtaining number storage is converted to word frequency vector and stores into HDFS.
Wherein, the test training sample is the 10% of word frequency vector.
Wherein, when preliminary classification file being saved to distributed memory system HDFS, to preliminary classification file carry out two into
System is serialized and is stored after compressing to preliminary classification file content.
Wherein, further include before preliminary classification file being saved to distributed memory system HDFS:
Preliminary classification file is screened, orientation defines new classification information and stores to HDFS system.
Invention additionally discloses a kind of textual classification model training system based on LDA algorithm, including
Input unit obtains unordered text as training data;
Storage medium, for executing to give an order:
The LDA theme training pattern of Chinese word segmentation is set, unordered text progress word segmentation processing is obtained into every unordered text
In set theme distributed intelligence and word distribution situation, according to LDA theme training pattern to the unordered text after word segmentation processing
It is trained and generates preliminary classification file;
Preliminary classification file is saved and carries out TFIDF algorithm process and is converted to word frequency vector, is stored into HDFS;
Extract the test training sample that word frequency vector does training result return, the training based on Bayes, using hadoop
Mapreduce training process by test training sample be cut into multiple mapping tasks and finally in specification process will mapping times
The result of business carries out the collection and arrangement of data.
Wherein, the content of preservation TFIDF algorithm process is carried out to specifically comprise the following steps:
Word after all word segmentation processings is numbered, and stores the content of text after number;
The word frequency that each participle occurs is counted, the reversed word frequency of each participle is obtained;
Content of text after obtaining number storage is converted to word frequency vector and stores into HDFS.
Wherein, the test training sample is the 10% of word frequency vector.
Wherein, when preliminary classification file being saved to distributed memory system HDFS, to preliminary classification file carry out two into
System is serialized and is stored after compressing to preliminary classification file content.
Wherein, further include before preliminary classification file being saved to distributed memory system HDFS:
Preliminary classification file is screened, orientation defines new classification information and stores to HDFS system.
Beneficial effects of the present invention are:
The present invention, will be in unordered a small amount of text by the preliminary screening of LDA topic model and in conjunction with artificial final election function
Appearance carries out unified conclusion, data cleansing, has ensured event independence between each attribute, improves the accuracy rate of text classification prediction.
Storage data are changed to ensure using the location mode based on distributed HDFS by traditional disk as medium
The safety of data reduces the data time loaded into memory.Using the MapReduce distributed computing based on hadoop
Framework has better dilatation and fault-tolerance relative to single machine, can load bigger sample size, saves the runing time of program.
Detailed description of the invention
Fig. 1 is the schematic diagram of unordered text collection of the invention;
Fig. 2 is a kind of flow chart of textual classification model training method based on LDA algorithm of the invention.
Specific embodiment
Below with reference to specific embodiment shown in the drawings, the present invention will be described in detail.But these embodiments are simultaneously
The present invention is not limited, structure that those skilled in the art are made according to these embodiments, method or functionally
Transformation is included within the scope of protection of the present invention.
A kind of textual classification model training method based on LDA algorithm is disclosed in one embodiment of the present invention, referring to fig. 2
It is shown, include the following steps:
S001:Unordered text is obtained as training data;In the embodiment, it can obtain from designated position without preface
This, the data source including abiding by File Transfer Protocol, http protocol;Support that the upper transmitting file of parsing user includes:TXT format compression packet,
The data format of CSV compressed package, excel and WORD compressed package;By way of specified link and the form of set of links processing
Http data is uploaded, is based on http data, it is possible to specify the breadth and depth of data acquisition;By the text in HTML
Content is obtained by the orientation that XPATH format carries out data;Add the number to dynamically load or AJAX Asynchronous Request in webpage
It is obtained according to the dynamic data carried out based on selenium solution.
S002:The LDA theme training pattern of Chinese word segmentation is set, by unordered text progress word segmentation processing obtain every it is unordered
The distributed intelligence of set theme and word distribution situation in text, according to LDA theme training pattern to unordered after word segmentation processing
Text, which is trained, generates preliminary classification file, and saves;In this step, the unordered text got is carried out at participle
Reason, word segmentation processing include the relevant stop words maintenance of the pre- business processing of progress, synonym relevant to business processing maintenance, thus
Data cleansing is done for subsequent classification.Using the LDA theme training pattern of Chinese word segmentation, sets number of topics and configure tnum, iteration
Number configuration inum, mode input position sloc;LDA theme training pattern main function be for some documents at random being not necessarily into
Row is concentrated, and is found theme distribution in each document and is distributed to the word in the document, how to automatically generate theme and
The mosquito net for how analyzing theme is required to that LDA theme training pattern is transferred to automatically generate.In probability statistics, in every article
Word is all to first pass through certain probability to generate or select a theme and then generate from the theme that this is chosen once more
Or some word has been selected, the probabilistic relation of word and document is as follows:
The main function of LDA theme training pattern is to find tnum theme from unordered text and establish corresponding word
Corresponding relationship, shown in Figure 1, Doc1, Doc2...Docm represent unordered text collection, Docm represents m-th of nothing
The Wordm of preface sheet, corresponding row indicates there be n word in the unordered text.In conjunction with configured in LDA theme training pattern
Good tnum, inum parameter, program can be based on tnum theme number to all unordered texts and calculate, Word probability distribution
Process is in the following way:
Wherein, θ indicates a theme vector in above procedure, and the data of each theme vector represent and its in document
With the probability occurred;P (θ) is used to indicate the distribution of θ, and used herein is Dirichlet distribution, i.e. one group of continuous multivariable
Probability distribution;N and w_n also respectively indicates corresponding distribution;Z_n indicates the best theme selected after the completion of algorithm operation, p (z
| θ) indicate circulation θ theme Z probability distribution, the specially value of θ, i.e. p (z=i | θ)=θ _ i;P (w | z) ibid.
Above-mentioned corresponding new probability formula is as follows:
Unordered text after word segmentation processing carries out just unordered text in conjunction with configured LDA theme training pattern
Step training, generates preliminary classification file.
Preliminary classification file is screened, orientation defines new classification information and storage.The preliminary classification of above-mentioned generation
File is stored by classification, and user can be presented in a manner of visual, facilitates access, confirmation.By preliminary classification text
Part is screened, and the screening includes deleting some contents and reservation partial content.Preliminary classification file is subjected to secondary relationship
Association, by user demand orientation definition it is new divide write information, and store to HDFS.
In one embodiment, can to preliminary classification file carry out Binary Serialization and to preliminary classification file content into
It stores after row compression, text information is subjected to Binary Serialization and is saved after being compressed to content, data can be saved and deposited
Between emptying, the time that data are loaded into memory is saved, reduces the time of data prediction.
S003:Preliminary classification file is subjected to TFIDF algorithm process and is converted to word frequency vector and stores into HDFS.This
In the embodiment of invention, TFIDF is carried out to unordered text and is included the following steps:
Word after all word segmentation processings is numbered, and stores the content of text after number;Such as:Key1- China,
Key2-, key3- emerge etc.;
Count the word frequency that each participle occurs, such as key1-100, key2-50, key3-30 etc.;
The calculation formula of word frequency statistics is:
Wherein nI, jIndicate word tiIn document djThe number of middle appearance, ∑k nK, jIndicate document djIn all word occur
The sum of number.
Obtain the reversed word frequency of each participle, such as key1-0.005, key2-0.004, key3-0.002 etc.;
Reversely the calculation formula of word frequency is:
Wherein | D | indicate the sum of document, | { j:ti∈dj| it indicates to include word tiNumber of files, but for not
Word in corpus, will lead to | { j:ti∈dj| being worth is 0, and formula is meaningless at this time, and 1+ can be used | { j:ti∈dj| to replace
Generation.
Content of text after obtaining number storage is converted to word frequency vector and stores into HDFS.The process carries out vectorization
Operation.
S004:Then it is trained result feedback, extracts the test training sample that word frequency vector does training result return, base
In the training of Bayes, multiple mappings (map) are cut into for training sample is tested using the mapreduce training process of hadoop
Task and the result that (map) task will be finally mapped in specification (reduce) process carry out the collection and arrangement of data.Specifically
, in this process, extracting test training sample is the 10% of word frequency vector, is then based on the training process of Bayes, Ye Si
Operation logic is as follows:
I.e.:P (B [i] | A)=P (B [i]) * P (A | B [i])/P (B [1]) * P (A | B [1])+P (B [2]) * P (A | B [2])
+…+P(B[n])*P(A|B[n])}
In order to accelerate trained speed and improve trained input sample capacity, the realization of bayesian algorithm is revised as base
In the realization of the mapreduce of hadoop, i.e., training mission is pressed into certain segmentation condition, is cut into multiple map tasks and most
The result of map is carried out in reduce process the collection and arrangement of data eventually;From training data, m parts are divided into, default
M=total_size/block_size;Wherein total_size indicates the size of input file entirety;block_size:hdfs
File block size, be defaulted as 64M, can be arranged by parameter dfs.block.size.The process by the word frequency of vectorization to
Amount carries out disaggregated model training as Bayes's input condition, and is divided using the LDA model that training generates sample data
The prediction of class and the statistics of classification accuracy, so far disaggregated model training finish.
Classification prediction is carried out using the disaggregated model that above embodiment is formed, by crawler in network, such as Netease's public affairs
It opens and crawls data on webpage in real time, the participle of the LDA theme training pattern according to above embodiment setting safeguards information, that is, wraps
Include synonym maintenance, stop words maintenance etc..The unordered text of input is subjected to word segmentation processing.By the unordered text after word segmentation processing
The word frequency vector of vectorization is converted to, the 10% of word frequency vector is extracted and is used as classification input condition, obtained by Bayes's training
The result of classification prediction returns.
A kind of textual classification model training system based on LDA algorithm, the system is also disclosed in one embodiment of the present invention
System using the textual classification model training method based on LDA algorithm all processes and step, the system have including:
Input unit obtains unordered text as training data;
Storage medium, for executing to give an order:
The LDA theme training pattern of Chinese word segmentation is set, unordered text progress word segmentation processing is obtained into every unordered text
In set theme distributed intelligence and word distribution situation, according to LDA theme training pattern to the unordered text after word segmentation processing
It is trained and generates preliminary classification file, and save;
Preliminary classification file is subjected to TFIDF algorithm process and is converted to word frequency vector, is stored into HDFS;
Extract the test training sample that word frequency vector does training result return, the training based on Bayes, using hadoop
Mapreduce training process by test training sample be cut into multiple mapping tasks and finally in specification process will mapping times
The result of business carries out the collection and arrangement of data.In yet another embodiment, the content of preservation is subjected to TFIDF algorithm process tool
Body includes the following steps:
Word after all word segmentation processings is numbered, and stores the content of text after number;
The word frequency that each participle occurs is counted, the reversed word frequency of each participle is obtained;
Content of text after obtaining number storage is converted to word frequency vector and stores into HDFS.
In another preferred embodiment, the test training sample is the 10% of word frequency vector.
It is loaded into time of memory in order to save the space of data storage, save data, reduces the substantial amounts of data prediction time, it will
When preliminary classification file is saved to distributed memory system HDFS, Binary Serialization is carried out to preliminary classification file and to preliminary
Sort file content stores after being compressed.
In another preferred embodiment, also wrapped before preliminary classification file is saved to distributed memory system HDFS
It includes:Preliminary classification file is screened, orientation defines new classification information and stores to HDFS system.
Method and system described in above embodiment, by the preliminary screening of LDA topic model and in conjunction with artificial final election
Unordered a small amount of content of text is carried out unified conclusion, data cleansing, has ensured event independence between each attribute, improved by function
The accuracy rate of text classification prediction.
Storage data are changed to ensure using the location mode based on distributed HDFS by traditional disk as medium
The safety of data reduces the data time loaded into memory.Using the MapReduce distributed computing based on hadoop
Framework has better dilatation and fault-tolerance relative to single machine, can load bigger sample size, saves the runing time of program.
It should be appreciated that although this specification is described in terms of embodiments, but not each embodiment only includes one
A independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should will say
As a whole, the technical solution in each embodiment may also be suitably combined to form those skilled in the art can for bright book
With the other embodiments of understanding.
The series of detailed descriptions listed above only for feasible embodiment of the invention specifically
Protection scope bright, that they are not intended to limit the invention, it is all without departing from equivalent implementations made by technical spirit of the present invention
Or change should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of textual classification model training method based on LDA algorithm, it is characterised in that include the following steps:
Unordered text is obtained as training data;
The LDA theme training pattern of Chinese word segmentation is set, unordered text progress word segmentation processing is obtained into institute in every unordered text
The distributed intelligence and word distribution situation for setting theme carry out the unordered text after word segmentation processing according to LDA theme training pattern
Training generates preliminary classification file, and saves;
Preliminary classification file is subjected to TFIDF algorithm process and is converted to word frequency vector, is stored into HDFS;
Extract the test training sample that word frequency vector does training result return, the training based on Bayes, using hadoop's
Mapreduce training process is cut into multiple mapping tasks and finally in specification process by mapping tasks for training sample is tested
Result carry out data collection and arrangement;
Disaggregated model training finishes.
2. the textual classification model training method according to claim 1 based on LDA algorithm, it is characterised in that:It will save
Content carry out TFIDF algorithm process specifically comprise the following steps:
Word after all word segmentation processings is numbered, and stores the content of text after number;
The word frequency that each participle occurs is counted, the reversed word frequency of each participle is obtained;
Content of text after obtaining number storage is converted to word frequency vector and stores into HDFS.
3. the textual classification model training method according to claim 1 or 2 based on LDA algorithm, it is characterised in that:It is described
Test 10% that training sample is word frequency vector.
4. the textual classification model training method according to claim 1 or 2 based on LDA algorithm, it is characterised in that:It will be first
When step sort file is saved to distributed memory system HDFS, Binary Serialization is carried out to preliminary classification file and to preliminary point
Class file content stores after being compressed.
5. the textual classification model training method described in Claims 1-4 any one based on LDA algorithm, it is characterised in that:
Further include before carrying out TFIDF processing:
Preliminary classification file is screened, orientation defines new classification information.
6. a kind of textual classification model training system based on LDA algorithm, it is characterised in that:Including
Input unit obtains unordered text as training data;
Storage medium, for executing to give an order:
The LDA theme training pattern of Chinese word segmentation is set, unordered text progress word segmentation processing is obtained into institute in every unordered text
The distributed intelligence and word distribution situation for setting theme carry out the unordered text after word segmentation processing according to LDA theme training pattern
Training generates preliminary classification file, and saves;
Preliminary classification file is subjected to TFIDF algorithm process and is converted to word frequency vector, is stored into HDFS;
Extract the test training sample that word frequency vector does training result return, the training based on Bayes, using hadoop's
Mapreduce training process is cut into multiple mapping tasks and finally in specification process by mapping tasks for training sample is tested
Result carry out data collection and arrangement.
7. the textual classification model training system according to claim 6 based on LDA algorithm, it is characterised in that:It will save
Content carry out TFIDF algorithm process specifically comprise the following steps:
Word after all word segmentation processings is numbered, and stores the content of text after number;
The word frequency that each participle occurs is counted, the reversed word frequency of each participle is obtained;
Content of text after obtaining number storage is converted to word frequency vector and stores into HDFS.
8. the textual classification model training system according to claim 6 based on LDA algorithm, it is characterised in that:The survey
Try 10% that training sample is word frequency vector.
9. the textual classification model training system according to claim 6 based on LDA algorithm, it is characterised in that:It will be preliminary
When sort file is saved to distributed memory system HDFS, Binary Serialization is carried out to preliminary classification file and to preliminary classification
File content stores after being compressed.
10. the textual classification model training system according to claim 6 based on LDA algorithm, it is characterised in that:It will be preliminary
Sort file is saved to distributed memory system HDFS:
Preliminary classification file is screened, orientation defines new classification information and stores to HDFS system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810535046.1A CN108920508A (en) | 2018-05-29 | 2018-05-29 | Textual classification model training method and system based on LDA algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810535046.1A CN108920508A (en) | 2018-05-29 | 2018-05-29 | Textual classification model training method and system based on LDA algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108920508A true CN108920508A (en) | 2018-11-30 |
Family
ID=64411043
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810535046.1A Pending CN108920508A (en) | 2018-05-29 | 2018-05-29 | Textual classification model training method and system based on LDA algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108920508A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109977119A (en) * | 2019-03-25 | 2019-07-05 | 浙江大学 | Data classification and storage method for bioelectronics mixing man-made organ system |
CN110222179A (en) * | 2019-05-28 | 2019-09-10 | 深圳市小赢信息技术有限责任公司 | A kind of address list file classification method, device and electronic equipment |
CN111695020A (en) * | 2020-06-15 | 2020-09-22 | 广东工业大学 | Hadoop platform-based information recommendation method and system |
CN112580355A (en) * | 2020-12-30 | 2021-03-30 | 中科院计算技术研究所大数据研究院 | News information topic detection and real-time aggregation method |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103810293A (en) * | 2014-02-28 | 2014-05-21 | 广州云宏信息科技有限公司 | Text classification method and device based on Hadoop |
CN105912525A (en) * | 2016-04-11 | 2016-08-31 | 天津大学 | Sentiment classification method for semi-supervised learning based on theme characteristics |
CN105975478A (en) * | 2016-04-09 | 2016-09-28 | 北京交通大学 | Word vector analysis-based online article belonging event detection method and device |
US20170056764A1 (en) * | 2015-08-31 | 2017-03-02 | Omniscience Corporation | Event categorization and key prospect identification from storylines |
CN106844424A (en) * | 2016-12-09 | 2017-06-13 | 宁波大学 | A kind of file classification method based on LDA |
CN107943824A (en) * | 2017-10-17 | 2018-04-20 | 广东广业开元科技有限公司 | A kind of big data news category method, system and device based on LDA |
-
2018
- 2018-05-29 CN CN201810535046.1A patent/CN108920508A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103810293A (en) * | 2014-02-28 | 2014-05-21 | 广州云宏信息科技有限公司 | Text classification method and device based on Hadoop |
US20170056764A1 (en) * | 2015-08-31 | 2017-03-02 | Omniscience Corporation | Event categorization and key prospect identification from storylines |
CN105975478A (en) * | 2016-04-09 | 2016-09-28 | 北京交通大学 | Word vector analysis-based online article belonging event detection method and device |
CN105912525A (en) * | 2016-04-11 | 2016-08-31 | 天津大学 | Sentiment classification method for semi-supervised learning based on theme characteristics |
CN106844424A (en) * | 2016-12-09 | 2017-06-13 | 宁波大学 | A kind of file classification method based on LDA |
CN107943824A (en) * | 2017-10-17 | 2018-04-20 | 广东广业开元科技有限公司 | A kind of big data news category method, system and device based on LDA |
Non-Patent Citations (2)
Title |
---|
胡晓东,高嘉伟: "大数据下基于mapreduce 的朴素贝叶斯文本分类算法", 《科技通报》 * |
董帅: "基于半监督学习的文本分类算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109977119A (en) * | 2019-03-25 | 2019-07-05 | 浙江大学 | Data classification and storage method for bioelectronics mixing man-made organ system |
CN110222179A (en) * | 2019-05-28 | 2019-09-10 | 深圳市小赢信息技术有限责任公司 | A kind of address list file classification method, device and electronic equipment |
CN111695020A (en) * | 2020-06-15 | 2020-09-22 | 广东工业大学 | Hadoop platform-based information recommendation method and system |
CN112580355A (en) * | 2020-12-30 | 2021-03-30 | 中科院计算技术研究所大数据研究院 | News information topic detection and real-time aggregation method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Vysotska et al. | Web Content Support Method in Electronic Business Systems. | |
Ma et al. | An ontology-based text-mining method to cluster proposals for research project selection | |
CA3033859C (en) | Method and system for automatically extracting relevant tax terms from forms and instructions | |
CN108920508A (en) | Textual classification model training method and system based on LDA algorithm | |
Hussain et al. | Approximation of COSMIC functional size to support early effort estimation in Agile | |
Basiri et al. | Sentiment prediction based on dempster‐shafer theory of evidence | |
CN110458324A (en) | Calculation method, device and the computer equipment of risk probability | |
Chyrun et al. | Content monitoring method for cut formation of person psychological state in social scoring | |
CN110442872B (en) | Text element integrity checking method and device | |
Zhang et al. | A multiclassification model of sentiment for E-commerce reviews | |
CN113312480A (en) | Scientific and technological thesis level multi-label classification method and device based on graph convolution network | |
KR20230052609A (en) | Review analysis system using machine reading comprehension and method thereof | |
US20220004718A1 (en) | Ontology-Driven Conversational Interface for Data Analysis | |
CN112307336A (en) | Hotspot information mining and previewing method and device, computer equipment and storage medium | |
Habbat et al. | Topic modeling and sentiment analysis with lda and nmf on moroccan tweets | |
Bonfitto et al. | Semi-automatic column type inference for CSV table understanding | |
Bhole et al. | Extracting named entities and relating them over time based on Wikipedia | |
Lin et al. | Ensemble making few-shot learning stronger | |
CN116541517A (en) | Text information processing method, apparatus, device, software program, and storage medium | |
Han et al. | An evidence-based credit evaluation ensemble framework for online retail SMEs | |
CN110222179A (en) | A kind of address list file classification method, device and electronic equipment | |
Gillmann et al. | Quantification of Economic Uncertainty: a deep learning approach | |
CN114861655A (en) | Data mining processing method, system and storage medium | |
Anastasopoulos et al. | Computational text analysis for public management research: An annotated application to county budgets | |
KR20230059364A (en) | Public opinion poll system using language model and method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181130 |
|
RJ01 | Rejection of invention patent application after publication |