CN105787004A - Text classification method and device - Google Patents
Text classification method and device Download PDFInfo
- Publication number
- CN105787004A CN105787004A CN201610096316.4A CN201610096316A CN105787004A CN 105787004 A CN105787004 A CN 105787004A CN 201610096316 A CN201610096316 A CN 201610096316A CN 105787004 A CN105787004 A CN 105787004A
- Authority
- CN
- China
- Prior art keywords
- classification
- characteristic vector
- word
- text
- training text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Abstract
The invention provides a text classification method and device.The method comprises the steps of obtaining multiple training texts, wherein the multiple training texts belong to multiple categories; according to the training texts contained in each category, determining a characteristic vector of each category; conducting dimensionality reduction operation on the characteristic vector of each category; according to each category obtained after dimensionality reduction operation, calculating the probability of the text to be classified of belonging to each category; distributing the text to be classified into the category with the maximum probability.According to the scheme, after the characteristic vector of each category is determined, dimensionality reduction operation is conducted on the characteristic vector of each category, thereby each category can be simplified, and text classification efficiency can be improved.
Description
Technical field
The present invention relates to field of computer technology, particularly to a kind of file classification method and device.
Background technology
Along with developing rapidly of computer technology, the information targetedly that obtains from vast resources storehouse becomes a kind of demand of present society, and its various information processing technologies relied on also become currently indispensable instrument.The automatic classification technology of text refers under the premise of each classification given, according to its affiliated categorizing process of the content automatic decision in text to be sorted.Owing to text is to be made up of multiple words, the word quantity comprised in the corpus of generation is sizable, and when text table is shown as vector, dimension is huge, affects calculated performance, therefore, how to provide a kind of file classification method, to improve calculated performance.
Summary of the invention
Embodiments provide a kind of file classification method and device, to reduce the vector dimension of text.
First aspect, embodiments provides a kind of file classification method, including:
Obtaining multiple training text, wherein, the plurality of training text belongs to multiple classification;
According to the training text that each classification is included, it is determined that the characteristic vector of each classification;
The characteristic vector of each classification is carried out dimensionality reduction operation;
Each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification;
Described text to be sorted is assigned in the classification of maximum probability.
Preferably, the described characteristic vector determining each classification, including:
For the current training text of each in each current class, perform following operation respectively: current training text is carried out word segmentation processing;Calculate each word feature weight in current training text;
Feature weight according to all words included in this current class, forms the characteristic vector of this current class.
Preferably, each word of described calculating feature weight in current training text, including:
Word t is calculated at training text by following formulaIn feature weight
Wherein,For characterizing word t at training textIn word frequency, N is used for characterizing training text in this current classSum, ntInclude the amount of text of word t for training texts all in this current class, denominator is normalization factor.
Preferably, the described characteristic vector to each classification carries out dimensionality reduction operation, including:
Characteristic vector for each current class performs following operation: calculate the mutual information A of word corresponding to each characteristic vector and this current class, in the characteristic vector of current class, the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class, to realize the dimensionality reduction operation of the characteristic vector to this current class;Wherein,
Wherein, P (W | Cj) for word W classification CjThe proportion of middle appearance, | D1 | is this classification CjTraining text number, N (W, di) for word W at training text diIn word frequency, | V | is this training text diTotal word number,For this classification CjIn all words word frequency and;| D2 | is the training text number of all classification correspondences,For words all in all classification word frequency and.
Preferably, described calculating text to be sorted is belonging respectively to the probability of each classification, including:
Text d to be sorted is calculated by following formulamBelong to classification CjProbability:
Wherein, | C | is the quantity of classification, N (Wk,di) for characteristic vector WkAt training text dmIn word frequency, n is total number of characteristic vector.
Second aspect, embodiments provides a kind of document sorting apparatus, including:
Acquiring unit, is used for obtaining multiple training text, and wherein, the plurality of training text belongs to multiple classification, and is sent to by the multiple training texts obtained and determines unit;
Described determine unit, for the training text included according to each classification, it is determined that the characteristic vector of each classification, and the characteristic vector of each classification is sent to dimensionality reduction unit;
Described dimensionality reduction unit, for the characteristic vector of each classification carries out dimensionality reduction operation, and is sent to computing unit by each classification after dimensionality reduction operation;
Described computing unit, for each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification, and the probability that the text to be sorted calculated is belonging respectively to each classification is sent to allocation units;
Described allocation units, for being assigned to described text to be sorted in the classification of maximum probability.
Preferably, described determine unit, specifically for for the current training text of each in each current class, performing following operation respectively: current training text is carried out word segmentation processing;Calculate each word feature weight in current training text;Feature weight according to all words included in this current class, forms the characteristic vector of this current class.
Preferably, described determine unit, specifically for calculating word t at training text by following formulaIn feature weight
Wherein,For characterizing word t at training textIn word frequency, N is used for characterizing training text in this current classSum, ntInclude the amount of text of word t for training texts all in this current class, denominator is normalization factor.
Preferably, described dimensionality reduction unit, specifically for performing following operation for the characteristic vector of each current class: calculate the mutual information A of word corresponding to each characteristic vector and this current class, in the characteristic vector of current class, the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class, to realize the dimensionality reduction operation of the characteristic vector to this current class;Wherein,
Wherein, P (W | Cj) for word W classification CjThe proportion of middle appearance, | D1 | is this classification CjTraining text number, N (W, di) for word W at training text diIn word frequency, | V | is this training text diTotal word number,For this classification CjIn all words word frequency and;| D2 | is the training text number of all classification correspondences,For words all in all classification word frequency and.
Preferably, described computing unit, specifically for calculating text d to be sorted by following formulamBelong to classification CjProbability:
Wherein, | C | is the quantity of classification, N (Wk,di) for characteristic vector WkAt training text dmIn word frequency, n is total number of characteristic vector.
Embodiments provide a kind of file classification method and device, by after the characteristic vector determining each classification, the characteristic vector of each classification is carried out dimensionality reduction operation, so that each classification is simplified, it is possible to improve the efficiency of text classification.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is a kind of method flow diagram that one embodiment of the invention provides;
Fig. 2 is the another kind of method flow diagram that one embodiment of the invention provides;
Fig. 3 is the hardware structure figure of the device place equipment that one embodiment of the invention provides;
Fig. 4 is the apparatus structure schematic diagram that one embodiment of the invention provides.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly; below in conjunction with the accompanying drawing in the embodiment of the present invention; technical scheme in the embodiment of the present invention is clearly and completely described; obviously; described embodiment is a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention; the every other embodiment that those of ordinary skill in the art obtain under the premise not making creative work, broadly falls into the scope of protection of the invention.
As it is shown in figure 1, embodiments provide a kind of file classification method, the method may comprise steps of:
Step 101: obtain multiple training text, wherein, the plurality of training text belongs to multiple classification;
Step 102: according to the training text that each classification is included, it is determined that the characteristic vector of each classification;
Step 103: the characteristic vector of each classification is carried out dimensionality reduction operation;
Step 104: each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification;
Step 105: described text to be sorted is assigned in the classification of maximum probability.
According to the scheme that above-described embodiment provides, by, after the characteristic vector determining each classification, the characteristic vector of each classification being carried out dimensionality reduction operation, so that each classification is simplified, it is possible to improve the efficiency of text classification.
In an embodiment of the invention, the characteristic vector of each classification is used to determine the important parameter whether a new text belongs to this classification, wherein, determine the characteristic vector of each classification, can be accomplished in that for the current training text of each in each current class, perform following operation respectively: current training text is carried out word segmentation processing;Calculate each word feature weight in current training text;Feature weight according to all words included in this current class, forms the characteristic vector of this current class.
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with drawings and the specific embodiments, the present invention is described in further detail.
As in figure 2 it is shown, embodiments provide a kind of file classification method, the method may comprise steps of:
Step 201: obtain multiple training text, wherein, the plurality of training text belongs to multiple classification.
In the present embodiment, in order to calculate the classification belonging to new text, it is thus necessary to determine that multiple classification, each classification includes multiple training text, and wherein, each multiple training text included by classification is for calculating whether new text belongs to the sample for reference of this classification.
Step 202: for each training sample included in each classification, carry out participle routine processes, and remove stop words, merge the words such as numeral and name, and add up word frequency.
In the present embodiment, owing to there being some words there is no actual meaning for the classification of text in training sample, for instance some stop words, it is necessary to these words are removed;And the actual weight that some words are in training text may result in the mistake of classification, for instance, some numeral or names etc., it is necessary to these words are merged, there is the weight of practical significance calculating these words.
Step 203: for each classification, calculate the feature weight of each word in each training text.
In the present embodiment, it is possible to calculate word t at training text by following formula (1)In feature weight
Wherein,For characterizing word t at training textIn word frequency, N is used for characterizing training text in this current classSum, ntInclude the amount of text of word t for training texts all in this current class, denominator is normalization factor.
Step 204: the feature weight according to all words included in each current class, forms the characteristic vector of this current class.
For the feature weight that the characteristic vector of each current class is corresponding to the word occurred included in this classification.
Step 205: for each current class, calculate the mutual information A of word corresponding to each characteristic vector and this current class.
In the present embodiment, it is possible to realized the calculating of mutual information A by such as following formula (2), (3), (4):
Wherein, P (W | Cj) for word W classification CjThe proportion of middle appearance, | D1 | is this classification CjTraining text number, N (W, di) for word W at training text diIn word frequency, | V | is this training text diTotal word number,For this classification CjIn all words word frequency and;| D2 | is the training text number of all classification correspondences,For words all in all classification word frequency and.
Step 206: the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class in the characteristic vector of current class, to realize the dimensionality reduction operation of the characteristic vector to this current class.
Owing to the included word amount of each classification is relatively big, therefore the dimension of its characteristic vector is higher, in order to realize that classification is carried out dimensionality reduction operation, it is possible to the characteristic vector selecting some mutual informations big forms this classification.
Wherein, this predetermined number generally may be set in about several thousand, for instance, 2000.This predetermined number can also determine optimum according to experiment test and statistical result.
Step 207: each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification.
In the present embodiment, it is possible to calculate text d to be sorted by equation belowmBelong to classification CjProbability:
Wherein, | C | is the quantity of classification, N (Wk,di) for characteristic vector WkAt training text dmIn word frequency, n is total number of characteristic vector.
Step 208: text to be sorted is assigned in the classification of maximum probability.
As shown in Figure 3, Figure 4, a kind of document sorting apparatus is embodiments provided.Device embodiment can be realized by software, it is also possible to is realized by the mode of hardware or software and hardware combining.Say from hardware view; as shown in Figure 3; a kind of hardware structure diagram for the document sorting apparatus place equipment that the embodiment of the present invention provides; except the processor shown in Fig. 3, internal memory, network interface and nonvolatile memory; in embodiment, the equipment at device place generally can also include other hardware, such as the forwarding chip etc. of responsible process message.Implemented in software for example, as shown in Figure 4, as the device on a logical meaning, it is that computer program instructions corresponding in nonvolatile memory is read to run in internal memory and formed by the CPU by its place equipment.The document sorting apparatus that the present embodiment provides, including:
Acquiring unit 401, is used for obtaining multiple training text, and wherein, the plurality of training text belongs to multiple classification, and is sent to by the multiple training texts obtained and determines unit 402;
Described determine unit 402, for the training text included according to each classification, it is determined that the characteristic vector of each classification, and the characteristic vector of each classification is sent to dimensionality reduction unit 403;
Described dimensionality reduction unit 403, for the characteristic vector of each classification carries out dimensionality reduction operation, and is sent to computing unit 404 by each classification after dimensionality reduction operation;
Described computing unit 404, for each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification, and the probability that the text to be sorted calculated is belonging respectively to each classification is sent to allocation units 405;
Described allocation units 405, for being assigned to described text to be sorted in the classification of maximum probability.
Wherein, described determine unit 402, specifically for for the current training text of each in each current class, performing following operation respectively: current training text is carried out word segmentation processing;Calculate each word feature weight in current training text;Feature weight according to all words included in this current class, forms the characteristic vector of this current class.
Wherein, described determine unit 402, specifically for calculating word t at training text by following formulaIn feature weight
Wherein,For characterizing word t at training textIn word frequency, N is used for characterizing training text in this current classSum, ntInclude the amount of text of word t for training texts all in this current class, denominator is normalization factor.
Wherein, described dimensionality reduction unit 403, specifically for performing following operation for the characteristic vector of each current class: calculate the mutual information A of word corresponding to each characteristic vector and this current class, in the characteristic vector of current class, the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class, to realize the dimensionality reduction operation of the characteristic vector to this current class;Wherein,
Wherein, P (W | Cj) for word W classification CjThe proportion of middle appearance, | D1 | is this classification CjTraining text number, N (W, di) for word W at training text diIn word frequency, | V | is this training text diTotal word number,For this classification CjIn all words word frequency and;| D2 | is the training text number of all classification correspondences,For words all in all classification word frequency and.
Wherein, described computing unit 404, specifically for calculating text d to be sorted by following formulamBelong to classification CjProbability:
Wherein, | C | is the quantity of classification, N (Wk,di) for characteristic vector WkAt training text dmIn word frequency, n is total number of characteristic vector.
To sum up, the embodiment of the present invention at least can realize following beneficial effect:
1, in embodiments of the present invention, by, after the characteristic vector determining each classification, the characteristic vector of each classification being carried out dimensionality reduction operation, so that each classification is simplified, it is possible to improve the efficiency of text classification.
2, in embodiments of the present invention, carry out the vector representation of text using word as feature, and the characteristic vector obtained is carried out dimensionality reduction expression, both remained text important information, the calculating after facilitating again;By carrying out training, sum up criteria for classification, according to standard, new text is classified automatically;By remaining interface beyond the clouds, it is possible to externally provide open safely controllable API service.
3, in embodiments of the present invention, by after extensive text message is carried out effective classification process, it is possible to set up individual searching engine targetedly, improve the precision ratio of system, allow user can retrieve target information fast and effectively.
The contents such as the information between each unit in said apparatus is mutual, execution process, due to the inventive method embodiment based on same design, particular content referring to the narration in the inventive method embodiment, can repeat no more herein.
It should be noted that, in this article, the relational terms of such as first and second etc is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " includes ", " comprising " or its any other variant are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment not only include those key elements, but also include other key elements being not expressly set out, or also include the key element intrinsic for this process, method, article or equipment.When there is no more restriction, statement " including " key element limited, it is not excluded that there is also other same factor in including the process of described key element, method, article or equipment.
One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program upon execution, performs to include the step of said method embodiment;And aforesaid storage medium includes: in the various media that can store program code such as ROM, RAM, magnetic disc or CD.
Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate technical scheme, is not intended to limit protection scope of the present invention.All make within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all contained in protection scope of the present invention.
Claims (10)
1. a file classification method, it is characterised in that including:
Obtaining multiple training text, wherein, the plurality of training text belongs to multiple classification;
According to the training text that each classification is included, it is determined that the characteristic vector of each classification;
The characteristic vector of each classification is carried out dimensionality reduction operation;
Each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification;
Described text to be sorted is assigned in the classification of maximum probability.
2. method according to claim 1, it is characterised in that the described characteristic vector determining each classification, including:
For the current training text of each in each current class, perform following operation respectively: current training text is carried out word segmentation processing;Calculate each word feature weight in current training text;
Feature weight according to all words included in this current class, forms the characteristic vector of this current class.
3. method according to claim 2, it is characterised in that each word of described calculating feature weight in current training text, including:
Word t is calculated at training text by following formulaIn feature weight
Wherein,For characterizing word t at training textIn word frequency, N is used for characterizing training text in this current classSum, ntInclude the amount of text of word t for training texts all in this current class, denominator is normalization factor.
4. method according to claim 1, it is characterised in that the described characteristic vector to each classification carries out dimensionality reduction operation, including:
Characteristic vector for each current class performs following operation: calculate the mutual information A of word corresponding to each characteristic vector and this current class, in the characteristic vector of current class, the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class, to realize the dimensionality reduction operation of the characteristic vector to this current class;Wherein,
Wherein, P (W | Cj) for word W classification CjThe proportion of middle appearance, | D1 | is this classification CjTraining text number, N (W, di) for word W at training text diIn word frequency, | V | is this training text diTotal word number,For this classification CjIn all words word frequency and;| D2 | is the training text number of all classification correspondences,For words all in all classification word frequency and.
5. according to described method arbitrary in claim 1-4, it is characterised in that described calculating text to be sorted is belonging respectively to the probability of each classification, including:
Text d to be sorted is calculated by following formulamBelong to classification CjProbability:
Wherein, | C | is the quantity of classification, N (Wk,di) for characteristic vector WkAt training text dmIn word frequency, n is total number of characteristic vector.
6. a document sorting apparatus, it is characterised in that including:
Acquiring unit, is used for obtaining multiple training text, and wherein, the plurality of training text belongs to multiple classification, and is sent to by the multiple training texts obtained and determines unit;
Described determine unit, for the training text included according to each classification, it is determined that the characteristic vector of each classification, and the characteristic vector of each classification is sent to dimensionality reduction unit;
Described dimensionality reduction unit, for the characteristic vector of each classification carries out dimensionality reduction operation, and is sent to computing unit by each classification after dimensionality reduction operation;
Described computing unit, for each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification, and the probability that the text to be sorted calculated is belonging respectively to each classification is sent to allocation units;
Described allocation units, for being assigned to described text to be sorted in the classification of maximum probability.
7. document sorting apparatus according to claim 6, it is characterised in that described determine unit, specifically for for the current training text of each in each current class, performing following operation respectively: current training text is carried out word segmentation processing;Calculate each word feature weight in current training text;Feature weight according to all words included in this current class, forms the characteristic vector of this current class.
8. document sorting apparatus according to claim 7, it is characterised in that described determine unit, specifically for calculating word t at training text by following formulaIn feature weight
Wherein,For characterizing word t at training textIn word frequency, N is used for characterizing training text in this current classSum, ntInclude the amount of text of word t for training texts all in this current class, denominator is normalization factor.
9. document sorting apparatus according to claim 6, it is characterized in that, described dimensionality reduction unit, specifically for performing following operation for the characteristic vector of each current class: calculate the mutual information A of word corresponding to each characteristic vector and this current class, in the characteristic vector of current class, the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class, to realize the dimensionality reduction operation of the characteristic vector to this current class;Wherein,
Wherein, P (W | Cj) for word W classification CjThe proportion of middle appearance, | D1 | is this classification CjTraining text number, N (W, di) for word W at training text diIn word frequency, | V | is this training text diTotal word number,For this classification CjIn all words word frequency and;| D2 | is the training text number of all classification correspondences,For words all in all classification word frequency and.
10. according to described document sorting apparatus arbitrary in claim 6-9, it is characterised in that described computing unit, specifically for calculating text d to be sorted by following formulamBelong to classification CjProbability:
Wherein, | C | is the quantity of classification, N (Wk,di) for characteristic vector WkAt training text dmIn word frequency, n is total number of characteristic vector.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610096316.4A CN105787004A (en) | 2016-02-22 | 2016-02-22 | Text classification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610096316.4A CN105787004A (en) | 2016-02-22 | 2016-02-22 | Text classification method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105787004A true CN105787004A (en) | 2016-07-20 |
Family
ID=56403606
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610096316.4A Pending CN105787004A (en) | 2016-02-22 | 2016-02-22 | Text classification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105787004A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106372117A (en) * | 2016-08-23 | 2017-02-01 | 电子科技大学 | Word co-occurrence-based text classification method and apparatus |
CN107329999A (en) * | 2017-06-09 | 2017-11-07 | 江西科技学院 | Document classification method and device |
CN109284377A (en) * | 2018-09-13 | 2019-01-29 | 云南电网有限责任公司 | A kind of file classification method and device based on vector space |
CN109408636A (en) * | 2018-09-29 | 2019-03-01 | 新华三大数据技术有限公司 | File classification method and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101290626A (en) * | 2008-06-12 | 2008-10-22 | 昆明理工大学 | Text categorization feature selection and weight computation method based on field knowledge |
CN102930063A (en) * | 2012-12-05 | 2013-02-13 | 电子科技大学 | Feature item selection and weight calculation based text classification method |
CN103593418A (en) * | 2013-10-30 | 2014-02-19 | 中国科学院计算技术研究所 | Distributed subject finding method and system for big data |
US20150113388A1 (en) * | 2013-10-22 | 2015-04-23 | Qualcomm Incorporated | Method and apparatus for performing topic-relevance highlighting of electronic text |
CN105159998A (en) * | 2015-09-08 | 2015-12-16 | 海南大学 | Keyword calculation method based on document clustering |
-
2016
- 2016-02-22 CN CN201610096316.4A patent/CN105787004A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101290626A (en) * | 2008-06-12 | 2008-10-22 | 昆明理工大学 | Text categorization feature selection and weight computation method based on field knowledge |
CN102930063A (en) * | 2012-12-05 | 2013-02-13 | 电子科技大学 | Feature item selection and weight calculation based text classification method |
US20150113388A1 (en) * | 2013-10-22 | 2015-04-23 | Qualcomm Incorporated | Method and apparatus for performing topic-relevance highlighting of electronic text |
CN103593418A (en) * | 2013-10-30 | 2014-02-19 | 中国科学院计算技术研究所 | Distributed subject finding method and system for big data |
CN105159998A (en) * | 2015-09-08 | 2015-12-16 | 海南大学 | Keyword calculation method based on document clustering |
Non-Patent Citations (2)
Title |
---|
赵杰: "3.7.1特征选择", 《搜索引擎技术》 * |
陈慧芳: "文本分类中特征向量空间降维方法", 《中国优秀硕士论文全文数据库信息科技辑》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106372117A (en) * | 2016-08-23 | 2017-02-01 | 电子科技大学 | Word co-occurrence-based text classification method and apparatus |
CN106372117B (en) * | 2016-08-23 | 2019-06-14 | 电子科技大学 | A kind of file classification method and its device based on Term co-occurrence |
CN107329999A (en) * | 2017-06-09 | 2017-11-07 | 江西科技学院 | Document classification method and device |
CN107329999B (en) * | 2017-06-09 | 2020-10-20 | 江西科技学院 | Document classification method and device |
CN109284377A (en) * | 2018-09-13 | 2019-01-29 | 云南电网有限责任公司 | A kind of file classification method and device based on vector space |
CN109408636A (en) * | 2018-09-29 | 2019-03-01 | 新华三大数据技术有限公司 | File classification method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107844559A (en) | A kind of file classifying method, device and electronic equipment | |
US20220147023A1 (en) | Method and device for identifying industry classification of enterprise and particular pollutants of enterprise | |
CN106033416A (en) | A string processing method and device | |
CN105787004A (en) | Text classification method and device | |
CN110334356A (en) | Article matter method for determination of amount, article screening technique and corresponding device | |
CN108241867B (en) | Classification method and device | |
CN109918658A (en) | A kind of method and system obtaining target vocabulary from text | |
CN106991090A (en) | The analysis method and device of public sentiment event entity | |
US11288266B2 (en) | Candidate projection enumeration based query response generation | |
CN113934848B (en) | Data classification method and device and electronic equipment | |
CN107908649B (en) | Text classification control method | |
CN106997340A (en) | The generation of dictionary and the Document Classification Method and device using dictionary | |
CN114611850A (en) | Service analysis method and device and electronic equipment | |
CN112818110A (en) | Text filtering method, text filtering equipment and computer storage medium | |
CN110019556A (en) | A kind of topic news acquisition methods, device and its equipment | |
CN115563268A (en) | Text abstract generation method and device, electronic equipment and storage medium | |
CN110895703A (en) | Legal document routing identification method and device | |
CN105677677A (en) | Information classification and device | |
CN105512145A (en) | Method and device for information classification | |
CN108520012A (en) | Mobile Internet user comment method for digging based on machine learning | |
JP5310196B2 (en) | Classification system revision support program, classification system revision support device, and classification system revision support method | |
CN114021716A (en) | Model training method and system and electronic equipment | |
CN110321435B (en) | Data source dividing method, device, equipment and storage medium | |
CN113468339A (en) | Label extraction method, system, electronic device and medium based on knowledge graph | |
Zou et al. | An improved model for spam user identification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160720 |