CN106294689B - A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature - Google Patents
A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature Download PDFInfo
- Publication number
- CN106294689B CN106294689B CN201610639904.8A CN201610639904A CN106294689B CN 106294689 B CN106294689 B CN 106294689B CN 201610639904 A CN201610639904 A CN 201610639904A CN 106294689 B CN106294689 B CN 106294689B
- Authority
- CN
- China
- Prior art keywords
- text
- feature
- frequency
- probability
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000009467 reduction Effects 0.000 title claims abstract description 23
- 238000000034 method Methods 0.000 title claims abstract description 19
- 239000013598 vector Substances 0.000 claims abstract description 49
- 238000004364 calculation method Methods 0.000 claims description 23
- 230000011218 segmentation Effects 0.000 claims description 19
- KXGVIPZFMKMPFU-OATXVPTESA-N fad-001 Chemical compound O([C@H]1C[C@@](O)(CC=2C(O)=C3C(=O)C=4C=CC=C(C=4C(=O)C3=C(O)C=21)OC)C(=O)CO)[C@@H]1O[C@@H](C)[C@@H](O)[C@@H](O)[C@H]1F KXGVIPZFMKMPFU-OATXVPTESA-N 0.000 claims description 7
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 abstract description 4
- 238000012545 processing Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature, and the method comprising the steps of:Obtain pending text;It is segmented to obtain multiple lexical items using HanLP, removes the stop words in the lexical item;Count word frequency, lexical item document frequency and document word number;Lexical item, word frequency and lexical item document frequency and document word number are stored and form primary text vector;Information gain calculating is carried out to primary text vector, is sorted according to the size of information gain amount, the vocabulary for meeting preset requirement is formed to the reference vector of feature selecting;Pending text is subjected to dimensionality reduction according to reference vector, forms the text vector after dimensionality reduction.The device includes:Acquisition module, word-dividing mode, statistical module, vector module, information gain computing module and dimensionality reduction module.This method and device, based on information gain algorithm carry out text feature selection, to feature set of words vector carry out dimension-reduction treatment, reduce dimension it is excessive caused by computation burden.
Description
Technical Field
The invention relates to the technical field of machine learning, in particular to a method and a device for dimension reduction based on text feature selection.
Background
With the rapid development of the internet, the continuous innovation of the internet related technology enables the cost and efficiency of the whole social informatization to be greatly changed ten years or twenty years ago. In addition, the increasing popularity of the internet has resulted in data in many different formats (text, multimedia, etc.) and many different sources of data, and in the face of huge amounts of information, people have not simply relied on human labor to process all information resources, but instead have needed auxiliary tools to help people better discover, filter and manage these electronic information data and resources.
The traditional text processing related software is all processing text files, however, with the emergence of a plurality of text formats, files bearing electronic information are no longer limited to a single file type, especially with the development of the Internet, the texts in these formats also show respective advantages, and the limitation to the processing system of the single format files is more and more obvious.
The representation of the text is abstracted into a space vector of the feature word set, however, the original candidate feature word set is hundreds of thousands of dimensions, and the high-dimensional text representation causes huge burden on calculation.
Disclosure of Invention
The invention provides a method and a device for reducing dimension based on text feature selection, which aim to solve the technical problem.
The invention provides a method for reducing dimension based on text feature selection, which comprises the following steps:
step A, acquiring and storing detailed information of a data source text to be processed;
b, performing word segmentation on the data source text by adopting HanLP to obtain a plurality of terms, and removing stop words in the terms;
step C, counting word frequency, term document frequency and document word number;
step D, storing the terms, the word frequency, the term document frequency and the document word number to form a primary text vector;
step E, performing information gain calculation on the primary text vector to obtain information gain quantity of each term, sequencing according to the size of the information gain quantity, and forming a reference vector of feature selection by a plurality of words meeting preset requirements;
and F, reducing the dimension of the text to be processed according to the reference vector to form a text vector after dimension reduction.
Wherein, the information gain calculation in the step E comprises the steps of:
taking each text as a category, taking terms in the text as characteristics, and calculating the information gain according to the following formula
Wherein N represents the total number of classes, P (C)i) Represents class CiProbability of occurrence, P (T) represents the feature (T)The probability of occurrence of the event is,indicates the probability that the feature (T) does not occur, P (C)iT), meaning that the text contains a feature (T) and belongs to category CiThe probability of (c).
Wherein, in the step E,wherein DFTA document frequency representing the characteristic (T);
wherein TFiRepresenting the frequency of occurrence of each term;
the embodiment of the invention also provides a device for reducing the dimension by selecting the text type characteristics, which comprises an acquisition module, a word segmentation module, a statistic module, a vector module, an information gain calculation module and a dimension reduction module;
the acquisition module is used for acquiring and storing the detailed information of the data source text to be processed;
the word segmentation module is used for performing word segmentation on the data source text by adopting HanLP to obtain a plurality of terms and removing stop words in the terms;
the statistic module is used for counting word frequency (the occurrence frequency of each term) and term document frequency and document word number;
the vector module is used for storing the terms, the word frequency, the term document frequency and the document word number and forming a primary text vector;
the information gain calculation module is used for performing information gain calculation on the primary text vector to obtain information gain quantity of each term, and forming a reference vector of feature selection on a plurality of words meeting preset requirements according to the sorting of the information gain quantity;
and the dimension reduction module is used for reducing the dimension of the text to be processed according to the reference vector to form a text vector after dimension reduction.
Wherein the information gain calculation module is configured to:
taking each text as a category, taking terms in the text as characteristics, and calculating the information gain according to the following formula
Wherein N represents the total number of classes, P (C)i) Represents class CiProbability of occurrence, P (T) denotes the probability of occurrence of the feature (T),indicates the probability that the feature (T) does not occur, P (C)iT), meaning that the text contains a feature (T) and belongs to category CiThe probability of (d);
wherein DFTA document frequency representing the characteristic (T);
wherein TFiRepresenting the frequency of occurrence of each term;
the embodiment of the invention provides a method and a device for reducing dimension based on text feature selection, which are a document feature dimension reduction processing method based on an information gain algorithm, and are used for reducing the dimension of a document feature word set and reducing the calculation load caused by hundreds of thousands of dimension feature word sets.
Drawings
FIG. 1 is a flowchart illustrating an embodiment of a method for performing dimension reduction based on text type feature selection according to the present invention;
fig. 2 is a schematic flow chart of an embodiment of selecting text features according to a second embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a method and a device for reducing dimension based on text feature selection, which is an algorithm for selecting text features based on Information Gain (IG), and reduces the dimensionality of a data set by extracting representative and most effective features from a text. In information gain, the measure of importance is to see how much information a feature can bring to the classification system, and the more information that is brought, the more important the feature is.
The embodiment of the invention adopts a HanLP word segmentation technology to segment words of a text, and the principle is to construct a large enough dictionary base containing all possible Chinese words, judge whether Chinese character strings of a Chinese text to be processed appear in the dictionary base, recognize a word once the Chinese character strings are found, and segment the word from the Chinese character strings until the Chinese character strings are segmented. The HanLP has the characteristics of complete functions, high performance, clear architecture, novel linguistic data and self-definition. When rich functions are provided, the HanLP internal module insists on low coupling, the model insists on inertia loading, the service insists on static providing and the dictionary insists on plaintext publishing, so that the use is very convenient, and meanwhile, the HanLP internal module is provided with some corpus processing tools to help a user to train own corpus. But the greatest disadvantage is that the accuracy of execution is completely dependent on the dictionary database, and the dictionary database needs to be updated.
The information gain method is to determine the amount of information provided by the category to which the text belongs by the presence or absence of features in the text. In filtering problems to measure how much information is contributed to a topic prediction by knowing whether a feature appears in the relevant text of the topic. The information gain is calculated to obtain the characteristics that the frequency of occurrence in the positive example samples is high and the frequency of occurrence in the non-positive example samples is low. The information gain relates to more mathematical theories and complex entropy theory formulas, and the embodiment of the invention defines the information gain as the amount of information which can be provided by a certain feature item for the whole classification, and does not consider the difference between the entropy of any feature and the entropy after considering the feature. According to the embodiment of the invention, the information gain of each characteristic item is calculated according to the training data, the items with small information gain are deleted, and the rest items are sorted and screened from large to small according to the information gain.
Example one
Specifically, referring to fig. 1, the method includes the steps of:
and step S110, acquiring and storing the detailed information of the data source text to be processed.
And acquiring the detailed information of the data source text, storing the detailed information into the HDFS, and reserving backup for subsequent inspection or data tracing.
And step S111, performing word segmentation on the data source text by adopting HanLP to obtain a plurality of terms, and removing stop words in the terms.
The effective information of a text is generally mainly composed of real words such as nouns, adjectives, verbs and quantifiers, which categories the effective information belongs to are also mainly distinguished by the real words, and words frequently appearing in all texts and virtual words without actual meanings hardly contribute to the classification of the text. These stop words usually do not have a great practical meaning, but often appear in the text, and if the stop words are not removed, two texts with completely different contents cannot be separated due to the large amount of common information, and meanwhile, the subsequent feature selection stage is affected, the system calculation cost is increased, and finally the construction of the classifier is affected. Therefore, by stopping the word stock, after the word segmentation processing is carried out on the text, words existing in the word stock are directly filtered out.
Step S112, counting word frequency, word item document frequency and document word number.
And performing word segmentation on the text by adopting HanLP, and counting word frequency, Term (Term) document frequency and document word number. The Term frequency is the frequency of each Term (Term) in all texts, the Term document frequency is the frequency of each Term in one document, and the document word number is the number of terms contained in one document.
Step S113, storing the terms, the term frequency, the term document frequency and the document word number and forming a primary text vector.
And storing the terms, the occurrence frequency (term frequency) of the terms and the term document frequency in a memory database to form a vectorized text, and calculating the read-write of the prepared information gain.
And step S114, performing information gain calculation on the primary text vector to obtain information gain quantity of each term, sequencing according to the information gain quantity, and forming a reference vector for feature selection by a plurality of words meeting preset requirements.
And performing information gain calculation on the text vectors, sorting according to the size of the information, reserving N vocabularies according to requirements as reference vectors for feature selection, and performing dimension reduction on all texts according to the reference vectors to form the final dimension-reduced text vectors.
Mathematical definition of entropy: suppose there is a variable X, which has n possible values, X respectively1,x2,…,xnTaking the probability of each value as P1,P2,…,PnThen the entropy of the variable X is defined as:
entropy of the classification system: for a classification system, class C is a variable, which may take the value C1,C2,…,CnThe probability of occurrence of each class is P (C)1),P(C2),…,P(Cn) Where n represents the number of categories. The entropy of the classification system is defined as:
wherein: p (C)i) Represents class CiProbability of occurrence, class C can be usediThe number of records (number of documents) contained is divided by the total number of records (total number of documents) for estimation. Namely:
wherein N represents the total number of records,represents class CiThe number of records contained.
Conditional entropy: suppose that there are n possible values (X) for feature X1,x2,…,xn) Then, given X, the entropy of the system is defined as:
wherein,
the information gain is for each feature, that is, looking at a feature (T), what the amount of information the system has and does not have, and the difference between the two is the amount of information the feature brings to the system, i.e., the gain.
The information gain brought to the system by the characteristic (T) can be written as the difference between the original entropy of the system and the conditional entropy after the characteristic (T) is fixed:
IG(T)=H(C)-H(C|T)
in a text classification system, a feature (T) corresponds to a term that has only two values, "present" or "not present". The occurrence of the feature (T) is denoted by TIndicating that feature (T) is not present.
Then:
wherein: p (T) represents the probability of the occurrence of the feature (T),indicates the probability that the feature (T) does not occur.
The formula is further expanded:
so IG (T) can be further expanded to:
the feature selection of the text is to extract important terms from the whole text set, wherein there is no concept of category, so that the problem needs to be generalized, and each text is taken as a category. At this time, the number of categories is equal to the number N of texts in the text set. The relevant parameters of the information gain formula are estimated based on this assumption.
Description of the symbols:
n, representing the total text number, namely the total category number;
P(Ci) Represents a class CiProbability of occurrence, i.e. text DiProbability of occurrence equal to
And P (T) represents the probability of the feature (T) to appear, and the number of texts containing the feature (T) is divided by the total number of texts N, namely:wherein DFTA document frequency representing the characteristic (T);
represents the probability that the feature (T) does not occur, equal to 1-P (T);
P(Cit), meaning that the text contains a feature (T) and belongs to category CiThe probability of (d); here, there may be two ways of estimating:
by including features (T) and belonging to class CiIs divided by the total number of texts and has a value of 0 or
The expansion is carried out according to a Bayesian formula,P(t|Ci) Represents class CiProbability of occurrence of middle feature (T), i.e. feature (T) in document DiProbability of occurrence in, adoptWherein TFiRepresenting the frequency of occurrence of each term; TFTIndicating the frequency of occurrence of each feature T.
Representing that the text contains a feature (T) and belongs to the class CiThe probability of (d); here, there may be two ways of estimating:
by not including a feature (T) and belonging to class CiIs divided by the total number of texts and has a value of 0 or
The expansion is carried out according to a Bayesian formula,wherein
It should be noted that:
in estimating P (t), the value may be 1, which will causeIs 0, thereby causingIt cannot be calculated. So that P (t) is actually adoptedAnd (6) estimating.
P(t|Ci) By usingMake an estimate if TFTIs 0, this would be the estimated value of 0. So that the following is actually adopted:and (6) estimating.
The features mentioned in the embodiments of the present invention all refer to terms of text.
Those skilled in the art can determine the definitions of the parameters according to the technical solutions of the embodiments of the present invention, which are not all listed.
And step S115, reducing the dimension of the text to be processed according to the reference vector to form a text vector after dimension reduction.
The text feature selection based on the information gain algorithm provided by the embodiment of the invention is performed, and the features are sorted and screened according to the importance of the features to the whole system, so that the purpose of reducing the dimension is achieved, and the calculation burden is reduced.
Example two
In the second embodiment of the present invention, the main process of the method for performing dimension reduction based on text-type feature selection is the same as that in the first embodiment, where the text-type feature selection process is shown in fig. 2, and includes the steps of:
step S210, an initial text is acquired.
Step S211, a word segmentation device is obtained, and the word segmentation device is used for carrying out word segmentation on the initial text.
Step S212, a noun filter is obtained, and the noun filter is used for carrying out noun screening on the text after word segmentation to obtain a noun set.
And step S213, carrying out document frequency statistics and storing the document frequency statistics into redis.
And step S214, performing word frequency statistics and storing the word frequency statistics in redis.
In step S215, forward indexing of the documents is performed.
In step S216, IG calculation is performed based on the results of the statistics in steps S213 and S214.
And step S217, persisting the obtained feature words.
EXAMPLE III
The third embodiment of the invention provides a device for reducing dimension based on text type feature selection, which comprises an acquisition module, a word segmentation module, a statistic module, a vector module, an information gain calculation module and a dimension reduction module.
The acquisition module is used for acquiring and storing the detailed information of the data source text to be processed.
And the word segmentation module is used for performing word segmentation on the data source text by adopting HanLP to obtain a plurality of terms and removing stop words in the terms.
And the statistic module is used for counting word frequency (the occurrence frequency of each term) and term and document frequency and document word number.
And the vector module is used for storing the terms, the word frequency, the term document frequency and the document word number and forming a primary text vector.
And the information gain calculation module is used for performing information gain calculation on the primary text vector to obtain the information gain quantity of each term, and forming a reference vector for feature selection by a plurality of words meeting the preset requirement according to the sorting of the information gain quantity.
And the dimension reduction module is used for reducing the dimension of the text to be processed according to the reference vector to form a text vector after dimension reduction.
The text representation abstraction becomes a space vector of a feature word set, however, the original candidate feature word set is hundreds of thousands of dimensions, the high-dimensional text representation not only causes computational burden on one hand, but also causes the reduction of classification performance due to larger feature redundancy on the other hand.
It should be noted that the apparatus or system embodiments in the embodiments of the present invention may be implemented by software, or by hardware, or by a combination of hardware and software. In terms of hardware, in the hardware structure framework structure of the embodiment of the present invention, in addition to the CPU, the memory, the network interface, and the nonvolatile memory, the device in which the apparatus is located in the embodiment may also include other hardware, such as a forwarding chip responsible for processing a packet, and the like. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory by the CPU of the device where the device is located and running the computer program instructions.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (2)
1. A method for reducing dimension based on text feature selection is characterized by comprising the following steps:
step A, acquiring and storing detailed information of a data source text to be processed;
b, performing word segmentation on the data source text to obtain a plurality of terms, and removing stop words in the terms;
step C, counting word frequency, term document frequency and document word number;
step D, storing the terms, the word frequency, the term document frequency and the document word number to form a primary text vector;
step E, performing information gain calculation on the primary text vector to obtain information gain quantity of each term, sequencing according to the size of the information gain quantity, and forming a reference vector of feature selection by a plurality of words meeting preset requirements;
step F, reducing the dimension of the text to be processed according to the reference vector to form a text vector after dimension reduction;
the step E of calculating the information gain includes the steps of:
taking each text as a category, taking terms in the text as characteristics, and calculating the information gain according to the following formula
In the step E, the step of the method is carried out,wherein DFTA document frequency representing the characteristic (T);
wherein TFiRepresenting the frequency of occurrence of each term;
n, representing the total text number, namely the total category number;
P(Ci) Represents a class CiProbability of occurrence, i.e. text DiProbability of occurrence equal to
And P (T) represents the probability of the feature (T) to appear, and the number of texts containing the feature (T) is divided by the total number of texts N, namely:wherein DFTA document frequency representing the characteristic (T);
representing the probability of the feature (T) not appearing, equal to 1-P (T);
P(Cit), meaning that the text contains a feature (T) and belongs to category CiThe probability of (d); here, there are two ways of estimation:
by including features (T) and belonging to class CiIs divided by the total number of texts and has a value of 0 or
The expansion is carried out according to a Bayesian formula,P(t|Ci) Represents class CiProbability of occurrence of middle feature (T), i.e. feature (T) in document DiProbability of occurrence in, adoptWherein TFiRepresenting the frequency of occurrence of each term; TFTRepresenting the frequency of occurrence of each feature T;
indicating that the text does not contain a feature (T) and belongs to the category CiThe probability of (d); here, there are two ways of estimation:
by not including a feature (T) and belonging to class CiIs divided by the total number of texts and has a value of 0 or
The expansion is carried out according to a Bayesian formula,wherein
It should be noted that:
when estimating P (t), P (t) may be 1, which will causeIs 0, thereby causingThe calculation cannot be carried out; so that P (t) is actually adoptedCarrying out estimation;
according toIf TFTIs 0, this will cause P (t | C)i) Is 0; so P (t | C)i) Practical applicationAnd (6) estimating.
2. A device for reducing dimension by selecting text type features is characterized by comprising an acquisition module, a word segmentation module, a statistic module, a vector module, an information gain calculation module and a dimension reduction module;
the acquisition module is used for acquiring and storing the detailed information of the data source text to be processed;
the word segmentation module is used for performing word segmentation on the data source text by adopting HanLP to obtain a plurality of terms and removing stop words in the terms;
the statistic module is used for counting word frequency, word and document frequency and document word number;
the vector module is used for storing the terms, the word frequency, the term document frequency and the document word number and forming a primary text vector;
the information gain calculation module is used for performing information gain calculation on the primary text vector to obtain information gain quantities of all terms, and forming a reference vector of feature selection on a plurality of words meeting preset requirements according to the sorting of the information gain quantities;
the dimension reduction module is used for reducing the dimension of the text to be processed according to the reference vector to form a text vector after dimension reduction;
the information gain calculation module is configured to:
taking each text as a category, taking terms in the text as characteristics, and calculating the information gain according to the following formula
Wherein DFTA document frequency representing the characteristic (T);
wherein TFiRepresenting the frequency of occurrence of each term;
n, representing the total text number, namely the total category number;
P(Ci) Represents a class CiProbability of occurrence, i.e. text DiProbability of occurrence equal to
And P (T) represents the probability of the feature (T) to appear, and the number of texts containing the feature (T) is divided by the total number of texts N, namely:wherein DFTA document frequency representing the characteristic (T);
represents the probability that the feature (T) does not occur, equal to 1-P (T);
P(Cit), meaning that the text contains a feature (T) and belongs to category CiThe probability of (d); here, there are two ways of estimation:
by including features (T) and belonging to class CiIs divided by the total number of texts and has a value of 0 or
The expansion is carried out according to a Bayesian formula,P(t|Ci) Represents class CiProbability of occurrence of middle feature (T), i.e. feature (T) in document DiProbability of occurrence in, adoptWherein TFiRepresent each oneFrequency of occurrence of terms; TFTRepresenting the frequency of occurrence of each feature T;
indicating that the text does not contain a feature (T) and belongs to the category CiThe probability of (d); here, there are two ways of estimation:
by not including a feature (T) and belonging to class CiIs divided by the total number of texts and has a value of 0 or
The expansion is carried out according to a Bayesian formula,wherein
It should be noted that:
when estimating P (t), P (t) may be 1, which will causeIs 0, thereby causingThe calculation cannot be carried out; so that P (t) is actually adoptedCarrying out estimation;
according toIf TFTIs 0, this will cause P (t | C)i) Is 0; so P (t | C)i) Practical applicationAnd (6) estimating.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610639904.8A CN106294689B (en) | 2016-08-05 | 2016-08-05 | A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610639904.8A CN106294689B (en) | 2016-08-05 | 2016-08-05 | A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106294689A CN106294689A (en) | 2017-01-04 |
CN106294689B true CN106294689B (en) | 2018-09-25 |
Family
ID=57665827
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610639904.8A Active CN106294689B (en) | 2016-08-05 | 2016-08-05 | A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106294689B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109308317A (en) * | 2018-09-07 | 2019-02-05 | 浪潮软件股份有限公司 | A kind of hot spot word extracting method of the non-structured text based on cluster |
CN110472240A (en) * | 2019-07-26 | 2019-11-19 | 北京影谱科技股份有限公司 | Text feature and device based on TF-IDF |
CN112906386B (en) * | 2019-12-03 | 2023-08-11 | 深圳无域科技技术有限公司 | Method and device for determining text characteristics |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102033949A (en) * | 2010-12-23 | 2011-04-27 | 南京财经大学 | Correction-based K nearest neighbor text classification method |
CN102662952A (en) * | 2012-03-02 | 2012-09-12 | 成都康赛电子科大信息技术有限责任公司 | Chinese text parallel data mining method based on hierarchy |
CN105095162A (en) * | 2014-05-19 | 2015-11-25 | 腾讯科技(深圳)有限公司 | Text similarity determining method and device, electronic equipment and system |
CN105512104A (en) * | 2015-12-02 | 2016-04-20 | 上海智臻智能网络科技股份有限公司 | Dictionary dimension reducing method and device and information classifying method and device |
-
2016
- 2016-08-05 CN CN201610639904.8A patent/CN106294689B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102033949A (en) * | 2010-12-23 | 2011-04-27 | 南京财经大学 | Correction-based K nearest neighbor text classification method |
CN102662952A (en) * | 2012-03-02 | 2012-09-12 | 成都康赛电子科大信息技术有限责任公司 | Chinese text parallel data mining method based on hierarchy |
CN105095162A (en) * | 2014-05-19 | 2015-11-25 | 腾讯科技(深圳)有限公司 | Text similarity determining method and device, electronic equipment and system |
CN105512104A (en) * | 2015-12-02 | 2016-04-20 | 上海智臻智能网络科技股份有限公司 | Dictionary dimension reducing method and device and information classifying method and device |
Also Published As
Publication number | Publication date |
---|---|
CN106294689A (en) | 2017-01-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
RU2628431C1 (en) | Selection of text classifier parameter based on semantic characteristics | |
RU2628436C1 (en) | Classification of texts on natural language based on semantic signs | |
CN109299280B (en) | Short text clustering analysis method and device and terminal equipment | |
CN108875059B (en) | Method and device for generating document tag, electronic equipment and storage medium | |
CN110019792A (en) | File classification method and device and sorter model training method | |
CN113407679B (en) | Text topic mining method and device, electronic equipment and storage medium | |
US11144723B2 (en) | Method, device, and program for text classification | |
WO2017198031A1 (en) | Semantic parsing method and apparatus | |
US9348901B2 (en) | System and method for rule based classification of a text fragment | |
CN110019820B (en) | Method for detecting time consistency of complaints and symptoms of current medical history in medical records | |
CN113961685A (en) | Information extraction method and device | |
AU2019203783B2 (en) | Extraction of tokens and relationship between tokens from documents to form an entity relationship map | |
CN105760363B (en) | Word sense disambiguation method and device for text file | |
CN110019776B (en) | Article classification method and device and storage medium | |
CN107679075B (en) | Network monitoring method and equipment | |
CN109791570B (en) | Efficient and accurate named entity recognition method and device | |
CN112612892B (en) | Special field corpus model construction method, computer equipment and storage medium | |
CN106294689B (en) | A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature | |
Awajan | Semantic similarity based approach for reducing Arabic texts dimensionality | |
CN115795030A (en) | Text classification method and device, computer equipment and storage medium | |
CN114255067A (en) | Data pricing method and device, electronic equipment and storage medium | |
CN113407584A (en) | Label extraction method, device, equipment and storage medium | |
CN104216880B (en) | Term based on internet defines discrimination method | |
CN107291686B (en) | Method and system for identifying emotion identification | |
CN109344397B (en) | Text feature word extraction method and device, storage medium and program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |