CN106294689B

CN106294689B - A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature

Info

Publication number: CN106294689B
Application number: CN201610639904.8A
Authority: CN
Inventors: 张达; 亓开元; 苏志远
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: IEIT Systems Co Ltd
Priority date: 2016-08-05
Filing date: 2016-08-05
Publication date: 2018-09-25
Anticipated expiration: 2036-08-05
Also published as: CN106294689A

Abstract

The present invention provides a method and device for dimensionality reduction based on text class feature selection. The method includes the steps of: obtaining the text to be processed; using HanLP to perform word segmentation to obtain multiple word items, and removing stop words in the word items; Count word frequency, term document frequency, and document word count; store term, term frequency, term document frequency, and document word count to form primary text vectors; perform information gain calculation on primary text vectors, sort according to the size of information gain, Vocabulary that meets the preset requirements is formed into a reference vector for feature selection; the text to be processed is reduced in dimension according to the reference vector to form a reduced-dimensional text vector. The device includes: an acquisition module, a word segmentation module, a statistics module, a vector module, an information gain calculation module and a dimension reduction module. The method and device perform text feature selection based on an information gain algorithm, and perform dimensionality reduction processing on feature word set vectors, thereby reducing the calculation burden caused by excessive dimensionality.

Description

A method and device for dimensionality reduction based on text class feature selection

技术领域technical field

本发明涉及机器学习技术领域，特别涉及一种基于文本类特征选择进行降维的方法和装置。The present invention relates to the technical field of machine learning, in particular to a method and device for dimensionality reduction based on text class feature selection.

背景技术Background technique

随着互联网的高速发展，互联网相关技术的持续创新，使得整个社会信息化的成本、效率都较十年、二十年前发生了巨大的变化。此外，互联网的日益普及产生了许多不同格式的数据(文本、多媒体等)和许多不同的数据来源，面对海量信息，人们已经不能简单的靠人工来处理所有的信息资源，而是需要辅助工具来帮助人们更好的发现、过滤和管理这些电子信息数据及资源。With the rapid development of the Internet and the continuous innovation of Internet-related technologies, the cost and efficiency of the entire society's informatization have undergone tremendous changes compared with ten or twenty years ago. In addition, the increasing popularity of the Internet has produced data in many different formats (text, multimedia, etc.) and many different data sources. In the face of massive information, people can no longer simply rely on manual processing of all information resources, but need auxiliary tools To help people better discover, filter and manage these electronic information data and resources.

传统的文本处理相关的软件都是针对文本文件进行处理，然而随着多种文本格式的出现，承载电子信息的文件已经不再局限于单一的文件类型，尤其随着Internet的发展，这些格式的文本也表现出各自的优越性，对于单一格式文件的处理系统的局限性也随之越来越明显。Traditional software related to text processing is all about processing text files. However, with the emergence of various text formats, files carrying electronic information are no longer limited to a single file type. Especially with the development of the Internet, these formats Text also shows its own advantages, and the limitations of the processing system for single-format files are becoming more and more obvious.

文本的表示抽象成为特征词集的空间向量，然而原始的候选特征词集高达几十万维，而高维度的文本表示则造成了计算上的巨大负担。The representation of the text is abstracted into the space vector of the feature word set. However, the original candidate feature word set is as high as hundreds of thousands of dimensions, and the high-dimensional text representation causes a huge computational burden.

发明内容Contents of the invention

本发明提供一种基于文本类特征选择进行降维的方法和装置，以解决上述技术问题。The present invention provides a method and device for dimensionality reduction based on text feature selection to solve the above technical problems.

本发明提供的一种基于文本类特征选择进行降维的方法，包括步骤：A method for dimensionality reduction based on text class feature selection provided by the present invention comprises steps:

步骤A，获取待处理的数据源文本的详细信息并存储；Step A, obtain and store the detailed information of the data source text to be processed;

步骤B，采用HanLP对所述数据源文本进行分词得到多个词项，去除所述词项中的停用词；Step B, using HanLP to perform word segmentation on the data source text to obtain a plurality of terms, and remove stop words in the terms;

步骤C，统计词频、词项文档频率以及文档词数；Step C, counting word frequency, term document frequency and document word count;

步骤D，将所述词项、词频和词项文档频率以及文档词数存储并形成初级文本向量；Step D, storing the term, term frequency, term document frequency and document word count and forming a primary text vector;

步骤E，对所述初级文本向量进行信息增益计算，得到各词项的信息增益量，按照所述信息增益量的大小排序，将满足预设要求的多个词汇形成特征选择的基准向量；Step E, performing information gain calculation on the primary text vector to obtain the information gain amount of each term, sorting according to the size of the information gain amount, and forming a reference vector for feature selection from a plurality of words that meet the preset requirements;

步骤F，将待处理的文本按照所述基准向量进行降维，形成降维后的文本向量。Step F, reducing the dimensionality of the text to be processed according to the reference vector to form a dimensionality-reduced text vector.

其中，步骤E中进行信息增益计算包括步骤：Wherein, performing information gain calculation in step E includes steps:

将每篇文本作为一个类别，将文本中的词项作为特征，按照如下公式计算信息增益量Take each text as a category, and use the terms in the text as features, and calculate the amount of information gain according to the following formula

其中，N表示总类别数，P(C_i)表示类别C_i出现的概率，P(t)表示特征(T)出现的概率，表示特征(T)不出现的概率，P(C_i|t)，表示文本包含特征(T)且属于类别C_i的概率。Among them, N represents the total number of categories, P(C _i ) represents the probability of category C _i appearing, P(t) represents the probability of feature (T) appearing, Indicates the probability that the feature (T) does not appear, P(C _i |t), indicates the probability that the text contains the feature (T) and belongs to the category C _i .

其中，步骤E中，其中DF_T表示特征(T)的文档频率；Wherein, in step E, where DF _T represents the document frequency of the feature (T);

其中TF_i表示每个词项的出现频率； where TF _i represents the frequency of occurrence of each term;

本发明实施例还提供一种文本类特征选择进行降维的装置，包括获取模块、分词模块、统计模块、向量模块、信息增益计算模块和降维模块；The embodiment of the present invention also provides a device for feature selection of text class for dimensionality reduction, including an acquisition module, a word segmentation module, a statistics module, a vector module, an information gain calculation module, and a dimensionality reduction module;

获取模块，用于获取待处理的数据源文本的详细信息并存储；The obtaining module is used to obtain and store the detailed information of the data source text to be processed;

分词模块，用于采用HanLP对所述数据源文本进行分词得到多个词项，去除所述词项中的停用词；A word segmentation module, configured to use HanLP to perform word segmentation on the data source text to obtain a plurality of terms, and remove stop words in the terms;

统计模块，用于统计词频(每个词项的出现频率)和词项文档频率以及文档词数；Statistical module, used to count term frequency (occurrence frequency of each term) and term document frequency and document word count;

向量模块，用于将词项、词频和词项文档频率以及文档词数存储并形成初级文本向量；The vector module is used to store term, term frequency, term document frequency and document word count and form primary text vector;

信息增益计算模块，用于对初级文本向量进行信息增益计算，得到各词项的信息增益量，按照信息增益量的大小排序，将满足预设要求的多个词汇形成特征选择的基准向量；The information gain calculation module is used to calculate the information gain of the primary text vector, obtain the information gain amount of each term, sort according to the size of the information gain amount, and form a reference vector for feature selection by a plurality of words that meet the preset requirements;

降维模块，用于将待处理的文本按照所述基准向量进行降维，形成降维后的文本向量。The dimensionality reduction module is configured to perform dimensionality reduction on the text to be processed according to the reference vector to form a dimensionality-reduced text vector.

其中，所述信息增益计算模块，用于：Wherein, the information gain calculation module is used for:

其中，N表示总类别数，P(C_i)表示类别C_i出现的概率，P(t)表示特征(T)出现的概率，表示特征(T)不出现的概率，P(C_i|t)，表示文本包含特征(T)且属于类别C_i的概率；Among them, N represents the total number of categories, P(C _i ) represents the probability of category C _i appearing, P(t) represents the probability of feature (T) appearing, Indicates the probability that the feature (T) does not appear, P(C _i |t), indicates the probability that the text contains the feature (T) and belongs to the category C _i ;

其中DF_T表示特征(T)的文档频率； where DF _T represents the document frequency of the feature (T);

本发明实施例提供了一种基于文本类特征选择进行降维的方法和装置，通过HanLP分词、去除停用词、将词项作为特征进行信息增益计算、根据信息增益量排序等步骤得到基准向量，再根据基准向量对文档进行降维处理，是一种基于信息增益算法而实现的文档特征降维处理方法，降低了文档特征词集的维度，减少了几十万维特征词集所导致的计算负担。The embodiment of the present invention provides a method and device for dimensionality reduction based on text-based feature selection. The reference vector is obtained through steps such as HanLP word segmentation, removal of stop words, calculation of information gain by using terms as features, and sorting according to the amount of information gain. , and then perform dimension reduction processing on the document according to the reference vector, which is a document feature dimension reduction processing method based on the information gain algorithm, which reduces the dimension of the document feature word set, and reduces the hundreds of thousands of dimensional feature word sets. Calculate the burden.

附图说明Description of drawings

图1为本发明基于文本类特征选择进行降维的方法一个实施例的流程示意图；Fig. 1 is a schematic flow chart of an embodiment of a method for dimensionality reduction based on text class feature selection in the present invention;

图2位本发明实施例二提供的进行文本特征选择的一个实施例的流程示意图。FIG. 2 is a schematic flowchart of an embodiment of text feature selection provided by Embodiment 2 of the present invention.

具体实施方式Detailed ways

本发明实施例提供了一种基于文本类特征选择进行降维的方法和装置，是一种基于信息增益(Information Gain,IG)的文本类特征选择的算法，通过在文本中提取能够具有代表性、最有效的的特征，以降低数据集维度。在信息增益中，重要性的衡量标准就是看特征能够为分类系统带来多少信息，带来的信息越多，该特征就越重要。The embodiment of the present invention provides a method and device for dimensionality reduction based on text feature selection, which is an algorithm for text feature selection based on Information Gain (IG), which can be representative by extracting text features. , the most effective features to reduce the dimensionality of the dataset. In information gain, the measure of importance is to see how much information a feature can bring to the classification system. The more information it brings, the more important the feature is.

本发明实施例采用HanLP分词技术对文本进行分词，其原理是构建一个足够大的包含所有可能出现的汉语词的词典库，判断待处理的中文文本汉字串是否出现在词典库中，一旦发现则识别出一词，并将该词从汉字串中分割出来，直到汉字串被分割完毕。HanLP具备功能完善、性能高效、架构清晰、语料时新、可自定义的特点。在提供丰富功能的同时，HanLP内部模块坚持低耦合、模型坚持惰性加载、服务坚持静态提供、词典坚持明文发布，使用非常方便，同时自带一些语料处理工具，帮助用户训练自己的语料。但其最大的缺点就是执行的准确率都完全依赖于词典库，需要对词典库进行更新。The embodiment of the present invention adopts HanLP word segmentation technology to carry out word segmentation to the text, and its principle is to build a sufficiently large dictionary database containing all possible Chinese words, and judge whether the Chinese character string of the Chinese text to be processed appears in the dictionary database, once found, then A word is recognized, and the word is segmented from the Chinese character string until the Chinese character string is segmented. HanLP has the characteristics of complete functions, high performance, clear structure, up-to-date corpus, and customization. While providing rich functions, HanLP's internal modules insist on low coupling, models insist on lazy loading, services insist on static provision, and dictionaries insist on publishing in plain text. But its biggest disadvantage is that the execution accuracy is completely dependent on the dictionary library, which needs to be updated.

信息增益方法是通过特征在文本中出现或不出现来判定文本所属类别所提供的信息量的大小。在过滤问题中用于度量已知一个特征是否出现于某主题相关文本中对于该主题预测有多少信息贡献。通过计算信息增益可以得到那些在正例样本中出现频率高而在非正例样本中出现频率低的特征。信息增益涉及较多的数学理论和复杂的熵理论公式，本发明实施例将其定义为某特征项为整个分类所能提供的信息量，不考虑任何特征的熵与考虑该特征后的熵的差值。本发明实施例根据训练数据，计算出各个特征项的信息增益，删除信息增益很小的项，其余的按照信息增益从大到小排序和筛选。The information gain method is to determine the amount of information provided by the category to which the text belongs by judging whether the feature appears or does not appear in the text. In the filtering problem, it is used to measure how much information contribution to the topic prediction is given whether a feature appears in the relevant text of a certain topic. By calculating the information gain, the features that appear frequently in positive samples and low in non-positive samples can be obtained. Information gain involves more mathematical theories and complex entropy theoretical formulas. The embodiment of the present invention defines it as the amount of information that a feature item can provide for the entire classification, regardless of the entropy of any feature and the entropy after considering the feature. difference. According to the training data, the embodiment of the present invention calculates the information gain of each feature item, deletes items with a small information gain, and sorts and screens the rest according to the information gain from large to small.

实施例一Embodiment one

具体地，参见图1所示，该方法包括步骤：Specifically, as shown in Figure 1, the method includes steps:

步骤S110，获取待处理的数据源文本的详细信息并存储。In step S110, the detailed information of the data source text to be processed is acquired and stored.

获取数据源文本详细信息，并存储到HDFS中，并且保留备份，已备后续检验或数据追溯使用。Obtain the detailed information of the data source text and store it in HDFS, and keep the backup for subsequent inspection or data retrospective use.

步骤S111，采用HanLP对所述数据源文本进行分词得到多个词项，去除所述词项中的停用词。Step S111 , using HanLP to segment the data source text to obtain a plurality of lexical items, and remove stop words in the lexical items.

一篇文本的有效信息一般主要由名词、形容词、动词、量词等实词构成，其属于哪个类别也主要由这些实词来区分，而还有一些在所有文本中都频繁出现的词和没有实际含义的虚词对文本分类几乎没有什么贡献。这些停用词通常不具有很大的实际意义，但在文本中却经常出现，如果不去除的话，可能会使两个内容完全不同的文本因为这些大量的共有信息而无法分别，同时会影响到后面的特征选取阶段，增加系统计算开销，最终影响分类器的构建。因此通过停用词库，对文本进行分词处理后，将存在于词库中的词直接过滤掉。The effective information of a text is generally composed of nouns, adjectives, verbs, quantifiers and other content words, which category it belongs to is also mainly distinguished by these content words, and there are some words that frequently appear in all texts and words that have no actual meaning. Function words contribute little to text classification. These stop words usually do not have great practical significance, but they often appear in the text. If they are not removed, two texts with completely different contents may not be able to be distinguished because of the large amount of common information, and it will affect the The subsequent feature selection stage increases the computational overhead of the system, which ultimately affects the construction of the classifier. Therefore, by disabling the thesaurus, after the text is segmented, the words existing in the thesaurus are directly filtered out.

步骤S112，统计词频、词项文档频率以及文档词数。Step S112, counting word frequency, term document frequency and document word count.

采用HanLP对文本进行分词，统计词频和词项(Term)文档频率以及文档词数。其中，词频为每个词项(Term)在全部文本中出现的频率，词项文档频率为每个词项在一篇文档中出现的频率，文档词数为一篇文档所包含的词项数量。Use HanLP to segment the text, count the word frequency and term (Term) document frequency and the number of words in the document. Among them, the term frequency is the frequency of each term (Term) appearing in all texts, the term document frequency is the frequency of each term appearing in a document, and the document word count is the number of terms contained in a document .

步骤S113，将所述词项、词频和词项文档频率以及文档词数存储并形成初级文本向量。Step S113, storing the term, term frequency, term document frequency and document word count to form a primary text vector.

将词项、词出现的次数(词频)、词项文档频率存储到内存数据库中，形成向量化文本，已备信息增益计算读写。Store the term, the number of times the word appears (word frequency), and the document frequency of the term in the memory database to form a vectorized text, which is ready for information gain calculation and reading.

步骤S114，对所述初级文本向量进行信息增益计算，得到各词项的信息增益量，按照所述信息增益量的大小排序，将满足预设要求的多个词汇形成特征选择的基准向量。Step S114, perform information gain calculation on the primary text vector to obtain the information gain amount of each term, sort according to the size of the information gain amount, and form a plurality of words that meet the preset requirements into a reference vector for feature selection.

对文本向量进行信息增益计算，按照信息量大小排序，根据需求保留N个词汇，作为特征选择的基准向量，并将所有文本按照基准向量进行降维，形成最终降维后的文本向量。Calculate the information gain of the text vectors, sort them according to the amount of information, reserve N words according to the requirements, as the reference vectors for feature selection, and reduce the dimensionality of all texts according to the reference vectors to form the final dimensionality-reduced text vectors.

熵的数学定义：假设有一个变量X，它的可能取值有n种，分别是x₁,x₂,…,x_n，取每种值的概率为P₁,P₂,…,P_n，那么变量X的熵定义为：Mathematical definition of entropy: Suppose there is a variable X with n possible values, namely x ₁ , x ₂ ,…,x _n , and the probability of taking each value is P ₁ , P ₂ ,…,P _n , then the entropy of variable X is defined as:

分类系统的熵：对于一个分类系统，类别C是变量，它的可能取值为C₁,C₂,…,C_n，每个类别出现的概率为P(C₁),P(C₂),…,P(C_n)，其中n表示类别数量。分类系统的熵定义为：Entropy of classification system: For a classification system, category C is a variable, its possible values are C ₁ , C ₂ ,...,C _n , and the probability of each category is P(C ₁ ), P(C ₂ ) ,…,P(C _n ), where n represents the number of categories. The entropy of a classification system is defined as:

其中：P(C_i)表示类别C_i出现的概率，可以采用类别C_i包含的记录数量(文档数量)除以总记录数(总文档数量)进行估计。即：Among them: P(C _i ) represents the probability of category C _i appearing, which can be estimated by dividing the number of records (number of documents) contained in category C _i by the total number of records (number of documents). which is:

其中，N表示总记录数，表示类别C_i包含的记录数。Among them, N represents the total number of records, Indicates the number of records contained in category C _i .

条件熵：假设特征X有n种可能取值(x₁,x₂,…,x_n)，那么在给定X的情况下，系统的熵定义为：Conditional entropy: Assuming that feature X has n possible values (x ₁ , x ₂ ,…, x _n ), then given X, the entropy of the system is defined as:

其中，in,

信息增益是针对每个特征而言的，就是看一个特征(T)，系统有它和没它的时候的信息量各是多少，两者之间的差值就是该特征给系统带来的信息量，即增益。Information gain is for each feature, that is, to look at a feature (T), how much information does the system have when it has it and when it does not have it, and the difference between the two is the information that the feature brings to the system Quantity, that is, gain.

特征(T)给系统带来的信息增益可以写成系统原本的熵与固定特征(T)后的条件熵的差：The information gain brought by the feature (T) to the system can be written as the difference between the original entropy of the system and the conditional entropy after the feature (T) is fixed:

IG(T)＝H(C)-H(C|T)IG(T)＝H(C)-H(C|T)

在文本分类系统中，特征(T)对应一个词项，它只有两种取值“出现”或“不出现”。用t表示特征(T)出现，用表示特征(T)不出现。In the text classification system, a feature (T) corresponds to a term, and it has only two values "appear" or "not appear". Use t to indicate that a feature (T) appears, and use Indicates that the feature (T) does not appear.

那么：So:

其中：P(t)表示特征(T)出现的概率，表示特征(T)不出现的概率。Among them: P(t) represents the probability of feature (T) appearing, Indicates the probability that the feature (T) does not appear.

将该公式进一步展开：Expanding the formula further:

所以IG(T)可以进一步展开成：So IG(T) can be further expanded into:

文本的特征选择，是在整个文本集合中抽取重要的词项，其中没有类别的概念，所以需要对问题进行泛化，将每篇文本作为一个类别。此时，类别的数量等于文本集合中文本的数量N。基于该种假设对信息增益公式的相关参数进行估计。The feature selection of the text is to extract important words from the entire text collection. There is no category concept, so the problem needs to be generalized and each text is regarded as a category. At this time, the number of categories is equal to the number N of texts in the text collection. Based on this assumption, the relevant parameters of the information gain formula are estimated.

符号说明：Symbol Description:

N，表示总文本数，即总类别数；N, represents the total number of texts, that is, the total number of categories;

P(C_i)，表示类别C_i出现的概率，即文本D_i出现的概率，等于 P(C _i ), represents the probability of category C _i appearing, that is, the probability of text D _i appearing, equal to

P(t)，表示特征(T)出现的概率，采用包含特征(T)的文本数量除以总文本数量N，即：其中DF_T表示特征(T)的文档频率；P(t), which represents the probability of feature (T), divides the number of texts containing feature (T) by the total number of texts N, namely: where DF _T represents the document frequency of the feature (T);

表示特征(T)不出现的概率，等于1-P(t)； Indicates the probability that the feature (T) does not appear, equal to 1-P(t);

P(C_i|t)，表示文本包含特征(T)且属于类别C_i的概率；这里，可能存在两种估计方式：P(C _i |t), represents the probability that the text contains features (T) and belongs to category C _i ; here, there may be two estimation methods:

采用包含特征(T)且属于类别C_i的文本数量除以总文本数，值为0或 Divide the number of texts containing the feature (T) and belonging to the category C _i by the total number of texts, and the value is 0 or

按贝叶斯公式展开，P(t|C_i)表示类别C_i中特征(T)出现的概率，即特征(T)在文档D_i中出现的概率，采用其中TF_i表示每个词项的出现频率；TF_T表示每个特征T出现的频率。According to the Bayesian formula, P(t|C _i ) represents the probability of feature (T) appearing in category C _i , that is, the probability of feature (T) appearing in document D _i , using Among them, TF _i represents the frequency of occurrence of each term; TF _T represents the frequency of occurrence of each feature T.

表示文本包含特征(T)且属于类别C_i的概率；这里，可能存在两种估计方式： Indicates the probability that the text contains the feature (T) and belongs to the category C _i ; here, there may be two estimation methods:

采用不包含特征(T)且属于类别C_i的文本数量除以总文本数，值为0或 Divide the number of texts that do not contain the feature (T) and belong to the category C _i by the total number of texts, and the value is 0 or

按贝叶斯公式展开，其中 According to the Bayesian formula, in

需要注意的是：have to be aware of is:

在估计P(t)时，该值可能为1，这将造成的值为0，从而使无法计算。所以P(t)实际采用进行估计。When estimating P(t), this value may be 1, which would cause has a value of 0, so that Unable to calculate. So P(t) actually uses Make an estimate.

P(t|C_i)采用进行估计，如果TF_T的值为0，这将是该估计的值为0。所以实际采用：进行估计。P(t|C _i ) adopts is estimated, if the value of TF _T is 0, this will be the value of 0 for this estimate. So actually use: Make an estimate.

在本发明实施例中所说的特征均指的是文本的词项。The features mentioned in the embodiments of the present invention all refer to the terms of the text.

本领域技术人员可根据本发明实施例的技术方案确定各参数定义，本发明实施例不全部列举。Those skilled in the art can determine the definition of each parameter according to the technical solutions of the embodiments of the present invention, and the embodiments of the present invention do not list all of them.

步骤S115，将待处理的文本按照所述基准向量进行降维，形成降维后的文本向量。Step S115, reducing the dimensionality of the text to be processed according to the reference vector to form a dimensionality-reduced text vector.

本发明实施例一提出的基于信息增益算法对文本进行特征选择，通过特征对整个系统的重要性，对特征进行排序筛选，从而达到降维的目的，减轻计算负担。The information gain algorithm proposed in the first embodiment of the present invention selects the features of the text, sorts and screens the features according to the importance of the features to the entire system, so as to achieve the purpose of dimensionality reduction and reduce the computational burden.

实施例二Embodiment two

在本发明实施例二中，基于文本类特征选择进行降维的方法的主要流程同实施例一，其中文本特征选择流程参见图2所示，包括步骤：In Embodiment 2 of the present invention, the main process of the method for dimensionality reduction based on text feature selection is the same as in Embodiment 1, wherein the process of text feature selection is shown in FIG. 2, including steps:

步骤S210，获取初始文本。Step S210, obtaining initial text.

步骤S211，获取分词器，使用分词器对初始文本进行分词。Step S211, obtain a tokenizer, and use the tokenizer to segment the initial text.

步骤S212，获取名词过滤器，使用名词过滤器对分词后的文本进行名词筛选得到名词集合。Step S212, obtaining a noun filter, and using the noun filter to perform noun screening on the word-segmented text to obtain a noun set.

步骤S213，进行文档频率统计并存入redis。In step S213, document frequency statistics are performed and stored in redis.

步骤S214，进行词频统计并存入redis。Step S214, perform word frequency statistics and store in redis.

步骤S215，进行文档正向索引。Step S215, perform document forward indexing.

步骤S216，根据步骤S213和步骤S214统计的结果进行IG计算。Step S216, perform IG calculation according to the statistical results of steps S213 and S214.

步骤S217，将得到的特征词持久化。Step S217, persisting the obtained feature words.

实施例三Embodiment Three

本发明实施例三提供一种基于文本类特征选择进行降维的装置，包括获取模块、分词模块、统计模块、向量模块、信息增益计算模块和降维模块。Embodiment 3 of the present invention provides a device for dimensionality reduction based on text feature selection, including an acquisition module, a word segmentation module, a statistics module, a vector module, an information gain calculation module, and a dimensionality reduction module.

其中获取模块，用于获取待处理的数据源文本的详细信息并存储。The obtaining module is used to obtain and store the detailed information of the data source text to be processed.

分词模块，用于采用HanLP对所述数据源文本进行分词得到多个词项，去除所述词项中的停用词。A word segmentation module, configured to use HanLP to perform word segmentation on the data source text to obtain multiple word items, and remove stop words in the word items.

统计模块，用于统计词频(每个词项的出现频率)和词项文档频率以及文档词数。The statistical module is used for counting term frequency (frequency of occurrence of each term), term document frequency and document word count.

向量模块，用于将所述词项、词频和词项文档频率以及文档词数存储并形成初级文本向量。The vector module is used to store the term, term frequency, term document frequency and document word count to form a primary text vector.

信息增益计算模块，用于对初级文本向量进行信息增益计算，得到各词项的信息增益量，按照信息增益量的大小排序，将满足预设要求的多个词汇形成特征选择的基准向量。The information gain calculation module is used to calculate the information gain of the primary text vector to obtain the information gain amount of each term, sort according to the size of the information gain amount, and form a plurality of words that meet the preset requirements into a reference vector for feature selection.

降维模块，用于将待处理的文本按照基准向量进行降维，形成降维后的文本向量。The dimensionality reduction module is used to reduce the dimensionality of the text to be processed according to the reference vector to form a dimensionality-reduced text vector.

文本的表示抽象成为特征词集的空间向量，然而原始的候选特征词集高达几十万维，高维度的文本表示除了一方面造成了计算上的负担，另一方面，较大的特征冗余会造成分类性能的下降，本发明实施例提供了一种基于信息增益算法而进行特征提取的方法和装置，降低了特征词集的维度，减轻了相应的计算负担，除去了冗余特征提高了分类性能。The representation of the text is abstracted into a space vector of the feature word set. However, the original candidate feature word set is as high as hundreds of thousands of dimensions. The high-dimensional text representation not only causes a computational burden on the one hand, but also has large feature redundancy. It will cause a decline in classification performance. Embodiments of the present invention provide a method and device for feature extraction based on an information gain algorithm, which reduces the dimension of the feature word set, reduces the corresponding calculation burden, and removes redundant features. classification performance.

需要说明的是，本发明实施例中的装置或者系统实施例可以通过软件实现，也可以通过硬件或者软硬件结合的方式实现。从硬件层面而言，本发明实施例的硬件结构框架结构中，除了CPU、内存、网络接口、以及非易失性存储器之外，实施例中装置所在的设备通常还可以包括其他硬件，如负责处理报文的转发芯片等等。以软件实现为例，作为一个逻辑意义上的装置，是通过其所在设备的CPU将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。It should be noted that the devices or system embodiments in the embodiments of the present invention may be implemented by software, or by hardware or a combination of software and hardware. From the perspective of hardware, in the framework of the hardware structure of the embodiment of the present invention, in addition to the CPU, memory, network interface, and non-volatile memory, the device where the device in the embodiment is located can generally include other hardware, such as responsible for The forwarding chip that processes the message, etc. Taking software implementation as an example, as a device in a logical sense, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory for operation by the CPU of the device where it is located.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.

Claims

1. A method for dimensionality reduction based on text class feature selection, characterized in that, comprising steps:

Step A, obtain and store the detailed information of the data source text to be processed;

Step B, performing word segmentation on the data source text to obtain a plurality of terms, and removing stop words in the terms;

Step C, counting word frequency, term document frequency and document word count;

Step D, storing the term, term frequency, term document frequency and document word count and forming a primary text vector;

Step E, performing information gain calculation on the primary text vector to obtain the information gain amount of each term, sorting according to the size of the information gain amount, and forming a reference vector for feature selection from a plurality of words that meet the preset requirements;

Step F, reducing the dimensionality of the text to be processed according to the reference vector to form a dimensionality-reduced text vector;

Performing the information gain calculation in the step E includes steps:

Take each text as a category, and use the terms in the text as features, and calculate the amount of information gain according to the following formula

In the step E, where DF _T represents the document frequency of the feature (T);

where TF _i represents the frequency of occurrence of each term;

N, indicates the total number of texts, that is, the total number of categories;

P(C _i ), represents the probability of category C _i appearing, that is, the probability of text D _i appearing, equal to

P(t), which represents the probability of feature (T), divides the number of texts containing feature (T) by the total number of texts N, namely: where DF _T represents the document frequency of the feature (T);

, indicating the probability that the feature (T) does not appear, equal to 1-P(t);

P(C _i |t), represents the probability that the text contains features (T) and belongs to category C _i ; here, there are two estimation methods:

Divide the number of texts containing the feature (T) and belonging to the category C _i by the total number of texts, and the value is 0 or

According to the Bayesian formula, P(t|C _i ) represents the probability of feature (T) appearing in category C _i , that is, the probability of feature (T) appearing in document D _i , using Among them, TF _i represents the frequency of occurrence of each term; TF _T represents the frequency of occurrence of each feature T;

Indicates the probability that the text does not contain features (T) and belongs to category C _i ; here, there are two estimation methods:

Divide the number of texts that do not contain the feature (T) and belong to the category C _i by the total number of texts, and the value is 0 or

According to the Bayesian formula, in

have to be aware of is:

When estimating P(t), P(t) may be 1, which will cause has a value of 0, so that cannot be calculated; so P(t) actually takes make an estimate;

according to If the value of TF _T is 0, this will make the value of P(t|C _i ) 0; so P(t|C _i ) actually takes Make an estimate.

2. A text class feature selection device for dimensionality reduction, characterized in that it includes an acquisition module, a word segmentation module, a statistical module, a vector module, an information gain calculation module and a dimensionality reduction module;

The obtaining module is used to obtain and store detailed information of the data source text to be processed;

The word segmentation module is used to use HanLP to perform word segmentation on the data source text to obtain a plurality of word items, and remove stop words in the word items;

The statistical module is used to count word frequency, term document frequency and document word count;

The vector module is used to store and form primary text vectors with the term, term frequency, term document frequency and document word count;

The information gain calculation module is used to perform information gain calculation on the primary text vector to obtain the information gain amount of each term, sort according to the size of the information gain amount, and form a plurality of words that meet the preset requirements into a feature selected datum vector;

The dimensionality reduction module is configured to perform dimensionality reduction on the text to be processed according to the reference vector to form a dimensionality-reduced text vector;

The information gain calculation module is used for:

where DF _T represents the document frequency of the feature (T);

where TF _i represents the frequency of occurrence of each term;

Indicates the probability that the feature (T) does not appear, equal to 1-P(t);

, represents the probability that the text does not contain features (T) and belongs to category C _i ; here, there are two estimation methods:

According to the Bayesian formula, in

have to be aware of is: