CN106250910B - Semi-structured data classification method based on label sequence and nGrams - Google Patents
Semi-structured data classification method based on label sequence and nGrams Download PDFInfo
- Publication number
- CN106250910B CN106250910B CN201610555498.7A CN201610555498A CN106250910B CN 106250910 B CN106250910 B CN 106250910B CN 201610555498 A CN201610555498 A CN 201610555498A CN 106250910 B CN106250910 B CN 106250910B
- Authority
- CN
- China
- Prior art keywords
- tsgrams
- semi
- structured data
- feature
- class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a semi-structured data classification method based on a label sequence and nGrams, which is used for solving the technical problem of poor accuracy of the existing semi-structured data classification method. The technical scheme includes that TSGrams features are used as a basic unit for representing semi-structured data, structural information of the semi-structured data is captured through a label sequence, content information of the semi-structured data is captured through nGrams, the TSGrams features and the content information are fused to serve as inclusion relation between feature capture structures and content, mutual relation among different keywords in the content information is considered, information gain is used for screening the TSGrams features, a TSGrams feature construction feature space with strong classification capacity is obtained, a category classification model is built according to the mutual information between the TSGrams features and categories, similarity among different structures is considered during classification, and accuracy of classification of the semi-structured data is improved.
Description
Technical Field
The invention relates to a semi-structured data classification method, in particular to a semi-structured data classification method based on a label sequence and nGrams.
Background
Semi-structured data classification is generally divided into three steps: firstly, extracting features from a semi-structured data set of known classes, then constructing a classification model by using the extracted features, and finally classifying the data of the unknown classes by using the constructed model. The semi-structured data contains structure and content information, and for the classification based on the structure and the content, the following factors need to be considered when extracting the features:
1. the containment relationship between structure and content, i.e., the content is organized to be contained in different hierarchies;
2. the mutual relations among the elements in the structure, namely brother relations, father-son relations, ancestor-descendant relations and the like among the elements;
3. the interrelationship between keywords in the content.
Most of the existing semi-structured data classification methods based on structures and contents extend the traditional unstructured text document vector space model on the basis of the vector space model, so that the structured text document vector space model contains structural information and is then used for classification. For example, documents 1 "train T, Nayak R, Bruza P d. combining Structure and Content properties for XML Document clustering. proceedings of the 7th architecture Data Mining Conference (AusDM '08)," 2008.219-226 "and" ghos, Mitra P. combining Content and Structure Similarity for XML Document Classification Composite SVM kernels. proceedings of19th International Conference on Pattern Recognition (ICPR'08), Tampa,2008.1-4 "all employ such methods in that the Structure information and the Content information respectively indicate that the mutual relationship between the Structure and the Content is broken, i.e., the first factor mentioned above is ignored.
There are also methods that consider the relationship between structure and Content, such as document 2 "Yuanjia, Goodde, Bohong. XML Web Classification research based on the correlation between structure and text keywords. computer research and development, 2006,43(8): 1361-. Although these methods consider the inclusion relationship between the structure and the content, the structure is modeled as a path, and the path represents the hierarchical (parent-child, ancestor-descendant) relationship between elements, but neglects the mutual relationship between different paths and the similarity of paths, and the like, i.e. the second factor cannot be processed well.
Document 4 "Yang J, Zhang F. XML Document Classification Using Extended VSM. proceedings of Focused Access to XML Documents, the 6th International works for the Evaluation of XML recommendation (INEX'07), Dagstuhl cast, Germany: springer Berlin/Heidelberg,2008.234-244, "extends the vector space model, represents semi-structured data as a matrix, thereby capturing the association between structure and content, and the relationship among the internal (elements) of the structure is embodied by a core matrix, and the structure element structure subtrees are replaced by structure subtrees in the Document 5 'Yang J, Wang S.extended VSM for XML Document Classification Using frequency subtrees of Focused Retrieval and Evaluation, the 8th International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX'09), spring Berlin/Heidelberg,2010.441-448 ], so as to further embody the relevance. This approach, while taking into account the first two factors to some extent, has not yet explicitly addressed the third factor.
In summary, when the existing semi-structured data classification method extracts features from semi-structured data, the three factors are not fully considered, so that the semi-structured data classification model constructed by the methods lacks part of intrinsic information between the semi-structured data features and the semi-structured data categories, thereby affecting the accuracy of semi-structured data classification.
Disclosure of Invention
In order to overcome the defect of poor accuracy of the existing semi-structured data classification method, the invention provides a semi-structured data classification method based on a label sequence and nGrams. The method comprises the steps of taking TSGrams characteristics as a basic unit for representing semi-structured data, capturing structure information of the semi-structured data by using a label sequence, capturing content information of the semi-structured data by using nGrams, fusing the TSGrams characteristics and the content information as an inclusion relation between a characteristic capture structure and the content, considering the mutual relation between different keywords in the content information, screening the TSGrams characteristics by using information gain, obtaining a TSGrams characteristic structure characteristic space with strong classification capability, constructing a category classification model according to the mutual information between the TSGrams characteristics and the categories, and considering the similarity between different structures during classification, so that the accuracy of classification of the semi-structured data is improved.
The technical scheme adopted by the invention for solving the technical problems is as follows: a semi-structured data classification method based on a tag sequence and nGrams is characterized by comprising the following steps:
step one, constructing a TSGrams characteristic space.
(1) And (5) extracting TSGrams characteristics. For each data document D in the data set D, traversing all text nodes of D by using a tree model, forming a label sequence by using paths from a root node to a parent node of the root node, and extracting all nGrams with the length less than or equal to 3 from text contents. And then combining the tag sequences and nGrams to form TSGrams characteristics, and recording a set formed by all the TSGrams characteristics as TSGramsSet.
(2) An information gain is calculated. An information gain IG (f) is calculated for each TSGrams feature f: < s, g > in TSGramsSet. The calculation method is as follows:
wherein the content of the first and second substances,
(3) sorting all TSGrams characteristics with the length of1 in the TSGramsSet from large to small according to information gain, and setting the information gain of the Nth characteristic as IGN. Wherein N is a parameter.
(4) Selecting all information gain values in TSGramsSet to be larger than IGNConstitutes a feature space omega.
And step two, constructing a classification model.
(1) Calculating each TSGrams feature f in the feature space omega:<s,g>and class CiMutual information MI (f, C)i). The calculation method is as follows:
(2) the feature space Ω is divided into k disjoint subsets, each subset representing a class CiThe subset is called the class CiIs recorded as a classification model ofAny feature f in the feature space omega is divided into a classification model phi with the highest mutual informationC*In (1). Namely:
(3) according to the above division, any one of the classes CiClassification model ofCan be expressed as a vector in the TSGrams eigenspace Ω:
wherein, wi,jFor TSGrams feature fj:<sj,gj>In class CiWeight in (1), if fjNot being CiIf the weight is 0, otherwise, the two information are mutually obtained and normalized, that is to say
And step three, classifying the unknown class data.
(1) Preprocessing the unknown class semi-structured data d to be classified, obtaining the TSGrams features in the unknown class semi-structured data d by the method, and discarding the features which appear in the unknown class semi-structured data d but are not contained in the feature space omega, so that the unknown class semi-structured data d is represented as a vector in the feature space omega:
wherein, wjIs the jth feature f in the TSGrams feature space omegaj:<sj,gj>The values obtained after normalization of the frequencies that appear in document d.
(2) Computing the unknown class semi-structured data d and any class C using its vector representationiClassification model ofThe similarity between the two is calculated as follows:
wherein, wd(<sj,gj>) For TSGrams characteristic < sj,gj>The weights in the unknown class semi-structured data d,for TSGrams characteristic < sk,gj>In-category modelThe weights in (1), d, andare respectively d andeuclidean norm of, and sim(s)j,sk) Is a tag sequence sjAnd skThe similarity between them is defined as:
wherein m and n are respectively a tag sequence sjAnd skLength of (d), and ed(s)j,sk) Is s isjAnd skThe edit distance of (1).
(3) Assigning the category of document d to C with the highest similarity*I.e. by
The invention has the beneficial effects that: the method takes TSGrams characteristics as a basic unit for representing semi-structured data, captures structural information of the semi-structured data by using a label sequence, captures content information of the semi-structured data by using nGrams, fuses the two to be used as an inclusion relation between a characteristic capture structure and the content, considers the mutual relation between different keywords in the content information, screens the TSGrams characteristics by using information gain, obtains a TSGrams characteristic structure characteristic space with strong classification capability, constructs a category classification model according to the mutual information between the TSGrams characteristics and the categories, considers the similarity between different structures during classification, and improves the accuracy of classification of the semi-structured data.
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
Drawings
FIG. 1 is a flow chart of the present invention label sequence and nGrams based semi-structured data classification method.
Detailed Description
Refer to fig. 1. The method for classifying semi-structured data based on the label sequence and nGrams comprises the following specific steps:
1. a TSGrams feature space is constructed.
1> TSGrams feature extraction: for each data Document D in The data set D, traversing all text nodes of D in a tree model, constructing a tag sequence from The path of The root node to its parent node, and extracting all nGrams with a length less than or equal to 3 from The text content by a method similar to The documents "Tesar R, Strnad V, Jezek K, et al. And then combining the tag sequences and nGrams to form TSGrams characteristics, and recording a set formed by all the TSGrams characteristics as TSGramsSet.
And 2, calculating information gain. An information gain IG (f) is calculated for each TSGrams feature f: < s, g > in TSGramsSet.
The calculation method is as follows:
wherein:
3>general in TSGramsSetThe TSGrams features with the length of1 are sorted from large to small according to the information gain, and the information gain of the Nth feature is set as IGN. Wherein N is a parameter, different values are taken according to different training data sets, and adjustment is needed according to experimental results.
4>Selecting all information gain values in TSGramsSet to be larger than IGNConstitutes a feature space omega.
2. And constructing a classification model.
1>Calculating each TSGrams feature f in the feature space omega:<s,g>and class CiMutual information MI (f, C)i). The calculation method is as follows:
2>the feature space Ω is divided into k disjoint subsets, each subset representing a class CiThe subset is called the class CiIs recorded as a classification model ofAny feature f in the feature space omega is divided into a classification model phi with the highest mutual informationC*In (1). Namely:
3>according to the above division, any one of the classes CiClassification model ofCan be expressed as a vector in the TSGrams eigenspace Ω:
wherein, wi,jFor TSGrams feature fj:<sj,gj>In class CiWeight in (1), if fjNot being CiIf the weight is 0, otherwise, the two information are mutually obtained and normalized, that is to say
3. And classifying the unknown class data.
1, preprocessing unknown class semi-structured data d to be classified, acquiring TSGrams features in the unknown class semi-structured data d by the method, and discarding features which appear in the unknown class semi-structured data d but are not included in a feature space Ω, so that the unknown class semi-structured data d can be represented as a vector in the feature space Ω:
wherein, wjIs the jth feature f in the TSGrams feature space omegaj:<sj,gj>The values obtained after normalization of the frequencies that appear in document d.
2>Computing the unknown class semi-structured data d and any class C using its vector representationiClassification model ofThe similarity between the two is calculated as follows:
wherein, wd(<sj,gj>) For TSGrams characteristic < sj,gj>The weight in d is such that,for TSGrams characteristic < sk,gj>In-category modelThe weights in (1), d, andare respectively d andeuclidean norm of, and sim(s)j,sk) Is a tag sequence sjAnd skThe similarity between them is defined as:
wherein m and n are respectively a tag sequence sjAnd skLength of (d), and ed(s)j,sk) Is s isjAnd skFor the calculation of the edit distance, please refer to the document "Levenshtein VI. binary codes capable of correcting spectral errors and deletions of ones. schemes of Information transmission.1965", which is different from the basic unit edited in the method is the label (tag).
3>Assigning the category of document d to C with the highest similarity*I.e. by
Claims (1)
1. A semi-structured data classification method based on label sequences and nGrams is characterized by comprising the following steps:
step one, constructing a TSGrams characteristic space:
(1) TSGrams feature extraction: for each data document D in the data set D, traversing all text nodes of D by using a tree model, forming a label sequence from a path from a root node to a parent node of the data document D, extracting nGrams with the length less than or equal to 3 from text content, combining the label sequence and the nGrams to form TSGrams characteristics, and recording a set formed by all the TSGrams characteristics as TSGramsSet;
(2) calculating information gain: and calculating the information gain IG (f) of each TSGrams characteristic f: < s, g > in the TSGramsSet, wherein the calculation method comprises the following steps:
wherein the content of the first and second substances,
(3) sorting all TSGrams characteristics with the length of1 in the TSGramsSet from large to small according to information gain, and setting the information gain of the Nth characteristic as IGNWherein N is a parameter;
(4) selecting all information gain values in TSGramsSet to be larger than IGNThe characteristic of (a) constitutes a characteristic space Ω;
step two, constructing a classification model:
(1) calculating each TSGrams feature f in the feature space omega:<s,g>and class CiMutual information MI (f, C)i) The calculation method comprises the following steps:
(2) the feature space Ω is divided into k disjoint subsets, each subset representing a class CiThe subset is called the class CiIs recorded as a classification model ofAny feature f in feature space omega is divided into classification models with highest mutual informationIn (1), namely:
(3) according to the above division, any one of the classes CiClassification model ofCan be expressed as a vector in the TSGrams eigenspace Ω:
wherein, wi,jFor TSGrams feature fj:<sj,gj>In class CiWeight in (1), if fjNot being CiIf the weight is 0, otherwise, the two information are mutually obtained and normalized, that is to say
Step three, classifying the unknown class data:
(1) preprocessing the unknown class semi-structured data d to be classified, obtaining the TSGrams features in the unknown class semi-structured data d by the method, and discarding the features which appear in the unknown class semi-structured data d but are not contained in the feature space omega, so that the unknown class semi-structured data d is represented as a vector in the feature space omega:
wherein, wjIs the jth feature f in the TSGrams feature space omegaj:<sj,gj>The value obtained after normalization of the frequency of occurrence in the document d;
(2) computing the unknown class semi-structured data d and any class C using its vector representationiClassification model ofThe similarity between the two groups is calculated by the following method:
wherein, wd(<sj,gj>) is TSGrams characteristic < sj,gjWeight in the unknown class semi-structured data d,for TSGrams characteristic < sk,gjIn class modelThe weights in (1), d, andare respectively d andeuclidean norm of, and sim(s)j,sk) Is a tag sequence sjAnd skThe similarity between them is defined as:
wherein m and n are respectively a tag sequence sjAnd skLength of (e) andd(sj,sk) Is s isjAnd skThe edit distance of (d);
(3) assigning the category of document d to C with the highest similarity*I.e. by
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2016100599996 | 2016-01-28 | ||
CN201610059999 | 2016-01-28 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106250910A CN106250910A (en) | 2016-12-21 |
CN106250910B true CN106250910B (en) | 2021-01-05 |
Family
ID=57613103
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610555498.7A Active CN106250910B (en) | 2016-01-28 | 2016-07-14 | Semi-structured data classification method based on label sequence and nGrams |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106250910B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109993235A (en) * | 2019-04-10 | 2019-07-09 | 苏州浪潮智能科技有限公司 | A kind of multivariate data classification method and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102033867A (en) * | 2010-12-14 | 2011-04-27 | 西北工业大学 | Semantic-similarity measuring method for XML (Extensible Markup Language) document classification |
CN102890698A (en) * | 2012-06-20 | 2013-01-23 | 杜小勇 | Method for automatically describing microblogging topic tag |
CN103577452A (en) * | 2012-07-31 | 2014-02-12 | 国际商业机器公司 | Website server and method and device for enriching content of website |
CN104063472A (en) * | 2014-06-30 | 2014-09-24 | 电子科技大学 | KNN text classifying method for optimizing training sample set |
CN104794169A (en) * | 2015-03-30 | 2015-07-22 | 明博教育科技有限公司 | Subject term extraction method and system based on sequence labeling model |
-
2016
- 2016-07-14 CN CN201610555498.7A patent/CN106250910B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102033867A (en) * | 2010-12-14 | 2011-04-27 | 西北工业大学 | Semantic-similarity measuring method for XML (Extensible Markup Language) document classification |
CN102890698A (en) * | 2012-06-20 | 2013-01-23 | 杜小勇 | Method for automatically describing microblogging topic tag |
CN103577452A (en) * | 2012-07-31 | 2014-02-12 | 国际商业机器公司 | Website server and method and device for enriching content of website |
CN104063472A (en) * | 2014-06-30 | 2014-09-24 | 电子科技大学 | KNN text classifying method for optimizing training sample set |
CN104794169A (en) * | 2015-03-30 | 2015-07-22 | 明博教育科技有限公司 | Subject term extraction method and system based on sequence labeling model |
Non-Patent Citations (4)
Title |
---|
Extending the Single Words-Based Document Model:A Comparison of Bigrams and 2-Itemsets;Roman Tesar;《proceeding DocEng’06 proceedins of the 2006 ACM symposium on document engineering》;20061013;第2节、第4节、第4.4节、第7.1节 * |
Karl-Michael Schneider.A New Feature Selection Score for Multinomial Naïve Bayes Text Classification Based on KL-Divergence.《proceedings ACLdemo’04 proceedings of the ACL 2004 on interactive poster and demonstration sessions》.2004, * |
基于标签序列的半结构化数据相似度度量;张利军等;《华中科技大学学报(自然科学版)》;20120823;第40卷(第8期);摘要 * |
文本分类的归纳学习算法和描述;郑东飞等;《计算机工程与设计》;20060228;第27卷(第4 期);第679-681页 * |
Also Published As
Publication number | Publication date |
---|---|
CN106250910A (en) | 2016-12-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109471938B (en) | Text classification method and terminal | |
AU2011326430B2 (en) | Learning tags for video annotation using latent subtags | |
US9087297B1 (en) | Accurate video concept recognition via classifier combination | |
CN108256104B (en) | Comprehensive classification method of internet websites based on multidimensional characteristics | |
CN104834940A (en) | Medical image inspection disease classification method based on support vector machine (SVM) | |
CN104239553A (en) | Entity recognition method based on Map-Reduce framework | |
CN109446804B (en) | Intrusion detection method based on multi-scale feature connection convolutional neural network | |
CN110222218A (en) | Image search method based on multiple dimensioned NetVLAD and depth Hash | |
CN107341199B (en) | Recommendation method based on document information commonality mode | |
CN112784031B (en) | Method and system for classifying customer service conversation texts based on small sample learning | |
CN112699953A (en) | Characteristic pyramid neural network architecture searching method based on multi-information path aggregation | |
CN115048464A (en) | User operation behavior data detection method and device and electronic equipment | |
CN116582300A (en) | Network traffic classification method and device based on machine learning | |
CN114676346A (en) | News event processing method and device, computer equipment and storage medium | |
CN106250910B (en) | Semi-structured data classification method based on label sequence and nGrams | |
Adami et al. | Bootstrapping for hierarchical document classification | |
CN116578708A (en) | Paper data name disambiguation algorithm based on graph neural network | |
US11514233B2 (en) | Automated nonparametric content analysis for information management and retrieval | |
CN114265954B (en) | Graph representation learning method based on position and structure information | |
CN111768214A (en) | Product attribute prediction method, system, device and storage medium | |
Saund | A graph lattice approach to maintaining and learning dense collections of subgraphs as image features | |
Asirvatham et al. | Web page categorization based on document structure | |
CN107729557A (en) | A kind of classification of inventory information, search method and device | |
CN114611668A (en) | Vector representation learning method and system based on heterogeneous information network random walk | |
CN110717100B (en) | Context perception recommendation method based on Gaussian embedded representation technology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20210915 Address after: 310000 room 660, building 5, No. 16, Zhuantang science and technology economic block, Xihu District, Hangzhou City, Zhejiang Province Patentee after: Yunyao Technology (Zhejiang) Co.,Ltd. Address before: 710072 No. 127 Youyi West Road, Shaanxi, Xi'an Patentee before: Northwestern Polytechnical University |