CN106250910B

CN106250910B - Semi-structured data classification method based on label sequence and nGrams

Info

Publication number: CN106250910B
Application number: CN201610555498.7A
Authority: CN
Inventors: 张利军; 李宁; 高锦涛
Original assignee: Northwestern Polytechnical University
Current assignee: Yunyao Technology (Zhejiang) Co.,Ltd.
Priority date: 2016-01-28
Filing date: 2016-07-14
Publication date: 2021-01-05
Anticipated expiration: 2036-07-14
Also published as: CN106250910A

Abstract

The invention discloses a semi-structured data classification method based on a label sequence and nGrams, which is used for solving the technical problem of poor accuracy of the existing semi-structured data classification method. The technical scheme includes that TSGrams features are used as a basic unit for representing semi-structured data, structural information of the semi-structured data is captured through a label sequence, content information of the semi-structured data is captured through nGrams, the TSGrams features and the content information are fused to serve as inclusion relation between feature capture structures and content, mutual relation among different keywords in the content information is considered, information gain is used for screening the TSGrams features, a TSGrams feature construction feature space with strong classification capacity is obtained, a category classification model is built according to the mutual information between the TSGrams features and categories, similarity among different structures is considered during classification, and accuracy of classification of the semi-structured data is improved.

Description

Semi-structured data classification method based on label sequence and nGrams

Technical Field

The invention relates to a semi-structured data classification method, in particular to a semi-structured data classification method based on a label sequence and nGrams.

Background

Semi-structured data classification is generally divided into three steps: firstly, extracting features from a semi-structured data set of known classes, then constructing a classification model by using the extracted features, and finally classifying the data of the unknown classes by using the constructed model. The semi-structured data contains structure and content information, and for the classification based on the structure and the content, the following factors need to be considered when extracting the features:

1. the containment relationship between structure and content, i.e., the content is organized to be contained in different hierarchies;

2. the mutual relations among the elements in the structure, namely brother relations, father-son relations, ancestor-descendant relations and the like among the elements;

3. the interrelationship between keywords in the content.

Most of the existing semi-structured data classification methods based on structures and contents extend the traditional unstructured text document vector space model on the basis of the vector space model, so that the structured text document vector space model contains structural information and is then used for classification. For example, documents 1 "train T, Nayak R, Bruza P d. combining Structure and Content properties for XML Document clustering. proceedings of the 7th architecture Data Mining Conference (AusDM '08)," 2008.219-226 "and" ghos, Mitra P. combining Content and Structure Similarity for XML Document Classification Composite SVM kernels. proceedings of19th International Conference on Pattern Recognition (ICPR'08), Tampa,2008.1-4 "all employ such methods in that the Structure information and the Content information respectively indicate that the mutual relationship between the Structure and the Content is broken, i.e., the first factor mentioned above is ignored.

There are also methods that consider the relationship between structure and Content, such as document 2 "Yuanjia, Goodde, Bohong. XML Web Classification research based on the correlation between structure and text keywords. computer research and development, 2006,43(8): 1361-. Although these methods consider the inclusion relationship between the structure and the content, the structure is modeled as a path, and the path represents the hierarchical (parent-child, ancestor-descendant) relationship between elements, but neglects the mutual relationship between different paths and the similarity of paths, and the like, i.e. the second factor cannot be processed well.

Document 4 "Yang J, Zhang F. XML Document Classification Using Extended VSM. proceedings of Focused Access to XML Documents, the 6th International works for the Evaluation of XML recommendation (INEX'07), Dagstuhl cast, Germany: springer Berlin/Heidelberg,2008.234-244, "extends the vector space model, represents semi-structured data as a matrix, thereby capturing the association between structure and content, and the relationship among the internal (elements) of the structure is embodied by a core matrix, and the structure element structure subtrees are replaced by structure subtrees in the Document 5 'Yang J, Wang S.extended VSM for XML Document Classification Using frequency subtrees of Focused Retrieval and Evaluation, the 8th International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX'09), spring Berlin/Heidelberg,2010.441-448 ], so as to further embody the relevance. This approach, while taking into account the first two factors to some extent, has not yet explicitly addressed the third factor.

In summary, when the existing semi-structured data classification method extracts features from semi-structured data, the three factors are not fully considered, so that the semi-structured data classification model constructed by the methods lacks part of intrinsic information between the semi-structured data features and the semi-structured data categories, thereby affecting the accuracy of semi-structured data classification.

Disclosure of Invention

In order to overcome the defect of poor accuracy of the existing semi-structured data classification method, the invention provides a semi-structured data classification method based on a label sequence and nGrams. The method comprises the steps of taking TSGrams characteristics as a basic unit for representing semi-structured data, capturing structure information of the semi-structured data by using a label sequence, capturing content information of the semi-structured data by using nGrams, fusing the TSGrams characteristics and the content information as an inclusion relation between a characteristic capture structure and the content, considering the mutual relation between different keywords in the content information, screening the TSGrams characteristics by using information gain, obtaining a TSGrams characteristic structure characteristic space with strong classification capability, constructing a category classification model according to the mutual information between the TSGrams characteristics and the categories, and considering the similarity between different structures during classification, so that the accuracy of classification of the semi-structured data is improved.

The technical scheme adopted by the invention for solving the technical problems is as follows: a semi-structured data classification method based on a tag sequence and nGrams is characterized by comprising the following steps:

step one, constructing a TSGrams characteristic space.

(1) And (5) extracting TSGrams characteristics. For each data document D in the data set D, traversing all text nodes of D by using a tree model, forming a label sequence by using paths from a root node to a parent node of the root node, and extracting all nGrams with the length less than or equal to 3 from text contents. And then combining the tag sequences and nGrams to form TSGrams characteristics, and recording a set formed by all the TSGrams characteristics as TSGramsSet.

(2) An information gain is calculated. An information gain IG (f) is calculated for each TSGrams feature f: < s, g > in TSGramsSet. The calculation method is as follows:

wherein the content of the first and second substances,

(3) sorting all TSGrams characteristics with the length of1 in the TSGramsSet from large to small according to information gain, and setting the information gain of the Nth characteristic as IG_N. Wherein N is a parameter.

(4) Selecting all information gain values in TSGramsSet to be larger than IG_NConstitutes a feature space omega.

And step two, constructing a classification model.

(1) Calculating each TSGrams feature f in the feature space omega:<s,g>and class C_iMutual information MI (f, C)_i). The calculation method is as follows:

(2) the feature space Ω is divided into k disjoint subsets, each subset representing a class C_iThe subset is called the class C_iIs recorded as a classification model of

Any feature f in the feature space omega is divided into a classification model phi with the highest mutual information_C*In (1). Namely:

(3) according to the above division, any one of the classes C_iClassification model of

Can be expressed as a vector in the TSGrams eigenspace Ω:

wherein, w_i,jFor TSGrams feature f_j:<s_j,g_j>In class C_iWeight in (1), if f_jNot being C_iIf the weight is 0, otherwise, the two information are mutually obtained and normalized, that is to say

And step three, classifying the unknown class data.

(1) Preprocessing the unknown class semi-structured data d to be classified, obtaining the TSGrams features in the unknown class semi-structured data d by the method, and discarding the features which appear in the unknown class semi-structured data d but are not contained in the feature space omega, so that the unknown class semi-structured data d is represented as a vector in the feature space omega:

wherein, w_jIs the jth feature f in the TSGrams feature space omega_j:<s_j,g_j>The values obtained after normalization of the frequencies that appear in document d.

(2) Computing the unknown class semi-structured data d and any class C using its vector representation_iClassification model of

The similarity between the two is calculated as follows:

wherein, w_d(＜s_j,g_j>) For TSGrams characteristic < s_j,g_j>The weights in the unknown class semi-structured data d,

for TSGrams characteristic < s_k,g_j>In-category model

The weights in (1), d, and

are respectively d and

euclidean norm of, and sim(s)_j,s_k) Is a tag sequence s_jAnd s_kThe similarity between them is defined as:

wherein m and n are respectively a tag sequence s_jAnd s_kLength of (d), and ed(s)_j,s_k) Is s is_jAnd s_kThe edit distance of (1).

(3) Assigning the category of document d to C with the highest similarity_*I.e. by

The invention has the beneficial effects that: the method takes TSGrams characteristics as a basic unit for representing semi-structured data, captures structural information of the semi-structured data by using a label sequence, captures content information of the semi-structured data by using nGrams, fuses the two to be used as an inclusion relation between a characteristic capture structure and the content, considers the mutual relation between different keywords in the content information, screens the TSGrams characteristics by using information gain, obtains a TSGrams characteristic structure characteristic space with strong classification capability, constructs a category classification model according to the mutual information between the TSGrams characteristics and the categories, considers the similarity between different structures during classification, and improves the accuracy of classification of the semi-structured data.

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

Drawings

FIG. 1 is a flow chart of the present invention label sequence and nGrams based semi-structured data classification method.

Detailed Description

Refer to fig. 1. The method for classifying semi-structured data based on the label sequence and nGrams comprises the following specific steps:

1. a TSGrams feature space is constructed.

1> TSGrams feature extraction: for each data Document D in The data set D, traversing all text nodes of D in a tree model, constructing a tag sequence from The path of The root node to its parent node, and extracting all nGrams with a length less than or equal to 3 from The text content by a method similar to The documents "Tesar R, Strnad V, Jezek K, et al. And then combining the tag sequences and nGrams to form TSGrams characteristics, and recording a set formed by all the TSGrams characteristics as TSGramsSet.

And 2, calculating information gain. An information gain IG (f) is calculated for each TSGrams feature f: < s, g > in TSGramsSet.

The calculation method is as follows:

wherein:

3>general in TSGramsSetThe TSGrams features with the length of1 are sorted from large to small according to the information gain, and the information gain of the Nth feature is set as IG_N. Wherein N is a parameter, different values are taken according to different training data sets, and adjustment is needed according to experimental results.

4>Selecting all information gain values in TSGramsSet to be larger than IG_NConstitutes a feature space omega.

2. And constructing a classification model.

1>Calculating each TSGrams feature f in the feature space omega:<s,g>and class C_iMutual information MI (f, C)_i). The calculation method is as follows:

2>the feature space Ω is divided into k disjoint subsets, each subset representing a class C_iThe subset is called the class C_iIs recorded as a classification model of

3>according to the above division, any one of the classes C_iClassification model of

Can be expressed as a vector in the TSGrams eigenspace Ω:

3. And classifying the unknown class data.

1, preprocessing unknown class semi-structured data d to be classified, acquiring TSGrams features in the unknown class semi-structured data d by the method, and discarding features which appear in the unknown class semi-structured data d but are not included in a feature space Ω, so that the unknown class semi-structured data d can be represented as a vector in the feature space Ω:

2>Computing the unknown class semi-structured data d and any class C using its vector representation_iClassification model of

The similarity between the two is calculated as follows:

wherein, w_d(＜s_j,g_j>) For TSGrams characteristic < s_j,g_j>The weight in d is such that,

for TSGrams characteristic < s_k,g_j>In-category model

The weights in (1), d, and

are respectively d and

wherein m and n are respectively a tag sequence s_jAnd s_kLength of (d), and ed(s)_j,s_k) Is s is_jAnd s_kFor the calculation of the edit distance, please refer to the document "Levenshtein VI. binary codes capable of correcting spectral errors and deletions of ones. schemes of Information transmission.1965", which is different from the basic unit edited in the method is the label (tag).

3>Assigning the category of document d to C with the highest similarity_*I.e. by

Claims

1. A semi-structured data classification method based on label sequences and nGrams is characterized by comprising the following steps:

step one, constructing a TSGrams characteristic space:

(1) TSGrams feature extraction: for each data document D in the data set D, traversing all text nodes of D by using a tree model, forming a label sequence from a path from a root node to a parent node of the data document D, extracting nGrams with the length less than or equal to 3 from text content, combining the label sequence and the nGrams to form TSGrams characteristics, and recording a set formed by all the TSGrams characteristics as TSGramsSet;

(2) calculating information gain: and calculating the information gain IG (f) of each TSGrams characteristic f: < s, g > in the TSGramsSet, wherein the calculation method comprises the following steps:

wherein the content of the first and second substances,

(3) sorting all TSGrams characteristics with the length of1 in the TSGramsSet from large to small according to information gain, and setting the information gain of the Nth characteristic as IG_NWherein N is a parameter;

(4) selecting all information gain values in TSGramsSet to be larger than IG_NThe characteristic of (a) constitutes a characteristic space Ω;

step two, constructing a classification model:

(1) calculating each TSGrams feature f in the feature space omega:<s,g>and class C_iMutual information MI (f, C)_i) The calculation method comprises the following steps:

Any feature f in feature space omega is divided into classification models with highest mutual information

In (1), namely:

Can be expressed as a vector in the TSGrams eigenspace Ω:

Step three, classifying the unknown class data:

wherein, w_jIs the jth feature f in the TSGrams feature space omega_j:<s_j,g_j>The value obtained after normalization of the frequency of occurrence in the document d;

The similarity between the two groups is calculated by the following method:

wherein, w_d(＜s_j,g_j>) is TSGrams characteristic < s_j,g_jWeight in the unknown class semi-structured data d,

for TSGrams characteristic < s_k,g_jIn class model

The weights in (1), d, and

are respectively d and

wherein m and n are respectively a tag sequence s_jAnd s_kLength of (e) andd(s_j,s_k) Is s is_jAnd s_kThe edit distance of (d);