CN109710725A

CN109710725A - A kind of Chinese table column label restoration methods and system based on text classification

Info

Publication number: CN109710725A
Application number: CN201811524302.3A
Authority: CN
Inventors: 曹聪; 谢洁; 刘燕兵; 曹亚男; 谭建龙; 郭莉
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2018-12-13
Filing date: 2018-12-13
Publication date: 2019-05-03

Abstract

The present invention relates to a kind of Chinese table column label restoration methods and system based on text classification.The step of this method includes: 1) to extract entity from every a line in table, and the entity of extraction is searched in network encyclopaedic knowledge platform, obtains the corresponding message details page of entity；2) to each attribute of entity, the sentence comprising attribute value is extracted in the message details page of entity, forms the related text of attribute value；3) by the related text input text classifier of attribute value, classification belonging to attribute value, the as classification of cell where attribute value are obtained；4) column label of the attribute column is determined using the rule of majority ballot according to classification belonging to each unit lattice in attribute column for the attribute column of table.The present invention effectively can carry out column label recovery to network table, and the table after restoring column label can be used for the building and extension of Chinese knowledge mapping, it can also be used to the application such as data pick-up and table search.

Description

A kind of Chinese table column label restoration methods and system based on text classification

Technical field

The invention belongs to software technology, based on the knowledge acquisition technology field of network table, it is extensive to be related to network table semanteme A kind of compound method, and in particular to Chinese table column label restoration methods and system based on text classification.

Background technique

Hundreds of millions of tables has good structural features and the potential feature of semanteme on internet, compared to non-knot The text data of structure is easier to analyze and understand, therefore, the knowledge acquisition in recent years based on network table becomes research heat Point, the research that list data also has been used for the extension of knowledge base, table search, table merge etc..

Under normal conditions, table possesses the entity comprising a group object and arranges, other to be classified as attribute column, describes entity Attribute.Every a line in table is made of an entity and its correlation attribute value.The content that the cell of same row is included has Similitude.But the specification that network table is not unified, a large amount of table lack relationship etc. between specific table name, column name, column and close Key semantic information, prevent computer is from directly carrying out knowledge acquisition to table, therefore, how to restore table semanteme becomes base In the important research problem of the knowledge acquisition of table.Network table semanteme restores mainly to include three aspect researchs: table entity column Detection, table column label restore, relationship judges between grid column.The present invention solves the problems, such as that the column label of Chinese network table restores.

Currently, the column label recovery research for Chinese table is very few, for English table, existing algorithm is base mostly In large scale knowledge base (for example, YAGO, DBpedia, Probase etc.) or the database (for example, isA database) crawled from Web. Candidate column label is obtained by the way that the cell content in grid column is mapped to the concept in knowledge base (database), is then led to Crossing certain algorithm is that table determines most suitable column label.But many existing Chinese knowledge bases or make only for inside With or the knowledge that is included quantity it is very little, problem cannot be restored for Chinese table column label and sufficient knowledge is provided.And And existing English table column label recovery technology fact present in knowledge base carries out table mark, is difficult to find New, unknown knowledge, and the knowledge in knowledge base is limited after all, and this technology is caused to have biggish limitation.In addition, by The ununified specification of cell in Chinese network table, and may include a certain number of sentences, so that existing skill Art cannot efficiently solve Chinese table column label and restore problem.

Summary of the invention

The present invention is in view of the above-mentioned problems, proposing a kind of Chinese table column label restoration methods based on text classification and being System solves the problems, such as that Chinese table column label restores using machine learning method using the thought of text classification.

The basic thought that table column label of the invention restores is, using the column of table as unit, finds in column for each column Then column label belonging to cell content is labeled this column with column label.Due to the unit in each column of table Lattice have similitude, so column label recovery problem can be converted to cell label in column and determine problem.

Research range is limited to not arrange the Chinese network table of branch by the present invention, therefore table can be regarded as to one The two-dimensional array of m × n, each of array element can be word or sentence.Given table T, it is proposed by the invention Method to determine classification belonging to the content of each cell in column, and grid column mark is determined according to cell generic Label.

The technical solution adopted by the invention is as follows:

A kind of Chinese table column label restoration methods based on text classification, comprising the following steps:

1) entity is extracted from every a line in table, the entity of extraction is searched in network encyclopaedic knowledge platform, obtained The corresponding message details page of entity；

2) to each attribute of entity, the sentence comprising attribute value, composition are extracted in the message details page of entity The related text of attribute value；

3) by the related text input text classifier of attribute value, classification, as attribute value belonging to attribute value are obtained The classification of place cell；

4) rule of majority ballot is used according to classification belonging to each unit lattice in attribute column for the attribute column of table Determine the column label of the attribute column.

Further, step 1) the network encyclopaedic knowledge platform includes one of the following or a variety of: Baidupedia, dimension Base encyclopaedia, search dog encyclopaedia, interaction encyclopaedia.

Further, in step 2), if attribute value is sentence, sentence is segmented, stop words is gone to handle, by sentence It is converted into word set, then obtains the sentence comprising the word in the word set again, forms related text.

Further, the training data that the step 3) text classifier is used in training are as follows: from semi-structured letter It ceases and extracts " attribute-name-attribute value " in frame to information, the item text where the message box is then divided into sentence, with attribute value For keyword, mark is carried out back, extracts the sentence comprising the keyword, forms training corpus.

Further, two didactic rules are used during described time target:

A) if there is multiple sentences include the keyword, then these sentences are formed to the related text of the keyword, with category A training sample is collectively formed in property name；

If b) without sentence include complete keyword, to the keyword carry out word segmentation processing, and remove with it is other The word of attribute value conflict returns mark sentence for obtained word as keyword, extracts the sentence comprising one or more keywords.

Further, step 3) is before training text classifier, to training corpus carry out Text Pretreatment and text to Then quantization operation carries out the training of disaggregated model.

Further, the Text Pretreatment includes text participle, removal stop words operation；The text vectorization uses Vector space model carries out character representation to text, and is selected by feature selecting larger to text classification contribution degree Feature characterize a text, to reduce feature quantity, reduce vector dimension.

Further, feature selecting is carried out using Chi-square Test, then calculates feature weight using TF-IDF algorithm, Measurement of the feature weight as characteristic item for the significance level or separating capacity of text.

Further, the training of the disaggregated model is to use the feature vector constructed as input, uses simple shellfish This algorithm of leaf and algorithm of support vector machine are trained；In the training process using k folding cross validation combination grid search to mould Shape parameter carries out tuning, then goes out disaggregated model using the parameter training after optimization.

A kind of Chinese table column label recovery system based on text classification comprising:

Message details page acquisition module is responsible for extracting entity from every a line in table, flat in network encyclopaedic knowledge The entity of extraction is searched in platform, obtains the corresponding message details page of entity；

Related text abstraction module is responsible for each attribute to entity, and packet is extracted in the message details page of entity Sentence containing attribute value forms the related text of attribute value；

Attribute value categorization module is responsible in the related text input text classifier by attribute value, is obtained belonging to attribute value Classification, the classification of cell as where attribute value；

Column label determining module is responsible for making the attribute column of table according to classification belonging to each unit lattice in attribute column The column label of the attribute column is determined with the rule of majority ballot.

Key point of the invention includes:

1, using information resources abundant on the networks encyclopaedic knowledge platform such as Baidupedia come to rare table content into Row supplement obtains related text for each table cell, and determines table column label by way of text classification.

2, " attribute-name-attribute value " in the semi-structured message box of the networks encyclopaedic knowledge platform such as Baidupedia is utilized Mark is carried out back to non-structured text, obtains a large amount of training data.

3, using the mode of majority ballot, comprehensively consider each unit lattice generic in each column, determine column column mark Label.

A kind of Chinese table column label restoration methods based on text classification of the invention, can be effectively to network table Column label recovery is carried out, the table after restoring column label can be used for the building and extension of Chinese knowledge mapping, it can also be used to data It extracts and the applications such as table search.Technological merit of the invention mainly includes the following aspects:

1. can be carried out using information resources abundant on the networks encyclopaedic knowledge platform such as Baidupedia to table content Supplement solves form context information Sparse Problems.

2. be able to solve the problem of table cell includes long sentence, by handling long sentence, can for comprising The cell of long sentence carries out classification mark, it means that the method proposed through the invention can be obtained from network table A large amount of descriptive knowledge (such as: representative works, Main Achievements) come the library that expands knowledge.

3. independent of existing knowledge base, it can be found that new, unknown knowledge, and can use these new knowledge To supplement knowledge base.

Detailed description of the invention

Fig. 1 is that table column label restores flow chart.

Fig. 2 is classifier training flow chart.

Fig. 3 is dataset construction exemplary diagram.

Fig. 4 is table annotation results figure.

Fig. 5 is classifier accuracy rate figure.

Fig. 6 is grid column label for labelling result figure.

Specific embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below by specific embodiment and Attached drawing is described in further details the present invention.

The present invention provides a kind of table column label restoration methods, it regards table column label recovery problem as text classification Problem mends rare table content using information resources abundant on the networks encyclopaedic knowledge platform such as Baidupedia It fills.Since the problem to be solved in the present invention is table column label problem, and it is not concerned with the entity column identification problem of net list lattice, institute To assume that the entity of table arranges it is known that and for the first row of table in the present invention.

A kind of workflow such as Fig. 1 institute of table column label restoration methods based on text classification provided in this embodiment Show, comprising the following steps:

1) for every a line in table, from extraction entity (assuming that being located in first row) in the row, in Baidupedia The entity is searched for, the corresponding message details page of the entity is obtained.

2) to each attribute of the entity, the sentence comprising the attribute value is extracted in entity information details page and is formed The related text of the attribute value.If attribute value is sentence, sentence is segmented, goes stop words etc. to handle, sentence is converted At word set, the sentence comprising word in the word set is then obtained again, forms the related text of sentence.

3) by the related text input text classifier of attribute value, classification belonging to the attribute value is obtained, as the category Property value where cell classification.

4) for the attribute column of table, according to cell generic in column, the column are determined using the rule of majority ballot Column label.

Whole flow process can be divided into three pretreatment, table cell mark, post-processing parts, below will be to these three portions Divide and be described in detail:

1) it pre-processes: enriching the attribute cell of table using the information of Baidupedia.Every a line R of table_iIt can be with It is expressed as (e, p₁,p₂,…,p_m), wherein e is entity, p₁,p₂,…,p_mIt is the attribute of entity e.For every a line of table, make Baidupedia is searched for entity e, obtains the text information of the corresponding page of the entity, and by text segmentation at sentence.Then By attribute p_iThe sentence comprising the keyword is searched for as keyword, forms related text RT (p_i).If attribute value is sentence, Then sentence is segmented, and removes the word to conflict with other attribute cells, is then obtained comprising one or more word Sentence forms related text.Using Chinese words segmentation to each attribute p_iRelated text RT (p_i) handled, obtain one The set of a word

2) table cell marks: by the above-mentioned each attribute p obtained by pretreatment_iRelated set of wordsIt is put into text To get to classification belonging to the attribute in this classifier.Table column label provided by the present invention based on text classification restores The major issue of method be how obtain training dataset and how training text classifier.

A. training dataset: the present invention extracts " attribute-name-attribute value " from the semi-structured infobox of Baidupedia To information, the Baidupedia item text where the infobox is then divided into sentence, using attribute value as keyword, is returned Mark extracts the sentence comprising the keyword, forms training corpus.During returning target, two didactic rules are used:

A1) if there is multiple sentences include the keyword, then these sentences are formed to the related text of the keyword, with category A training sample is collectively formed in property name.

A2) if without sentence include complete keyword, to the keyword carry out word segmentation processing, and remove with it is other The word of attribute value conflict returns mark sentence by keyword of these words, extracts the sentence comprising one or more keywords.

It returns target corpus 80% and is used as training set, 20% is used as test set.

B. classifier training: before training text classifier, need to carry out training corpus pretreatment and vectorization table Show, then carries out classifier training using training data and classifier performance is assessed.Process is as shown in Fig. 2, next will be to each A step is described in detail:

B1) Text Pretreatment: Text Pretreatment includes the operations such as text participle, removal stop words.Participle is that text is located in advance Indispensable one operation during reason, since Chinese language text divides word using space unlike English text It cuts, first has to rely on participle technique for text segmentation at word one by one, by text so carrying out processing to Chinese language text It is converted into the set of word, characterizes text using word so as to subsequent.It is used in the present invention jieba participle to carry out corpus Word segmentation processing.And some words for hardly carrying any information, only reflecting Sentence Grammar structure in text are removed, such as " ", " obtaining ", " this ", the words such as " that ".

B2) text vector: the content of processing textual is enabled a computer to, it is also necessary to reflect participle with dictionary It penetrates, carries out the expression of mathematicization.In pretreated text, each word is considered as a feature, and the present invention uses vector Spatial model (vector space model, VSM) carries out character representation, a series of core concept of VSM are as follows: by spies to text Levy the document D that is composed can the weight corresponding to characteristic item and characteristic item be indicated, i.e. D=D (t₁,ω₁；t₂, ω₂；…；t_n, ω n), it can be abbreviated are as follows: D=D (ω₁,ω₂,…,ω_n) wherein ω_iIt is feature t_iPossessed weight.Therefore one A document can be expressed as the space vector of n dimension.Since to will lead to very much vector dimension greatly excessively high for participle quantity, vector is caused The complexity of calculating is excessive, it is possible to is selected by feature selecting to the biggish feature of text classification contribution degree and be characterized One text reduces vector dimension, keeps the generalization ability of model stronger to reduce feature quantity.The present invention is examined using card side Test (χ²) Lai Jinhang feature selecting, Chi-square statistic tradeoff is characteristic item t_iWith classification C_jBetween degree of correlation, and assume feature Item t_iWith classification C_jBetween meet χ²Distribution.Degree of correlation uses χ²Statistic (CHI) is measured, and characteristic item is for some The CHI of class is higher, then illustrates that the correlation between this feature item and this class is also bigger, thus entrained by this feature about The information of the category is also more, on the contrary then fewer.Then TF-IDF algorithm is reused to calculate feature weight, and feature weight can Using the measurement as characteristic item for the significance level or separating capacity of text.Text after mathematicization indicates, Ke Yizuo For classifier training and input when test.

B3) model training: the present invention uses the feature vector constructed as input, uses NB Algorithm (BAYES) and algorithm of support vector machine (SVM) algorithm carry out disaggregated model training, so as to train come classifier carry out Assessment, to select a kind of most suitable model to carry out the recovery of table column label.It is rolled in the training process of model using k Cross validation combination grid search carries out tuning to model parameter.Then go out disaggregated model using the parameter training after optimization, and Model is saved, is marked for use in model measurement and table cell.

C. cell marks: each attribute p that will be obtained by pretreatment_iRelated set of wordsAs trained text The input of this classifier obtains classification belonging to the attribute, the as classification of cell where the attribute.

3) post-process: since the cell of same row has similar content, the present invention utilizes table column unit Consistency come exclude those mistake marks.For the jth column of table, the present invention comprehensively considers each cell in the column Mark determines the column label of the column using the principle of majority ballot, and most of cell is noted as class even in the column Other t^(k), then t is set by the column label of the column^(k)。

Example is set forth below and further illustrates a kind of the specific of Chinese table column label restoration methods based on text classification Implementation process.

1) training dataset constructs: by taking Fig. 3 as an example, " place-BeiJing, China ", " collection essence are extracted from infobox The attribute-names such as product-Riverside Scene at the Pure Moon Festival "-attribute value pair are keyword to the free text of the page where the infobox using attribute value Mark is carried out back, the sentence (sentence where underscore in figure) comprising the keyword is obtained；If what is do not included in free text is complete Attribute value, then word segmentation processing is carried out to it, and remove the word to conflict with other attribute values, then using these words as keyword time Sentence is marked, mark is returned in part, and the results are shown in Table 1.

1. partial data of table returns mark result

2) classifier training: word segmentation processing is carried out to the related text that training data is concentrated using jieba participle tool, is gone Except stop words, low-frequency word, Chi-square Test (χ is then used²) feature selecting is carried out, vectorization table is carried out using tf-idf method Show.Data after vectorization is indicated use NB Algorithm (BAYES) and algorithm of support vector machine as input (SVM) algorithm train classification models.Model-naive Bayesian needs to set a smoothing parameter alpha in the training process It sets；Supporting vector machine model needs the kernel parameter to expression Selection of kernel function, indicates the C parameter of penalty coefficient and be elected to It selects γ parameter of the rbf as kernel function when to be configured, model parameter tuning can be real by grid search combination cross validation Existing, optimized parameter is as shown in table 2:

The setting of 2. optimal model parameters of table

3) cell marks: every a line (e, p for not marking table₁,p₂,…,p_m), it is searched for using the row entity e Baidupedia obtains the text information of the corresponding page of the entity.Then obtain includes attribute value p in text_iSentence, composition Related text predicts related text using trained classifier, then available attribute value p_iClass label.It is right In each column of table, column label is determined using most voting rules.Fig. 4 is the annotation results of network table, black in cell Color overstriking font is cell annotation results, and table last line black overstriking font is more using the progress of cell annotation results The determining column label annotation results of number ballot.

It is designed based on above scheme, illustrates the good effect that method proposed by the invention generates herein.It is other using figure kind Data tested and select five common attribute types --- date of birth, nationality, birthplace, occupation, graduated school. Experiment uses these attribute types as objective attribute target attribute and gets the data largely marked from Baidupedia, uses 80% Data are as training data TR, and for 20% data as test data TE, table 3 lists the data system of each classification in data set Meter.

3. training dataset of table and test data set

Attribute type	Training dataset	Test data set
			Date of birth	13620	3431
Nationality	12210	3000
			Birthplace	13062	3317
Occupation	12302	3005
			Graduated school	8048	2018
It amounts to	59242	14771

Use NB Algorithm (BAYES) and algorithm of support vector machine (SVM) algorithm train classification models, optimal ginseng Number setting is as shown in table 2.The smoothing parameter alpha=1 of BAYES；SVM uses RBF kernel function, parameter C=0.5, γ=2.Fig. 5 Show the accuracy rate of the BAYES and SVM classifier with the training of above-mentioned data, the experimental results showed that, use point of SVM algorithm training Class device will be higher than BAYES algorithm in the accuracy rate of most of attribute type, and in " nationality ", " birthplace " two attributes On accuracy rate be lifted beyond 19%.

The table comprising figure kind's entity is crawled from webpage using web crawlers, table is screened and is therefrom selected 104 tables out, every a line of table includes an entity and several attribute informations, for five objective attribute target attribute classes in experiment Type gets 1807 examples, as shown in table 4 in total.All tables are manually marked, experimental evaluation is used for.

4. form attributes example of table statistics

Attribute type	Physical quantities
		Date of birth	126
Nationality	833
		Birthplace	353
Occupation	346
		Graduated school	149
It amounts to	1807

Table cell is labeled using trained classifier first, come test it is proposed by the invention based on The ability of the method processing truth table data of text classification.Then post-processing operation is added, to exclude misclassification cell, really Determine table column label.Table 5 illustrates the experimental result of table cell mark with the column label mark after addition post-processing operation Comparison.Two kinds of algorithms are added after post-processing operation, and accuracy rate has promotion by a relatively large margin, this demonstrate that in post-processing operation Most voting methods can effectively exclude misclassification cell.

5. cell of table mark and column label mark accuracy rate assessment

Fig. 6 illustrates the experimental result that grid column label for labelling is carried out using the method based on BAYES and based on SVM.It can See that mark accuracy rate of the two methods in most classifications is all higher than 90%, it was demonstrated that proposed by the present invention based on text classification The validity of Chinese table column label restoration methods.

The Baidupedia used above also could alternatively be other network encyclopaedic knowledge platforms, such as wikipedia, search dog Encyclopaedia, interaction encyclopaedia etc..

Naive Bayesian, the support vector cassification algorithm used above can be substituted for other sorting algorithms, such as decision Tree, logistic regression, k- arest neighbors, neural network etc..

Another embodiment of the present invention provides a kind of Chinese table column label recovery system based on text classification comprising:

The specific implementation of above each module sees above the explanation to the method for the present invention.

The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the spirit and scope of the present invention, this The protection scope of invention should be subject to described in claims.

Claims

1. a kind of Chinese table column label restoration methods based on text classification, which comprises the following steps:

1) entity is extracted from every a line in table, the entity of extraction is searched in network encyclopaedic knowledge platform, obtain entity The corresponding message details page；

2) to each attribute of entity, the sentence comprising attribute value is extracted in the message details page of entity, forms attribute The related text of value；

3) by the related text input text classifier of attribute value, classification belonging to attribute value is obtained, as where attribute value The classification of cell；

4) attribute column of table is determined according to classification belonging to each unit lattice in attribute column using the rule of majority ballot The column label of the attribute column.

2. the method according to claim 1, wherein step 1) the network encyclopaedic knowledge platform includes in following It is one or more: Baidupedia, wikipedia, search dog encyclopaedia, interaction encyclopaedia.

3. the method according to claim 1, wherein if attribute value is sentence, being carried out to sentence in step 2) It segments, stop words is gone to handle, sentence is converted into word set, then obtain the sentence comprising the word in the word set again, composition is related Text.

4. the method according to claim 1, wherein the instruction that the step 3) text classifier is used in training Practice data are as follows: " attribute-name-attribute value " is extracted from semi-structured message box to information, then by the item where the message box Mesh text is divided into sentence, using attribute value as keyword, carries out back mark, extracts the sentence comprising the keyword, forms training corpus.

5. according to the method described in claim 4, it is characterized in that, using two didactic rule during described time target Then:

A) if there is multiple sentences include the keyword, then these sentences are formed to the related text of the keyword, with attribute-name A training sample is collectively formed；

If including b) complete keyword without sentence, word segmentation processing is carried out to the keyword, and remove and other attributes It is worth the word of conflict, returns mark sentence for obtained word as keyword, extract the sentence comprising one or more keywords.

6. method according to claim 1 or 5, which is characterized in that step 3) is before training text classifier, to training Corpus carries out Text Pretreatment and text vectorization operation, then carries out the training of disaggregated model.

7. according to the method described in claim 6, it is characterized in that, the Text Pretreatment includes that text segments, removal deactivates Word operation；The text vectorization carries out character representation to text using vector space model, and is selected by feature selecting It selects and a text is characterized to the biggish feature of text classification contribution degree, to reduce feature quantity, reduce vector dimension.

8. then being used the method according to the description of claim 7 is characterized in that carrying out feature selecting using Chi-square Test TF-IDF algorithm calculates feature weight, and feature weight is as characteristic item for the significance level of text or the weighing apparatus of separating capacity Amount.

9. according to the method described in claim 6, it is characterized in that, the training of the disaggregated model is using the feature constructed Vector is trained as input using NB Algorithm and algorithm of support vector machine；It is handed in the training process using k folding Fork verifying combines grid search to carry out tuning to model parameter, then goes out disaggregated model using the parameter training after optimization.

10. a kind of Chinese table column label recovery system based on text classification characterized by comprising

Message details page acquisition module is responsible for extracting entity from every a line in table, in network encyclopaedic knowledge platform The entity extracted is searched for, the corresponding message details page of entity is obtained；

Related text abstraction module is responsible for each attribute to entity, extracts in the message details page of entity comprising belonging to The sentence of property value, forms the related text of attribute value；

Attribute value categorization module is responsible in the related text input text classifier by attribute value, obtains class belonging to attribute value Not, as where attribute value cell classification；

Column label determining module is responsible for the attribute column for table, according to classification belonging to each unit lattice in attribute column, using more The rule of number ballot determines the column label of the attribute column.