CN109710725A - A kind of Chinese table column label restoration methods and system based on text classification - Google Patents

A kind of Chinese table column label restoration methods and system based on text classification Download PDF

Info

Publication number
CN109710725A
CN109710725A CN201811524302.3A CN201811524302A CN109710725A CN 109710725 A CN109710725 A CN 109710725A CN 201811524302 A CN201811524302 A CN 201811524302A CN 109710725 A CN109710725 A CN 109710725A
Authority
CN
China
Prior art keywords
text
attribute
column
entity
attribute value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811524302.3A
Other languages
Chinese (zh)
Inventor
曹聪
谢洁
刘燕兵
曹亚男
谭建龙
郭莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201811524302.3A priority Critical patent/CN109710725A/en
Publication of CN109710725A publication Critical patent/CN109710725A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of Chinese table column label restoration methods and system based on text classification.The step of this method includes: 1) to extract entity from every a line in table, and the entity of extraction is searched in network encyclopaedic knowledge platform, obtains the corresponding message details page of entity;2) to each attribute of entity, the sentence comprising attribute value is extracted in the message details page of entity, forms the related text of attribute value;3) by the related text input text classifier of attribute value, classification belonging to attribute value, the as classification of cell where attribute value are obtained;4) column label of the attribute column is determined using the rule of majority ballot according to classification belonging to each unit lattice in attribute column for the attribute column of table.The present invention effectively can carry out column label recovery to network table, and the table after restoring column label can be used for the building and extension of Chinese knowledge mapping, it can also be used to the application such as data pick-up and table search.

Description

A kind of Chinese table column label restoration methods and system based on text classification
Technical field
The invention belongs to software technology, based on the knowledge acquisition technology field of network table, it is extensive to be related to network table semanteme A kind of compound method, and in particular to Chinese table column label restoration methods and system based on text classification.
Background technique
Hundreds of millions of tables has good structural features and the potential feature of semanteme on internet, compared to non-knot The text data of structure is easier to analyze and understand, therefore, the knowledge acquisition in recent years based on network table becomes research heat Point, the research that list data also has been used for the extension of knowledge base, table search, table merge etc..
Under normal conditions, table possesses the entity comprising a group object and arranges, other to be classified as attribute column, describes entity Attribute.Every a line in table is made of an entity and its correlation attribute value.The content that the cell of same row is included has Similitude.But the specification that network table is not unified, a large amount of table lack relationship etc. between specific table name, column name, column and close Key semantic information, prevent computer is from directly carrying out knowledge acquisition to table, therefore, how to restore table semanteme becomes base In the important research problem of the knowledge acquisition of table.Network table semanteme restores mainly to include three aspect researchs: table entity column Detection, table column label restore, relationship judges between grid column.The present invention solves the problems, such as that the column label of Chinese network table restores.
Currently, the column label recovery research for Chinese table is very few, for English table, existing algorithm is base mostly In large scale knowledge base (for example, YAGO, DBpedia, Probase etc.) or the database (for example, isA database) crawled from Web. Candidate column label is obtained by the way that the cell content in grid column is mapped to the concept in knowledge base (database), is then led to Crossing certain algorithm is that table determines most suitable column label.But many existing Chinese knowledge bases or make only for inside With or the knowledge that is included quantity it is very little, problem cannot be restored for Chinese table column label and sufficient knowledge is provided.And And existing English table column label recovery technology fact present in knowledge base carries out table mark, is difficult to find New, unknown knowledge, and the knowledge in knowledge base is limited after all, and this technology is caused to have biggish limitation.In addition, by The ununified specification of cell in Chinese network table, and may include a certain number of sentences, so that existing skill Art cannot efficiently solve Chinese table column label and restore problem.
Summary of the invention
The present invention is in view of the above-mentioned problems, proposing a kind of Chinese table column label restoration methods based on text classification and being System solves the problems, such as that Chinese table column label restores using machine learning method using the thought of text classification.
The basic thought that table column label of the invention restores is, using the column of table as unit, finds in column for each column Then column label belonging to cell content is labeled this column with column label.Due to the unit in each column of table Lattice have similitude, so column label recovery problem can be converted to cell label in column and determine problem.
Research range is limited to not arrange the Chinese network table of branch by the present invention, therefore table can be regarded as to one The two-dimensional array of m × n, each of array element can be word or sentence.Given table T, it is proposed by the invention Method to determine classification belonging to the content of each cell in column, and grid column mark is determined according to cell generic Label.
The technical solution adopted by the invention is as follows:
A kind of Chinese table column label restoration methods based on text classification, comprising the following steps:
1) entity is extracted from every a line in table, the entity of extraction is searched in network encyclopaedic knowledge platform, obtained The corresponding message details page of entity;
2) to each attribute of entity, the sentence comprising attribute value, composition are extracted in the message details page of entity The related text of attribute value;
3) by the related text input text classifier of attribute value, classification, as attribute value belonging to attribute value are obtained The classification of place cell;
4) rule of majority ballot is used according to classification belonging to each unit lattice in attribute column for the attribute column of table Determine the column label of the attribute column.
Further, step 1) the network encyclopaedic knowledge platform includes one of the following or a variety of: Baidupedia, dimension Base encyclopaedia, search dog encyclopaedia, interaction encyclopaedia.
Further, in step 2), if attribute value is sentence, sentence is segmented, stop words is gone to handle, by sentence It is converted into word set, then obtains the sentence comprising the word in the word set again, forms related text.
Further, the training data that the step 3) text classifier is used in training are as follows: from semi-structured letter It ceases and extracts " attribute-name-attribute value " in frame to information, the item text where the message box is then divided into sentence, with attribute value For keyword, mark is carried out back, extracts the sentence comprising the keyword, forms training corpus.
Further, two didactic rules are used during described time target:
A) if there is multiple sentences include the keyword, then these sentences are formed to the related text of the keyword, with category A training sample is collectively formed in property name;
If b) without sentence include complete keyword, to the keyword carry out word segmentation processing, and remove with it is other The word of attribute value conflict returns mark sentence for obtained word as keyword, extracts the sentence comprising one or more keywords.
Further, step 3) is before training text classifier, to training corpus carry out Text Pretreatment and text to Then quantization operation carries out the training of disaggregated model.
Further, the Text Pretreatment includes text participle, removal stop words operation;The text vectorization uses Vector space model carries out character representation to text, and is selected by feature selecting larger to text classification contribution degree Feature characterize a text, to reduce feature quantity, reduce vector dimension.
Further, feature selecting is carried out using Chi-square Test, then calculates feature weight using TF-IDF algorithm, Measurement of the feature weight as characteristic item for the significance level or separating capacity of text.
Further, the training of the disaggregated model is to use the feature vector constructed as input, uses simple shellfish This algorithm of leaf and algorithm of support vector machine are trained;In the training process using k folding cross validation combination grid search to mould Shape parameter carries out tuning, then goes out disaggregated model using the parameter training after optimization.
A kind of Chinese table column label recovery system based on text classification comprising:
Message details page acquisition module is responsible for extracting entity from every a line in table, flat in network encyclopaedic knowledge The entity of extraction is searched in platform, obtains the corresponding message details page of entity;
Related text abstraction module is responsible for each attribute to entity, and packet is extracted in the message details page of entity Sentence containing attribute value forms the related text of attribute value;
Attribute value categorization module is responsible in the related text input text classifier by attribute value, is obtained belonging to attribute value Classification, the classification of cell as where attribute value;
Column label determining module is responsible for making the attribute column of table according to classification belonging to each unit lattice in attribute column The column label of the attribute column is determined with the rule of majority ballot.
Key point of the invention includes:
1, using information resources abundant on the networks encyclopaedic knowledge platform such as Baidupedia come to rare table content into Row supplement obtains related text for each table cell, and determines table column label by way of text classification.
2, " attribute-name-attribute value " in the semi-structured message box of the networks encyclopaedic knowledge platform such as Baidupedia is utilized Mark is carried out back to non-structured text, obtains a large amount of training data.
3, using the mode of majority ballot, comprehensively consider each unit lattice generic in each column, determine column column mark Label.
A kind of Chinese table column label restoration methods based on text classification of the invention, can be effectively to network table Column label recovery is carried out, the table after restoring column label can be used for the building and extension of Chinese knowledge mapping, it can also be used to data It extracts and the applications such as table search.Technological merit of the invention mainly includes the following aspects:
1. can be carried out using information resources abundant on the networks encyclopaedic knowledge platform such as Baidupedia to table content Supplement solves form context information Sparse Problems.
2. be able to solve the problem of table cell includes long sentence, by handling long sentence, can for comprising The cell of long sentence carries out classification mark, it means that the method proposed through the invention can be obtained from network table A large amount of descriptive knowledge (such as: representative works, Main Achievements) come the library that expands knowledge.
3. independent of existing knowledge base, it can be found that new, unknown knowledge, and can use these new knowledge To supplement knowledge base.
Detailed description of the invention
Fig. 1 is that table column label restores flow chart.
Fig. 2 is classifier training flow chart.
Fig. 3 is dataset construction exemplary diagram.
Fig. 4 is table annotation results figure.
Fig. 5 is classifier accuracy rate figure.
Fig. 6 is grid column label for labelling result figure.
Specific embodiment
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below by specific embodiment and Attached drawing is described in further details the present invention.
The present invention provides a kind of table column label restoration methods, it regards table column label recovery problem as text classification Problem mends rare table content using information resources abundant on the networks encyclopaedic knowledge platform such as Baidupedia It fills.Since the problem to be solved in the present invention is table column label problem, and it is not concerned with the entity column identification problem of net list lattice, institute To assume that the entity of table arranges it is known that and for the first row of table in the present invention.
A kind of workflow such as Fig. 1 institute of table column label restoration methods based on text classification provided in this embodiment Show, comprising the following steps:
1) for every a line in table, from extraction entity (assuming that being located in first row) in the row, in Baidupedia The entity is searched for, the corresponding message details page of the entity is obtained.
2) to each attribute of the entity, the sentence comprising the attribute value is extracted in entity information details page and is formed The related text of the attribute value.If attribute value is sentence, sentence is segmented, goes stop words etc. to handle, sentence is converted At word set, the sentence comprising word in the word set is then obtained again, forms the related text of sentence.
3) by the related text input text classifier of attribute value, classification belonging to the attribute value is obtained, as the category Property value where cell classification.
4) for the attribute column of table, according to cell generic in column, the column are determined using the rule of majority ballot Column label.
Whole flow process can be divided into three pretreatment, table cell mark, post-processing parts, below will be to these three portions Divide and be described in detail:
1) it pre-processes: enriching the attribute cell of table using the information of Baidupedia.Every a line R of tableiIt can be with It is expressed as (e, p1,p2,…,pm), wherein e is entity, p1,p2,…,pmIt is the attribute of entity e.For every a line of table, make Baidupedia is searched for entity e, obtains the text information of the corresponding page of the entity, and by text segmentation at sentence.Then By attribute piThe sentence comprising the keyword is searched for as keyword, forms related text RT (pi).If attribute value is sentence, Then sentence is segmented, and removes the word to conflict with other attribute cells, is then obtained comprising one or more word Sentence forms related text.Using Chinese words segmentation to each attribute piRelated text RT (pi) handled, obtain one The set of a word
2) table cell marks: by the above-mentioned each attribute p obtained by pretreatmentiRelated set of wordsIt is put into text To get to classification belonging to the attribute in this classifier.Table column label provided by the present invention based on text classification restores The major issue of method be how obtain training dataset and how training text classifier.
A. training dataset: the present invention extracts " attribute-name-attribute value " from the semi-structured infobox of Baidupedia To information, the Baidupedia item text where the infobox is then divided into sentence, using attribute value as keyword, is returned Mark extracts the sentence comprising the keyword, forms training corpus.During returning target, two didactic rules are used:
A1) if there is multiple sentences include the keyword, then these sentences are formed to the related text of the keyword, with category A training sample is collectively formed in property name.
A2) if without sentence include complete keyword, to the keyword carry out word segmentation processing, and remove with it is other The word of attribute value conflict returns mark sentence by keyword of these words, extracts the sentence comprising one or more keywords.
It returns target corpus 80% and is used as training set, 20% is used as test set.
B. classifier training: before training text classifier, need to carry out training corpus pretreatment and vectorization table Show, then carries out classifier training using training data and classifier performance is assessed.Process is as shown in Fig. 2, next will be to each A step is described in detail:
B1) Text Pretreatment: Text Pretreatment includes the operations such as text participle, removal stop words.Participle is that text is located in advance Indispensable one operation during reason, since Chinese language text divides word using space unlike English text It cuts, first has to rely on participle technique for text segmentation at word one by one, by text so carrying out processing to Chinese language text It is converted into the set of word, characterizes text using word so as to subsequent.It is used in the present invention jieba participle to carry out corpus Word segmentation processing.And some words for hardly carrying any information, only reflecting Sentence Grammar structure in text are removed, such as " ", " obtaining ", " this ", the words such as " that ".
B2) text vector: the content of processing textual is enabled a computer to, it is also necessary to reflect participle with dictionary It penetrates, carries out the expression of mathematicization.In pretreated text, each word is considered as a feature, and the present invention uses vector Spatial model (vector space model, VSM) carries out character representation, a series of core concept of VSM are as follows: by spies to text Levy the document D that is composed can the weight corresponding to characteristic item and characteristic item be indicated, i.e. D=D (t11;t2, ω2;…;tn, ω n), it can be abbreviated are as follows: D=D (ω12,…,ωn) wherein ωiIt is feature tiPossessed weight.Therefore one A document can be expressed as the space vector of n dimension.Since to will lead to very much vector dimension greatly excessively high for participle quantity, vector is caused The complexity of calculating is excessive, it is possible to is selected by feature selecting to the biggish feature of text classification contribution degree and be characterized One text reduces vector dimension, keeps the generalization ability of model stronger to reduce feature quantity.The present invention is examined using card side Test (χ2) Lai Jinhang feature selecting, Chi-square statistic tradeoff is characteristic item tiWith classification CjBetween degree of correlation, and assume feature Item tiWith classification CjBetween meet χ2Distribution.Degree of correlation uses χ2Statistic (CHI) is measured, and characteristic item is for some The CHI of class is higher, then illustrates that the correlation between this feature item and this class is also bigger, thus entrained by this feature about The information of the category is also more, on the contrary then fewer.Then TF-IDF algorithm is reused to calculate feature weight, and feature weight can Using the measurement as characteristic item for the significance level or separating capacity of text.Text after mathematicization indicates, Ke Yizuo For classifier training and input when test.
B3) model training: the present invention uses the feature vector constructed as input, uses NB Algorithm (BAYES) and algorithm of support vector machine (SVM) algorithm carry out disaggregated model training, so as to train come classifier carry out Assessment, to select a kind of most suitable model to carry out the recovery of table column label.It is rolled in the training process of model using k Cross validation combination grid search carries out tuning to model parameter.Then go out disaggregated model using the parameter training after optimization, and Model is saved, is marked for use in model measurement and table cell.
C. cell marks: each attribute p that will be obtained by pretreatmentiRelated set of wordsAs trained text The input of this classifier obtains classification belonging to the attribute, the as classification of cell where the attribute.
3) post-process: since the cell of same row has similar content, the present invention utilizes table column unit Consistency come exclude those mistake marks.For the jth column of table, the present invention comprehensively considers each cell in the column Mark determines the column label of the column using the principle of majority ballot, and most of cell is noted as class even in the column Other t(k), then t is set by the column label of the column(k)
Example is set forth below and further illustrates a kind of the specific of Chinese table column label restoration methods based on text classification Implementation process.
1) training dataset constructs: by taking Fig. 3 as an example, " place-BeiJing, China ", " collection essence are extracted from infobox The attribute-names such as product-Riverside Scene at the Pure Moon Festival "-attribute value pair are keyword to the free text of the page where the infobox using attribute value Mark is carried out back, the sentence (sentence where underscore in figure) comprising the keyword is obtained;If what is do not included in free text is complete Attribute value, then word segmentation processing is carried out to it, and remove the word to conflict with other attribute values, then using these words as keyword time Sentence is marked, mark is returned in part, and the results are shown in Table 1.
1. partial data of table returns mark result
2) classifier training: word segmentation processing is carried out to the related text that training data is concentrated using jieba participle tool, is gone Except stop words, low-frequency word, Chi-square Test (χ is then used2) feature selecting is carried out, vectorization table is carried out using tf-idf method Show.Data after vectorization is indicated use NB Algorithm (BAYES) and algorithm of support vector machine as input (SVM) algorithm train classification models.Model-naive Bayesian needs to set a smoothing parameter alpha in the training process It sets;Supporting vector machine model needs the kernel parameter to expression Selection of kernel function, indicates the C parameter of penalty coefficient and be elected to It selects γ parameter of the rbf as kernel function when to be configured, model parameter tuning can be real by grid search combination cross validation Existing, optimized parameter is as shown in table 2:
The setting of 2. optimal model parameters of table
3) cell marks: every a line (e, p for not marking table1,p2,…,pm), it is searched for using the row entity e Baidupedia obtains the text information of the corresponding page of the entity.Then obtain includes attribute value p in textiSentence, composition Related text predicts related text using trained classifier, then available attribute value piClass label.It is right In each column of table, column label is determined using most voting rules.Fig. 4 is the annotation results of network table, black in cell Color overstriking font is cell annotation results, and table last line black overstriking font is more using the progress of cell annotation results The determining column label annotation results of number ballot.
It is designed based on above scheme, illustrates the good effect that method proposed by the invention generates herein.It is other using figure kind Data tested and select five common attribute types --- date of birth, nationality, birthplace, occupation, graduated school. Experiment uses these attribute types as objective attribute target attribute and gets the data largely marked from Baidupedia, uses 80% Data are as training data TR, and for 20% data as test data TE, table 3 lists the data system of each classification in data set Meter.
3. training dataset of table and test data set
Attribute type Training dataset Test data set
Date of birth 13620 3431
Nationality 12210 3000
Birthplace 13062 3317
Occupation 12302 3005
Graduated school 8048 2018
It amounts to 59242 14771
Use NB Algorithm (BAYES) and algorithm of support vector machine (SVM) algorithm train classification models, optimal ginseng Number setting is as shown in table 2.The smoothing parameter alpha=1 of BAYES;SVM uses RBF kernel function, parameter C=0.5, γ=2.Fig. 5 Show the accuracy rate of the BAYES and SVM classifier with the training of above-mentioned data, the experimental results showed that, use point of SVM algorithm training Class device will be higher than BAYES algorithm in the accuracy rate of most of attribute type, and in " nationality ", " birthplace " two attributes On accuracy rate be lifted beyond 19%.
The table comprising figure kind's entity is crawled from webpage using web crawlers, table is screened and is therefrom selected 104 tables out, every a line of table includes an entity and several attribute informations, for five objective attribute target attribute classes in experiment Type gets 1807 examples, as shown in table 4 in total.All tables are manually marked, experimental evaluation is used for.
4. form attributes example of table statistics
Attribute type Physical quantities
Date of birth 126
Nationality 833
Birthplace 353
Occupation 346
Graduated school 149
It amounts to 1807
Table cell is labeled using trained classifier first, come test it is proposed by the invention based on The ability of the method processing truth table data of text classification.Then post-processing operation is added, to exclude misclassification cell, really Determine table column label.Table 5 illustrates the experimental result of table cell mark with the column label mark after addition post-processing operation Comparison.Two kinds of algorithms are added after post-processing operation, and accuracy rate has promotion by a relatively large margin, this demonstrate that in post-processing operation Most voting methods can effectively exclude misclassification cell.
5. cell of table mark and column label mark accuracy rate assessment
Fig. 6 illustrates the experimental result that grid column label for labelling is carried out using the method based on BAYES and based on SVM.It can See that mark accuracy rate of the two methods in most classifications is all higher than 90%, it was demonstrated that proposed by the present invention based on text classification The validity of Chinese table column label restoration methods.
The Baidupedia used above also could alternatively be other network encyclopaedic knowledge platforms, such as wikipedia, search dog Encyclopaedia, interaction encyclopaedia etc..
Naive Bayesian, the support vector cassification algorithm used above can be substituted for other sorting algorithms, such as decision Tree, logistic regression, k- arest neighbors, neural network etc..
Another embodiment of the present invention provides a kind of Chinese table column label recovery system based on text classification comprising:
Message details page acquisition module is responsible for extracting entity from every a line in table, flat in network encyclopaedic knowledge The entity of extraction is searched in platform, obtains the corresponding message details page of entity;
Related text abstraction module is responsible for each attribute to entity, and packet is extracted in the message details page of entity Sentence containing attribute value forms the related text of attribute value;
Attribute value categorization module is responsible in the related text input text classifier by attribute value, is obtained belonging to attribute value Classification, the classification of cell as where attribute value;
Column label determining module is responsible for making the attribute column of table according to classification belonging to each unit lattice in attribute column The column label of the attribute column is determined with the rule of majority ballot.
The specific implementation of above each module sees above the explanation to the method for the present invention.
The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the spirit and scope of the present invention, this The protection scope of invention should be subject to described in claims.

Claims (10)

1. a kind of Chinese table column label restoration methods based on text classification, which comprises the following steps:
1) entity is extracted from every a line in table, the entity of extraction is searched in network encyclopaedic knowledge platform, obtain entity The corresponding message details page;
2) to each attribute of entity, the sentence comprising attribute value is extracted in the message details page of entity, forms attribute The related text of value;
3) by the related text input text classifier of attribute value, classification belonging to attribute value is obtained, as where attribute value The classification of cell;
4) attribute column of table is determined according to classification belonging to each unit lattice in attribute column using the rule of majority ballot The column label of the attribute column.
2. the method according to claim 1, wherein step 1) the network encyclopaedic knowledge platform includes in following It is one or more: Baidupedia, wikipedia, search dog encyclopaedia, interaction encyclopaedia.
3. the method according to claim 1, wherein if attribute value is sentence, being carried out to sentence in step 2) It segments, stop words is gone to handle, sentence is converted into word set, then obtain the sentence comprising the word in the word set again, composition is related Text.
4. the method according to claim 1, wherein the instruction that the step 3) text classifier is used in training Practice data are as follows: " attribute-name-attribute value " is extracted from semi-structured message box to information, then by the item where the message box Mesh text is divided into sentence, using attribute value as keyword, carries out back mark, extracts the sentence comprising the keyword, forms training corpus.
5. according to the method described in claim 4, it is characterized in that, using two didactic rule during described time target Then:
A) if there is multiple sentences include the keyword, then these sentences are formed to the related text of the keyword, with attribute-name A training sample is collectively formed;
If including b) complete keyword without sentence, word segmentation processing is carried out to the keyword, and remove and other attributes It is worth the word of conflict, returns mark sentence for obtained word as keyword, extract the sentence comprising one or more keywords.
6. method according to claim 1 or 5, which is characterized in that step 3) is before training text classifier, to training Corpus carries out Text Pretreatment and text vectorization operation, then carries out the training of disaggregated model.
7. according to the method described in claim 6, it is characterized in that, the Text Pretreatment includes that text segments, removal deactivates Word operation;The text vectorization carries out character representation to text using vector space model, and is selected by feature selecting It selects and a text is characterized to the biggish feature of text classification contribution degree, to reduce feature quantity, reduce vector dimension.
8. then being used the method according to the description of claim 7 is characterized in that carrying out feature selecting using Chi-square Test TF-IDF algorithm calculates feature weight, and feature weight is as characteristic item for the significance level of text or the weighing apparatus of separating capacity Amount.
9. according to the method described in claim 6, it is characterized in that, the training of the disaggregated model is using the feature constructed Vector is trained as input using NB Algorithm and algorithm of support vector machine;It is handed in the training process using k folding Fork verifying combines grid search to carry out tuning to model parameter, then goes out disaggregated model using the parameter training after optimization.
10. a kind of Chinese table column label recovery system based on text classification characterized by comprising
Message details page acquisition module is responsible for extracting entity from every a line in table, in network encyclopaedic knowledge platform The entity extracted is searched for, the corresponding message details page of entity is obtained;
Related text abstraction module is responsible for each attribute to entity, extracts in the message details page of entity comprising belonging to The sentence of property value, forms the related text of attribute value;
Attribute value categorization module is responsible in the related text input text classifier by attribute value, obtains class belonging to attribute value Not, as where attribute value cell classification;
Column label determining module is responsible for the attribute column for table, according to classification belonging to each unit lattice in attribute column, using more The rule of number ballot determines the column label of the attribute column.
CN201811524302.3A 2018-12-13 2018-12-13 A kind of Chinese table column label restoration methods and system based on text classification Pending CN109710725A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811524302.3A CN109710725A (en) 2018-12-13 2018-12-13 A kind of Chinese table column label restoration methods and system based on text classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811524302.3A CN109710725A (en) 2018-12-13 2018-12-13 A kind of Chinese table column label restoration methods and system based on text classification

Publications (1)

Publication Number Publication Date
CN109710725A true CN109710725A (en) 2019-05-03

Family

ID=66255787

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811524302.3A Pending CN109710725A (en) 2018-12-13 2018-12-13 A kind of Chinese table column label restoration methods and system based on text classification

Country Status (1)

Country Link
CN (1) CN109710725A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111324609A (en) * 2020-02-17 2020-06-23 腾讯云计算(北京)有限责任公司 Knowledge graph construction method and device, electronic equipment and storage medium
CN111611799A (en) * 2020-05-07 2020-09-01 北京智通云联科技有限公司 Dictionary and sequence labeling model based entity attribute extraction method, system and equipment
CN111931229A (en) * 2020-07-10 2020-11-13 深信服科技股份有限公司 Data identification method and device and storage medium
CN112381143A (en) * 2020-11-13 2021-02-19 长城计算机软件与系统有限公司 Variable automatic classification method and system based on machine learning
CN113486177A (en) * 2021-07-12 2021-10-08 贵州电网有限责任公司 Electric power field table column labeling method based on text classification

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778238B (en) * 2014-01-27 2015-03-04 西安交通大学 Method for automatically building classification tree from semi-structured data of Wikipedia
CN106503148A (en) * 2016-10-21 2017-03-15 东南大学 A kind of form entity link method based on multiple knowledge base
US20170103107A1 (en) * 2015-10-09 2017-04-13 Informatica Llc Method, apparatus, and computer-readable medium to extract a referentially intact subset from a database
CN107133208A (en) * 2017-03-24 2017-09-05 南京缘长信息科技有限公司 The method and device that a kind of entity is extracted
CN108090070A (en) * 2016-11-22 2018-05-29 北京高地信息技术有限公司 A kind of Chinese entity attribute abstracting method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778238B (en) * 2014-01-27 2015-03-04 西安交通大学 Method for automatically building classification tree from semi-structured data of Wikipedia
US20170103107A1 (en) * 2015-10-09 2017-04-13 Informatica Llc Method, apparatus, and computer-readable medium to extract a referentially intact subset from a database
CN106503148A (en) * 2016-10-21 2017-03-15 东南大学 A kind of form entity link method based on multiple knowledge base
CN108090070A (en) * 2016-11-22 2018-05-29 北京高地信息技术有限公司 A kind of Chinese entity attribute abstracting method
CN107133208A (en) * 2017-03-24 2017-09-05 南京缘长信息科技有限公司 The method and device that a kind of entity is extracted

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DENG DONG,JIANG YU,LI GUOLIANG,ET AL.: "《Scalable column concept determination for web tables using large knowledge bases》", 《PROCEEDINGS OF THE VLDB ENDOWMENT》 *
JIE XIE,CONG CAO,YANBING LIU,YANAN CAO,BAOKE LI: "Column Concept Determination for Chinese Web Tables via Convolutional Neural Network", 《INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE(ICCS)2018》 *
罗静: "互联网表格数据的语义恢复", 《中国优秀硕士学位论文全文数据库》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111324609A (en) * 2020-02-17 2020-06-23 腾讯云计算(北京)有限责任公司 Knowledge graph construction method and device, electronic equipment and storage medium
CN111324609B (en) * 2020-02-17 2023-07-14 腾讯云计算(北京)有限责任公司 Knowledge graph construction method and device, electronic equipment and storage medium
CN111611799A (en) * 2020-05-07 2020-09-01 北京智通云联科技有限公司 Dictionary and sequence labeling model based entity attribute extraction method, system and equipment
CN111611799B (en) * 2020-05-07 2023-06-02 北京智通云联科技有限公司 Entity attribute extraction method, system and equipment based on dictionary and sequence labeling model
CN111931229A (en) * 2020-07-10 2020-11-13 深信服科技股份有限公司 Data identification method and device and storage medium
CN111931229B (en) * 2020-07-10 2023-07-11 深信服科技股份有限公司 Data identification method, device and storage medium
CN112381143A (en) * 2020-11-13 2021-02-19 长城计算机软件与系统有限公司 Variable automatic classification method and system based on machine learning
CN112381143B (en) * 2020-11-13 2023-12-05 新长城科技有限公司 Automatic variable classification method and system based on machine learning
CN113486177A (en) * 2021-07-12 2021-10-08 贵州电网有限责任公司 Electric power field table column labeling method based on text classification

Similar Documents

Publication Publication Date Title
EP3920044A1 (en) Data-driven structure extraction from text documents
CN109710725A (en) A kind of Chinese table column label restoration methods and system based on text classification
CN112711953B (en) Text multi-label classification method and system based on attention mechanism and GCN
CN101305370B (en) Information classification paradigm
CN111881983B (en) Data processing method and device based on classification model, electronic equipment and medium
CN106651057A (en) Mobile terminal user age prediction method based on installation package sequence table
CN113312480B (en) Scientific and technological thesis level multi-label classification method and device based on graph volume network
CN107315738A (en) A kind of innovation degree appraisal procedure of text message
CN111241410B (en) Industry news recommendation method and terminal
CN110688474A (en) Embedded representation obtaining and citation recommending method based on deep learning and link prediction
CN110990676A (en) Social media hotspot topic extraction method and system
CN109255012A (en) A kind of machine reads the implementation method and device of understanding
KR20160149050A (en) Apparatus and method for selecting a pure play company by using text mining
US20120221545A1 (en) Isolating desired content, metadata, or both from social media
CN103514151A (en) Dependency grammar analysis method and device and auxiliary classifier training method
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
CN103268346A (en) Semi-supervised classification method and semi-supervised classification system
CN103218420B (en) A kind of web page title extracting method and device
CN111930944B (en) File label classification method and device
US11809980B1 (en) Automatic classification of data sensitivity through machine learning
Pienaar et al. Spelling checker-based language identification for the eleven official south african languages
CN112417147A (en) Method and device for selecting training samples
Revindasari et al. Traceability between business process and software component using Probabilistic Latent Semantic Analysis
CN117235253A (en) Truck user implicit demand mining method based on natural language processing technology
CN109871429A (en) Merge the short text search method of Wikipedia classification and explicit semantic feature

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190503