CN109710725A - A kind of Chinese table column label restoration methods and system based on text classification - Google Patents
A kind of Chinese table column label restoration methods and system based on text classification Download PDFInfo
- Publication number
- CN109710725A CN109710725A CN201811524302.3A CN201811524302A CN109710725A CN 109710725 A CN109710725 A CN 109710725A CN 201811524302 A CN201811524302 A CN 201811524302A CN 109710725 A CN109710725 A CN 109710725A
- Authority
- CN
- China
- Prior art keywords
- text
- attribute
- column
- entity
- attribute value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of Chinese table column label restoration methods and system based on text classification.The step of this method includes: 1) to extract entity from every a line in table, and the entity of extraction is searched in network encyclopaedic knowledge platform, obtains the corresponding message details page of entity;2) to each attribute of entity, the sentence comprising attribute value is extracted in the message details page of entity, forms the related text of attribute value;3) by the related text input text classifier of attribute value, classification belonging to attribute value, the as classification of cell where attribute value are obtained;4) column label of the attribute column is determined using the rule of majority ballot according to classification belonging to each unit lattice in attribute column for the attribute column of table.The present invention effectively can carry out column label recovery to network table, and the table after restoring column label can be used for the building and extension of Chinese knowledge mapping, it can also be used to the application such as data pick-up and table search.
Description
Technical field
The invention belongs to software technology, based on the knowledge acquisition technology field of network table, it is extensive to be related to network table semanteme
A kind of compound method, and in particular to Chinese table column label restoration methods and system based on text classification.
Background technique
Hundreds of millions of tables has good structural features and the potential feature of semanteme on internet, compared to non-knot
The text data of structure is easier to analyze and understand, therefore, the knowledge acquisition in recent years based on network table becomes research heat
Point, the research that list data also has been used for the extension of knowledge base, table search, table merge etc..
Under normal conditions, table possesses the entity comprising a group object and arranges, other to be classified as attribute column, describes entity
Attribute.Every a line in table is made of an entity and its correlation attribute value.The content that the cell of same row is included has
Similitude.But the specification that network table is not unified, a large amount of table lack relationship etc. between specific table name, column name, column and close
Key semantic information, prevent computer is from directly carrying out knowledge acquisition to table, therefore, how to restore table semanteme becomes base
In the important research problem of the knowledge acquisition of table.Network table semanteme restores mainly to include three aspect researchs: table entity column
Detection, table column label restore, relationship judges between grid column.The present invention solves the problems, such as that the column label of Chinese network table restores.
Currently, the column label recovery research for Chinese table is very few, for English table, existing algorithm is base mostly
In large scale knowledge base (for example, YAGO, DBpedia, Probase etc.) or the database (for example, isA database) crawled from Web.
Candidate column label is obtained by the way that the cell content in grid column is mapped to the concept in knowledge base (database), is then led to
Crossing certain algorithm is that table determines most suitable column label.But many existing Chinese knowledge bases or make only for inside
With or the knowledge that is included quantity it is very little, problem cannot be restored for Chinese table column label and sufficient knowledge is provided.And
And existing English table column label recovery technology fact present in knowledge base carries out table mark, is difficult to find
New, unknown knowledge, and the knowledge in knowledge base is limited after all, and this technology is caused to have biggish limitation.In addition, by
The ununified specification of cell in Chinese network table, and may include a certain number of sentences, so that existing skill
Art cannot efficiently solve Chinese table column label and restore problem.
Summary of the invention
The present invention is in view of the above-mentioned problems, proposing a kind of Chinese table column label restoration methods based on text classification and being
System solves the problems, such as that Chinese table column label restores using machine learning method using the thought of text classification.
The basic thought that table column label of the invention restores is, using the column of table as unit, finds in column for each column
Then column label belonging to cell content is labeled this column with column label.Due to the unit in each column of table
Lattice have similitude, so column label recovery problem can be converted to cell label in column and determine problem.
Research range is limited to not arrange the Chinese network table of branch by the present invention, therefore table can be regarded as to one
The two-dimensional array of m × n, each of array element can be word or sentence.Given table T, it is proposed by the invention
Method to determine classification belonging to the content of each cell in column, and grid column mark is determined according to cell generic
Label.
The technical solution adopted by the invention is as follows:
A kind of Chinese table column label restoration methods based on text classification, comprising the following steps:
1) entity is extracted from every a line in table, the entity of extraction is searched in network encyclopaedic knowledge platform, obtained
The corresponding message details page of entity;
2) to each attribute of entity, the sentence comprising attribute value, composition are extracted in the message details page of entity
The related text of attribute value;
3) by the related text input text classifier of attribute value, classification, as attribute value belonging to attribute value are obtained
The classification of place cell;
4) rule of majority ballot is used according to classification belonging to each unit lattice in attribute column for the attribute column of table
Determine the column label of the attribute column.
Further, step 1) the network encyclopaedic knowledge platform includes one of the following or a variety of: Baidupedia, dimension
Base encyclopaedia, search dog encyclopaedia, interaction encyclopaedia.
Further, in step 2), if attribute value is sentence, sentence is segmented, stop words is gone to handle, by sentence
It is converted into word set, then obtains the sentence comprising the word in the word set again, forms related text.
Further, the training data that the step 3) text classifier is used in training are as follows: from semi-structured letter
It ceases and extracts " attribute-name-attribute value " in frame to information, the item text where the message box is then divided into sentence, with attribute value
For keyword, mark is carried out back, extracts the sentence comprising the keyword, forms training corpus.
Further, two didactic rules are used during described time target:
A) if there is multiple sentences include the keyword, then these sentences are formed to the related text of the keyword, with category
A training sample is collectively formed in property name;
If b) without sentence include complete keyword, to the keyword carry out word segmentation processing, and remove with it is other
The word of attribute value conflict returns mark sentence for obtained word as keyword, extracts the sentence comprising one or more keywords.
Further, step 3) is before training text classifier, to training corpus carry out Text Pretreatment and text to
Then quantization operation carries out the training of disaggregated model.
Further, the Text Pretreatment includes text participle, removal stop words operation;The text vectorization uses
Vector space model carries out character representation to text, and is selected by feature selecting larger to text classification contribution degree
Feature characterize a text, to reduce feature quantity, reduce vector dimension.
Further, feature selecting is carried out using Chi-square Test, then calculates feature weight using TF-IDF algorithm,
Measurement of the feature weight as characteristic item for the significance level or separating capacity of text.
Further, the training of the disaggregated model is to use the feature vector constructed as input, uses simple shellfish
This algorithm of leaf and algorithm of support vector machine are trained;In the training process using k folding cross validation combination grid search to mould
Shape parameter carries out tuning, then goes out disaggregated model using the parameter training after optimization.
A kind of Chinese table column label recovery system based on text classification comprising:
Message details page acquisition module is responsible for extracting entity from every a line in table, flat in network encyclopaedic knowledge
The entity of extraction is searched in platform, obtains the corresponding message details page of entity;
Related text abstraction module is responsible for each attribute to entity, and packet is extracted in the message details page of entity
Sentence containing attribute value forms the related text of attribute value;
Attribute value categorization module is responsible in the related text input text classifier by attribute value, is obtained belonging to attribute value
Classification, the classification of cell as where attribute value;
Column label determining module is responsible for making the attribute column of table according to classification belonging to each unit lattice in attribute column
The column label of the attribute column is determined with the rule of majority ballot.
Key point of the invention includes:
1, using information resources abundant on the networks encyclopaedic knowledge platform such as Baidupedia come to rare table content into
Row supplement obtains related text for each table cell, and determines table column label by way of text classification.
2, " attribute-name-attribute value " in the semi-structured message box of the networks encyclopaedic knowledge platform such as Baidupedia is utilized
Mark is carried out back to non-structured text, obtains a large amount of training data.
3, using the mode of majority ballot, comprehensively consider each unit lattice generic in each column, determine column column mark
Label.
A kind of Chinese table column label restoration methods based on text classification of the invention, can be effectively to network table
Column label recovery is carried out, the table after restoring column label can be used for the building and extension of Chinese knowledge mapping, it can also be used to data
It extracts and the applications such as table search.Technological merit of the invention mainly includes the following aspects:
1. can be carried out using information resources abundant on the networks encyclopaedic knowledge platform such as Baidupedia to table content
Supplement solves form context information Sparse Problems.
2. be able to solve the problem of table cell includes long sentence, by handling long sentence, can for comprising
The cell of long sentence carries out classification mark, it means that the method proposed through the invention can be obtained from network table
A large amount of descriptive knowledge (such as: representative works, Main Achievements) come the library that expands knowledge.
3. independent of existing knowledge base, it can be found that new, unknown knowledge, and can use these new knowledge
To supplement knowledge base.
Detailed description of the invention
Fig. 1 is that table column label restores flow chart.
Fig. 2 is classifier training flow chart.
Fig. 3 is dataset construction exemplary diagram.
Fig. 4 is table annotation results figure.
Fig. 5 is classifier accuracy rate figure.
Fig. 6 is grid column label for labelling result figure.
Specific embodiment
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below by specific embodiment and
Attached drawing is described in further details the present invention.
The present invention provides a kind of table column label restoration methods, it regards table column label recovery problem as text classification
Problem mends rare table content using information resources abundant on the networks encyclopaedic knowledge platform such as Baidupedia
It fills.Since the problem to be solved in the present invention is table column label problem, and it is not concerned with the entity column identification problem of net list lattice, institute
To assume that the entity of table arranges it is known that and for the first row of table in the present invention.
A kind of workflow such as Fig. 1 institute of table column label restoration methods based on text classification provided in this embodiment
Show, comprising the following steps:
1) for every a line in table, from extraction entity (assuming that being located in first row) in the row, in Baidupedia
The entity is searched for, the corresponding message details page of the entity is obtained.
2) to each attribute of the entity, the sentence comprising the attribute value is extracted in entity information details page and is formed
The related text of the attribute value.If attribute value is sentence, sentence is segmented, goes stop words etc. to handle, sentence is converted
At word set, the sentence comprising word in the word set is then obtained again, forms the related text of sentence.
3) by the related text input text classifier of attribute value, classification belonging to the attribute value is obtained, as the category
Property value where cell classification.
4) for the attribute column of table, according to cell generic in column, the column are determined using the rule of majority ballot
Column label.
Whole flow process can be divided into three pretreatment, table cell mark, post-processing parts, below will be to these three portions
Divide and be described in detail:
1) it pre-processes: enriching the attribute cell of table using the information of Baidupedia.Every a line R of tableiIt can be with
It is expressed as (e, p1,p2,…,pm), wherein e is entity, p1,p2,…,pmIt is the attribute of entity e.For every a line of table, make
Baidupedia is searched for entity e, obtains the text information of the corresponding page of the entity, and by text segmentation at sentence.Then
By attribute piThe sentence comprising the keyword is searched for as keyword, forms related text RT (pi).If attribute value is sentence,
Then sentence is segmented, and removes the word to conflict with other attribute cells, is then obtained comprising one or more word
Sentence forms related text.Using Chinese words segmentation to each attribute piRelated text RT (pi) handled, obtain one
The set of a word
2) table cell marks: by the above-mentioned each attribute p obtained by pretreatmentiRelated set of wordsIt is put into text
To get to classification belonging to the attribute in this classifier.Table column label provided by the present invention based on text classification restores
The major issue of method be how obtain training dataset and how training text classifier.
A. training dataset: the present invention extracts " attribute-name-attribute value " from the semi-structured infobox of Baidupedia
To information, the Baidupedia item text where the infobox is then divided into sentence, using attribute value as keyword, is returned
Mark extracts the sentence comprising the keyword, forms training corpus.During returning target, two didactic rules are used:
A1) if there is multiple sentences include the keyword, then these sentences are formed to the related text of the keyword, with category
A training sample is collectively formed in property name.
A2) if without sentence include complete keyword, to the keyword carry out word segmentation processing, and remove with it is other
The word of attribute value conflict returns mark sentence by keyword of these words, extracts the sentence comprising one or more keywords.
It returns target corpus 80% and is used as training set, 20% is used as test set.
B. classifier training: before training text classifier, need to carry out training corpus pretreatment and vectorization table
Show, then carries out classifier training using training data and classifier performance is assessed.Process is as shown in Fig. 2, next will be to each
A step is described in detail:
B1) Text Pretreatment: Text Pretreatment includes the operations such as text participle, removal stop words.Participle is that text is located in advance
Indispensable one operation during reason, since Chinese language text divides word using space unlike English text
It cuts, first has to rely on participle technique for text segmentation at word one by one, by text so carrying out processing to Chinese language text
It is converted into the set of word, characterizes text using word so as to subsequent.It is used in the present invention jieba participle to carry out corpus
Word segmentation processing.And some words for hardly carrying any information, only reflecting Sentence Grammar structure in text are removed, such as
" ", " obtaining ", " this ", the words such as " that ".
B2) text vector: the content of processing textual is enabled a computer to, it is also necessary to reflect participle with dictionary
It penetrates, carries out the expression of mathematicization.In pretreated text, each word is considered as a feature, and the present invention uses vector
Spatial model (vector space model, VSM) carries out character representation, a series of core concept of VSM are as follows: by spies to text
Levy the document D that is composed can the weight corresponding to characteristic item and characteristic item be indicated, i.e. D=D (t1,ω1;t2,
ω2;…;tn, ω n), it can be abbreviated are as follows: D=D (ω1,ω2,…,ωn) wherein ωiIt is feature tiPossessed weight.Therefore one
A document can be expressed as the space vector of n dimension.Since to will lead to very much vector dimension greatly excessively high for participle quantity, vector is caused
The complexity of calculating is excessive, it is possible to is selected by feature selecting to the biggish feature of text classification contribution degree and be characterized
One text reduces vector dimension, keeps the generalization ability of model stronger to reduce feature quantity.The present invention is examined using card side
Test (χ2) Lai Jinhang feature selecting, Chi-square statistic tradeoff is characteristic item tiWith classification CjBetween degree of correlation, and assume feature
Item tiWith classification CjBetween meet χ2Distribution.Degree of correlation uses χ2Statistic (CHI) is measured, and characteristic item is for some
The CHI of class is higher, then illustrates that the correlation between this feature item and this class is also bigger, thus entrained by this feature about
The information of the category is also more, on the contrary then fewer.Then TF-IDF algorithm is reused to calculate feature weight, and feature weight can
Using the measurement as characteristic item for the significance level or separating capacity of text.Text after mathematicization indicates, Ke Yizuo
For classifier training and input when test.
B3) model training: the present invention uses the feature vector constructed as input, uses NB Algorithm
(BAYES) and algorithm of support vector machine (SVM) algorithm carry out disaggregated model training, so as to train come classifier carry out
Assessment, to select a kind of most suitable model to carry out the recovery of table column label.It is rolled in the training process of model using k
Cross validation combination grid search carries out tuning to model parameter.Then go out disaggregated model using the parameter training after optimization, and
Model is saved, is marked for use in model measurement and table cell.
C. cell marks: each attribute p that will be obtained by pretreatmentiRelated set of wordsAs trained text
The input of this classifier obtains classification belonging to the attribute, the as classification of cell where the attribute.
3) post-process: since the cell of same row has similar content, the present invention utilizes table column unit
Consistency come exclude those mistake marks.For the jth column of table, the present invention comprehensively considers each cell in the column
Mark determines the column label of the column using the principle of majority ballot, and most of cell is noted as class even in the column
Other t(k), then t is set by the column label of the column(k)。
Example is set forth below and further illustrates a kind of the specific of Chinese table column label restoration methods based on text classification
Implementation process.
1) training dataset constructs: by taking Fig. 3 as an example, " place-BeiJing, China ", " collection essence are extracted from infobox
The attribute-names such as product-Riverside Scene at the Pure Moon Festival "-attribute value pair are keyword to the free text of the page where the infobox using attribute value
Mark is carried out back, the sentence (sentence where underscore in figure) comprising the keyword is obtained;If what is do not included in free text is complete
Attribute value, then word segmentation processing is carried out to it, and remove the word to conflict with other attribute values, then using these words as keyword time
Sentence is marked, mark is returned in part, and the results are shown in Table 1.
1. partial data of table returns mark result
2) classifier training: word segmentation processing is carried out to the related text that training data is concentrated using jieba participle tool, is gone
Except stop words, low-frequency word, Chi-square Test (χ is then used2) feature selecting is carried out, vectorization table is carried out using tf-idf method
Show.Data after vectorization is indicated use NB Algorithm (BAYES) and algorithm of support vector machine as input
(SVM) algorithm train classification models.Model-naive Bayesian needs to set a smoothing parameter alpha in the training process
It sets;Supporting vector machine model needs the kernel parameter to expression Selection of kernel function, indicates the C parameter of penalty coefficient and be elected to
It selects γ parameter of the rbf as kernel function when to be configured, model parameter tuning can be real by grid search combination cross validation
Existing, optimized parameter is as shown in table 2:
The setting of 2. optimal model parameters of table
3) cell marks: every a line (e, p for not marking table1,p2,…,pm), it is searched for using the row entity e
Baidupedia obtains the text information of the corresponding page of the entity.Then obtain includes attribute value p in textiSentence, composition
Related text predicts related text using trained classifier, then available attribute value piClass label.It is right
In each column of table, column label is determined using most voting rules.Fig. 4 is the annotation results of network table, black in cell
Color overstriking font is cell annotation results, and table last line black overstriking font is more using the progress of cell annotation results
The determining column label annotation results of number ballot.
It is designed based on above scheme, illustrates the good effect that method proposed by the invention generates herein.It is other using figure kind
Data tested and select five common attribute types --- date of birth, nationality, birthplace, occupation, graduated school.
Experiment uses these attribute types as objective attribute target attribute and gets the data largely marked from Baidupedia, uses 80%
Data are as training data TR, and for 20% data as test data TE, table 3 lists the data system of each classification in data set
Meter.
3. training dataset of table and test data set
Attribute type | Training dataset | Test data set |
Date of birth | 13620 | 3431 |
Nationality | 12210 | 3000 |
Birthplace | 13062 | 3317 |
Occupation | 12302 | 3005 |
Graduated school | 8048 | 2018 |
It amounts to | 59242 | 14771 |
Use NB Algorithm (BAYES) and algorithm of support vector machine (SVM) algorithm train classification models, optimal ginseng
Number setting is as shown in table 2.The smoothing parameter alpha=1 of BAYES;SVM uses RBF kernel function, parameter C=0.5, γ=2.Fig. 5
Show the accuracy rate of the BAYES and SVM classifier with the training of above-mentioned data, the experimental results showed that, use point of SVM algorithm training
Class device will be higher than BAYES algorithm in the accuracy rate of most of attribute type, and in " nationality ", " birthplace " two attributes
On accuracy rate be lifted beyond 19%.
The table comprising figure kind's entity is crawled from webpage using web crawlers, table is screened and is therefrom selected
104 tables out, every a line of table includes an entity and several attribute informations, for five objective attribute target attribute classes in experiment
Type gets 1807 examples, as shown in table 4 in total.All tables are manually marked, experimental evaluation is used for.
4. form attributes example of table statistics
Attribute type | Physical quantities |
Date of birth | 126 |
Nationality | 833 |
Birthplace | 353 |
Occupation | 346 |
Graduated school | 149 |
It amounts to | 1807 |
Table cell is labeled using trained classifier first, come test it is proposed by the invention based on
The ability of the method processing truth table data of text classification.Then post-processing operation is added, to exclude misclassification cell, really
Determine table column label.Table 5 illustrates the experimental result of table cell mark with the column label mark after addition post-processing operation
Comparison.Two kinds of algorithms are added after post-processing operation, and accuracy rate has promotion by a relatively large margin, this demonstrate that in post-processing operation
Most voting methods can effectively exclude misclassification cell.
5. cell of table mark and column label mark accuracy rate assessment
Fig. 6 illustrates the experimental result that grid column label for labelling is carried out using the method based on BAYES and based on SVM.It can
See that mark accuracy rate of the two methods in most classifications is all higher than 90%, it was demonstrated that proposed by the present invention based on text classification
The validity of Chinese table column label restoration methods.
The Baidupedia used above also could alternatively be other network encyclopaedic knowledge platforms, such as wikipedia, search dog
Encyclopaedia, interaction encyclopaedia etc..
Naive Bayesian, the support vector cassification algorithm used above can be substituted for other sorting algorithms, such as decision
Tree, logistic regression, k- arest neighbors, neural network etc..
Another embodiment of the present invention provides a kind of Chinese table column label recovery system based on text classification comprising:
Message details page acquisition module is responsible for extracting entity from every a line in table, flat in network encyclopaedic knowledge
The entity of extraction is searched in platform, obtains the corresponding message details page of entity;
Related text abstraction module is responsible for each attribute to entity, and packet is extracted in the message details page of entity
Sentence containing attribute value forms the related text of attribute value;
Attribute value categorization module is responsible in the related text input text classifier by attribute value, is obtained belonging to attribute value
Classification, the classification of cell as where attribute value;
Column label determining module is responsible for making the attribute column of table according to classification belonging to each unit lattice in attribute column
The column label of the attribute column is determined with the rule of majority ballot.
The specific implementation of above each module sees above the explanation to the method for the present invention.
The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field
Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the spirit and scope of the present invention, this
The protection scope of invention should be subject to described in claims.
Claims (10)
1. a kind of Chinese table column label restoration methods based on text classification, which comprises the following steps:
1) entity is extracted from every a line in table, the entity of extraction is searched in network encyclopaedic knowledge platform, obtain entity
The corresponding message details page;
2) to each attribute of entity, the sentence comprising attribute value is extracted in the message details page of entity, forms attribute
The related text of value;
3) by the related text input text classifier of attribute value, classification belonging to attribute value is obtained, as where attribute value
The classification of cell;
4) attribute column of table is determined according to classification belonging to each unit lattice in attribute column using the rule of majority ballot
The column label of the attribute column.
2. the method according to claim 1, wherein step 1) the network encyclopaedic knowledge platform includes in following
It is one or more: Baidupedia, wikipedia, search dog encyclopaedia, interaction encyclopaedia.
3. the method according to claim 1, wherein if attribute value is sentence, being carried out to sentence in step 2)
It segments, stop words is gone to handle, sentence is converted into word set, then obtain the sentence comprising the word in the word set again, composition is related
Text.
4. the method according to claim 1, wherein the instruction that the step 3) text classifier is used in training
Practice data are as follows: " attribute-name-attribute value " is extracted from semi-structured message box to information, then by the item where the message box
Mesh text is divided into sentence, using attribute value as keyword, carries out back mark, extracts the sentence comprising the keyword, forms training corpus.
5. according to the method described in claim 4, it is characterized in that, using two didactic rule during described time target
Then:
A) if there is multiple sentences include the keyword, then these sentences are formed to the related text of the keyword, with attribute-name
A training sample is collectively formed;
If including b) complete keyword without sentence, word segmentation processing is carried out to the keyword, and remove and other attributes
It is worth the word of conflict, returns mark sentence for obtained word as keyword, extract the sentence comprising one or more keywords.
6. method according to claim 1 or 5, which is characterized in that step 3) is before training text classifier, to training
Corpus carries out Text Pretreatment and text vectorization operation, then carries out the training of disaggregated model.
7. according to the method described in claim 6, it is characterized in that, the Text Pretreatment includes that text segments, removal deactivates
Word operation;The text vectorization carries out character representation to text using vector space model, and is selected by feature selecting
It selects and a text is characterized to the biggish feature of text classification contribution degree, to reduce feature quantity, reduce vector dimension.
8. then being used the method according to the description of claim 7 is characterized in that carrying out feature selecting using Chi-square Test
TF-IDF algorithm calculates feature weight, and feature weight is as characteristic item for the significance level of text or the weighing apparatus of separating capacity
Amount.
9. according to the method described in claim 6, it is characterized in that, the training of the disaggregated model is using the feature constructed
Vector is trained as input using NB Algorithm and algorithm of support vector machine;It is handed in the training process using k folding
Fork verifying combines grid search to carry out tuning to model parameter, then goes out disaggregated model using the parameter training after optimization.
10. a kind of Chinese table column label recovery system based on text classification characterized by comprising
Message details page acquisition module is responsible for extracting entity from every a line in table, in network encyclopaedic knowledge platform
The entity extracted is searched for, the corresponding message details page of entity is obtained;
Related text abstraction module is responsible for each attribute to entity, extracts in the message details page of entity comprising belonging to
The sentence of property value, forms the related text of attribute value;
Attribute value categorization module is responsible in the related text input text classifier by attribute value, obtains class belonging to attribute value
Not, as where attribute value cell classification;
Column label determining module is responsible for the attribute column for table, according to classification belonging to each unit lattice in attribute column, using more
The rule of number ballot determines the column label of the attribute column.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811524302.3A CN109710725A (en) | 2018-12-13 | 2018-12-13 | A kind of Chinese table column label restoration methods and system based on text classification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811524302.3A CN109710725A (en) | 2018-12-13 | 2018-12-13 | A kind of Chinese table column label restoration methods and system based on text classification |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109710725A true CN109710725A (en) | 2019-05-03 |
Family
ID=66255787
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811524302.3A Pending CN109710725A (en) | 2018-12-13 | 2018-12-13 | A kind of Chinese table column label restoration methods and system based on text classification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109710725A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111324609A (en) * | 2020-02-17 | 2020-06-23 | 腾讯云计算(北京)有限责任公司 | Knowledge graph construction method and device, electronic equipment and storage medium |
CN111611799A (en) * | 2020-05-07 | 2020-09-01 | 北京智通云联科技有限公司 | Dictionary and sequence labeling model based entity attribute extraction method, system and equipment |
CN111931229A (en) * | 2020-07-10 | 2020-11-13 | 深信服科技股份有限公司 | Data identification method and device and storage medium |
CN112381143A (en) * | 2020-11-13 | 2021-02-19 | 长城计算机软件与系统有限公司 | Variable automatic classification method and system based on machine learning |
CN113486177A (en) * | 2021-07-12 | 2021-10-08 | 贵州电网有限责任公司 | Electric power field table column labeling method based on text classification |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103778238B (en) * | 2014-01-27 | 2015-03-04 | 西安交通大学 | Method for automatically building classification tree from semi-structured data of Wikipedia |
CN106503148A (en) * | 2016-10-21 | 2017-03-15 | 东南大学 | A kind of form entity link method based on multiple knowledge base |
US20170103107A1 (en) * | 2015-10-09 | 2017-04-13 | Informatica Llc | Method, apparatus, and computer-readable medium to extract a referentially intact subset from a database |
CN107133208A (en) * | 2017-03-24 | 2017-09-05 | 南京缘长信息科技有限公司 | The method and device that a kind of entity is extracted |
CN108090070A (en) * | 2016-11-22 | 2018-05-29 | 北京高地信息技术有限公司 | A kind of Chinese entity attribute abstracting method |
-
2018
- 2018-12-13 CN CN201811524302.3A patent/CN109710725A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103778238B (en) * | 2014-01-27 | 2015-03-04 | 西安交通大学 | Method for automatically building classification tree from semi-structured data of Wikipedia |
US20170103107A1 (en) * | 2015-10-09 | 2017-04-13 | Informatica Llc | Method, apparatus, and computer-readable medium to extract a referentially intact subset from a database |
CN106503148A (en) * | 2016-10-21 | 2017-03-15 | 东南大学 | A kind of form entity link method based on multiple knowledge base |
CN108090070A (en) * | 2016-11-22 | 2018-05-29 | 北京高地信息技术有限公司 | A kind of Chinese entity attribute abstracting method |
CN107133208A (en) * | 2017-03-24 | 2017-09-05 | 南京缘长信息科技有限公司 | The method and device that a kind of entity is extracted |
Non-Patent Citations (3)
Title |
---|
DENG DONG,JIANG YU,LI GUOLIANG,ET AL.: "《Scalable column concept determination for web tables using large knowledge bases》", 《PROCEEDINGS OF THE VLDB ENDOWMENT》 * |
JIE XIE,CONG CAO,YANBING LIU,YANAN CAO,BAOKE LI: "Column Concept Determination for Chinese Web Tables via Convolutional Neural Network", 《INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE(ICCS)2018》 * |
罗静: "互联网表格数据的语义恢复", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111324609A (en) * | 2020-02-17 | 2020-06-23 | 腾讯云计算(北京)有限责任公司 | Knowledge graph construction method and device, electronic equipment and storage medium |
CN111324609B (en) * | 2020-02-17 | 2023-07-14 | 腾讯云计算(北京)有限责任公司 | Knowledge graph construction method and device, electronic equipment and storage medium |
CN111611799A (en) * | 2020-05-07 | 2020-09-01 | 北京智通云联科技有限公司 | Dictionary and sequence labeling model based entity attribute extraction method, system and equipment |
CN111611799B (en) * | 2020-05-07 | 2023-06-02 | 北京智通云联科技有限公司 | Entity attribute extraction method, system and equipment based on dictionary and sequence labeling model |
CN111931229A (en) * | 2020-07-10 | 2020-11-13 | 深信服科技股份有限公司 | Data identification method and device and storage medium |
CN111931229B (en) * | 2020-07-10 | 2023-07-11 | 深信服科技股份有限公司 | Data identification method, device and storage medium |
CN112381143A (en) * | 2020-11-13 | 2021-02-19 | 长城计算机软件与系统有限公司 | Variable automatic classification method and system based on machine learning |
CN112381143B (en) * | 2020-11-13 | 2023-12-05 | 新长城科技有限公司 | Automatic variable classification method and system based on machine learning |
CN113486177A (en) * | 2021-07-12 | 2021-10-08 | 贵州电网有限责任公司 | Electric power field table column labeling method based on text classification |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3920044A1 (en) | Data-driven structure extraction from text documents | |
CN109710725A (en) | A kind of Chinese table column label restoration methods and system based on text classification | |
CN112711953B (en) | Text multi-label classification method and system based on attention mechanism and GCN | |
CN101305370B (en) | Information classification paradigm | |
CN111881983B (en) | Data processing method and device based on classification model, electronic equipment and medium | |
CN106651057A (en) | Mobile terminal user age prediction method based on installation package sequence table | |
CN113312480B (en) | Scientific and technological thesis level multi-label classification method and device based on graph volume network | |
CN107315738A (en) | A kind of innovation degree appraisal procedure of text message | |
CN111241410B (en) | Industry news recommendation method and terminal | |
CN110688474A (en) | Embedded representation obtaining and citation recommending method based on deep learning and link prediction | |
CN110990676A (en) | Social media hotspot topic extraction method and system | |
CN109255012A (en) | A kind of machine reads the implementation method and device of understanding | |
KR20160149050A (en) | Apparatus and method for selecting a pure play company by using text mining | |
US20120221545A1 (en) | Isolating desired content, metadata, or both from social media | |
CN103514151A (en) | Dependency grammar analysis method and device and auxiliary classifier training method | |
CN110310012B (en) | Data analysis method, device, equipment and computer readable storage medium | |
CN103268346A (en) | Semi-supervised classification method and semi-supervised classification system | |
CN103218420B (en) | A kind of web page title extracting method and device | |
CN111930944B (en) | File label classification method and device | |
US11809980B1 (en) | Automatic classification of data sensitivity through machine learning | |
Pienaar et al. | Spelling checker-based language identification for the eleven official south african languages | |
CN112417147A (en) | Method and device for selecting training samples | |
Revindasari et al. | Traceability between business process and software component using Probabilistic Latent Semantic Analysis | |
CN117235253A (en) | Truck user implicit demand mining method based on natural language processing technology | |
CN109871429A (en) | Merge the short text search method of Wikipedia classification and explicit semantic feature |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190503 |