CN110019829A - Data attribute determines method, apparatus - Google Patents

Data attribute determines method, apparatus Download PDF

Info

Publication number
CN110019829A
CN110019829A CN201710848242.XA CN201710848242A CN110019829A CN 110019829 A CN110019829 A CN 110019829A CN 201710848242 A CN201710848242 A CN 201710848242A CN 110019829 A CN110019829 A CN 110019829A
Authority
CN
China
Prior art keywords
attribute
data
column
column data
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710848242.XA
Other languages
Chinese (zh)
Other versions
CN110019829B (en
Inventor
宋奇
王思睿
姜萌芽
钟磊
秦锋剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Green Bay Network Technology Co., Ltd.
Original Assignee
Grass Count Language (beijing) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Grass Count Language (beijing) Technology Co Ltd filed Critical Grass Count Language (beijing) Technology Co Ltd
Priority to CN201710848242.XA priority Critical patent/CN110019829B/en
Publication of CN110019829A publication Critical patent/CN110019829A/en
Application granted granted Critical
Publication of CN110019829B publication Critical patent/CN110019829B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of data attributes to determine method and device, wherein obtains multiple column datas this method comprises: split to the initial data of formatting;If the column data does not include column head content, the corresponding candidate attribute set of the column data is determined according to the data type of the column data;The attribute of each unit content of the column data is determined according to the candidate attribute set;The confidence level that statistics obtains each attribute is carried out to the attribute of each unit content of the column data, the attribute of the column data is determined according to the confidence level.The candidate attribute set that this method passes through categorizedly lookup column data, and the attribute by counting each unit content determines the attribute of column data, the operand being reduced as far as in data attribute identification process is realized, the recognition efficiency and accuracy rate of data attribute are improved.

Description

Data attribute determines method, apparatus
Technical field
The present invention relates to data mining technology fields more particularly to a kind of data attribute to determine method, apparatus.
Background technique
In the relevant big data analysis of relation map, the initial data of structuring split and identify by item, with Just data are mapped to entity-attribute, are convenient for solid modelling and analysis.
Typical application scenarios are: the excel List of input intact for a structuring passes through the file according to table Name, gauge outfit, each column content can guess the corresponding each Column Properties of entire table content and, and correspond to E-R (entity Relationship model, entity relationship mode) figure correlation model attribute.Original input information may be implemented in this way to figure The relationship map of spectrum model, convenient for deeper operations such as subsequent map excavations.However, how to improve the discrimination of data attribute always It is a technical problem to be solved urgently.
Summary of the invention
The purpose of the present invention is intended to solve above-mentioned one of technical problem at least to a certain extent.
For this purpose, the first purpose of this invention is to propose that a kind of data attribute determines method, by categorizedly looking into It looks for the candidate attribute set of column data, and the attribute by counting each unit content to determine the attribute of column data, realizes The operand being reduced as far as in data attribute identification process improves the recognition efficiency and accuracy rate of data attribute.
Second object of the present invention is to propose a kind of data attribute determining device.
Third object of the present invention is to propose a kind of computer equipment.
Fourth object of the present invention is to propose a kind of computer readable storage medium.
To achieve the goals above, the data attribute of first aspect present invention embodiment determines method, comprising: to formatting Initial data split and obtain multiple column datas;
If the column data does not include column head content, the column data pair is determined according to the data type of the column data The candidate attribute set answered;
The attribute of each unit content of the column data is determined according to the candidate attribute set;
The confidence level that statistics obtains each attribute is carried out to the attribute of each unit content of the column data, is set according to described Reliability determines the attribute of the column data.
Method as described above, the data type according to the column data determine the corresponding candidate category of the column data Property set, comprising:
When the data type is digital alphabet type, candidate regular expression collection corresponding with the column data is obtained It closes, candidate regular expression set is determined as the corresponding candidate attribute set of the column data, wherein candidate's canonical table It include multiple regular expressions up to formula set, the regular expression associated data attribute;
The attribute of each unit content that the column data is determined according to the candidate attribute set includes:
The location contents is matched one by one with each regular expression in the candidate regular expression set;
It is when the regular expression is matched with the location contents, the associated data attribute of the regular expression is true It is set to the corresponding attribute of location contents.
Method as described above, the data type according to the column data determine the corresponding candidate category of the column data Property set, comprising:
When the data type is nonnumeric letter type, candidate Hash dictionary collection corresponding with the column data is obtained It closes, the candidate Hash dictionary set is determined as the corresponding candidate attribute set of the column data, wherein candidate's Hash Dictionary set includes multiple Hash dictionaries, the Hash dictionary associated data attribute;
The attribute of each unit content that the column data is determined according to the candidate attribute set, comprising:
Each Hash dictionary that location contents is input in the candidate Hash dictionary set is inquired one by one;
It is when inquiring the location contents in the Hash dictionary, the associated data attribute of Hash dictionary is true It is set to the corresponding attribute of location contents.
Method as described above, the attribute of each unit content to the column data carry out statistics and obtain each attribute Confidence level, the attribute of the column data is determined according to the confidence level, comprising:
Statistics is carried out to the attribute for being associated with Hash dictionary of each unit content of the column data and obtains each be associated with The confidence level of the attribute of Hash dictionary;
The attribute for being associated with the column data of Hash dictionary is determined according to the confidence level for being associated with Hash dictionary;
After the confidence level that the basis is associated with Hash dictionary determines the attribute for the column data for being associated with Hash dictionary, Further include:
It is low in the confidence level for the attribute that the nonnumeric letter type is short text and the column data that is associated with Hash dictionary When given threshold, the attribute for being associated with the column data of Trie tree is determined;
The attribute for being associated with the column data of Hash dictionary is compared with the attribute for the column data for being associated with Trie tree, it will The attribute of the big column data of confidence level is determined as the objective attribute target attribute of column data.
Method as described above, further includes:
When determining the column data includes column head content, obtained according to the column head content search preset attribute mapping dictionary The attribute with the column head content matching is taken, by the attribute for being determined as column data with column head content matching attribute.
To achieve the goals above, the data attribute determining device of second aspect of the present invention embodiment, comprising: split mould Block obtains multiple column datas for split to the initial data of formatting;
First determining module, if not including column head content for the column data, according to the data class of the column data Type determines the corresponding candidate attribute set of the column data;
Second determining module, the category of each unit content for determining the column data according to the candidate attribute set Property;
Third determining module, the attribute for each unit content to the column data carry out statistics and obtain each attribute Confidence level determines the attribute of the column data according to the confidence level.
To achieve the goals above, the computer equipment of third aspect present invention embodiment, comprising: processor and storage Device;
Wherein, the processor run by reading the executable program code stored in the memory with it is described can The corresponding program of program code is executed, to determine method for realizing data attribute described in first aspect.
To achieve the goals above, the computer readable storage medium of fourth aspect present invention embodiment, is stored thereon with Computer program realizes that data attribute described in first aspect determines method when the computer program is executed by processor.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.
Detailed description of the invention
Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, wherein
Fig. 1 is that the data attribute of one embodiment of the invention determines the flow chart of method;
Fig. 2 is that the data attribute of further embodiment of this invention determines the flow chart of method;
Fig. 3 is that the data attribute of another embodiment of the present invention determines the flow chart of method;
Fig. 4 is that the data attribute of yet another embodiment of the invention determines the flow chart of method;
Fig. 5 is that the data attribute of one embodiment of the invention determines the flow chart of method;
Fig. 6 is the structural schematic diagram of the data attribute determining device of one embodiment of the invention;
Fig. 7 is the structural schematic diagram of the computer equipment of one embodiment of the invention.
Specific embodiment
The embodiment of the present invention is described below in detail, the example of embodiment is shown in the accompanying drawings, wherein identical from beginning to end Or similar label indicates same or similar element or element with the same or similar functions.It is retouched below with reference to attached drawing The embodiment stated is exemplary, it is intended to is used to explain the present invention, and is not considered as limiting the invention.
Below with reference to the accompanying drawings the data attribute for describing the embodiment of the present invention determines method and device.
Fig. 1 is that the data attribute of one embodiment of the invention determines the flow chart of method.The data attribute of the present embodiment determines Method is executed by data attribute determining device, which can integrate in the server.
As shown in Figure 1, the data attribute of the present embodiment determines method, comprising:
Step S101, the initial data of formatting split and obtain multiple column datas.
Specifically, the initial data of formatting contains the valuable information treasure-house of magnanimity, to the initial data of formatting Carry out the decision that data mining can help user more scientific.The initial data of formatting in the present embodiment can be all kinds of Report, the report of the data such as client's list, product inventory, list of articles, order, dispatch list, the form of report can be PDF, Word, Excel and Power Point etc..
It is exemplified by Table 1, the entitled logistics management report of the file of table 1, table 1 shares 3 column datas, the column head content of each column data It is respectively as follows: Air Way Bill No., outbox company, addressee company.In general, column head content is the attribute of corresponding column data, i.e., in table 1 The attribute of Air Way Bill No. column be Air Way Bill No., the attribute of the outbox company column in table 1 is outbox company, and so on.
When being integrated with the collection of server of data attribute determining device to a large amount of all logistics management reports as shown in Table 1 Table, firstly, the structure type of identification logistics management report, determines the columns of logistics management report;Then, by column split to get To 3 column datas, respectively Air Way Bill No., outbox company, addressee company;Followed by according to each in column content or column data Location contents excavates the attribute of each column data.
Table 1
Air Way Bill No. Outbox company Addressee company
4506442377787 Jiangxi company Shandong company
4523447706787 Beijing company Shanxi company
8744235077647 Hainan company Henan company
7643507474287 Hunan company Hebei company
3587442077647 Shanghai company Jiangxi company
If step S102, the described column data does not include column head content, institute is determined according to the data type of the column data State the corresponding candidate attribute set of column data.
For example, server includes column head content collecting all column datas as shown in Table 1, and including in unit Rong Shi, due to the attribute of column head content characterization column data, at this moment server can preferentially identify column head content to excavate column data Attribute.However, be not the initial data of collected formatting all including column head, at this moment, server is according to each column data Each unit content excavates the attribute of column data.
Specifically, the data type of the attribute of column data is segmented into digital alphabet type and nonnumeric letter type.Number The corresponding location contents of the attribute of word letter type can be by number, 26 English alphabets, underscore, space, tab, change The content of the compositions such as page symbol, the corresponding location contents of the attributes such as Air Way Bill No., postcode, identity card, MAC Address.Non- number Word letter type can be enumeration type, short text type.The attribute of enumeration type has nationality, gender etc.;Short text type Attribute has company, school etc..
For example, when collection of server column data corresponding to Air Way Bill No. in table 1, firstly, server can be to 3 Location contents such as is identified, is classified, being clustered at the operation, conjecture be associated with the column data of Air Way Bill No. attribute may for Air Way Bill No., The data alphabets type attribute such as postcode, identity card, at this moment server transfer pre-stored Air Way Bill No. attribute Recognition Model, Attribute Recognition Model, the identity card attribute Recognition Model of postcode.If server is established for different logistics companies Different Air Way Bill No. attribute Recognition Models, server can transfer out all Air Way Bill No. attribute Recognition Models at this time.All Air Way Bill No. categories Property identification model, the attribute Recognition Model of postcode, identity card attribute Recognition Model be group cost implementation in column data Corresponding candidate attribute set (being associated with the corresponding candidate attribute set of column data of Air Way Bill No.)
For example, when collection of server column data corresponding to outbox company in table 1, firstly, server can be to 3 A location contents such as is identified, is classified, being clustered at the operation, guesses that the attribute for the column data for being associated with outbox company may be outbox The non-data letter type attribute such as company, addressee company, transit company, at this moment server is transferred pre-stored outbox company and is belonged to Property identification model, addressee company attributes identification model, transit company's attribute Recognition Model, if server be directed to different outboxes Company, addressee company, transit company establish corresponding attribute Recognition Model respectively, and at this moment server can transfer out all outboxes Company attributes identification model, addressee company attributes identification model, transit company's attribute Recognition Model.All outbox company attributes are known Other model, addressee company attributes identification model, transit company's attribute Recognition Model are that the column data in group cost implementation is corresponding Candidate attribute set (being associated with the corresponding candidate attribute set of column data of outbox company).
It is pointed out that such as Air Way Bill No. attribute Recognition Model, the attribute Recognition Model of postcode, identity card attribute The examples such as identification model, outbox company attributes identification model, addressee company attributes identification model, transit company's attribute Recognition Model The attribute Recognition Model of property, is established according to actual needs by data mining development company.For example, data mining development company By acquiring the initial data of the historical forms of magnanimity, corresponding Attribute Recognition mould is established by analysis methods such as machine learning Type, either, data mining development company are based on preset algorithm by the initial data of the historical forms of statistical analysis magnanimity Establish corresponding attribute Recognition Model.For example, attribute Recognition Model can be based on attribute establish Hash dictionary, be based on The identification models such as the Trie tree that attribute is established, the regular expression based on attribute foundation.
The present embodiment is when determining column data not includes column head content, by first determining that the data type of column data is divided The candidate attribute set of column data is searched to the other class of door, the operand being reduced as far as in data attribute identification process is promoted The recognition efficiency and accuracy rate of data attribute.
Step S103, the attribute of each unit content of the column data is determined according to the candidate attribute set.
Step S104, the confidence level that statistics obtains each attribute is carried out to the attribute of each unit content of the column data, The attribute of the column data is determined according to the confidence level.
For example, when server identifies the attribute in corresponding 5 location contents of outbox company in table 1 When, each unit content is input to candidate attribute set and is matched one by one.Wherein, candidate attribute set includes: outbox company Attribute Recognition Model, addressee company attributes identification model, transit company's attribute Recognition Model.
Specifically, by Jiangxi company be sequentially inputted to outbox company attributes identification model, addressee company attributes identification model, When transit company's attribute Recognition Model is matched one by one, discovery Jiangxi company is input to the progress of addressee company attributes model Match, successful match, it is determined that the attribute of the corresponding location contents of Jiangxi company is addressee company.And so on, determine Beijing public affairs The attribute for taking charge of corresponding location contents is outbox company;The attribute for determining the corresponding location contents of Hainan company is outbox company; The attribute for determining the corresponding location contents of Hunan company is outbox company;The attribute for determining the corresponding location contents of Shanghai company is Outbox company.
Specifically, be associated with the attribute of the column data of outbox company there are two kinds by statistics, respectively outbox company, Addressee company;Wherein, the Attribute Recognition of four location contents is outbox company, and the Attribute Recognition of a location contents is that addressee is public Department.By calculating, the attribute of column data is that the probability of outbox company is 80%,
The attribute of column data is that the probability of addressee company is 20%.The confidence level of attribute can be understood as belonging in the present embodiment Property probability, for example, the probability that the attribute of column data is outbox company be 80% be column data attribute be outbox company Confidence level is 80%, and the attribute of column data is that the attribute that the probability of addressee company is 20% as column data is setting for addressee company Reliability is 20%.
Wherein, the specific implementation of the attribute of the column data is determined according to confidence level are as follows:
The first implementation, each confidence level corresponding to same column data are compared, and it is right to choose confidence level maximum The attribute answered is determined as the attribute of column data.
Second of implementation, determines whether the corresponding each confidence level of same column data meets setting condition, will meet The corresponding attribute of the confidence level of condition is determined as the attribute of column data.Wherein, imposing a condition can be confidence level and setting threshold Value carries out size comparison, is also possible to whether within the set range to determine confidence level, but be not limited thereto.It may be noted that It is that, it is possible that multiple qualified confidence levels, correspondingly, the attribute of identified column data might have multiple.
The corresponding each confidence level of same column data is presented to the user by the third implementation, by user according to setting Reliability independently selects the attribute of column data.
Data attribute provided in this embodiment determines method, comprising: it is more to carry out fractionation acquisition to the initial data of formatting A column data;If the column data does not include column head content, the column data is determined according to the data type of the column data Corresponding candidate attribute set;The attribute of each unit content of the column data is determined according to the candidate attribute set;To institute The attribute for stating each unit content of column data carries out the confidence level that statistics obtains each attribute, according to confidence level determination The attribute of column data.This method passes through the candidate attribute set for categorizedly searching column data, and by counting each list The attribute of first content determines the attribute of column data, realizes the operand being reduced as far as in data attribute identification process, mentions The recognition efficiency and accuracy rate of data attribute are risen.
Fig. 2 is that the data attribute of further embodiment of this invention determines the flow chart of method.On the basis of the above embodiments, When column data has column head content, the attribute of more accurate column data can be simply and efficiently determined using column head content, Also the recognition efficiency of the attribute of column data is accelerated.
As shown in Fig. 2, the data attribute of the present embodiment determines method, comprising:
Step S201, the initial data of formatting split and obtain multiple column datas, execute step S202 or step S205;
If step S202, the described column data does not include column head content, institute is determined according to the data type of the column data The corresponding candidate attribute set of column data is stated, step S203 is executed;
Step S203, the attribute that each unit content of the column data is determined according to the candidate attribute set, executes step Rapid S204.
Step S204, the confidence level that statistics obtains each attribute is carried out to the attribute of each unit content of the column data, The attribute of the column data is determined according to the confidence level.
The implementation of step S201, S202, S203, S204 in the present embodiment are respectively and in above-described embodiment The implementation of S101, S102, S103, S104 are identical, and details are not described herein.
Step S205, when determining the column data includes column head content, according to the column head content search preset attribute Mapping dictionary obtains the attribute with the column head content matching, is determined as columns with column head content matching attribute for described According to attribute.
For example, when user designs report, what column head content was characterized is the attribute of column data, all as shown in table 1 Column data include the column head content such as Air Way Bill No., outbox company, addressee company.
Specifically, preset attribute mapping dictionary is previously stored in the server, preset attribute mapping dictionary is by data mining Development company is designed according to the characteristics of all trades and professions, and constantly updates.The attribute for including in preset attribute mapping dictionary is It is authoritative high by the attribute that professional authenticates.The present embodiment by the way that column content is input in preset attribute mapping dictionary, The attribute that attribute is determined as column data will be found in preset attribute mapping dictionary, method is simple and efficient, identified column data category Property is more accurate.
Data attribute provided in this embodiment determines method, when determining the column data includes column head content, according to institute The attribute for stating column head content search preset attribute mapping dictionary acquisition and the column head content matching, will be in the described and column head Hold the attribute that matched attribute is determined as column data, method is simple and efficient, and identified column data attribute is more accurate.
Fig. 3 is that the data attribute of another embodiment of the present invention determines the flow chart of method.On the basis of the above embodiments, When the data type is digital alphabet type, by the way that location contents and regular expression are carried out logic judgment, determine single The attribute of first content, and then determine the attribute of column data.
As shown in figure 3, the data attribute of the present embodiment determines method, comprising:
Step S301, the initial data of formatting split and obtain multiple column datas, execute step S302;
If step S302, the described column data does not include column head content, institute is determined according to the data type of the column data The corresponding candidate attribute set of column data is stated, step S303 is executed;
Step S303, when the data type is digital alphabet type, candidate corresponding with the column data is being obtained just Candidate regular expression set is determined as the corresponding candidate attribute set of the column data, executes step by then expression formula set S304。
Wherein, the candidate regular expression set includes multiple regular expressions, the regular expression associated data Attribute.
Specifically, regular expression is a kind of logical formula to string operation, is exactly with more predefined The combination of specific character and these specific characters, forms one " regular character string ", this " regular character string " is used to express pair A kind of filter logic of character string.Regular expression flexibility, logicality and functionality are very strong;Regular expression is used for character The occasions such as string processing, form validation, it is practical and efficient.
For example, various regular expressions, a kind of one attribute of regular expression association has been stored in advance in server.Than Such as, Air Way Bill No. regular expression, postcode regular expression, identity card regular expression have been prestored.
When collection of server column data corresponding to Air Way Bill No. in table 1, firstly, server can be to 5 location contents The operation such as identified, classified, being clustered, it may be Air Way Bill No., postcode, body that conjecture, which is associated with the corresponding column data of Air Way Bill No., The data alphabets type attributes such as part card.At this moment server transfers pre-stored Air Way Bill No. regular expression, postcode canonical Expression formula, identity card regular expression, i.e., the candidate regular expression set in the present embodiment include Air Way Bill No. regular expression, Postcode regular expression, identity card regular expression.
Step S304, each regular expression in the location contents and the candidate regular expression set is carried out It matches one by one, executes step S305.
Step S305, when the regular expression is matched with the location contents, the regular expression is associated Data attribute is determined as the corresponding attribute of location contents, executes step S306.
Step S306, the confidence level that statistics obtains each attribute is carried out to the attribute of each unit content of the column data, The attribute of the column data is determined according to the confidence level.
For example, when server identifies the attribute in corresponding 5 location contents of Air Way Bill No. in table 1, Each unit content is input to candidate regular expression set to be matched one by one.
Specifically, be associated with the attribute of the column data of Air Way Bill No. there are three kinds by statistics, respectively Air Way Bill No., postal service Coding, identity card;Wherein, the Attribute Recognition of 3 location contents is Air Way Bill No., and the Attribute Recognition of 1 location contents is postal compiles Code, the Attribute Recognition of 1 location contents are identity card.
By calculating, the attribute of column data is that the probability (probability can be understood as confidence level) of Air Way Bill No. is 60%, columns According to attribute be postcode probability (probability can be understood as confidence level) be 20%;The attribute of column data is the general of identity card Rate (probability can be understood as confidence level) is 20%.
For example, each confidence level corresponding to same column data is compared, and chooses the maximum corresponding category of confidence level Property is determined as the attribute of column data.So, it is maximum confidence that confidence level, which is 60%, in the example above, at this moment, determined columns According to attribute be Air Way Bill No..
The data attribute of the present embodiment determines method, when the data type for determining column data is digital alphabet type, first Candidate regular expression set corresponding with the column data is obtained, then, by the location contents and the candidate canonical table It is matched one by one up to each regular expression in formula set to be determined as the corresponding attribute of location contents, finally, to described The attribute of each unit content of column data count the attribute of determining column data.This method determines number using regular expression The attribute of the column data of word letter type, it is practical and efficient.Since regular expression flexibility, logicality and functionality are very strong, According to the different regular expression of newly-increased attributes edit, there is good extendibility.
Fig. 4 is that the data attribute of yet another embodiment of the invention determines the flow chart of method.On the basis of the above embodiments, When the data type is nonnumeric letter type, by the way that location contents and Hash dictionary are carried out logic judgment, determine single The attribute of first content, and then determine the attribute of column data.
As shown in figure 4, the data attribute of the present embodiment determines method, comprising:
Step S401, the initial data of formatting split and obtain multiple column datas, execute step S402;
If step S402, the described column data does not include column head content, institute is determined according to the data type of the column data The corresponding candidate attribute set of column data is stated, step S403 is executed;
Step S403, when the data type is nonnumeric letter type, candidate corresponding with the column data is obtained The candidate Hash dictionary set is determined as the corresponding candidate attribute set of the column data, wherein institute by Hash dictionary set Stating candidate Hash dictionary set includes multiple Hash dictionaries, the Hash dictionary associated data attribute, executes step S404.
Briefly introduce Hash table herein: hash table (Hash table, be also Hash table) is according to keyword (Key Value) directly accessing the data structure in memory storage locations, i.e., Hash table, which passes through, calculates a function about key assignments, The data of required inquiry are mapped in table a position to access record, accelerate search speed.Wherein, mapping function is referred to as Hash function, the array for storing record are referred to as hash table, and keyword and function rule can theoretically arbitrarily determine.
In order to accelerate the inquiry velocity of data attribute, the Hash dictionary in the present embodiment can establish the difference according to industry Establish corresponding initial hash table.It is exemplified by Table 1, the Hash dictionary for needing to establish is outbox company Hash dictionary, addressee public affairs Take charge of Hash dictionary;In outbox company Hash dictionary, the initial hash table of corresponding Jiangxi company is established respectively, corresponds to Beijing The initial hash table of company, the initial hash table of corresponding Hainan company, the initial hash table of corresponding Hunan company, correspondence The initial hash table of Shanghai company.When it needs to be determined that each unit content attribute when, only need to be by initial in Hash dictionary Middle carry out fast search.For example, initial j is input to outbox company and is breathed out when determination unit content is the attribute of Jiangxi company In uncommon dictionary, if inquiring Jiangxi company in Hash dictionary, it is determined that location contents is the outbox company of Jiangxi company.
Step S404, each Hash dictionary that location contents is input in the candidate Hash dictionary set is carried out one by one Inquiry executes step S405;
Step S405, when inquiring the location contents in the Hash dictionary, the Hash dictionary is associated Data attribute is determined as the corresponding attribute of location contents, executes step S406.
Step S406, the confidence level that statistics obtains each attribute is carried out to the attribute of each unit content of the column data, The attribute of the column data is determined according to the confidence level.
For example, when server identifies the attribute in corresponding 5 location contents of outbox company in table 1 When, each unit content is input to candidate Hash dictionary set and is inquired one by one.
Specifically, be associated with the attribute of the column data of outbox company there are two kinds by statistics, respectively outbox company, Addressee company;Wherein, the Attribute Recognition of 4 location contents is outbox company, and the Attribute Recognition of 1 location contents is that addressee is public Department.
By calculating, the attribute of column data is that the probability (probability can be understood as confidence level) of outbox company is 80%, column The attribute of data is that the probability (probability can be understood as confidence level) of addressee company is 20%.
For example, each confidence level corresponding to same column data is compared, and chooses the maximum corresponding category of confidence level Property is determined as the attribute of column data.So, it is maximum confidence that confidence level, which is 80%, in the example above, at this moment, determined columns According to attribute be outbox company.
The data attribute of the present embodiment determines method, when the data type is nonnumeric letter type, acquisition and institute State the corresponding candidate Hash dictionary set of column data;Then, location contents is input in the candidate Hash dictionary set Each Hash dictionary is inquired one by one;When inquiring the location contents in the Hash dictionary, by the Hash word The associated data attribute of allusion quotation is determined as the corresponding attribute of location contents;Finally, the attribute of each unit content to the column data Count the attribute of determining column data.This method utilizes the category of the column data of the nonnumeric letter type of Hash dictionary quick search Property;When establishing Hash dictionary, keyword and function rule in Hash table therein can theoretically be arbitrarily determined, according to newly-increased The different Hash table of attributes edit, there is good extendibility.
Fig. 5 is that the data attribute of one embodiment of the invention determines the flow chart of method.On the basis of the above embodiments, right In the attribute of the short texts such as company, school, since the title of the title of each company, each school is continuously updated variation, It may not include incipient Business Name, school's title etc. in Hash dictionary, at this moment it is possible that select confidence level not high Column data attribute.For said circumstances, the present embodiment is also passed through after using attribute of the Hash dictionary to determine column data Trie tree further determines that the attribute of column data, selects target category of the higher column data attribute of confidence level as column data Property.
As shown in figure 5, the data attribute of the present embodiment determines method, comprising:
Step S501, the initial data of formatting split and obtain multiple column datas, execute step S502;
If step S502, the described column data does not include column head content, institute is determined according to the data type of the column data The corresponding candidate attribute set of column data is stated, step S503 is executed;
Step S503, when the data type is nonnumeric letter type, candidate corresponding with the column data is obtained The candidate Hash dictionary set is determined as the corresponding candidate attribute set of the column data, wherein institute by Hash dictionary set Stating candidate Hash dictionary set includes multiple Hash dictionaries, the Hash dictionary associated data attribute, executes step S504.
Step S504, each Hash dictionary that location contents is input in the candidate Hash dictionary set is carried out one by one Inquiry executes step S505;
Step S505, when inquiring the location contents in the Hash dictionary, the Hash dictionary is associated Data attribute is determined as the corresponding attribute of location contents, executes step S506.
Step S506, statistics is carried out to the attribute for being associated with Hash dictionary of each unit content of the column data to obtain respectively The confidence level of a attribute for being associated with Hash dictionary;Hash dictionary is associated with according to the confidence level determination for being associated with Hash dictionary The attribute of column data executes step S507.
Step S507, the described nonnumeric letter type is setting for the attribute of short text and the column data that is associated with Hash dictionary When reliability is lower than given threshold, the attribute for being associated with the column data of Trie tree is determined, execute step S508.
Specifically, the nonnumeric letter type in the present embodiment can be enumeration type, be also possible to short text type.It is all Such as national, gender is the attribute of enumeration type, for the attribute of column data is gender, the location contents of column data or is Female or to male.The attribute of enumeration type can be determined clearly by Hash dictionary.
And for the attribute of the short texts such as company, school, constantly due to the title of each company, the title of each school More new change may not include incipient Business Name, school's title etc. in Hash dictionary, at this moment it is possible that selecting The not high column data attribute of confidence level.
For said circumstances, the present embodiment also passes through Trie after using attribute of the Hash dictionary to determine column data The attribute to further determine that column data is set, objective attribute target attribute of the higher column data attribute of confidence level as column data is selected.
Trie tree, also known as word lookup tree are a kind of tree structures, are a kind of mutation of Hash tree.Typical case is to use In statistics, a large amount of character string (but being not limited only to character string) is sorted and saves, so often searched automotive engine system is for text This word frequency statistics.Its advantages are: reducing query time using the common prefix of character string, reduce to the maximum extent meaningless Character string comparison, search efficiency are higher than Hash tree.
In the present embodiment, the specific implementation for being associated with the attribute of column data of Trie tree is determined are as follows:
S1, candidate Trie tree set corresponding with the column data is obtained, candidate Trie tree set is determined as institute State the corresponding candidate attribute set of column data, wherein the candidate Trie tree set includes multiple Trie trees, and the Trie tree is closed Join data attribute.
It is exemplified by Table 1, the Trie tree for needing to establish is outbox company's T rie tree, addressee company's T rie tree;In outbox company In Trie tree, according to such as Jiangxi company of big data excavation, Beijing company, Hainan company, Hunan company, Shanghai company etc. Numerous outbox companies establish Trie tree.
S2, each Trie tree that location contents is input in the candidate Trie tree set is inquired one by one.
S3, when inquiring the location contents in the Trie tree, the associated data attribute of Trie tree is true It is set to the corresponding attribute of location contents.
Step S508, by the attribute of the attribute for being associated with the column data of Hash dictionary and the column data for being associated with Trie tree into Row compares, and the attribute of the big column data of confidence level is determined as to the objective attribute target attribute of column data.
For example, the attribute of outbox company column is determined, if being according to the column data attribute that Hash dictionary determines Addressee company, corresponding confidence level are 40% (given threshold 50%);It is outbox according to the column data attribute of Trie tree determination Company, corresponding confidence level are 30%, at this moment, are determined as the target category of column data according to the column data attribute that Hash dictionary determines Property (attribute be addressee company).If being addressee company according to the column data attribute that Hash dictionary determines, corresponding confidence level is 40% (given threshold 50%);It is outbox company according to the column data attribute of Trie tree determination, corresponding confidence level is 50%, At this moment, it is determined as the objective attribute target attribute of column data according to the column data attribute of Trie tree determination (attribute is outbox company).
The data attribute of the present embodiment determines method, and nonnumeric letter type is short text and the column for being associated with Hash dictionary When the confidence level of the attribute of data is lower than given threshold, determines the attribute for being associated with the column data of Trie tree, Hash will be associated with The attribute of the column data of dictionary is compared with the attribute for the column data for being associated with Trie tree, by the category of the big column data of confidence level Property is determined as the objective attribute target attribute of column data.For the not high situation of the column data attribute selected using early period using Hash vocabulary, The attribute that column data is also further determined that by Trie tree selects mesh of the higher column data attribute of confidence level as column data Attribute is marked, the recognition accuracy of data attribute is promoted.
Fig. 6 is the structural schematic diagram of the data attribute determining device of one embodiment of the invention.As shown in fig. 6, the present embodiment The data attribute determining device of offer, comprising:
Module 01 is split, obtains multiple column datas for split to the initial data of formatting;
First determining module 02, if not including column head content for the column data, according to the data of the column data Type determines the corresponding candidate attribute set of the column data;
Second determining module 03, the category of each unit content for determining the column data according to the candidate attribute set Property;
Third determining module 04, the attribute for each unit content to the column data carry out statistics and obtain each attribute Confidence level, the attribute of the column data is determined according to the confidence level.
Further, first determining module includes first unit;Second determining module includes second unit;
The first unit, for obtaining corresponding with the column data when the data type is digital alphabet type Candidate regular expression set, candidate regular expression set is determined as the corresponding candidate attribute set of the column data, Wherein, the candidate regular expression set includes multiple regular expressions, the regular expression associated data attribute;
The second unit, for by each canonical table in the location contents and the candidate regular expression set It is matched one by one up to formula;It is when the regular expression is matched with the location contents, the regular expression is associated Data attribute is determined as the corresponding attribute of location contents.
Further, first determining module further includes third unit;Second determining module further includes the 4th single Member;
The third unit, for obtaining and the column data pair when the data type is nonnumeric letter type The candidate Hash dictionary set is determined as the corresponding candidate attribute collection of the column data by the candidate Hash dictionary set answered It closes, wherein candidate's Hash dictionary set includes multiple Hash dictionaries, the Hash dictionary associated data attribute;
Unit the 4th, each Hash dictionary for being input to location contents in the candidate Hash dictionary set It is inquired one by one;When inquiring the location contents in the Hash dictionary, by the associated data of Hash dictionary Attribute is determined as the corresponding attribute of location contents.
Further, the third determining module is also used to be associated with Hash to each unit content of the column data The attribute of dictionary carries out the confidence level that statistics obtains each attribute for being associated with Hash dictionary;
The attribute for being associated with the column data of Hash dictionary is determined according to the confidence level for being associated with Hash dictionary;
After the confidence level that the basis is associated with Hash dictionary determines the attribute for the column data for being associated with Hash dictionary, Further include:
It is low in the confidence level for the attribute that the nonnumeric letter type is short text and the column data that is associated with Hash dictionary When given threshold, the attribute for being associated with the column data of Trie tree is determined;
The attribute for being associated with the column data of Hash dictionary is compared with the attribute for the column data for being associated with Trie tree, it will The attribute of the big column data of confidence level is determined as the objective attribute target attribute of column data.
Further, the first determining module is also used to when determining the column data includes column head content, according to the column Head content search preset attribute mapping dictionary obtain with the attribute of the column head content matching, will be described with the column head content The attribute matched is determined as the attribute of column data.
Device in this present embodiment is closed, wherein modules execute the concrete mode of operation in related this method It is described in detail in embodiment, no detailed explanation will be given here.
Data attribute determining device provided in this embodiment, comprising: it is more that fractionation acquisition is carried out to the initial data of formatting A column data;If the column data does not include column head content, the column data is determined according to the data type of the column data Corresponding candidate attribute set;The attribute of each unit content of the column data is determined according to the candidate attribute set;To institute The attribute for stating each unit content of column data carries out the confidence level that statistics obtains each attribute, according to confidence level determination The attribute of column data.The device passes through the candidate attribute set for categorizedly searching column data, and by counting each list The attribute of first content determines the attribute of column data, realizes the operand being reduced as far as in data attribute identification process, mentions The recognition efficiency and accuracy rate of data attribute are risen.
In order to achieve the above object, the embodiment of the present invention also proposed a kind of computer equipment.
Fig. 7 is the structural schematic diagram of the computer equipment of one embodiment of the invention.
As shown in fig. 7, the computer equipment includes: memory 11, processor 12 and is stored on memory 11 and can be The computer program run on processor 12.
Processor 12 realized when executing described program Fig. 1 to Fig. 5 it is any shown in the data attribute that provides in embodiment it is true Determine method.
Further, computer equipment further include:
Communication interface 13, for the communication between memory 11 and processor 12.
Memory 11, for storing the computer program that can be run on the processor 12.
Memory 11 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non-volatile Memory), a for example, at least magnetic disk storage.
Processor 12, the data attribute for realizing that Fig. 1 is provided into embodiment shown in fig. 5 when for executing described program are true Determine method.
If memory 11, processor 12 and the independent realization of communication interface 13, communication interface 13, memory 11 and processing Device 12 can be connected with each other by bus and complete mutual communication.The bus can be industry standard architecture (Industry Standard Architecture, abbreviation ISA) bus, external equipment interconnection (Peripheral Component Interconnect, abbreviation PCI) bus or extended industry-standard architecture (Extended Industry Standard Architecture, abbreviation EISA) bus etc..The bus can be divided into address bus, data/address bus, control Bus etc..Only to be indicated in Fig. 7 with a thick line convenient for indicating, it is not intended that an only bus or a type of total Line.
Optionally, in specific implementation, if memory 11, processor 12 and communication interface 13, are integrated in chip piece Upper realization, then memory 11, processor 12 and communication interface 13 can complete mutual communication by internal interface.
Processor 12 can be a central processing unit (Central Processing Unit, abbreviation CPU), either Specific integrated circuit (Application Specific Integrated Circuit, abbreviation ASIC), or be arranged to Implement one or more integrated circuits of the embodiment of the present invention.
In order to achieve the above object, the embodiment of the present application also proposed a kind of computer readable storage medium, it is stored thereon with meter Calculation machine program, realized when which is executed by processor as realization Fig. 1 to Fig. 5 it is any shown in the data category that provides in embodiment Property determines method.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing custom logic function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discussed suitable Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be of the invention Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment It sets.The more specific example (non-exhaustive list) of computer-readable medium include the following: there is the electricity of one or more wirings Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable Medium, because can then be edited, be interpreted or when necessary with it for example by carrying out optical scanner to paper or other media His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the invention can be realized with hardware, software, firmware or their combination.Above-mentioned In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware is realized.Such as, if realized with hardware in another embodiment, following skill well known in the art can be used Any one of art or their combination are realized: have for data-signal is realized the logic gates of logic function from Logic circuit is dissipated, the specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene can compile Journey gate array (FPGA) etc..
Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above The embodiment of the present invention is stated, it is to be understood that above-described embodiment is exemplary, and should not be understood as to limit of the invention System, those skilled in the art can be changed above-described embodiment, modify, replace and become within the scope of the invention Type.

Claims (10)

1. a kind of data attribute determines method characterized by comprising
The initial data of formatting split and obtains multiple column datas;
If the column data does not include column head content, determine that the column data is corresponding according to the data type of the column data Candidate attribute set;
The attribute of each unit content of the column data is determined according to the candidate attribute set;
The confidence level that statistics obtains each attribute is carried out to the attribute of each unit content of the column data, according to the confidence level Determine the attribute of the column data.
2. the method as described in claim 1, which is characterized in that the data type according to the column data determines the column The corresponding candidate attribute set of data, comprising:
When the data type is digital alphabet type, candidate regular expression set corresponding with the column data is obtained, Candidate regular expression set is determined as the corresponding candidate attribute set of the column data, wherein candidate's regular expressions Formula set includes multiple regular expressions, the regular expression associated data attribute;
The attribute of each unit content that the column data is determined according to the candidate attribute set includes:
The location contents is matched one by one with each regular expression in the candidate regular expression set;
When the regular expression is matched with the location contents, the associated data attribute of the regular expression is determined as The corresponding attribute of location contents.
3. the method as described in claim 1, which is characterized in that the data type according to the column data determines the column The corresponding candidate attribute set of data, comprising:
When the data type is nonnumeric letter type, candidate Hash dictionary set corresponding with the column data is obtained, The candidate Hash dictionary set is determined as the corresponding candidate attribute set of the column data, wherein candidate's Hash word Allusion quotation set includes multiple Hash dictionaries, the Hash dictionary associated data attribute;
The attribute of each unit content that the column data is determined according to the candidate attribute set, comprising:
Each Hash dictionary that location contents is input in the candidate Hash dictionary set is inquired one by one;
When inquiring the location contents in the Hash dictionary, the associated data attribute of Hash dictionary is determined as The corresponding attribute of location contents.
4. method as claimed in claim 3, which is characterized in that
The attribute of each unit content to the column data carries out the confidence level that statistics obtains each attribute, is set according to described Reliability determines the attribute of the column data, comprising:
It carries out statistics to the attribute for being associated with Hash dictionary of each unit content of the column data and obtains each to be associated with Hash The confidence level of the attribute of dictionary;
The attribute for being associated with the column data of Hash dictionary is determined according to the confidence level for being associated with Hash dictionary;
After the confidence level that the basis is associated with Hash dictionary determines the attribute for the column data for being associated with Hash dictionary, also wrap It includes:
In the confidence level for the attribute that the nonnumeric letter type is short text and the column data that is associated with Hash dictionary lower than setting When determining threshold value, the attribute for being associated with the column data of Trie tree is determined;
The attribute for being associated with the column data of Hash dictionary is compared with the attribute for the column data for being associated with Trie tree, by confidence The attribute for spending big column data is determined as the objective attribute target attribute of column data.
5. such as the described in any item methods of Claims 1-4, which is characterized in that further include:
When determining the column data includes column head content, according to the column head content search preset attribute mapping dictionary obtain with The attribute of the column head content matching, by the attribute for being determined as column data with column head content matching attribute.
6. a kind of data attribute determining device characterized by comprising
Module is split, obtains multiple column datas for split to the initial data of formatting;
First determining module, if not including column head content for the column data, the data type according to the column data is true Determine the corresponding candidate attribute set of the column data;
Second determining module, the attribute of each unit content for determining the column data according to the candidate attribute set;
Third determining module, the attribute for each unit content to the column data carry out the confidence that statistics obtains each attribute Degree, the attribute of the column data is determined according to the confidence level.
7. device as claimed in claim 6, which is characterized in that first determining module includes first unit;Second determines Module includes second unit;
The first unit, for obtaining time corresponding with the column data when the data type is digital alphabet type Regular expression set is selected, candidate regular expression set is determined as the corresponding candidate attribute set of the column data, wherein Candidate's regular expression set includes multiple regular expressions, the regular expression associated data attribute;
The second unit, for by each regular expression in the location contents and the candidate regular expression set It is matched one by one;When the regular expression is matched with the location contents, by the associated data of the regular expression Attribute is determined as the corresponding attribute of location contents.
8. device as claimed in claim 6, which is characterized in that first determining module further includes third unit;Described Two determining modules further include Unit the 4th;
The third unit, for obtaining corresponding with the column data when the data type is nonnumeric letter type The candidate Hash dictionary set is determined as the corresponding candidate attribute set of the column data by candidate Hash dictionary set, In, candidate's Hash dictionary set includes multiple Hash dictionaries, the Hash dictionary associated data attribute;
Unit the 4th, each Hash dictionary for location contents to be input in the candidate Hash dictionary set carry out It inquires one by one;When inquiring the location contents in the Hash dictionary, by the associated data attribute of Hash dictionary It is determined as the corresponding attribute of location contents.
9. a kind of computer equipment characterized by comprising processor and memory;
Wherein, the processor is run by reading the executable program code stored in the memory can be performed with described The corresponding program of program code, to determine method for realizing the data attribute as described in any one of claims 1 to 5.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program quilt It is realized when processor executes as any one of claims 1 to 5 data attribute determines method.
CN201710848242.XA 2017-09-19 2017-09-19 Data attribute determination method and device Active CN110019829B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710848242.XA CN110019829B (en) 2017-09-19 2017-09-19 Data attribute determination method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710848242.XA CN110019829B (en) 2017-09-19 2017-09-19 Data attribute determination method and device

Publications (2)

Publication Number Publication Date
CN110019829A true CN110019829A (en) 2019-07-16
CN110019829B CN110019829B (en) 2021-05-07

Family

ID=67186310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710848242.XA Active CN110019829B (en) 2017-09-19 2017-09-19 Data attribute determination method and device

Country Status (1)

Country Link
CN (1) CN110019829B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110503378A (en) * 2019-08-27 2019-11-26 云汉芯城(上海)互联网科技股份有限公司 A kind of BOM standardized method, system and electronic equipment and storage medium
CN110609928A (en) * 2019-08-28 2019-12-24 宁波市智慧城市规划标准发展研究院 Name feature recognition system based on government affair data

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method
CN101702167A (en) * 2009-11-03 2010-05-05 上海第二工业大学 Method for extracting attribution and comment word with template based on internet
US20100131540A1 (en) * 2008-11-25 2010-05-27 Yu Xu System, method, and computer-readable medium for optimizing processing of distinct and aggregation queries on skewed data in a database system
CN102419744A (en) * 2010-10-20 2012-04-18 微软公司 Semantic analysis of information
CN102637180A (en) * 2011-02-14 2012-08-15 汉王科技股份有限公司 Character post processing method and device based on regular expression
CN103399854A (en) * 2013-06-28 2013-11-20 中国中医科学院中医临床基础医学研究所 Data positioning identifying and storing method and system
CN103617290A (en) * 2013-12-13 2014-03-05 江苏名通信息科技有限公司 Chinese machine-reading system
CN104636466A (en) * 2015-02-11 2015-05-20 中国科学院计算技术研究所 Entity attribute extraction method and system oriented to open web page
CN104794222A (en) * 2015-04-29 2015-07-22 北京交通大学 Network table semantic recovery method
CN105138637A (en) * 2015-08-24 2015-12-09 浪潮软件股份有限公司 Data processing method and device
KR101582050B1 (en) * 2014-10-24 2015-12-31 이화여자대학교 산학협력단 Apparatus and method for searching name using bloom filter pre-searching
CN105573971A (en) * 2014-10-10 2016-05-11 富士通株式会社 Table reconstruction apparatus and method
CN106484675A (en) * 2016-09-29 2017-03-08 北京理工大学 Fusion distributed semantic and the character relation abstracting method of sentence justice feature
WO2017044409A1 (en) * 2015-09-07 2017-03-16 Voicebox Technologies Corporation System and method of annotating utterances based on tags assigned by unmanaged crowds
CN106649557A (en) * 2016-11-09 2017-05-10 北京大学(天津滨海)新代信息技术研究院 Semantic association mining method for defect report and mail list
CN107092675A (en) * 2017-04-12 2017-08-25 新疆大学 A kind of Uighur semanteme string abstracting method based on statistics and shallow-layer language analysis

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100131540A1 (en) * 2008-11-25 2010-05-27 Yu Xu System, method, and computer-readable medium for optimizing processing of distinct and aggregation queries on skewed data in a database system
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method
CN101702167A (en) * 2009-11-03 2010-05-05 上海第二工业大学 Method for extracting attribution and comment word with template based on internet
CN102419744A (en) * 2010-10-20 2012-04-18 微软公司 Semantic analysis of information
CN102637180A (en) * 2011-02-14 2012-08-15 汉王科技股份有限公司 Character post processing method and device based on regular expression
CN103399854A (en) * 2013-06-28 2013-11-20 中国中医科学院中医临床基础医学研究所 Data positioning identifying and storing method and system
CN103617290A (en) * 2013-12-13 2014-03-05 江苏名通信息科技有限公司 Chinese machine-reading system
CN105573971A (en) * 2014-10-10 2016-05-11 富士通株式会社 Table reconstruction apparatus and method
KR101582050B1 (en) * 2014-10-24 2015-12-31 이화여자대학교 산학협력단 Apparatus and method for searching name using bloom filter pre-searching
CN104636466A (en) * 2015-02-11 2015-05-20 中国科学院计算技术研究所 Entity attribute extraction method and system oriented to open web page
CN104794222A (en) * 2015-04-29 2015-07-22 北京交通大学 Network table semantic recovery method
CN105138637A (en) * 2015-08-24 2015-12-09 浪潮软件股份有限公司 Data processing method and device
WO2017044409A1 (en) * 2015-09-07 2017-03-16 Voicebox Technologies Corporation System and method of annotating utterances based on tags assigned by unmanaged crowds
CN106484675A (en) * 2016-09-29 2017-03-08 北京理工大学 Fusion distributed semantic and the character relation abstracting method of sentence justice feature
CN106649557A (en) * 2016-11-09 2017-05-10 北京大学(天津滨海)新代信息技术研究院 Semantic association mining method for defect report and mail list
CN107092675A (en) * 2017-04-12 2017-08-25 新疆大学 A kind of Uighur semanteme string abstracting method based on statistics and shallow-layer language analysis

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HONGMEI CHEN等: "Finding associations-between-groups in multimode networks", 《2014 INTERNATIONAL CONFERENCE ON BEHAVIORAL, ECONOMIC, AND SOCIO-CULTURAL COMPUTING (BESC2014)》 *
MOHAMMAD SHAFKAT AMIN等: "Ontology guided autonomous label assignment in wrapper induced tables with missing column names", 《2009 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE & INTEGRATION》 *
任向冉: "网络表格的实体列发现与标识", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110503378A (en) * 2019-08-27 2019-11-26 云汉芯城(上海)互联网科技股份有限公司 A kind of BOM standardized method, system and electronic equipment and storage medium
CN110609928A (en) * 2019-08-28 2019-12-24 宁波市智慧城市规划标准发展研究院 Name feature recognition system based on government affair data

Also Published As

Publication number Publication date
CN110019829B (en) 2021-05-07

Similar Documents

Publication Publication Date Title
Boenninghoff et al. Explainable authorship verification in social media via attention-based similarity learning
KR101999152B1 (en) English text formatting method based on convolution network
CA2750609C (en) Methods and systems for matching records and normalizing names
US20180181646A1 (en) System and method for determining identity relationships among enterprise data entities
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN104899304A (en) Named entity identification method and device
CN108509482A (en) Question classification method, device, computer equipment and storage medium
US20100325115A1 (en) Method and system for displaying and processing electronic file list
CN108664574A (en) Input method, terminal device and the medium of information
CN110377558B (en) Document query method, device, computer equipment and storage medium
CN109902090B (en) Method and device for acquiring field name
CN107704512A (en) Financial product based on social data recommends method, electronic installation and medium
CN110929125A (en) Search recall method, apparatus, device and storage medium thereof
CN111767716A (en) Method and device for determining enterprise multilevel industry information and computer equipment
CN108280225B (en) Semantic retrieval method and semantic retrieval system
Wick et al. A unified approach for schema matching, coreference and canonicalization
CN107679208A (en) A kind of searching method of picture, terminal device and storage medium
KR20220134695A (en) System for author identification using artificial intelligence learning model and a method thereof
CN108280197A (en) A kind of method and system of the homologous binary file of identification
CN111680506A (en) External key mapping method and device of database table, electronic equipment and storage medium
JP5098631B2 (en) Mail classification system, mail search system
CN105164672A (en) Content classification
CN110019829A (en) Data attribute determines method, apparatus
CN113326363B (en) Searching method and device, prediction model training method and device and electronic equipment
CN112395881B (en) Material label construction method and device, readable storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20190903

Address after: 100192 Dongsheng Science Park, Zhongguancun, 66 Xixiaokou Road, Haidian District, Beijing

Applicant after: Green Bay Network Technology Co., Ltd.

Address before: 100089 Beijing Haidian District Xixiaokou Road 66 Zhongguancun Dongsheng Science Park B-6 Building B 5 floors

Applicant before: Grass count language (Beijing) Technology Co., Ltd.

GR01 Patent grant
GR01 Patent grant