CN110019829B - Data attribute determination method and device - Google Patents

Data attribute determination method and device Download PDF

Info

Publication number
CN110019829B
CN110019829B CN201710848242.XA CN201710848242A CN110019829B CN 110019829 B CN110019829 B CN 110019829B CN 201710848242 A CN201710848242 A CN 201710848242A CN 110019829 B CN110019829 B CN 110019829B
Authority
CN
China
Prior art keywords
attribute
data
column data
column
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710848242.XA
Other languages
Chinese (zh)
Other versions
CN110019829A (en
Inventor
宋奇
王思睿
姜萌芽
钟磊
秦锋剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Green Bay Network Technology Co., Ltd.
Original Assignee
Green Bay Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Green Bay Network Technology Co ltd filed Critical Green Bay Network Technology Co ltd
Priority to CN201710848242.XA priority Critical patent/CN110019829B/en
Publication of CN110019829A publication Critical patent/CN110019829A/en
Application granted granted Critical
Publication of CN110019829B publication Critical patent/CN110019829B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data attribute determining method and a data attribute determining device, wherein the method comprises the following steps: splitting formatted original data to obtain a plurality of column data; if the column data does not comprise column header contents, determining a candidate attribute set corresponding to the column data according to the data type of the column data; determining the attribute of each unit content of the column data according to the candidate attribute set; and counting the attributes of the content of each unit of the column data to obtain the confidence coefficient of each attribute, and determining the attributes of the column data according to the confidence coefficient. According to the method, the candidate attribute set of the column data is searched in a classified manner, and the attribute of the column data is determined by counting the attribute of each unit content, so that the calculation amount in the data attribute identification process is reduced as much as possible, and the identification efficiency and accuracy of the data attribute are improved.

Description

Data attribute determination method and device
Technical Field
The invention relates to the technical field of data mining, in particular to a data attribute determining method and device.
Background
In relational graph-related big data analysis, structured original data are split and identified by terms so as to map the data to entity-attributes, and entity modeling and analysis are facilitated.
Typical application scenarios are: for an excel input table with good structuralization, the corresponding row attributes of the whole table content can be guessed according to the file name, the header and each row of content of the table, and the attributes correspond to the relevant model attributes of an E-R (entity relationship model) diagram. Therefore, the relational mapping from the original input information to the atlas model can be realized, and the subsequent atlas mining and other deeper operations are facilitated. However, how to improve the identification rate of the data attribute is an urgent technical problem to be solved.
Disclosure of Invention
The object of the present invention is to solve at least to some extent one of the above mentioned technical problems.
Therefore, a first objective of the present invention is to provide a data attribute determining method, which achieves to reduce the computation amount in the data attribute identification process as much as possible and improve the identification efficiency and accuracy of the data attribute by searching the candidate attribute set of the column data in categories and determining the attribute of the column data by counting the attributes of each unit content.
A second object of the present invention is to provide a data attribute determining apparatus.
A third object of the invention is to propose a computer device.
A fourth object of the invention is to propose a computer-readable storage medium.
In order to achieve the above object, a data attribute determining method according to an embodiment of the first aspect of the present invention includes: splitting formatted original data to obtain a plurality of column data;
if the column data does not comprise column header contents, determining a candidate attribute set corresponding to the column data according to the data type of the column data;
determining the attribute of each unit content of the column data according to the candidate attribute set;
and counting the attributes of the content of each unit of the column data to obtain the confidence coefficient of each attribute, and determining the attributes of the column data according to the confidence coefficient.
The method for determining the candidate attribute set corresponding to the column data according to the data type of the column data includes:
when the data type is a digital letter type, acquiring a candidate regular expression set corresponding to the column data, and determining the candidate regular expression set as a candidate attribute set corresponding to the column data, wherein the candidate regular expression set comprises a plurality of regular expressions, and the regular expressions are associated with data attributes;
the determining the attribute of each unit content of the column data according to the candidate attribute set includes:
matching the unit content with each regular expression in the candidate regular expression set one by one;
and when the regular expression is matched with the unit content, determining the data attribute associated with the regular expression as the attribute corresponding to the unit content.
The method for determining the candidate attribute set corresponding to the column data according to the data type of the column data includes:
when the data type is a non-numeric letter type, acquiring a candidate hash dictionary set corresponding to the column data, and determining the candidate hash dictionary set as a candidate attribute set corresponding to the column data, wherein the candidate hash dictionary set comprises a plurality of hash dictionaries which are associated with data attributes;
the determining the attribute of each unit content of the column data according to the candidate attribute set includes:
inputting unit content into each hash dictionary in the candidate hash dictionary set to carry out one-by-one query;
and when the unit content is inquired in the hash dictionary, determining the data attribute associated with the hash dictionary as the attribute corresponding to the unit content.
The above method, where the counting the attributes of each unit content of the column data to obtain the confidence level of each attribute, and determining the attribute of the column data according to the confidence level, includes:
counting attributes associated with a hash dictionary of each unit content of the column data to obtain confidence degrees of the attributes associated with the hash dictionary;
determining attributes of column data associated with the hash dictionary based on the confidence associated with the hash dictionary;
after determining the attribute of the column data associated with the hash dictionary according to the confidence degree associated with the hash dictionary, the method further comprises:
when the non-numeric letter type is a short text and the confidence of the attribute of the column data associated with the Hash dictionary is lower than a set threshold, determining the attribute of the column data associated with the Trie tree;
and comparing the attribute of the column data associated with the hash dictionary with the attribute of the column data associated with the Trie tree, and determining the attribute of the column data with high confidence as the target attribute of the column data.
The method as described above, further comprising:
and when the column data is determined to comprise column header content, inquiring a preset attribute mapping dictionary according to the column header content to obtain an attribute matched with the column header content, and determining the attribute matched with the column header content as the attribute of the column data.
In order to achieve the above object, a data attribute determining apparatus according to an embodiment of a second aspect of the present invention includes: the splitting module is used for splitting the formatted original data to obtain a plurality of column data;
a first determining module, configured to determine, if the column data does not include column header content, a candidate attribute set corresponding to the column data according to a data type of the column data;
a second determining module, configured to determine, according to the candidate attribute set, an attribute of each unit content of the column data;
and the third determining module is used for counting the attributes of the content of each unit of the column data to obtain the confidence coefficient of each attribute, and determining the attributes of the column data according to the confidence coefficient.
To achieve the above object, a computer apparatus according to an embodiment of the third aspect of the present invention includes: a processor and a memory;
wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the data attribute determination method of the first aspect.
In order to achieve the above object, a computer-readable storage medium according to a fourth aspect of the present invention has a computer program stored thereon, and when executed by a processor, implements the data attribute determining method according to the first aspect.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which,
FIG. 1 is a flow chart of a data attribute determination method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a data attribute determination method according to yet another embodiment of the present invention;
FIG. 3 is a flow chart of a data attribute determination method of another embodiment of the present invention;
FIG. 4 is a flow chart of a data attribute determination method according to yet another embodiment of the present invention;
FIG. 5 is a flow chart of a data attribute determination method according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a data attribute determining apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
A data attribute determining method and apparatus according to an embodiment of the present invention will be described below with reference to the drawings.
Fig. 1 is a flowchart of a data attribute determining method according to an embodiment of the present invention. The data attribute determination method of the present embodiment is executed by a data attribute determination device, which may be integrated in a server.
As shown in fig. 1, the data attribute determining method of the present embodiment includes:
step S101, splitting the formatted original data to obtain a plurality of column data.
Specifically, the formatted original data contains a huge amount of valuable information treasury, and the data mining of the formatted original data can help a user to make a more scientific decision. The formatted original data in this embodiment may be various reports, such as reports of data such as a customer list, a product list, an article list, an order, an invoice, and the like, and the reports may be in the form of PDF, Word, Excel, Power Point, and the like.
Taking table 1 as an example, the file name of table 1 is a logistics management report, table 1 has 3 columns of data, and the column header contents of each column of data are: the invoice number, the issuing company, the receiving company. Generally, the column header content is an attribute corresponding to the column data, that is, the attribute of the column in which the waybill number in table 1 is located is the waybill number, the attribute of the column in which the sender company in table 1 is located is the sender company, and so on.
When a server integrated with a data attribute determining device collects a large number of logistics management reports such as those shown in table 1, firstly, identifying the structural form of the logistics management report, and determining the column number of the logistics management report; then, splitting according to columns to obtain 3 columns of data which are respectively an invoice number, a delivery company and a receiving company; then, the attribute of each row of data is mined according to the row content or each unit content in the row data.
TABLE 1
Freight note number Sender Corp Ltd Addressee company
4506442377787 Jiangxi Co Ltd Shandong Co Ltd
4523447706787 Beijing Corp Ltd Shanxi Co Ltd
8744235077647 Hainan Co Ltd Henan Co Ltd
7643507474287 Hunan corporation Hebei Co Ltd
3587442077647 Shanghai Co Ltd Jiangxi Co Ltd
Step S102, if the column data does not include column header content, determining a candidate attribute set corresponding to the column data according to the data type of the column data.
For example, when the server collects the column data shown in table 1, including the column header content and the unit content, since the column header content represents the attribute of the column data, the server preferentially identifies the column header content to mine the attribute of the column data. However, not all the collected formatted raw data includes the column header, and in this case, the server mines the attributes of the column data according to the content of each unit of each column data.
Specifically, the data types of the attributes of the column data may be classified into an alphanumeric type and a non-alphanumeric type. The unit content corresponding to the attribute of the numeric letter type can be content consisting of numbers, 26 English letters, underlines, spaces, tab characters, page changers and the like, and the unit content corresponding to the attribute such as a waybill number, a postal code, an identity card, a MAC address and the like. The non-alphanumeric type may be an enumerated type, a short text type. Enumerated type attributes include ethnicity, gender, and the like; attributes of short text types are company, school, etc.
For example, when the server acquires the column data corresponding to the invoice number in table 1, the server performs operations such as identification, classification, clustering and the like on the 3 unit contents, guesses that the attribute of the column data associated with the invoice number may be the data letter type attribute of the invoice number, the zip code, the identification card and the like, and calls the prestored attribute identification model of the invoice number, the attribute identification model of the zip code and the identification card attribute identification model. If the server establishes different waybill number attribute identification models aiming at different logistics companies, the server calls all waybill number attribute identification models. All the attribute recognition models of the waybill number, the attribute recognition model of the zip code and the attribute recognition model of the ID card are the candidate attribute sets corresponding to the column data in the embodiment (i.e. the candidate attribute set corresponding to the column data associated with the waybill number)
For example, when the server acquires the column data corresponding to the sender in table 1, the server first performs operations such as identification, classification, clustering, and the like on the 3 unit contents, guesses that the attributes of the column data associated with the sender may be non-data letter type attributes of the sender, the recipient, the relay company, and the like, and then calls the pre-stored sender attribute identification model, recipient attribute identification model, and relay company attribute identification model. All the sender attribute identification model, the recipient attribute identification model and the transit company attribute identification model are candidate attribute sets corresponding to the column data in the embodiment (i.e. candidate attribute sets corresponding to the column data associated with the sender).
It should be noted that exemplary attribute identification models, such as an invoice number attribute identification model, a zip code attribute identification model, an identification card attribute identification model, a delivery company attribute identification model, a receiving company attribute identification model, a transit company attribute identification model, and the like, are established by the data mining development company according to actual requirements. For example, the data mining development company establishes a corresponding attribute identification model by collecting massive raw data in historical format and using an analysis method such as machine learning, or the data mining development company establishes a corresponding attribute identification model based on a preset algorithm by statistically analyzing the massive raw data in historical format. For example, the attribute identification model may be a hash dictionary established based on the attribute, a Trie tree established based on the attribute, a regular expression established based on the attribute, or other identification models.
In this embodiment, when it is determined that the column data does not include the column header content, the candidate attribute set of the column data is searched for in a classified manner by determining the data type of the column data, so that the computation amount in the data attribute identification process is reduced as much as possible, and the identification efficiency and accuracy of the data attribute are improved.
And step S103, determining the attribute of each unit content of the column data according to the candidate attribute set.
Step S104, counting the attributes of each unit content of the line data to obtain the confidence of each attribute, and determining the attributes of the line data according to the confidence.
For example, when the server identifies attributes in 5 unit contents corresponding to the sending company in table 1, the unit contents are input into the candidate attribute set for matching one by one. Wherein the set of candidate attributes comprises: the system comprises a sender attribute identification model, a receiver attribute identification model and a transfer company attribute identification model.
Specifically, when the property identification models of the sending company, the receiving company and the transfer company are sequentially input to the western and Jiang company and matched one by one, the property identification models of the sending company, the receiving company and the transfer company are input to the western and Jiang company to be matched, and if the matching is successful, the property of the unit content corresponding to the western and Jiang company is determined to be the receiving company. Determining the attribute of the unit content corresponding to the Beijing company as a sending company by analogy; determining the attribute of the unit content corresponding to Hainan company as a sending company; determining the attribute of the unit content corresponding to the Hunan company as a sending company; and determining the attribute of the unit content corresponding to the Shanghai company as the issuing company.
Specifically, through statistics, two attributes of column data related to a sending company exist, namely the sending company and a receiving company; wherein, the attributes of four unit contents are identified as the sending company, and the attribute of one unit content is identified as the receiving company. The attribute of the column data is calculated to be that the probability of the sender is 80%,
the attribute of the column data is a 20% probability of the recipient company. The confidence of the attribute in this embodiment may be understood as a probability of the attribute, for example, if the probability of the column data that the attribute is 80% of the sending company is that the confidence of the column data that the attribute is 80% of the sending company is 80%, and if the probability of the column data that the attribute is 20% of the receiving company is that the confidence of the column data that the attribute is 20% of the receiving company is 20%.
The specific implementation manner of determining the attribute of the column data according to the confidence coefficient is as follows:
in a first implementation manner, the confidence degrees corresponding to the same column of data are compared, and the attribute corresponding to the maximum confidence degree is selected to be determined as the attribute of the column of data.
In a second implementation manner, whether each confidence corresponding to the same row of data meets a set condition is determined, and an attribute corresponding to the confidence meeting the condition is determined as the attribute of the row of data. The setting condition may be to compare the confidence level with a set threshold, or to determine whether the confidence level is within a set range, but the setting condition is not limited thereto. It is noted that there may be multiple eligible confidences and, correspondingly, there may be multiple attributes of the column data determined.
In a third implementation manner, each confidence corresponding to the same column of data is presented to the user, and the user autonomously selects the attribute of the column of data according to the confidence.
The data attribute determining method provided by the embodiment includes: splitting formatted original data to obtain a plurality of column data; if the column data does not comprise column header contents, determining a candidate attribute set corresponding to the column data according to the data type of the column data; determining the attribute of each unit content of the column data according to the candidate attribute set; and counting the attributes of the content of each unit of the column data to obtain the confidence coefficient of each attribute, and determining the attributes of the column data according to the confidence coefficient. According to the method, the candidate attribute set of the column data is searched in a classified manner, and the attribute of the column data is determined by counting the attribute of each unit content, so that the calculation amount in the data attribute identification process is reduced as much as possible, and the identification efficiency and accuracy of the data attribute are improved.
Fig. 2 is a flowchart of a data attribute determination method according to another embodiment of the present invention. On the basis of the embodiment, when the column data has the column header content, the more accurate attribute of the column data can be simply and efficiently determined by using the column header content, and the identification efficiency of the attribute of the column data is also accelerated.
As shown in fig. 2, the data attribute determining method of the present embodiment includes:
step S201, splitting the formatted original data to obtain a plurality of column data, and executing step S202 or step S205;
step S202, if the column data does not include column header content, determining a candidate attribute set corresponding to the column data according to the data type of the column data, and executing step S203;
step S203, determining the attribute of each unit content of the column data according to the candidate attribute set, and executing step S204.
Step S204, counting the attributes of each unit content of the column data to obtain the confidence of each attribute, and determining the attributes of the column data according to the confidence.
The implementation manners of steps S201, S202, S203, and S204 in this embodiment are the same as the implementation manners of S101, S102, S103, and S104 in the above embodiment, and are not described again here.
Step S205, when it is determined that the column data includes column header content, querying a preset attribute mapping dictionary according to the column header content to obtain an attribute matched with the column header content, and determining the attribute matched with the column header content as the attribute of the column data.
For example, when a user designs a report, what is represented by the column header contents is an attribute of the column data, such as the column data shown in table 1 including the column header contents of the invoice number, the issuing company, the receiving company, and the like.
Specifically, a preset attribute mapping dictionary is stored in the server in advance, and the preset attribute mapping dictionary is designed by a data mining development company according to the characteristics of various industries and is continuously updated. The attributes contained in the preset attribute mapping dictionary are the attributes authenticated by professionals, and the authority is high. In the embodiment, the column content is input into the preset attribute mapping dictionary, and the attribute found in the preset attribute mapping dictionary is determined as the attribute of the column data.
In the data attribute determining method provided by this embodiment, when it is determined that the column data includes the column header content, the preset attribute mapping dictionary is queried according to the column header content to obtain the attribute matched with the column header content, and the attribute matched with the column header content is determined as the attribute of the column data.
Fig. 3 is a flowchart of a data attribute determination method according to another embodiment of the present invention. On the basis of the above embodiment, when the data type is a numeric letter type, the attribute of the cell content is determined by performing logic judgment on the cell content and the regular expression, and then the attribute of the column data is determined.
As shown in fig. 3, the data attribute determining method of the present embodiment includes:
step S301, splitting the formatted original data to obtain a plurality of column data, and executing step S302;
step S302, if the column data does not include column header content, determining a candidate attribute set corresponding to the column data according to the data type of the column data, and executing step S303;
step S303, when the data type is a numeric character type, obtaining a candidate regular expression set corresponding to the column data, determining the candidate regular expression set as a candidate attribute set corresponding to the column data, and executing step S304.
The candidate regular expression set comprises a plurality of regular expressions, and the regular expressions are associated with data attributes.
Specifically, the regular expression is a logic formula for operating on a character string, that is, a "rule character string" is formed by using specific characters defined in advance and a combination of the specific characters, and the "rule character string" is used for expressing a filtering logic for the character string. Regular expressions are very flexible, logical and functional; the regular expression is used for occasions such as character string processing, form verification and the like, and is practical and efficient.
For example, the server stores various regular expressions in advance, and one regular expression is associated with one attribute. For example, a regular expression of the waybill number, a regular expression of the postal code and a regular expression of the identity card are prestored.
When the server collects the column data corresponding to the waybill number in table 1, the server may perform operations such as identification, classification, clustering, and the like on the 5 unit contents, and guess that the column data corresponding to the waybill number may be data letter type attributes such as the waybill number, zip code, identification card, and the like. At this time, the server calls the pre-stored regular expression of waybill number, regular expression of postal code and regular expression of identity card, that is, the set of candidate regular expressions in this embodiment includes the regular expression of waybill number, regular expression of postal code and regular expression of identity card.
Step S304, matching the unit content with each regular expression in the candidate regular expression set one by one, and executing step S305.
Step S305, when the regular expression is matched with the unit content, determining the data attribute associated with the regular expression as the attribute corresponding to the unit content, and executing step S306.
Step S306, counting the attributes of each unit content of the line data to obtain the confidence of each attribute, and determining the attributes of the line data according to the confidence.
For example, when the server identifies attributes in 5 unit contents corresponding to the waybill number in table 1, the unit contents are input into the candidate regular expression set for one-by-one matching.
Specifically, through statistics, three attributes of the column data related to the waybill number exist, namely the waybill number, the postal code and the identity card; the attribute of 3 unit contents is identified as the waybill number, the attribute of 1 unit content is identified as the zip code, and the attribute of 1 unit content is identified as the identity card.
Through calculation, the probability (probability can be understood as confidence) that the attribute of the column data is the waybill number is 60%, and the probability (probability can be understood as confidence) that the attribute of the column data is the zip code is 20%; the attribute of the column data is that the probability (probability can be understood as confidence) of the identity card is 20%.
For example, the confidence degrees corresponding to the same line of data are compared, and the attribute corresponding to the highest confidence degree is selected to be determined as the attribute of the line of data. Then, the confidence level of 60% in the above example is the maximum confidence level, and the attribute of the determined column data is the waybill number.
In the data attribute determining method of this embodiment, when it is determined that the data type of the column data is a numeric letter type, a candidate regular expression set corresponding to the column data is obtained first, then, the unit content is matched with each regular expression in the candidate regular expression set one by one to determine an attribute corresponding to the unit content, and finally, the attribute of each unit content of the column data is counted to determine the attribute of the column data. The method determines the attribute of the column data of the digital letter type by using the regular expression, and is practical and efficient. Because the regular expressions have very strong flexibility, logicality and functionality, different regular expressions are edited according to the added attributes, and the regular expressions have very good expandability.
Fig. 4 is a flowchart of a data attribute determination method according to still another embodiment of the present invention. On the basis of the above embodiment, when the data type is a non-numeric letter type, the attribute of the unit content is determined by performing logical judgment on the unit content and the hash dictionary, and further the attribute of the column data is determined.
As shown in fig. 4, the data attribute determining method of the present embodiment includes:
step S401, splitting the formatted original data to obtain a plurality of column data, and executing step S402;
step S402, if the column data does not include column header content, determining a candidate attribute set corresponding to the column data according to the data type of the column data, and executing step S403;
step S403, when the data type is a non-numeric letter type, obtaining a candidate hash dictionary set corresponding to the column data, determining the candidate hash dictionary set as a candidate attribute set corresponding to the column data, where the candidate hash dictionary set includes a plurality of hash dictionaries, and the hash dictionaries are associated with data attributes, and then step S404 is executed.
Briefly introduced here is a hash table: a Hash table (also called Hash table) is a data structure directly accessed to a memory storage location according to a Key value, that is, the Hash table maps data to be queried to a location in the table to access a record by calculating a function related to a Key value, thereby speeding up the lookup. The mapping function is called as a hash function, the array for storing records is called as a hash table, and the keywords and the function rule can be determined arbitrarily in theory.
In order to accelerate the query speed of the data attribute, the hash dictionary in the embodiment may establish a corresponding initial hash table according to different industries. Taking table 1 as an example, the hash dictionaries to be established are the hash dictionaries of the sending company and the receiving company; in a Hash dictionary of a sending company, a first letter hash table corresponding to Jiangxi company, a first letter hash table corresponding to Beijing company, a first letter hash table corresponding to Hainan company, a first letter hash table corresponding to Hunan company and a first letter hash table corresponding to Shanghai company are respectively established. When the attribute of each unit content needs to be determined, the initial letter is only needed to be quickly searched in the hash dictionary. For example, when the unit content is determined to be an attribute of the Jiangxi company, the initial letter j is input into a Hash dictionary of the sending company, and if the Jiangxi company is inquired in the Hash dictionary, the unit content is determined to be the sending company of the Jiangxi company.
Step S404, inputting unit content into each hash dictionary in the candidate hash dictionary set to carry out one-by-one inquiry, and executing step S405;
step S405, when the unit content is found in the hash dictionary, determining the data attribute associated with the hash dictionary as the attribute corresponding to the unit content, and executing step S406.
Step S406, counting the attributes of each unit content of the column data to obtain the confidence of each attribute, and determining the attributes of the column data according to the confidence.
For example, when the server identifies attributes in 5 unit contents corresponding to the sender in table 1, the unit contents are input into the candidate hash dictionary set for query one by one.
Specifically, through statistics, two attributes of the column data related to the issuing company exist, namely the issuing company and the receiving company; wherein, the attributes of 4 unit contents are identified as the sending company, and the attributes of 1 unit content are identified as the receiving company.
It is calculated that the probability (probability can be understood as confidence) of the sender being the attribute of the column data is 80%, and the probability (probability can be understood as confidence) of the recipient being the attribute of the column data is 20%.
For example, the confidence degrees corresponding to the same line of data are compared, and the attribute corresponding to the highest confidence degree is selected to be determined as the attribute of the line of data. Then, the confidence level of 80% in the above example is the maximum confidence level, and the attribute of the determined column data is the sender.
In the data attribute determining method of this embodiment, when the data type is a non-numeric character type, a candidate hash dictionary set corresponding to the column data is obtained; secondly, inputting unit content into each hash dictionary in the candidate hash dictionary set to carry out one-by-one query; when the unit content is inquired in the hash dictionary, determining the data attribute associated with the hash dictionary as the attribute corresponding to the unit content; and finally, counting the attributes of the content of each unit of the column data to determine the attributes of the column data. The method utilizes a Hash dictionary to quickly inquire the attribute of non-numeric letter type column data; when the Hash dictionary is established, the keywords and the function rule in the Hash table can be determined at will theoretically, different Hash tables are edited according to the added attributes, and the method has good expandability.
Fig. 5 is a flowchart of a data attribute determination method according to an embodiment of the present invention. On the basis of the above embodiment, for attributes of short texts such as companies and schools, since names of companies and schools are constantly updated, names of companies and schools which have just appeared may not be included in the hash dictionary, and a column data attribute with low confidence may be selected. For the above situation, in this embodiment, after the hash dictionary is used to determine the attributes of the column data, the attributes of the column data are further determined through the Trie tree, and the column data attributes with higher confidence are selected as the target attributes of the column data.
As shown in fig. 5, the data attribute determining method of the present embodiment includes:
step S501, splitting the formatted original data to obtain a plurality of column data, and executing step S502;
step S502, if the column data does not comprise column header content, determining a candidate attribute set corresponding to the column data according to the data type of the column data, and executing step S503;
step S503, when the data type is a non-numeric letter type, acquiring a candidate hash dictionary set corresponding to the column data, determining the candidate hash dictionary set as a candidate attribute set corresponding to the column data, where the candidate hash dictionary set includes a plurality of hash dictionaries, and the hash dictionaries are associated with data attributes, and then step S504 is executed.
Step S504, inputting the unit content into each hash dictionary in the candidate hash dictionary set, inquiring one by one, and executing step S505;
step S505, when the unit content is found in the hash dictionary, determining the data attribute associated with the hash dictionary as the attribute corresponding to the unit content, and executing step S506.
Step S506, counting the attributes associated with the hash dictionary of each unit content of the column data to obtain the confidence of each attribute associated with the hash dictionary; the attribute of the column data associated with the hash dictionary is determined according to the confidence associated with the hash dictionary, and step S507 is performed.
Step S507, when the non-numeric letter type is a short text and the confidence of the attribute of the column data associated with the hash dictionary is lower than a set threshold, determining the attribute of the column data associated with the Trie tree, and executing step S508.
Specifically, the non-numeric letter type in this embodiment may be an enumeration type or a short text type. The attribute such as nationality, gender, etc. is an enumeration type, the attribute of the column data is gender as an example, and the unit content of the column data is either female or male. The enumerated type of attribute can be unambiguously determined by means of a hash dictionary.
For attributes of short texts such as companies and schools, since names of companies and schools are constantly updated and changed, names of companies and schools which have just appeared may not be included in the hash dictionary, and a column data attribute with low confidence may be selected.
For the above situation, in this embodiment, after the hash dictionary is used to determine the attributes of the column data, the attributes of the column data are further determined through the Trie tree, and the column data attributes with higher confidence are selected as the target attributes of the column data.
The Trie, also known as the word-lookup tree, is a tree-like structure that is a variant of the hash tree. Typical applications are for statistics, sorting and storing a large number of strings (but not limited to strings), and are therefore often used by search engine systems for text word frequency statistics. It has the advantages that: the public prefix of the character string is utilized to reduce the query time, so that unnecessary character string comparison is reduced to the maximum extent, and the query efficiency is higher than that of a Hash tree.
In this embodiment, a specific implementation manner of determining the attribute of the column data associated with the Trie tree is as follows:
and S1, acquiring a candidate Trie tree set corresponding to the column data, and determining the candidate Trie tree set as a candidate attribute set corresponding to the column data, wherein the candidate Trie tree set comprises a plurality of Trie trees which are associated with data attributes.
Taking table 1 as an example, the Trie trees to be established are Trie trees of a sender company and Trie trees of a receiver company; in the Trie tree of the launch company, the Trie tree is built according to many launch companies such as Jiangxi company, Beijing company, Hainan company, Hunan company, Shanghai company and the like of big data mining.
And S2, inputting the unit content into each Trie tree in the candidate Trie tree set to carry out one-by-one query.
S3, when the unit content is inquired in the Trie tree, determining the data attribute associated with the Trie tree as the attribute corresponding to the unit content.
Step S508, the attribute of the column data associated with the hash dictionary is compared with the attribute of the column data associated with the Trie tree, and the attribute of the column data with high confidence is determined as the target attribute of the column data.
For example, determining the attribute of the column where the sending company is located, and if the column data attribute determined according to the hash dictionary is the receiving company, the corresponding confidence is 40% (the set threshold is 50%); the column data attribute determined according to the Trie tree is a sender, the corresponding confidence is 30%, and at this time, the column data attribute determined according to the hash dictionary is determined as a target attribute of the column data (the attribute is an addressee). If the column data attribute determined according to the Hash dictionary is the recipient company, the corresponding confidence is 40% (the set threshold is 50%); the line data attribute determined according to the Trie is a sender, and the corresponding confidence is 50%, at this time, the line data attribute determined according to the Trie is determined as a target attribute of the line data (the attribute is the sender).
In the data attribute determining method of this embodiment, when the non-numeric letter type is a short text and the confidence of the attribute of the column data associated with the hash dictionary is lower than a set threshold, the attribute of the column data associated with the Trie tree is determined, the attribute of the column data associated with the hash dictionary is compared with the attribute of the column data associated with the Trie tree, and the attribute of the column data with high confidence is determined as the target attribute of the column data. Aiming at the situation that the attribute of the row data selected by utilizing the Hash word table is not high in the early stage, the attribute of the row data is further determined through the Trie tree, the row data attribute with high confidence coefficient is selected as the target attribute of the row data, and the identification accuracy of the data attribute is improved.
Fig. 6 is a schematic structural diagram of a data attribute determining apparatus according to an embodiment of the present invention. As shown in fig. 6, the data attribute determining apparatus provided in this embodiment includes:
the splitting module 01 is configured to split the formatted original data to obtain a plurality of line data;
a first determining module 02, configured to determine, if the column data does not include column header content, a candidate attribute set corresponding to the column data according to a data type of the column data;
a second determining module 03, configured to determine, according to the candidate attribute set, an attribute of each unit content of the column data;
a third determining module 04, configured to count attributes of each unit content of the line data to obtain a confidence level of each attribute, and determine the attribute of the line data according to the confidence level.
Further, the first determination module comprises a first unit; the second determination module comprises a second unit;
the first unit is configured to, when the data type is a numeric letter type, obtain a candidate regular expression set corresponding to the column data, and determine the candidate regular expression set as a candidate attribute set corresponding to the column data, where the candidate regular expression set includes a plurality of regular expressions, and the regular expressions are associated with data attributes;
the second unit is used for matching the unit content with each regular expression in the candidate regular expression set one by one; and when the regular expression is matched with the unit content, determining the data attribute associated with the regular expression as the attribute corresponding to the unit content.
Further, the first determining module further comprises a third unit; the second determination module further comprises a fourth unit;
the third unit is configured to, when the data type is a non-numeric letter type, obtain a candidate hash dictionary set corresponding to the column data, and determine the candidate hash dictionary set as a candidate attribute set corresponding to the column data, where the candidate hash dictionary set includes a plurality of hash dictionaries, and the hash dictionaries are associated with data attributes;
the fourth unit is used for inputting the unit content into each hash dictionary in the candidate hash dictionary set to inquire one by one; and when the unit content is inquired in the hash dictionary, determining the data attribute associated with the hash dictionary as the attribute corresponding to the unit content.
Further, the third determining module is further configured to count attributes associated with a hash dictionary of each unit content of the column data to obtain a confidence of each attribute associated with the hash dictionary;
determining attributes of column data associated with the hash dictionary based on the confidence associated with the hash dictionary;
after determining the attribute of the column data associated with the hash dictionary according to the confidence degree associated with the hash dictionary, the method further comprises:
when the non-numeric letter type is a short text and the confidence of the attribute of the column data associated with the Hash dictionary is lower than a set threshold, determining the attribute of the column data associated with the Trie tree;
and comparing the attribute of the column data associated with the hash dictionary with the attribute of the column data associated with the Trie tree, and determining the attribute of the column data with high confidence as the target attribute of the column data.
Further, the first determining module is further configured to, when it is determined that the column data includes column header content, query a preset attribute mapping dictionary according to the column header content to obtain an attribute matched with the column header content, and determine the attribute matched with the column header content as an attribute of the column data.
The specific manner in which the respective modules perform operations has been described in detail in relation to the apparatus in this embodiment, and will not be elaborated upon here.
The data attribute determining apparatus provided in this embodiment includes: splitting formatted original data to obtain a plurality of column data; if the column data does not comprise column header contents, determining a candidate attribute set corresponding to the column data according to the data type of the column data; determining the attribute of each unit content of the column data according to the candidate attribute set; and counting the attributes of the content of each unit of the column data to obtain the confidence coefficient of each attribute, and determining the attributes of the column data according to the confidence coefficient. The device finds the candidate attribute set of the column data by classification and determines the attribute of the column data by counting the attribute of each unit content, so that the operation amount in the data attribute identification process is reduced as much as possible, and the identification efficiency and accuracy of the data attribute are improved.
In order to achieve the above object, an embodiment of the present invention further provides a computer device.
Fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
As shown in fig. 7, the computer apparatus includes: a memory 11, a processor 12 and a computer program stored on the memory 11 and executable on the processor 12.
The processor 12, when executing the program, implements the data attribute determination method provided in the embodiment shown in any of fig. 1 to 5.
Further, the computer device further comprises:
a communication interface 13 for communication between the memory 11 and the processor 12.
A memory 11 for storing a computer program operable on the processor 12.
The memory 11 may comprise a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
And a processor 12, configured to implement the data attribute determining method provided in the embodiments shown in fig. 1 to 5 when executing the program.
If the memory 11, the processor 12 and the communication interface 13 are implemented independently, the communication interface 13, the memory 11 and the processor 12 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 7, but this does not mean only one bus or one type of bus.
Alternatively, in practical implementation, if the memory 11, the processor 12 and the communication interface 13 are integrated on one chip, the memory 11, the processor 12 and the communication interface 13 may complete communication with each other through an internal interface.
The processor 12 may be a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present invention.
To achieve the above object, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the data attribute determining method provided in the embodiment shown in any one of fig. 1 to 5.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (9)

1. A method for determining data attributes, comprising:
splitting formatted original data to obtain a plurality of column data;
if the column data does not comprise column header contents, determining a candidate attribute set corresponding to the column data according to the data type of the column data;
determining the attribute of each unit content of the column data according to the candidate attribute set;
counting the attributes of the content of each unit of the column data to obtain the confidence of each attribute, and determining the attributes of the column data according to the confidence;
the determining a candidate attribute set corresponding to the column data according to the data type of the column data includes:
when the data type is a digital letter type, acquiring a candidate regular expression set corresponding to the column data, and determining the candidate regular expression set as a candidate attribute set corresponding to the column data, wherein the candidate regular expression set comprises a plurality of regular expressions, and the regular expressions are associated with data attributes;
the determining the attribute of each unit content of the column data according to the candidate attribute set includes:
matching the unit content with each regular expression in the candidate regular expression set one by one;
and when the regular expression is matched with the unit content, determining the data attribute associated with the regular expression as the attribute corresponding to the unit content.
2. The method of claim 1, further comprising:
and when the column data is determined to comprise column header content, inquiring a preset attribute mapping dictionary according to the column header content to obtain an attribute matched with the column header content, and determining the attribute matched with the column header content as the attribute of the column data.
3. A method for determining data attributes, comprising:
splitting formatted original data to obtain a plurality of column data;
if the column data does not comprise column header contents, determining a candidate attribute set corresponding to the column data according to the data type of the column data;
determining the attribute of each unit content of the column data according to the candidate attribute set;
counting the attributes of the content of each unit of the column data to obtain the confidence of each attribute, and determining the attributes of the column data according to the confidence;
the determining a candidate attribute set corresponding to the column data according to the data type of the column data includes:
when the data type is a non-numeric letter type, acquiring a candidate hash dictionary set corresponding to the column data, and determining the candidate hash dictionary set as a candidate attribute set corresponding to the column data, wherein the candidate hash dictionary set comprises a plurality of hash dictionaries which are associated with data attributes;
the determining the attribute of each unit content of the column data according to the candidate attribute set includes:
inputting unit content into each hash dictionary in the candidate hash dictionary set to carry out one-by-one query;
and when the unit content is inquired in the hash dictionary, determining the data attribute associated with the hash dictionary as the attribute corresponding to the unit content.
4. The method of claim 3,
the counting the attributes of the content of each unit of the column data to obtain the confidence of each attribute, and determining the attributes of the column data according to the confidence includes:
counting attributes associated with a hash dictionary of each unit content of the column data to obtain confidence degrees of the attributes associated with the hash dictionary;
determining attributes of column data associated with the hash dictionary based on the confidence associated with the hash dictionary;
after determining the attribute of the column data associated with the hash dictionary according to the confidence degree associated with the hash dictionary, the method further comprises:
when the non-numeric letter type is a short text and the confidence of the attribute of the column data associated with the Hash dictionary is lower than a set threshold, determining the attribute of the column data associated with the Trie tree;
and comparing the attribute of the column data associated with the hash dictionary with the attribute of the column data associated with the Trie tree, and determining the attribute of the column data with high confidence as the target attribute of the column data.
5. The method of any of claims 3 to 4, further comprising:
and when the column data is determined to comprise column header content, inquiring a preset attribute mapping dictionary according to the column header content to obtain an attribute matched with the column header content, and determining the attribute matched with the column header content as the attribute of the column data.
6. A data attribute determination apparatus, comprising:
the splitting module is used for splitting the formatted original data to obtain a plurality of column data;
a first determining module, configured to determine, if the column data does not include column header content, a candidate attribute set corresponding to the column data according to a data type of the column data;
a second determining module, configured to determine, according to the candidate attribute set, an attribute of each unit content of the column data;
a third determining module, configured to count attributes of each unit content of the column data to obtain a confidence level of each attribute, and determine the attribute of the column data according to the confidence level;
the first determining module comprises a first unit; the second determination module comprises a second unit;
the first unit is configured to, when the data type is a numeric letter type, obtain a candidate regular expression set corresponding to the column data, and determine the candidate regular expression set as a candidate attribute set corresponding to the column data, where the candidate regular expression set includes a plurality of regular expressions, and the regular expressions are associated with data attributes;
the second unit is used for matching the unit content with each regular expression in the candidate regular expression set one by one; and when the regular expression is matched with the unit content, determining the data attribute associated with the regular expression as the attribute corresponding to the unit content.
7. A data attribute determination apparatus, comprising:
the splitting module is used for splitting the formatted original data to obtain a plurality of column data;
a first determining module, configured to determine, if the column data does not include column header content, a candidate attribute set corresponding to the column data according to a data type of the column data;
a second determining module, configured to determine, according to the candidate attribute set, an attribute of each unit content of the column data;
a third determining module, configured to count attributes of each unit content of the column data to obtain a confidence level of each attribute, and determine the attribute of the column data according to the confidence level;
the first determination module further comprises a third unit; the second determination module further comprises a fourth unit;
the third unit is configured to, when the data type is a non-numeric letter type, obtain a candidate hash dictionary set corresponding to the column data, and determine the candidate hash dictionary set as a candidate attribute set corresponding to the column data, where the candidate hash dictionary set includes a plurality of hash dictionaries, and the hash dictionaries are associated with data attributes;
the fourth unit is used for inputting the unit content into each hash dictionary in the candidate hash dictionary set to inquire one by one; and when the unit content is inquired in the hash dictionary, determining the data attribute associated with the hash dictionary as the attribute corresponding to the unit content.
8. A computer device, comprising: a processor and a memory;
wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory for implementing the data attribute determination method according to any one of claims 1 to 2 and implementing the data attribute determination method according to any one of claims 3 to 5.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a data property determination method according to any one of claims 1 to 2 and carries out a data property determination method according to any one of claims 3 to 5.
CN201710848242.XA 2017-09-19 2017-09-19 Data attribute determination method and device Active CN110019829B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710848242.XA CN110019829B (en) 2017-09-19 2017-09-19 Data attribute determination method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710848242.XA CN110019829B (en) 2017-09-19 2017-09-19 Data attribute determination method and device

Publications (2)

Publication Number Publication Date
CN110019829A CN110019829A (en) 2019-07-16
CN110019829B true CN110019829B (en) 2021-05-07

Family

ID=67186310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710848242.XA Active CN110019829B (en) 2017-09-19 2017-09-19 Data attribute determination method and device

Country Status (1)

Country Link
CN (1) CN110019829B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110503378A (en) * 2019-08-27 2019-11-26 云汉芯城(上海)互联网科技股份有限公司 A kind of BOM standardized method, system and electronic equipment and storage medium
CN110609928A (en) * 2019-08-28 2019-12-24 宁波市智慧城市规划标准发展研究院 Name feature recognition system based on government affair data

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102419744A (en) * 2010-10-20 2012-04-18 微软公司 Semantic analysis of information
CN102637180A (en) * 2011-02-14 2012-08-15 汉王科技股份有限公司 Character post processing method and device based on regular expression
CN104636466A (en) * 2015-02-11 2015-05-20 中国科学院计算技术研究所 Entity attribute extraction method and system oriented to open web page
KR101582050B1 (en) * 2014-10-24 2015-12-31 이화여자대학교 산학협력단 Apparatus and method for searching name using bloom filter pre-searching
CN106484675A (en) * 2016-09-29 2017-03-08 北京理工大学 Fusion distributed semantic and the character relation abstracting method of sentence justice feature
WO2017044409A1 (en) * 2015-09-07 2017-03-16 Voicebox Technologies Corporation System and method of annotating utterances based on tags assigned by unmanaged crowds
CN106649557A (en) * 2016-11-09 2017-05-10 北京大学(天津滨海)新代信息技术研究院 Semantic association mining method for defect report and mail list
CN107092675A (en) * 2017-04-12 2017-08-25 新疆大学 A kind of Uighur semanteme string abstracting method based on statistics and shallow-layer language analysis

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8234268B2 (en) * 2008-11-25 2012-07-31 Teradata Us, Inc. System, method, and computer-readable medium for optimizing processing of distinct and aggregation queries on skewed data in a database system
CN101464905B (en) * 2009-01-08 2011-03-23 中国科学院计算技术研究所 Web page information extraction system and method
CN101702167A (en) * 2009-11-03 2010-05-05 上海第二工业大学 Method for extracting attribution and comment word with template based on internet
CN103399854B (en) * 2013-06-28 2017-02-08 中国中医科学院中医临床基础医学研究所 Data positioning identifying and storing method and system
CN103617290B (en) * 2013-12-13 2017-02-15 江苏名通信息科技有限公司 Chinese machine-reading system
CN105573971B (en) * 2014-10-10 2018-09-25 富士通株式会社 Table reconfiguration device and method
CN104794222B (en) * 2015-04-29 2017-12-12 北京交通大学 Network form semanteme restoration methods
CN105138637A (en) * 2015-08-24 2015-12-09 浪潮软件股份有限公司 Data processing method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102419744A (en) * 2010-10-20 2012-04-18 微软公司 Semantic analysis of information
CN102637180A (en) * 2011-02-14 2012-08-15 汉王科技股份有限公司 Character post processing method and device based on regular expression
KR101582050B1 (en) * 2014-10-24 2015-12-31 이화여자대학교 산학협력단 Apparatus and method for searching name using bloom filter pre-searching
CN104636466A (en) * 2015-02-11 2015-05-20 中国科学院计算技术研究所 Entity attribute extraction method and system oriented to open web page
WO2017044409A1 (en) * 2015-09-07 2017-03-16 Voicebox Technologies Corporation System and method of annotating utterances based on tags assigned by unmanaged crowds
CN106484675A (en) * 2016-09-29 2017-03-08 北京理工大学 Fusion distributed semantic and the character relation abstracting method of sentence justice feature
CN106649557A (en) * 2016-11-09 2017-05-10 北京大学(天津滨海)新代信息技术研究院 Semantic association mining method for defect report and mail list
CN107092675A (en) * 2017-04-12 2017-08-25 新疆大学 A kind of Uighur semanteme string abstracting method based on statistics and shallow-layer language analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Ontology guided autonomous label assignment in wrapper induced tables with missing column names;Mohammad Shafkat Amin等;《2009 IEEE International Conference on Information Reuse & Integration》;20090821;424-425 *

Also Published As

Publication number Publication date
CN110019829A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
CN109885692B (en) Knowledge data storage method, apparatus, computer device and storage medium
US20180181646A1 (en) System and method for determining identity relationships among enterprise data entities
CN110377558B (en) Document query method, device, computer equipment and storage medium
CN107544982B (en) Text information processing method and device and terminal
CN110851598B (en) Text classification method and device, terminal equipment and storage medium
CN106033416A (en) A string processing method and device
CN108664574A (en) Input method, terminal device and the medium of information
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
CN113094559B (en) Information matching method, device, electronic equipment and storage medium
CN110929125A (en) Search recall method, apparatus, device and storage medium thereof
CN104573130A (en) Entity resolution method based on group calculation and entity resolution device based on group calculation
CN105589894B (en) Document index establishing method and device and document retrieval method and device
CN102750379A (en) Fast character string matching method based on filtering type
CN110019829B (en) Data attribute determination method and device
CN111506608A (en) Method and device for comparing structured texts
CN110825817B (en) Enterprise suspected association judgment method and system
CN108287850B (en) Text classification model optimization method and device
US20180365223A1 (en) Semantic analysis apparatus, method, and non-transitory computer readable storage medium thereof
JP6677093B2 (en) Table data search device, table data search method, and table data search program
CN113051919B (en) Method and device for identifying named entity
CN116561181A (en) Data query method, device, computer equipment and computer readable storage medium
CN114579580A (en) Data storage method and data query method and device
CN115292008A (en) Transaction processing method, device, equipment and medium for distributed system
CN111539576B (en) Risk identification model optimization method and device
CN116263770A (en) Method, device, terminal equipment and medium for storing business data based on database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20190903

Address after: 100192 Dongsheng Science Park, Zhongguancun, 66 Xixiaokou Road, Haidian District, Beijing

Applicant after: Green Bay Network Technology Co., Ltd.

Address before: 100089 Beijing Haidian District Xixiaokou Road 66 Zhongguancun Dongsheng Science Park B-6 Building B 5 floors

Applicant before: Grass count language (Beijing) Technology Co., Ltd.

GR01 Patent grant
GR01 Patent grant