CN110019829B

CN110019829B - Data attribute determination method and device

Info

Publication number: CN110019829B
Application number: CN201710848242.XA
Authority: CN
Inventors: 宋奇; 王思睿; 姜萌芽; 钟磊; 秦锋剑
Original assignee: Green Bay Network Technology Co ltd
Current assignee: Green Bay Network Technology Co., Ltd.
Priority date: 2017-09-19
Filing date: 2017-09-19
Publication date: 2021-05-07
Anticipated expiration: 2037-09-19
Also published as: CN110019829A

Abstract

The invention discloses a data attribute determining method and a data attribute determining device, wherein the method comprises the following steps: splitting formatted original data to obtain a plurality of column data; if the column data does not comprise column header contents, determining a candidate attribute set corresponding to the column data according to the data type of the column data; determining the attribute of each unit content of the column data according to the candidate attribute set; and counting the attributes of the content of each unit of the column data to obtain the confidence coefficient of each attribute, and determining the attributes of the column data according to the confidence coefficient. According to the method, the candidate attribute set of the column data is searched in a classified manner, and the attribute of the column data is determined by counting the attribute of each unit content, so that the calculation amount in the data attribute identification process is reduced as much as possible, and the identification efficiency and accuracy of the data attribute are improved.

Description

Data attribute determination method and device

Technical Field

The invention relates to the technical field of data mining, in particular to a data attribute determining method and device.

Background

In relational graph-related big data analysis, structured original data are split and identified by terms so as to map the data to entity-attributes, and entity modeling and analysis are facilitated.

Typical application scenarios are: for an excel input table with good structuralization, the corresponding row attributes of the whole table content can be guessed according to the file name, the header and each row of content of the table, and the attributes correspond to the relevant model attributes of an E-R (entity relationship model) diagram. Therefore, the relational mapping from the original input information to the atlas model can be realized, and the subsequent atlas mining and other deeper operations are facilitated. However, how to improve the identification rate of the data attribute is an urgent technical problem to be solved.

Disclosure of Invention

The object of the present invention is to solve at least to some extent one of the above mentioned technical problems.

Therefore, a first objective of the present invention is to provide a data attribute determining method, which achieves to reduce the computation amount in the data attribute identification process as much as possible and improve the identification efficiency and accuracy of the data attribute by searching the candidate attribute set of the column data in categories and determining the attribute of the column data by counting the attributes of each unit content.

A second object of the present invention is to provide a data attribute determining apparatus.

A third object of the invention is to propose a computer device.

A fourth object of the invention is to propose a computer-readable storage medium.

In order to achieve the above object, a data attribute determining method according to an embodiment of the first aspect of the present invention includes: splitting formatted original data to obtain a plurality of column data;

if the column data does not comprise column header contents, determining a candidate attribute set corresponding to the column data according to the data type of the column data;

determining the attribute of each unit content of the column data according to the candidate attribute set;

and counting the attributes of the content of each unit of the column data to obtain the confidence coefficient of each attribute, and determining the attributes of the column data according to the confidence coefficient.

The method for determining the candidate attribute set corresponding to the column data according to the data type of the column data includes:

when the data type is a digital letter type, acquiring a candidate regular expression set corresponding to the column data, and determining the candidate regular expression set as a candidate attribute set corresponding to the column data, wherein the candidate regular expression set comprises a plurality of regular expressions, and the regular expressions are associated with data attributes;

the determining the attribute of each unit content of the column data according to the candidate attribute set includes:

matching the unit content with each regular expression in the candidate regular expression set one by one;

and when the regular expression is matched with the unit content, determining the data attribute associated with the regular expression as the attribute corresponding to the unit content.

when the data type is a non-numeric letter type, acquiring a candidate hash dictionary set corresponding to the column data, and determining the candidate hash dictionary set as a candidate attribute set corresponding to the column data, wherein the candidate hash dictionary set comprises a plurality of hash dictionaries which are associated with data attributes;

inputting unit content into each hash dictionary in the candidate hash dictionary set to carry out one-by-one query;

and when the unit content is inquired in the hash dictionary, determining the data attribute associated with the hash dictionary as the attribute corresponding to the unit content.

The above method, where the counting the attributes of each unit content of the column data to obtain the confidence level of each attribute, and determining the attribute of the column data according to the confidence level, includes:

counting attributes associated with a hash dictionary of each unit content of the column data to obtain confidence degrees of the attributes associated with the hash dictionary;

determining attributes of column data associated with the hash dictionary based on the confidence associated with the hash dictionary;

after determining the attribute of the column data associated with the hash dictionary according to the confidence degree associated with the hash dictionary, the method further comprises:

when the non-numeric letter type is a short text and the confidence of the attribute of the column data associated with the Hash dictionary is lower than a set threshold, determining the attribute of the column data associated with the Trie tree;

and comparing the attribute of the column data associated with the hash dictionary with the attribute of the column data associated with the Trie tree, and determining the attribute of the column data with high confidence as the target attribute of the column data.

The method as described above, further comprising:

and when the column data is determined to comprise column header content, inquiring a preset attribute mapping dictionary according to the column header content to obtain an attribute matched with the column header content, and determining the attribute matched with the column header content as the attribute of the column data.

In order to achieve the above object, a data attribute determining apparatus according to an embodiment of a second aspect of the present invention includes: the splitting module is used for splitting the formatted original data to obtain a plurality of column data;

a first determining module, configured to determine, if the column data does not include column header content, a candidate attribute set corresponding to the column data according to a data type of the column data;

a second determining module, configured to determine, according to the candidate attribute set, an attribute of each unit content of the column data;

and the third determining module is used for counting the attributes of the content of each unit of the column data to obtain the confidence coefficient of each attribute, and determining the attributes of the column data according to the confidence coefficient.

To achieve the above object, a computer apparatus according to an embodiment of the third aspect of the present invention includes: a processor and a memory;

wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the data attribute determination method of the first aspect.

In order to achieve the above object, a computer-readable storage medium according to a fourth aspect of the present invention has a computer program stored thereon, and when executed by a processor, implements the data attribute determining method according to the first aspect.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which,

FIG. 1 is a flow chart of a data attribute determination method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a data attribute determination method according to yet another embodiment of the present invention;

FIG. 3 is a flow chart of a data attribute determination method of another embodiment of the present invention;

FIG. 4 is a flow chart of a data attribute determination method according to yet another embodiment of the present invention;

FIG. 5 is a flow chart of a data attribute determination method according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a data attribute determining apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

A data attribute determining method and apparatus according to an embodiment of the present invention will be described below with reference to the drawings.

Fig. 1 is a flowchart of a data attribute determining method according to an embodiment of the present invention. The data attribute determination method of the present embodiment is executed by a data attribute determination device, which may be integrated in a server.

As shown in fig. 1, the data attribute determining method of the present embodiment includes:

step S101, splitting the formatted original data to obtain a plurality of column data.

Specifically, the formatted original data contains a huge amount of valuable information treasury, and the data mining of the formatted original data can help a user to make a more scientific decision. The formatted original data in this embodiment may be various reports, such as reports of data such as a customer list, a product list, an article list, an order, an invoice, and the like, and the reports may be in the form of PDF, Word, Excel, Power Point, and the like.

Taking table 1 as an example, the file name of table 1 is a logistics management report, table 1 has 3 columns of data, and the column header contents of each column of data are: the invoice number, the issuing company, the receiving company. Generally, the column header content is an attribute corresponding to the column data, that is, the attribute of the column in which the waybill number in table 1 is located is the waybill number, the attribute of the column in which the sender company in table 1 is located is the sender company, and so on.

When a server integrated with a data attribute determining device collects a large number of logistics management reports such as those shown in table 1, firstly, identifying the structural form of the logistics management report, and determining the column number of the logistics management report; then, splitting according to columns to obtain 3 columns of data which are respectively an invoice number, a delivery company and a receiving company; then, the attribute of each row of data is mined according to the row content or each unit content in the row data.

TABLE 1

Freight note number	Sender Corp Ltd	Addressee company
			4506442377787	Jiangxi Co Ltd	Shandong Co Ltd
4523447706787	Beijing Corp Ltd	Shanxi Co Ltd
			8744235077647	Hainan Co Ltd	Henan Co Ltd
7643507474287	Hunan corporation	Hebei Co Ltd
			3587442077647	Shanghai Co Ltd	Jiangxi Co Ltd

Step S102, if the column data does not include column header content, determining a candidate attribute set corresponding to the column data according to the data type of the column data.

For example, when the server collects the column data shown in table 1, including the column header content and the unit content, since the column header content represents the attribute of the column data, the server preferentially identifies the column header content to mine the attribute of the column data. However, not all the collected formatted raw data includes the column header, and in this case, the server mines the attributes of the column data according to the content of each unit of each column data.

Specifically, the data types of the attributes of the column data may be classified into an alphanumeric type and a non-alphanumeric type. The unit content corresponding to the attribute of the numeric letter type can be content consisting of numbers, 26 English letters, underlines, spaces, tab characters, page changers and the like, and the unit content corresponding to the attribute such as a waybill number, a postal code, an identity card, a MAC address and the like. The non-alphanumeric type may be an enumerated type, a short text type. Enumerated type attributes include ethnicity, gender, and the like; attributes of short text types are company, school, etc.

For example, when the server acquires the column data corresponding to the invoice number in table 1, the server performs operations such as identification, classification, clustering and the like on the 3 unit contents, guesses that the attribute of the column data associated with the invoice number may be the data letter type attribute of the invoice number, the zip code, the identification card and the like, and calls the prestored attribute identification model of the invoice number, the attribute identification model of the zip code and the identification card attribute identification model. If the server establishes different waybill number attribute identification models aiming at different logistics companies, the server calls all waybill number attribute identification models. All the attribute recognition models of the waybill number, the attribute recognition model of the zip code and the attribute recognition model of the ID card are the candidate attribute sets corresponding to the column data in the embodiment (i.e. the candidate attribute set corresponding to the column data associated with the waybill number)

For example, when the server acquires the column data corresponding to the sender in table 1, the server first performs operations such as identification, classification, clustering, and the like on the 3 unit contents, guesses that the attributes of the column data associated with the sender may be non-data letter type attributes of the sender, the recipient, the relay company, and the like, and then calls the pre-stored sender attribute identification model, recipient attribute identification model, and relay company attribute identification model. All the sender attribute identification model, the recipient attribute identification model and the transit company attribute identification model are candidate attribute sets corresponding to the column data in the embodiment (i.e. candidate attribute sets corresponding to the column data associated with the sender).

It should be noted that exemplary attribute identification models, such as an invoice number attribute identification model, a zip code attribute identification model, an identification card attribute identification model, a delivery company attribute identification model, a receiving company attribute identification model, a transit company attribute identification model, and the like, are established by the data mining development company according to actual requirements. For example, the data mining development company establishes a corresponding attribute identification model by collecting massive raw data in historical format and using an analysis method such as machine learning, or the data mining development company establishes a corresponding attribute identification model based on a preset algorithm by statistically analyzing the massive raw data in historical format. For example, the attribute identification model may be a hash dictionary established based on the attribute, a Trie tree established based on the attribute, a regular expression established based on the attribute, or other identification models.

In this embodiment, when it is determined that the column data does not include the column header content, the candidate attribute set of the column data is searched for in a classified manner by determining the data type of the column data, so that the computation amount in the data attribute identification process is reduced as much as possible, and the identification efficiency and accuracy of the data attribute are improved.

And step S103, determining the attribute of each unit content of the column data according to the candidate attribute set.

Step S104, counting the attributes of each unit content of the line data to obtain the confidence of each attribute, and determining the attributes of the line data according to the confidence.

For example, when the server identifies attributes in 5 unit contents corresponding to the sending company in table 1, the unit contents are input into the candidate attribute set for matching one by one. Wherein the set of candidate attributes comprises: the system comprises a sender attribute identification model, a receiver attribute identification model and a transfer company attribute identification model.

Specifically, when the property identification models of the sending company, the receiving company and the transfer company are sequentially input to the western and Jiang company and matched one by one, the property identification models of the sending company, the receiving company and the transfer company are input to the western and Jiang company to be matched, and if the matching is successful, the property of the unit content corresponding to the western and Jiang company is determined to be the receiving company. Determining the attribute of the unit content corresponding to the Beijing company as a sending company by analogy; determining the attribute of the unit content corresponding to Hainan company as a sending company; determining the attribute of the unit content corresponding to the Hunan company as a sending company; and determining the attribute of the unit content corresponding to the Shanghai company as the issuing company.

Specifically, through statistics, two attributes of column data related to a sending company exist, namely the sending company and a receiving company; wherein, the attributes of four unit contents are identified as the sending company, and the attribute of one unit content is identified as the receiving company. The attribute of the column data is calculated to be that the probability of the sender is 80%,

the attribute of the column data is a 20% probability of the recipient company. The confidence of the attribute in this embodiment may be understood as a probability of the attribute, for example, if the probability of the column data that the attribute is 80% of the sending company is that the confidence of the column data that the attribute is 80% of the sending company is 80%, and if the probability of the column data that the attribute is 20% of the receiving company is that the confidence of the column data that the attribute is 20% of the receiving company is 20%.

The specific implementation manner of determining the attribute of the column data according to the confidence coefficient is as follows:

in a first implementation manner, the confidence degrees corresponding to the same column of data are compared, and the attribute corresponding to the maximum confidence degree is selected to be determined as the attribute of the column of data.

In a second implementation manner, whether each confidence corresponding to the same row of data meets a set condition is determined, and an attribute corresponding to the confidence meeting the condition is determined as the attribute of the row of data. The setting condition may be to compare the confidence level with a set threshold, or to determine whether the confidence level is within a set range, but the setting condition is not limited thereto. It is noted that there may be multiple eligible confidences and, correspondingly, there may be multiple attributes of the column data determined.

In a third implementation manner, each confidence corresponding to the same column of data is presented to the user, and the user autonomously selects the attribute of the column of data according to the confidence.

The data attribute determining method provided by the embodiment includes: splitting formatted original data to obtain a plurality of column data; if the column data does not comprise column header contents, determining a candidate attribute set corresponding to the column data according to the data type of the column data; determining the attribute of each unit content of the column data according to the candidate attribute set; and counting the attributes of the content of each unit of the column data to obtain the confidence coefficient of each attribute, and determining the attributes of the column data according to the confidence coefficient. According to the method, the candidate attribute set of the column data is searched in a classified manner, and the attribute of the column data is determined by counting the attribute of each unit content, so that the calculation amount in the data attribute identification process is reduced as much as possible, and the identification efficiency and accuracy of the data attribute are improved.

Fig. 2 is a flowchart of a data attribute determination method according to another embodiment of the present invention. On the basis of the embodiment, when the column data has the column header content, the more accurate attribute of the column data can be simply and efficiently determined by using the column header content, and the identification efficiency of the attribute of the column data is also accelerated.

As shown in fig. 2, the data attribute determining method of the present embodiment includes:

step S201, splitting the formatted original data to obtain a plurality of column data, and executing step S202 or step S205;

step S202, if the column data does not include column header content, determining a candidate attribute set corresponding to the column data according to the data type of the column data, and executing step S203;

step S203, determining the attribute of each unit content of the column data according to the candidate attribute set, and executing step S204.

Step S204, counting the attributes of each unit content of the column data to obtain the confidence of each attribute, and determining the attributes of the column data according to the confidence.

The implementation manners of steps S201, S202, S203, and S204 in this embodiment are the same as the implementation manners of S101, S102, S103, and S104 in the above embodiment, and are not described again here.

Step S205, when it is determined that the column data includes column header content, querying a preset attribute mapping dictionary according to the column header content to obtain an attribute matched with the column header content, and determining the attribute matched with the column header content as the attribute of the column data.

For example, when a user designs a report, what is represented by the column header contents is an attribute of the column data, such as the column data shown in table 1 including the column header contents of the invoice number, the issuing company, the receiving company, and the like.

Specifically, a preset attribute mapping dictionary is stored in the server in advance, and the preset attribute mapping dictionary is designed by a data mining development company according to the characteristics of various industries and is continuously updated. The attributes contained in the preset attribute mapping dictionary are the attributes authenticated by professionals, and the authority is high. In the embodiment, the column content is input into the preset attribute mapping dictionary, and the attribute found in the preset attribute mapping dictionary is determined as the attribute of the column data.

In the data attribute determining method provided by this embodiment, when it is determined that the column data includes the column header content, the preset attribute mapping dictionary is queried according to the column header content to obtain the attribute matched with the column header content, and the attribute matched with the column header content is determined as the attribute of the column data.

Fig. 3 is a flowchart of a data attribute determination method according to another embodiment of the present invention. On the basis of the above embodiment, when the data type is a numeric letter type, the attribute of the cell content is determined by performing logic judgment on the cell content and the regular expression, and then the attribute of the column data is determined.

As shown in fig. 3, the data attribute determining method of the present embodiment includes:

step S301, splitting the formatted original data to obtain a plurality of column data, and executing step S302;

step S302, if the column data does not include column header content, determining a candidate attribute set corresponding to the column data according to the data type of the column data, and executing step S303;

step S303, when the data type is a numeric character type, obtaining a candidate regular expression set corresponding to the column data, determining the candidate regular expression set as a candidate attribute set corresponding to the column data, and executing step S304.

The candidate regular expression set comprises a plurality of regular expressions, and the regular expressions are associated with data attributes.

Specifically, the regular expression is a logic formula for operating on a character string, that is, a "rule character string" is formed by using specific characters defined in advance and a combination of the specific characters, and the "rule character string" is used for expressing a filtering logic for the character string. Regular expressions are very flexible, logical and functional; the regular expression is used for occasions such as character string processing, form verification and the like, and is practical and efficient.

For example, the server stores various regular expressions in advance, and one regular expression is associated with one attribute. For example, a regular expression of the waybill number, a regular expression of the postal code and a regular expression of the identity card are prestored.

When the server collects the column data corresponding to the waybill number in table 1, the server may perform operations such as identification, classification, clustering, and the like on the 5 unit contents, and guess that the column data corresponding to the waybill number may be data letter type attributes such as the waybill number, zip code, identification card, and the like. At this time, the server calls the pre-stored regular expression of waybill number, regular expression of postal code and regular expression of identity card, that is, the set of candidate regular expressions in this embodiment includes the regular expression of waybill number, regular expression of postal code and regular expression of identity card.

Step S304, matching the unit content with each regular expression in the candidate regular expression set one by one, and executing step S305.

Step S305, when the regular expression is matched with the unit content, determining the data attribute associated with the regular expression as the attribute corresponding to the unit content, and executing step S306.

Step S306, counting the attributes of each unit content of the line data to obtain the confidence of each attribute, and determining the attributes of the line data according to the confidence.

For example, when the server identifies attributes in 5 unit contents corresponding to the waybill number in table 1, the unit contents are input into the candidate regular expression set for one-by-one matching.

Specifically, through statistics, three attributes of the column data related to the waybill number exist, namely the waybill number, the postal code and the identity card; the attribute of 3 unit contents is identified as the waybill number, the attribute of 1 unit content is identified as the zip code, and the attribute of 1 unit content is identified as the identity card.

Through calculation, the probability (probability can be understood as confidence) that the attribute of the column data is the waybill number is 60%, and the probability (probability can be understood as confidence) that the attribute of the column data is the zip code is 20%; the attribute of the column data is that the probability (probability can be understood as confidence) of the identity card is 20%.

For example, the confidence degrees corresponding to the same line of data are compared, and the attribute corresponding to the highest confidence degree is selected to be determined as the attribute of the line of data. Then, the confidence level of 60% in the above example is the maximum confidence level, and the attribute of the determined column data is the waybill number.

In the data attribute determining method of this embodiment, when it is determined that the data type of the column data is a numeric letter type, a candidate regular expression set corresponding to the column data is obtained first, then, the unit content is matched with each regular expression in the candidate regular expression set one by one to determine an attribute corresponding to the unit content, and finally, the attribute of each unit content of the column data is counted to determine the attribute of the column data. The method determines the attribute of the column data of the digital letter type by using the regular expression, and is practical and efficient. Because the regular expressions have very strong flexibility, logicality and functionality, different regular expressions are edited according to the added attributes, and the regular expressions have very good expandability.

Fig. 4 is a flowchart of a data attribute determination method according to still another embodiment of the present invention. On the basis of the above embodiment, when the data type is a non-numeric letter type, the attribute of the unit content is determined by performing logical judgment on the unit content and the hash dictionary, and further the attribute of the column data is determined.

As shown in fig. 4, the data attribute determining method of the present embodiment includes:

step S401, splitting the formatted original data to obtain a plurality of column data, and executing step S402;

step S402, if the column data does not include column header content, determining a candidate attribute set corresponding to the column data according to the data type of the column data, and executing step S403;

step S403, when the data type is a non-numeric letter type, obtaining a candidate hash dictionary set corresponding to the column data, determining the candidate hash dictionary set as a candidate attribute set corresponding to the column data, where the candidate hash dictionary set includes a plurality of hash dictionaries, and the hash dictionaries are associated with data attributes, and then step S404 is executed.

Briefly introduced here is a hash table: a Hash table (also called Hash table) is a data structure directly accessed to a memory storage location according to a Key value, that is, the Hash table maps data to be queried to a location in the table to access a record by calculating a function related to a Key value, thereby speeding up the lookup. The mapping function is called as a hash function, the array for storing records is called as a hash table, and the keywords and the function rule can be determined arbitrarily in theory.

In order to accelerate the query speed of the data attribute, the hash dictionary in the embodiment may establish a corresponding initial hash table according to different industries. Taking table 1 as an example, the hash dictionaries to be established are the hash dictionaries of the sending company and the receiving company; in a Hash dictionary of a sending company, a first letter hash table corresponding to Jiangxi company, a first letter hash table corresponding to Beijing company, a first letter hash table corresponding to Hainan company, a first letter hash table corresponding to Hunan company and a first letter hash table corresponding to Shanghai company are respectively established. When the attribute of each unit content needs to be determined, the initial letter is only needed to be quickly searched in the hash dictionary. For example, when the unit content is determined to be an attribute of the Jiangxi company, the initial letter j is input into a Hash dictionary of the sending company, and if the Jiangxi company is inquired in the Hash dictionary, the unit content is determined to be the sending company of the Jiangxi company.

Step S404, inputting unit content into each hash dictionary in the candidate hash dictionary set to carry out one-by-one inquiry, and executing step S405;

step S405, when the unit content is found in the hash dictionary, determining the data attribute associated with the hash dictionary as the attribute corresponding to the unit content, and executing step S406.

Step S406, counting the attributes of each unit content of the column data to obtain the confidence of each attribute, and determining the attributes of the column data according to the confidence.

For example, when the server identifies attributes in 5 unit contents corresponding to the sender in table 1, the unit contents are input into the candidate hash dictionary set for query one by one.

Specifically, through statistics, two attributes of the column data related to the issuing company exist, namely the issuing company and the receiving company; wherein, the attributes of 4 unit contents are identified as the sending company, and the attributes of 1 unit content are identified as the receiving company.

It is calculated that the probability (probability can be understood as confidence) of the sender being the attribute of the column data is 80%, and the probability (probability can be understood as confidence) of the recipient being the attribute of the column data is 20%.

For example, the confidence degrees corresponding to the same line of data are compared, and the attribute corresponding to the highest confidence degree is selected to be determined as the attribute of the line of data. Then, the confidence level of 80% in the above example is the maximum confidence level, and the attribute of the determined column data is the sender.

In the data attribute determining method of this embodiment, when the data type is a non-numeric character type, a candidate hash dictionary set corresponding to the column data is obtained; secondly, inputting unit content into each hash dictionary in the candidate hash dictionary set to carry out one-by-one query; when the unit content is inquired in the hash dictionary, determining the data attribute associated with the hash dictionary as the attribute corresponding to the unit content; and finally, counting the attributes of the content of each unit of the column data to determine the attributes of the column data. The method utilizes a Hash dictionary to quickly inquire the attribute of non-numeric letter type column data; when the Hash dictionary is established, the keywords and the function rule in the Hash table can be determined at will theoretically, different Hash tables are edited according to the added attributes, and the method has good expandability.

Fig. 5 is a flowchart of a data attribute determination method according to an embodiment of the present invention. On the basis of the above embodiment, for attributes of short texts such as companies and schools, since names of companies and schools are constantly updated, names of companies and schools which have just appeared may not be included in the hash dictionary, and a column data attribute with low confidence may be selected. For the above situation, in this embodiment, after the hash dictionary is used to determine the attributes of the column data, the attributes of the column data are further determined through the Trie tree, and the column data attributes with higher confidence are selected as the target attributes of the column data.

As shown in fig. 5, the data attribute determining method of the present embodiment includes:

step S501, splitting the formatted original data to obtain a plurality of column data, and executing step S502;

step S502, if the column data does not comprise column header content, determining a candidate attribute set corresponding to the column data according to the data type of the column data, and executing step S503;

step S503, when the data type is a non-numeric letter type, acquiring a candidate hash dictionary set corresponding to the column data, determining the candidate hash dictionary set as a candidate attribute set corresponding to the column data, where the candidate hash dictionary set includes a plurality of hash dictionaries, and the hash dictionaries are associated with data attributes, and then step S504 is executed.

Step S504, inputting the unit content into each hash dictionary in the candidate hash dictionary set, inquiring one by one, and executing step S505;

step S505, when the unit content is found in the hash dictionary, determining the data attribute associated with the hash dictionary as the attribute corresponding to the unit content, and executing step S506.

Step S506, counting the attributes associated with the hash dictionary of each unit content of the column data to obtain the confidence of each attribute associated with the hash dictionary; the attribute of the column data associated with the hash dictionary is determined according to the confidence associated with the hash dictionary, and step S507 is performed.

Step S507, when the non-numeric letter type is a short text and the confidence of the attribute of the column data associated with the hash dictionary is lower than a set threshold, determining the attribute of the column data associated with the Trie tree, and executing step S508.

Specifically, the non-numeric letter type in this embodiment may be an enumeration type or a short text type. The attribute such as nationality, gender, etc. is an enumeration type, the attribute of the column data is gender as an example, and the unit content of the column data is either female or male. The enumerated type of attribute can be unambiguously determined by means of a hash dictionary.

For attributes of short texts such as companies and schools, since names of companies and schools are constantly updated and changed, names of companies and schools which have just appeared may not be included in the hash dictionary, and a column data attribute with low confidence may be selected.

For the above situation, in this embodiment, after the hash dictionary is used to determine the attributes of the column data, the attributes of the column data are further determined through the Trie tree, and the column data attributes with higher confidence are selected as the target attributes of the column data.

The Trie, also known as the word-lookup tree, is a tree-like structure that is a variant of the hash tree. Typical applications are for statistics, sorting and storing a large number of strings (but not limited to strings), and are therefore often used by search engine systems for text word frequency statistics. It has the advantages that: the public prefix of the character string is utilized to reduce the query time, so that unnecessary character string comparison is reduced to the maximum extent, and the query efficiency is higher than that of a Hash tree.

In this embodiment, a specific implementation manner of determining the attribute of the column data associated with the Trie tree is as follows:

and S1, acquiring a candidate Trie tree set corresponding to the column data, and determining the candidate Trie tree set as a candidate attribute set corresponding to the column data, wherein the candidate Trie tree set comprises a plurality of Trie trees which are associated with data attributes.

Taking table 1 as an example, the Trie trees to be established are Trie trees of a sender company and Trie trees of a receiver company; in the Trie tree of the launch company, the Trie tree is built according to many launch companies such as Jiangxi company, Beijing company, Hainan company, Hunan company, Shanghai company and the like of big data mining.

And S2, inputting the unit content into each Trie tree in the candidate Trie tree set to carry out one-by-one query.

S3, when the unit content is inquired in the Trie tree, determining the data attribute associated with the Trie tree as the attribute corresponding to the unit content.

Step S508, the attribute of the column data associated with the hash dictionary is compared with the attribute of the column data associated with the Trie tree, and the attribute of the column data with high confidence is determined as the target attribute of the column data.

For example, determining the attribute of the column where the sending company is located, and if the column data attribute determined according to the hash dictionary is the receiving company, the corresponding confidence is 40% (the set threshold is 50%); the column data attribute determined according to the Trie tree is a sender, the corresponding confidence is 30%, and at this time, the column data attribute determined according to the hash dictionary is determined as a target attribute of the column data (the attribute is an addressee). If the column data attribute determined according to the Hash dictionary is the recipient company, the corresponding confidence is 40% (the set threshold is 50%); the line data attribute determined according to the Trie is a sender, and the corresponding confidence is 50%, at this time, the line data attribute determined according to the Trie is determined as a target attribute of the line data (the attribute is the sender).

In the data attribute determining method of this embodiment, when the non-numeric letter type is a short text and the confidence of the attribute of the column data associated with the hash dictionary is lower than a set threshold, the attribute of the column data associated with the Trie tree is determined, the attribute of the column data associated with the hash dictionary is compared with the attribute of the column data associated with the Trie tree, and the attribute of the column data with high confidence is determined as the target attribute of the column data. Aiming at the situation that the attribute of the row data selected by utilizing the Hash word table is not high in the early stage, the attribute of the row data is further determined through the Trie tree, the row data attribute with high confidence coefficient is selected as the target attribute of the row data, and the identification accuracy of the data attribute is improved.

Fig. 6 is a schematic structural diagram of a data attribute determining apparatus according to an embodiment of the present invention. As shown in fig. 6, the data attribute determining apparatus provided in this embodiment includes:

the splitting module 01 is configured to split the formatted original data to obtain a plurality of line data;

a first determining module 02, configured to determine, if the column data does not include column header content, a candidate attribute set corresponding to the column data according to a data type of the column data;

a second determining module 03, configured to determine, according to the candidate attribute set, an attribute of each unit content of the column data;

a third determining module 04, configured to count attributes of each unit content of the line data to obtain a confidence level of each attribute, and determine the attribute of the line data according to the confidence level.

Further, the first determination module comprises a first unit; the second determination module comprises a second unit;

the first unit is configured to, when the data type is a numeric letter type, obtain a candidate regular expression set corresponding to the column data, and determine the candidate regular expression set as a candidate attribute set corresponding to the column data, where the candidate regular expression set includes a plurality of regular expressions, and the regular expressions are associated with data attributes;

the second unit is used for matching the unit content with each regular expression in the candidate regular expression set one by one; and when the regular expression is matched with the unit content, determining the data attribute associated with the regular expression as the attribute corresponding to the unit content.

Further, the first determining module further comprises a third unit; the second determination module further comprises a fourth unit;

the third unit is configured to, when the data type is a non-numeric letter type, obtain a candidate hash dictionary set corresponding to the column data, and determine the candidate hash dictionary set as a candidate attribute set corresponding to the column data, where the candidate hash dictionary set includes a plurality of hash dictionaries, and the hash dictionaries are associated with data attributes;

the fourth unit is used for inputting the unit content into each hash dictionary in the candidate hash dictionary set to inquire one by one; and when the unit content is inquired in the hash dictionary, determining the data attribute associated with the hash dictionary as the attribute corresponding to the unit content.

Further, the third determining module is further configured to count attributes associated with a hash dictionary of each unit content of the column data to obtain a confidence of each attribute associated with the hash dictionary;

Further, the first determining module is further configured to, when it is determined that the column data includes column header content, query a preset attribute mapping dictionary according to the column header content to obtain an attribute matched with the column header content, and determine the attribute matched with the column header content as an attribute of the column data.

The specific manner in which the respective modules perform operations has been described in detail in relation to the apparatus in this embodiment, and will not be elaborated upon here.

The data attribute determining apparatus provided in this embodiment includes: splitting formatted original data to obtain a plurality of column data; if the column data does not comprise column header contents, determining a candidate attribute set corresponding to the column data according to the data type of the column data; determining the attribute of each unit content of the column data according to the candidate attribute set; and counting the attributes of the content of each unit of the column data to obtain the confidence coefficient of each attribute, and determining the attributes of the column data according to the confidence coefficient. The device finds the candidate attribute set of the column data by classification and determines the attribute of the column data by counting the attribute of each unit content, so that the operation amount in the data attribute identification process is reduced as much as possible, and the identification efficiency and accuracy of the data attribute are improved.

In order to achieve the above object, an embodiment of the present invention further provides a computer device.

As shown in fig. 7, the computer apparatus includes: a memory 11, a processor 12 and a computer program stored on the memory 11 and executable on the processor 12.

The processor 12, when executing the program, implements the data attribute determination method provided in the embodiment shown in any of fig. 1 to 5.

Further, the computer device further comprises:

a communication interface 13 for communication between the memory 11 and the processor 12.

A memory 11 for storing a computer program operable on the processor 12.

The memory 11 may comprise a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

And a processor 12, configured to implement the data attribute determining method provided in the embodiments shown in fig. 1 to 5 when executing the program.

If the memory 11, the processor 12 and the communication interface 13 are implemented independently, the communication interface 13, the memory 11 and the processor 12 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 7, but this does not mean only one bus or one type of bus.

Alternatively, in practical implementation, if the memory 11, the processor 12 and the communication interface 13 are integrated on one chip, the memory 11, the processor 12 and the communication interface 13 may complete communication with each other through an internal interface.

The processor 12 may be a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present invention.

To achieve the above object, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the data attribute determining method provided in the embodiment shown in any one of fig. 1 to 5.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method for determining data attributes, comprising:

splitting formatted original data to obtain a plurality of column data;

counting the attributes of the content of each unit of the column data to obtain the confidence of each attribute, and determining the attributes of the column data according to the confidence;

the determining a candidate attribute set corresponding to the column data according to the data type of the column data includes:

2. The method of claim 1, further comprising:

3. A method for determining data attributes, comprising:

splitting formatted original data to obtain a plurality of column data;

4. The method of claim 3,

the counting the attributes of the content of each unit of the column data to obtain the confidence of each attribute, and determining the attributes of the column data according to the confidence includes:

5. The method of any of claims 3 to 4, further comprising:

6. A data attribute determination apparatus, comprising:

the splitting module is used for splitting the formatted original data to obtain a plurality of column data;

a third determining module, configured to count attributes of each unit content of the column data to obtain a confidence level of each attribute, and determine the attribute of the column data according to the confidence level;

the first determining module comprises a first unit; the second determination module comprises a second unit;

7. A data attribute determination apparatus, comprising:

the first determination module further comprises a third unit; the second determination module further comprises a fourth unit;

8. A computer device, comprising: a processor and a memory;

wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory for implementing the data attribute determination method according to any one of claims 1 to 2 and implementing the data attribute determination method according to any one of claims 3 to 5.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a data property determination method according to any one of claims 1 to 2 and carries out a data property determination method according to any one of claims 3 to 5.