CN107291672B

CN107291672B - Data table processing method and device

Info

Publication number: CN107291672B
Application number: CN201610197071.4A
Authority: CN
Inventors: 纪丽娟
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2016-03-31
Filing date: 2016-03-31
Publication date: 2020-11-20
Anticipated expiration: 2036-03-31
Also published as: CN107291672A

Abstract

The application discloses a data table processing method and device. Wherein, the method comprises the following steps: comparing a first field in the first data table with a second field in the second data table; under the condition that the identification information of the first field and the identification information of the second field are different, processing information of the first field and processing information of the second field are obtained, wherein the processing information is used for recording a plurality of processing logics in a processing path of the corresponding field; comparing each processing logic of each corresponding field according to the processing path; and if the currently compared processing logics are not consistent, determining that the currently compared processing logics are logics with differences. The method and the device solve the technical problem of low efficiency in comparing the contents of the data table.

Description

Data table processing method and device

Technical Field

The present application relates to the field of data processing, and in particular, to a method and an apparatus for processing a data table.

Background

In the prior art, when data tables are compared, data contents are directly compared, and when differences are found, the errors are located by manually checking upwards along a processing link.

After the content difference is found, the difference is found manually, the data with the difference needs to be acquired manually, and the data with the difference is processed along the links of the data with the difference, so that the data with the difference is searched one by one, and the data with the difference is compared one by one to locate errors. In the process, due to a large number of comparison tasks, manual operation is heavy, and the error rate is high in the manual operation process.

In order to solve the problem of low efficiency in comparing the contents of the data table, no effective solution is provided at present.

Disclosure of Invention

The embodiment of the application provides a method and a device for processing a data table, which are used for at least solving the problem of low efficiency when comparing the contents of the data table.

According to an aspect of an embodiment of the present application, there is provided a method for processing a data table, the method including: comparing a first field in the first data table with a second field in the second data table; under the condition that the identification information of the first field and the identification information of the second field are different, processing information of the first field and processing information of the second field are obtained, wherein the processing information is used for recording a plurality of processing logics in a processing path of the corresponding field; comparing each processing logic of each corresponding field according to the processing path; and if the currently compared processing logics are not consistent, determining that the currently compared processing logics are logics with differences.

According to another aspect of the embodiments of the present application, there is also provided a device for processing a data table, the device including: the first comparison unit is used for comparing a first field in the first data table with a second field in the second data table; the information acquisition unit is used for acquiring the processing information of the first field and the processing information of the second field under the condition that the identification information of the first field and the identification information of the second field are compared to be different, wherein the processing information is used for recording a plurality of processing logics in the processing path of the corresponding field; the second comparison unit is used for comparing each processing logic of each corresponding field according to the processing path; and the difference positioning unit is used for determining the currently compared processing logic as the logic with the difference if the currently compared processing logic is inconsistent.

Further, the apparatus further comprises: and the field determining unit is used for acquiring the identification information of the first field and determining a second field which has the same identification information with the first field in the second data table.

Further, the identification information includes: a field name, wherein the first comparing unit includes: the first comparison module is used for comparing whether the field names of the first field and the second field are the same or not; and the first difference determining module is used for comparing the difference of the identification information of the first field and the second field if the field names of the first field and the second field are different.

Further, the identification information includes field metadata and processing logic, wherein the first comparing unit includes: the second comparison module is used for comparing whether the field metadata and the processing logic of the first field and the second field are the same; and the second difference determining module is used for comparing the identification information of the first field and the second field to generate a difference if the field metadata and the processing logic of the first field and the second field are different.

Further, the apparatus further comprises: the information acquisition unit is used for acquiring the processing information of each field of each data table in the data tables to be analyzed before comparing the first field in the first data table with the second field in the second data table; the judging unit is used for judging whether each field is the field with the same identification information by using the processing logic in the processing information to obtain a judgment result; the statistical unit is used for counting the number of fields with the same identification information between every two data tables in the data table to be analyzed according to the judgment result; the calculating unit is used for calculating the similarity of the two data tables based on the number; and the table acquisition unit is used for acquiring a plurality of second data tables of which the similarity with the first data table meets a preset similarity condition.

By adopting the embodiment, under the condition that the first field and the second field with the same identification information in the first data table and the second data table are compared to be different, the processing logics of the first field and the second field are automatically compared, and if the processing logics are different, the different logics are the problem that the fields with the same identification information in the data table to be analyzed are different. Through the embodiment, when the fields which are supposed to be the same in the two data tables are different, the problem of the difference can be automatically positioned according to the processing logic of the corresponding fields, and the processing accuracy is improved. Through the method and the device, the problem of low efficiency in comparison of the data table contents in the prior art is solved, and the processing efficiency of data table comparison is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a block diagram of a hardware structure of a computer terminal of a data table processing method according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of processing a data table according to an embodiment of the present application;

FIG. 3 is a first flowchart of an alternative method for processing a data table according to an embodiment of the present application;

FIG. 4 is a flow chart diagram two of an alternative method of processing a data table according to an embodiment of the present application;

FIG. 5 is a flow chart of a processing method applied to a data table of scenario one according to an embodiment of the present application;

FIG. 6 is a flowchart of a processing method applied to a data table of scenario two according to an embodiment of the present application;

FIG. 7 is a flow chart of an alternative method of obtaining processing information for a spreadsheet field according to an embodiment of the present application;

FIG. 8 is a schematic diagram of an alternative data table processing apparatus according to an embodiment of the present application;

FIG. 9 is a schematic diagram of an alternative data table processing device according to an embodiment of the present application;

fig. 10 is a network environment diagram of a computer terminal according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, terms related to embodiments of the present application are explained as follows, but the terms are not construed to limit the embodiments of the present application:

an online table: the data in the table, online table, generated by the business system is written to the database due to the operation or twisting of the business.

Blood margin of data: the data extracted from the online table can be calculated or processed to form new data, and the link between the online data and the new data is called blood margin.

A processing path: for recording the name of each processing node (i.e., processing step), the order of the processing nodes, the links between the processing nodes, and the processing logic of each processing node during the processing of the data. Specifically, in the embodiment of the present application, the processing means: and performing calculus or logic processing operation on the extracted data in the online table.

Processing logic: a source field for recording the process node, a result field, filter conditions, and a process function (which may be a logical process function).

Data caliber: the actual business meaning of the data representation.

Table similarity: whether the fields of the two tables are the same or not is judged mainly from attribute information such as data sources, data processing processes, data granularity and the like, and then the similarity of the tables is calculated through the number of the fields which are the same.

Mass fraction of table: the method is used for measuring the quality of data of one table and mainly measuring the integrity and reliability of information.

Health score of table: the health degree of the table usage is measured by the storage of the table and the consumption of computing resources.

Access heat of table: to describe the number of times the form is used over a period of time, the more times the form is used the more hot it is.

Example 1

There is also provided, in accordance with an embodiment of the present application, an embodiment of a method for processing a data table, it should be noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than here.

The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking an example of the method running on a computer terminal, fig. 1 is a hardware structure block diagram of a computer terminal of a method for processing a data table according to an embodiment of the present application. As shown in fig. 1, the computer terminal 10 may include one or more (only one shown) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 104 for storing data, and a transmission module 106 for communication functions. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store software programs and modules of application software, such as program instructions/modules corresponding to the processing method of the data table in the embodiment of the present application, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, implementing the processing method of the data table. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network Interface Controller (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

Under the operating environment, the application provides a method for processing a data table as shown in fig. 2. Fig. 2 is a flowchart of a processing method of a data table according to an embodiment of the present application.

As shown in fig. 2, the method may include the steps of:

step S202: comparing a first field in the first data table with a second field in the second data table;

step S204: under the condition that the identification information of the first field and the identification information of the second field are compared to be different, processing information of the first field and processing information of the second field are obtained, wherein the processing information is used for recording a plurality of processing logics in a processing path of the corresponding field, and the corresponding field is the first field or the second field;

step S206: comparing each processing logic of each corresponding field according to the processing path;

step S208: and if the currently compared processing logics are not consistent, determining that the currently compared processing logics are logics with differences.

By adopting the embodiment, under the condition that the first field and the second field with the same identification information in the data table to be analyzed are compared to be different, the processing logics of the first field and the second field are automatically compared, and if the processing logics are different, the different logics are the problem that the fields with the same identification information in the data table to be analyzed are different. Through the embodiment, when the fields which are supposed to be the same in the two data tables are different, the problem of the difference can be automatically positioned according to the processing logic of the corresponding fields, and the processing accuracy is improved. Through the method and the device, the problem of low efficiency in comparison of the data table contents in the prior art is solved, and the processing efficiency of data table comparison is improved.

The identification information is information for identifying a field, and the identification information of a field points to a field, such as a field name, field processing logic, and the like.

The step S202 is to compare the first field in the first data table with the second field in the second data table, and may be implemented by determining whether the fields with the same identification information in the data table to be analyzed always satisfy a predetermined comparison condition, if the fields with the same identification information always satisfy the predetermined comparison condition, the step S202 may be executed in return, and if the fields with the same identification information do not satisfy the predetermined comparison condition any more, the fields with the same identification information in the data table to be analyzed may have a difference. In essence, the corresponding field appears because the field with the same identification information is further newly processed, and the processing logic causing the difference of the field with the same identification information can be located through steps S204 to S208.

The predetermined comparison condition in the above embodiment may be determined based on a comparison scenario, and the predetermined comparison condition may include: the field names are the same, the field processing logic is the same, and the field metadata and the processing logic are the same.

In the above embodiment, the processing logic in the processing path of the corresponding field is recorded in the processing information acquired in step S204, and the processing logic may include at least one of: the source field, the result field, the filter condition and the processing function of the corresponding processing node.

When the processing logics are compared, the processing logics can be compared through the source field, the result field, the filtering condition and the processing function of the processing node, and if the currently compared processing logics are not consistent, the currently compared processing logics are determined to be difference logics, so that the positions with differences can be located.

In an alternative, before comparing the first field in the first data table with the second field in the second data table, the specifying information is obtained, wherein the specifying information is used for specifying the first field and the second field.

That is, the user may specify the comparison field of the target field in the data table to be analyzed, and determine the target field and the comparison field as the first field and the second field in the first data table and the second data table.

Through the embodiment, the fields needing data comparison can be directly specified. The scheme can be applied to data migration, data before and after the data migration can be monitored, whether the data migration is complete or not is verified, and under the condition that the data migration is not complete, automatic positioning of difference reasons is carried out by means of field blood margins.

In another alternative, before comparing the first field in the first data table with the second field in the second data table, the identification information of the first field is obtained, and the second field in the second data table having the same identification information as the first field is determined by using the identification information of the first field.

Specifically, the identification information (such as processing information) of the target field of the first data table is obtained, and the matching field (i.e. the second field in the other data table which is the same as the identification information of the target field) belonging to the other data table which is the same as the target field is determined.

In addition to specifying the alignment fields, the matching fields identical to the target fields of the first data table may be abstracted by the field blood relationship and the field blood relationship path of the target fields of the first data table, and the matching fields may be located in other data tables, in which case, the processing paths of the target fields and the matching fields are identical, and the predetermined alignment rules (i.e., the predetermined alignment conditions) are extracted by the step of determining the matching fields.

Optionally, the identification information being the same includes: the field names are the same, or the field metadata and the machining logic are the same (e.g., the same machining logic is used to machine the same metadata into the fields with the same identification information).

In an alternative embodiment, the identification information includes a field name, wherein comparing the first field in the first data table to the second field in the second data table includes: comparing whether the field names of the first field and the second field are the same; and if the field names of the first field and the second field are different, comparing the identification information of the first field and the second field to obtain the difference.

In another alternative embodiment, the identification information includes field metadata and processing logic, wherein comparing the first field in the first data table to the second field in the second data table includes: comparing whether the field metadata of the first field and the second field are the same with the processing logic; and if the field metadata and the processing logic of the first field and the second field are different, comparing the identification information of the first field and the second field to generate a difference.

Two data tables are taken as an example below, and the comparison method in the embodiment of the present application is detailed in combination with fig. 3, as shown in fig. 3, the embodiment can be implemented by the following steps:

step S301: and starting a data comparison mode.

Step S302: and detecting whether the system specifies the comparison data of the target field.

If the comparison data of the specified target field of the system is detected, executing the step S304; if the comparison data of the target field is not specified by the system, step S303 is executed.

The user may determine the comparison field B (i.e. the comparison data) of the target field a (i.e. the first field in the above-mentioned embodiment) in the first data table a, the comparison field belongs to the second data table B, and in the case of specifying the comparison data of the target field, the system determines the target field and the comparison field B as the second field.

Step S303: and acquiring a matching field according to the blood margin of the field of the target field.

Specifically, the system may determine the matching field that is the same as the target field by the blood margin of the field, the blood margin path of the field. If the blood margin of a field of a certain field is the same as that of the target field, the system determines that the certain field is a matching field which is the same as the target field.

Step S304: and acquiring a monitoring rule.

In the process of executing step S302 and step S303, the monitoring rule (i.e. the predetermined comparison rule) may be abstracted based on the blood relationship of the field, such as: and processing logic + online data are preset comparison rules with the same name of offline fields and fields between tables.

Step S305: it is determined whether the same field violates a monitoring rule.

If the same field (i.e. the first field and the second field) violates the monitoring rule, determining that the same field (i.e. the first field and the second field) is different, then executing step S306; if the same field does not violate the monitoring rule, the monitoring is continued.

For example, in the case that the monitoring rule is that the field names between tables are the same, if the field names are different, i.e., the rule is violated, the blood margin of the original field may be changed, and therefore the blood margin needs to be obtained again.

Step S306: and acquiring the blood margin of the first field and the second field.

Alternatively, the field blood margin may be recalculated due to a change in the field blood margin.

Step S307: the field blood margin is used to derive the path location problem.

Through the comparison of the front and back blood margins and the comparison of the output result of each step on the blood margin, if the information of a certain processing node in the blood margin is inconsistent, the processing node is automatically positioned as the node with problems.

Through the embodiment, mutual verification is performed according to the blood relationship among the data (the blood relationship can be recorded in the processing path), for example, comparison rule configuration is performed on first-layer source data (such as data extracted from an online table, and also data in the online table) and end consumption data, for example, rule configuration is performed, early warning is performed on the problem, and difference reasons are automatically positioned by means of field blood relationship.

Based on the above embodiment, the present application further provides a determination method for similarity of data tables.

Specifically, before comparing the first field in the first data table and the second field in the second data table, the method may further include: acquiring processing information of each field of each data table in the data table to be analyzed, wherein the processing information of the field is at least used for recording each processing logic in the processing path of the corresponding field; judging whether each field is the field with the same identification information by using processing logic in the processing information to obtain a judgment result; counting the number of fields with the same identification information between every two data tables in the data tables to be analyzed according to the judgment result; calculating the similarity of every two data tables based on the number of fields with the same identification information between every two data tables; and acquiring a plurality of second data tables of which the similarity with the first data table meets the preset similarity condition.

In an alternative embodiment, under the condition that the data granularity of the two fields is the same, if the source field of the first processing node of the two fields is the same and the result field of the last processing node is the same, the two fields are the fields with the same identification information.

In the above method for determining similarity of data tables, in another alternative embodiment, it may be determined whether two fields are fields with the same identification information based on respective processing logics on the processing paths of the fields.

Specifically, the determining, by using the processing logic in the processing information, whether each field is a field with the same identification information, and the obtaining of the determination result may include: if the processing logics of the two fields are consistent, judging that the two fields are the fields with the same identification information; and if the two fields have different processing logics, judging that the two fields are fields with different identification information, wherein the judgment result comprises information of the fields with the same identification information and information of the fields with different identification information.

It is further noted that the processing logic may include the filter conditions, processing functions, source data, and result data in the corresponding processing paths.

Optionally, if all the information in the processing logic is consistent, the processing logic is consistent; if the source data is consistent but the result data is inconsistent in the processing logic, the processing logic must be inconsistent.

In another alternative, the source of the field a is q, the source of the field b is q, the number of processing nodes of the field a is 4, the number of processing nodes of the field b is 5, and the fields a and b may also be the same field, as the first 3 processing nodes are all identical, but the result of the fourth processing node of the field a is m (i.e., the attribute value of the field a), and the result of the fourth processing node of the field b is n, but the result of the fifth processing node of the field b is m, then the two fields are also fields with the same identification information.

With the above embodiment, when performing the alignment of the blood vessel borders (i.e. the processing information) of the two fields, one of the two fields may be used to locate the reference field, for example, the reference field is used as the target field, the other of the two fields is used as the alignment field, and the processing logic in each processing node in the target field is aligned with the processing logic in the processing boundary of the alignment field. For example, the target field has n processing nodes, and the comparison field is provided for m processing nodes. In the embodiment of the present application, the source field is source data, and the result field is result data.

Optionally, the field names of the fields may be compared first, and if the field names of the two fields are different, the two fields are fields with different identification information; if the field names of the two fields are the same, comparing the processing logic of the first processing node of the two fields, for example, judging the source data of the first processing node 1 of the two fields, if the source data of the first processing node 1 is inconsistent, the two fields are different fields.

And further, in order to ensure the accuracy of the fields with the same acquired identification information, performing blood margin verification on the middle processing node. If the source data of the first processing node of the two fields are consistent, the result data of the processing node x of the target field can be compared with the result data in each processing logic of the comparison field, and if the y-th processing result of the comparison field is consistent with the result data of the processing node x, the processing logic of the processing node between (m-y) is used for verification when the processing node between (n-x) is verified.

An embodiment of data cleansing is detailed below in conjunction with fig. 4, and as shown in fig. 4, may include the following steps:

step S401: and acquiring the processing information of each field of each data table in the data table to be analyzed, wherein the processing information of the field is at least used for recording each processing logic in the processing path of the corresponding field.

Step S402: and judging whether each field is the field with the same identification information by using the processing logic in the processing information to obtain a judgment result.

Optionally, under the condition that the data granularities of the two fields are the same, if the source field of the first processing node of the two fields is the same and the result field of the last processing node is the same, the two fields are fields with the same identification information, otherwise, the two fields are fields with different identification information.

Step S403: and counting the number of fields with the same identification information between every two data tables in the data table to be analyzed according to the judgment result.

Step S404: and calculating the similarity of every two data tables based on the number of fields with the same identification information between every two data tables.

Acquiring the number of fields with the same identification information of two data tables to be analyzed, wherein the step can be specifically realized by the following steps:

calculating the similarity P of every two data tables according to the following formula, wherein the formula is as follows:

p ═ Y × 2/(M + N), where in this embodiment, Y is used to indicate the number of fields having the same identification information between two data tables, M is used to indicate the number of fields of one data table of the two data tables, and N is used to indicate the number of fields of the other data table of the two data tables.

The similarity of any two data tables can be calculated through the method, and the processing method of the similarity can be applied to scenes of data recommendation and data purification.

After obtaining a plurality of second data tables whose similarity to the first data table meets a preset similarity condition, the method may further include: and sequencing the plurality of second data tables according to the health attribute and the quality attribute to obtain reverse sequencing information, wherein the health attribute is used for representing the resource consumption value of the data tables, and the quality attribute is at least used for representing the information completeness and reliability of the data tables.

Wherein, the preset similarity condition comprises: and the similarity is greater than a preset threshold, and the data tables similar to the first data table are sorted according to the similarity and then sorted into the data tables with the first N bits.

For example, after the similarity between each two data tables is determined by the above scheme, the data table with the similarity greater than the preset threshold (e.g., 90%) with the first data table is used as the second data table, the data health attribute (e.g., the health score of the table) and the quality attribute (e.g., the quality score of the table) of each second data table are obtained, the second data tables are sorted according to the health score and the quality score (the weighted result of the health score and the quality score can be used as the sorting score of the table during sorting), and the sorting information of the second data table is obtained, where the data table sorted in the first several bits in the sorting information is the data table with higher correlation with the first data table and better quality and health.

The data processing method can be applied to the following scenes:

before processing information of each field of each data table in the data table to be analyzed is obtained, a pushing request for obtaining a similar table of a first data table is received, and the data table to be analyzed is obtained based on the pushing request, wherein the data table to be analyzed comprises the first data table, namely the data table to be analyzed is applied to a data pushing scene;

receiving a processing task for processing data, extracting an identifier of a first data table from the processing task, and acquiring a data table to be analyzed by using the identifier of the first data table, namely, the processing mode can be applied to a task of replacing the data table;

and receiving a cleaning task for cleaning the first data table, and acquiring the data table to be analyzed based on the cleaning task, namely, the cleaning task can be applied to data cleaning.

Specifically, after obtaining the reverse ordering information, the method may further include: taking the reverse ordering information as push information responding to the push request under the condition of receiving the push request; under the condition that the machining task is received, replacing a first data table in the machining task by using a first second data table in the reverse ordering information; and under the condition that the cleaning task is received, combining the first q second data tables and the first data table in the reverse ordering information, wherein q is a natural number.

In the following, the data push and the application scenario are described in detail with reference to fig. 5.

As shown in fig. 5, this embodiment may include the steps of:

step S501: and acquiring the data table name in the push request.

Step S502: and acquiring the blood margin of each field in the data table according to the data table name.

Step S503: and calculating the fields with the same identification information in the table according to the blood relationship of the fields.

The way of calculating the fields with the same identification information is consistent with the implementation way in the above embodiments, and is not described herein again.

Step S504: and calculating the similarity of the two data tables according to the number of the fields with the same identification information of the two data tables.

Step S505: and recommending according to the similarity, the health score and the quality score in a reverse order.

The processing manner of this step is the same as that in the above embodiment, and is not described herein again.

In the above embodiment, the similarity of the table is calculated based on the number of fields with the same identification information, where the number of fields with the same identification information is 2/(the number of fields in table a + the number of fields in table B), and when a user performs table search, the table with the similarity greater than a range is recommended to the user according to the quality score and the health score from high to low.

Through the embodiment, the table with high similarity can be searched and ranked according to the health score and the quality score, more excellent data can be recommended to consumers, the same-type data with less downstream application can be gradually cleaned through selection of the consumers, and intelligent optimization of data application is achieved.

In the following, the data push and the application scenario are described in detail with reference to fig. 6.

Step S601: and acquiring the data table name in the processing task request.

The data table name in all embodiments of the present application may be an ID.

Step S602: and acquiring the blood margin of each field in the data table according to the data table name.

Step S603: and calculating to obtain fields with the same identification information among the tables according to the blood relationship of the fields of each field.

The manner of determining the fields with the same identification information is consistent with the implementation manner in the above embodiments, and is not described herein again.

Step S604: and calculating the similarity of the two data tables according to the number of the fields with the same identification information of the two data tables.

A table of similarity degrees with a similarity degree greater than a certain threshold may be used as a replacement table for replacing the data table in the task.

Step S605: and recommending according to the health score and the quality score in a reverse order.

Step S606: whether all tasks are traversed.

If so, the process is terminated, otherwise, the process returns to step S602.

And replacing the data table in the task with the data table with high health score and high quality score in the replacement table with high similarity.

Through the embodiment, the similarity between every two data tables can be utilized to calculate the tables and fields referenced by the tasks, whether the replaced tables with higher quality scores and health scores exist or not can be judged, and the users can be guided to use the more optimized tables.

The application also provides a scheme applied to periodically cleaning the table with low access heat, the specific processing mode of the scheme is consistent with the processing mode, through the application scenario, the storage and computing resources can be released, the data architecture is optimized, for example, the table with high similarity is merged and compatible (the compatibility can be realized through table connection, for example, the similarity of the first data table and the second data table is 99% and is greater than the preset threshold value 90%, if the health score and the quality score of the second data table are both higher than those of the first data table, the second data table can be used for replacing the first data table, if the evaluation score determined by the health score and the quality score of the second data table is greater than that of the first data table, the second data table can also be used for replacing the first data table, of course, in the above case, the second data table can also be not used for replacing the first data table, but the second data table and the first data table are used for table connection, replace the concatenation result with the first data table and the second data table).

Specifically, the fields of the tables quoted in the existing tasks are the same as those of other tables, so that other tables can be used for replacement, users are required to replace the tables with more optimal tables according to the health score and quality score of the tables, the same data with less downstream application can be gradually cleared, and intelligent optimization of data application is achieved.

In the prior art, when synchronous cleaning of homologous tables is performed, only one layer of blood relationship is used, that is, in the process of extracting data from online to offline, only whether online tables extracted by offline data are the same is judged, tables with the same source and repeatedly extracted can be obtained, one of the tables is reserved, and the rest tables are subjected to offline processing.

The fields with the same identification information are determined through the blood relationship of the fields in the data tables, the similarity of the two data tables is determined based on the number of the fields with the same identification information, and the two tables are judged to be the same table by simply using the same source table. The judgment method used by the application is used for comparing and analyzing the processing process of the fields between the two data tables, and can distinguish homologous data tables which do not record the same content even if the sources are the same.

The obtaining of the processing information in the embodiment of the present application may include: analyzing the source table of each processing node in the processing path of the corresponding field by using the processing code of the data table where the corresponding field is located until the source table is an extraction table of the online table; recording the processing logic of each processing node, wherein the processing logic comprises: the source field and the result field, and the processing logic also comprises a filtering condition and/or a processing function.

It should be noted that, in any embodiment of the present application, the field blood margin of a field, that is, the processing information of the field, can be determined in the above manner.

An embodiment of the present application is described in detail below with reference to fig. 7, and as shown in fig. 7, the embodiment may include the following steps:

step S701: the table field is entered.

In the embodiment of the present application, the following operation needs to be performed once for each field in the table.

Step S702: a primary key of the table is determined based on the table field.

If the data table is an order table, and the order number, the purchaser and other information are recorded in the order table, the primary key of the data table can be determined according to the number of the record items of the data table and the number of the record items corresponding to each field. And if the number of the entries of the field is consistent with that of the entries of the data table, the field is a primary key of the data table. If there are 100 orders recorded in the order table as described above and there are 100 order numbers, but there are 60 purchasers, the order number field in the order table is the primary key of the order table.

Step S703: record the source table of the upper layer of the primary key.

The processing code of the data table can be obtained, the source table of the upper layer of the data table can be analyzed from the processing code of the data table, and similarly, the source field of each field can also be read from the processing code of the data table.

Step S704: the filter condition for the processing node is recorded.

In the process of processing the data table, filtering the data table may be involved, the filtering condition based on the source table of the previous layer is read from the processing code, and the association table corresponding to the filtering condition of the processing node is obtained.

The table join in the embodiments of the present application may each include direct field filtering and join filtering between tables.

Step S705: and judging whether the association table of the processing node filters data or not.

If the association table of the processing node filters the data, executing step S706; and if the association table of the processing node does not filter the data, judging that the association table is an extraction table, and ending.

The extraction tables in the embodiment of the present application are tables generated from data extracted from online tables.

Step S706: the association table and the fields in the table are recorded.

Step S707: the processing function on the field is recorded.

Step S708: and judging whether the previous layer of source table and the association table are both extraction tables of the online table.

If yes, completing blood margin analysis; if not, the process returns to step S703.

In the above-described embodiments of the present application, the fields of the same attribute in the foregoing embodiments refer to fields in which the identification information is the same. The field blood relationship analysis may specifically start from a field of a table, determine a primary key of the table, analyze a source table (which may include the source field) on a layer above the primary key field, a result field, a filtering condition (including direct field filtering and join filtering between tables), and a function to be used (i.e., the processing function in the above embodiment). If the source table in the previous layer is not the online table or the associated table in the filtering condition is not the extraction table of the online table, the pushing up is continued in the manner shown in fig. 7 until all the tables in the upstream and filtering conditions are the extraction tables of the online table, and the traced path in each step is recorded to generate the processing information.

Through above-mentioned embodiment, not only can the automatic positioning data difference, can also carry out the intelligent purification of data. By recommending the data with high similarity but good system performance to the consumers, a mechanism of high-priority and low-priority is formed, the data with poor system performance are applied less and less gradually, the off-line can be completed, the existing data storage can be reduced, and the data architecture is optimized.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

Example 2

According to an embodiment of the present application, there is also provided a data table processing apparatus for implementing the above data table processing method, as shown in fig. 8, the apparatus includes: a first comparing unit 81, an information acquiring unit 83, a second comparing unit 85, and a difference locating unit 87.

The first comparison unit is used for comparing a first field in the first data table with a second field in the second data table;

the information acquisition unit is used for acquiring the processing information of the first field and the processing information of the second field under the condition that the identification information of the first field and the identification information of the second field are compared to be different, wherein the processing information is used for recording a plurality of processing logics in the processing path of the corresponding field;

the second comparison unit is used for comparing each processing logic of each corresponding field according to the processing path;

and the difference positioning unit is used for determining the currently compared processing logic as the logic with the difference if the currently compared processing logic is inconsistent.

The identification information is information for identifying a field, and the identification information of the field points to a field, such as a field name, field processing logic, and the like.

The comparison between the first field in the first data table and the second field in the second data table may be implemented by determining whether the field with the same identification information in the data table to be analyzed always satisfies a predetermined comparison condition, and if the field with the same identification information always satisfies the predetermined comparison condition, returning to perform the operation of comparing the first field in the first data table with the second field in the second data table. In essence, the corresponding field appears because the field with the same identification information is further newly processed, and the processing logic causing the difference of the field with the same identification information can be located by the above device.

In the above embodiment, the obtained processing information of the corresponding field has a processing logic in the processing path of the corresponding field recorded therein, and the processing logic may include at least one of: the source field, the result field, the filter condition and the processing function of the corresponding processing node.

In another alternative embodiment, the field determining unit is configured to obtain identification information (e.g., machining information) of the first field, and determine a second field in the second data table, where the second field has the same identification information as the first field.

In addition to specifying the alignment fields, a matching field identical to the identification information of the target field of the first data table may be abstracted by the field blood relationship and the field blood relationship path of the target field of the first data table, and the matching field may be located in other data tables, in this case, the processing paths of the target field and the matching field are identical, and the predetermined alignment rule (i.e., the predetermined alignment condition) is extracted through the step of determining the matching field.

The identification information includes: a field name, wherein the first comparing unit includes: the first comparison module is used for comparing whether the field names of the first field and the second field are the same or not; and the first difference determining module is used for comparing the difference of the identification information of the first field and the second field if the field names of the first field and the second field are different.

In an alternative embodiment, the identification information includes field metadata and processing logic, wherein the first comparing unit includes: the second comparison module is used for comparing whether the field metadata and the processing logic of the first field and the second field are the same; and the second difference determining module is used for comparing the identification information of the first field and the second field to generate a difference if the field metadata and the processing logic of the first field and the second field are different.

In an optional embodiment, the apparatus further comprises: and the field specifying unit is used for acquiring the specifying information before comparing the first field in the first data table with the second field in the second data table, wherein the specifying information is used for specifying the first field and the second field.

According to the above-mentioned embodiment of the present application, the apparatus may further include, as shown in fig. 9: an information obtaining unit 91, configured to obtain, before comparing a first field in a first data table with a second field in a second data table, processing information of each field in each data table in the data table to be analyzed, where the processing information of a field is at least used to record each processing logic in a processing path of the corresponding field; a judging unit 93, configured to judge, by using the processing logic in the processing information, whether each field is a field with the same identification information, so as to obtain a judgment result; a counting unit 95, configured to count, according to the determination result, the number of fields with the same identification information between every two data tables in the data table to be analyzed; a calculating unit 97, configured to calculate similarity between every two data tables based on the number; the table acquiring unit 99 acquires a plurality of second data tables whose similarity to the first data table meets a preset similarity condition.

Through the embodiment, mutual verification is performed according to the blood relationship among data, for example, comparison rule configuration is performed on first-layer source data (for example, data extracted from an online table, or data in the online table) and end consumption data, for example, rule configuration among fields with the same identification information is performed, early warning is performed on the problem, and automatic positioning of difference reasons is performed by means of the blood relationship among the fields.

Based on the above embodiment, the application also provides a device for determining similarity of data tables.

Specifically, the judgment unit includes: the first judgment module is used for judging that the two fields are fields with the same identification information if the processing logics of the two fields are consistent; and the second judging module is used for judging that the two fields are fields with different identification information if the two fields have different processing logics, wherein the judgment result comprises information of the fields with the same identification information and information of the fields with different identification information.

Specifically, the computing unit is specifically configured to:

and P is Y x 2/(M + N), wherein Y is used for indicating the number of fields with the same identification information between every two data tables, M is used for indicating the number of fields of one data table in every two data tables, and N is used for indicating the number of fields of the other data table in every two data tables.

According to the above-mentioned embodiments of the present application, the apparatus may further include: and the sorting unit is used for sorting the second data tables according to the health attribute and the quality attribute after acquiring the second data tables of which the similarity with the first data table meets the preset similarity condition to obtain reverse sorting information, wherein the health attribute is used for representing the resource consumption value of the data tables, and the quality attribute is at least used for representing the information integrity and reliability of the data tables.

Further, the device further comprises a receiving unit, configured to receive at least one of the following before obtaining the processing information of each field of each of the data tables to be analyzed: receiving a pushing request for acquiring a similar table of a first data table, and acquiring a data table to be analyzed based on the pushing request, wherein the data table to be analyzed comprises the first data table; receiving a processing task for processing data, extracting an identifier of a first data table from the processing task, and acquiring a data table to be analyzed by using the identifier of the first data table; and receiving a cleaning task for cleaning the first data table, and acquiring the data table to be analyzed based on the cleaning task.

It should be further noted that, the apparatus further includes an information output unit, configured to output the information according to one of the following manners after obtaining the reverse sorting information: taking the reverse ordering information as push information responding to the push request under the condition of receiving the push request; under the condition that the machining task is received, replacing a first data table in the machining task by using a first second data table in the reverse ordering information; and under the condition that the cleaning task is received, combining the first q second data tables and the first data table in the reverse ordering information, wherein q is a natural number.

Through the embodiment, the table with high similarity can be searched and ranked according to the health score and the quality score, more excellent data can be recommended to consumers, the same-type data with less offline downstream application can be gradually cleaned through selection of the consumers, and intelligent optimization of data application is achieved; and the similarity between every two data tables can be utilized to calculate the tables and fields referenced by the tasks, determine whether a replacement table with higher quality score and health score exists or not, and guide the user to use a more optimized table.

Specifically, the information acquisition unit includes: the analysis module is used for analyzing the source table of each processing node in the processing path of the corresponding field by using the processing code of the data table where the corresponding field is located until the source table is an extraction table of the online table; the recording module is used for recording the processing logic of each processing node, wherein the processing logic comprises: the source field and the result field, and the processing logic also comprises a filtering condition and/or a processing function.

In the above embodiment of the present application, the field blood relationship analysis may specifically start from a field of a table, determine a primary key of the table, analyze a source table (which may include the above source field) on a layer above the primary key field, a result field, a filtering condition (including direct field filtering and join filtering between tables), and use a function (i.e., a processing function in the above embodiment). If the source table in the previous layer is not the online table or the associated table in the filtering condition is not the extraction table of the online table, the pushing up is continued in the manner shown in fig. 7 until all the tables in the upstream and filtering conditions are the extraction tables of the online table, and the traced path in each step is recorded to generate the processing information.

It should be noted that, the modules or units in the above embodiments of the present application are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the above-mentioned units may be executed in the terminal provided in the first embodiment as a part of the apparatus, and may be implemented by software or hardware.

It should be noted that, as is clear to those skilled in the art, for convenience and brevity of description, the specific working process and description of the processing apparatus of the data table described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

Example 3

The embodiment of the application can provide a computer terminal, and the computer terminal can be any one computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.

Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.

In this embodiment, the computer terminal may perform the following steps in the processing method of the data table:

comparing a first field in the first data table with a second field in the second data table; under the condition that the identification information of the first field and the identification information of the second field are different, processing information of the first field and processing information of the second field are obtained, wherein the processing information is used for recording a plurality of processing logics in a processing path of the corresponding field; comparing each processing logic of each corresponding field according to the processing path; and if the currently compared processing logics are not consistent, determining that the currently compared processing logics are logics with differences.

Optionally, fig. 10 is a network environment diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 10, the computer terminal 101 may be connected to the server 102 via a network, and the computer terminal may include one or more processors (only one is shown) and a memory as shown in fig. 1.

The memory may be configured to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for processing a data table in the embodiment of the present application, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, implements the method for processing a data table. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located from the processor, and these remote memories may be connected to terminal a through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

It can be understood by those skilled in the art that the structure shown in fig. 10 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 10 is a diagram illustrating a structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 10, or have a different configuration than shown in FIG. 10.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Example 4

Embodiments of the present application also provide a storage medium. Optionally, in this embodiment, the storage medium may be configured to store a program code executed by the data table processing method provided in the first embodiment.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: comparing a first field in the first data table with a second field in the second data table; under the condition that the identification information of the first field and the identification information of the second field are different, processing information of the first field and the second field is obtained, wherein the processing information is used for recording a plurality of processing logics in a processing path of the corresponding field; comparing each processing logic of each corresponding field according to the processing path; and if the currently compared processing logics are not consistent, determining that the currently compared processing logics are logics with differences.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A method for processing a data table, comprising:

comparing a first field in the first data table with a second field in the second data table;

under the condition that the identification information of the first field and the identification information of the second field are different, processing information of the first field and processing information of the second field are obtained, wherein the processing information is used for recording a plurality of processing logics in a processing path of the corresponding field;

comparing each processing logic of each corresponding field according to the processing path;

if the currently compared processing logics are not consistent, determining that the currently compared processing logics are logics with the difference;

prior to comparing the first field in the first data table and the second field in the second data table, the method further comprises: acquiring processing information of each field of each data table in the data table to be analyzed; judging whether each field is the field with the same identification information by using the processing logic in the processing information to obtain a judgment result; counting the number of fields with the same identification information between every two data tables in the data tables to be analyzed according to the judgment result; calculating the similarity of the two data tables based on the number; and acquiring a plurality of second data tables of which the similarity with the first data table meets a preset similarity condition.

2. The method of claim 1, wherein prior to comparing the first field in the first data table to the second field in the second data table, the method further comprises:

and acquiring the identification information of the first field, and determining a second field which has the same identification information as the first field in the second data table.

3. The method of claim 2, wherein the identification information comprises a field name, and wherein comparing the first field in the first data table to the second field in the second data table comprises:

comparing whether the field names of the first field and the second field are the same;

and if the field names of the first field and the second field are different, comparing the identification information of the first field and the second field to obtain the difference.

4. The method of claim 2, wherein the identification information comprises field metadata and processing logic, and wherein comparing the first field in the first data table to the second field in the second data table comprises:

comparing whether the field metadata and the processing logic of the first field and the second field are the same;

and if the field metadata and the processing logic of the first field and the second field are different, comparing the identification information of the first field and the second field to generate a difference.

5. The method of claim 1, wherein determining whether the fields are fields with the same identification information by using the processing logic in the processing information comprises:

if each processing logic of the two fields is consistent, judging that the two fields are the fields with the same identification information;

and if the two fields have different processing logics, judging that the two fields are fields with different identification information.

6. The method according to claim 1, wherein after obtaining a plurality of second data tables whose similarity to the first data table meets a preset similarity condition, the method further comprises:

sorting the plurality of second data tables according to the health attribute and the quality attribute to obtain reverse sorting information,

the health attribute is used for representing the resource consumption value of the data table, and the quality attribute is at least used for representing the information integrity and reliability degree of the data table.

7. The method of claim 6, wherein prior to obtaining the processing information for the respective field of each of the data sheets to be analyzed, the method further comprises at least one of:

receiving a push request for acquiring a similar table of the first data table, and acquiring the data table to be analyzed based on the push request, wherein the data table to be analyzed comprises the first data table;

receiving a processing task for processing data, extracting the identifier of the first data table from the processing task, and acquiring the data table to be analyzed by using the identifier of the first data table;

and receiving a cleaning task for cleaning the first data table, and acquiring the data table to be analyzed based on the cleaning task.

8. The method of claim 7, wherein after obtaining the reverse ordering information, the method further comprises:

taking the reverse ordering information as the push information responding to the push request under the condition of receiving the push request;

under the condition that the machining task is received, replacing a first data table in the machining task with a first second data table in the reverse ordering information;

and under the condition of receiving the cleaning task, combining the first q second data tables and the first data table in the reverse ordering information, wherein q is a natural number.

9. The method according to any one of claims 1 to 8, wherein acquiring the processing information of the first field and the processing information of the second field comprises:

analyzing a source table of each processing node in a processing path of a corresponding field by using a processing code of a data table where the corresponding field is located until the source table is an extraction table of an online table;

recording the processing logic of each processing node, wherein the processing logic comprises: a source field and a result field, and a filter condition and/or a processing function are also included in the processing logic.

10. A data table processing apparatus, comprising:

the information acquisition unit is used for acquiring processing information of the first field and processing information of the second field under the condition that the identification information of the first field and the identification information of the second field are compared to be different, wherein the processing information is used for recording a plurality of processing logics in a processing path of the corresponding field;

a second comparing unit, configured to compare the processing logics of the corresponding fields according to the processing path;

the difference positioning unit is used for determining the currently compared processing logic as the logic with the difference if the currently compared processing logic is inconsistent;

before comparing a first field in a first data table with a second field in a second data table, acquiring processing information of each field of each data table in the data tables to be analyzed; judging whether each field is the field with the same identification information by using the processing logic in the processing information to obtain a judgment result; counting the number of fields with the same identification information between every two data tables in the data tables to be analyzed according to the judgment result; calculating the similarity of the two data tables based on the number; and acquiring a plurality of second data tables of which the similarity with the first data table meets a preset similarity condition.