CN117057329A

CN117057329A - Table data processing method and device and computing equipment

Info

Publication number: CN117057329A
Application number: CN202311327790.XA
Authority: CN
Inventors: 赵愉; 陈杨; 代玉成
Original assignee: Zanta Hangzhou Technology Co ltd
Current assignee: Zanta Hangzhou Technology Co ltd
Priority date: 2023-10-13
Filing date: 2023-10-13
Publication date: 2023-11-14
Anticipated expiration: 2043-10-13
Also published as: CN117057329B

Abstract

The embodiment of the specification provides a form data processing method and device and computing equipment; the method comprises the following steps: identifying a header of a target form to be processed aiming at the target form; matching similar fields in a field set for each header field in the header; determining a target canonical name corresponding to each header field based on the corresponding relation between the fields in the field set and the canonical names; the target standard name is used for processing the data in the target table; the method can improve the data processing efficiency of the table.

Description

Table data processing method and device and computing equipment

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a table data processing method; and also relates to a form data processing apparatus, a computing device, and a computer-readable storage medium.

Background

With the development of computer technology, the requirements on data processing efficiency are also increasing.

At present, tables are used for recording related data for different services. Different tables may be created by different personnel, and many tables may employ different header specifications, such that the names of the corresponding data of the same type in the headers of the different tables differ. When data analysis is performed on a plurality of existing tables, the header of each table needs to be manually modified, and the names corresponding to the similar data are modified into specified uniform names.

However, since the number of tables involved in performing the table data processing is generally large, the labor cost required for performing the table data processing is high, and the data processing efficiency is low.

Disclosure of Invention

In view of this, the present embodiment provides a table data processing method. One or more embodiments of the present disclosure relate to a form data processing apparatus, a computing device, a computer-readable storage medium, and a computer program that can improve data processing efficiency of a form.

According to an aspect of embodiments of the present specification, there is provided a table data processing method, the method including:

identifying a header of a target form to be processed aiming at the target form;

matching similar fields in a field set for each header field in the header;

determining a target canonical name corresponding to each header field based on the corresponding relation between the fields in the field set and the canonical names; the target standard name is used for processing the data in the target table.

According to another aspect of the embodiments of the present specification, there is provided a form data processing apparatus including:

The identifying module is used for identifying the header of the target form aiming at the target form to be processed;

the matching module is used for matching similar fields in the field set aiming at each header field in the header;

the first determining module is used for determining target canonical names corresponding to each header field based on the corresponding relation between the fields in the field set and the canonical names; the target standard name is used for processing the data in the target table.

According to yet another aspect of embodiments of the present specification, there is provided a computing device comprising: a memory and a processor;

the memory is configured to store computer-executable instructions that, when executed by the processor, perform the steps of the method described above.

According to yet another aspect of embodiments of the present description, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the above-described method.

According to a further aspect of the embodiments of the present description, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the above-described method.

In one embodiment of the present disclosure, a header of a target table to be processed may be identified, and a similar field of each header field in the header may be determined in a field set, so as to determine a target canonical name corresponding to each header field based on a canonical name corresponding to the similar field. Thus, the data in the target table can be processed based on the target specification name. Even if a plurality of tables with different table header specifications are processed, the fields in the table header do not need to be set with uniform names, the table header of each table does not need to be manually modified, the labor cost can be reduced, and the data processing efficiency of the table is improved.

Drawings

FIG. 1 is a flow chart of a method for processing tabular data according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of another method for processing tabular data according to an embodiment of the present disclosure;

FIG. 3 is a block diagram of a target table according to an embodiment of the present disclosure;

FIG. 4 is a simplified flowchart of a table data processing method according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a table data processing apparatus according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of a computing device according to one embodiment of the present disclosure.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items. The term "at least one" in one or more embodiments of the present specification refers to "one or more" and "a plurality" refers to "two or more". The term "comprising" is an open description and should be understood as "including but not limited to" and may include other content in addition to what has been described.

It should be understood that although the terms "first," "second," and the like may be used in one or more embodiments of the present description to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, "first" may also be referred to as "second" and, similarly, "second" may also be referred to as "first" without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

Furthermore, user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in one or more embodiments of the present description are both user-authorized or fully authorized information and data by parties, and the collection, use, and processing of relevant data entails adherence to relevant standards and requirements and provision of corresponding operational portals for user selection of authorization or denial.

With the development of computer technology, various data are more and more, and the processing requirements for the data are higher and higher. Currently, many data is recorded in the form of tables, and the formats of tables for storing different pieces of body information may be different, and the formats of tables from different sources may be different. For public safety services, for example, three-party data and bank card data tend to be complex in source, and the tabular forms of the data obtained from different platforms and institutions tend to differ greatly. Errors can easily occur if the acquired data is directly imported into a data processing device (such as a server), so that the header of each table needs to be manually adjusted after the acquired data is acquired. Typically, the product side (the side that actually processes the data in the table) provides a set of header specifications, and the application side (the side that obtains the data) modifies the header of the data file (i.e., the table) to be transmitted to the product side accordingly based on the specifications, thereby achieving the effect of manually unifying the data import standards.

The method is reliable and effective for processing small-batch data files, but the labor cost required for processing large-batch data files is huge, and the processing efficiency of the data is seriously affected. If the header is manually modified for each data file, the consumed labor cost is too high, the use threshold of the user can be greatly improved, and the use power of the user is reduced. Moreover, header specifications for data files in public safety scenarios often require more than one set. For different types of forms, there are usually different header specifications, for example, a form that uses a bank card, three-party data, a penta-linked list and a ticket as stored main body information corresponds to the different header specifications, and each set of header specifications further involves ten or more header fields to several tens of header fields, which obviously further increases the operation difficulty of manually modifying the header by a user. In addition, there are some data files without original headers, and users are even more unopposed with such data files. In addition, the manual modification of the header will not further adjust the data in the header that is originally erroneous or contradictory, and the utilization rate of other data outside the header will be low, so that the best chance of data cleaning will be missed. The data cleaning comprises the steps of deleting repeated data, supplementing missing data, carrying out data consistency processing, data sorting, abnormal data processing and the like. Therefore, the data processing efficiency and effect of the table still need to be improved.

In the present specification, a table data processing method is provided, which can improve the data processing efficiency of a table. The present specification relates to a form data processing apparatus, a computing device, and a computer-readable storage medium, and will be described in detail in the following embodiments. The tabular data processing method can be used for the tabular data processing apparatus and the computing device.

Fig. 1 is a flowchart of a table data processing method according to an embodiment of the present disclosure, where the method is applied to a table data processing apparatus. As shown in fig. 1, the table data processing method includes the steps of:

step 102, identifying the header of the target table aiming at the target table to be processed.

The form data processing apparatus may acquire a target form to be processed, for example, the target form may be uploaded by a user, or may be automatically acquired from a set channel by the form data processing apparatus. The table data processing means may then analyze the target table to identify the header of the target table. If the header of the target table is identified, the data included in the header may be determined and the subsequent steps performed. In the embodiment of the present disclosure, there may be a case where the target table header is not identified, and the processing manner in this case will be described in detail later, which is not described in detail herein.

The tables are arranged in rows and columns, and the header may include a row of data in the tables. When the data of the same type in the table are arranged in columns, the header may include a row of data; when the data of the same type in the table is arranged in rows, the header may include a column of data. In the embodiment of the present specification, the header includes one line of data. The header in the table is most often the first row of the table, and in some cases, the header may be located in a middle or end position of the table.

The header may include a plurality of fields, which in the present embodiment will be referred to as header fields, and each header field may be located in one cell. The header field may characterize the nature of the data in the column in which it is located, and the header field may be the name or type of the data in the column in which it is located. The table data processing device determines the data included in the header, that is, determines each header field included in the header.

Step 104, for each header field in the header, matching similar fields in the field set.

In the embodiment of the present specification, a field set may be preset, where the field set includes a plurality of fields that may be header fields, and may also be referred to as header field pseudonyms. The table data processing device may determine, for each header field in the header of the target table, a similarity between each field in the field set and the header field, and further determine a similar field of the header field based on the similarity.

Step 106, determining the target standard names corresponding to the header fields based on the corresponding relation between the fields in the field set and the standard names; the target standard name is used for processing the data in the target table.

Each field in the field set may correspond to a canonical name, where the canonical name functions the same as the header field, and is used to characterize a property or type of a type of data. The same canonical name may correspond to multiple fields. The fields corresponding to the same canonical name may express the same or similar meaning. The canonical name may be one of the fields corresponding thereto, or the canonical name may be different from the field corresponding thereto. For example, if the user name and the home name are both actually the attribution of the current table as header fields, the user name and the home name may correspond to the same canonical name. The standard name can be a user name or a home terminal name, and also can be a table attribution. For another example, a specification name is a query card number, and the fields corresponding to the query card number in the field set may include fields such as a client card number, a user account number, and a query account number.

The table processing device may determine the canonical name corresponding to the similar field of the header field as the target canonical name corresponding to the header field.

Alternatively, the table data processing apparatus may execute the method provided in the embodiment of the present disclosure (e.g., steps 102 to 106) after receiving the processing request for the data in the target table, and process the data in the target table after step 106. Or the table data processing device can directly execute the method after acquiring the target table, so that the target standard name can be directly utilized when the data in the target table is required to be processed later. The processing of the data in the target table may include data extraction, data summarization, or any other manner of data processing, and embodiments of the present disclosure are not limited.

The process of processing the data of the target form may be performed by the form data processing apparatus or by a device other than the form data processing apparatus. For example, after determining the target standard name corresponding to each header field in the header of the target table, the table data processing apparatus may send both the target table and the target standard name to another device (such as a server), so that the other device processes the data in the target table based on the target standard name.

In this embodiment of the present disclosure, the table data processing apparatus may automatically correspond the header field of the table to the canonical name, so as to normalize the header field. Thus, the table heads of all the tables do not need to be manually modified; even if different tables have different header specifications, header fields in the tables can be corresponding to a set of unified specification names. Furthermore, the processing of each table can be based on the unified standard name, so that the efficiency and the reliability of data processing can be improved.

In summary, in the table data processing method provided in the embodiment of the present disclosure, the header of the target table to be processed may be identified, and similar fields of each header field in the header may be determined in the field set, so as to determine the target specification name corresponding to each header field based on the specification name corresponding to the similar field. Thus, the data in the target table can be processed based on the target specification name. Even if a plurality of tables with different table header specifications are processed, the fields in the table header do not need to be set with uniform names, the table header of each table does not need to be manually modified, the labor cost can be reduced, and the data processing efficiency of the table is improved.

Fig. 2 is a flowchart of another method for processing tabular data according to an embodiment of the present disclosure, where the method is applied to a tabular data processing apparatus. As shown in fig. 2, the table data processing method includes the steps of:

step 202, aiming at a target table to be processed, identifying the header of the target table to obtain an identification result.

Step 202 may refer to step 102, and the description of the embodiment of the present disclosure is omitted herein.

The table data processing device may identify the header of the target table, and the obtained identification result may be that the header is identified and the content of the identified header is obtained, and the identification result may also be that the header is not identified. The target table may be any table to be processed, and the table data processing apparatus may execute the same processing procedure as the target table for other tables to be processed.

In one form of identification, the form data processing device may analyze all of the data in the target form to identify the header. The data analysis in this manner may be referred to as analysis of the data in another identification manner described below.

In another recognition method, the table data processing apparatus recognizes a header of partial data in the target table as an analysis target, with respect to the partial data. The portion of data may include alternative rows in the target table, the number of which may be a plurality. Since the same type of data in a table may be arranged in rows or columns, the alternate rows may be data columns or data rows, respectively. The rank number of the alternative rank may be less than a second threshold. If each candidate row is a data row, the second threshold is 21, the table data processing device may use the first 20 rows of data in the target table as an analysis object, that is, each row of data in the first 20 rows of data is one candidate row of data, so as to determine whether a header exists in the first 20 rows of data.

For a target table, the table data processing apparatus may first analyze the candidate data rows to identify whether a header exists. If the table head exists, the table data processing device can determine that the data of the same type in the target table are arranged according to the columns, and further can not analyze the alternative data columns. If the header is not identified, the alternative data column is analyzed. Because the probability that the data of the same type are arranged in columns in the table is larger, the data which need to be analyzed for identifying the header can be ensured to be less. Alternatively, the table data processing device may also analyze the candidate data columns first. If no header is identified for each candidate data row and each candidate data column, the table data processing apparatus may analyze other data in the target table that is not analyzed to further identify a header. Alternatively, the table data processing apparatus may directly consider that there is no header in the target table.

In the embodiment of the present specification, in the process of analyzing each candidate row in the target table, the table data processing apparatus may match, for each field in the candidate row, a word included in the field in the header word stock. Based on the matching results for each field, the target word in the candidate row can be obtained, and the target word has matched words in the header word segmentation library. In the event that the number of target words in the alternative rank is greater than a third threshold, the alternative rank is determined to be a header.

The words in the header word segmentation library can be obtained by word segmentation from the history header field, and the header word segmentation library can be a priori table. For example, the existing history header field includes "opposite side card number", and the header field may be segmented to obtain three words of "opposite side", "card" and "number", and then the three words may be stored in the header word segmentation library. Words in the header word segmentation library can be obtained from header field word segmentation in various tables, and the sources of the words in the header word segmentation library can not be recorded. Optionally, the header word stock may be further expanded or compressed according to the service change, that is, words may be added to the header word stock or words in the header word stock may be deleted.

The tabular data processing means may analyze the individual alternatives one by one. For each field in each alternative, determining whether the word in each field has a matched word in the header word segmentation library one by one, if the two matched words need to be the same. If a word in the field has a matched word in the header word segmentation library, the word can be determined to be a target word. The table data processing means may determine the number of target words in each of the candidate rows, and determine, as the header, the candidate row including the number of target words exceeding the third threshold. If there are no alternative rows that include the number of target words exceeding the third threshold, it may be considered that no header is identified in the target table.

And if the number of the target words included in the plurality of alternative rows is equal and exceeds a third threshold value, determining the alternative row with the largest number of the target words in the plurality of alternative rows as the header. If there are a plurality of candidate rows whose target word number exceeds the third threshold value and which are equal, the candidate row with the smallest rank number among the plurality of candidate rows may be determined as the header. For example, the data rows are alternatively ranked, with the data row closer to the top of the table being preferentially selected as the header.

In the embodiment of the present disclosure, after the header of the target table is identified, the arrangement manner of the data in the target table may be determined, that is, whether the data of the same type are arranged in rows or columns is determined.

Step 204, based on the identification result of the header, the pre-header data and post-header data outside the header in the target table are determined.

The table data processing means may block the data in the target table based on the identification result of the header. For example, the target table may be divided into three pieces of header, pre-header data, and post-header data, or the target table may be divided into only pre-header data and post-header data. Different analysis can be performed on different blocks of data respectively, so that an analysis result corresponding to each block of data is obtained. The analysis result can be used in the process of carrying out subsequent processing on the data in the target table. Illustratively, the following steps 206 to 210 are the analysis process of the header, the following steps 212 to 216 are the analysis process of the data before the header, and the following steps 218 and 220 are the analysis process of the data after the header.

In the case where the identification result of the header characterizes the header of the target table identified by the table data processing means, the identification result may include information of the header, such as the position of the header in the target table. The table data processing means may determine the pre-header data and the post-header data in the target table based on the position of the header in the target table. In this case, no additional analysis may be performed on the post-header data.

Taking the header as an example of data row, the table data processing device may determine the data of which the row number of the row is smaller than the row number of the header and the column number is smaller than the column number of the header as the data before the header in the target table. The table data processing means may determine the data having a row number greater than the header row in the row as the post-header data in the target table. Fig. 3 is a block diagram of a target table according to an embodiment of the present disclosure. As shown in fig. 3, the data in the area Q1 in the target table is the pre-header data, the data in the area Q2 is the header data, and the data in the area Q3 is the post-header data.

In the case where the identification result of the header characterizes the header of the target table, the table data processing apparatus does not identify the header of the target table, the data in the target table may be determined as both the pre-header data and the post-header data. The following description will be made only for the analysis process of the post-header data in this case, and for the analysis process of the pre-header data, the analysis process of the pre-header data in the case where the header of the target table is recognized can be referred to.

The table data processing apparatus may further determine the number of columns of the target table based on the number of columns of the data after the header. The column number of the target table is the minimum column number of the data behind the header. Assuming that the target table includes three rows of data, the columns of the three rows of data are 5 columns, 4 columns and 3 columns, respectively, the table data processing apparatus may determine that the column number of the entire target table is 3 columns. The column number of the target table can be used for positioning the positions of the data before the header and the data after the header later, so that the additional area in the table is not required to be analyzed when the data analysis is performed, and the additional consumption of the processing performance is avoided.

Step 206, determining the target service type corresponding to the target table based on the header field in the header when the header is identified by the identification result of the header.

The table may be applied to different traffic scenarios, which may differ in traffic type. Such as the type of service may include a bank card type of service, a three-way type of service, a ticket type of service, etc. The data of different traffic types is typically quite different and the names of the data are quite different. In order to ensure the analysis effect on the data of the target table, in the embodiment of the present disclosure, the target service type corresponding to the target table may be determined first, and then more accurate data analysis may be performed under the target service type. The table data processing device can analyze the header field in the target table to determine the target service type corresponding to the target table.

In a first way of determining the target traffic type, the table data processing means may determine the target traffic type based on the field set. Alternatively, the fields in the field set may include words in a header word stock. Each field in the field set may correspond to a service type, and there may be multiple fields in the field set that all correspond to the same service type. Such as the field "customer card number" in the field set may correspond to a bank card transaction type. For each header field in the header, the table data processing apparatus may determine a similarity between each field in the field set and the header field, and may further determine an auxiliary field in the field set having a similarity with the header field greater than a first threshold. Accordingly, the auxiliary fields corresponding to the header fields in the header can be obtained. Then, the target service type corresponding to the target table can be determined based on the service type corresponding to each auxiliary field. Alternatively, only one auxiliary field may be determined for each header field, or a plurality of auxiliary fields may be determined.

For example, the tabular data processing apparatus may employ a levenstein distance (Levenshtein Distance) algorithm to determine the similarity between fields, and thus the auxiliary fields. By this algorithm, the edit distance (i.e., the minimum number of edit operations) required to switch from one to the other between two strings can be determined. The permitted editing operations include replacing one character with another, inserting one character, and deleting one character. The smaller the edit distance, the greater the similarity of the two strings. The algorithm is not a traditional absolute similarity evaluation algorithm, and the similarity can be determined by adopting the Levenstein distance algorithm, so that the algorithm has a certain tolerance to field spelling errors which occur in even numbers, has higher requirements on word sequence sequences of fields, and can be more suitable for scenes aimed at by the embodiment of the specification.

The table data processing device can determine the similar field of each header field, namely the auxiliary field corresponding to each header field, based on the similarity between the header field and each field in the field set. The auxiliary field may reflect characteristics of the header field, and based on a service type corresponding to the auxiliary field, a service type to which the header field in the target table may belong may be determined, and accordingly, a target service type corresponding to the target table may be determined.

For example, the target traffic type may be determined based on a sum of similarities of respective auxiliary fields corresponding to the same traffic type. If the service type with the largest sum of the similarity is determined as the target service type. As another example, only one auxiliary field may be determined for each header field, and the target traffic type may be determined based on the number of auxiliary fields corresponding to the same traffic type.

In a second manner of determining the target service type, the target service type corresponding to the target table is determined based on the number of header fields in the header (i.e., the number of columns of the header). A reference range may be set for the number of header fields, for example, and corresponds to a traffic type. In the case that the number of header fields in the header is within the reference range, the service type corresponding to the reference range may be determined as the target service type corresponding to the target table. For example, for a form in a public safety scenario, if the number of header fields of the form is less than or equal to 4, the form is considered to correspond to a bank card service type.

Alternatively, the table data processing apparatus may first determine the target service type based on the reference range in the second determination manner described above. When the number of the header fields is not in the reference range, the first determination mode is adopted to determine the target service type.

In an alternative way of determining the target service type, the table data processing apparatus may directly analyze the header field of the target table to determine the target service type. And if the header field is subjected to feature extraction, matching the extracted features with the features of the service type to obtain the target service type.

Step 208, for each header field in the header, matching similar fields under the target service type in the field set.

After determining the target service type, the table data processing device may re-match similar fields of the header field in the fields corresponding to the target service type in the field set. The manner in which the similar fields are matched may be referred to as the manner in which the auxiliary fields are determined in step 206 described above. For each header field, the table data processing device determines the similarity of each field corresponding to the header field and the target service type, and determines the similar fields of the header field based on the similarity. If the similar field is the field with the highest similarity with the header field in the fields corresponding to the target service type.

Alternatively, a minimum similarity threshold may be set, for example, the minimum similarity threshold is 0.62, or any other value or empirical value. The table data processing apparatus may determine a field having the highest similarity with the header field and higher than the minimum similarity threshold as a similar field to the header field.

In step 208, the table data processing apparatus may redetermine the similarity of the header field to each field corresponding to the target service type. Or, the similarity of each field corresponding to the target service type may be screened from the similarity of the header field and each field in the field set, which is obtained when the auxiliary field is determined.

Optionally, each field in the field set has a weight. For each header field in the header, in the case where the set of fields matches to a plurality of alternative fields based on similarity, the similar fields of the header field may be determined based on weights of the plurality of alternative fields, such as determining the alternative field with the higher weight as the similar field of the header field. The candidate field may be the field in the field set having the highest similarity to the header field.

In an optional implementation of the embodiment of the present disclosure, the table data processing apparatus may determine, directly, as the similar field of the header field, the field with the highest similarity to the header field in the field set without determining the target service type.

Step 210, determining the target canonical names corresponding to the header fields based on the correspondence between the fields in the field set and the canonical names.

For the specification names, reference may be made to the related description in step 104, and details in step 104 are not repeated in this embodiment. Each field in the field set may correspond to a canonical name, and the table processing device may determine the canonical name corresponding to the similar field of the header field as the target canonical name corresponding to the header field. Thus, the normalization processing of each header field in the target table is realized. After determining the target standard name corresponding to each header field, the table data processing device may further record the corresponding relationship between the header field and the target standard name, so as to facilitate subsequent use.

Optionally, the canonical name corresponding to each field in the field set may also correspond to a mandatory word and a forbidden word, where the mandatory word may be called a white list and the forbidden word may be called a black list. The necessary words are words which are necessary to be included in the header field corresponding to the standard name, and the forbidden words are words which cannot be included in the header field corresponding to the standard name. Each canonical name may correspond to only one of the plural candidates, or may correspond to plural candidates. When a canonical name corresponds to a plurality of the plural indispensable words, the header field corresponding to the canonical name may include only one of the plural indispensable words. Each canonical name may correspond to one or more forbidden words, and when the forbidden words correspond to the forbidden words, the header field corresponding to the canonical name needs not to include each forbidden word.

In the case that the canonical name corresponds to the necessary word and the forbidden word, the table data processing apparatus may determine the alternative canonical name corresponding to each header field based on the correspondence between the field in the field set and the canonical name. The alternative canonical name is the canonical name corresponding to the similar field of the header field. Then, whether the candidate specification name meets the requirements of the necessary word and the forbidden word can be determined, and when the table header field comprises the necessary word corresponding to the candidate specification name and does not comprise the forbidden word corresponding to the candidate specification name, the candidate specification name is determined to be the target specification name.

If the alternative canonical name does not include the necessary word corresponding to the alternative canonical name and/or includes the forbidden word corresponding to the alternative canonical name, the table data processing apparatus may determine that the matching of the header field fails, where the alternative canonical name cannot be used as the target canonical name. In this case, the candidate canonical name may be redetermined, for example, the similar field of the header field may be redetermined in a field other than the field corresponding to the candidate canonical name, to implement redefining the candidate canonical name.

In the embodiment of the present disclosure, the weights of the similar fields may be the weights of the corresponding canonical names, and the weights of the fields in the field set corresponding to the same canonical name may be the same.

The way in which the pre-header data is analyzed is described below by steps 212 through 216.

Step 212, for the header data, field segmentation is performed based on multiple character formats, so as to obtain multiple segmented fields.

In one example, the plurality of character formats may include: chinese characters, numeric english characters, and non-chinese-english numeric characters. The table data processing device can judge which of the plurality of character formats the character belongs to one by one aiming at the character in the data before the header so as to segment the data before the header to obtain a plurality of segment fields. If the non-Chinese-English number field is used as a segmenter, the data break before the header is segmented into a plurality of segmentation fields.

For example, the pre-header data is the string "aaaa account name: 12345678, card number name: identification card number: ". Wherein the space character ": the character can be used as a segmenter, and the data before the header can be further segmented to obtain five segmented fields, namely 'aaaa', 'account name', '12345678', 'card number name' and 'identity card number'.

In the embodiment of the present disclosure, only the three character formats are taken as an example, and the character formats set in different application scenarios are adjusted accordingly. As in some scenarios, some special characters and numeric english characters may be grouped into a character format.

Step 214, determining the field attribute of each split field based on the field set.

The field attributes of the split field may include a name attribute and a content attribute. The name is similar to the header field, so the field attributes of the split fields can be determined based on the field set. The table data processing apparatus may match the split field with the field in the field set, and further determine the field attribute from the matching result.

For any split field, in the case where the similarity between the field set existing field and the split field is greater than the fourth threshold, the field attribute of the split field may be determined to be the name attribute. In the case that the similarity between the fields in the field set and the split field is less than or equal to the fourth threshold, the field attribute of the split field may be determined to be the content attribute. If the field set has a field with similarity to the split field greater than the fourth threshold, the split field may be considered to be close to the field pseudonym of the canonical name, otherwise the split field may be considered to be not close to the field pseudonym of the canonical name. Alternatively, the fourth threshold may be equal to the minimum similarity threshold described above.

Step 216, determining the analysis result of the data before the header based on the field attribute of each split field.

In the case where the former divided field is a name attribute and the latter divided field is a content attribute in the adjacent two divided fields, the table data processing apparatus may consider the latter divided field as the content of the former divided field. The table data processing apparatus may determine, for the preceding divided field, a corresponding canonical name based on the field set, and further determine the content of the divided field as the content corresponding to the canonical name. The manner of determining the canonical name may refer to the foregoing manner of determining the target canonical name. The corresponding relation between the standard name and the content can be the analysis result of the data before the header. The analysis result of the pre-header data may be referred to as supplementary information of the target table. Still taking the pre-header data illustrated in step 212 as an example, the canonical name corresponding to the account name of the split field in the pre-header data is the query account, and the analysis result of the pre-header data may be 12345678 content corresponding to the query account.

Alternatively, when processing the data in the target table, a name corresponding to the content to be extracted may be specified. The names as specified include: the query card number, the client name, the account opening row and the query account number, the table data processing device can extract the content of the query account number from the data before the table header.

The following describes the analysis of the post header data by steps 218 and 220.

And 218, extracting features of each row of data in the data behind the header when the identification result of the header indicates that the header is not identified.

For example, if the table data processing apparatus does not determine the arrangement manner of the data of the same type in the table, the feature extraction may be performed on at least one row of data in the data after the header, and also on at least one row of data in the data after the header. Based on the extracted features, a more uniform direction of the data format in the row direction and the column direction is determined, and the direction is further considered as the arrangement direction of the data of the same type in the table. Optionally, the table data processing device may also default that the data of the same type in the table are arranged in columns, so that feature extraction may be performed for each column of data in the data behind the header.

In the embodiment of the present disclosure, the characteristics extracted by the table data processing apparatus for each column of data may include the number of specific characters, the number of non-repeated character strings, the length of character strings, the number of character strings satisfying specific conditions, and the like. The characteristics include: the number of characters "-" is contained in the single-column multi-row, the number of characters ":" is contained in the single-column multi-row, the number of characters "/" is contained in the single-column multi-row, the number of non-repeated character strings is contained in the single-column multi-row, the number of actual rows and non-repeated character strings is different in the single-column multi-row, the length of the character string is more than 10 in the single-column multi-row, the number of characters "-" is contained in the single-column multi-row, the number of characters in the single-column multi-row is pure number, the number of characters in the single-column multi-row is pure Chinese number, and the number of characters in the single-column multi-row is pure letter number.

And 220, determining the standard name of each row of data based on the extracted data characteristics, and obtaining the analysis result of the data behind the header.

The form data processing means may determine, among the feature conditions of the plurality of specification names, a target feature condition that is satisfied by the data feature extracted for each row of data; and determining the canonical name corresponding to the target characteristic condition as the canonical name of the row of data. For example, the feature condition of the canonical name "time" includes two ": "character". If each field in a column of data contains two ": "character", the tabular data processing apparatus may determine the canonical name of the column data as "time". The analysis result of the data behind the header may include the correspondence of each row of data with its canonical name.

In another alternative, the tabular data processing apparatus may import the extracted data features into a canonical name analysis model to determine the canonical name of each row of data by the canonical name analysis model. The model may be trained in advance based on a large amount of tabular data to achieve artificial intelligence classification. If the same standard name is determined for multiple rows of data, the row of data with the highest confidence level can be determined to correspond to the standard name, and the standard name determination can be carried out again for other data.

Optionally, the table data processing device may determine a header field name corresponding to each row of data based on the extracted data features, and further determine a corresponding canonical name based on the header field name.

Alternatively, for the case where the header of the target table is not identified, the target table may be considered to correspond to the default service type; and further, the standard name of each row of data can be determined from the standard names corresponding to the default service types. The default traffic type may be considered a bank card traffic type as in public safety scenarios.

Alternatively, in the case where the target table does not have a header, the table data processing apparatus may generate a header for the target table based on the analysis result of the data after the header. Each header field in the generated header is a canonical name for a row of data.

Step 222, processing the data of the target table.

The processing of the data in the target table may include data extraction, data summarization, or any other manner of data processing. Step 222 may be performed after the form data processing apparatus receives a data processing instruction for the target form. The data processing instruction may be received before step 202, after step 220, or at any intermediate time, which is not limited in this embodiment of the present disclosure. Step 222 may refer to the related description in step 106, and details in step 106 are not described in the embodiment of the present disclosure.

In the case that the header is identified in step 202, the data of the target table may be processed in step 222 based on the target specification name and the analysis result of the data before the header. In the event that no header is identified in step 202, the data of the target table may be processed in step 222 based on the analysis of the pre-header data and the post-header data. The pre-header data and the post-header data in this case are the same, and two different manners of analysis can be performed on the data, respectively, to obtain two analysis results.

In embodiments of the present description, the data processing instructions may carry a specified header field. And the table header field can also be subjected to standardization processing, and the standard name corresponding to the table header field is determined so as to ensure that the table header field in the table corresponds to the table header field carried by the data processing instruction through the standard name, thereby being convenient for data processing.

Fig. 4 is a simplified flowchart of a table data processing method according to an embodiment of the present disclosure, which is a simplified schematic diagram of the method shown in fig. 2, and the description of fig. 4 may be referred to with the descriptions of steps 202 to 222. The information in the box in fig. 4 is the information involved in the analysis of the table or the execution of the action, and the information in the oval frame is the information to be acquired by the table data processing method.

As shown in fig. 4, the form data processing apparatus may process the original form data (e.g., the target form). First, a header search may be performed on the original table data, and the process corresponds to header identification in step 202, where the identification result may be that there is a header or that there is no header. The target table may then be partitioned to determine header data, pre-header data, and post-header data. In the case of a header, the rank (e.g., line number) of the header may be determined based on the header data, and in the case of no header, the rank of the header may be directly considered to be 0.

In both cases, the pre-header data is analyzed to obtain corresponding analysis results, and the post-header data is obtained. For the case where no header exists, the column number of the table is also determined based on the post-header data, and the table type is determined based on the post-header data. For the post header data, it may be analyzed to determine the type of each column of data, which is equivalent to matching the corresponding header field for each column of data. Then, the mapping of the standard names can be performed based on the matched header fields to obtain the standard names corresponding to each column of data, and further the mode and structure of the table can be obtained.

In the case of a header, the number of table columns and the table type (e.g., the target traffic type described above) may be determined from the header data. And, the header fields need to be matched in the field set to obtain similar fields corresponding to the header fields. If there is a missing header field in the header, the corresponding post-header data may be analyzed to complement the header field.

In the embodiment of the specification, the table data processing device can be used for carrying out the standardized processing of the table header on a large number of table files in a high-efficient and automatic manner, so that the time and labor waste of manual operation are avoided, and the data processing efficiency is improved. The failure rate of the user for importing the data can be reduced, and the mass transfer quantity on the data can be improved. And the standard name of the header field is determined through the similarity, so that the problem that the required header field cannot be found due to misspelling or multiple words and few words can be avoided. The method is suitable for form data of most business types under the current public security scene, such as bank card data, three-party data, five-linked list data, bill data, bank card main body data and the like. The method can match thousands of header fields based on the field set, the field set can be expanded, and the fields are not matched with the header fields in a strong consistency manner, so the field set has good expansibility and robustness.

In addition, the table data processing device can analyze and deduce the header field possibly belonging to the data behind the table head under the condition of no table head according to the format and the characteristics of the data behind the table head for the table without the table head, thereby ensuring that the data of the table is effectively utilized and improving the utilization rate of the table data. And the form data processing device can reasonably supplement and correct the data to be extracted through the data before the header, can improve the reliability of data processing, and can meet the requirements of public safety scenes with strict standards. For most of the form files, the method has short time consumption of data extraction even if the form files with thousands of lines of data exist, for example, the method can maintain about 500 milliseconds, so that the processing performance of the form files is higher.

In summary, in the table data processing method provided in this embodiment, the header of the target table to be processed may be identified, and the similar fields of the header fields in the header may be determined in the field set, so as to determine the target canonical name corresponding to each header field based on the canonical name corresponding to the similar field. Thus, the data in the target table can be processed based on the target specification name. Even if a plurality of tables with different table header specifications are processed, the fields in the table header do not need to be set with uniform names, the table header of each table does not need to be manually modified, the labor cost can be reduced, and the data processing efficiency of the table is improved.

Corresponding to the method embodiment, the present disclosure further provides an embodiment of a table data processing apparatus, and fig. 5 is a schematic structural diagram of a table data processing apparatus according to an embodiment of the present disclosure. As shown in fig. 5, the form data processing apparatus includes:

an identifying module 501, configured to identify a header of a target table for a target table to be processed;

a matching module 502, configured to match similar fields in the field set for each header field in the header;

a first determining module 503, configured to determine, based on a correspondence between fields in the field set and the canonical names, a target canonical name corresponding to each header field; the target standard name is used for processing the data in the target table.

Optionally, the matching module 502 includes:

the first determining submodule is used for determining a target service type corresponding to the target form based on a header field in the header;

and the matching sub-module is used for matching similar fields under the target service type in the field set for each header field in the header.

Optionally, the first determining submodule is configured to:

determining auxiliary fields with similarity larger than a first threshold value in a field set for each header field in the header to obtain auxiliary fields corresponding to each header field in the header;

And determining the target service type based on the service type corresponding to the auxiliary field.

Optionally, the first determining submodule is configured to:

and determining the target service type based on the sum of the similarity of the auxiliary fields corresponding to the same service type.

Optionally, the first determining submodule is configured to:

and under the condition that the number of header fields in the header is in the reference range, determining the service type corresponding to the reference range as the target service type corresponding to the target table.

Optionally, each field in the field set has a weight; the matching module 502 is configured to:

for each header field in the header, in the case that the set of fields matches to a plurality of candidate fields based on similarity, a similar field for each header field is determined based on weights for the plurality of candidate fields.

Optionally, the canonical name corresponding to each field in the field set corresponds to a necessary word and a forbidden word; the first determining module 503 is configured to:

based on the corresponding relation between the fields in the field set and the standard names, determining the alternative standard names corresponding to the header fields;

for any header field, if the header field includes a mandatory word corresponding to the alternative canonical name and does not include a forbidden word corresponding to the alternative canonical name, determining the alternative canonical name as the target canonical name.

Optionally, the identification module 501 is configured to:

matching words included in the fields in the header word segmentation library for each field in the candidate rows of the target table; the alternative ranking is a data row or a data column, the ranking number of the alternative ranking is smaller than a second threshold value, and words in the header word segmentation library are obtained by word segmentation of the history header field;

determining the alternative row as a header if the number of target words in the alternative row is greater than a third threshold; wherein, the target word has matched words in the header word segmentation library.

Optionally, the form data processing apparatus further includes:

the second determining module is used for determining the front data and the back data of the table head outside the table head in the target table based on the position of the table head in the target table after identifying the table head of the target table aiming at the target table to be processed;

the analysis module is used for respectively analyzing the front data and the rear data of the gauge head by adopting different analysis modes to obtain analysis results; the analysis result is used for processing the data in the target table.

Optionally, the analysis module includes:

the segmentation sub-module is used for carrying out field segmentation on the front data of the header based on a plurality of character formats to obtain a plurality of segmentation fields;

A second determining submodule, configured to determine a field attribute of each split field based on the field set;

and the third determination submodule is used for determining the analysis result of the data before the header based on the field attribute of each split field.

Optionally, the second determining submodule is configured to:

for any split field, determining that the field attribute of the split field is a name attribute when the similarity between the field existing in the field set and the split field is greater than a fourth threshold;

for any split field, determining the field attribute of the split field as the content attribute under the condition that the similarity between the fields in the field set and the split field is smaller than or equal to a fourth threshold value.

Optionally, the second determining module is configured to:

under the condition that the header is not identified, determining the data in the target table as post-header data;

the analysis module is used for:

extracting features of each row of data in the data behind the header under the condition that the header is not identified;

and determining the standard name of each row of data based on the extracted data characteristics to obtain the analysis result of the data behind the header.

Optionally, the analysis module is configured to:

among the feature conditions of the plurality of canonical names, determining a target feature condition satisfied by the data feature extracted for each row of data;

And determining the standard name corresponding to the target characteristic condition as the standard name of each row of data.

In summary, the table data processing apparatus provided in this embodiment may identify the header of the target table to be processed, and determine the similar fields of the header fields in the header in the field set, so as to determine the target canonical name corresponding to each header field based on the canonical name corresponding to the similar fields. Thus, the data in the target table can be processed based on the target specification name. Even if a plurality of tables with different table header specifications are processed, the fields in the table header do not need to be set with uniform names, the table header of each table does not need to be manually modified, the labor cost can be reduced, and the data processing efficiency of the table is improved.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the form data processing apparatus, since it is substantially similar to the form data processing method embodiment, the description is relatively simple, and the relevant points are only required to be referred to the partial explanation of the form data processing method embodiment.

FIG. 6 is a block diagram of a computing device according to one embodiment of the present disclosure. The components of computing device 600 include, but are not limited to, memory 610 and processor 620. The processor 620 is coupled to the memory 610 via a bus 630 and a database 650 is used to hold data.

Computing device 600 also includes access device 640, access device 640 enabling computing device 600 to communicate via one or more networks 660. Examples of such networks include public switched telephone networks (PSTN, public Switched Telephone Network), local area networks (LAN, local Area Network), wide area networks (WAN, wide Area Network), personal area networks (PAN, personal Area Network), or combinations of communication networks such as the internet. The access device 640 may include one or more of any type of network interface, wired or wireless, such as a network interface card (NIC, network interface controller), such as an IEEE802.11 wireless local area network (WLAN, wireless Local Area Network) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, worldwide Interoperability for Microwave Access) interface, an ethernet interface, a universal serial bus (USB, universal Serial Bus) interface, a cellular network interface, a bluetooth interface, near field communication (NFC, near Field Communication).

In one embodiment of the present description, the above-described components of computing device 600, as well as other components not shown in FIG. 6, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device shown in FIG. 6 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 600 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or personal computer (PC, personal Computer). Computing device 600 may also be a mobile or stationary server.

Wherein the processor 620 is configured to execute computer-executable instructions that, when executed by the processor, implement the methods illustrated in fig. 1 or 2 described above.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for computing device embodiments, since they are substantially similar to form data processing method embodiments, the description is relatively simple, and references to portions of the description of form data processing method embodiments are sufficient.

An embodiment of the present specification also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, implement the steps of the table data processing method described above. The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable storage medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable storage medium may be appropriately increased or decreased according to the requirements of jurisdictions in which the computer readable storage medium does not include electrical carrier signals and telecommunication signals, for example, according to jurisdictions in which the computer readable storage medium is not configured.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for computer readable storage medium embodiments, since they are substantially similar to the tabular data processing method embodiments, the description is relatively simple, and the relevant points are found in the partial description of the tabular data processing method embodiments.

An embodiment of the present specification also provides a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the table data processing method described above.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the computer program embodiments, since they are substantially similar to the tabular data processing method embodiments, the description is relatively simple, and the relevant points are only referred to the partial description of the tabular data processing method embodiments.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

It should be noted that the foregoing describes specific embodiments of the present invention. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. A method of table data processing, the method comprising:

matching similar fields in a field set for each header field in the header;

2. The method of claim 1, wherein said matching similar fields in a field set for each header field in said header comprises:

determining a target service type corresponding to the target table based on a header field in the header;

for each header field in the header, matching similar fields under the target traffic type in a field set.

3. The method of claim 2, wherein determining the target service type corresponding to the target table based on the header field in the header comprises:

determining auxiliary fields with similarity larger than a first threshold value in the field set for each header field in the header to obtain auxiliary fields corresponding to each header field in the header;

4. The method of claim 3, wherein the determining the target traffic type based on the traffic type corresponding to the auxiliary field comprises:

5. The method of claim 2, wherein determining the target service type corresponding to the target table based on the header field in the header comprises:

and under the condition that the number of header fields in the header is in a reference range, determining the service type corresponding to the reference range as the target service type corresponding to the target table.

6. The method of claim 1, wherein each field in the set of fields has a weight; the matching similar fields in the field set for each header field in the header includes:

for each header field in the header, in the case that the field set is matched to a plurality of alternative fields based on similarity, determining the similar field of each header field based on the weights of the plurality of alternative fields.

7. The method of claim 1, wherein the canonical name for each field in the set of fields corresponds to a must-select word and a no-select word; the determining the target standard name corresponding to each header field based on the corresponding relation between the fields in the field set and the standard names comprises the following steps:

for any header field, determining the candidate canonical name as a target canonical name when the header field includes a mandatory word corresponding to the candidate canonical name and does not include a forbidden word corresponding to the candidate canonical name.

8. The method of claim 1, wherein the identifying the header of the target table comprises:

for each field in the candidate row of the target table, matching words included in the field in a header word stock; the alternative row is a data row or a data column, the row number of the alternative row is smaller than a second threshold value, and words in the header word segmentation library are obtained by word segmentation of a history header field;

determining the alternative row as the header if the number of target words in the alternative row is greater than a third threshold; wherein, the target word has matched words in the header word segmentation library.

9. The method according to any one of claims 1 to 8, wherein after the identifying of the header of the target table for the target table to be processed, the method further comprises:

determining pre-header data and post-header data outside the header in the target table based on the position of the header in the target table;

analyzing the front data of the gauge head and the rear data of the gauge head by adopting different analysis modes respectively to obtain analysis results; and the analysis result is used for processing the data in the target table.

10. The method of claim 9, wherein the analyzing the pre-header data and the post-header data by using different analysis modes to obtain analysis results comprises:

performing field segmentation on the header front data based on a plurality of character formats to obtain a plurality of segmented fields;

determining field attributes of the split fields based on the field set;

and determining an analysis result of the pre-header data based on the field attribute of each split field.

11. The method of claim 10, wherein determining a field attribute for each of the partitioned fields based on a field set comprises:

For any of the split fields, determining that the field attribute of the split field is a name attribute when the similarity between the field existing in the field set and the split field is greater than a fourth threshold;

and for any of the split fields, determining that the field attribute of the split field is a content attribute under the condition that the similarity between the field in the field set and the split field is smaller than or equal to a fourth threshold value.

12. The method of claim 9, wherein the determining pre-header data and post-header data in the target table other than the header based on the position of the header in the target table comprises:

and respectively analyzing the front data of the gauge outfit and the rear data of the gauge outfit by adopting different analysis modes to obtain analysis results, wherein the analysis results comprise:

and determining the standard name of each row of data based on the extracted data characteristics to obtain an analysis result of the data behind the header.

13. The method of claim 12, wherein determining the canonical name for each row of data based on the extracted data features comprises:

among the feature conditions of the plurality of canonical names, determining a target feature condition satisfied by the data features extracted for each row of data;

14. A form data processing apparatus, characterized in that the form data processing apparatus comprises:

15. A computing device, comprising: a memory and a processor;

the memory is configured to store computer-executable instructions, the processor being configured to execute the computer-executable instructions, which when executed by the processor implement the method of any one of claims 1 to 13.

16. A computer readable storage medium, characterized in that computer executable instructions are stored, which when executed by a processor implement the method of any one of claims 1 to 13.