WO2020057021A1

WO2020057021A1 - Data table processing method and device, computer device and storage medium

Info

Publication number: WO2020057021A1
Application number: PCT/CN2019/071126
Authority: WO
Inventors: 柳明辉; 徐国强; 黄北辰; 杨镭; 付晓
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2018-09-18
Filing date: 2019-01-10
Publication date: 2020-03-26
Also published as: CN109299094A

Abstract

A data table processing method: acquiring a data table uploaded by a user; parsing the data table to obtain table structure information of the data table; carrying out recognition on the table structure information by means of a trained labeling model, and outputting a label result for each field name in the table structure information, the label results comprising either one among only retrieval ranges, only retrieval dimensions, or both retrieval ranges and retrieval dimensions; and storing the label results and the data table in correspondence.

Description

Data table processing method, device, computer equipment and storage medium

This application claims the priority of a Chinese patent application filed on September 18, 2018 with the Chinese Patent Office under the application number 201811090036.8, and the application name is "Data Sheet Processing Method, Device, Computer Equipment, and Storage Medium", the entire contents of which are hereby incorporated by reference Incorporated in this application.

Technical field

The present application relates to the field of computer technology, and in particular, to a data table processing method, device, computer device, and storage medium.

Background technique

At present, the market is equipped with corresponding big data platforms for all walks of life. These data platforms can obtain data and perform statistics based on user input, and can also statistically present the statistical results to users in the form of reports to meet user data. Analyze requirements.

In order to be able to obtain data that matches the user's input, it is usually necessary to pre-process the data in the data source database. However, the inventors have realized that existing data platforms usually can only perform simple operations on the data in the data source database. Normalize the processing of field names, and so on. When the field names need to be marked as dimensions or ranges, they usually rely on manual processing, which requires a lot of repetitive work to be performed manually, resulting in very low processing efficiency.

Summary of the Invention

According to various embodiments disclosed in the present application, a data table processing method, apparatus, computer device, and storage medium are provided.

A data table processing method includes:

Get the data table uploaded by the user;

Parse the data table to obtain table structure information of the data table;

The table structure information is identified through a trained labeling model, and the labeling results of each field name in the table structure information are output; the labeling results include only the search range, only the search dimension, and both the search range and One of the search dimensions; and

The labeling result is stored in correspondence with the data table.

A data table processing device includes:

An acquisition module for acquiring a data table uploaded by a user;

An analysis module, configured to parse the data table to obtain table structure information of the data table;

A labeling module, configured to identify the table structure information through a trained labeling model, and output labeling results for each field name in the data table; the labeling results include only the search range, only the search dimension, and both The search scope is one of the search dimensions; and

A storage module, configured to store the marked result corresponding to the data table.

A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the one or more processors are executed. The following steps: obtain the data table uploaded by the user;

Parse the data table to obtain table structure information of the data table;

The table structure information is identified through a trained labeling model, and the labeling results of each field name in the data table are output; the labeling results include only the search range, only the search dimension, and both the search range and the search. One of the dimensions; and

The labeling result is stored in correspondence with the data table.

One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:

Get the data table uploaded by the user;

Parse the data table to obtain table structure information of the data table;

The labeling result is stored in correspondence with the data table.

Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the application will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions in the embodiments of the present application more clearly, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. Those of ordinary skill in the art can obtain other drawings according to the drawings without paying creative labor.

FIG. 1 is an application scenario diagram of a data table processing method according to one or more embodiments.

FIG. 2 is a schematic flowchart of a data table processing method according to one or more embodiments.

FIG. 3 is a schematic flowchart of steps for filtering report data according to an annotation result in one or more embodiments.

FIG. 4 is a schematic flowchart of a data table processing method according to one or more specific embodiments.

FIG. 5 is a block diagram of a data table processing apparatus according to one or more embodiments.

FIG. 6 is a block diagram of a computer device according to one or more embodiments.

detailed description

In order to make the technical solution and advantages of the present application more clear and clear, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the application, and are not used to limit the application.

The data table processing method provided in this application can be applied to the application environment shown in FIG. 1. The terminal 102 communicates with the server 104 through the network through the network. The terminal 102 can obtain the data table uploaded by the user, and analyze the data table to obtain the table structure information of the data table. The terminal 120 can also identify the table structure information through the trained labeling model, and output the information of each field name in the data table. The labeling result includes one of only the search range, only the search dimension, and both the search range and the search dimension; the terminal 102 may also store the obtained labeling result corresponding to the data table and store the corresponding result in the server. The terminal 102 may be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server 104 may be implemented by an independent server or a server cluster composed of multiple servers, and may also provide a cloud Services, cloud databases, cloud storage and other basic cloud computing services.

It should be noted that the above application environment is only an example. In some embodiments, the terminal 102 may also send the obtained table structure information to the server 104, and the server 104 identifies the table structure information through the trained labeling model. The labeling result corresponding to each field name in the data table is obtained, and the labeling result corresponding to the data table is stored by the server 120, and the terminal 102 can obtain the labeling result corresponding to the data table from the server.

In some embodiments, as shown in FIG. 2, a data table processing method is provided. The method is applied to the terminal 102 in FIG. 1 as an example for description, and includes the following steps:

Step 202: Obtain a data table uploaded by a user.

A data table is a structured data table. For example, it can be a table in CSV (Comma-Separated Values) format. The CSV data table stores table data in plain text. The stored table data includes numeric and character types. Specifically, a web interface can be provided, and the user uploads the data table through the web interface, and the terminal can obtain the data table uploaded by the user. In one embodiment, each user needs to generate a data table containing report data according to a preset file format or form template, so that the terminal can parse out the table structure information of the uploaded data table.

Table 1 below is a schematic diagram of a CSV format data table uploaded in an embodiment.

Table 1

As can be seen from Table 1 above, the elements in each row in the data table are separated by commas. The elements in the first row are used to indicate the column name of this column, which is also called the header or field name of the data table. The corresponding element in this column is the field value corresponding to the field name, and one field name corresponds to multiple field values.

Step 204: Parse the data table to obtain table structure information of the data table.

The table structure information is information capable of indicating the contents included in the data table. Specifically, the terminal may restrict the format or form of the data table uploaded by the user, and the data table uploaded by the user conforms to a fixed format, so that the terminal can parse the data table according to the preset format or form to obtain the data table. Table structure information.

In one embodiment, the table structure information includes field names and field value types. In step S204, parsing the data table to obtain the table structure information of the data table specifically includes: extracting field names included in the header of the data table; The enumeration value corresponding to the field name; the character type of the field value corresponding to each field name as the field value type of the field name; the table structure information of the data table is determined according to the field name and the corresponding enumeration value and field value type.

Specifically, the terminal may extract each field name included in the header of the data table uploaded by the current user, and obtain all field names. The terminal may also count the field values corresponding to each field name, and classify the field values. How many types represent the number of enumeration values in the field name. For example, in Table 1 shown above, the field value corresponding to the field name "Gender" has only two categories: "Male" and "Female", then the enumeration value corresponding to the field name "Gender" includes "Male, Female" Similarly, in Table 1, the enumeration value corresponding to the field name "Education" includes "College, Undergraduate, Master".

It can be understood that for field names that cannot be exhausted or whose field types exceed a preset number (such as 20), since the field values are very scattered, the terminal can determine that the field names are not enumerated value. For example, for the field name "Name" in Table 1, it is meaningless to count its enumeration value, and each row is different. For example, for the field name "Loan Amount", different regions and different genders 2. The loan amount is different for different academic degrees, so there is no enumerated value for this field name. The terminal may mark the enumeration value corresponding to the field name where the enumeration value does not exist as "none".

The field value type is the character type of the field value, including character and numeric. For example, in Table 1 above, the field value types of name, gender, education, and region are character types, age, loan amount, loan time, and ID number are pure numeric values, and the corresponding field value types are numeric types.

The terminal parses the data table uploaded by the user to obtain each field name. For the field name where the enumeration value exists, the corresponding enumeration value and the corresponding field value type can be obtained. In this way, the terminal obtains the ability to express the entire Table structure information of the contents of the data table.

Step 206: Identify the table structure information through the trained labeling model, and output the labeling results of each field name in the table structure information; the labeling results include only the search range, only the search dimension, and both the search range and the search dimension. Kind of.

Specifically, the terminal may recognize the input table structure information by using a trained machine learning model, and output a labeling result of each field name in the data table. Among them, the labeling result is the result of automatically labeling each field name in the data table through the labeling model, including that the field name can only be used as a search range, only as a search dimension, and can be used as both a search range and a search dimension. Kind of.

The search range can be used as a condition for data screening from the large amount of data stored in the data source library. The terminal can filter the report data required to generate the report according to the search range from the data source database; the search dimension can be used as a filter for the filtered data. The display dimension of the display. If the field name can be used as the search range, when filtering report data from the data source database, the terminal can filter from the field value corresponding to the field name that can be used as the search range. Similarly, if the field name can be used as the search dimension, When filtering report data from a data source database, the terminal can filter from field values corresponding to field names that can be used as search dimensions, which can improve the degree of matching between the filtered data and the search terms entered by the user.

In one of the embodiments, in step 206, the table structure information is identified through the trained labeling model, and the output of the labeling results of each field name in the data table includes: obtaining the business scenario category selected by the user; inputting the table structure information to In the trained labeling model corresponding to the business scenario category, the feature vector corresponding to each field name in the data table is obtained according to the table structure information through the labeling model; the feature vector corresponding to each field name is transformed to output each field name in the data table. The corresponding labeling result.

Business scenario categories are used to distinguish different business scenarios. Different business scenarios correspond to different data source libraries and different labeling models. The business scenarios include loan business, insurance business, wealth management business, banking business, etc. The data of the data source database involved in these different business scenarios are different, and the training corpus used is different when training the labeling model. That is, different business scenarios need to use different labeling models to identify table structure information.

Specifically, when the user enters the retrieval platform, the terminal may provide a business scenario category for the user to select. After the user selects the business scenario category, the user uploads the data table, and the terminal analyzes the data table to obtain the table structure information. The annotation model corresponding to the business scenario category, obtains the feature vector corresponding to each field name according to the parsed table structure information, and then transforms the model parameters of the hidden layer of the model and the obtained feature vector to output whether each field name can be used for retrieval Range or retrieve dimension label results.

In one embodiment, after obtaining the table structure information of the data table, the terminal may determine the feature vector corresponding to each field name according to the field name, the enumeration value corresponding to the field name, and the field value type corresponding to the field name. Specifically, each field name in the data table can be vectorized to obtain a vectorized representation of each field name, and a word vector corresponding to the enumeration value of each field name can be obtained. According to the field name, whether an enumeration value exists, or an enumeration value, Feature number, field value type and other features to generate feature vectors corresponding to each field name. That is, the information expressed by the feature vector of the field name includes various table result features associated with the field name.

Step 208: Store the labeling result corresponding to the data table.

Specifically, the terminal may store the identified labeling result corresponding to the data table, so as to filter out report data required for generating a report from a large number of data tables stored in a data source database according to a search term input by a user.

In one embodiment, the terminal may generate a corresponding data table identifier for the data table uploaded by the user. The data table identifier is used to uniquely identify a data table. The data table identifier may include at least any one of characters, numbers, and symbols. The terminal may obtain each field name included in the table structure information of the data table extracted in step 204, and obtain the labeling result of each field name obtained in step 203, and identify the data table identifier with each field name in the data table. The result is stored correspondingly.

In this way, when the search term is obtained, the terminal can traverse the data table corresponding to each data table identifier from the data source database, obtain the corresponding labeling result according to the data table identifier, and filter out the data table to generate the report according to the labeling result. Report data that matches the search term. As shown in FIG. 3, in one of the embodiments, the above data table processing method further includes a step of filtering report data according to the marked result:

Step 302: Obtain a search term input by a user.

Specifically, the search platform includes a search box for a user to input a search term, and when the terminal detects a report search event, the terminal can obtain the content entered by the user in the search box as a search term. For example, the user enters "how are the academic qualifications of male borrowers in Shanghai distributed?" And triggers the terminal to obtain the search term when the user clicks the search icon.

Step 304: Identify a search range and a search dimension corresponding to the search term.

Specifically, the terminal may recognize the search term through the trained intent recognition model, and obtain a search range and a search dimension corresponding to the search term. The terminal may perform vectorization processing on the retrieved search terms to obtain the search term vector, and then input the search term vector into the intent recognition model. The hidden term layer of the intent recognition model is used to encode and transform the search term vector. Output the search scope and search dimension corresponding to the search term.

Step 306: Obtain a labeling result corresponding to each data table in the data source database.

A data source database is a database that stores a large number of data tables. The data stored in the data source database can correspond to the business scenario category. Different business scenarios correspond to data source libraries that store different types of data. When the report data needs to be filtered from the data source database, the terminal may first obtain the labeling results corresponding to each stored data table.

In step 308, according to the labeling result, the report data matching the search range and search dimension is filtered from the data source database.

Specifically, after the terminal recognizes the search range and search dimension corresponding to the search term and obtains the labeling results corresponding to each data table, it can filter from the data table in the data source database according to the labeling result, search range and search dimension. Report data required for report generation.

In one of the embodiments, step 308, according to the labeling result, filtering report data matching the search range and search dimension from the data source database specifically includes: matching the search range with a field name that can be used as the search range in the labeling result. ; Match the search dimension with the field name that can be used as the search dimension in the labeling result; filter the report data from the database source according to the matched field name.

Specifically, the terminal can obtain the labeling results corresponding to each data table in the data source database, match the search range identified according to the search term with the field name that can be used as the search range in the labeling result, and identify the search range based on the search term. The search dimension matches the field name that can be used as the search dimension in the labeled result. If it matches, the report data is filtered from the data source database according to the field name on the match. If it does not match, the data source database does not match Stores report data that matches search terms. In one of the embodiments, the terminal may also identify the retrieval intent according to the search term, and further statistically summarize the filtered report data to obtain statistical data for generating a report, and draw a report according to the retrieval dimension and retrieval intention to Show statistics.

In the above data table processing method, when the data table uploaded by the user is obtained, the data table is parsed to obtain the table structure information of the data table. The table structure information can reflect the content and field names included in the data table, and then pass the The trained labeling model recognizes the table structure information, and can automatically output the labeling results corresponding to the field names in the data table. The labeling results can determine whether the field names in the data table can be used as the search range or search dimension. The field names in the data table uploaded by the user are automatically labeled. Compared with manual labeling, the labeling efficiency of the data table is greatly improved, and the labeling results are stored in correspondence with the data table, which can be easily obtained from the data table. The user's search term matches the data.

In one embodiment, the training step of the labeling model includes: obtaining training sample corpus and test sample corpus; obtaining each training sample in the training sample corpus, and corresponding labeling results of each test sample in the test sample corpus; the loop execution will mark the marked The current training sample is input to the machine learning model, and the prediction result corresponding to the current training sample is output. The prediction result output by the current training sample is compared with the corresponding labeled result. When the difference does not meet the preset conditions, the model of the machine learning model is adjusted. Parameters, when the difference meets the preset conditions, the step of accepting the previously adjusted model parameters until the training sample corpus is trained; input each test sample in the test sample corpus into the trained machine learning model and output each test sample Corresponding prediction results; Based on the differences between the corresponding prediction results of each test sample and the corresponding labeling results, the accuracy rate of the machine learning model is counted; when the statistical accuracy rate meets the training stop conditions, a trained labeling model is obtained.

The training sample corpus is a corpus used for training the model, and the test sample corpus is a corpus used for testing the model. In one embodiment, when training a machine learning model, it is necessary to distinguish business scenario categories. For different business scenario categories, obtain training sample corpora and test sample corpora corresponding to the business scenario category, and perform machine learning model processing. The training obtains a labeling model corresponding to the business scenario category. After the user uploads the data table and selects and enters the business scenario category, the uploaded data table can be automatically labeled by the labeling model corresponding to the selected business scenario category.

Specifically, in order to train the model, the labeling results of the training sample corpus and test sample corpus obtained may be manually labeled, and the labeling results are accurate, which is beneficial to obtaining a labeling model with high labeling accuracy. During the training process, the labeled current training samples can be input into the machine learning model in sequence, and the prediction results corresponding to the current training samples are output. The prediction results output by the current training samples are compared with the corresponding labeled results. When the difference between the corresponding labeled results does not meet the preset conditions, adjust the model parameters of the machine learning model. When the differences meet the preset conditions, accept the previously adjusted model parameters, repeat the above training process, and send the next training The samples are input into the machine learning model until the training samples in the training sample corpus are trained.

Next, the test samples in the test sample corpus are input to the trained model, and the accuracy rate of predicting the test samples in the test sample corpus is statistically calculated. When the statistical accuracy rate meets the training stop condition, a trained labeled model is obtained. .

When the statistical accuracy rate does not meet the training stop conditions, you can continue the new step of training the machine learning model based on the training sample corpus and test sample corpus until the statistical accuracy rate meets the training stop condition and get training. Good annotation model.

In this embodiment, the machine learning model can be continuously trained by manually labeling the sample corpus to obtain a labeling model with a labeling accuracy rate that satisfies a preset condition, so as to realize automatic labeling of the data table uploaded by the user.

In one embodiment, the above data table processing method further includes the following steps: displaying each field name and the corresponding labeling result; obtaining at least two field names entered by the user from the displayed field names; and obtaining the user input and the The intermediate field names associated with at least two field names; the intermediate field names are stored in correspondence with the data table; the labeling results of the intermediate field names are the same as the at least two field names selected for input.

The terminal can also display the identified annotation results to the user who uploaded the data table, so that the user can customize the middle field name according to the displayed annotation results. Specifically, the terminal may obtain at least two field names selected by the user from the displayed field names, and the at least two field names selected and input are the original field names in the data table uploaded by the user and appear in the data table. The middle field name is not the original field name in the data table, but the middle field name defined according to the original field name. The user can enter the middle field name and associate at least two original field names selected with the middle field name, and the terminal can obtain the connection between the middle field name and the at least two field names, and associate the middle field name with The data table is stored correspondingly, and the labeling result of the middle field name is the same as the labeling result of at least two field names selected for input.

For example, in the data table uploaded by the user, including the original field names "overdue amount" and "overdue principal", and the marked result is both a search range and a search dimension, then the user can customize the intermediate field Name "overdue rate", and "overdue rate = overdue amount / overdue principal", the terminal can store the middle field name "overdue rate" corresponding to this data table, and the corresponding labeling result can also be used as a search range or As the search dimension. Then when the user enters the search term "difference between overdue rates in Shanghai and Beijing", the retrieval dimension can be identified as "overdue rate" through the intent recognition model, then the "overdue rate" can be obtained from the data source database according to the search dimension = Overdue Amount / Overdue Principal ", the" overdue amount, overdue principal "filter out the report data, and calculate the" overdue rate "to get the statistical data required to generate the report.

In this embodiment, by displaying the annotation results, the user can customize the intermediate dimension according to the original field name and store it, so that the field names that can match the search terms entered by the user are richer and more diverse. The search scope or search dimension identified by the search term does not exist in the original field name of the data table, and the report data matching the search term can be filtered, which improves the accuracy of the matching.

As shown in FIG. 4, in a specific embodiment, the data table processing method specifically includes the following steps:

Step 402: Obtain a data table uploaded by a user;

Step 404: Extract field names included in the header of the data table;

Step 406: Count the enumeration values corresponding to the field names.

Step 408: Use the character type of the field value corresponding to each field name as the field value type of the field name.

Step 410: Determine the table structure information of the data table according to the field names and corresponding enumeration values and field value types.

Step 412: Obtain a business scenario category selected by the user;

Step 414: Enter the table structure information into the labeled model corresponding to the business scene category that has been trained, and obtain the feature vector corresponding to each field name in the data table according to the table structure information through the label model;

Step 416: The feature vector corresponding to each field name is transformed, and the labeling result corresponding to each field name in the data table is output; the labeling result includes only one of the search range, only the search dimension, and both the search range and the search dimension. Species

Step 418, displaying each field name and corresponding labeling result;

Step 420: Acquire at least two field names input by the user from the displayed field names;

Step 422: Obtain an intermediate field name associated with at least two field names input by the user;

Step 424: Store the middle field name corresponding to the data table; the labeling result of the middle field name is the same as at least two field names selected for input;

Step 426: Store the labeling result corresponding to the data table.

Step 428: Obtain a search term input by the user;

Step 430: Identify a search range and a search dimension corresponding to the search term;

Step 432: Obtain a labeling result corresponding to each data table in the data source database.

Step 434: Match the search range with the field name that can be used as the search range in the marked result;

Step 436: Match the search dimension with a field name that can be used as the search dimension in the labeled result;

Step 438: Filter the report data from the database source according to the matching field names.

It should be understood that although the steps in the flowcharts of FIGS. 2 to 4 are sequentially displayed in accordance with the directions of the arrows, these steps are not necessarily performed in the order indicated by the arrows. Unless explicitly stated in this document, the execution of these steps is not strictly limited, and these steps can be performed in other orders. Moreover, at least a part of the steps in FIGS. 2 to 4 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily performed at the same time, but may be performed at different times. These sub-steps or The execution order of the phases is not necessarily performed sequentially, but may be performed in turn or alternately with other steps or at least a part of the sub-steps or phases of other steps.

In one embodiment, as shown in FIG. 5, a data table processing apparatus 500 is provided, including: an obtaining module 502, a parsing module 504, a labeling module 506, and a storage module 508, where:

The obtaining module 502 is configured to obtain a data table uploaded by a user.

The analysis module 504 is configured to parse the data table to obtain table structure information of the data table.

A labeling module 506 is used to identify the table structure information through the trained labeling model and output the labeling results of each field name in the data table; the labeling results include only the search range, only the search dimension, and both the search range and the search One of the dimensions.

The storage module 508 is configured to store the marked result corresponding to the data table.

In one of the embodiments, the data table processing device 500 further includes a search term acquisition module, a recognition module, a labeling result acquisition module, and a report data screening module; the search term acquisition module is used to obtain a search term input by a user; a recognition module It is used to identify the search scope and search dimension corresponding to the search term; the annotation result acquisition module is used to obtain the annotation results corresponding to each data table in the data source database; the report data filtering module is used to filter out the data source database based on the annotation results. Report data that matches the search scope and search dimensions.

In one embodiment, the report data filtering module is further configured to match the search scope with the field names that can be used as the search scope in the labeled results; match the search dimensions with the field names that can be used as the search dimensions in the labeled results; Field name to filter report data from the database source.

In one embodiment, the table structure information includes field names and field value types; the parsing module is further configured to extract the field names included in the header of the data table; count the enumeration values corresponding to each field name; The character type of the field value is used as the field value type of the field name; the table structure information of the data table is determined according to the field name and the corresponding enumeration value and field value type.

In one embodiment, the labeling module is further configured to obtain the business scenario category selected by the user; input the table structure information into the trained labeling model corresponding to the business scenario category, and obtain a data table based on the table structure information through the labeling model. Feature vector corresponding to each field name in the field; transforming the feature vector corresponding to each field name to output the labeling result corresponding to each field name in the data table.

In one embodiment, the data table processing device 500 further includes a training module for obtaining training sample corpus and test sample corpus; obtaining each training sample in the training sample corpus and corresponding labeling results of each test sample in the test sample corpus; loop Execute the labeled current training samples into the machine learning model, output the prediction results corresponding to the current training samples, compare the prediction results output by the current training samples with the corresponding labeled results, and adjust when the differences do not meet the preset conditions. When the model parameters of the machine learning model meet the preset conditions, the steps of the previously adjusted model parameters are accepted until the training sample corpus is trained; each test sample in the test sample corpus is input into the trained machine learning model. , Output the prediction results corresponding to each test sample; based on the differences between the prediction results corresponding to each test sample and the corresponding labeled results, calculate the accuracy rate of the machine learning model; when the statistical accuracy rate meets the training stop conditions, get a good training Callout model.

In one embodiment, the data table processing device 500 further includes an annotation result display module, a field name acquisition module, an intermediate field name definition module, and an intermediate field name storage module; the annotation result display module is used to display each field name and corresponding annotation. Result; the field name acquisition module is used to obtain at least two field names that the user selects from the displayed field names; the middle field name definition module is used to obtain the middle field names associated with the at least two field names entered by the user; The field name storage module is used to store the middle field name corresponding to the data table; the labeling result of the middle field name is the same as at least two field names selected for input.

The above data table processing device 500 analyzes the data table when the data table uploaded by the user is obtained, and obtains the table structure information of the data table. The table structure information can reflect the content and field names included in the data table, and then passes The trained labeling model recognizes the table structure information, and can automatically output the labeling results corresponding to the field names in the data table. The labeling results can determine whether the field names in the data table can be used as the search range or search dimension. Automatically label the field names in the data table uploaded by the user. Compared with manual labeling, the labeling efficiency of the data table is greatly improved, and the labeling results are stored in correspondence with the data table, which can be easily obtained from the data table. Data that matches a user's search term.

For the specific limitation of the data table processing apparatus 500, reference may be made to the foregoing limitation on the data table processing method, and details are not described herein again. Each module in the data table processing apparatus 500 may be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware form or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor calls and performs the operations corresponding to the above modules.

In one embodiment, a computer device is provided. The computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 6. The computer equipment includes a processor, a memory, a network interface, a display screen, and an input device connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile computer-readable storage medium and an internal memory. The non-volatile computer-readable storage medium stores an operating system and computer-readable instructions. The internal memory provides an environment for operating systems and computer-readable instructions in a non-volatile computer-readable storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by a processor to implement a data table processing method. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen. The input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball, or a touchpad provided on the computer device casing. , Or an external keyboard, trackpad, or mouse.

Those skilled in the art can understand that the structure shown in FIG. 6 is only a block diagram of a part of the structure related to the scheme of the present application, and does not constitute a limitation on the computer equipment to which the scheme of the present application is applied. Include more or fewer parts than shown in the figure, or combine certain parts, or have a different arrangement of parts.

In one embodiment, the data table processing apparatus 500 provided in the present application may be implemented in the form of a computer-readable instruction, and the computer-readable instruction may run on a computer device as shown in FIG. 6. The memory of the computer device may store various program modules constituting the data table processing apparatus 500, such as the obtaining module 502, the analyzing module 504, the labeling module 506, and the storage module 508 shown in FIG. The computer-readable instructions constituted by each program module cause the processor to execute the steps in the data table processing method of each embodiment of the present application described in this specification.

For example, the computer device shown in FIG. 6 may execute step S202 by the obtaining module 502 in the data table processing apparatus 500 shown in FIG. 5. The computer device may execute step S204 through the analysis module 504. The computer device may execute step S206 through the labeling module 506. The computer device may execute step S208 through the storage module 508.

In one embodiment, a computer device is provided, which includes a memory and one or more processors. The memory stores computer-readable instructions. The computer-readable instructions are implemented by the processor to implement any one of the embodiments of the present application. Provide the steps of the data sheet processing method. The steps of the data table processing method herein may be the steps in the data table processing method of each of the foregoing embodiments.

In one embodiment, one or more non-volatile computer-readable storage media storing computer-readable instructions are provided, and when the computer-readable instructions are executed by one or more processors, one or more processes are processed. The processor implements the steps of implementing the data table processing method provided in any one of the embodiments of the present application. The steps of the data table processing method herein may be the steps in the data table processing method of each of the foregoing embodiments.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by computer-readable instructions to instruct related hardware. The computer-readable instructions can be stored in a non-volatile computer. In the readable storage medium, the computer-readable instructions, when executed, may include the processes of the embodiments of the methods described above. Wherein, any reference to the memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined. In order to make the description concise, all possible combinations of the technical features in the above embodiments have not been described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered to be the range described in this specification.

The above-mentioned embodiments only express several implementation manners of the present application, and their descriptions are more specific and detailed, but they cannot be understood as limiting the scope of the invention patent. It should be noted that, for those of ordinary skill in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the protection scope of this application patent shall be subject to the appended claims.

Claims

A data table processing method includes:

Get the data table uploaded by the user;

Parse the data table to obtain table structure information of the data table;

The table structure information is identified through a trained labeling model, and the labeling results of each field name in the data table are output; the labeling results include only the search range, only the search dimension, and both the search range and the search. One of the dimensions; and

The labeling result is stored in correspondence with the data table.
The method according to claim 1, further comprising:

Get search terms entered by the user;

Identifying a search range and a search dimension corresponding to the search term;

Obtain the annotation results corresponding to each data table in the data source database; and

According to the labeling result, report data matching the search range and the search dimension is filtered from the data source database.
The method according to claim 2, wherein, according to the labeling result, filtering out report data matching the search range and the search dimension from the data source database comprises:

Matching the search range with a field name that can be used as the search range in the labeled result;

Matching the search dimension with a field name that can be used as the search dimension in the labeled result; and

Filter report data from the database source according to the matching field names.
The method according to claim 1, wherein the table structure information includes field names and field value types; and parsing the data table to obtain table structure information of the data table includes:

Extracting field names included in a header of the data table;

Enumerating the enumerated values corresponding to the field names;

Use the character type of the field value corresponding to each of the field names as the field value type of the field name; and

The table structure information of the data table is determined according to the field name and the corresponding enumeration value and field value type.
The method according to claim 1, wherein the identifying the table structure information by using a trained annotation model, and outputting an annotation result of each field name in the data table comprises:

Get the business scenario category selected by the user;

Inputting the table structure information into a trained labeling model corresponding to the business scene category, and obtaining a feature vector corresponding to each field name in the data table according to the table structure information through the labeling model; and

The feature vector corresponding to each of the field names is transformed, and a labeling result corresponding to each field name in the data table is output.
The method according to claim 1, wherein the training step of the labeling model comprises:

Obtain training sample corpus and test sample corpus;

Obtaining annotation results corresponding to each training sample in the training sample corpus and each test sample in the test sample corpus;

The loop execution inputs the labeled current training samples into the machine learning model, outputs the prediction results corresponding to the current training samples, and compares the prediction results output by the current training samples with the corresponding labeled results. Adjusting the model parameters of the machine learning model, and accepting the steps of the previously adjusted model parameters when the difference meets a preset condition, until the training sample corpus is trained;

Inputting each test sample in the test sample corpus into a trained machine learning model, and outputting prediction results corresponding to each test sample;

Based on the difference between the prediction result corresponding to each test sample and the corresponding labeled result, statistics of the accuracy of the machine learning model; and

When the statistical accuracy rate meets the training stop condition, a trained labeled model is obtained.
The method according to any one of claims 1 to 6, further comprising:

Display each field name and corresponding labeling results;

Acquiring at least two field names input by the user from the displayed field names;

Obtaining an intermediate field name associated with the at least two field names input by the user; and

The intermediate field name is stored corresponding to the data table; the labeling result of the intermediate field name is the same as at least two field names of the selected input.
A data table processing device, the device includes:

An acquisition module for acquiring a data table uploaded by a user;

An analysis module, configured to parse the data table to obtain table structure information of the data table;

A labeling module, configured to identify the table structure information through a trained labeling model, and output labeling results for each field name in the data table; the labeling results include only the search range, only the search dimension, and both The search scope is one of the search dimensions; and

A storage module, configured to store the marked result corresponding to the data table.
The apparatus according to claim 8, further comprising:

A search term acquisition module for acquiring a search term input by a user;

A recognition module, configured to identify a search range and a search dimension corresponding to the search term;

Annotation result acquisition module, for acquiring the annotation results corresponding to each data table in the data source database; and

A report data filtering module is configured to filter report data matching the search range and the search dimension from the data source database according to the annotation result.
The device according to claim 8, wherein the report data filtering module is further configured to match the search range with a field name that can be used as a search range in the marked result; match the search dimension with all Matching the field names as search dimensions in the annotation results; and filtering report data from the database source according to the matching field names.
The device according to claim 8, wherein the table structure information includes field names and field value types; the parsing module is further configured to extract field names included in a header of the data table; and count each of the fields The enumeration value corresponding to the field name; the character type of the field value corresponding to each of the field names as the field value type of the field name; and determining the field name according to the field name and the corresponding enumeration value and field value type Table structure information of the data table.
The device according to claim 8, wherein the labeling module is further configured to obtain a business scenario category selected by a user; and input the table structure information to a trained labeling model corresponding to the business scenario category In the method, the feature vector corresponding to each field name in the data table is obtained according to the table structure information through the annotation model; and the feature vector corresponding to each field name is transformed to output each field name in the data table. The corresponding labeling result.
A computer device includes a memory and one or more processors. Computer-readable instructions are stored in the memory. When the computer-readable instructions are executed by the processor, the one or more processors execute the following steps:

Get the data table uploaded by the user;

Parse the data table to obtain table structure information of the data table;

The table structure information is identified through a trained labeling model, and the labeling results of each field name in the data table are output; the labeling results include only the search range, only the search dimension, and both the search range and the search. One of the dimensions; and

The labeling result is stored in correspondence with the data table.
The computer device of claim 13, wherein the processor further executes the following steps when executing the computer-readable instructions:

Get search terms entered by the user;

Identifying a search range and a search dimension corresponding to the search term;

Obtain the annotation results corresponding to each data table in the data source database; and

According to the labeling result, report data matching the search range and the search dimension is filtered from the data source database.
The computer device according to claim 13, wherein the table structure information includes field names and field value types; and when the processor executes the computer-readable instructions, the following steps are further performed:

Extracting field names included in a header of the data table;

Enumerating the enumerated values corresponding to the field names;

Use the character type of the field value corresponding to each of the field names as the field value type of the field name; and

The table structure information of the data table is determined according to the field name and the corresponding enumeration value and field value type.
The computer device of claim 13, wherein the processor further executes the following steps when executing the computer-readable instructions:

Get the business scenario category selected by the user;

Inputting the table structure information into a trained labeling model corresponding to the business scene category, and obtaining a feature vector corresponding to each field name in the data table according to the table structure information through the labeling model; and

The feature vector corresponding to each of the field names is transformed, and the annotation result corresponding to each field name in the data table is output.
One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:

Get the data table uploaded by the user;

Parse the data table to obtain table structure information of the data table;

The table structure information is identified through a trained labeling model, and the labeling results of each field name in the data table are output; the labeling results include only the search range, only the search dimension, and both the search range and the search. One of the dimensions; and

The labeling result is stored in correspondence with the data table.
The storage medium according to claim 17, wherein when the computer-readable instructions are executed by the processor, the following steps are further performed:

Get search terms entered by the user;

Identifying a search range and a search dimension corresponding to the search term;

Obtain the annotation results corresponding to each data table in the data source database; and

According to the labeling result, report data matching the search range and the search dimension is filtered from the data source database.
The storage medium according to claim 16, wherein the table structure information includes field names and field value types; when the computer-readable instructions are executed by the processor, the following steps are further performed: extracting the data table Field names included in the header;

Enumerating the enumerated values corresponding to the field names;

Use the character type of the field value corresponding to each of the field names as the field value type of the field name; and

The table structure information of the data table is determined according to the field name and the corresponding enumeration value and field value type.
The storage medium according to claim 16, wherein when the computer-readable instructions are executed by the processor, the following steps are further performed:

Get the business scenario category selected by the user;

Inputting the table structure information into a trained labeling model corresponding to the business scene category, and obtaining a feature vector corresponding to each field name in the data table according to the table structure information through the labeling model; and

The feature vector corresponding to each of the field names is transformed, and a labeling result corresponding to each field name in the data table is output.