CN112784557A - Method and device for determining pivot table - Google Patents

Method and device for determining pivot table Download PDF

Info

Publication number
CN112784557A
CN112784557A CN201911088571.4A CN201911088571A CN112784557A CN 112784557 A CN112784557 A CN 112784557A CN 201911088571 A CN201911088571 A CN 201911088571A CN 112784557 A CN112784557 A CN 112784557A
Authority
CN
China
Prior art keywords
column
columns
data
predetermined
characteristic value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911088571.4A
Other languages
Chinese (zh)
Other versions
CN112784557B (en
Inventor
苏奕虹
辛洋
皮霞林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Original Assignee
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Office Software Inc, Zhuhai Kingsoft Office Software Co Ltd filed Critical Beijing Kingsoft Office Software Inc
Priority to CN201911088571.4A priority Critical patent/CN112784557B/en
Publication of CN112784557A publication Critical patent/CN112784557A/en
Application granted granted Critical
Publication of CN112784557B publication Critical patent/CN112784557B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Abstract

A method for determining a pivot table comprises the steps of obtaining a selected data column in a current table after receiving an instruction for establishing the pivot table aiming at the current table; respectively determining a first data column for generating the data perspective table row and a second data column for generating the data perspective table value in the acquired data columns by adopting a preset random forest model; and generating the pivot table according to the determined first data column for generating the data pivot table row and the determined second data column for generating the data pivot table value by adopting a preset rule. The invention can automatically determine the pivot table by adopting the random forest model, help the user to process and analyze data, reduce the use threshold of the user and provide a more convenient way for the user.

Description

Method and device for determining pivot table
Technical Field
The present disclosure relates to computer technology, and more particularly, to a method and apparatus for determining a pivot table.
Background
The "pivot table" in the table software is a high threshold function. No more than two percent of all form software users will use the function. For many form documents, the user needs to count, sum, average, etc. columns of data in the worksheet. These operations are most convenient using "pivot tables," but because of the high threshold of functionality, many users can only do so using awkward methods.
Disclosure of Invention
The application provides a method and a device for determining a pivot table, which can help a user to process and analyze data, reduce the use threshold of the user and provide a more convenient way for the user.
The application provides a method for generating a pivot table, which comprises the steps of acquiring a selected data column in a current table after receiving an instruction for establishing the pivot table aiming at the current table; respectively determining a first data column for generating the data perspective table row and a second data column for generating the data perspective table value in the acquired data columns by adopting a preset random forest model; and generating the pivot table according to the determined first data column for generating the data pivot table row and the determined second data column for generating the data pivot table value by adopting a preset rule.
In an exemplary embodiment, the determining, in the acquired data columns, a first data column generating the data perspective table row and a second data column generating the data perspective table value respectively using a predetermined random forest model includes: respectively traversing the selected data columns in at least one preset sequence, and respectively acquiring at least one preset first characteristic value and at least one preset second characteristic value of each selected data column; respectively inputting the preset first characteristic value of each acquired data column into a pre-generated first random forest model to obtain a first analysis result of each data column corresponding to the first characteristic value; inputting the preset second characteristic value of each acquired data column into a pre-generated second random forest model respectively to obtain a second analysis result of each data column corresponding to the second characteristic value;
determining a first data column serving as a row for generating the pivot table according to the data column of which the first analysis result meets a first preset condition; and determining a second data column serving as a value for generating the pivot table from the data columns of which the second analysis result meets a second preset condition.
In an exemplary embodiment, the generating the pivot table according to the determined first data column for generating the data pivot table row and the determined second data column for generating the data pivot table value by using the predetermined rule includes: merging the values of the cells with the same content in the first data column, and taking each merged value as a row title of the pivot table; and summing the values of the cells in the second data column in the current table according to the line headers of the pivot table respectively, and taking the obtained summation result as the value of the corresponding cell in the pivot table.
In an exemplary embodiment, the obtaining the selected data column in the current table includes: acquiring a data column selected by a user in a current table, and judging the area size of the acquired data column selected by the user in the table, wherein the area size of the data column is represented as: m × n, where m is the number of rows and n is the number of columns; when the row number and the column number of a data column selected in a table by an acquiring user are equal to 1 and 1, expanding cells of the data column selected in the table by the acquiring user, and acquiring areas which are not continuous blank rows, columns and rows at the upper side, the lower side, the left side and the right side as the selected data column in the acquiring table; and when the row number or the column number of the data column selected in the table by the user is larger than 1, taking the data column selected in the table by the user as the selected data column in the acquisition table.
In an exemplary embodiment, after acquiring the selected data column in the table, the method further includes: identifying the direction of the table according to the selected data columns in the acquired table; identifying the structure of the table when the identified table direction is arranged in a predetermined manner; and when the table structure is a preset table structure, executing the step of respectively traversing the selected data columns in at least one preset sequence, and respectively acquiring at least one preset first characteristic value and at least one preset second characteristic value of each selected data column.
In an exemplary embodiment, the predetermined at least one order includes a first left-to-right order, and when the traversal is performed in the first left-to-right order, the obtaining of the at least one first predetermined characteristic value of each of the selected data columns includes: the column number of the whole data column, the index value, the data type contained in the whole column, the number of the cells after removing the repeated cells, the variance of the occurrence times of the content of the repeated cells, the maximum value of the character length of the cells and the variance of the character length of the cells.
In an exemplary embodiment, the predetermined at least one order includes a second left-to-right order, and when the traversal is performed in the second left-to-right order, the obtaining of the at least one first predetermined characteristic value of each of the selected data columns further includes: the left columns of the self and the self contain the number of columns, and the left columns of the self and the self contain the number of columns of Chinese, English and date.
In an exemplary embodiment, the predetermined at least one sequence includes a right-to-left sequence, and when the traversal is performed in the right-to-left sequence, the obtaining of the at least one first predetermined characteristic value of each of the selected data columns further includes: the columns of the self and the right of the self contain the number of columns, and the columns of the self and the right of the self contain the number of columns of Chinese, English and date.
In an exemplary embodiment, the predetermined at least one order includes a first left-to-right order, and when the traversal is performed in the first left-to-right order, the obtaining of the at least one second predetermined characteristic value of each of the selected data columns includes: title extraction keywords, the number of columns of the whole data column, the number of units of only numbers, each unit, and the variance of the length of the number of characters of integer numbers.
In an exemplary embodiment, the predetermined at least one order includes a second left-to-right order, and when the traversal is performed in the second left-to-right order, the obtaining at least one second predetermined characteristic value of each of the selected data columns further includes: the left columns of the self and the self contain the number of columns, the left columns of the self and the self contain only the number of columns, and the left columns of the self and the self contain the number of columns of Chinese, English and date.
In an exemplary embodiment, the predetermined at least one sequence includes a right-to-left sequence, and when the traversal is performed in the right-to-left sequence, the acquiring at least one second predetermined characteristic value of each of the selected data columns further includes: the columns of the self and the right of the self contain the number of columns, the columns of the self and the right of the self contain only the number of columns, and the columns of the self and the right of the self contain the number of columns of Chinese, English and date.
The present application further provides a device for targeted delivery of content, comprising: the acquisition module is used for acquiring the selected data columns in the current table after receiving an instruction of establishing a pivot table for the current table; the analysis module is used for respectively determining a first data column for generating the data perspective table row and a second data column for generating the data perspective table value in the acquired data columns by adopting a preset random forest model; and generating the pivot table according to the determined first data column for generating the data pivot table row and the determined second data column for generating the data pivot table value by adopting a preset rule. .
Compared with the related technology, the method and the device have the advantages that the data pivot table is automatically determined by the aid of the random forest model, a user is helped to process and analyze data, the use threshold of the user is reduced, and a more convenient way is provided for the user.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. Other advantages of the application may be realized and attained by the instrumentalities and combinations particularly pointed out in the specification, claims, and drawings.
Drawings
The accompanying drawings are included to provide an understanding of the present disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the examples serve to explain the principles of the disclosure and not to limit the disclosure.
FIG. 1 is a flow chart of a method for determining pivot tables according to the present application;
FIG. 2 is a table data diagram according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating an embodiment of the present application for generating a pivot table from data in the prior art;
FIG. 4 is a diagram illustrating a prior art method for generating a result of a data perspective table row according to an embodiment of the present application;
FIG. 5 is a table data diagram according to the second embodiment of the present application;
FIG. 6 is a flowchart illustrating a specific operation of the method for determining rows of a pivot table according to the present application;
FIG. 7 is a flowchart illustrating a specific operation of a method for determining pivot table values according to the present application;
FIG. 8 is a block diagram of an apparatus for determining a pivot table according to the present application.
Detailed Description
At least one embodiment is described herein, but the description is intended to be exemplary, rather than limiting and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible that are within the scope of the embodiments described herein. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with or instead of any other feature or element in any other embodiment, unless expressly limited otherwise.
The present application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The embodiments, features and elements disclosed in this application may also be combined with any conventional features or elements to form a unique inventive concept as defined by the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive aspects to form yet another unique inventive aspect, as defined by the claims. Thus, it should be understood that any of the features shown and/or discussed in this application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not limited except as by the appended claims and their equivalents. Furthermore, various modifications and changes may be made within the scope of the appended claims.
Further, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other orders of steps are possible as will be understood by those of ordinary skill in the art. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Further, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present application.
The technical solutions of the present application will be described in more detail below with reference to the accompanying drawings and embodiments.
As shown in fig. 1, an embodiment of the present invention provides a method for determining a pivot table, including the following steps:
s101, after an instruction for establishing a pivot table for a current table is received, acquiring a selected data column in the current table;
s102, respectively determining a first data column serving as the data perspective table row and a second data column serving as the data perspective table value in the acquired data columns by adopting a preset random forest model;
s103, generating the pivot table according to the determined first data column for generating the data pivot table row and the determined second data column for generating the data pivot table value by adopting a preset rule.
A Pivot Table (Pivot Table) is an interactive Table that can perform certain calculations such as summing and counting. The calculations performed relate to the arrangement of the data in the pivot table, which can be dynamically changed in their layout to analyze the data in different ways, and can also rearrange the row numbers, column labels and page fields. Each time the layout is changed, the pivot table immediately recalculates the data according to the new layout. In addition, the pivot table may be updated if the original data is changed.
A random forest is a classifier that contains at least one decision tree and whose output class is dependent on the mode of the class output by the individual tree. Leo Breiman and Adele Cutler developed algorithms that inferred random forests. And "Random forms" are trademarks thereof. This term was derived from random decision forests (random decision trees) proposed by Tin Kam Ho of Bell laboratories in 1995. This approach combines the "boosting" idea of Breimans with the "random subspace method" of Ho to build a set of decision trees.
In one exemplary embodiment, each tree is built according to the following algorithm: the number of training cases (samples) is represented by N, and the number of features is represented by M. Inputting a characteristic number m for determining a decision result of a node on a decision tree; where M should be much smaller than M. Sampling N times from N training cases (samples) in a manner of sampling back to form a training set (i.e. bootstrap sampling), and using the cases (samples) which are not extracted as a prediction to evaluate the error. For each node, m features are randomly selected, and the decision for each node on the decision tree is determined based on these features. Based on the m features, the optimal splitting mode is calculated. Each tree grows completely without pruning, which may be employed after a normal tree classifier is built.
In one exemplary embodiment, data in a Microsoft Office Excel worksheet is employed as the source for the list of tabular data columns.
In an exemplary embodiment, in step S101, the instruction for creating the pivot table may be a preset option in the Microsoft Office Excel worksheet, and when the option is clicked, the creation of the pivot table is triggered; or the pivot table may be automatically prompted when the user selects a column of data.
In an exemplary embodiment, in step S101, the selected data column in the table is obtained, where the obtained data column may be a directly selected data column, or a data column obtained by deleting or expanding the directly selected data column.
In an exemplary embodiment, in step S101, the acquiring the selected data column in the current table includes: acquiring a data column selected by a user in a table, and judging the area size of the acquired data column selected by the user in the table, wherein the area size of the data column is represented as: m n, where m is the number of rows and n is the number of columns. When the row number and the column number of the data column selected in the table by the user are equal to 1, expanding the cell of the data column selected in the table by the user, and acquiring an area with upper, lower, left and right sides being discontinuous blank rows and columns as the selected data column in the table. And when the row number or the column number of the data column selected in the table by the user is larger than 1, taking the data column selected in the table by the user as the selected data column in the acquisition table.
In an exemplary embodiment, in step S101, after acquiring the selected data column in the table, the method further includes:
identifying the direction of the table according to the selected data columns in the acquired table;
when the identified table direction is arranged in a non-predetermined mode, the pivot table is not recommended;
identifying the structure of the table when the identified table direction is arranged in a predetermined manner;
and when the table structure is a preset table structure, executing the step of respectively traversing the selected data columns in at least one preset sequence, and respectively acquiring at least one preset first characteristic value and at least one preset second characteristic value of each selected data column.
In an exemplary embodiment, the table direction is "by row" or "by column", if "by row" the pivot table is not recommended, if "by column" the table structure is identified by one row, classified as "row title", "table content", and "others", and finally the table of "row title + table content" is obtained, if not, the pivot table is not recommended.
In one exemplary embodiment, the identifying table structure may default to a first row being a row title and the other rows being table contents, the user manually selecting the row title and table contents, etc.
In another exemplary embodiment, identifying the table structure may obtain a selection of a table and table data, traverse each cell of the table, translate each cell content type in the selection: chinese, English, time, date, number; the word size; and judging whether the cell is a merged cell. And traversing each cell in the table, and tiling the contents in the combined cells into each column to finally obtain the table with the same number of cells in each row. And traversing each row in the table, calculating the similarity between the current row and the next row, judging whether to combine or not, obtaining a Rows structure, wherein each row at least comprises 1 table row, and finally obtaining an array RowsList of the Rows. Traversing the example array RowsList, transforming the following features of Rows: merging cell number/column number, union of cell content types of Rows, number of columns/non-blank columns containing chinese, number of columns/non-blank columns not containing chinese, number of columns/non-blank columns containing digits, number of columns/non-blank columns containing colon, number of cells different in type from the word number difference of Rows nearest greater than 1 row, (as compared to the content type of each cell nearest Rows greater than 1 row) of Rows. According to the characteristics, three categories of the table structure are obtained by adopting a pre-established random forest model for identifying the table structure: row title, table contents, and others.
In an exemplary embodiment, as shown in the table structure of fig. 2, specifically, the rows of the table 30 are combined into 2 rows according to the type of the content of each cell, that is, the 1 st row and the 2 nd-30 th rows, and the types are as follows, and the corresponding calculated characteristic values are as follows (table a).
TABLE a
Figure BDA0002266175810000081
In an exemplary embodiment, as identified by the table structure, the table 16 rows merge similar rows into 2 rows, i.e., row 1 and rows 2-16, according to the type of the content of each cell, and the corresponding calculated characteristic values are as follows (table b).
Table b
Figure BDA0002266175810000082
In an exemplary embodiment, identifying the table orientation may also be manually selectable by a user, and will not be described in detail herein.
In another exemplary embodiment, identifying the table direction may obtain a new table area and table data by acquiring a selection area and table data of a table, and deleting upper, lower, left, right, and consecutive blank rows and blank columns of the table. The new table area has RowCount for rows and ColumnCount for columns. Obtaining a minimum intercepting area minLength according to the number of rows and columns, wherein the minimum intercepting area formula is as follows: minLength ═ min (number of lines RowCount, ColumnCount, 10). And (5) cutting the region with length and width minLength from the upper left corner of the table to obtain a region newTable. The type of each unit in the newTable is converted, and the type comprises Chinese, English, number, date and time. And traversing minLength row in the newTable, merging continuous similar Rows according to the row similarity to serve as Rows, wherein each Rows at least comprises one row, and finally obtaining the sequence of the Rows, wherein the number of the sequence of the Rows is simearRowCount. Traversing minLength column in newTable, merging continuous similar Columns according to column similarity to serve as Columns, wherein each column at least comprises one column, and finally obtaining the sequence of Columns, and the number of Rows is simiarColumnCount. According to row number RowCount, column number ColumnCount, similar row number similarRowCount, similar column number similarColumnCount, 4 characteristics, adopting a preset identification table direction random forest module to calculate and obtain a series direction, wherein the series direction comprises: by row and by column.
In an exemplary embodiment, as shown in the table direction identification in fig. 2, a specific table has 5 columns and 30 rows, and min (30,5,10) obtains 5, then a table area at the top left corner 5 × 5 of the table is intercepted to perform table similar row merging, 2 rows are obtained by row merging according to the type of the cell content, 4 columns are obtained by column merging according to the type of the cell content, and then 4 features [30,5,2,4] are submitted to the random forest model to obtain the result that the result is column-wise. Fig. 5 shows table direction recognition, where a table has 6 columns and 16 rows, min (16,6,10) obtains 6, then a table area at the top left corner 6 x 6 of the table is intercepted to perform table similar row merging, 2 rows are obtained by row merging according to the type of cell contents, 4 columns are obtained by column merging according to the type of cell contents, and then 4 features [6,16,2,4] are submitted to a random forest model to obtain a result that the results are column by column.
In an exemplary embodiment, in step S101, the selected data columns in the table are obtained, where the data columns include titles of the data columns and data corresponding to the titles.
In an exemplary embodiment, in step S102, determining a first data column generating the data perspective table row and a second data column generating the data perspective table value in the acquired data columns respectively by using a predetermined random forest model includes:
respectively traversing the selected data columns in at least one preset sequence, and respectively acquiring at least one preset first characteristic value and at least one preset second characteristic value of each selected data column;
respectively inputting the preset first characteristic value of each acquired data column into a pre-generated first random forest model to obtain a first analysis result of each data column corresponding to the first characteristic value; inputting the second characteristic value of each acquired data column into a pre-generated second random forest model respectively to obtain a second analysis result of each data column corresponding to the second characteristic value;
determining a first data column serving as a row for generating the pivot table according to a data column of which a first analysis result meets a first preset condition; and determining a second data column as a value for generating the pivot table from the data columns of which the second analysis result satisfies a second preset condition.
The first random forest model and the second random forest model do not necessarily have a precedence relationship, and only the result is obtained respectively.
In an exemplary embodiment, the pre-generated first random forest model is created by collecting at least one pivot table as a training data sample, extracting at least one first feature, creating a pivot table row decision tree according to the step of creating the decision tree, and creating the pivot table row decision tree according to the pivot table row decision tree.
In an exemplary embodiment, the pre-generated second random forest model is created by collecting at least one pivot table as a training data sample, extracting at least one second feature, creating a pivot table value decision tree according to the step of creating the decision tree, and creating the decision tree according to the pivot table.
In an exemplary embodiment, in step S102, the first analysis result may be in the form of calculating a score, that is, the acquired preset first feature value of each data column may be respectively input into the first random forest model generated in advance, and the first predetermined row field score of each data column may be calculated; and determining a first data column of which the first prediction field score is within a second preset range as a first data column of a row for generating the pivot table.
In an exemplary embodiment, in step S102, the second analysis result may be in the form of calculating a score, that is, the obtained second preset first feature value of each data column may be respectively input into a second random forest model generated in advance, and a second predetermined row field score of each data column may be calculated; and determining the data column of which the second prediction field score is within a second preset range as a second data column of the values for generating the pivot table.
In other embodiments, the inference can be performed by using a logical operation, such as the model directly outputting "yes" and "no". This embodiment is merely an exemplary embodiment, and is not limited thereto.
In an exemplary embodiment, in step S103, generating the pivot table according to the determined first data column for generating the data pivot table row and the determined second data column for generating the data pivot table value by using a predetermined rule includes:
merging the values of the cells with the same content in the first data column as the data pivot table row, and taking each merged value as a row header of the determined data pivot table;
and summing the values of the cells in the second data column, which are used as the data pivot table values in the current table, according to the cells corresponding to the cells with the same content in the columns of the data pivot table rows in the headers of the rows of the data pivot table, and taking the obtained summation result as the data pivot table value determined by the values of the corresponding cells in the data pivot table.
In an exemplary embodiment, for example, as shown in fig. 2, the first data meeting the condition is listed as a "managed mode", for example, a plurality of "noonto" in the managed mode are merged and serve as one of the row headers of the pivot table.
In the second data column "teacher's day" satisfying the condition, for example, all the values corresponding to "all day" are summed, and the result of the summation is the value corresponding to the row "all day" in the pivot table.
In an exemplary embodiment, said predetermined at least one order includes a first left-to-right order, and when traversing is performed in the first left-to-right order, said obtaining at least one predetermined first characteristic value of each of said selected columns of data includes: the number of columns of the whole data column, an index value, the data type contained in the whole column, the number of the cells after the repeated cells are removed, the variance of the occurrence times of the content of the repeated cells, the maximum value of the character length of the cells and the variance of the character length of the cells.
In an exemplary embodiment, the predetermined at least one order includes a second left-to-right order, and when the traversal is performed in the second left-to-right order, the obtaining of the at least one predetermined first characteristic value of each of the selected data columns further includes: the left columns of the self and the self contain the number of columns, and the left columns of the self and the self contain the number of columns of Chinese, English and date.
In an exemplary embodiment, the predetermined at least one order includes a right-to-left order, and when the traversal is performed in the right-to-left order, the obtaining of the at least one predetermined first characteristic value of each of the selected data columns further includes: the columns of the self and the right of the self contain the number of columns, and the columns of the self and the right of the self contain the number of columns of Chinese, English and date.
The first left-to-right sequence and the second left-to-right sequence refer to traversing the different characteristic values twice according to the left-to-right sequence, and once the number of columns of the whole data column, the index value, the data type contained in the whole column, the number of cells without repeated cells, the variance of the occurrence times of the content of the repeated cells, the maximum value of the character length of the cells and the variance of the character length of the cells are obtained; the other time is to obtain the number of columns containing numbers in the left columns of the self and the self, and the number of columns containing Chinese, English and date in the left columns of the self and the self.
The first characteristic value in the above embodiments may be obtained through a predetermined sequence, which is relatively simple compared to computer processing, and of course, a person skilled in the art may obtain the above characteristic values through other predetermined sequences, and the obtaining of the characteristic values in the present application is not limited herein, but is intended to be the result of the characteristic values.
Wherein the predetermined first characteristic value 1: the number of columns of the whole data column is specifically the total number of columns of the table, and the characteristic value of each column of the same table is the same.
Predetermined first characteristic value 2: the index value, specifically, the index value is the column that is the few columns in the entire table, counting from the left.
Predetermined first characteristic value 3: the types contained in the whole column comprise Chinese, English, date and number, the types of all the cells in the whole column are counted, the types are respectively converted into a specific number to be expressed, and then the numbers corresponding to all the contained types are added to obtain the characteristic value. For example, the first column of eigenvalues 3 of the data column shown in fig. 2 is specifically a number for each type, and the whole column obtained by adding up these numbers has a type 322. For another example, if the number type corresponds to 64 and the Chinese type corresponds to 128, the type including Chinese and number is accumulated 192.
Predetermined first characteristic value 4: and (4) removing the repeated units to obtain the number, specifically, removing the repeated content in all the cells in a row, and calculating the number of the cells.
Predetermined first characteristic value 5: and repeating the variance of the occurrence times of the cell contents, specifically, counting the occurrence times of each cell content in the column, and calculating the variance of the occurrence times to obtain the characteristic value.
Predetermined first characteristic value 6: the maximum value of the cell character length, specifically, the value of the longest character length of the content of the row of cells is calculated to obtain the characteristic value.
Predetermined first characteristic value 7: the variance of the character length of the cells, specifically, the character length of each cell is counted, and the variance is calculated for the numbers to obtain the characteristic value.
Predetermined first characteristic value 8: the number of columns containing numbers in the left column is calculated from the first column on the left to the current column.
Predetermined first characteristic value 9: the number of columns containing Chinese, English and date is calculated from the first column on the left to the current column.
Predetermined first characteristic value 10: the number of columns containing numbers of the self and the columns on the right of the self, specifically, the number of columns containing numbers is calculated from the first column on the right to the current column.
Predetermined first characteristic value 11: the self and each column on the right of the self contain the number of columns of Chinese, English and date, specifically, the number of columns containing Chinese, English and date is calculated from the first column on the right to the current column.
In an exemplary embodiment, the predetermined first eigenvalue of each acquired data column is respectively input into a first random forest model generated in advance, and the method includes the following steps:
and sequentially inputting the column number of the whole data columns of each acquired data column, an index value, the type contained in the whole column, the number of the unit grids without repeated unit grids, the variance of the occurrence times of the content of the repeated unit grids, the maximum value of the character length of the unit grids, the variance of the character length of the unit grids, the column number of the left column containing numbers of the self and the self, the column number of the left column containing Chinese, English and date of the self and the self, the column number of the right column containing numbers of the self and the self, and the column number of the right column containing Chinese, English and date of the self and the self into a first random forest model.
In an exemplary embodiment, said predetermined at least one order comprises a first left-to-right order, and when traversing is performed in a second left-to-right order, said obtaining at least one predetermined second characteristic value of said each of said selected columns of data comprises: title extraction keywords, the number of columns of the whole data column, the number of unit lattices of only numbers, and the variance of the length of the integral number character number of the unit lattices.
In an exemplary embodiment, the predetermined at least one order includes a second left-to-right order, and when the traversal is performed in the second left-to-right order, the obtained at least one predetermined second characteristic value of each of the selected data columns includes: the number of columns containing numbers in the left columns, the number of columns containing only numbers in the left columns, and the number of columns containing Chinese, English and date in the left columns.
In an exemplary embodiment, the predetermined at least one order includes a right-to-left order, and when the traversal is performed in the right-to-left order, the obtained at least one predetermined second characteristic value of each of the selected data columns includes: the number of columns containing numbers in the columns of the self and the right, the number of columns containing only numbers in the columns of the self and the right, and the number of columns containing Chinese, English and date in the columns of the self and the right.
The first left-to-right sequence and the second left-to-right sequence refer to traversing the different feature values twice according to the left-to-right sequence; obtaining the variance of title extraction keywords, the column number of the whole data column, the number of the unit lattices of only numbers and the length of the integral number character number of the unit lattices at one time; the other time is to obtain the number of columns containing numbers in the left columns of the self and the self, the number of columns containing only numbers in the left columns of the self and the self, and the number of columns containing Chinese, English and date in the left columns of the self and the self.
Wherein the predetermined second characteristic value 1: the title extraction keyword includes, for example, the following word count of 1, "number", "amount", "total", "income", "expense", "amount", "fee", "sales", the following word count of-1, "month", "year", "number", "association", "telephone", "code", "single number", "serial number", "unit price", "time", "date", "number", "unit", and one number is finally obtained as the feature value. Specifically, more keywords can be extracted, and more results can be obtained according to more sample training. Predetermined second characteristic value 2: the number of columns of the whole data column is specifically the total number of columns of the table, and the characteristic value of each column of the same table is the same.
Predetermined second characteristic value 3: the number of the number-only cells, specifically, the number-only cells are added to obtain the value.
Predetermined second characteristic value 4: and the variance of the length of the integral number character number of the unit cell, specifically, counting the character length of the integral number in each unit cell of the column, calculating the variance, and if the decimal is encountered, intercepting the integral part for calculation.
Predetermined second characteristic value 5: the number of columns containing numbers in the left column is calculated from the first column on the left to the current column.
Predetermined second characteristic value 6: the number of columns containing only numbers in the left column and the left column, specifically, the number of columns containing only numbers is calculated from the first column on the left to the current column
Predetermined second characteristic value 7: the number of columns containing Chinese, English and date is calculated from the first column on the left to the current column.
Predetermined second characteristic value 8: the number of columns containing numbers of the self and the columns on the right of the self, specifically, the number of columns containing numbers is calculated from the first column on the right to the current column.
Predetermined second characteristic value 9: specifically, the number of columns including only numbers is counted from the first column on the right to the current column.
Predetermined second characteristic value 10: the self and each column on the right of the self contain the number of columns of Chinese, English and date, specifically, the number of columns containing Chinese, English and date is calculated from the first column on the right to the current column. Wherein, Chinese, English and date only need to contain one, and then calculate one row.
In an exemplary embodiment, the step S102 of inputting the at least one predetermined second feature value of each acquired data column into a second random forest model generated in advance respectively includes:
and sequentially inputting the title extraction keywords of each acquired data column, the column number of the whole data column, the number of the unit lattices only with numbers, the variance of the integral number and the number character length of the unit lattices, the column number of the left columns containing numbers, the column number of the left columns containing Chinese, English and date, the column number of the right columns and the column number of the right columns containing numbers, and the column number of the right columns and the column number of the Chinese, English and date into a second random forest model.
The feature values in the above embodiments may be obtained through a predetermined sequence, which is relatively simple compared to computer processing, and of course, those skilled in the art may obtain the feature values through other predetermined sequences, and the feature value obtaining in the present application is not limited herein, but is intended to be the result of the feature value obtaining.
It will be appreciated by those skilled in the art that the random forest algorithm need only provide the correct feature values and models to be able to derive the score. The random forest intermediary process is an algorithm package and is not in the scope of the patent.
It can be understood by those skilled in the art that the trained random forest model needs corresponding feature values, and the sequence cannot be modified at will, unless, for example, in the optimization algorithm, the model is retrained with new user data, and as long as the feature values are not increased or decreased, the retrained model is the same and the sequence is not changed. This sequence is used in the present application to obtain reasonably accurate results.
According to the method and the device, at least one characteristic value of the list of the table data columns acquired according to the preset sequence is analyzed through the random forest model, the data columns where the rows of the data pivot table are located are automatically found for the user, and the data columns are determined for the user, so that the use threshold of the user is reduced, and a more convenient way is provided for the user.
As shown in fig. 2, the table data column list of the embodiment of the present application includes 5 columns a, B, C, D, E, and the titles of each column are "grade", "name", "hosting method", "number of days of teacher", respectively. The user desires to sum the "managed mode" and the "teacher's day".
As shown in fig. 3-4, the prior art uses a pivot table to sum the "managed mode" and the "teacher days". Firstly, selecting a current table area, clicking and inserting the current table area, dragging the 'hosting mode' to a row at the lower right corner at the upper right corner, dragging the 'teacher days' to a value at the lower right corner, and then performing data analysis after obtaining the data pivot table. The prior art requires multiple operations.
By adopting the method for determining the pivot table, the system automatically acquires at least one characteristic value of the table, namely 11 first characteristic values in the embodiment. As shown in table 1, taking the first column as an example:
predetermined first characteristic value 1 of the first column: the column number of the whole data column is 5 columns;
predetermined first characteristic value 2 of the first column: the index value is 0;
predetermined first feature value 3 of the first column: the entire column contains type 322;
predetermined first characteristic value 4 of the first column: the number of the cells after the repeated cells are removed is 0;
predetermined first characteristic value 5 of the first column: the variance of the occurrence times of the contents of the repeated cells is 0;
predetermined first characteristic value 6 of the first column: the maximum value of the length of the cell character is 5;
predetermined first characteristic value 7 of the first column: the variance of the cell character length is 0;
predetermined first characteristic value 8 of the first column: 0 cases of the column number of the column per se and the column number of the column per se with the left column containing numbers;
predetermined first characteristic value 9 of the first column: 0 cases of columns containing Chinese, English and date are arranged in the left columns per se and per se;
predetermined first characteristic value 10 of the first column: 2 cases of column number including number of self and each column on the right of self;
predetermined first characteristic value 11 of the first column: the number of columns containing Chinese, English, date for itself and each column on the right of itself is 2 examples.
And inputting the 11 characteristic values into a first random forest model for analysis and calculation, and obtaining a calculation result of a first column, wherein the field score of the prediction row is 0.1. The threshold value range of the embodiment is that the score is between 0 and 1, and when the calculated score of the data column is greater than 0.6, a row serving as a pivot table can be generated according to the data. The predicted row field score of the first column is 0.1 less than 0.6 and therefore cannot be used as a row of the pivot table. By analogy, the predicted row field score is greater than 0.6 and between 0-1 is a column with the subject name "managed," and finally the rows of the pivot table are automatically generated by the C column "managed.
TABLE 1
Column headings Grade of year Name (I) Supporting tube mode Number of days Days of teacher
Characteristic value
1 5 5 5 5 5
Characteristic value 2 0 1 2 3 4
Characteristic value 3 322 320 256 64 64
Characteristic value 4 0 25 5 8 8
Characteristic value 5 0 0.366606056 6.493073232 3.471251471 3.699452718
Characteristic value 6 5 3 5 2 2
Characteristic value 7 0 0.304543478 1.283759343 0.405080694 0.405080694
Characteristic value of 8 0 0 0 1 2
Characteristic value 9 0 1 2 2 2
Characteristic value 10 2 2 2 2 1
Characteristic value 11 2 2 1 0 0
Predicting row field scores 0.1 0.0 0.8 0.0 0.0
By adopting the method for determining the pivot table, the system automatically acquires at least one second preset characteristic value of the table, namely 10 characteristic values in the embodiment. As shown in table 2, taking the first column as an example:
predetermined second characteristic value 1 of the first column: the title extraction keyword is 0;
predetermined second characteristic value 2 of the first column: the number of columns of the whole data column is 5 cases;
predetermined second feature value 3 of the first column: the number of the unit cells of only the number is 0;
predetermined second characteristic value 4 of the first column: the variance of the length of the integral number of digits of the cell is 0;
predetermined second feature value 5 of the first column: the number of columns containing numbers in each left column is 0 example per se;
predetermined second characteristic value 6 of the first column: the number of columns which only contain numbers in each left column is 0 example;
predetermined second characteristic value 7 of the first column: 0 cases of columns containing Chinese, English and date are arranged in the left columns per se and per se;
predetermined second characteristic value 8 of the first column: the number of columns containing numbers of the column itself and the column on the right is 2 examples;
predetermined second characteristic value 9 of the first column: the number of columns per se and each column on the right of itself containing only numbers is 2 examples;
predetermined second characteristic value 10 of the first column: the number of columns containing Chinese, English, and date is 0 case for itself and for each column on the right of itself.
And inputting the 10 characteristic values into a random forest model for analysis and calculation to obtain a first column of calculation results, wherein the prediction row field score is 0.0. The threshold range of this embodiment is a data column with a score between 0 and 1, and when the score is greater than 0.65, a pivot table value can be generated. The prediction row field score of the first column is 0.0 less than 0.65, so the pivot table value cannot be generated. By analogy, the prediction row field score is greater than 0.65 and between 0-1 is the data column with the subject name "teacher days," and the final recommendation E is the value of the "teacher days" to generate the pivot table.
TABLE 2
Figure BDA0002266175810000181
Figure BDA0002266175810000191
As shown in fig. 5, the table of the second table data column in the embodiment of the present application includes 6 columns a, B, C, D, E, and F, and the titles of each column are "site name", "material name", "unit", "design quantity", "construction quantity", and "physical verification quantity", respectively.
The system automatically obtains at least one predetermined first characteristic value, in this embodiment 11 characteristic values, of the form. As shown in table 3, taking the first column as an example:
first column predetermined first eigenvalue 1: the total column number of the data columns is 6 columns;
first column predetermined first eigenvalue 2: the index value is 0;
first column predetermined first eigenvalue 3: the entire column contains type 448;
first column predetermined first characteristic value 4: the number of the cells after the repeated cells are removed is 0;
first column predetermined first characteristic value 5: the variance of the occurrence times of the contents of the repeated cells is 0;
first column predetermined first characteristic value 6: the maximum value of the length of the cell character is 11;
first column predetermined first characteristic value 7: 0 for the square of the cell character length;
first column predetermined first characteristic value 8: 0 cases of the column number of the column per se and the column number of the column per se with the left column containing numbers;
first column predetermined first characteristic value 9: 0 cases of columns containing Chinese, English and date are arranged in the left columns per se and per se;
first column predetermined first characteristic value 10: 3 columns with numbers in the self and the columns on the right of the self;
first column predetermined first characteristic value 11: the number of columns containing Chinese, English, date for itself and each column on the right of itself is 2 examples.
And inputting the characteristic values into a random forest model in sequence for analysis and calculation to obtain a first column of calculation results, wherein the prediction row field score is 0.0. The threshold value range of the embodiment is that the score is between 0 and 1, and when the calculated score of the data column is greater than 0.6, a row serving as a pivot table can be generated according to the data. The predicted row field score of the first column is 0.0 less than 0.6, so the first column cannot be a row of the pivot table. By analogy, the prediction row field score is greater than 0.6 and between 0 and 1 is the column with the subject name "material name", and finally the rows of the pivot table are automatically generated by the B column "material name".
TABLE 3
Figure BDA0002266175810000201
The system automatically obtains at least one predetermined second characteristic value, in this embodiment 10 characteristic values, of the form. As shown in table 4, taking the first column as an example:
first-column predetermined second characteristic value 1: title extraction keywords are 0 cases;
first column predetermined second eigenvalue 2: the number of columns of the whole data column is 6;
first column predetermined second eigenvalue 3: the number of the unit cells of only the number is 0;
first column predetermined second characteristic value 4: the variance of the length of the cell integer numeric character number is 0.330718914;
first column predetermined second feature value 5: the number of columns containing numbers in each left column is 0 example per se;
first column predetermined second eigenvalue 6: the number of columns containing numbers in each left column is 0 example per se;
first column predetermined second eigenvalue 7: the number of columns containing Chinese, English and date of each left column is 0 example;
first column predetermined second eigenvalue 8: 3 cases of the column number of the column containing the number of the self and the column on the right are taken as examples;
first column predetermined second eigenvalue 9: the number of columns per se and each column on the right of itself containing only numbers is 2 examples;
first column predetermined second characteristic value 10: the number of columns containing Chinese, English, and date is 0 case for itself and for each column on the right of itself.
And inputting the 10 characteristic values into a random forest model for analysis and calculation to obtain a calculation result, wherein the field score of the prediction row is 0.0. The threshold range of this embodiment is a score between 0-1, and when the score is greater than 0.65 data column, a pivot table value can be generated. The prediction row field score of the first column is 0.0 less than 0.65 and therefore cannot be used as a pivot table value. By analogy, the prediction row field score is greater than 0.65 and between 0 and 1 are columns of data with subject names "design quantity", "construction quantity", "audit quantity", and finally the recommendation D, E, F columns of "design quantity", "construction quantity", "audit quantity" generating pivot table values.
TABLE 4
Figure BDA0002266175810000211
As shown in FIG. 6, the method for determining the value of the pivot table of the data of the present invention comprises the following steps:
1) starting;
2) acquiring a list of table data columns and a title of each column;
3) all data columns are traversed, and the following characteristic values of each column are obtained: the number of columns of the whole data column, an index value, the data type contained in the whole column, the number of the cells after the repeated cells are removed, the variance of the occurrence times of the content of the repeated cells, the maximum value of the character length of the cells and the variance of the character length of the cells;
4) all data columns are traversed, and the following characteristic values of each column are obtained: the left column of the column contains the column number of the number (including the current column), and the left column of the column contains the column number of the Chinese, English and date (including the current column);
5) all data columns are traversed, and the following characteristic values of each column are obtained: the number of columns (including the current column) containing numbers in each column on the right of the row, and the number of columns (including the current column) containing Chinese, English and date in each column on the right of the row;
6) and (6) ending.
As shown in FIG. 7, the method for determining the value of the pivot table of the data of the present invention comprises the following steps:
1) starting;
2) acquiring a list of table data columns and a title of each column;
3) all data columns are traversed, and the following characteristics of each column are obtained: extracting keywords, the whole list number, the number of the unit lattices of the numbers only and the variance of the length of the integral number character number of the unit lattices from the titles;
4) traversing all data columns from left, obtaining the following features for each column: in the example, the number of columns (including the current column) containing numbers in each left column, the number of columns (including the current column) containing only numbers in each left column, and the number of columns (including the current column) containing Chinese, English and date in each left column;
5) all data columns are traversed from the right, obtaining the following features for each column: in this example, the number of columns (including the current column) containing numbers in each right column, the number of columns (including the current column) containing only numbers in each right column, and the number of columns (including the current column) containing Chinese, English, and date in each right column;
6) calculating to obtain whether the column is a value field by using the obtained 10 characteristics and the model;
7) and (6) ending.
As shown in fig. 8, an embodiment of the present invention provides an apparatus for targeted delivery of content, including:
the obtaining module 10 is configured to obtain a selected data column in a current table after receiving an instruction for establishing a pivot table for the current table;
an analysis module 20, configured to determine, in the obtained data columns, a first data column as the data perspective table row and a second data column as the data perspective table value, respectively, by using a predetermined random forest model; and generating the pivot table according to the determined first data column for generating the data pivot table row and the determined second data column for generating the data pivot table value by adopting a preset rule.
A Pivot Table (Pivot Table) is an interactive Table that can perform certain calculations such as summing and counting. The calculations performed relate to the arrangement of the data in the pivot table, which can be dynamically changed in their layout to analyze the data in different ways, and can also rearrange the row numbers, column labels and page fields. Each time the layout is changed, the pivot table immediately recalculates the data according to the new layout. In addition, the pivot table may be updated if the original data is changed.
A random forest is a classifier that contains at least one decision tree and whose output class is dependent on the mode of the class output by the individual tree. Leo Breiman and Adele Cutler developed algorithms that inferred random forests. And "Random forms" are trademarks thereof. This term was derived from random decision forests (random decision trees) proposed by Tin Kam Ho of Bell laboratories in 1995. This approach combines the "boosting" idea of Breimans with the "random subspace method" of Ho to build a set of decision trees.
In one exemplary embodiment, each tree is built according to the following algorithm: the number of training cases (samples) is represented by N, and the number of features is represented by M. Inputting a characteristic number m for determining a decision result of a node on a decision tree; where M should be much smaller than M. Sampling N times from N training cases (samples) in a manner of sampling back to form a training set (i.e. bootstrap sampling), and using the cases (samples) which are not extracted as a prediction to evaluate the error. For each node, m features are randomly selected, and the decision for each node on the decision tree is determined based on these features. Based on the m features, the optimal splitting mode is calculated. Each tree grows completely without pruning, which may be employed after a normal tree classifier is built.
In one exemplary embodiment, data in a Microsoft Office Excel worksheet is employed as the source for the list of tabular data columns.
In an exemplary embodiment, the instruction for creating the pivot table may be a preset option in a Microsoft Office Excel worksheet, and when the option is clicked, the creation of the pivot table is triggered; or the pivot table may be automatically prompted when the user selects a column of data.
In an exemplary embodiment, the obtaining module 10 obtains the selected data column in the table, where the obtained data column may be a directly selected data column, or a data column obtained by deleting or expanding the directly selected data column.
In an exemplary embodiment, the data columns selected in the current table acquired by the acquiring module 10 refer to: acquiring a data column selected by a user in a table, and judging the area size of the acquired data column selected by the user in the table, wherein the area size of the data column is represented as: m n, where m is the number of rows and n is the number of columns. When the row number or the column number of the data column selected in the table by the user is equal to 1, expanding the cell of the data column selected in the table by the user, and acquiring an area with upper, lower, left and right sides being discontinuous blank rows and columns as the selected data column in the table. When the row number of the data column selected in the table by the acquisition user is larger than 1 and the column number is larger than 1, taking the acquired data column selected in the table by the acquisition user as the selected data column in the acquisition table.
In an exemplary embodiment, after the obtaining module 10 obtains the selected data columns in the table, it is further configured to identify the table direction according to the obtained selected data columns in the table;
when the identified table direction is arranged in a non-predetermined mode, the pivot table is not recommended;
identifying the structure of the table when the identified table direction is arranged in a predetermined manner;
and when the table structure is a preset table structure, executing the step of respectively traversing the selected data columns in at least one preset sequence, and respectively acquiring at least one preset first characteristic value and at least one preset second characteristic value of each selected data column.
In an exemplary embodiment, the table direction is "by row" or "by column", if "by row" the pivot table is not recommended, if "by column" the table structure is identified by one row, classified as "row title", "table content", and "others", and finally the table of "row title + table content" is obtained, if not, the pivot table is not recommended.
In one exemplary embodiment, the identifying table structure may default to a first row being a row title and the other rows being table contents, the user manually selecting the row title and table contents, etc.
In another exemplary embodiment, identifying the table structure may obtain a selection of a table and table data, traverse each cell of the table, translate each cell content type in the selection: chinese, English, time, date, number; the word size; and judging whether the cell is a merged cell. And traversing each cell in the table, and tiling the contents in the combined cells into each column to finally obtain the table with the same number of cells in each row. And traversing each row in the table, calculating the similarity between the current row and the next row, judging whether to combine or not, obtaining a Rows structure, wherein each row at least comprises 1 table row, and finally obtaining an array RowsList of the Rows. Traversing the example array RowsList, transforming the following features of Rows: merging cell number/column number, union of cell content types of Rows, number of columns/non-blank columns containing chinese, number of columns/non-blank columns not containing chinese, number of columns/non-blank columns containing digits, number of columns/non-blank columns containing colon, number of cells different in type from the word number difference of Rows nearest greater than 1 row, (as compared to the content type of each cell nearest Rows greater than 1 row) of Rows. According to the characteristics, three categories of the table structure are obtained by adopting a pre-established random forest model for identifying the table structure: row title, table contents, and others.
In an exemplary embodiment, as shown in the table structure of fig. 2, specifically, the rows of the table 30 are combined into 2 rows according to the type of the content of each cell, that is, the 1 st row and the 2 nd-30 th rows, and the types are as follows, and the corresponding calculated characteristic values are as follows (table a).
TABLE a
Figure BDA0002266175810000251
In an exemplary embodiment, as identified by the table structure, the table 16 rows merge similar rows into 2 rows, i.e., row 1 and rows 2-16, according to the type of the content of each cell, and the corresponding calculated characteristic values are as follows (table b).
Table b
Figure BDA0002266175810000261
In an exemplary embodiment, identifying the table orientation may also be manually selectable by a user, and will not be described in detail herein.
In another exemplary embodiment, identifying the table direction may obtain a new table area and table data by acquiring a selection area and table data of a table, and deleting upper, lower, left, right, and consecutive blank rows and blank columns of the table. The new table area has RowCount for rows and ColumnCount for columns. Obtaining a minimum intercepting area minLength according to the number of rows and columns, wherein the minimum intercepting area formula is as follows: minLength ═ min (number of lines RowCount, ColumnCount, 10). And (5) cutting the region with length and width minLength from the upper left corner of the table to obtain a region newTable. The type of each unit in the newTable is converted, and the type comprises Chinese, English, number, date and time. And traversing minLength row in the newTable, merging continuous similar Rows according to the row similarity to serve as Rows, wherein each Rows at least comprises one row, and finally obtaining the sequence of the Rows, wherein the number of the sequence of the Rows is simearRowCount. Traversing minLength column in newTable, merging continuous similar Columns according to column similarity to serve as Columns, wherein each column at least comprises one column, and finally obtaining the sequence of Columns, and the number of Rows is simiarColumnCount. According to row number RowCount, column number ColumnCount, similar row number similarRowCount, similar column number similarColumnCount, 4 characteristics, adopting a preset identification table direction random forest module to calculate and obtain a series direction, wherein the series direction comprises: by row and by column.
In an exemplary embodiment, as shown in the table direction identification in fig. 2, a specific table has 5 columns and 30 rows, and min (30,5,10) obtains 5, then a table area at the top left corner 5 × 5 of the table is intercepted to perform table similar row merging, 2 rows are obtained by row merging according to the type of the cell content, 4 columns are obtained by column merging according to the type of the cell content, and then 4 features [30,5,2,4] are submitted to the random forest model to obtain the result that the result is column-wise. Fig. 5 shows table direction recognition, where a table has 6 columns and 16 rows, min (16,6,10) obtains 6, then a table area at the top left corner 6 x 6 of the table is intercepted to perform table similar row merging, 2 rows are obtained by row merging according to the type of cell contents, 4 columns are obtained by column merging according to the type of cell contents, and then 4 features [6,16,2,4] are submitted to a random forest model to obtain a result that the results are column by column.
In an exemplary embodiment, the obtaining module 10 obtains the selected data column in the table, where the data column includes a title of the data column and data corresponding to the title.
In an exemplary embodiment, the analysis module 20 determines, in the acquired data columns, a first data column generating the data pivot table row and a second data column generating the data pivot table value, respectively, by using a predetermined random forest model, and refers to:
the analysis module 20 respectively traverses the selected data columns in at least one predetermined order, and for each selected data column, at least one predetermined first characteristic value and at least one predetermined second characteristic value of the data column are respectively obtained;
respectively inputting the preset first characteristic value of each acquired data column into a pre-generated first random forest model to obtain a first analysis result of each data column corresponding to the first characteristic value; inputting the second characteristic value of each acquired data column into a pre-generated second random forest model respectively to obtain a second analysis result of each data column corresponding to the second characteristic value;
determining a first data column serving as a row for generating the pivot table according to a data column of which a first analysis result meets a first preset condition; and determining a second data column as a value for generating the pivot table from the data columns of which the second analysis result satisfies a second preset condition.
The first random forest model and the second random forest model do not necessarily have a precedence relationship, and only the result is obtained respectively.
In an exemplary embodiment, the pre-generated first random forest model is created by collecting at least one pivot table as a training data sample, extracting at least one first feature, creating a pivot table row decision tree according to the step of creating the decision tree, and creating the pivot table row decision tree according to the pivot table row decision tree.
In an exemplary embodiment, the pre-generated second random forest model is created by collecting at least one pivot table as a training data sample, extracting at least one second feature, creating a pivot table value decision tree according to the step of creating the decision tree, and creating the decision tree according to the pivot table.
In an exemplary embodiment, the first analysis result analyzed by the analysis module 20 may be in the form of calculating a score, that is, the obtained preset first feature value of each data column may be respectively input into the first random forest model generated in advance, and the first predetermined row field score of each data column may be calculated; and determining a first data column of which the first prediction field score is within a second preset range as a first data column of a row for generating the pivot table.
In an exemplary embodiment, the second analysis result analyzed by the analysis module 20 may be in the form of calculating a score, that is, the obtained second preset first feature value of each data column may be respectively input into a second random forest model generated in advance, and a second predetermined row field score of each data column may be calculated; and determining the data column of which the second prediction field score is within a second preset range as a second data column of the values for generating the pivot table.
In other embodiments, the inference can be performed by using a logical operation, such as the model directly outputting "yes" and "no". This embodiment is merely an exemplary embodiment, and is not limited thereto.
In an exemplary embodiment, the analysis module 20 generates the pivot table according to the determined first data column for generating the data pivot table row and the determined second data column for generating the data pivot table value by using a predetermined rule, which refers to:
the analysis module 20 merges the values of the cells with the same content in the first data column as the data pivot table row, and uses each merged value as a row header of the determined data pivot table;
and summing the values of the cells in the second data column, which are used as the data pivot table values in the current table, according to the cells corresponding to the cells with the same content in the columns of the data pivot table rows in the headers of the rows of the data pivot table, and taking the obtained summation result as the data pivot table value determined by the values of the corresponding cells in the data pivot table.
In an exemplary embodiment, for example, as shown in fig. 2, the first data meeting the condition is listed as a "managed mode", for example, a plurality of "noonto" in the managed mode are merged and serve as one of the row headers of the pivot table.
In the second data column "teacher's day" satisfying the condition, for example, all the values corresponding to "all day" are summed, and the result of the summation is the value corresponding to the row "all day" in the pivot table.
In an exemplary embodiment, said predetermined at least one order includes a first left-to-right order, and when traversing is performed in the first left-to-right order, said obtaining at least one predetermined first characteristic value of each of said selected columns of data includes: the number of columns of the whole data column, an index value, the data type contained in the whole column, the number of the cells after the repeated cells are removed, the variance of the occurrence times of the content of the repeated cells, the maximum value of the character length of the cells and the variance of the character length of the cells.
In an exemplary embodiment, the predetermined at least one order includes a second left-to-right order, and when the traversal is performed in the second left-to-right order, the obtaining of the at least one predetermined first characteristic value of each of the selected data columns further includes: the left columns of the self and the self contain the number of columns, and the left columns of the self and the self contain the number of columns of Chinese, English and date.
In an exemplary embodiment, the predetermined at least one order includes a right-to-left order, and when the traversal is performed in the right-to-left order, the obtaining of the at least one predetermined first characteristic value of each of the selected data columns further includes: the columns of the self and the right of the self contain the number of columns, and the columns of the self and the right of the self contain the number of columns of Chinese, English and date.
The first left-to-right sequence and the second left-to-right sequence refer to traversing the different characteristic values twice according to the left-to-right sequence, and once the number of columns of the whole data column, the index value, the data type contained in the whole column, the number of cells without repeated cells, the variance of the occurrence times of the content of the repeated cells, the maximum value of the character length of the cells and the variance of the character length of the cells are obtained; the other time is to obtain the number of columns containing numbers in the left columns of the self and the self, and the number of columns containing Chinese, English and date in the left columns of the self and the self.
The first characteristic value in the above embodiments may be obtained through a predetermined sequence, which is relatively simple compared to computer processing, and of course, a person skilled in the art may obtain the above characteristic values through other predetermined sequences, and the obtaining of the characteristic values in the present application is not limited herein, but is intended to be the result of the characteristic values.
Wherein the predetermined first characteristic value 1: the number of columns of the whole data column is specifically the total number of columns of the table, and the characteristic value of each column of the same table is the same.
Predetermined first characteristic value 2: the index value, specifically, the index value is the column that is the few columns in the entire table, counting from the left.
Predetermined first characteristic value 3: the types contained in the whole column comprise Chinese, English, date and number, the types of all the cells in the whole column are counted, the types are respectively converted into a specific number to be expressed, and then the numbers corresponding to all the contained types are added to obtain the characteristic value. For example, the first column of eigenvalues 3 of the data column shown in fig. 2 is specifically a number for each type, and the whole column obtained by adding up these numbers has a type 322. For another example, if the number type corresponds to 64 and the Chinese type corresponds to 128, the type including Chinese and number is accumulated 192.
Predetermined first characteristic value 4: and (4) removing the repeated units to obtain the number, specifically, removing the repeated content in all the cells in a row, and calculating the number of the cells.
Predetermined first characteristic value 5: and repeating the variance of the occurrence times of the cell contents, specifically, counting the occurrence times of each cell content in the column, and calculating the variance of the occurrence times to obtain the characteristic value.
Predetermined first characteristic value 6: the maximum value of the cell character length, specifically, the value of the longest character length of the content of the row of cells is calculated to obtain the characteristic value.
Predetermined first characteristic value 7: the variance of the character length of the cells, specifically, the character length of each cell is counted, and the variance is calculated for the numbers to obtain the characteristic value.
Predetermined first characteristic value 8: the number of columns containing numbers in the left column is calculated from the first column on the left to the current column.
Predetermined first characteristic value 9: the number of columns containing Chinese, English and date is calculated from the first column on the left to the current column.
Predetermined first characteristic value 10: the number of columns containing numbers of the self and the columns on the right of the self, specifically, the number of columns containing numbers is calculated from the first column on the right to the current column.
Predetermined first characteristic value 11: the self and each column on the right of the self contain the number of columns of Chinese, English and date, specifically, the number of columns containing Chinese, English and date is calculated from the first column on the right to the current column.
In an exemplary embodiment, the analysis module 20 respectively inputs the predetermined first feature values of each acquired data column into a first pre-generated random forest model, which is:
and sequentially inputting the column number of the whole data columns of each acquired data column, an index value, the type contained in the whole column, the number of the unit grids without repeated unit grids, the variance of the occurrence times of the content of the repeated unit grids, the maximum value of the character length of the unit grids, the variance of the character length of the unit grids, the column number of the left column containing numbers of the self and the self, the column number of the left column containing Chinese, English and date of the self and the self, the column number of the right column containing numbers of the self and the self, and the column number of the right column containing Chinese, English and date of the self and the self into a first random forest model.
In an exemplary embodiment, said predetermined at least one order comprises a first left-to-right order, and when traversing is performed in a second left-to-right order, said obtaining at least one predetermined second characteristic value of said each of said selected columns of data comprises: title extraction keywords, the number of columns of the whole data column, the number of unit lattices of only numbers, and the variance of the length of the integral number character number of the unit lattices.
In an exemplary embodiment, the predetermined at least one order includes a second left-to-right order, and when the traversal is performed in the second left-to-right order, the obtained at least one predetermined second characteristic value of each of the selected data columns includes: the number of columns containing numbers in the left columns, the number of columns containing only numbers in the left columns, and the number of columns containing Chinese, English and date in the left columns.
In an exemplary embodiment, the predetermined at least one order includes a right-to-left order, and when the traversal is performed in the right-to-left order, the obtained at least one predetermined second characteristic value of each of the selected data columns includes: the number of columns containing numbers in the columns of the self and the right, the number of columns containing only numbers in the columns of the self and the right, and the number of columns containing Chinese, English and date in the columns of the self and the right.
The first left-to-right sequence and the second left-to-right sequence refer to traversing the different feature values twice according to the left-to-right sequence; obtaining the variance of title extraction keywords, the column number of the whole data column, the number of the unit lattices of only numbers and the length of the integral number character number of the unit lattices at one time; the other time is to obtain the number of columns containing numbers in the left columns of the self and the self, the number of columns containing only numbers in the left columns of the self and the self, and the number of columns containing Chinese, English and date in the left columns of the self and the self.
Wherein the predetermined second characteristic value 1: the title extraction keyword specifically includes, for example, the following word count of 1, "number", "amount", "total", "income", "expense", "amount", "fee", "sales", and the following word count of-1, "month", "year", "number", "contact", "telephone", "code", "single number", "serial number", "unit price", "time", "date", "number", "unit", and finally obtains one number as the feature value. Specifically, more keywords can be extracted, and more results can be obtained according to more sample training. Predetermined second characteristic value 2: the number of columns of the whole data column is specifically the total number of columns of the table, and the characteristic value of each column of the same table is the same.
Predetermined second characteristic value 3: the number of the number-only cells, specifically, the number-only cells are added to obtain the value.
Predetermined second characteristic value 4: the variance of the length of the integer number character number of the unit cell, specifically, the character length of the integer number in each unit cell of the column is counted, the variance is calculated, and if the decimal number is encountered, the integral part calculation is intercepted.
Predetermined second characteristic value 5: the number of columns containing numbers in the left column is calculated from the first column on the left to the current column.
Predetermined second characteristic value 6: the number of columns containing only numbers in the left column and the left column, specifically, the number of columns containing only numbers is calculated from the first column on the left to the current column
Predetermined second characteristic value 7: the number of columns containing Chinese, English and date is calculated from the first column on the left to the current column.
Predetermined second characteristic value 8: the number of columns containing numbers of the self and the columns on the right of the self, specifically, the number of columns containing numbers is calculated from the first column on the right to the current column.
Predetermined second characteristic value 9: specifically, the number of columns including only numbers is counted from the first column on the right to the current column.
Predetermined second characteristic value 10: the self and each column on the right of the self contain the number of columns of Chinese, English and date, specifically, the number of columns containing Chinese, English and date is calculated from the first column on the right to the current column. The number of columns containing Chinese, English and date is calculated. When only one of Chinese, English and date is needed, a row is calculated.
In an exemplary embodiment, the analysis module 20 respectively inputs at least one predetermined second feature value of each acquired data column into a second random forest model generated in advance, where the at least one predetermined second feature value is:
and sequentially inputting the title extraction keywords of each acquired data column, the column number of the whole data column, the number of the unit lattices only with numbers, the variance of the integral number and the number character length of the unit lattices, the column number of the left columns containing numbers, the column number of the left columns containing Chinese, English and date, the column number of the right columns and the column number of the right columns containing numbers, and the column number of the right columns and the column number of the Chinese, English and date into a second random forest model.
The feature values in the above embodiments may be obtained through a predetermined sequence, which is relatively simple compared to computer processing, and of course, those skilled in the art may obtain the feature values through other predetermined sequences, and the feature value obtaining in the present application is not limited herein, but is intended to be the result of the feature value obtaining.
It will be appreciated by those skilled in the art that the random forest algorithm need only provide the correct feature values and models to be able to derive the score. The random forest intermediary process is an algorithm package and is not in the scope of the patent.
It can be understood by those skilled in the art that the trained random forest model needs corresponding feature values, and the sequence cannot be modified at will, unless, for example, in the optimization algorithm, the model is retrained with new user data, and as long as the feature values are not increased or decreased, the retrained model is the same and the sequence is not changed. This sequence is used in the present application to obtain reasonably accurate results.
According to the method and the device, at least one characteristic value of the list of the table data columns acquired according to the preset sequence is analyzed through the random forest model, the data columns where the rows of the data pivot table are located are automatically found for the user, and the data columns are determined for the user, so that the use threshold of the user is reduced, and a more convenient way is provided for the user.
As shown in fig. 2, the table data column list of the embodiment of the present application includes 5 columns a, B, C, D, E, and the titles of each column are "grade", "name", "hosting method", "number of days of teacher", respectively. The user desires to sum the "managed mode" and the "teacher's day".
As shown in fig. 3-4, the prior art uses a pivot table to sum the "managed mode" and the "teacher days". Firstly, selecting a current table area, clicking and inserting the current table area, dragging the 'hosting mode' to a row at the lower right corner at the upper right corner, dragging the 'teacher days' to a value at the lower right corner, and then performing data analysis after obtaining the data pivot table. The prior art requires multiple operations.
By adopting the method for determining the pivot table, the system automatically acquires at least one characteristic value of the table, namely 11 first characteristic values in the embodiment. As shown in table 1, taking the first column as an example:
predetermined first characteristic value 1 of the first column: the column number of the whole data column is 5 columns;
predetermined first characteristic value 2 of the first column: the index value is 0;
predetermined first feature value 3 of the first column: the whole column contains 322 types;
predetermined first characteristic value 4 of the first column: the number of the cells after the repeated cells are removed is 0;
predetermined first characteristic value 5 of the first column: the variance of the occurrence times of the contents of the repeated cells is 0;
predetermined first characteristic value 6 of the first column: the maximum value of the length of the cell character is 5;
predetermined first characteristic value 7 of the first column: the variance of the cell character length is 0;
predetermined first characteristic value 8 of the first column: 0 cases of the column number of the column per se and the column number of the column per se with the left column containing numbers;
predetermined first characteristic value 9 of the first column: 0 cases of columns containing Chinese, English and date are arranged in the left columns per se and per se;
predetermined first characteristic value 10 of the first column: 2 cases of column number including number of self and each column on the right of self;
predetermined first characteristic value 11 of the first column: the number of columns containing Chinese, English, date for itself and each column on the right of itself is 2 examples.
And inputting the 11 characteristic values into a first random forest model for analysis and calculation, and obtaining a calculation result of a first column, wherein the field score of the prediction row is 0.1. The threshold value range of the embodiment is that the score is between 0 and 1, and when the calculated score of the data column is greater than 0.6, a row serving as a pivot table can be generated according to the data. The predicted row field score of the first column is 0.1 less than 0.6 and therefore cannot be used as a row of the pivot table. By analogy, the predicted row field score is greater than 0.6 and between 0-1 is a column with the subject name "managed," and finally the rows of the pivot table are automatically generated by the C column "managed.
TABLE 1
Figure BDA0002266175810000341
Figure BDA0002266175810000351
By adopting the method for determining the pivot table, the system automatically acquires at least one second preset characteristic value of the table, namely 10 characteristic values in the embodiment. As shown in table 2, taking the first column as an example:
predetermined second characteristic value 1 of the first column: the title extraction keyword is 0;
predetermined second characteristic value 2 of the first column: the number of columns of the whole data column is 5 cases;
predetermined second feature value 3 of the first column: the number of the unit cells of only the number is 0;
predetermined second characteristic value 4 of the first column: the variance of the length of the integral number of digits of the cell is 0;
predetermined second feature value 5 of the first column: the number of columns containing numbers in each left column is 0 example per se;
predetermined second characteristic value 6 of the first column: the number of columns which only contain numbers in each left column is 0 example;
predetermined second characteristic value 7 of the first column: 0 cases of columns containing Chinese, English and date are arranged in the left columns per se and per se;
predetermined second characteristic value 8 of the first column: the number of columns containing numbers of the column itself and the column on the right is 2 examples;
predetermined second characteristic value 9 of the first column: the number of columns per se and each column on the right of itself containing only numbers is 2 examples;
predetermined second characteristic value 10 of the first column: the number of columns containing Chinese, English, and date is 0 case for itself and for each column on the right of itself.
And inputting the 10 characteristic values into a random forest model for analysis and calculation to obtain a first column of calculation results, wherein the prediction row field score is 0.0. The threshold range of this embodiment is a data column with a score between 0 and 1, and when the score is greater than 0.65, a pivot table value can be generated. The prediction row field score of the first column is 0.0 less than 0.65, so the pivot table value cannot be generated. By analogy, the prediction row field score is greater than 0.65 and between 0-1 is the data column with the subject name "teacher days," and the final recommendation E is the value of the "teacher days" to generate the pivot table.
TABLE 2
Column headings Grade of year Name (I) Supporting tube mode Number of days Days of teacher
Characteristic value
1 0 0 0 0 0
Characteristic value 1 5 5 5 5 5
Characteristic value 3 0 0 0 25 25
Characteristic value 4 0 0 0 0.405080694 0.405080694
Characteristic value 5 0 0 0 1 2
Characteristic value 6 0 1 2 2 2
Characteristic value 7 0 0 0 0 0
Characteristic value of 8 2 2 2 2 1
Characteristic value 9 2 2 1 0 0
Characteristic value 10 0 0 0 0 0
Predicting row field scores 0.0 0.0 0.0 0.3 1.0
As shown in fig. 5, the table data column list of the second embodiment of the present application includes 6 columns a, B, C, D, E, and F, and the titles of each column are "site name", "material name", "unit", "design quantity", "construction quantity", and "physical verification quantity", respectively.
The system automatically obtains at least one predetermined first characteristic value, in this embodiment 11 characteristic values, of the form. As shown in table 3, taking the first column as an example:
first column predetermined first eigenvalue 1: the total column number of the data columns is 6 columns;
first column predetermined first eigenvalue 2: the index value is 0;
first column predetermined first eigenvalue 3: the entire column contains type 448;
first column predetermined first characteristic value 4: the number of the cells after the repeated cells are removed is 0;
first column predetermined first characteristic value 5: the variance of the occurrence times of the contents of the repeated cells is 0;
first column predetermined first characteristic value 6: the maximum value of the length of the cell character is 11;
first column predetermined first characteristic value 7: 0 for the square of the cell character length;
first column predetermined first characteristic value 8: 0 cases of the column number of the column per se and the column number of the column per se with the left column containing numbers;
first column predetermined first characteristic value 9: 0 cases of columns containing Chinese, English and date are arranged in the left columns per se and per se;
first column predetermined first characteristic value 10: 3 columns with numbers in the self and the columns on the right of the self;
first column predetermined first characteristic value 11: the number of columns containing Chinese, English, and date for itself and its right and left columns is 3.
And inputting the characteristic values into a random forest model in sequence for analysis and calculation to obtain a first column of calculation results, wherein the prediction row field score is 0.0. The threshold value range of the embodiment is that the score is between 0 and 1, and when the calculated score of the data column is greater than 0.6, a row serving as a pivot table can be generated according to the data. The predicted row field score of the first column is 0.0 less than 0.6, so the first column cannot be a row of the pivot table. By analogy, the prediction row field score is greater than 0.6 and between 0 and 1 is the column with the subject name "material name", and finally the rows of the pivot table are automatically generated by the B column "material name".
TABLE 3
Figure BDA0002266175810000371
Figure BDA0002266175810000381
The system automatically obtains at least one predetermined second characteristic value, in this embodiment 10 characteristic values, of the form. As shown in table 4, taking the first column as an example:
first-column predetermined second characteristic value 1: title extraction keywords are 0 cases;
first column predetermined second eigenvalue 2: the number of columns of the whole data column is 6;
first column predetermined second eigenvalue 3: the number of the unit cells of only the number is 0;
first column predetermined second characteristic value 4: the variance of the length of the cell integer numeric character number is 0.330718914;
first column predetermined second feature value 5: the number of columns containing numbers in each left column is 0 example per se;
first column predetermined second eigenvalue 6: the number of columns containing numbers in each left column is 0 example per se;
first column predetermined second eigenvalue 7: the number of columns containing Chinese, English and date of each left column is 0 example;
first column predetermined second eigenvalue 8: 3 cases of the column number of the column containing the number of the self and the column on the right are taken as examples;
first column predetermined second eigenvalue 9: the number of columns per se and each column on the right of itself containing only numbers is 2 examples;
first column predetermined second characteristic value 10: the number of columns containing Chinese, English, and date is 0 case for itself and for each column on the right of itself.
And inputting the 10 characteristic values into a random forest model for analysis and calculation to obtain a calculation result, wherein the field score of the prediction row is 0.0. The threshold range of this embodiment is a score between 0-1, and when the score is greater than 0.65 data column, a pivot table value can be generated. The prediction row field score of the first column is 0.0 less than 0.65 and therefore cannot be used as a pivot table value. By analogy, the prediction row field score is greater than 0.65 and between 0 and 1 are columns of data with subject names "design quantity", "construction quantity", "audit quantity", and finally the recommendation D, E, F columns of "design quantity", "construction quantity", "audit quantity" generating pivot table values.
TABLE 4
Figure BDA0002266175810000391
It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have at least one function, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims (12)

1. A method of generating a pivot table, the method comprising:
after an instruction for establishing a pivot table for a current table is received, acquiring a selected data column in the current table;
respectively determining a first data column for generating the data perspective table row and a second data column for generating the data perspective table value in the acquired data columns by adopting a preset random forest model;
and generating the pivot table according to the determined first data column for generating the data pivot table row and the determined second data column for generating the data pivot table value by adopting a preset rule.
2. A method as claimed in claim 1, wherein determining, using a predetermined random forest model, a first data column that generates the row of the data pivot table and a second data column that generates the value of the data pivot table in the acquired data columns, respectively, comprises:
respectively traversing the selected data columns in at least one preset sequence, and respectively acquiring at least one preset first characteristic value and at least one preset second characteristic value of each selected data column;
respectively inputting the preset first characteristic value of each acquired data column into a pre-generated first random forest model to obtain a first analysis result of each data column corresponding to the first characteristic value; inputting the preset second characteristic value of each acquired data column into a pre-generated second random forest model respectively to obtain a second analysis result of each data column corresponding to the second characteristic value;
determining a first data column serving as a row for generating the pivot table according to the data column of which the first analysis result meets a first preset condition; and determining a second data column serving as a value for generating the pivot table from the data columns of which the second analysis result meets a second preset condition.
3. The method of claim 1, wherein generating the pivot table according to the determined first data column generating the pivot table row and the determined second data column generating the pivot table value using the predetermined rule comprises:
merging the values of the cells with the same content in the first data column, and taking each merged value as a row title of the pivot table;
and summing the values of the cells in the second data column in the current table according to the line headers of the pivot table respectively, and taking the obtained summation result as the value of the corresponding cell in the pivot table.
4. The method of claim 1, wherein obtaining the selected data column in the current table comprises:
acquiring a data column selected by a user in a current table, and judging the area size of the acquired data column selected by the user in the table, wherein the area size of the data column is represented as: m × n, where m is the number of rows and n is the number of columns;
when the row number and the column number of a data column selected in a table by an acquiring user are equal to 1 and 1, expanding cells of the data column selected in the table by the acquiring user, and acquiring areas which are not continuous blank rows, columns and rows at the upper side, the lower side, the left side and the right side as the selected data column in the acquiring table;
and when the row number or the column number of the data column selected in the table by the user is larger than 1, taking the data column selected in the table by the user as the selected data column in the acquisition table.
5. The method of claim 1, wherein after obtaining the selected data column in the table, further comprising:
identifying the direction of the table according to the selected data columns in the acquired table;
identifying the structure of the table when the identified table direction is arranged in a predetermined manner;
and when the table structure is a preset table structure, executing the step of respectively traversing the selected data columns in at least one preset sequence, and respectively acquiring at least one preset first characteristic value and at least one preset second characteristic value of each selected data column.
6. The method of claim 2, wherein the predetermined at least one order comprises a first left-to-right order, and wherein the obtaining of the at least one first predetermined characteristic value for each of the selected columns of data, when traversed in the first left-to-right order, comprises: the column number of the whole data column, the index value, the data type contained in the whole column, the number of the cells after removing the repeated cells, the variance of the occurrence times of the content of the repeated cells, the maximum value of the character length of the cells and the variance of the character length of the cells.
7. The method of claim 6, wherein the predetermined at least one order comprises a second left-to-right order, and wherein obtaining the at least one first predetermined characteristic value for each of the selected columns of data further comprises, when traversing in the second left-to-right order: the left columns of the self and the self contain the number of columns, and the left columns of the self and the self contain the number of columns of Chinese, English and date.
8. The method of claim 7, wherein the predetermined at least one order comprises a right-to-left order, and wherein the obtaining of the at least one first predetermined characteristic value of each of the selected columns of data further comprises, when traversing in the right-to-left order: the columns of the self and the right of the self contain the number of columns, and the columns of the self and the right of the self contain the number of columns of Chinese, English and date.
9. The method of claim 2, wherein said predetermined at least one order comprises a first left-to-right order, and wherein said obtaining at least one second predetermined characteristic value of each of said selected columns of data comprises, when traversed in the first left-to-right order: title extraction keywords, the number of columns of the whole data column, the number of units of only numbers, each unit, and the variance of the length of the number of characters of integer numbers.
10. The method of claim 9, wherein said predetermined at least one order comprises a second left-to-right order, and wherein said obtaining at least one second predetermined characteristic value for each of said selected columns of data further comprises, when traversed in the second left-to-right order: the left columns of the self and the self contain the number of columns, the left columns of the self and the self contain only the number of columns, and the left columns of the self and the self contain the number of columns of Chinese, English and date.
11. The method according to claim 10, wherein said predetermined at least one order comprises a right-to-left order, and wherein said obtaining at least one second predetermined characteristic value of each of said selected columns of data further comprises, when traversed in the right-to-left order: the columns of the self and the right of the self contain the number of columns, the columns of the self and the right of the self contain only the number of columns, and the columns of the self and the right of the self contain the number of columns of Chinese, English and date.
12. An apparatus for targeted delivery of content, comprising:
the acquisition module is used for acquiring the selected data columns in the current table after receiving an instruction of establishing a pivot table for the current table;
the analysis module is used for respectively determining a first data column for generating the data perspective table row and a second data column for generating the data perspective table value in the acquired data columns by adopting a preset random forest model;
and generating the pivot table according to the determined first data column for generating the data pivot table row and the determined second data column for generating the data pivot table value by adopting a preset rule.
CN201911088571.4A 2019-11-08 2019-11-08 Method and device for determining pivot table Active CN112784557B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911088571.4A CN112784557B (en) 2019-11-08 2019-11-08 Method and device for determining pivot table

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911088571.4A CN112784557B (en) 2019-11-08 2019-11-08 Method and device for determining pivot table

Publications (2)

Publication Number Publication Date
CN112784557A true CN112784557A (en) 2021-05-11
CN112784557B CN112784557B (en) 2023-06-30

Family

ID=75748988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911088571.4A Active CN112784557B (en) 2019-11-08 2019-11-08 Method and device for determining pivot table

Country Status (1)

Country Link
CN (1) CN112784557B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102782675A (en) * 2009-10-09 2012-11-14 微软公司 Data analysis expressions
CN102799652A (en) * 2012-06-29 2012-11-28 用友软件股份有限公司 Multidimensional data presentation device and multidimensional data presentation method
US20180018383A1 (en) * 2016-07-18 2018-01-18 Sap Se Hierarchical Data Grouping in Main-Memory Relational Databases

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102782675A (en) * 2009-10-09 2012-11-14 微软公司 Data analysis expressions
CN102799652A (en) * 2012-06-29 2012-11-28 用友软件股份有限公司 Multidimensional data presentation device and multidimensional data presentation method
US20180018383A1 (en) * 2016-07-18 2018-01-18 Sap Se Hierarchical Data Grouping in Main-Memory Relational Databases

Also Published As

Publication number Publication date
CN112784557B (en) 2023-06-30

Similar Documents

Publication Publication Date Title
CN110941959B (en) Text violation detection, text restoration method, data processing method and equipment
CN105718490A (en) Method and device for updating classifying model
EP3022659A1 (en) Systems and methods for extracting table information from documents
CN103699521A (en) Text analysis method and device
CN111597309A (en) Similar enterprise recommendation method and device, electronic equipment and medium
CN102436512B (en) Preference-based web page text content control method
CN112784549A (en) Method, device and storage medium for generating chart
CN111581193A (en) Data processing method, device, computer system and storage medium
CN113642320A (en) Method, device, equipment and medium for extracting document directory structure
CN113268971B (en) Intelligent generation method and device of demonstration report, computer equipment and storage medium
CN112801784A (en) Bit currency address mining method and device for digital currency exchange
CN112784557B (en) Method and device for determining pivot table
CN112784556B (en) Method and device for generating pivot table value
CN112783890B (en) Method and device for generating data pivot table row
CN112784555B (en) Method and device for generating data perspective
US11514233B2 (en) Automated nonparametric content analysis for information management and retrieval
CN115510326A (en) Internet forum user interest recommendation algorithm based on text features and emotional tendency
CN114238597A (en) Information extraction method, device, equipment and storage medium
CN110413899B (en) Storage resource optimization method and system for server storage news
CN114840642A (en) Event extraction method, device, equipment and storage medium
CN112785095A (en) Loan prediction method, loan prediction device, electronic device, and computer-readable storage medium
CN112364135B (en) Object pushing method, device, equipment and storage medium based on multi-source data
CN111259209B (en) User intention prediction method based on artificial intelligence, electronic device and storage medium
CN117648635B (en) Sensitive information classification and classification method and system and electronic equipment
CN117633149A (en) Correction method and system for database query statement, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant