CN116010421A

CN116010421A - Searching method

Info

Publication number: CN116010421A
Application number: CN202310098907.5A
Authority: CN
Inventors: 张博航
Original assignee: Shanghai Eisoo Information Technology Co Ltd
Current assignee: Shanghai Eisoo Information Technology Co Ltd
Priority date: 2023-02-08
Filing date: 2023-02-08
Publication date: 2023-04-25

Abstract

The invention discloses a searching method, which comprises the following steps: determining a table file to be searched, wherein the table file comprises at least one worksheet, and the worksheet comprises at least one table; obtaining structural information of the table file, wherein the structural information comprises table information of a table contained in the table file; searching in the table file according to the structured information and the search statement, and displaying a search result obtained by searching. According to the method, the structured information of the table file is obtained, and searching is carried out in the table file according to the structured information and the search statement, so that the table identification speed is improved, the function of searching conditions from a large number of electronic tables is realized, and the searching range is improved.

Description

Searching method

Technical Field

The invention relates to the technical field of data processing, in particular to a searching method.

Background

Because of the strong analysis capability and display capability, electronic forms such as Excel are widely applied to data organization, arrangement, induction, analysis, display and other scenes. Therefore, how to identify structured information from a table file and perform conditional search is a problem to be solved.

The existing form identification method is mainly realized based on convolutional neural network and OCR technology, such as Tablesense, but is not applicable in some scenes with higher requirements on speed due to the complex network structure, and the actual effect is tested to not meet the use requirement.

In addition, most of the existing searching methods realize the function of searching according to conditions based on single table contents, so the searching range is single.

Disclosure of Invention

The invention provides a searching method, which can realize the function of searching conditions from a large number of electronic forms and improve the searching range while improving the form identification speed.

According to an aspect of the present invention, there is provided a search method including:

determining a table file to be searched, wherein the table file comprises at least one worksheet, and the worksheet comprises at least one table;

obtaining structural information of the table file, wherein the structural information comprises table information of a table contained in the table file;

searching in the table file according to the structured information and the search statement, and displaying a search result obtained by searching.

Optionally, the obtaining the structured information of the table file includes:

Determining cell categories corresponding to cells in the table file, wherein the cell categories comprise at least one of a column head cell, a data cell and other cells, and the other cells are cells except the column head cell and the data cell in the table file;

determining a header area and a data area of an effective table contained in the table file based on cell categories corresponding to the cells;

and determining the structural information of the table file according to the header area and the data area of the effective table.

Optionally, the determining the cell category corresponding to each cell in the table file includes:

determining a feature vector corresponding to each cell in the table file, wherein the feature vector is used for representing a basic attribute of the cell, and the basic attribute comprises at least one of a text attribute, a style attribute, a font attribute and a space attribute;

and inputting the feature vectors corresponding to the cells into the classifier model to obtain cell categories corresponding to the cells.

Optionally, the determining the feature vector corresponding to each cell in the table file includes:

Analyzing the table file to obtain basic attributes of each cell in the table file;

converting the basic attribute into a feature vector.

Optionally, before the feature vector corresponding to each cell is input into the classifier model to obtain the cell category corresponding to each cell, the method further includes:

training a model to be trained based on a training sample to obtain the classifier model, wherein the model to be trained is a LightGBM model.

Optionally, the determining, based on the cell category corresponding to each cell, a header area and a data area of the valid table included in the table file includes:

determining a valid table contained in the table file based on a region growing algorithm;

and determining a column header area and a data area of the effective table according to the cell types of the cells contained in the effective table.

Optionally, the determining, based on the region growing algorithm, the valid table contained in the table file includes:

determining a growing area in the table file and first and second coordinates of the growing area based on an area growing algorithm, wherein the first coordinates comprise a first abscissa and a first ordinate corresponding to a first edge cell in the growing area, and the second coordinates comprise a second abscissa and a second ordinate corresponding to a second edge cell in the growing area;

Determining whether the cells contained in each growth area meet a preset condition according to the first coordinate and the second coordinate of the growth area; if yes, determining the table formed by the growth areas as a valid table.

Optionally, the determining the column header area and the data area of the valid table according to the cell category of the cells included in the valid table includes:

determining a label of each row in the effective table, and determining a header row number array, a data row number array and other row number arrays of the effective table according to the row number of each row and the label of each row, wherein the other row number arrays comprise row numbers which are not positioned in the header row number array and the data row number array in the effective table;

determining the title of the effective table based on the table head row number array and the data row number array of the effective table;

and verifying the head row number array, the data row number array and other row number arrays of the effective table according to a heuristic method to obtain a column head area and a data area of the effective table.

Optionally, the verifying the header row number array, the data row number array and the other row number arrays of the valid table according to the heuristic method to obtain a column header area and a data area of the valid table includes:

Determining an initial area of a column head in the effective table and a data area of the effective table based on the table head row number array, the data row number array and the other row number arrays;

verifying the validity of the initial area according to a heuristic method;

and if the verification is successful, determining the initial area as a column head area of the valid table.

Optionally, the verifying the validity of the initial area according to the heuristic method includes:

splitting the merging cells in the initial area, and copying the basic attribute of a third edge cell in the merging cells to other cells except the third edge cell in the merging cells;

judging whether the types of the contents stored in each cell in the initial area are all preset types or not;

if yes, determining that the initial area verification is successful.

Optionally, the displaying the search result obtained by the search includes:

if the search statement is a keyword search statement or a conditional search statement, displaying an effective table containing search results obtained by searching according to the structured information, and highlighting the search results;

And if the search statement is a statistical calculation statement, displaying the search result obtained by searching in a chart form.

The embodiment of the invention provides a searching method, which comprises the steps of determining a table file to be searched, wherein the table file comprises at least one worksheet, and the worksheet comprises at least one table; obtaining structural information of the table file, wherein the structural information comprises table information of a table contained in the table file; searching in the table file according to the structured information and the search statement, and displaying a search result obtained by searching. By utilizing the technical scheme, the structured information of the table file is obtained, and the table file is searched according to the structured information and the search statement, so that the function of searching conditions from a large number of electronic tables can be realized while the table recognition speed is improved, and the search range is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a search method according to a first embodiment of the present invention;

FIG. 2 is a flow chart of a search method according to a second embodiment of the present invention;

FIG. 3 is a flowchart of another searching method according to the second embodiment of the present invention;

FIG. 4 is a flow chart for determining a table title according to a second embodiment of the present invention;

FIG. 5 is a flow chart for determining structured information according to a second embodiment of the present invention;

FIG. 6 is a diagram of a table file according to a second embodiment of the present invention;

FIG. 7 is a schematic diagram of a structured message according to a second embodiment of the present invention;

fig. 8 is a flowchart of a search according to a second embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a search method according to a first embodiment of the present invention, where the present embodiment is applicable to a case of searching data in a table file, and the method may be configured in an electronic device.

It is believed that spreadsheets such as Excel are widely used in the context of organizing, sorting, generalizing, analyzing, and displaying data due to their strong analytical and presentation capabilities, and thus, these data are more "human friendly" than "machine friendly". How to identify tables from data and extract structured information from these tables is a challenge.

The task of table detection and table recognition is mainly realized based on convolutional neural networks and OCR technology, such as Tablesense, but the network structure is too complex to be suitable for some scenes with high requirements on speed. In addition, the practical effect of the water-based paint is tested and does not meet the use requirement. Meanwhile, some technologies mention training by using the characteristics of cells to identify the layout of the electronic form, but the problems of multiple forms, excessive classification, complex post-processing and the like of one Sheet are not considered, so that the method is far from being required in practical use.

In addition, most of the existing searching methods realize a conditional search function based on single table contents, and related work of screening specific tables from massive table data centers according to conditions is not performed yet.

Based on the above, the embodiment of the invention provides a searching method, which extracts the structural information of the table in the electronic table document through machine learning classification and heuristic post-processing algorithm, and searches the content based on the extracted structural information, so that the function of searching conditions from a large number of electronic tables can be realized while the recognition speed of the table is improved, and the method is simple to operate, easy to implement and strong in interpretation.

As shown in fig. 1, the method includes:

s110, determining a table file to be searched, wherein the table file comprises at least one worksheet, and the worksheet comprises at least one table.

The table file to be searched may refer to a table file to be searched, and the table file may be, for example, a file in an electronic table format such as Excel, CSV, etc., where the table file may include at least one worksheet, and each worksheet may include one or at least two tables.

In this embodiment, the table file to be searched may be determined, for example, the table file uploaded by the user may be acquired first, and then the table file to be searched is determined; it is also possible to edit a certain form file online and trigger a certain determination control to determine the form file to be searched after the editing is completed.

S120, obtaining structural information of the table file, wherein the structural information comprises table information of tables contained in the table file.

The structured information may include table information of a table included in the table file, where the table information may be, for example, the number of tables, a table title, and/or information of a cell included in the table.

Specifically, after the table file to be searched is determined, the structured information of the table file may be obtained, for example, the structured information of the table file may be determined, for example, extracted, and after the extraction is completed, the structured information may be obtained, or the structured information of the table file may be obtained from a database according to an index. The manner of determining the structured information of the table file is not limited as long as the structured information can be obtained.

Optionally, in this embodiment, after determining the table file to be searched, the structural information of the table file may be extracted, and after the extraction is successful, the extracted structural information is stored in the database and an index is established, so that when the content search is performed on the same table file subsequently, the structural information of the table file may be directly obtained. The present embodiment does not further expand the process of storing structured information into a database.

S130, searching in the table file according to the structural information and the search statement, and displaying a search result obtained by searching.

The search sentence can be used for searching the table file by a user, the type of the search sentence is not limited, for example, the search sentence can be a keyword search sentence or other sentences, the keyword search sentence can be a sentence for searching the keyword, and the other sentences are other sentences except the keyword search sentence. The search term may be derived from a combination of instructions entered by the user, such as when the user enters a keyword within an input box, the keyword search term may be formed based on the keyword.

In this embodiment, searching may be performed in a table file according to the structured information and the search statement, and the search result obtained by the searching may be displayed, so that the user may view the search result, where a manner of displaying the search result is not limited, e.g., different search statements may correspond to different display manners.

In one embodiment, the presenting the search results from the search includes:

Conditional search terms may refer to terms that are searched based on conditions that may be determined by a user, such as by user input. A statistical calculation statement may refer to a statement that performs a statistical calculation, such as a statement that performs an operation on a cell and/or other cells in a table, such as a summation, etc.

It can be considered that if the search term is a keyword search term or a conditional search term, then the valid table containing the search result can be presented according to the structured information and the search result is highlighted, and for example, when the search term is a keyword search term, such as searching in a table file, then all valid tables containing the keyword a can be presented according to the structured information and the keyword a can be highlighted in the table.

If the search statement is a statistical calculation statement, the search result obtained by searching can be displayed in the form of a chart, for example, simple statistical calculation can be performed at the same time of searching, and the chart is generated for display.

The first embodiment of the invention provides a searching method, which comprises the steps of determining a table file to be searched, wherein the table file comprises at least one worksheet, and the worksheet comprises at least one table; obtaining structural information of the table file, wherein the structural information comprises table information of a table contained in the table file; searching in the table file according to the structured information and the search statement, and displaying a search result obtained by searching. By using the method, the structured information of the table file is obtained, and the table file is searched according to the structured information and the search statement, so that the function of searching conditions from a large number of electronic tables can be realized while the table recognition speed is improved, and the search range is improved.

Example two

Fig. 2 is a flowchart of a search method according to a second embodiment of the present invention, where the second embodiment is optimized based on the above embodiments. In this embodiment, the obtaining the structured information of the table file is further specified as: determining cell categories corresponding to cells in the table file, wherein the cell categories comprise at least one of a column head cell, a data cell and other cells, and the other cells are cells except the column head cell and the data cell in the table file; determining a header area and a data area of an effective table contained in the table file based on cell categories corresponding to the cells; and determining the structural information of the table file according to the header area and the data area of the effective table.

For details not yet described in detail in this embodiment, refer to embodiment one.

As shown in fig. 2, the method includes:

s210, determining a table file to be searched, wherein the table file comprises at least one worksheet, and the worksheet comprises at least one table.

S220, determining cell categories corresponding to cells in the table file, wherein the cell categories comprise at least one of a column header cell, a data cell and other cells, and the other cells are cells except the column header cell and the data cell in the table file.

The cell category may be used to characterize a category of cells, e.g., a cell category may include at least one of a column header cell, a data cell, and other cells, a column header cell may refer to a cell belonging to a table header, the table header of the table being located above the table; the data cells may refer to cells belonging to the data portion of the table, and other cells may refer to cells in the table file other than the column header cells and the data cells, such as cells not within the table or cells of unknown significance.

In this step, after the table file is determined, the cell type of each cell in the table file may be determined, so that the operation of the subsequent step is performed based on the cell type. The classifier model may refer to a model that classifies cells.

In one embodiment, the determining the cell category corresponding to each cell in the table file includes:

The feature vector may refer to a vector corresponding to a cell, for example, the feature vector may correspond to a basic attribute of the cell, that is, each basic attribute may correspond to a feature vector, the basic attribute may be used to represent an attribute of the cell, for example, may include at least one of a text attribute, a style attribute, a font attribute, and a space attribute, the text attribute may refer to an attribute related to a text of the cell, such as a length of the text contained in the cell, the style attribute may refer to an attribute related to a style of the cell, such as whether the cell is indented, etc., the font attribute may refer to an attribute related to a font of the cell, such as a default font color of the cell, etc., the space attribute may refer to an attribute related to a space of the cell, such as whether the cell is in a merged cell, etc.

Specifically, when determining the cell type corresponding to each cell in the table file, the feature vector corresponding to each cell in the table file may be determined first, and then the feature vector corresponding to each cell is input into the classifier model to obtain the cell type corresponding to each cell, where the means for determining the feature vector corresponding to each cell is not limited, for example, the basic attribute of each cell may be determined first, and then the corresponding feature vector may be determined based on the basic attribute.

In one embodiment, the determining the feature vector corresponding to each cell in the table file includes:

converting the basic attribute into a feature vector.

In this embodiment, the basic attribute of each cell in the table file may be obtained by parsing the table file, for example, the parsing manner may be an Apache POI, where the Apache POI may be considered as a Java API of a free open source written in Java, and the Apache POI provides the API with the function of reading and writing the file in the format of Microsoft Office (such as Excel) for the Java program, for example, parsing the table file; then, converting each determined basic attribute into a corresponding feature vector so as to obtain a feature vector corresponding to each cell, wherein the method for converting the basic attribute into the feature vector can be distinguished according to the difference of the basic attributes, for example, for the basic attribute of the Boolean type, a mode of 0,1 can be adopted to map onto the corresponding dimension of the vector; for the basic attribute of the character string, one-hot encoding (i.e., one hot encoding) can be used to map to the corresponding dimension of the vector, and one-hot encoding can be regarded as a more common method of extracting text features, i.e., N states are encoded with N-bit state registers, each state has a separate register bit, and only one of the register bits is valid. On the basis, the basic attributes of the cells, namely the artificial characteristics of the cells, are determined, so that the interpretation is stronger, the subsequent algorithm is convenient to improve, a large amount of data learning characteristics are not needed, and the method is easier to realize.

In one embodiment, before the feature vector corresponding to each cell is input into the classifier model to obtain the cell category corresponding to each cell, the method further includes:

The training samples can be samples for training the model to be trained, and the specific content and number of the training samples are not limited, for example, the training samples can be preset by related personnel.

The model to be trained is used to obtain a classifier model, and in this embodiment, the model to be trained may be a LightGBM model. The training speed of the LightGBM model is far higher than that of other models, the training speed is easier to converge, and meanwhile, the accuracy of cell classification can meet the preset requirement.

The training sample can be used for training the LightGBM model, so that the classifier model meeting the requirement can be obtained, and the specific training process is not described herein, so long as the classifier model can be obtained.

S230, determining a header area and a data area of an effective table contained in the table file based on cell categories corresponding to the cells.

The valid table may refer to a meaningful table contained in the table file, and the standard of the valid table may be set by a relevant person, for example, if a certain table area has only one row, one column or one cell, the valid table is not considered to be a valid table, that is, an invalid table. The header area may refer to an area storing a header of the valid table, and the data area may refer to an area storing valid table data.

In this embodiment, after determining the cell type corresponding to each cell in the table file based on the above steps, the header area and the data area of the valid table included in the table file may be determined based on the determined cell type, and the determining process may be, for example, determining the valid table first and then determining the header area and the data area of the valid table.

In one embodiment, the determining the header area and the data area of the valid table included in the table file based on the cell category corresponding to each cell includes:

Specifically, the valid table contained in the table file may be determined based on the region growing algorithm, and then, after the valid table is determined, the column header region and the data region of the valid table are determined according to the cell type of the cells contained in the valid table.

In one embodiment, the determining the valid table contained in the table file based on the region growing algorithm includes:

The region growing algorithm may be considered as an algorithm that aggregates pixels or sub-regions into a larger region according to a predefined criterion to obtain a growing region, and in this embodiment, from a certain effective cell, the growing may be performed by diffusion to other cells to find an edge cell, so that the region growing is completed, and a growing region is obtained. The first coordinate may refer to a coordinate corresponding to the first edge cell in the growth area, for example, the first coordinate may include a first abscissa and a first ordinate of the first edge cell, where the first abscissa is an abscissa of the first edge cell, and the first abscissa is an ordinate of the first edge cell; the second coordinate may refer to a coordinate corresponding to the second edge cell in the growth area, for example, the second coordinate may include a second abscissa and a second ordinate of the second edge cell, where the second abscissa is an abscissa of the second edge cell, and the second abscissa is an ordinate of the second edge cell. In this embodiment, the first edge cell may refer to a cell located at the leftmost upper corner in the growth region, and the second edge cell may refer to a cell located at the rightmost upper corner in the growth region.

The preset condition may be used to determine whether the table formed by the growth area is determinable as a valid table, and the specific content of the preset condition may be set by a relevant person, for example, the number of rows and columns of the cells included in the growth area is not less than 2, and the like.

It is considered that the growth area in the table file and the first and second coordinates of each growth area may be determined first by the area growth algorithm, then for each growth area, whether the cells included in the growth area satisfy the preset condition is determined based on the first and second coordinates of the growth area, if the preset condition is satisfied, the table constituted by the growth area may be determined as a valid table, otherwise the table constituted by the growth area may be determined as an invalid table. The process of determining the growth area and the first and second coordinates of the growth area in the table file is not further developed here.

In one embodiment, the determining the column header area and the data area of the valid table according to the cell category of the cells included in the valid table includes:

The tag is used to characterize a category of a row in the active table, for example, the tag may be a header row, a data row, and/or other rows, the header row may be considered a row in the active table where the header is located, the header row may be considered a row in the active table where the data is located, and the other rows may be considered rows in the active table other than the header row and the data row. The header row number array comprises header rows and row numbers corresponding to the header rows, the data row number array comprises data rows and row numbers corresponding to the data rows, and other row number arrays can comprise row numbers which are not located in the header row number array and the data row number array in the effective table.

In one embodiment, the label of each row in the valid table may be determined first, and the header row number array, the data row number array and other row number arrays of the valid table are determined according to the row number of each row and the label of each row, where the manner of determining the label of each row may be determined according to, for example, the cell type of the cell contained in each row, for example, taking the mode of the cell type to determine the label of the row. For example, when the first row includes the cell 1, the cell 2 and the cell 3, the cell category corresponding to the cell 1 is the column header cell, the cell category corresponding to the cell 2 is the column header cell, and the cell category corresponding to the cell 3 is the data cell, and then the label of the first row is the column header row.

The header of the active table may then be determined based on the header row number array and the data row number array of the active table, and the process of determining the header may be, for example: judging whether the first row is in the header row number array, and continuously judging the number of the cells contained in the first row according to the judging result, so that the title of the effective table can be obtained.

And finally, verifying the head row number array, the data row number array and other row number arrays of the effective table according to a heuristic method to obtain a column head area and a data area of the effective table. On the basis, the column head area and the data area of the effective table are accurately obtained by adopting a heuristic method, so that the accuracy of the structured information is further improved.

In one embodiment, the verifying the header row number array, the data row number array, and the other row number arrays of the valid table according to the heuristic method, to obtain a column header area and a data area of the valid table includes:

Verifying the validity of the initial area according to a heuristic method;

In this embodiment, the initial area may be regarded as an area where the column header is located in the preliminarily determined valid table.

In one embodiment, in the process of verifying the head row number array, the data row number array and other row number arrays of the effective table according to the heuristic method to obtain the head row area and the data area of the effective table, the initial area of the head in the effective table and the data area of the effective table can be determined firstly based on the head row number array, the data row number array and the other row number arrays, the specific means for determining the initial area and the data area can be determined based on the specific contents of the head row number array, the data row number array and the other row number arrays, and different specific contents can correspond to different determining means; then, carrying out validity verification on the initial area according to a heuristic method, and determining the initial area as a column head area of the valid table after verification is successful; otherwise, the initial region may be considered invalid and the valid table may be invalidated. The specific procedure of validity verification is not limited, and may be set empirically, for example. On this basis, the column header area and the data area of the valid table are further refined by performing validity verification on the initial area.

In one embodiment, the verifying the validity of the initial area according to the heuristic method includes:

if yes, determining that the initial area verification is successful.

The third edge cell may be considered to be the cell located at the upper left corner edge in the merged cell. The preset type can be determined by the related personnel according to an experience value, for example, the preset type can be text type and the like.

In one embodiment, the process of verifying the validity of the initial area may be, for example: firstly, splitting merging cells in an initial area, and copying basic attributes of a third edge cell in the merging cells to other cells except the third edge cell in the merging cells; judging whether the types of the contents stored in each cell in the initial area are all preset types, and when the types of the contents stored in each cell are all preset types, considering the initial area as effective, namely determining that the initial area verification is successful; when the type of the content stored in at least one cell is not the preset type, the initial area can be considered invalid, namely the initial area verification failure is determined.

S240, determining the structural information of the table file according to the header area and the data area of the effective table.

After the header area and the data area of the valid table are obtained based on the steps, the structured information of the table file may be determined based on the information of each cell, for example, the structured information of the table file is determined based on the basic attributes of the cells included in the header area and the data area, and the determining process is not further limited.

S250, searching in the table file according to the structural information and the search statement, and displaying a search result obtained by searching.

In the searching method provided by the second embodiment of the invention, a table file to be searched is determined, wherein the table file comprises at least one worksheet, and the worksheet comprises at least one table; determining cell categories corresponding to cells in the table file, wherein the cell categories comprise at least one of a column head cell, a data cell and other cells, and the other cells are cells except the column head cell and the data cell in the table file; determining a header area and a data area of an effective table contained in the table file based on cell categories corresponding to the cells; determining the structural information of the table file according to the header area and the data area of the effective table; searching in the table file according to the structured information and the search statement, and displaying a search result obtained by searching. By utilizing the method, the header area and the data area of the effective table contained in the table file are determined based on the cell types corresponding to the cells, and further the structural information of the table file is determined according to the header area and the data area of the effective table, so that an information basis can be provided for subsequent searching in the table file, and the accuracy of a search result is further improved.

Fig. 3 is a flowchart of another searching method according to the second embodiment of the present invention, as shown in fig. 3, step S1: the user may upload a form file (i.e., determine a form file to search);

step S2: the Apache POI may be used to parse the table and obtain basic attributes of cells (i.e. parse the table file to obtain basic attributes of each cell in the table file), where the basic attributes may be artificial features such as text features of cells, style features of cells, font features of cell text, space features of cells, and the like (i.e. the basic attributes include at least one of text attributes, style attributes, font attributes and space attributes). The extracted features may be as follows:

the cell text attributes include the content of the cell text, the length of the text, the number of text breaks, the number of text start spaces, whether the text contains words such as SUM, TOTAL, MAX, MIN, AVG, MEAN, STATISTIC and related to its chinese translation, whether the text contains words such as TABLE, SHEET, CHART, SPREADSHEET, TABULAR and related to its chinese translation, whether the text is of the blast type, whether the text is of the NONE type, whether the text is of the NA type, whether the text is of the boolean type, whether the text is of the numerical type, whether the text is of the formula, whether the text is of the string type, whether the text is of the date type, whether the text is of the time type, whether the text is of the hyperlink type, whether the text is of the ERROR type, whether the text has notes, whether the text starts with digits, whether the text starts with special characters, whether the text starts with uppercase letters the ratio of english characters, chinese characters, numerals, spaces, punctuation marks, other characters, all english characters, chinese characters, all digital characters, all english characters and numerals, all chinese and numerals, all english and chinese characters, all uppercase characters, special characters, punctuation marks, colon marks, decimal marks, dial thousands of marks, currency symbols, annual intervals, ordinal numbers. If the cell is a formula, the attribute also includes the value type of the formula, whether it is a numeric type, a character string, a boolean type, a date, and an NA.

Since the cell text is all of the character string type when acquired, UNICODE encoding of the characters is used to determine chinese and english, numbers, punctuation and special symbols. The UNICODE is encoded in the fields 0X0040,0X002D-0x002f,0x0023-0X0026,0X0028-0x002b,0x003c-0x003e,0x005b-0X0060,0X007B-0x 00pe, 0x2010-0X2017,0X2020-0X2027,0X2B00-0x2bff, ox3000,0x3004-0x301c,0x301d-0x301f,0xff03-0XFF06,0XFF08-0xff0b,0xff0d,0xff0f,0 xfff 1c-0 xfff 1e,0 xfff 20,0xff65,0xff3b-0 xfff 40, oxf 5b-oxf 60, oxf 62, oxfff 63. For punctuation characters, both Chinese punctuation and English punctuation are included.

The cell style attributes include whether or not to retract, whether or not to rotate, whether or not to be locked, whether or not to be horizontally aligned, whether or not to be vertically aligned, whether or not to be a default fill style, whether or not to be automatically line-fed, whether or not to have top, bottom, left, and right frames, the style of the top, bottom, left, and right frames, the RGB values of the frame colors defining several sides, the RGB values of the top, bottom, left, and right frames, and the RGB values of the background color.

The font properties of the cell text include font size, font size level, font, default font color, bold, italic, strikethrough, underline type, superscript, subscript, no offset type, number of styles, font color RGB values.

The spatial attribute of a cell includes a row number, a column number, whether the cell is in a merged cell, an area of the merged cell in which the cell is located, a starting row number, a starting column number, a finishing row number, a finishing column number, a row number of an empty row between cells with contents closest to the cell, a number of adjacent cells, whether patterns between the cell and the adjacent cells are the same, whether text types between the cell and the adjacent cells are the same, whether types of four adjacent cells are a numeric type, a character string type, a boolean type, a date type, a formula type, and an NA type.

Step S3: the basic properties of the cells may be converted into 227-dimensional feature vectors (i.e., the basic properties are converted into feature vectors). The cell features extracted in the previous step are converted into vectors as a basic method based on feature processing. For the basic attribute of the Boolean type, mapping to the corresponding dimension of the vector in a mode of 0, 1; for the basic attribute of the numerical value type, converting into a double-precision floating point type; for the basic properties of a character string, one hot coding is used to map onto the corresponding dimension of the vector.

Step S4: the cell classifier is trained (i.e., a training sample is used for training a model to be trained to obtain the classifier model, the model to be trained is a LightGBM model), and in order to simplify the complexity of a subsequent processing flow, the classifier may divide cells into three categories, which are a Header cell (attribute of a table), a Data cell (Data portion of a table), and Other cells (Other cells not in the table, or cells with unknown meaning), respectively (i.e., the cell category includes at least one of a column Header cell, a Data cell, and Other cells, and the Other cells are cells in the table file except for the column Header cell and the Data cell).

Illustratively, the cell classifier is trained using an ensemble learning approach in machine learning. And manually labeling at least 2000 tables in advance to obtain a training sample of one hundred and ten thousand cells, wherein the training sample comprises vectors of the cells and labels of the cells. 80% is randomly extracted as training dataset and the remaining 20% is validation dataset.

Compared with a deep learning model based on CNN, the training speed of training the cell classifier by selecting the LightGBM is faster, the cell classifier is easier to converge, and the accuracy can meet the use requirement. Compared with a deep learning model based on a graph neural network, the graph neural network is trained from the whole table structure, so that the data volume is greatly reduced and difficult to train. The accuracy of the classifier in the verification set can reach more than 98%.

Step S5: the cells may be classified using a classification model, i.e., the trained cell classifier is used to classify the cells (i.e., the feature vectors corresponding to each cell are input into the classifier model to obtain the cell class corresponding to each cell).

Step S6: the region-growing-based method identifies the table region, such as confirming the position of the table in an Excel document based on a region-growing algorithm (i.e., determining the valid table contained in the table file based on the region-growing algorithm).

It is believed that the form of a form within a Sheet is typically a block of data, and based on this assumption, a region growing algorithm in image semantic segmentation is used to identify the form region in the Sheet. Arranging the effective cell coordinates in a Sheet from left to right and from top to bottom; then, performing region growth of the cells by using 8 neighborhood reachable and 4 neighborhood reachable respectively, and after the region growth is completed, obtaining a cell abscissa of the leftmost upper corner of the region and a cell abscissa of the lower right corner (namely determining a growth region in the table file and a first coordinate and a second coordinate of the growth region based on a region growth algorithm, wherein the first coordinate comprises a first abscissa and a first ordinate corresponding to a first edge cell in the growth region, and the second coordinate comprises a second abscissa and a second ordinate corresponding to a second edge cell in the growth region); then judging whether the area is valid or not, wherein the judging standard is as follows: if the area has only one row, one column or one cell, the invalid table is considered, otherwise, the valid table is considered (namely, for each growing area, whether the cells contained in the growing area meet the preset condition or not is determined according to the first coordinate and the second coordinate of the growing area, and if so, the table formed by the growing areas is determined to be the valid table).

Step S7: cell categories are corrected based on heuristics (i.e., column header regions and data regions of the active table are determined from cell categories of cells contained within the active table). Because the organization modes of the user on the data are various, the structural error of the table is larger only by using the cell classifier, and the embodiment of the invention corrects the classification result of the cells by using a heuristic method based on the table area information and the cell category obtained by the classifier. The implementation method comprises the following steps:

firstly, counting the number of labels in each row of effective cells to obtain the mode of each row of cell labels of a table as the labels of the row, summarizing the row numbers of the table according to the label names to obtain a Header row number array, a Data row number array and a Other row number array, and sorting each array from small to large according to the row numbers (namely, determining the labels of each row in the effective table, and determining a Header row number array, a Data row number array and Other row number arrays of the effective table according to the row numbers of each row and the labels of each row), wherein the Other row number arrays comprise row numbers which are not positioned in the Header row number array and the Data row number array in the effective table.

Then, the Header of the table is found in the area (i.e. the Header of the valid table is determined based on the Header row number array and the data row number array of the valid table), fig. 4 is a flowchart for determining the Header of the table according to the second embodiment of the present invention, and as shown in fig. 4, it may be first determined whether the first row of the table is in the Header set (i.e. the Header row number array).

If so, continuing to judge whether the first line has one cell, and if so, removing the line number of the line from the Header array, taking the first line number in the Header array after the line is removed as the head line of the Header, and storing the content of the cell as a table title; when the first row has more than one cell, judging whether the number of the cells in the first row is smaller than the number of the maximum row of the number of the cells in all rows of the table, if so, removing the row number of the row from the Header array, taking the first row number in the Header array after removing the row as the head row of the Header, and storing the content of the cells as the table title; if not, defining the table title as an anonymous table;

if not, continuing to judge whether the first line has one cell, and when the first line has one cell, removing the line number of the line from the Data array, taking the first line number in the Data array after the line is removed as the first line of the Data, and storing the content of the cell as a table title; when the first row has more than one cell, judging whether the number of the cells in the first row is smaller than the number of the maximum row of the number of the cells in all rows of the table, if so, removing the row number of the row from the Data array, taking the first row number in the Data array after the row is removed as the first row of the Data, and storing the content of the cells as the table title; if not, the definition table is entitled "hidden name table".

Then, a heuristic method is used to find a Header row and a Data row in the area (i.e. the Header row number array, the Data row number array and other row number arrays of the valid table are verified according to the heuristic method, so as to obtain the column Header area and the Data area of the valid table), and the rule is as follows:

when the head line number array and the Data line number array are both empty, if the Other line number array is not empty, the line conversion label indicated by the first line number in the Other line number array is the head, the line conversion label is added into the head line number array, and then the line is removed from the Other line number array; if the Other row number array is empty, the table is deemed invalid.

When the Header line number array is not empty and the Data line number array is empty, observing whether the lines in the Header line number array have merging cells crossing the lines. If there are merging cells across rows, the across rows are considered as a Header portion, and other rows in the Header row number array are converted to Data, removed from the Header row number array, and added to the Data row number array.

When the Data line number array is not empty and the Header array is empty, the minimum line number in the Data line number array is taken as the Header, removed from the Data line number array and added into the Header line number array.

When the head line number array and the Data line number array are not empty, if the head line number array first line number is smaller than the Data line number array first line number, converting the line label between the two lines into a head; if the first line number of the Header line number array is greater than the first line number of the Data line number array, a line similar to the Header is found from the Other line number array.

The judging criteria for the similar Header are: if the row has and only one merge cell, then the row is deemed not necessarily a Header; if the row has merged cells, but there are other non-merged cells, then it is indicated that the row is likely a Header; if the row does not merge cells, the description is likely a Header; otherwise, the Header is not found, and the table is invalid. If no Other row number array is empty, the first row of the Data row number array is used as a Header, if a merging cell crossing rows exists in the row, all the crossed rows are used as a Header, new Header rows are removed from the Data row number array, the Header row number array is added, and all the labels of all the rows below the rows are converted into Data (namely, the initial area of the column head in the effective table and the Data area of the effective table are determined based on the table head row number array, the Data row number array and the Other row number arrays).

And finally, verifying the Header area based on a heuristic method, and determining whether the Data area is valid (namely, verifying the validity of the initial area according to the heuristic method, and determining the initial area as the column head area of the valid table if the verification is successful). Specifically, before verifying the Header area and the Data area, splitting the merging cells, wherein the valid cell in the merging cells is the cell in the upper left corner of the area, copying the attribute of the cell to all cells in the merging cell area (namely splitting the merging cell in the initial area, and copying the basic attribute of the third edge cell in the merging cell to other cells except the third edge cell in the merging cell). The method for verifying the Header area can be as follows: judging whether the types of the values of all the cells in the identified Header row are TEXT types or not, further judging whether the values are date types or not on the basis of the TEXT types, if so, considering that the Header area is valid, otherwise, considering that the area is invalid, namely, the table is invalid (namely, judging whether the types of the contents stored in each cell in the initial area are all preset types or not, and if so, determining that the initial area verification is successful).

Step S8: and extracting structural information of the table based on the designed structural template (namely, determining the structural information of the table file according to the header area and the data area of the effective table). The embodiment of the invention designs a unified electronic form structuring method, and partial structuring content is as follows:

TableID: the ID of the table;

TableName: form name;

headers: the attribute name of the table is a Map, wherein Key is the coordinate of an attribute cell, and Value is the attribute name;

cells: the information of all cells is an array, and each Cell information contains:

text: representing the content of the cell;

type: the table whether the cell is a Header or Data;

DataType: the types of cell contents are classified into TEXT, NUMBER, BLANK, boost, ERROR;

NumberType: if the DataType of the cell class is NUMBER, the NUMBER type represents the type of the value, and is classified into GENERAL (common NUMBER), TIME (TIME), DATE (DATE), current (CURRENCY);

relationships: conditional searching in the massive tables is implemented based on this attribute. The rules are: when the type of a cell is a Header, the attribute is an array containing the text content of the cell. When the type of a cell is Data, the relations array not only stores the text content of the cell, but also stores the text content of the Header cell corresponding to the Data cell.

Step S9: and saving the structural information of the table. The table structured information may be stored in the unstructured database, so that a conditional search table (i.e. searching in the table file according to the structured information and the search statement and displaying the search result obtained by the search) may be implemented in a plurality of tables later, for example, the condition may be "attribute name=attribute value", the attribute name is from the Header part of the table, and the attribute value is from the Data part of the table; whether the search condition input by the user is satisfied is judged by judging whether the attribute name and the attribute value exist in the Relationships of certain cell information at the same time, so that the function of searching the condition in a plurality of tables can be realized.

As can be found from the above description, the searching method provided by the embodiment of the invention has stronger interpretation by using the artificial cell characteristics; the method can support scenes of one page and multiple tables, and more coverage scenes are realized; the method does not need a large model based on deep learning, does not need a high-performance machine, and can realize the functions of determining and searching the structured information; a conditional lookup table from the mass data table is supported per condition.

The following describes an exemplary search method provided by the embodiment of the present invention, which mainly comprises two parts, wherein the first part is the determination of the structured information of the electronic form, and the second part is the form search.

1. Spreadsheet structured information is determined.

The method comprises the steps of extracting structured information of a table, wherein the method can extract files in electronic form formats including, but not limited to, excel, CSV and the like, and FIG. 5 is a flow chart for determining structured information according to a second embodiment of the invention, and as shown in FIG. 5, after uploading the electronic form files including Excel and the like (i.e. determining the table file to be searched), a user can analyze the electronic form files including Excel (i.e. analyzing the table file to obtain basic attributes of cells in the table file; converting the basic attributes into feature vectors), and classifying cells based on a cell classification model (i.e. inputting feature vectors corresponding to the cells into a classifier model to obtain cell categories corresponding to the cells); secondly, identifying a table, correcting the cell type (namely, determining an effective table contained in the table file based on a region growing algorithm; determining a column head region and a data region of the effective table according to the cell type of the cell contained in the effective table), then determining a table structure (namely, determining structural information of the table file according to the table head region and the data region of the effective table), storing the structural information into a database, and establishing an index, thereby completing the step of extracting the structural information of the table.

Fig. 6 is a schematic diagram of a table file according to the second embodiment of the present invention, where, as shown in fig. 6, a Sheet of a work Sheet includes two tables, and through the above steps, the structural information of the tables can be extracted. Fig. 7 is a schematic diagram of structured information according to a second embodiment of the present invention, where the structured information includes the extracted results, such as a table title, a table type, a worksheet name, etc., as shown in fig. 7.

2. Form search

FIG. 8 is a flowchart of a search provided according to a second embodiment of the present invention, as shown in FIG. 8, after a user inputs a query, if the query includes a keyword or a condition, the electronic device according to the embodiment of the present invention may combine search terms based on the query, and search a table based on structured information stored in a database; after the search is completed, the table may be sorted or a chart may be generated. The embodiment of the invention designs three search scenes, including keyword search based on table contents, such as table contents, file names, sheet names and the like, wherein the results are displayed in a table form, and the matched parts are highlighted.

The method can also comprise conditional search based on table contents, namely, a table can be searched through the conditional, when a user inputs a column name, the corresponding column name is automatically prompted based on the input contents, the searched result is a table meeting the condition and a keyword searching result, and the condition item is highlighted in the returned preview table (namely, if the search statement is a keyword search statement or a conditional search statement, a valid table containing the searched result is displayed according to the structured information, and the searching result is highlighted).

And the method can also comprise a scene of converting the search result into a statistical chart, for example, simple statistical calculation such as counting, summing, averaging, maximum value calculating, minimum value calculating and the like can be performed at the same time of searching, and the chart is generated (namely, if the search statement is a statistical calculation statement, the search result obtained by searching is displayed in the form of the chart).

Claims

1. A search method, comprising:

2. The method of claim 1, wherein the obtaining the structured information of the table file comprises:

3. The method of claim 2, wherein determining the cell category corresponding to each cell in the table file comprises:

4. The method of claim 3, wherein determining the feature vector for each cell in the table file comprises:

converting the basic attribute into a feature vector.

5. The method of claim 3, further comprising, before the inputting the feature vector corresponding to each cell into the classifier model to obtain the cell class corresponding to each cell:

6. The method according to claim 2, wherein determining the header area and the data area of the valid table included in the table file based on the cell category corresponding to each cell includes:

7. The method of claim 6, wherein the determining the valid tables contained in the table file based on the region growing algorithm comprises:

8. The method of claim 6, wherein determining a column header area and a data area of the active table based on cell categories of cells contained within the active table comprises:

9. The method of claim 8, wherein verifying the header row number array, the data row number array, and the other row number arrays of the active table according to the heuristic method to obtain the column header area and the data area of the active table comprises:

verifying the validity of the initial area according to a heuristic method;

10. The method of claim 9, wherein said verifying the validity of the initial region according to a heuristic method comprises:

if yes, determining that the initial area verification is successful.

11. The method according to any one of claims 1-10, wherein the presenting search results from the search comprises: