CN112989050A

CN112989050A - Table classification method, device, equipment and storage medium

Info

Publication number: CN112989050A
Application number: CN202110349354.7A
Authority: CN
Inventors: 高宏华; 陈立捷; 崔莹琰; 魏翩翩; 苏建清
Original assignee: CCB Finetech Co Ltd
Current assignee: CCB Finetech Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-06-18
Anticipated expiration: 2041-03-31
Also published as: CN112989050B

Abstract

The embodiment of the invention discloses a table classification method, a table classification device, table classification equipment and a storage medium. The embodiment of the invention relates to the technical field of big data, wherein the method comprises the following steps: acquiring text content of a target field of a form to be classified as a text of the form to be classified; performing word segmentation processing on the text of the table to be classified; performing vectorization processing on the text of the table to be classified after the word segmentation processing by using a text feature vectorization algorithm to obtain a basic text feature vector of the table to be classified; generating a rule vector of the table to be classified according to a preset table classification rule and text contents in the table to be classified; splicing the basic text feature vector and the rule vector to obtain a text feature vector of the table to be classified; and inputting the text feature vector of the table to be classified into a pre-trained random forest classification model to obtain a classification label of the table to be classified. The embodiment of the invention can automatically complete the form classification process and improve the form classification efficiency.

Description

Table classification method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of big data, in particular to a table classification method, a table classification device, table classification equipment and a storage medium.

Background

The table has good structural characteristics and potential semantic characteristics, and is easier to analyze and understand compared with unstructured text data. In order to effectively manage the tables, the tables are generally classified and then managed according to the categories of the tables.

In the related art, the method for classifying the table is mainly to manually label the specific category to which the table belongs. The method has the advantages that millions of tables are counted, and the classification of the tables by manual marking is guaranteed in accuracy, but is time-consuming, labor-consuming and low in efficiency.

Disclosure of Invention

Embodiments of the present invention provide a form classification method, apparatus, device, and storage medium, which can automatically complete a form classification process and improve form classification efficiency.

In a first aspect, an embodiment of the present invention provides a table classification method, including:

acquiring text content of a target field of a form to be classified as a text of the form to be classified;

performing word segmentation processing on the text of the table to be classified;

performing vectorization processing on the text of the table to be classified after the word segmentation processing by using a text feature vectorization algorithm to obtain a basic text feature vector of the table to be classified;

generating a rule vector of the table to be classified according to a preset table classification rule and text contents in the table to be classified;

splicing the basic text feature vector and the rule vector to obtain a text feature vector of the table to be classified;

and inputting the text feature vector of the table to be classified into a pre-trained random forest classification model to obtain a classification label of the table to be classified.

In a second aspect, an embodiment of the present invention further provides a table classifying device, including:

the text acquisition module is used for acquiring the text content of the target field of the table to be classified as the text of the table to be classified;

the text word segmentation module is used for carrying out word segmentation processing on the text of the table to be classified;

the text vectorization module is used for carrying out vectorization processing on the text of the table to be classified after the word processing by using a text feature vectorization algorithm to obtain a basic text feature vector of the table to be classified;

the rule vector generation module is used for generating a rule vector of the table to be classified according to a preset rule vector generation rule;

the vector splicing module is used for splicing the basic text feature vector and the regular vector to obtain a text feature vector of the table to be classified;

and the label acquisition module is used for inputting the text feature vector of the table to be classified into a pre-trained random forest classification model to obtain a classification label of the table to be classified.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the table classification method according to the embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the table classification method according to the embodiment of the present invention.

According to the technical scheme of the embodiment of the invention, the text content of the target field of the table to be classified is obtained and used as the text of the table to be classified, then the text of the table to be classified is subjected to word segmentation, and the text of the table to be classified after word segmentation is subjected to vectorization treatment by using a text feature vectorization algorithm, so that the basic text feature vector of the table to be classified is obtained; generating a rule vector of the table to be classified according to a preset table classification rule and text contents in the table to be classified; splicing the basic text feature vector and the rule vector to obtain a text feature vector of the table to be classified; and finally, inputting the text feature vector of the table to be classified into a pre-trained random forest classification model to obtain a classification label of the table to be classified, generating the text feature vector for determining the table type according to the text content in the table by adopting a mode of combining a text feature vectorization algorithm and a rule, and taking the generated text feature vector as the input of the random forest classification model to obtain the classification label of the table to be classified output by the random forest classification model, so that the table classification process can be automatically completed according to the text content in the table based on the random forest algorithm and a preset table classification rule, the dependence on manpower is greatly reduced, the problem of low manual labeling efficiency is improved, the table classification efficiency is improved, and a certain classification accuracy is ensured.

Drawings

Fig. 1 is a flowchart of a table classification method according to an embodiment of the present invention.

Fig. 2 is a flowchart of a table classification method according to a second embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a table classifying device according to a third embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention.

It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings. Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example one

Fig. 1 is a flowchart of a table classification method according to an embodiment of the present invention. The embodiment of the invention can be suitable for the condition of table classification. The method can be executed by the table classifying device provided by the embodiment of the invention, and the device can be realized in a software and/or hardware mode and can be generally integrated in a computer device. As shown in fig. 1, the method of the embodiment of the present invention specifically includes:

step 101, obtaining the text content of the target field of the table to be classified as the text of the table to be classified.

Wherein, the table to be classified is the table needing to be classified. The table to be sorted comprises a plurality of fields. Each field has a corresponding field name. The field value of each field is the text content. The target field is one or more fields that contain text content for determining the form type. The text of the table to be classified is the text used to determine the type of the table.

Optionally, the obtaining the text content of the target field of the table to be classified as the text of the table to be classified includes: determining a target field of a table to be classified; and extracting the text content of the target field of the table to be classified as the text of the table to be classified.

Optionally, the determining a target field of the table to be classified includes: and determining the target field of the form to be classified according to the target field setting information input by the user.

The target field setting information is information for setting a target field of the table to be sorted. The target field setting information may include a field name of the target field. The user can specify the field name of the target field through the target field setting information, thereby setting the field name in the table to be sorted as the field of the field name in the target field setting information as the target field of the table to be sorted. And determining fields with field names consistent with the field names in the target field setting information in the form to be classified as the target fields of the form to be classified according to the target field setting information input by the user.

In one specific example, the target field setting information includes a field name of the target field: "Primary key Chinese name", "physical subsystem", and "tabular Chinese name". The user specifies the field names of the target fields as "primary key Chinese name", "physical subsystem", and "table Chinese name" through the target field setting information, thereby setting the field name of the table to be sorted as "primary key Chinese name", the field name of the table to be sorted as "physical subsystem", and the field name of the table to be sorted as "table Chinese name" as the target fields of the table to be sorted. According to target field setting information input by a user, determining fields with field names in the table to be classified consistent with field names in the target field setting information as target fields of the table to be classified, namely determining fields with field names in 'main key Chinese character names' in the table to be classified, fields with field names in 'physical subsystems' in the table to be classified and fields with field names in 'table Chinese character names' in the table to be classified as target fields of the table to be classified. And then extracting the text content 'accounting institution number | @ | institution number' of the field with the field name 'main key Chinese name' in the table to be classified, the text content 'escrow' of the field with the field name 'physical subsystem' in the table to be classified and the text content 'escrow fund account' of the field with the field name 'Chinese name' in the table to be classified as the text of the table to be classified.

Optionally, the determining a target field of the table to be classified includes: and determining a target field of the table to be classified according to preset keyword segment information.

The preset keyword segment information is information for setting a keyword segment of the table to be classified. The keyword field is a field containing important service information. The target field setting information may include a field name of the keyword field. The fields containing important service information are typically fields containing text content for determining the type of form. And determining fields with field names in the table to be classified consistent with the field names in the preset keyword field information as target fields of the table to be classified according to the preset keyword field information.

In one specific example, the preset keyword segment information includes a field name of the keyword segment: "Primary key Chinese name", "physical subsystem", and "tabular Chinese name". According to the preset keyword field information, determining fields with field names in the table to be classified consistent with the field names in the preset keyword field information as target fields of the table to be classified, namely determining fields with field names in the table to be classified as 'main key Chinese names', fields with field names in the table to be classified as 'physical subsystems', and fields with field names in the table to be classified as 'table Chinese names' as the target fields of the table to be classified. And then extracting the text content 'accounting institution number | @ | institution number' of the field with the field name 'main key Chinese name' in the table to be classified, the text content 'escrow' of the field with the field name 'physical subsystem' in the table to be classified and the text content 'escrow fund account' of the field with the field name 'Chinese name' in the table to be classified as the text of the table to be classified.

And 102, performing word segmentation processing on the text of the table to be classified.

The word segmentation processing of the text of the table to be classified refers to segmenting the text of one table to be classified into individual words.

Optionally, under the condition that the text of the table to be classified only includes the text content of one field, the word segmentation result of the field is the text of the table to be classified after the word segmentation processing.

Optionally, under the condition that the text of the table to be classified includes text contents of a plurality of fields, word segmentation processing is performed on the text contents of the fields respectively, and then word segmentation results of the text contents of the fields are spliced into a word segmentation result to be used as the text of the table to be classified after the word segmentation processing.

Optionally, the performing word segmentation processing on the text of the table to be classified includes: and performing word segmentation processing on the text of the table to be classified through a knot word segmentation tool.

In a specific example, the text of the table to be classified includes text contents of 3 fields: the text content of the field with the field name of ' main key Chinese name ' in the table to be classified is ' accounting agency number ' @ | agency number ', the text content of the field with the field name of ' physical subsystem ' in the table to be classified is ' escrow ', and the text content of the field with the field name of ' table Chinese name ' in the table to be classified is ' escrow fund account '. And performing word segmentation processing on the text content of each field through a Chinese character segmentation tool. For "accounting agency number | @ | agency number", the word division processing is performed by the settlement word division tool and becomes [ "accounting", "agency", "number", "|", "", "agency", "number" ]. For "hosting", the word segmentation processing is performed by the settlement word segmentation tool, and then the word becomes [ "hosting" ]. For the 'escrow fund account', the 'escrow', 'fund' and 'account' are changed after word segmentation processing is carried out through the settlement word segmentation tool. And then the word segmentation results of the text contents of the fields are spliced into a word segmentation result [ "accounting", "institution", "number", "|", "@", "|", "institution", "number", "hosting", "fund", "account" ], which is used as the text of the table to be classified after word segmentation processing.

Step 103, performing vectorization processing on the text of the table to be classified after the word processing by using a text feature vectorization algorithm to obtain a basic text feature vector of the table to be classified.

Optionally, the basic text feature vector of the table to be classified is a text feature vector obtained by performing vectorization processing on the text of the table to be classified after word segmentation processing through a text feature vectorization algorithm. The text feature vectorization algorithm is an algorithm for performing vectorization processing on a text to obtain a text feature vector. Text feature vectorization algorithms include, but are not limited to: a term frequency-inverse text frequency (TF-IDF) algorithm.

Optionally, the using a text feature vectorization algorithm to perform vectorization processing on the text of the table to be classified after the word processing to obtain the basic text feature vector of the table to be classified includes: and carrying out vectorization processing on the text of the table to be classified after the word processing by using a word frequency inverse text frequency algorithm to obtain a basic text feature vector of the table to be classified.

And 104, generating a rule vector of the table to be classified according to a preset table classification rule and text contents in the table to be classified.

The preset table classification rule is a rule for classifying the table set according to the user input information. The text content in the table to be classified is the text content of each field in the table to be classified. The rule vector is a vector generated according to a preset table classification rule and text contents in the table to be classified.

Optionally, the method further includes: setting at least one form classification rule according to form classification rule setting information input by a user; each table classification rule comprises a field name, a keyword and a classification label.

The user may set one or more form classification rules through the input form classification rule setting information. Each table classification rule includes a field name, a keyword, and a classification label. I.e. each "field name-keyword-category label" group as a table category rule. The category label is a label for identifying a table type. Each category table has a corresponding category label.

For each table classification rule, if the text content of the field whose field name in the table is consistent with the field name in the table classification rule contains the keyword in the table classification rule, it may be determined that the text content in the table satisfies the table classification rule, and the classification label of the table is the classification label in the table classification rule.

Optionally, the generating a rule vector of the table to be classified according to a preset table classification rule and text contents in the table to be classified includes: determining the dimensionality of a rule vector of the table to be classified according to the number of preset table classification rules, wherein each dimensionality in the rule vector corresponds to one table classification rule, and the initial value of each dimensionality in the rule vector is 0; and sequentially judging whether the text content in the table to be classified meets each table classification rule, and when detecting that the text content in the table to be classified meets a target table classification rule, setting the value of the dimension corresponding to the target table classification rule in the rule vector of the table to be classified as 1.

Optionally, according to the number of preset table classification rules, determining the dimension of the rule vector of the table to be classified to obtain an initial rule vector of the table to be classified. Illustratively, according to the number "3" of preset table classification rules, determining that the dimension of the rule vector of the table to be classified is 3, and obtaining an initial rule vector [0,0,0] of the table to be classified. The 3 dimensions in the rule vector correspond to 3 table classification rules, respectively.

Optionally, the sequentially determining whether the text content in the table to be classified meets the table classification rules includes: sequentially acquiring a form classification rule as a current processing form classification rule; judging whether the text content of the field with the field name consistent with the field name in the classification rule of the currently processed form contains the key word in the classification rule of the currently processed form; if the text content of the field with the field name consistent with the field name in the current processing form classification rule in the form to be classified contains the key word in the current processing form classification rule, determining that the text content in the form to be classified meets each form classification rule; and returning to execute the operation of sequentially acquiring a form classification rule as the current processing form classification rule until the processing of all the form classification rules is completed.

Optionally, after determining whether the text content of the field whose field name in the table to be classified is consistent with the field name in the currently processed table classification rule includes the keyword in the currently processed table classification rule, the method further includes: and if the text content of the field with the field name consistent with the field name in the classification rule of the currently processed form does not contain the key word in the classification rule of the currently processed form, determining that the text content in the to-be-classified form does not meet the classification rule of each form.

Optionally, the method further includes: when detecting that the text content in the table to be classified does not meet the target table classification rule, determining that the value of the dimension corresponding to the target table classification rule in the rule vector of the table to be classified is kept at 0.

In one embodiment, 3 table classification rules are set according to the table classification rule setting information input by the user: a first table classification rule, a second table classification rule, and a third table classification rule. The field name of the first table classification rule is "primary key keyword", the keyword of the first table classification rule is "account number", and the classification label of the first table classification rule is "contract". The field name of the second table classification rule is 'physical subsystem', the key word of the second table classification rule is 'warranty', and the classification label of the second table classification rule is 'fair'. The field name of the third table classification rule is "physical subsystem", the keyword of the third table classification rule is "product line", and the classification label of the third table classification rule is "trade financing". And determining the dimension of the rule vector of the table to be classified as 3 according to the number '3' of preset table classification rules, and obtaining an initial rule vector [0,0,0] of the table to be classified. The 3 dimensions in the rule vector correspond to the first table classification rule, the second table classification rule and the third table classification rule, respectively. And sequentially judging whether the text content in the table to be classified meets a first table classification rule, a second table classification rule and a third table classification rule. When detecting that the text content in the table to be classified meets a first table classification rule, setting the value of the dimension corresponding to the first table classification rule in the rule vector of the table to be classified as 1. And when detecting that the text content in the table to be classified does not meet a second table classification rule, determining that the value of a dimension corresponding to the second table classification rule in the rule vector of the table to be classified is kept at 0. And when detecting that the text content in the table to be classified does not meet a third table classification rule, determining that the numerical value of the dimension corresponding to the third table classification rule in the rule vector of the table to be classified is kept at 0. Thereby, a rule vector [1,0,0] of the table to be classified is generated.

And 105, splicing the basic text feature vector and the rule vector to obtain a text feature vector of the table to be classified.

Optionally, the text feature vector of the table to be classified is a text feature vector used for determining a classification label of the table to be classified. And splicing the basic text feature vector and the regular vector, and splicing the basic text feature vector and the regular vector into a vector. And taking the splicing result of the basic text characteristic vector and the regular vector as the text characteristic vector of the table to be classified.

In a specific example, the basic text feature vector of the table to be classified is a, the regular vector of the table to be classified is b, and the basic text feature vector a and the regular vector b are spliced to obtain a text feature vector a ═ a; b ].

And adopting the basic text feature vector of the table to be classified obtained by the text feature vectorization algorithm as the input of the random forest classification model, so as to classify the table data. However, the text feature vectorization algorithm only measures the importance of words more accurately from the statistical point of view, so the text feature vector is also added to the rule for classifying the table set according to the user input information, and the specific method is as follows: setting at least one form classification rule according to form classification rule setting information input by a user; each table classification rule comprises a field name, a keyword and a classification label. And determining the dimensionality of a rule vector of the table to be classified according to the number of preset table classification rules, wherein each dimensionality in the rule vector corresponds to one table classification rule, and the initial value of each dimensionality in the rule vector is 0. And then sequentially judging whether the text content in the table to be classified meets each table classification rule, when detecting that the text content in the table to be classified meets the target table classification rule, setting the value of the dimension corresponding to the target table classification rule in the rule vector of the table to be classified as 1, and when detecting that the text content in the table to be classified does not meet the target table classification rule, determining that the value of the dimension corresponding to the target table classification rule in the rule vector of the table to be classified is kept as 0. Therefore, a rule vector is formed for each table to be classified, and the rule vector is spliced with the basic text feature vector of the table to be classified obtained through a text feature vectorization algorithm to serve as the input of the random forest classification model.

The text feature vector of the table to be classified is generated by combining a text feature vectorization algorithm and rules, so that the importance degree of each word is measured in a statistical sense, the rules are introduced, and further constraint is performed, so that the generated text feature vector is more suitable for table data.

And 106, inputting the text characteristic vector of the table to be classified into a pre-trained random forest classification model to obtain a classification label of the table to be classified.

The pre-trained random forest classification model is a classification model obtained by training a training sample formed by text feature vectors and classification labels of a preset number of tables by using a random forest algorithm. The input of the random forest classification model is a text feature vector of the table, and the output is a classification label of the table.

Optionally, the text feature vector of the table to be classified is input to a pre-trained random forest classification model, the pre-trained random forest classification model analyzes the text feature vector of the table to be classified, and the classification label of the table to be classified is output according to a mapping relationship between the text feature vector and the classification label. And the classification label of the table to be classified is the classification result of the table to be classified.

The random forest algorithm is a method for integrated learning which is combined by a series of classifiers to make decisions and is expected to obtain the most fair. The table classification is carried out by adopting a pre-trained random forest classification model, so that the problem of limited accuracy of a single model can be solved.

Optionally, the method further includes: acquiring text feature vectors and classification labels of a preset number of tables as training samples; training a random forest classification model according to the training samples; and inputting the random forest classification model into a text feature vector of a table, and outputting the text feature vector as a classification label of the table.

Optionally, obtaining the text feature vectors and the classification labels of a preset number of tables as training samples includes: acquiring a preset number of tables; performing word segmentation processing on the text of each table; performing vectorization processing on the text of each table after the word segmentation processing by using a text feature vectorization algorithm to obtain a basic text feature vector of each table; generating a rule vector of each table according to a preset table classification rule and text contents in each table; splicing the basic text characteristic vector and the regular vector of each table to obtain the text characteristic vector of each table; sending each table to a manual labeling platform so that the manual labeling platform labels and feeds back the classification labels of each table; obtaining each table of the labeled classification labels fed back by the manual labeling platform; and taking the text feature vector and the classification label of each table as a training sample. Wherein the text feature vector and the classification label of each table are a set of training data in the training sample.

The embodiment of the invention provides a form classification method, which comprises the steps of obtaining text content of a target field of a form to be classified as a text of the form to be classified, then carrying out word segmentation on the text of the form to be classified, and carrying out vectorization on the text of the form to be classified after word segmentation by using a text feature vectorization algorithm to obtain a basic text feature vector of the form to be classified; generating a rule vector of the table to be classified according to a preset table classification rule and text contents in the table to be classified; splicing the basic text feature vector and the rule vector to obtain a text feature vector of the table to be classified; and finally, inputting the text feature vector of the table to be classified into a pre-trained random forest classification model to obtain a classification label of the table to be classified, generating the text feature vector for determining the table type according to the text content in the table by adopting a mode of combining a text feature vectorization algorithm and a rule, and taking the generated text feature vector as the input of the random forest classification model to obtain the classification label of the table to be classified output by the random forest classification model, so that the table classification process can be automatically completed according to the text content in the table based on the random forest algorithm and a preset table classification rule, the dependence on manpower is greatly reduced, the problem of low manual labeling efficiency is improved, the table classification efficiency is improved, and a certain classification accuracy is ensured.

Example two

Fig. 2 is a flowchart of a table classification method according to a second embodiment of the present invention. Embodiments of the invention may be combined with various alternatives in one or more of the embodiments described above.

As shown in fig. 2, the method of the embodiment of the present invention specifically includes:

step 201, acquiring text content of a target field of a form to be classified as a text of the form to be classified.

Step 202, performing word segmentation processing on the text of the table to be classified through a final word segmentation tool.

Step 203, using a word frequency inverse text frequency algorithm to perform vectorization processing on the text of the table to be classified after the word processing, so as to obtain a basic text feature vector of the table to be classified.

And 204, generating a rule vector of the table to be classified according to a preset table classification rule and the text content in the table to be classified.

And step 205, splicing the basic text feature vector and the rule vector to obtain a text feature vector of the table to be classified.

The basic text feature vector of the table to be classified obtained through a word frequency inverse text frequency algorithm is used as the input of a random forest classification model, and the classification of the table data can be carried out. However, the word frequency inverse text frequency algorithm only measures the importance of words more accurately from the statistical perspective, so that the text feature vector is also added to the rule for classifying the table set according to the user input information, and the specific method is as follows: setting at least one form classification rule according to form classification rule setting information input by a user; each table classification rule comprises a field name, a keyword and a classification label. And determining the dimensionality of a rule vector of the table to be classified according to the number of preset table classification rules, wherein each dimensionality in the rule vector corresponds to one table classification rule, and the initial value of each dimensionality in the rule vector is 0. And then sequentially judging whether the text content in the table to be classified meets each table classification rule, when detecting that the text content in the table to be classified meets the target table classification rule, setting the value of the dimension corresponding to the target table classification rule in the rule vector of the table to be classified as 1, and when detecting that the text content in the table to be classified does not meet the target table classification rule, determining that the value of the dimension corresponding to the target table classification rule in the rule vector of the table to be classified is kept as 0. Therefore, a regular vector is formed for each table to be classified, and the regular vector is spliced with the basic text characteristic vector of the table to be classified obtained through the word frequency inverse text frequency algorithm to be used as the input of the random forest classification model.

The text characteristic vector of the table to be classified is generated by combining a word frequency inverse text frequency algorithm and a rule, the importance degree of each word is measured in a statistical sense, the rule is introduced, further constraint is carried out, and the generated text characteristic vector is more suitable for table data.

And step 206, inputting the text feature vector of the table to be classified into a pre-trained random forest classification model to obtain a classification label of the table to be classified.

The embodiment of the invention provides a form classification method, which comprises the steps of obtaining text content of a target field of a form to be classified as a text of the form to be classified, then carrying out word segmentation on the text of the form to be classified through a word segmentation tool, and carrying out vectorization on the text of the form to be classified after word segmentation by using a word frequency inverse text frequency algorithm to obtain a basic text characteristic vector of the form to be classified; generating a rule vector of the table to be classified according to a preset table classification rule and text contents in the table to be classified; splicing the basic text feature vector and the rule vector to obtain a text feature vector of the table to be classified; finally, the text characteristic vector of the table to be classified is input into a pre-trained random forest classification model to obtain the classification label of the table to be classified, a mode of combining a word frequency inverse text frequency algorithm and a rule can be adopted, generating text feature vectors for determining the type of the table according to the text content in the table, making the generated text feature vectors more suitable for the table data, the generated text feature vector can be used as the input of the random forest classification model to obtain the classification label of the table to be classified output by the random forest classification model, so that the classification label can be based on the random forest algorithm and the preset table classification rule, the form classification process is automatically completed according to the text content in the form, so that the dependence on manpower is greatly reduced, the problem of low efficiency of manual labeling is solved, the form classification efficiency is improved, and meanwhile, a certain classification accuracy rate is ensured.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a table classifying device according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes: a text acquisition module 301, a text word segmentation module 302, a text vectorization module 303, a rule vector generation module 304, a vector concatenation module 305, and a label acquisition module 306.

The text obtaining module 301 is configured to obtain text content of a target field of a table to be classified as a text of the table to be classified; a text word segmentation module 302, configured to perform word segmentation processing on the text of the table to be classified; the text vectorization module 303 is configured to perform vectorization processing on the text of the table to be classified after the word segmentation processing by using a text feature vectorization algorithm, so as to obtain a basic text feature vector of the table to be classified; a rule vector generation module 304, configured to generate a rule vector of the table to be classified according to a preset rule vector generation rule; a vector stitching module 305, configured to stitch the basic text feature vector and the rule vector to obtain a text feature vector of the table to be classified; and a label obtaining module 306, configured to input the text feature vector of the table to be classified into a pre-trained random forest classification model, so as to obtain a classification label of the table to be classified.

The embodiment of the invention provides a form classification device, which is characterized in that text content of a target field of a form to be classified is obtained and used as a text of the form to be classified, then word segmentation processing is carried out on the text of the form to be classified, and vectorization processing is carried out on the text of the form to be classified after word segmentation processing by using a text feature vectorization algorithm, so that a basic text feature vector of the form to be classified is obtained; generating a rule vector of the table to be classified according to a preset table classification rule and text contents in the table to be classified; splicing the basic text feature vector and the rule vector to obtain a text feature vector of the table to be classified; and finally, inputting the text feature vector of the table to be classified into a pre-trained random forest classification model to obtain a classification label of the table to be classified, generating the text feature vector for determining the table type according to the text content in the table by adopting a mode of combining a text feature vectorization algorithm and a rule, and taking the generated text feature vector as the input of the random forest classification model to obtain the classification label of the table to be classified output by the random forest classification model, so that the table classification process can be automatically completed according to the text content in the table based on the random forest algorithm and a preset table classification rule, the dependence on manpower is greatly reduced, the problem of low manual labeling efficiency is improved, the table classification efficiency is improved, and a certain classification accuracy is ensured.

In an optional implementation manner of the embodiment of the present invention, optionally, the table classifying device further includes: the training sample acquisition module is used for acquiring the text feature vectors and the classification labels of a preset number of tables as training samples; the model training module is used for training a random forest classification model according to the training samples; and inputting the random forest classification model into a text feature vector of a table, and outputting the text feature vector as a classification label of the table.

In an optional implementation manner of the embodiment of the present invention, optionally, when the text obtaining module 301 performs an operation of obtaining text content of a target field of a table to be classified as a text of the table to be classified, the operation is specifically configured to: determining a target field of a table to be classified; and extracting the text content of the target field of the table to be classified as the text of the table to be classified.

In an optional implementation manner of the embodiment of the present invention, optionally, when the text obtaining module 301 executes an operation of determining a target field of the table to be classified, specifically, the text obtaining module is configured to: and determining the target field of the form to be classified according to the target field setting information input by the user.

In an optional implementation manner of the embodiment of the present invention, optionally, when the text obtaining module 301 executes an operation of determining a target field of the table to be classified, specifically, the text obtaining module is configured to: and determining a target field of the table to be classified according to preset keyword segment information.

In an optional implementation manner of the embodiment of the present invention, optionally, when performing an operation of performing word segmentation processing on a text of the table to be classified, the text word segmentation module 302 is specifically configured to: and performing word segmentation processing on the text of the table to be classified through a knot word segmentation tool.

In an optional implementation manner of the embodiment of the present invention, optionally, when the text vectorization module 303 performs an operation of performing vectorization processing on the text of the table to be classified after the word segmentation processing by using a text feature vectorization algorithm to obtain a basic text feature vector of the table to be classified, the operation is specifically configured to: and carrying out vectorization processing on the text of the table to be classified after the word processing by using a word frequency inverse text frequency algorithm to obtain a basic text feature vector of the table to be classified.

In an optional implementation manner of the embodiment of the present invention, optionally, when the rule vector generating module 304 executes an operation of generating a rule vector of the table to be classified according to a preset table classification rule and text contents in the table to be classified, the operation is specifically configured to: determining the dimensionality of a rule vector of the table to be classified according to the number of preset table classification rules, wherein each dimensionality in the rule vector corresponds to one table classification rule, and the initial value of each dimensionality in the rule vector is 0; and sequentially judging whether the text content in the table to be classified meets each table classification rule, and when detecting that the text content in the table to be classified meets a target table classification rule, setting the value of the dimension corresponding to the target table classification rule in the rule vector of the table to be classified as 1.

In an optional implementation manner of the embodiment of the present invention, optionally, the rule vector generation module 304 is further configured to: setting at least one form classification rule according to form classification rule setting information input by a user; each table classification rule comprises a field name, a keyword and a classification label.

In an optional implementation manner of the embodiment of the present invention, optionally, when the rule vector generation module 304 executes an operation of sequentially determining whether text content in the table to be classified satisfies the table classification rules, the operation is specifically configured to: sequentially acquiring a form classification rule as a current processing form classification rule; judging whether the text content of the field with the field name consistent with the field name in the classification rule of the currently processed form contains the key word in the classification rule of the currently processed form; if the text content of the field with the field name consistent with the field name in the current processing form classification rule in the form to be classified contains the key word in the current processing form classification rule, determining that the text content in the form to be classified meets each form classification rule; and returning to execute the operation of sequentially acquiring a form classification rule as the current processing form classification rule until the processing of all the form classification rules is completed.

In an optional implementation manner of the embodiment of the present invention, optionally, the rule vector generation module 304 is further configured to: and if the text content of the field with the field name consistent with the field name in the classification rule of the currently processed form does not contain the key word in the classification rule of the currently processed form, determining that the text content in the to-be-classified form does not meet the classification rule of each form.

In an optional implementation manner of the embodiment of the present invention, optionally, the rule vector generation module 304 is further configured to: when detecting that the text content in the table to be classified does not meet the target table classification rule, determining that the value of the dimension corresponding to the target table classification rule in the rule vector of the table to be classified is kept at 0.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The table classification device can execute the table classification method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the table classification method.

Example four

Fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention. FIG. 4 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in FIG. 4 is only one example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention.

As shown in FIG. 4, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors 16, a memory 28, and a bus 18 that connects the various system components (including the memory 28 and the processors 16).

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, and commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.

Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, computer device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be appreciated that although not shown in FIG. 4, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processor 16 executes various functional applications and data processing by running the program stored in the memory 28, thereby implementing the table classification method provided by the embodiment of the present invention: acquiring text content of a target field of a form to be classified as a text of the form to be classified; performing word segmentation processing on the text of the table to be classified; performing vectorization processing on the text of the table to be classified after the word segmentation processing by using a text feature vectorization algorithm to obtain a basic text feature vector of the table to be classified; generating a rule vector of the table to be classified according to a preset table classification rule and text contents in the table to be classified; splicing the basic text feature vector and the rule vector to obtain a text feature vector of the table to be classified; and inputting the text feature vector of the table to be classified into a pre-trained random forest classification model to obtain a classification label of the table to be classified.

EXAMPLE five

Fifth embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where when the computer program is executed by a processor, the method for classifying a table provided in the fifth embodiment of the present invention is implemented: acquiring text content of a target field of a form to be classified as a text of the form to be classified; performing word segmentation processing on the text of the table to be classified; performing vectorization processing on the text of the table to be classified after the word segmentation processing by using a text feature vectorization algorithm to obtain a basic text feature vector of the table to be classified; generating a rule vector of the table to be classified according to a preset table classification rule and text contents in the table to be classified; splicing the basic text feature vector and the rule vector to obtain a text feature vector of the table to be classified; and inputting the text feature vector of the table to be classified into a pre-trained random forest classification model to obtain a classification label of the table to be classified.

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or computer device. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method of table classification, comprising:

2. The method of claim 1, further comprising:

acquiring text feature vectors and classification labels of a preset number of tables as training samples;

training a random forest classification model according to the training samples;

and inputting the random forest classification model into a text feature vector of a table, and outputting the text feature vector as a classification label of the table.

3. The method according to claim 1, wherein the obtaining of the text content of the target field of the table to be classified as the text of the table to be classified comprises:

determining a target field of a table to be classified;

and extracting the text content of the target field of the table to be classified as the text of the table to be classified.

4. The method of claim 3, wherein determining the target field of the table to be sorted comprises:

and determining the target field of the form to be classified according to the target field setting information input by the user.

5. The method of claim 3, wherein determining the target field of the table to be sorted comprises:

and determining a target field of the table to be classified according to preset keyword segment information.

6. The method according to claim 1, wherein the performing word segmentation processing on the text of the table to be classified comprises:

and performing word segmentation processing on the text of the table to be classified through a knot word segmentation tool.

7. The method according to claim 1, wherein the using a text feature vectorization algorithm to perform vectorization processing on the text of the table to be classified after the word segmentation processing to obtain a basic text feature vector of the table to be classified comprises:

and carrying out vectorization processing on the text of the table to be classified after the word processing by using a word frequency inverse text frequency algorithm to obtain a basic text feature vector of the table to be classified.

8. The method according to claim 1, wherein the generating a rule vector of the table to be classified according to a preset table classification rule and text content in the table to be classified comprises:

determining the dimensionality of a rule vector of the table to be classified according to the number of preset table classification rules, wherein each dimensionality in the rule vector corresponds to one table classification rule, and the initial value of each dimensionality in the rule vector is 0;

and sequentially judging whether the text content in the table to be classified meets each table classification rule, and when detecting that the text content in the table to be classified meets a target table classification rule, setting the value of the dimension corresponding to the target table classification rule in the rule vector of the table to be classified as 1.

9. The method of claim 8, further comprising:

setting at least one form classification rule according to form classification rule setting information input by a user;

each table classification rule comprises a field name, a keyword and a classification label.

10. The method according to claim 9, wherein the sequentially determining whether the text content in the table to be classified satisfies the table classification rules comprises:

sequentially acquiring a form classification rule as a current processing form classification rule;

judging whether the text content of the field with the field name consistent with the field name in the classification rule of the currently processed form contains the key word in the classification rule of the currently processed form;

if the text content of the field with the field name consistent with the field name in the current processing form classification rule in the form to be classified contains the key word in the current processing form classification rule, determining that the text content in the form to be classified meets each form classification rule;

and returning to execute the operation of sequentially acquiring a form classification rule as the current processing form classification rule until the processing of all the form classification rules is completed.

11. The method of claim 10, further comprising, after determining whether the text content of the field with the field name consistent with the field name in the current processing form classification rule in the table to be classified contains the keyword in the current processing form classification rule:

and if the text content of the field with the field name consistent with the field name in the classification rule of the currently processed form does not contain the key word in the classification rule of the currently processed form, determining that the text content in the to-be-classified form does not meet the classification rule of each form.

12. The method of claim 8, further comprising:

when detecting that the text content in the table to be classified does not meet the target table classification rule, determining that the value of the dimension corresponding to the target table classification rule in the rule vector of the table to be classified is kept at 0.

13. A form sorting apparatus, comprising:

14. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the table classification method according to any one of claims 1 to 12 when executing the computer program.

15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the table classification method according to any one of claims 1 to 12.