CN112989050B

CN112989050B - Form classification method, device, equipment and storage medium

Info

Publication number: CN112989050B
Application number: CN202110349354.7A
Authority: CN
Inventors: 高宏华; 陈立捷; 崔莹琰; 魏翩翩; 苏建清
Original assignee: CCB Finetech Co Ltd
Current assignee: CCB Finetech Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2023-05-30
Anticipated expiration: 2041-03-31
Also published as: CN112989050A

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for classifying tables. The embodiment of the invention relates to the technical field of big data, wherein the method comprises the following steps: acquiring text content of a target field of a form to be classified as a text of the form to be classified; word segmentation processing is carried out on the text of the form to be classified; using a text feature vectorization algorithm to vectorize the text of the to-be-classified form after word segmentation to obtain a basic text feature vector of the to-be-classified form; generating rule vectors of the tables to be classified according to preset table classification rules and text contents in the tables to be classified; splicing the basic text feature vector and the rule vector to obtain a text feature vector of the form to be classified; and inputting the text feature vector of the form to be classified into a pre-trained random forest classification model to obtain a classification label of the form to be classified. The embodiment of the invention can automatically complete the table classification process and improve the table classification efficiency.

Description

Form classification method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of big data, in particular to a method, a device, equipment and a storage medium for classifying tables.

Background

The form has good structured properties and potentially semantic properties that are easier to analyze and understand than unstructured text data. In order to effectively manage the table, the table is generally classified, and then the table is managed according to the type of the table.

In the related art, the manner of classifying the table is mainly to manually label the specific category to which the table belongs. In the face of millions of tables, the way of classifying the tables by means of manual labeling has certain guarantee on accuracy, but is time-consuming and labor-consuming and has low efficiency.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a storage medium for classifying forms, which can automatically complete the form classification process and improve the form classification efficiency.

In a first aspect, an embodiment of the present invention provides a method for classifying a table, including:

acquiring text content of a target field of a form to be classified as a text of the form to be classified;

word segmentation processing is carried out on the text of the form to be classified;

using a text feature vectorization algorithm to vectorize the text of the to-be-classified form after word segmentation to obtain a basic text feature vector of the to-be-classified form;

Generating rule vectors of the to-be-classified forms according to preset form classification rules and text contents in the to-be-classified forms;

splicing the basic text feature vector and the rule vector to obtain the text feature vector of the form to be classified;

and inputting the text feature vector of the form to be classified into a pre-trained random forest classification model to obtain a classification label of the form to be classified.

In a second aspect, an embodiment of the present invention further provides a table classification apparatus, including:

the text acquisition module is used for acquiring text content of a target field of a form to be classified as a text of the form to be classified;

the text word segmentation module is used for carrying out word segmentation on the text of the form to be classified;

the text vectorization module is used for vectorizing the text of the to-be-classified form after word segmentation by using a text feature vectorization algorithm to obtain a basic text feature vector of the to-be-classified form;

the rule vector generation module is used for generating rules according to preset rule vectors and generating rule vectors of the to-be-classified forms;

the vector splicing module is used for splicing the basic text feature vector and the rule vector to obtain the text feature vector of the form to be classified;

The label acquisition module is used for inputting the text feature vector of the form to be classified into a pre-trained random forest classification model to obtain the classification label of the form to be classified.

In a third aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the table classification method according to the embodiment of the present invention when executing the computer program.

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement a table classification method according to an embodiment of the present invention.

According to the technical scheme, text content of a target field of a table to be classified is obtained to serve as text of the table to be classified, word segmentation is carried out on the text of the table to be classified, a text feature vectorization algorithm is used, and vectorization is carried out on the text of the table to be classified after the word segmentation, so that basic text feature vectors of the table to be classified are obtained; generating rule vectors of the to-be-classified forms according to preset form classification rules and text contents in the to-be-classified forms; splicing the basic text feature vector and the rule vector to obtain the text feature vector of the form to be classified; finally, inputting the text feature vector of the form to be classified into a pre-trained random forest classification model to obtain a classification label of the form to be classified, generating the text feature vector for determining the form type according to the text content in the form by adopting a mode of combining a text feature vectorization algorithm with rules, and taking the generated text feature vector as the input of the random forest classification model to obtain the classification label of the form to be classified output by the random forest classification model, so that the form classification process can be automatically completed according to the text content in the form based on the random forest algorithm and the preset form classification rules, the dependence on manpower is greatly reduced, the problem of low manual labeling efficiency is improved, the form classification efficiency is improved, and a certain classification accuracy is ensured.

Drawings

Fig. 1 is a flowchart of a table classification method according to an embodiment of the invention.

Fig. 2 is a flowchart of a table classification method according to a second embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a table classifying device according to a third embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof.

It should be further noted that, for convenience of description, only some, but not all of the matters related to the present invention are shown in the accompanying drawings. Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently, or at the same time. Furthermore, the order of the operations may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example 1

Fig. 1 is a flowchart of a table classification method according to an embodiment of the invention. The embodiment of the invention can be applied to the situation of classifying the table. The method may be performed by a form classification apparatus provided by an embodiment of the present invention, which may be implemented in software and/or hardware, and may be generally integrated in a computer device. As shown in fig. 1, the method in the embodiment of the present invention specifically includes:

and 101, acquiring text content of a target field of a table to be classified as the text of the table to be classified.

The to-be-classified table is a table which needs to be classified. The table to be classified contains a plurality of fields. Each field has a corresponding field name. The field value of each field is text content. The target field is one or more fields containing text content for determining the form type. The text of the form to be classified is text for determining the form type.

Optionally, the obtaining the text content of the target field of the form to be classified as the text of the form to be classified includes: determining a target field of a form to be classified; and extracting the text content of the target field of the table to be classified as the text of the table to be classified.

Optionally, the determining the target field of the table to be classified includes: and determining the target field of the form to be classified according to the target field setting information input by the user.

The target field setting information is information for setting a target field of the table to be classified. The target field setting information may include a field name of the target field. The user can designate the field name of the target field through the target field setting information, thereby setting the field whose field name in the table to be classified is the field name in the target field setting information as the target field of the table to be classified. And according to the target field setting information input by the user, determining the field with the field name consistent with the field name in the target field setting information in the to-be-classified form as the target field of the to-be-classified form.

In one specific example, the field name of the target field included in the target field setting information is: "primary key chinese name", "physical subsystem" and "form chinese name". The user designates the field names of the target fields as the "primary key Chinese name", "physical subsystem" and "form Chinese name" through the target field setting information, thereby setting the field names of the fields in the form to be classified as the "primary key Chinese name", the field names of the fields in the form to be classified as the "physical subsystem" and the field names of the fields in the form to be classified as the target field of the form to be classified. According to target field setting information input by a user, determining a field with a field name identical to the field name in the target field setting information in the to-be-classified form as a target field of the to-be-classified form, namely, determining a field with a field name of a 'primary key Chinese name' in the to-be-classified form, a field with a field name of a 'physical subsystem' in the to-be-classified form and a field with a field name of a 'form Chinese name' in the to-be-classified form as target fields of the to-be-classified form. And then extracting the text content of the field with the field name of ' main key Chinese name ' in the table to be classified, accounting organization number @ |organization number ', the text content of the field with the field name of ' physical subsystem ' in the table to be classified, and the text content of the field with the field name of ' table Chinese name ', namely ' managed fund account ', as the text of the table to be classified.

Optionally, the determining the target field of the table to be classified includes: and determining target fields of the tables to be classified according to the preset keyword segment information.

The preset keyword segment information is information for setting the keyword segment of the table to be classified. A keyword segment is a field containing important business information. The target field setting information may include a field name of the keyword segment. The field containing important business information is typically a field containing text content for determining a form type. And determining a field with the field name consistent with the field name in the preset keyword segment information in the to-be-classified form as a target field of the to-be-classified form according to the preset keyword segment information.

In a specific example, the preset keyword segment information includes a keyword segment with a field name of: "primary key chinese name", "physical subsystem" and "form chinese name". According to the preset keyword segment information, determining a field with a field name consistent with the field name in the preset keyword segment information as a target field of the table to be classified, namely determining a field with a field name of 'primary key Chinese name' in the table to be classified, a field with a field name of 'physical subsystem' in the table to be classified and a field with a field name of 'table Chinese name' in the table to be classified as the target field of the table to be classified. And then extracting the text content of the field with the field name of ' main key Chinese name ' in the table to be classified, accounting organization number @ |organization number ', the text content of the field with the field name of ' physical subsystem ' in the table to be classified, and the text content of the field with the field name of ' table Chinese name ', namely ' managed fund account ', as the text of the table to be classified.

And 102, word segmentation processing is carried out on the text of the form to be classified.

The word segmentation processing of the text of the to-be-classified form refers to the segmentation of the text of one to-be-classified form into individual words.

Optionally, in the case that the text of the form to be classified includes only text content of one field, the word segmentation result of the field is the text of the form to be classified after word segmentation processing.

Optionally, under the condition that the text of the table to be classified includes text contents of a plurality of fields, word segmentation processing is performed on the text contents of each field, and then word segmentation results of the text contents of each field are spliced into a word segmentation result to be used as the text of the table to be classified after the word segmentation processing.

Optionally, the word segmentation processing for the text of the to-be-classified form includes: and performing word segmentation on the text of the form to be classified through a crust word segmentation tool.

In a specific example, the text of the table to be classified includes text contents of 3 fields: the text content of the field with the field name of "primary key chinese name" in the table to be classified is "accounting organization number @ |organization number", the text content of the field with the field name of "physical subsystem" in the table to be classified is "host", and the text content of the field with the field name of "table chinese name" in the table to be classified is "host funds account". And respectively carrying out word segmentation processing on the text content of each field through a crust word segmentation tool. The accounting organization number @ organization number is changed into [ "accounting", "organization", "number", "@", "|", "organization", "number" ] after word segmentation by the nub word segmentation tool. For the "hosting", the word segmentation processing is performed by the nub word segmentation tool, and then the word segmentation tool is changed into [ "hosting" ]. For the 'managed fund account', the account is changed into [ 'managed', 'fund', 'account' ] after word segmentation processing is carried out by a bargain word segmentation tool. And then splicing the word segmentation results of the text contents of each field into a word segmentation result [ "accounting", "organization", "numbering", "I", "@", "I", "organization", "numbering", "hosting", "fund", "account" ], and using the word segmentation result as the text of the table to be classified after word segmentation.

And 103, carrying out vectorization processing on the text of the to-be-classified form after word segmentation processing by using a text feature vectorization algorithm to obtain a basic text feature vector of the to-be-classified form.

Optionally, the basic text feature vector of the to-be-classified form is a text feature vector obtained by vectorizing the text of the to-be-classified form after word segmentation through a text feature vectorization algorithm. The text feature vectorization algorithm is an algorithm for vectorizing texts to obtain text feature vectors. Text feature vectorization algorithms include, but are not limited to: word frequency inverse text frequency (TF-IDF) algorithm.

Optionally, the performing vectorization processing on the text of the to-be-classified form after word segmentation by using a text feature vectorization algorithm to obtain a basic text feature vector of the to-be-classified form includes: and carrying out vectorization processing on the text of the to-be-classified form after word segmentation processing by using a word frequency inverse text frequency algorithm to obtain a basic text feature vector of the to-be-classified form.

And 104, generating rule vectors of the to-be-classified form according to preset form classification rules and text contents in the to-be-classified form.

The preset form classification rule is a rule for classifying forms according to user input information. The text content in the table to be classified is the text content of each field in the table to be classified. The rule vector is a vector generated according to a preset table classification rule and text content in the table to be classified.

Optionally, the method further comprises: setting at least one form classification rule according to form classification rule setting information input by a user; wherein each table classification rule comprises a field name, a keyword and a classification label.

The user can set one or more form classification rules through the inputted form classification rule setting information. Each form classification rule includes a field name, a keyword, and a classification label. I.e. each "field name-keyword-class label" group acts as a table classification rule. The category labels are labels used to identify the type of form. Each category of the table has a corresponding category label.

For each table classification rule, if the text content of the field whose field name is identical to the field name in the table classification rule contains the keyword in the table classification rule, it may be determined that the text content in the table satisfies the table classification rule, and the classification label of the table is the classification label in the table classification rule.

Optionally, the generating a rule vector of the to-be-classified form according to a preset form classification rule and text content in the to-be-classified form includes: determining the dimension of a rule vector of the to-be-classified table according to the number of preset table classification rules, wherein each dimension in the rule vector corresponds to one table classification rule, and the initial value of each dimension in the rule vector is 0; and sequentially judging whether the text content in the to-be-classified table meets the table classification rules, and setting the numerical value of the dimension corresponding to the target table classification rule in the rule vector of the to-be-classified table to be 1 when detecting that the text content in the to-be-classified table meets the target table classification rule.

Optionally, according to the number of preset table classification rules, determining the dimension of the rule vector of the table to be classified, and obtaining the initial rule vector of the table to be classified. Illustratively, according to the number "3" of the preset table classification rules, the dimension of the rule vector of the table to be classified is determined to be 3, so as to obtain the initial rule vector [0, 0] of the table to be classified. The 3 dimensions in the rule vector correspond to 3 table classification rules, respectively.

Optionally, the sequentially determining whether the text content in the to-be-classified form meets the classification rule of each form includes: sequentially acquiring a form classification rule as a currently processed form classification rule; judging whether text contents of fields with field names consistent with field names in the current processing table classification rule in the to-be-classified table contain keywords in the current processing table classification rule or not; if the text content of the field with the field name consistent with the field name in the current processing table classification rule in the to-be-classified table contains keywords in the current processing table classification rule, determining that the text content in the to-be-classified table meets each table classification rule; and returning to execute the operation of sequentially acquiring one form classification rule as the current processing form classification rule until the processing of all form classification rules is completed.

Optionally, after determining whether the text content of the field whose field name in the to-be-classified table is consistent with the field name in the current processing table classification rule includes the keyword in the current processing table classification rule, the method further includes: and if the text content of the field with the field name consistent with the field name in the current processing table classification rule does not contain the keywords in the current processing table classification rule, determining that the text content in the table to be classified does not meet the table classification rules.

Optionally, the method further comprises: and when the text content in the table to be classified is detected to not meet the target table classification rule, determining that the numerical value of the dimension corresponding to the target table classification rule in the rule vector of the table to be classified is kept to be 0.

In one specific example, 3 table classification rules are set according to the table classification rule setting information input by the user: a first table classification rule, a second table classification rule, and a third table classification rule. The field name of the first form classification rule is a 'primary key keyword', the keyword of the first form classification rule is an 'account number', and the classification label of the first form classification rule is a 'contract'. The field name of the second table classification rule is "physical subsystem", the keyword of the second table classification rule is "warranty", and the classification label of the second table classification rule is "public. The field name of the third form classification rule is "physical subsystem", the keyword of the third form classification rule is "product line", and the classification label of the third form classification rule is "trade financing". And determining the dimension of the rule vector of the table to be classified as 3 according to the number of the preset table classification rules of 3, and obtaining the initial rule vector [0, 0] of the table to be classified. The 3 dimensions in the rule vector correspond to the first, second, and third table classification rules, respectively. And sequentially judging whether the text content in the to-be-classified form meets the first form classification rule, the second form classification rule and the third form classification rule. And when the text content in the table to be classified is detected to meet the first table classification rule, setting the numerical value of the dimension corresponding to the first table classification rule in the rule vector of the table to be classified to be 1. And when the text content in the to-be-classified table is detected to not meet the second table classification rule, determining that the numerical value of the dimension corresponding to the second table classification rule in the rule vector of the to-be-classified table is kept to be 0. And when the text content in the to-be-classified table is detected not to meet the third table classification rule, determining that the numerical value of the dimension corresponding to the third table classification rule in the rule vector of the to-be-classified table is kept to be 0. Thereby, rule vectors [1, 0] of the tables to be classified are generated.

And 105, splicing the basic text feature vector and the rule vector to obtain the text feature vector of the table to be classified.

Optionally, the text feature vector of the table to be classified is a text feature vector for determining a classification label of the table to be classified. And splicing the basic text feature vector and the rule vector, and splicing the basic text feature vector and the rule vector into a vector. And taking the spliced result of the basic text feature vector and the rule vector as the text feature vector of the table to be classified.

In a specific example, the basic text feature vector of the to-be-classified table is a, the rule vector of the to-be-classified table is b, and the basic text feature vector a and the rule vector b are spliced to obtain a text feature vector a= [ a ] of the to-be-classified table; b ].

And (3) taking the basic text feature vector of the form to be classified obtained by a text feature vectorization algorithm as the input of a random forest classification model, so that the classification of the form data can be performed. However, the text feature vectorization algorithm only measures the importance degree of the words more accurately from the statistical aspect, so that the text feature vector is added in consideration of the rule of classifying the table set according to the user input information, and the specific method is as follows: setting at least one form classification rule according to form classification rule setting information input by a user; wherein each table classification rule comprises a field name, a keyword and a classification label. And determining the dimension of the rule vector of the to-be-classified table according to the number of the preset table classification rules, wherein each dimension in the rule vector corresponds to one table classification rule, and the initial value of each dimension in the rule vector is 0. And then judging whether the text content in the to-be-classified table meets the table classification rules in sequence, setting the numerical value of the dimension corresponding to the target table classification rule in the rule vector of the to-be-classified table to be 1 when detecting that the text content in the to-be-classified table meets the target table classification rule, and determining that the numerical value of the dimension corresponding to the target table classification rule in the rule vector of the to-be-classified table is kept to be 0 when detecting that the text content in the to-be-classified table does not meet the target table classification rule. Therefore, a rule vector is formed for each form to be classified, and the rule vector is spliced with a basic text feature vector of the form to be classified, which is obtained through a text feature vectorization algorithm, to be used as the input of a random forest classification model.

The method has the advantages that a mode of combining a text feature vectorization algorithm with rules is adopted, the generated text feature vector of the to-be-classified form is used for measuring the importance degree of each word in a statistical sense, the rules are introduced, further constraint is carried out, and the generated text feature vector is more suitable for form data.

And 106, inputting the text feature vector of the form to be classified into a pre-trained random forest classification model to obtain a classification label of the form to be classified.

The pre-trained random forest classification model is a classification model obtained by training by using a random forest algorithm according to training samples formed by text feature vectors and classification labels of a preset number of tables. The input of the random forest classification model is the text feature vector of the table, and the output is the classification label of the table.

Optionally, inputting the text feature vector of the form to be classified into a pre-trained random forest classification model, analyzing the text feature vector of the form to be classified by the pre-trained random forest classification model, and outputting the classification label of the form to be classified according to the mapping relation between the text feature vector and the classification label. The classification label of the table to be classified is the classification result of the table to be classified.

The random forest algorithm is a decision-making by combining a series of classifiers, and is expected to get a most "fair" integrated learning method. The problem of limited accuracy of a single model can be solved by adopting a pre-trained random forest classification model to classify the form.

Optionally, the method further comprises: acquiring text feature vectors and classification labels of a preset number of tables as training samples; training a random forest classification model according to the training sample; the random forest classification model is input into text feature vectors of the table, and is output into classification labels of the table.

Optionally, obtaining text feature vectors and classification labels of a preset number of tables as training samples includes: acquiring a preset number of tables; word segmentation processing is carried out on the text of each form; using a text feature vectorization algorithm to vectorize the text of each table after word segmentation to obtain basic text feature vectors of each table; generating rule vectors of all tables according to preset table classification rules and text contents in all tables; splicing the basic text feature vector and the rule vector of each table to obtain the text feature vector of each table; transmitting each form to the manual labeling platform so that the manual labeling platform labels the classification labels of each form and feeds back the classification labels; acquiring each table of marked classification labels fed back by the manual marking platform; the text feature vector and the classification label of each table are used as training samples. Wherein the text feature vector and the classification label of each form are a set of training data in the training sample.

The embodiment of the invention provides a table classification method, which comprises the steps of obtaining text content of a target field of a table to be classified as a text of the table to be classified, then performing word segmentation on the text of the table to be classified, and performing vectorization on the text of the table to be classified after word segmentation by using a text feature vectorization algorithm to obtain a basic text feature vector of the table to be classified; generating rule vectors of the to-be-classified forms according to preset form classification rules and text contents in the to-be-classified forms; splicing the basic text feature vector and the rule vector to obtain the text feature vector of the form to be classified; finally, inputting the text feature vector of the form to be classified into a pre-trained random forest classification model to obtain a classification label of the form to be classified, generating the text feature vector for determining the form type according to the text content in the form by adopting a mode of combining a text feature vectorization algorithm with rules, and taking the generated text feature vector as the input of the random forest classification model to obtain the classification label of the form to be classified output by the random forest classification model, so that the form classification process can be automatically completed according to the text content in the form based on the random forest algorithm and the preset form classification rules, the dependence on manpower is greatly reduced, the problem of low manual labeling efficiency is improved, the form classification efficiency is improved, and a certain classification accuracy is ensured.

Example two

Fig. 2 is a flowchart of a table classification method according to a second embodiment of the present invention. Embodiments of the invention may be combined with various alternatives to one or more of the embodiments described above.

As shown in fig. 2, the method in the embodiment of the present invention specifically includes:

step 201, obtaining text content of a target field of a table to be classified as a text of the table to be classified.

And 202, word segmentation is carried out on the text of the form to be classified through a crust word segmentation tool.

And 203, performing vectorization processing on the text of the to-be-classified form after word segmentation processing by using a word frequency inverse text frequency algorithm to obtain a basic text feature vector of the to-be-classified form.

Step 204, generating rule vectors of the to-be-classified form according to preset form classification rules and text contents in the to-be-classified form.

And 205, splicing the basic text feature vector and the rule vector to obtain the text feature vector of the table to be classified.

The basic text feature vector of the form to be classified, which is obtained through a word frequency inverse text frequency algorithm, is used as the input of a random forest classification model, and the classification of the form data can be performed. However, the word frequency inverse text frequency algorithm only measures the importance degree of words more accurately from the statistical aspect, so that the text feature vector is added in consideration of the rule of classifying the table set according to the user input information, and the specific method is as follows: setting at least one form classification rule according to form classification rule setting information input by a user; wherein each table classification rule comprises a field name, a keyword and a classification label. And determining the dimension of the rule vector of the to-be-classified table according to the number of the preset table classification rules, wherein each dimension in the rule vector corresponds to one table classification rule, and the initial value of each dimension in the rule vector is 0. And then judging whether the text content in the to-be-classified table meets the table classification rules in sequence, setting the numerical value of the dimension corresponding to the target table classification rule in the rule vector of the to-be-classified table to be 1 when detecting that the text content in the to-be-classified table meets the target table classification rule, and determining that the numerical value of the dimension corresponding to the target table classification rule in the rule vector of the to-be-classified table is kept to be 0 when detecting that the text content in the to-be-classified table does not meet the target table classification rule. Therefore, a rule vector is formed for each form to be classified, and the rule vector is spliced with a basic text feature vector of the form to be classified, which is obtained through a word frequency inverse text frequency algorithm, to be used as the input of a random forest classification model.

The method adopts a mode of combining word frequency inverse text frequency algorithm and rules, and the generated text feature vector of the to-be-classified form not only measures the importance degree of each word in a statistical sense, but also introduces rules for further constraint, so that the generated text feature vector is more suitable for form data.

And 206, inputting the text feature vector of the form to be classified into a pre-trained random forest classification model to obtain a classification label of the form to be classified.

The embodiment of the invention provides a table classification method, which comprises the steps of obtaining text content of a target field of a table to be classified as a text of the table to be classified, then performing word segmentation on the text of the table to be classified through a barking word segmentation tool, and performing vectorization on the text of the table to be classified after word segmentation by using a word frequency inverse text frequency algorithm to obtain a basic text feature vector of the table to be classified; generating rule vectors of the to-be-classified forms according to preset form classification rules and text contents in the to-be-classified forms; splicing the basic text feature vector and the rule vector to obtain the text feature vector of the form to be classified; finally, inputting the text feature vector of the form to be classified into a pre-trained random forest classification model to obtain a classification label of the form to be classified, generating the text feature vector for determining the form type according to the text content in the form by adopting a mode of combining a word frequency inverse text frequency algorithm with a rule, enabling the generated text feature vector to be more suitable for form data, and taking the generated text feature vector as the input of the random forest classification model to obtain the classification label of the form to be classified output by the random forest classification model, so that the form classification process can be automatically completed according to the text content in the form based on the random forest algorithm and a preset form classification rule, the dependence on manpower is greatly reduced, the problem of low manual labeling efficiency is improved, the form classification efficiency is improved, and a certain classification accuracy is ensured.

Example III

Fig. 3 is a schematic structural diagram of a table classifying device according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes: a text acquisition module 301, a text word segmentation module 302, a text vectorization module 303, a rule vector generation module 304, a vector stitching module 305, and a tag acquisition module 306.

The text obtaining module 301 is configured to obtain text content of a target field of a form to be classified as a text of the form to be classified; the text word segmentation module 302 is configured to perform word segmentation on the text of the form to be classified; the text vectorization module 303 is configured to perform vectorization processing on the text of the to-be-classified form after word segmentation processing by using a text feature vectorization algorithm, so as to obtain a basic text feature vector of the to-be-classified form; the rule vector generation module 304 is configured to generate a rule according to a preset rule vector, and generate a rule vector of the table to be classified; the vector splicing module 305 is configured to splice the basic text feature vector and the rule vector to obtain a text feature vector of the form to be classified; the tag obtaining module 306 is configured to input the text feature vector of the form to be classified into a pre-trained random forest classification model, and obtain a classification tag of the form to be classified.

The embodiment of the invention provides a table classifying device, which is used for obtaining text content of a target field of a table to be classified as a text of the table to be classified, then performing word segmentation on the text of the table to be classified, and performing vectorization on the text of the table to be classified after word segmentation by using a text feature vectorization algorithm to obtain a basic text feature vector of the table to be classified; generating rule vectors of the to-be-classified forms according to preset form classification rules and text contents in the to-be-classified forms; splicing the basic text feature vector and the rule vector to obtain the text feature vector of the form to be classified; finally, inputting the text feature vector of the form to be classified into a pre-trained random forest classification model to obtain a classification label of the form to be classified, generating the text feature vector for determining the form type according to the text content in the form by adopting a mode of combining a text feature vectorization algorithm with rules, and taking the generated text feature vector as the input of the random forest classification model to obtain the classification label of the form to be classified output by the random forest classification model, so that the form classification process can be automatically completed according to the text content in the form based on the random forest algorithm and the preset form classification rules, the dependence on manpower is greatly reduced, the problem of low manual labeling efficiency is improved, the form classification efficiency is improved, and a certain classification accuracy is ensured.

In an optional implementation manner of the embodiment of the present invention, optionally, the table classification device further includes: the training sample acquisition module is used for acquiring text feature vectors and classification labels of a preset number of tables to serve as training samples; the model training module is used for training a random forest classification model according to the training samples; the random forest classification model is input into text feature vectors of the table, and is output into classification labels of the table.

In an optional implementation manner of the embodiment of the present invention, optionally, when performing an operation of acquiring text content of a target field of a table to be classified as text of the table to be classified, the text acquisition module 301 is specifically configured to: determining a target field of a form to be classified; and extracting the text content of the target field of the table to be classified as the text of the table to be classified.

In an optional implementation manner of the embodiment of the present invention, optionally, when performing an operation of determining a target field of a table to be classified, the text obtaining module 301 is specifically configured to: and determining the target field of the form to be classified according to the target field setting information input by the user.

In an optional implementation manner of the embodiment of the present invention, optionally, when performing an operation of determining a target field of a table to be classified, the text obtaining module 301 is specifically configured to: and determining target fields of the tables to be classified according to the preset keyword segment information.

In an optional implementation manner of the embodiment of the present invention, optionally, when performing an operation of word segmentation on the text of the to-be-classified form, the text word segmentation module 302 is specifically configured to: and performing word segmentation on the text of the form to be classified through a crust word segmentation tool.

In an optional implementation manner of the embodiment of the present invention, optionally, when performing the operation of performing the vectorization processing on the text of the to-be-classified form after the word segmentation processing by using a text feature vectorization algorithm, the text vectorization module 303 is specifically configured to: and carrying out vectorization processing on the text of the to-be-classified form after word segmentation processing by using a word frequency inverse text frequency algorithm to obtain a basic text feature vector of the to-be-classified form.

In an optional implementation manner of the embodiment of the present invention, optionally, when executing an operation of generating a rule vector of the to-be-classified form according to a preset form classification rule and text content in the to-be-classified form, the rule vector generation module 304 is specifically configured to: determining the dimension of a rule vector of the to-be-classified table according to the number of preset table classification rules, wherein each dimension in the rule vector corresponds to one table classification rule, and the initial value of each dimension in the rule vector is 0; and sequentially judging whether the text content in the to-be-classified table meets the table classification rules, and setting the numerical value of the dimension corresponding to the target table classification rule in the rule vector of the to-be-classified table to be 1 when detecting that the text content in the to-be-classified table meets the target table classification rule.

In an optional implementation of the embodiment of the present invention, optionally, the rule vector generation module 304 is further configured to: setting at least one form classification rule according to form classification rule setting information input by a user; wherein each table classification rule comprises a field name, a keyword and a classification label.

In an optional implementation manner of the embodiment of the present invention, optionally, when executing the operation of sequentially determining whether the text content in the to-be-classified form meets the classification rule of each form, the rule vector generation module 304 is specifically configured to: sequentially acquiring a form classification rule as a currently processed form classification rule; judging whether text contents of fields with field names consistent with field names in the current processing table classification rule in the to-be-classified table contain keywords in the current processing table classification rule or not; if the text content of the field with the field name consistent with the field name in the current processing table classification rule in the to-be-classified table contains keywords in the current processing table classification rule, determining that the text content in the to-be-classified table meets each table classification rule; and returning to execute the operation of sequentially acquiring one form classification rule as the current processing form classification rule until the processing of all form classification rules is completed.

In an optional implementation of the embodiment of the present invention, optionally, the rule vector generation module 304 is further configured to: and if the text content of the field with the field name consistent with the field name in the current processing table classification rule does not contain the keywords in the current processing table classification rule, determining that the text content in the table to be classified does not meet the table classification rules.

In an optional implementation of the embodiment of the present invention, optionally, the rule vector generation module 304 is further configured to: and when the text content in the table to be classified is detected to not meet the target table classification rule, determining that the numerical value of the dimension corresponding to the target table classification rule in the rule vector of the table to be classified is kept to be 0.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

The form classification device can execute the form classification method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the form classification method.

Example IV

Fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention. Fig. 4 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in fig. 4 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in FIG. 4, the computer device 12 is in the form of a general purpose computing device. Components of computer device 12 may include, but are not limited to: one or more processors 16, a memory 28, a bus 18 that connects the various system components, including the memory 28 and the processor 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, commonly referred to as a "hard disk drive"). Although not shown in fig. 4, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.

The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer device 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 20. As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18. It should be appreciated that although not shown in fig. 4, other hardware and/or software modules may be used in connection with computer device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processor 16 implements the table classification method provided by the embodiment of the present invention by executing programs stored in the memory 28 to perform various functional applications and data processing: acquiring text content of a target field of a form to be classified as a text of the form to be classified; word segmentation processing is carried out on the text of the form to be classified; using a text feature vectorization algorithm to vectorize the text of the to-be-classified form after word segmentation to obtain a basic text feature vector of the to-be-classified form; generating rule vectors of the to-be-classified forms according to preset form classification rules and text contents in the to-be-classified forms; splicing the basic text feature vector and the rule vector to obtain the text feature vector of the form to be classified; and inputting the text feature vector of the form to be classified into a pre-trained random forest classification model to obtain a classification label of the form to be classified.

Example five

A fifth embodiment of the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a table classification method provided by the embodiments of the present invention: acquiring text content of a target field of a form to be classified as a text of the form to be classified; word segmentation processing is carried out on the text of the form to be classified; using a text feature vectorization algorithm to vectorize the text of the to-be-classified form after word segmentation to obtain a basic text feature vector of the to-be-classified form; generating rule vectors of the to-be-classified forms according to preset form classification rules and text contents in the to-be-classified forms; splicing the basic text feature vector and the rule vector to obtain the text feature vector of the form to be classified; and inputting the text feature vector of the form to be classified into a pre-trained random forest classification model to obtain a classification label of the form to be classified.

Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present invention may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or computer device. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A method of classifying a form, comprising:

setting at least one form classification rule according to form classification rule setting information input by a user; wherein each table classification rule comprises a field name, a keyword and a classification label;

inputting the text feature vector of the form to be classified into a pre-trained random forest classification model to obtain a classification label of the form to be classified;

the generating a rule vector of the to-be-classified form according to a preset form classification rule and text content in the to-be-classified form includes: determining the dimension of a rule vector of the to-be-classified table according to the number of preset table classification rules, wherein each dimension in the rule vector corresponds to one table classification rule, and the initial value of each dimension in the rule vector is 0; sequentially judging whether text content in the to-be-classified table meets the table classification rules, and setting the numerical value of the dimension corresponding to the target table classification rule in the rule vector of the to-be-classified table to be 1 when detecting that the text content in the to-be-classified table meets the target table classification rule;

The sequentially judging whether the text content in the to-be-classified form meets the classification rule of each form comprises the following steps: sequentially acquiring a form classification rule as a currently processed form classification rule; judging whether text contents of fields with field names consistent with field names in the current processing table classification rule in the to-be-classified table contain keywords in the current processing table classification rule or not; if the text content of the field with the field name consistent with the field name in the current processing table classification rule in the to-be-classified table contains keywords in the current processing table classification rule, determining that the text content in the to-be-classified table meets each table classification rule; and returning to execute the operation of sequentially acquiring one form classification rule as the current processing form classification rule until the processing of all form classification rules is completed.

2. The method as recited in claim 1, further comprising:

acquiring text feature vectors and classification labels of a preset number of tables as training samples;

training a random forest classification model according to the training sample;

the random forest classification model is input into text feature vectors of the table, and is output into classification labels of the table.

3. The method according to claim 1, wherein the obtaining text content of the target field of the form to be classified as the text of the form to be classified includes:

determining a target field of a form to be classified;

and extracting the text content of the target field of the table to be classified as the text of the table to be classified.

4. A method according to claim 3, wherein said determining the target field of the table to be sorted comprises:

and determining the target field of the form to be classified according to the target field setting information input by the user.

5. A method according to claim 3, wherein said determining the target field of the table to be sorted comprises:

and determining target fields of the tables to be classified according to the preset keyword segment information.

6. The method according to claim 1, wherein the word segmentation of the text of the form to be classified comprises:

and performing word segmentation on the text of the form to be classified through a crust word segmentation tool.

7. The method of claim 1, wherein the performing vectorization on the text of the to-be-classified form after word segmentation using a text feature vectorization algorithm to obtain a basic text feature vector of the to-be-classified form comprises:

And carrying out vectorization processing on the text of the to-be-classified form after word segmentation processing by using a word frequency inverse text frequency algorithm to obtain a basic text feature vector of the to-be-classified form.

8. The method according to claim 1, further comprising, after determining whether text contents of a field whose field name in the table to be classified coincides with a field name in the current process table classification rule contain keywords in the current process table classification rule:

and if the text content of the field with the field name consistent with the field name in the current processing table classification rule does not contain the keywords in the current processing table classification rule, determining that the text content in the table to be classified does not meet the table classification rules.

9. The method as recited in claim 1, further comprising:

and when the text content in the table to be classified is detected to not meet the target table classification rule, determining that the numerical value of the dimension corresponding to the target table classification rule in the rule vector of the table to be classified is kept to be 0.

10. A form sorting apparatus, comprising:

the rule vector generation module is used for setting at least one form classification rule according to the form classification rule setting information input by the user; wherein each table classification rule comprises a field name, a keyword and a classification label; generating rule vectors of the to-be-classified forms according to preset form classification rules and text contents in the to-be-classified forms;

the label acquisition module is used for inputting the text feature vector of the form to be classified into a pre-trained random forest classification model to obtain a classification label of the form to be classified;

The rule vector generation module is specifically configured to, when executing an operation of generating a rule vector of the to-be-classified form according to a preset form classification rule and text content in the to-be-classified form: determining the dimension of a rule vector of the to-be-classified table according to the number of preset table classification rules, wherein each dimension in the rule vector corresponds to one table classification rule, and the initial value of each dimension in the rule vector is 0; sequentially judging whether text content in the to-be-classified table meets the table classification rules, and setting the numerical value of the dimension corresponding to the target table classification rule in the rule vector of the to-be-classified table to be 1 when detecting that the text content in the to-be-classified table meets the target table classification rule;

the rule vector generation module is specifically configured to, when executing the operation of sequentially judging whether text content in the to-be-classified form meets the classification rule of each form: sequentially acquiring a form classification rule as a currently processed form classification rule; judging whether text contents of fields with field names consistent with field names in the current processing table classification rule in the to-be-classified table contain keywords in the current processing table classification rule or not; if the text content of the field with the field name consistent with the field name in the current processing table classification rule in the to-be-classified table contains keywords in the current processing table classification rule, determining that the text content in the to-be-classified table meets each table classification rule; and returning to execute the operation of sequentially acquiring one form classification rule as the current processing form classification rule until the processing of all form classification rules is completed.

11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the form classification method of any of claims 1-9 when the computer program is executed by the processor.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the form classification method according to any one of claims 1-9.