CN117131196A

CN117131196A - Text processing method and system

Info

Publication number: CN117131196A
Application number: CN202311227355.XA
Authority: CN
Inventors: 储铭钧
Original assignee: Shanghai Chenghu Information Technology Co ltd
Current assignee: China Unicom WO Music and Culture Co Ltd
Priority date: 2023-09-21
Filing date: 2023-09-21
Publication date: 2023-11-28
Anticipated expiration: 2043-09-21

Abstract

The application provides a text processing method and a text processing system, which relate to the field of text processing and comprise the following steps: generating document table layout characteristics, and matching a table template matching result; when the number is greater than 1, analyzing the characteristic of the matching result of the table template to generate a template text semantic vector and a semantic vector layout characteristic; sorting the form template matching results to generate a first template sorting result; when the number is greater than 1, traversing the feature analysis of the sorting result of the first template to generate a first filling character attribute vector and a first attribute vector layout feature; analyzing the first character to be processed to generate a second filling character attribute vector and a second attribute vector layout feature, and sorting the first template sorting result to generate a second template sorting result; and when the number of the second template sorting results is equal to 1, classifying the texts of the first to-be-processed documents according to the second template sorting results, so that the technical problem of low processing efficiency in the prior art is solved.

Description

Text processing method and system

Technical Field

The application relates to the technical field of text processing, in particular to a text processing method and a text processing system.

Background

Along with informatization, the process of circulation, processing and storage of various data is greatly increased, and the accurate automatic classification of various data is an important premise for ensuring the rapid retrieval of related information in the later step, and text processing is an important content of automatic classification.

The traditional text processing means applied to automatic classification relies on a large model to perform text classification, so that the demand of calculation force is large, and when the automatic classification of batches is faced, the defects of poor response speed and low processing efficiency exist.

Disclosure of Invention

The application provides a text processing method and a text processing system, which are used for solving the technical problems of poor response speed and low processing efficiency caused by large calculation force demand in text processing for automatic classification in the prior art.

In view of the above problems, the present application provides a text processing method and system.

In a first aspect of the present application, there is provided a text processing method, comprising: carrying out primary feature analysis on the first document to be processed to generate document table layout features; traversing a table template library to perform template recognition by combining the document table layout characteristics to generate a table template matching result; when the number of the table template matching results is larger than 1, carrying out secondary feature analysis on the table template matching results to generate template text semantic vectors and semantic vector layout features; sorting the form template matching result by combining the template semantic vector and the semantic vector layout characteristics to generate a first template sorting result; when the number of the first template sorting results is greater than 1, traversing the first template sorting results to perform three-level feature analysis, and generating a first filling character attribute vector and a first attribute vector layout feature; performing four-level feature analysis on the first document to be processed to generate a second filling character attribute vector and a second attribute vector layout feature; sorting the first template sorting result by combining the first filling character attribute vector and the first attribute vector layout feature with the second filling character attribute vector and the second attribute vector layout feature to generate a second template sorting result; and when the number of the second template sorting results is equal to 1, classifying the texts of the first to-be-processed documents according to the second template sorting results.

In a second aspect of the present application, there is provided a text processing system comprising: the first feature processing unit is used for carrying out primary feature analysis on the first document to be processed to generate document table layout features; the template matching unit is used for traversing a table template library to perform template recognition by combining the document table layout characteristics to generate a table template matching result; the second feature processing unit is used for carrying out secondary feature analysis on the table template matching result when the number of the table template matching result is more than 1, so as to generate template text semantic vectors and semantic vector layout features; the first sorting unit is used for sorting the form template matching result by combining the template semantic vector and the semantic vector layout characteristic to generate a first template sorting result; the third feature processing unit is used for traversing the first template sorting result to perform three-level feature analysis when the number of the first template sorting result is larger than 1, and generating a first filling character attribute vector and a first attribute vector layout feature; the fourth feature processing unit is used for carrying out four-level feature analysis on the first document to be processed to generate a second filling character attribute vector and a second attribute vector layout feature; the second sorting unit is used for sorting the first template sorting result by combining the first filling character attribute vector and the first attribute vector layout characteristic with the second filling character attribute vector and the second attribute vector layout characteristic to generate a second template sorting result; and the first execution unit is used for classifying the texts of the first documents to be processed according to the second template sorting results when the number of the second template sorting results is equal to 1.

One or more technical schemes provided by the application have at least the following technical effects or advantages:

according to the application, text features are divided into three kinds of feature information, namely, the semantic features of the template class of the unfilled content and the semantic features of the filled content, and a text sorting algorithm of the three kinds of feature information is constructed, sorting is carried out layer by layer, firstly, sorting is carried out according to the form layout features, if the return result is not unique, then the semantic features of the template class of the unfilled content are activated for sorting, and if the classification result is not unique, finally, the semantic features of the filled content are activated for sorting. The computational power demand through the multistage sorting process is gradually increased, but the processed data volume is gradually reduced, and compared with the traditional text single-stage processing, the method has the technical effect of higher processing efficiency.

Drawings

FIG. 1 is a schematic flow chart of a text processing method provided by the application;

FIG. 2 is a schematic flow chart of a method for obtaining a matching result of a form template in a text processing method according to the present application;

fig. 3 is a schematic structural diagram of a text processing system according to the present application.

Reference numerals illustrate: a first feature processing unit 100, a template matching unit 200, a second feature processing unit 300, a first sorting unit 400, a third feature processing unit 500, a fourth feature processing unit 600, a second sorting unit 700, and a first execution unit 800.

Detailed Description

The embodiment of the application provides a text processing method and a text processing system, which divide text features into three kinds of feature information, namely semantic features of template types of unfilled contents and semantic features of filled contents, and construct a text sorting algorithm of the three kinds of feature information, sort the text by layers, firstly sort the text according to the layout features of the form, activate the semantic features of the template types of the unfilled contents again if a return result is not unique, sort the text by layers, and finally activate the semantic features of the filled contents again if a classification result is not unique. The calculation force demand through the multistage sorting process is gradually enhanced, but the data volume of processing is gradually reduced, compared with the traditional text single-stage processing, the method has the technical effect of higher processing efficiency, and solves the technical problems of poor response speed and low processing efficiency caused by the larger calculation force demand in the text processing for automatic classification in the prior art.

Example 1

As shown in fig. 1, the present application provides a text processing method, including the steps of:

s10: carrying out primary feature analysis on the first document to be processed to generate document table layout features;

specifically, the first document to be processed, which is pointed by the embodiment of the application, refers to the type of the document which needs to be filled in by depending on the form template, and is exemplified as follows: form documents of work flows of various projects, banks and the like. The document table layout features refer to storage information representing the distribution positions of cells of the table documents, and preferably any one table document is segmented according to the sequence from left to right and from top to bottom to obtain cells arranged according to the sequence, and the storage information of each cell is sequentially stored, wherein the storage information at least comprises four corner coordinates and a center point coordinate of the cell.

S20: traversing a table template library to perform template recognition by combining the document table layout characteristics to generate a table template matching result;

specifically, at any user end applying the text processing method, various form templates required to be archived by the user are stored and recorded as a form template library, i.e. any form sent by the user should have a unique corresponding matching module in the form template library. Preferably, the form template library is placed in the cloud memory, and the user needs to periodically update the form template library to ensure timeliness of the templates and avoid storing invalid templates or missing templates.

The form template matching result refers to a template type which is determined to be relatively similar by carrying out similarity analysis according to the document form layout characteristics and the form layout characteristics of the form template library. Because the templates of different filling-in contents may be identical in form layout, the number of form template matching results sorted according to the document form layout features may not be unique.

If the number of the form template matching results is equal to 1, directly outputting the matched templates for guiding the classified storage of the text. Therefore, unlike the traditional text processing, complicated semantic analysis is needed, and only comparison of table layout is needed, so that the document classification efficiency is improved.

If the number of the matching results of the form templates is larger than 1, the number of templates to be selected is reduced through sorting of the form layout, and the calculation pressure is reduced for sorting of the semantic level for the subsequent steps, so that the document classification efficiency is improved.

Further, as shown in fig. 2, in combination with the document table layout feature, traversing a table template library to perform template recognition, and generating a table template matching result, including:

evaluating the character arrangement direction of the first document to be processed to generate a first character arrangement direction;

extracting the character arrangement direction of a first table template of the table template library to generate a second character arrangement direction;

performing homodirectional corner alignment on the document table layout features and the first template table layout features based on the first character arrangement direction and the second character arrangement direction to generate a corner alignment result;

analyzing the similarity of the document table layout characteristics and the first template table layout characteristics according to the corner alignment result to generate a first template similarity coefficient;

and when the first template similarity coefficient meets a first similarity coefficient threshold, adding the first form template into the form template matching result.

Specifically, if the comparison table layouts are identical, the templates of the first document to be processed and the table template library first need to be aligned. The text alignment mode given by the embodiment of the application is preferably as follows:

the first text alignment direction refers to the layout direction of the text of the first document to be processed, which is an arrangement direction for easy recognition, whether in english, chinese or characters. Preferably, the direction of character arrangement is realized by a convolutional neural network, a plurality of character information is collected, each character is identified according to the direction easy to identify, then the convolutional neural network is trained, and when the training of the deviation of the character arrangement direction output by the convolutional neural network and the arrangement direction identification for 50 continuous times is less than or equal to a deviation threshold value, a character arrangement direction evaluation model is generated. And evaluating the arrangement direction of the characters of the first document to be processed through a character arrangement direction evaluation model, generating a plurality of character arrangement directions, and counting the direction in which the plurality of character arrangement directions appear most as the first character arrangement direction. And similarly, extracting the text arrangement direction of a first table template of the table template library to generate a second text arrangement direction, wherein the first table template refers to any table template of the table template library. At the same time, the first template table layout feature of the first table template is invoked.

Further, according to the first text arrangement direction and the second text arrangement direction, the form direction of the first document to be processed is adjusted to be consistent with the form direction of the first form template. And then, coinciding the corner points of the corner point unit cells of the document table layout features and the first template table layout features, which are consistent in direction adjustment, and coinciding the two sides corresponding to the corner points to obtain a corner point alignment result, so that the accurate comparison analysis of the subsequent table layout is facilitated.

Further, according to the corner alignment result, analyzing the similarity of the document table layout characteristics and the first template table layout characteristics to generate a first template similarity coefficient; after the corner points are aligned, the more the number of the overlapped cells of the document table layout features and the first template table layout features is, the larger the template similarity coefficient is, and otherwise, the smaller the template similarity coefficient is. The first similarity coefficient threshold refers to a preset similarity coefficient threshold regarded as template matching, and is a user-defined threshold. And when the first template similarity coefficient meets a first similarity coefficient threshold, namely the first template similarity coefficient is larger than or equal to the first similarity coefficient threshold, adding the first form template into a form template matching result. Traversing other templates of the table template library to perform the same process treatment, and enriching the matching result of the table template.

Further, according to the corner alignment result, analyzing the similarity between the document table layout feature and the first template table layout feature to generate a first template similarity coefficient, including:

constructing a table layout feature similarity analysis function:

X ₁ ＝(x ₁₁ ,x ₁₂ ,..,x _1i ,…,x _1n )；

X ₀ ＝(x ₀₁ ，x ₀₂ ，..，x _0i ，…，x _0m )；

wherein X is ₁ Representing left-to-right, top-to-bottom ordered form center point coordinate set, x of document form layout features after corner alignment _1i The ith order of table center point coordinates representing the order of the document table layout features from left to right, from top to bottom, n represents the number of table center point coordinates of the document table layout features, X ₀ Representing the ordered form center point coordinate set from left to right, from top to bottom and x of the form layout characteristics of the first template after corner alignment _0i Characterization and x _1i The table center point coordinates of the ith sequence of the table layout features of the first template with the same arrangement sequence, m represents the number of the table center point coordinates of the table layout features of the first template, and d ₀ Characterizing the offset distance threshold, a characterizing the offset quantity threshold, count () as a count function, SIM ₁ Characterizing similarity of table layout characteristics;

and carrying out similarity analysis on the document table layout characteristics and the first template table layout characteristics according to the table layout characteristic similarity analysis function to generate the first template similarity coefficient.

Specifically, after the angular points are aligned, the higher the similarity degree between the first document to be processed and the first form template is, the more the number of overlapped cells is, and a quantization index for similarity analysis can be set based on the number of overlapped cells, so that the similarity templates can be conveniently and rapidly sorted. Constructing a table layout feature similarity analysis function: x is X ₁ ＝(x ₁₁ ，x ₁₂ ，..，x _1i ，…，x _1n )；X ₀ ＝(x ₀₁ ，x ₀₂ ，..，x _0i ，...，x _0m )； Wherein X is ₁ Characterizing left-to-right, top-to-bottom document table layout features after corner alignmentOrdered form center point coordinate set, x _1i The ith order of table center point coordinates representing the order of the document table layout features from left to right, from top to bottom, n represents the number of table center point coordinates of the document table layout features, X ₀ Representing the ordered form center point coordinate set from left to right, from top to bottom and x of the form layout characteristics of the first template after corner alignment _0i Characterization and x _1i The table center point coordinates of the ith sequence of the table layout features of the first template with the same arrangement sequence, m represents the number of the table center point coordinates of the table layout features of the first template, and d ₀ Characterizing the offset distance threshold, a characterizing the offset quantity threshold, count () as a count function, SIM ₁ The table layout feature similarity is characterized.

After the corner points are aligned, if the tables are overlapped, the coordinates of the central points of the diagonal lines are overlapped, so that the coordinates of the central points are used as indexes for evaluating the similarity degree of the table layout of the first document to be processed and the first table template. Processing efficiency is faster than analyzing the entire cell. By comparing the absolute value m-n and the absolute value a, direct elimination of the large table number difference is realized, the similarity is regarded as 0, irrelevant data can be rapidly removed, and the calculation efficiency is improved. And carrying out similarity analysis on the document table layout characteristics and the first template table layout characteristics according to the table layout characteristic similarity analysis function to generate a first template similarity coefficient.

S30: when the number of the table template matching results is larger than 1, carrying out secondary feature analysis on the table template matching results to generate template text semantic vectors and semantic vector layout features;

specifically, if the number of the table template matching results is greater than 1, the table layout is only relied on, and sorting can not be completed, then part of text semantic features are further added, wherein the part of text semantic features refer to text semantic vectors of templates and stored as the template text semantic vectors, and the semantic vectors refer to codes representing text contents. Simultaneously storing the layout positions of the semantic vectors of all the characters in the table, and recording the layout positions as semantic vector layout features, wherein preferably, the semantic vector of any one character has two position parameters: the cell position, the text number in the cell, and further, the cell position is preferably represented by the cell center coordinates. The template text refers to the initial text that has not been filled with user content. Illustratively: some identity registry, the initial text may include "identity registry, name, ethnicity", etc. The form template matching result can be subjected to secondary sorting by calling part of semantic features, and compared with global semantic analysis, the analysis of part of semantic features reduces the calculation pressure and is convenient for improving the overall document classification efficiency.

S40: sorting the form template matching result by combining the template semantic vector and the semantic vector layout characteristics to generate a first template sorting result;

further, sorting the form template matching result by combining the template semantic vector and the semantic vector layout feature to generate a first template sorting result, including:

constructing a template text similarity analysis function:

Y ₀ ＝(y ₀₁ ，y ₀₂ ，..，y _0j ，...，y _0q )；

Y ₁ ＝(y ₁₁ ，y ₁₂ ，..，y _1j ，...，y _1q )；

wherein Y is ₀ Semantic binary encoding vectors of q different position tables characterizing table template matching results, Y ₁ Characterizing a first document to be processed, semantic binary-coded vectors of q position tables corresponding to the matching result of the table templates, A (y) _1j ，y _0j ) Representing hamming distance of co-located semantic binary coded vector, a ₀ Characterizing a departure distance threshold, SIM ₂ Representing the similarity of the template characters;

according to the semantic vector layout characteristics, carrying out five-level characteristic analysis from the first document to be processed, and extracting semantic vectors of documents with the same layout;

according to the template text similarity analysis function, similarity analysis is carried out on the template semantic vector and the document semantic vector with the same layout, and a second template similarity coefficient is generated;

and when the second template similarity coefficient is greater than or equal to a second similarity coefficient threshold value, adding the form template matching result into the first template sorting result.

Specifically, the first template sorting result refers to a residual template obtained by further sorting the table template matching result according to the template semantic vector and the semantic vector layout characteristics. The preferred sorting procedure is as follows:

constructing a template text similarity analysis function: y is Y ₀ ＝(y ₀₁ ，y ₀₂ ，..，y _0j ，...，y _0q )；Y ₁ ＝(y ₁₁ ，y ₁₂ ，..，y _1j ，...，y _1q )；Wherein Y is ₀ Semantic binary encoding vectors of q different position tables characterizing table template matching results, Y ₁ Characterizing a first document to be processed, semantic binary-coded vectors of q position tables corresponding to the matching result of the table templates, A (y) _1j ，y _0j ) Representing hamming distance of co-located semantic binary coded vector, a ₀ Characterizing a departure distance threshold, SIM ₂ Representing the similarity of the template characters;

determining the position coordinates of the semantic vector of the template text of any template, namely the coordinates of the center point of the cell and the serial number in the cell based on the semantic vector layout characteristics; based on the coordinates of the central point of the cell and the sequence in the cell, matching each semantic vector of the first to-be-processed document at the same coordinate position, and if no text exists in the first to-be-processed document at the same coordinate position, defaulting to the fact that the semantic deviation of the corresponding position is larger than a deviation distance threshold. After the matching of each semantic vector of the first document to be processed is completed, storing the semantic vector as Y ₁ Semantic binary coding vector of q different position tables of table template matching resultStored as Y ₀ And then counting the template text similarity according to the template text similarity analysis function, and storing the template text similarity as a second template similarity coefficient. And when the second template similarity coefficient is greater than or equal to the second similarity coefficient threshold, adding the analyzed form template matching result into the first template sorting result.

S50: when the number of the first template sorting results is greater than 1, traversing the first template sorting results to perform three-level feature analysis, and generating a first filling character attribute vector and a first attribute vector layout feature;

specifically, when the number of the first template sorting results is greater than 1, it is explained that the sorting of the templates cannot be achieved by the semantic vectors of the dependent parts, three-level feature analysis is performed on the first template sorting results to obtain a first filler character attribute vector and a first attribute vector layout feature, and the filler character attribute refers to a data type representing the required filler characters, and the method is exemplified as follows: if a certain cell data attribute is gender, but the actual filling content is "xx family", the actual filling content is regarded as inconsistent. The attribute vector layout features refer to the positions of cells which characterize the characters to be filled, and are preferably characterized by coordinates of center points of the cells.

The attribute verification can be performed on the basis of the first filler character attribute vector and the filler characters of the first document to be processed, and the filler position verification can be performed on the basis of the first attribute vector layout characteristics and the filler characters of the first document to be processed, so that the template can be further sorted, and all character semantic recognition is not required. The optimized text attribute classification task is realized by a text attribute calibration table, and attribute identifications of a plurality of words or texts are stored in an associated mode to obtain the text attribute calibration table, so that the text attribute calibration table is convenient to call in the later step.

S60: performing four-level feature analysis on the first document to be processed to generate a second filling character attribute vector and a second attribute vector layout feature;

specifically, the second filled text attribute vector refers to the filled text attribute vector of each cell of the first document to be processed, and the second attribute vector layout feature refers to the cell center point coordinates of each of the filled text attribute vectors of the first document to be processed.

S70: sorting the first template sorting result by combining the first filling character attribute vector and the first attribute vector layout feature with the second filling character attribute vector and the second attribute vector layout feature to generate a second template sorting result;

further, sorting the first template sorting result by combining the first filled text attribute vector and the first attribute vector layout feature with the second filled text attribute vector and the second attribute vector layout feature, and generating a second template sorting result, including:

constructing a layout similarity analysis function and a vector similarity analysis function, wherein the layout similarity analysis function is the same as a table layout feature similarity analysis function, and the vector similarity analysis function is the same as the template text similarity analysis function;

performing similarity evaluation on the first attribute vector layout features and the second attribute vector layout features according to the layout similarity analysis function to generate a first layout similarity coefficient;

performing similarity evaluation on the first filling character attribute vector and the second filling character attribute vector according to the vector similarity analysis function to generate a first vector similarity coefficient;

extracting the first template sorting result that the first layout similarity coefficient is greater than or equal to a third similarity coefficient threshold value and the first vector similarity coefficient is greater than or equal to a fourth similarity coefficient threshold value, and adding the first template sorting result into the second template sorting result.

Specifically, the layout similarity analysis function is the same as the table layout feature similarity analysis function, namely, the first attribute vector layout features of any one template are ordered according to the order of the cells from left to right and from top to small, the second attribute vector layout features are ordered according to the order of the cells from left to right and from top to small, then the table layout feature similarity analysis function is activated, and the first layout similarity coefficients are counted.

The vector similarity analysis function is the same as the template text similarity analysis function, namely the first filling text attribute vector of any one template is ordered according to the sequence of the cells from left to right and from top to small, then the second filling text attribute vector of the first document to be processed is ordered according to the sequence of the cells from left to right and from top to small, and then the template text similarity analysis function is activated to obtain the first vector similarity coefficient. And extracting a first template sorting result of which the first layout similarity coefficient is greater than or equal to the third similarity coefficient threshold value and the first vector similarity coefficient is greater than or equal to the fourth similarity coefficient threshold value, and adding the first template sorting result into a second template sorting result.

S80: and when the number of the second template sorting results is equal to 1, classifying the texts of the first to-be-processed documents according to the second template sorting results.

Further, the method further comprises the following steps:

when the number of the second template sorting results is greater than 1, adding the first layout similarity coefficient and the first vector similarity coefficient of the second template sorting results to generate a template sorting trend score;

and carrying out serialization adjustment on the second template sorting result based on the template sorting trend score, and sending the second template sorting result to a user side to obtain a third template sorting result, wherein the number of the third template sorting results is equal to 1.

Specifically, after multi-stage sorting, the number of the remaining second template sorting results should be small, and when the number of the second template sorting results is equal to 1, text classification is performed on the first document to be processed according to the second template sorting results; when the number of the second template sorting results is greater than 1, adding the first layout similarity coefficient and the first vector similarity coefficient of the second template sorting results to be stored as template sorting trend scores; and sorting the second template sorting results based on the large-to-small template sorting trend scores, sending the second template sorting results to a user side, and automatically screening by the user to obtain third template sorting results, wherein the number of the third template sorting results is equal to 1. At this time, through multi-level semantic analysis, if the sorting result of the second template is not equal to 1, complex semantic sorting is performed again with high probability, so that the sorting is difficult to realize, and the user is directly handed to the user for self-sorting, and the sorting time of the user is shorter due to the smaller number of the sorting results of the second template.

Further, the method further comprises the following steps:

when the number of the form template matching results is equal to 0, or the number of the first template sorting results is equal to 0, or the number of the second template sorting results is equal to 0, generating a text rechecking instruction, and sending the text rechecking instruction to a user side to recheck the first document to be processed;

and when the number of the table template matching results is equal to 1 or the number of the first template sorting results is equal to 1, classifying the text of the first document to be processed according to the table template matching results or the first template sorting results.

Specifically, when the number of the matching results of the form templates is equal to 0, or the number of the sorting results of the first template is equal to 0, or the number of the sorting results of the second template is equal to 0, a text review instruction is generated and sent to the user terminal to review the first document to be processed, that is, when the matching number of the templates is 0, the document may be in an uploading error at the moment, and the document needs to be fed back to the user terminal to review. And when the number of the table template matching results is equal to 1 or the number of the first template sorting results is equal to 1, classifying the text of the first document to be processed according to the table template matching results or the first template sorting results.

In summary, the embodiment of the application has at least the following technical effects:

according to the embodiment of the application, the text features are divided into three feature information of form layout features, semantic features of template types of unfilled contents and semantic features of filled contents, a text sorting algorithm of the three feature information is constructed, sorting is carried out layer by layer, firstly, sorting is carried out according to the form layout features, if a return result is not unique, then the semantic features of the template types of the unfilled contents are activated for sorting, if a classification result is not unique, finally, the semantic features of the filled contents are activated for sorting. The computational power demand through the multistage sorting process is gradually increased, but the processed data volume is gradually reduced, and compared with the traditional text single-stage processing, the method has the technical effect of higher processing efficiency.

Example two

Based on the same inventive concept as one of the text processing methods in the previous embodiments, as shown in fig. 3, the present application provides a text processing system, including:

the first feature processing unit 100 is configured to perform primary feature analysis on a first document to be processed, and generate document table layout features;

the template matching unit 200 is used for traversing a table template library to perform template recognition in combination with the document table layout characteristics to generate a table template matching result;

the second feature processing unit 300 is configured to perform secondary feature analysis on the table template matching result when the number of the table template matching result is greater than 1, so as to generate a template text semantic vector and a semantic vector layout feature;

a first sorting unit 400, configured to sort the form template matching result by combining the template semantic vector and the semantic vector layout feature, and generate a first template sorting result;

the third feature processing unit 500 is configured to traverse the first template sorting result to perform three-level feature analysis when the number of the first template sorting results is greater than 1, and generate a first filling text attribute vector and a first attribute vector layout feature;

a fourth feature processing unit 600, configured to perform four-level feature analysis on the first document to be processed, and generate a second filling text attribute vector and a second attribute vector layout feature;

a second sorting unit 700, configured to sort the first template sorting result by combining the first text-filled attribute vector and the first attribute vector layout feature, and the second text-filled attribute vector and the second attribute vector layout feature, to generate a second template sorting result;

and the first execution unit 800 is configured to perform text classification on the first document to be processed according to the second template sorting result when the number of the second template sorting results is equal to 1.

Further, the template matching unit 200 performs the steps of:

constructing a table layout feature similarity analysis function:

X ₁ ＝(x ₁₁ ,x ₁₂ ,..,x _1i ,…,x _1n )；

X ₀ ＝(x ₀₁ ,x ₀₂ ,..,x _0i ,…,x _0m )；

wherein X is ₁ Representing left-to-right, top-to-bottom ordered form center point coordinate set, x of document form layout features after corner alignment _1i Form center point coordinates of ith sequence of ordering from left to right, from top to bottom, representing document form layout features, n representing textForm center point coordinate number, X of form layout features ₀ Representing the ordered form center point coordinate set from left to right, from top to bottom and x of the form layout characteristics of the first template after corner alignment _0i Characterization and x _1i The table center point coordinates of the ith sequence of the table layout features of the first template with the same arrangement sequence, m represents the number of the table center point coordinates of the table layout features of the first template, and d ₀ Characterizing the offset distance threshold, a characterizing the offset quantity threshold, count () as a count function, SIM ₁ Characterizing similarity of table layout characteristics;

Further, the first sorting unit 400 performs the steps of:

constructing a template text similarity analysis function:

Y ₀ ＝(y ₀₁ ，y ₀₂ ，..，y _0j ，...，y _0q )；

Y ₁ ＝(y ₁₁ ，y ₁₂ ，..，y _1j ，...，y _1q )；

Further, the second sorting unit 700 performs the steps of:

Further, the device also comprises a second execution unit, and the second execution unit executes the steps of:

Further, the device also comprises a third execution unit, and the third execution unit executes the steps of:

The specification and figures are merely exemplary illustrations of the present application and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the application. It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the scope of the application. Thus, the present application is intended to include such modifications and alterations insofar as they come within the scope of the application or the equivalents thereof.

Claims

1. A method of text processing, the method comprising:

carrying out primary feature analysis on the first document to be processed to generate document table layout features;

traversing a table template library to perform template recognition by combining the document table layout characteristics to generate a table template matching result;

when the number of the table template matching results is larger than 1, carrying out secondary feature analysis on the table template matching results to generate template text semantic vectors and semantic vector layout features;

sorting the form template matching result by combining the template text semantic vector and the semantic vector layout feature to generate a first template sorting result;

when the number of the first template sorting results is greater than 1, traversing the first template sorting results to perform three-level feature analysis, and generating a first filling character attribute vector and a first attribute vector layout feature;

performing four-level feature analysis on the first document to be processed to generate a second filling character attribute vector and a second attribute vector layout feature;

sorting the first template sorting result by combining the first filling character attribute vector and the first attribute vector layout feature with the second filling character attribute vector and the second attribute vector layout feature to generate a second template sorting result;

and when the number of the second template sorting results is equal to 1, classifying the texts of the first to-be-processed documents according to the second template sorting results.

2. The method of claim 1, wherein traversing a table template library for template recognition in combination with the document table layout features generates a table template matching result, comprising:

3. The method of claim 2, wherein parsing the similarity of the document table layout features and the first template table layout features to generate first template similarity coefficients based on the corner alignment results comprises:

constructing a table layout feature similarity analysis function:

X ₁ ＝(x ₁₁ ,x ₁₂ ,..,x _1i ,…,x _1n )；

X ₀ ＝(x ₀₁ ,x ₀₂ ,..,x _0i ,…,x _0m )；

4. The method of claim 1, wherein sorting the form template matching results in combination with the template semantic vector and the semantic vector layout features to generate a first template sorting result comprises:

constructing a template text similarity analysis function:

Y ₀ ＝(y ₀₁ ,y ₀₂ ,..,y _0j ,…,y _0q )；

Y ₁ ＝(y ₁₁ ,y ₁₂ ,..,y _1j ,…,y _1q )；

wherein Y is ₀ Semantic binary encoding vectors of q different position tables characterizing table template matching results, Y ₁ Characterizing a first document to be processed, semantic binary-coded vectors of q position tables corresponding to the matching result of the table templates, A (y) _1j ,y _0j ) Representing hamming distance of co-located semantic binary coded vector, a ₀ Characterizing a departure distance threshold, SIM ₂ Representing the similarity of the template characters;

5. The method of claim 1, wherein sorting the first template sort result in combination with the first infill text attribute vector and the first attribute vector layout feature and the second infill text attribute vector and the second attribute vector layout feature to generate a second template sort result comprises:

6. The method as recited in claim 1, further comprising:

7. The method as recited in claim 1, further comprising:

8. A text processing system, comprising:

the first feature processing unit is used for carrying out primary feature analysis on the first document to be processed to generate document table layout features;

the template matching unit is used for traversing a table template library to perform template recognition by combining the document table layout characteristics to generate a table template matching result;

the second feature processing unit is used for carrying out secondary feature analysis on the table template matching result when the number of the table template matching result is more than 1, so as to generate template text semantic vectors and semantic vector layout features;

the first sorting unit is used for sorting the form template matching result by combining the template text semantic vector and the semantic vector layout characteristic to generate a first template sorting result;

the third feature processing unit is used for traversing the first template sorting result to perform three-level feature analysis when the number of the first template sorting result is larger than 1, and generating a first filling character attribute vector and a first attribute vector layout feature;

the fourth feature processing unit is used for carrying out four-level feature analysis on the first document to be processed to generate a second filling character attribute vector and a second attribute vector layout feature;

the second sorting unit is used for sorting the first template sorting result by combining the first filling character attribute vector and the first attribute vector layout characteristic with the second filling character attribute vector and the second attribute vector layout characteristic to generate a second template sorting result;

and the first execution unit is used for classifying the texts of the first documents to be processed according to the second template sorting results when the number of the second template sorting results is equal to 1.