CN117131196A - Text processing method and system - Google Patents

Text processing method and system Download PDF

Info

Publication number
CN117131196A
CN117131196A CN202311227355.XA CN202311227355A CN117131196A CN 117131196 A CN117131196 A CN 117131196A CN 202311227355 A CN202311227355 A CN 202311227355A CN 117131196 A CN117131196 A CN 117131196A
Authority
CN
China
Prior art keywords
template
layout
sorting
similarity
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311227355.XA
Other languages
Chinese (zh)
Other versions
CN117131196B (en
Inventor
储铭钧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unicom WO Music and Culture Co Ltd
Original Assignee
Shanghai Chenghu Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Chenghu Information Technology Co ltd filed Critical Shanghai Chenghu Information Technology Co ltd
Priority to CN202311227355.XA priority Critical patent/CN117131196B/en
Priority claimed from CN202311227355.XA external-priority patent/CN117131196B/en
Publication of CN117131196A publication Critical patent/CN117131196A/en
Application granted granted Critical
Publication of CN117131196B publication Critical patent/CN117131196B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a text processing method and a text processing system, which relate to the field of text processing and comprise the following steps: generating document table layout characteristics, and matching a table template matching result; when the number is greater than 1, analyzing the characteristic of the matching result of the table template to generate a template text semantic vector and a semantic vector layout characteristic; sorting the form template matching results to generate a first template sorting result; when the number is greater than 1, traversing the feature analysis of the sorting result of the first template to generate a first filling character attribute vector and a first attribute vector layout feature; analyzing the first character to be processed to generate a second filling character attribute vector and a second attribute vector layout feature, and sorting the first template sorting result to generate a second template sorting result; and when the number of the second template sorting results is equal to 1, classifying the texts of the first to-be-processed documents according to the second template sorting results, so that the technical problem of low processing efficiency in the prior art is solved.

Description

Text processing method and system
Technical Field
The application relates to the technical field of text processing, in particular to a text processing method and a text processing system.
Background
Along with informatization, the process of circulation, processing and storage of various data is greatly increased, and the accurate automatic classification of various data is an important premise for ensuring the rapid retrieval of related information in the later step, and text processing is an important content of automatic classification.
The traditional text processing means applied to automatic classification relies on a large model to perform text classification, so that the demand of calculation force is large, and when the automatic classification of batches is faced, the defects of poor response speed and low processing efficiency exist.
Disclosure of Invention
The application provides a text processing method and a text processing system, which are used for solving the technical problems of poor response speed and low processing efficiency caused by large calculation force demand in text processing for automatic classification in the prior art.
In view of the above problems, the present application provides a text processing method and system.
In a first aspect of the present application, there is provided a text processing method, comprising: carrying out primary feature analysis on the first document to be processed to generate document table layout features; traversing a table template library to perform template recognition by combining the document table layout characteristics to generate a table template matching result; when the number of the table template matching results is larger than 1, carrying out secondary feature analysis on the table template matching results to generate template text semantic vectors and semantic vector layout features; sorting the form template matching result by combining the template semantic vector and the semantic vector layout characteristics to generate a first template sorting result; when the number of the first template sorting results is greater than 1, traversing the first template sorting results to perform three-level feature analysis, and generating a first filling character attribute vector and a first attribute vector layout feature; performing four-level feature analysis on the first document to be processed to generate a second filling character attribute vector and a second attribute vector layout feature; sorting the first template sorting result by combining the first filling character attribute vector and the first attribute vector layout feature with the second filling character attribute vector and the second attribute vector layout feature to generate a second template sorting result; and when the number of the second template sorting results is equal to 1, classifying the texts of the first to-be-processed documents according to the second template sorting results.
In a second aspect of the present application, there is provided a text processing system comprising: the first feature processing unit is used for carrying out primary feature analysis on the first document to be processed to generate document table layout features; the template matching unit is used for traversing a table template library to perform template recognition by combining the document table layout characteristics to generate a table template matching result; the second feature processing unit is used for carrying out secondary feature analysis on the table template matching result when the number of the table template matching result is more than 1, so as to generate template text semantic vectors and semantic vector layout features; the first sorting unit is used for sorting the form template matching result by combining the template semantic vector and the semantic vector layout characteristic to generate a first template sorting result; the third feature processing unit is used for traversing the first template sorting result to perform three-level feature analysis when the number of the first template sorting result is larger than 1, and generating a first filling character attribute vector and a first attribute vector layout feature; the fourth feature processing unit is used for carrying out four-level feature analysis on the first document to be processed to generate a second filling character attribute vector and a second attribute vector layout feature; the second sorting unit is used for sorting the first template sorting result by combining the first filling character attribute vector and the first attribute vector layout characteristic with the second filling character attribute vector and the second attribute vector layout characteristic to generate a second template sorting result; and the first execution unit is used for classifying the texts of the first documents to be processed according to the second template sorting results when the number of the second template sorting results is equal to 1.
One or more technical schemes provided by the application have at least the following technical effects or advantages:
according to the application, text features are divided into three kinds of feature information, namely, the semantic features of the template class of the unfilled content and the semantic features of the filled content, and a text sorting algorithm of the three kinds of feature information is constructed, sorting is carried out layer by layer, firstly, sorting is carried out according to the form layout features, if the return result is not unique, then the semantic features of the template class of the unfilled content are activated for sorting, and if the classification result is not unique, finally, the semantic features of the filled content are activated for sorting. The computational power demand through the multistage sorting process is gradually increased, but the processed data volume is gradually reduced, and compared with the traditional text single-stage processing, the method has the technical effect of higher processing efficiency.
Drawings
FIG. 1 is a schematic flow chart of a text processing method provided by the application;
FIG. 2 is a schematic flow chart of a method for obtaining a matching result of a form template in a text processing method according to the present application;
fig. 3 is a schematic structural diagram of a text processing system according to the present application.
Reference numerals illustrate: a first feature processing unit 100, a template matching unit 200, a second feature processing unit 300, a first sorting unit 400, a third feature processing unit 500, a fourth feature processing unit 600, a second sorting unit 700, and a first execution unit 800.
Detailed Description
The embodiment of the application provides a text processing method and a text processing system, which divide text features into three kinds of feature information, namely semantic features of template types of unfilled contents and semantic features of filled contents, and construct a text sorting algorithm of the three kinds of feature information, sort the text by layers, firstly sort the text according to the layout features of the form, activate the semantic features of the template types of the unfilled contents again if a return result is not unique, sort the text by layers, and finally activate the semantic features of the filled contents again if a classification result is not unique. The calculation force demand through the multistage sorting process is gradually enhanced, but the data volume of processing is gradually reduced, compared with the traditional text single-stage processing, the method has the technical effect of higher processing efficiency, and solves the technical problems of poor response speed and low processing efficiency caused by the larger calculation force demand in the text processing for automatic classification in the prior art.
Example 1
As shown in fig. 1, the present application provides a text processing method, including the steps of:
s10: carrying out primary feature analysis on the first document to be processed to generate document table layout features;
specifically, the first document to be processed, which is pointed by the embodiment of the application, refers to the type of the document which needs to be filled in by depending on the form template, and is exemplified as follows: form documents of work flows of various projects, banks and the like. The document table layout features refer to storage information representing the distribution positions of cells of the table documents, and preferably any one table document is segmented according to the sequence from left to right and from top to bottom to obtain cells arranged according to the sequence, and the storage information of each cell is sequentially stored, wherein the storage information at least comprises four corner coordinates and a center point coordinate of the cell.
S20: traversing a table template library to perform template recognition by combining the document table layout characteristics to generate a table template matching result;
specifically, at any user end applying the text processing method, various form templates required to be archived by the user are stored and recorded as a form template library, i.e. any form sent by the user should have a unique corresponding matching module in the form template library. Preferably, the form template library is placed in the cloud memory, and the user needs to periodically update the form template library to ensure timeliness of the templates and avoid storing invalid templates or missing templates.
The form template matching result refers to a template type which is determined to be relatively similar by carrying out similarity analysis according to the document form layout characteristics and the form layout characteristics of the form template library. Because the templates of different filling-in contents may be identical in form layout, the number of form template matching results sorted according to the document form layout features may not be unique.
If the number of the form template matching results is equal to 1, directly outputting the matched templates for guiding the classified storage of the text. Therefore, unlike the traditional text processing, complicated semantic analysis is needed, and only comparison of table layout is needed, so that the document classification efficiency is improved.
If the number of the matching results of the form templates is larger than 1, the number of templates to be selected is reduced through sorting of the form layout, and the calculation pressure is reduced for sorting of the semantic level for the subsequent steps, so that the document classification efficiency is improved.
Further, as shown in fig. 2, in combination with the document table layout feature, traversing a table template library to perform template recognition, and generating a table template matching result, including:
evaluating the character arrangement direction of the first document to be processed to generate a first character arrangement direction;
extracting the character arrangement direction of a first table template of the table template library to generate a second character arrangement direction;
performing homodirectional corner alignment on the document table layout features and the first template table layout features based on the first character arrangement direction and the second character arrangement direction to generate a corner alignment result;
analyzing the similarity of the document table layout characteristics and the first template table layout characteristics according to the corner alignment result to generate a first template similarity coefficient;
and when the first template similarity coefficient meets a first similarity coefficient threshold, adding the first form template into the form template matching result.
Specifically, if the comparison table layouts are identical, the templates of the first document to be processed and the table template library first need to be aligned. The text alignment mode given by the embodiment of the application is preferably as follows:
the first text alignment direction refers to the layout direction of the text of the first document to be processed, which is an arrangement direction for easy recognition, whether in english, chinese or characters. Preferably, the direction of character arrangement is realized by a convolutional neural network, a plurality of character information is collected, each character is identified according to the direction easy to identify, then the convolutional neural network is trained, and when the training of the deviation of the character arrangement direction output by the convolutional neural network and the arrangement direction identification for 50 continuous times is less than or equal to a deviation threshold value, a character arrangement direction evaluation model is generated. And evaluating the arrangement direction of the characters of the first document to be processed through a character arrangement direction evaluation model, generating a plurality of character arrangement directions, and counting the direction in which the plurality of character arrangement directions appear most as the first character arrangement direction. And similarly, extracting the text arrangement direction of a first table template of the table template library to generate a second text arrangement direction, wherein the first table template refers to any table template of the table template library. At the same time, the first template table layout feature of the first table template is invoked.
Further, according to the first text arrangement direction and the second text arrangement direction, the form direction of the first document to be processed is adjusted to be consistent with the form direction of the first form template. And then, coinciding the corner points of the corner point unit cells of the document table layout features and the first template table layout features, which are consistent in direction adjustment, and coinciding the two sides corresponding to the corner points to obtain a corner point alignment result, so that the accurate comparison analysis of the subsequent table layout is facilitated.
Further, according to the corner alignment result, analyzing the similarity of the document table layout characteristics and the first template table layout characteristics to generate a first template similarity coefficient; after the corner points are aligned, the more the number of the overlapped cells of the document table layout features and the first template table layout features is, the larger the template similarity coefficient is, and otherwise, the smaller the template similarity coefficient is. The first similarity coefficient threshold refers to a preset similarity coefficient threshold regarded as template matching, and is a user-defined threshold. And when the first template similarity coefficient meets a first similarity coefficient threshold, namely the first template similarity coefficient is larger than or equal to the first similarity coefficient threshold, adding the first form template into a form template matching result. Traversing other templates of the table template library to perform the same process treatment, and enriching the matching result of the table template.
Further, according to the corner alignment result, analyzing the similarity between the document table layout feature and the first template table layout feature to generate a first template similarity coefficient, including:
constructing a table layout feature similarity analysis function:
X 1 =(x 11 ,x 12 ,..,x 1i ,…,x 1n );
X 0 =(x 01 ,x 02 ,..,x 0i ,…,x 0m );
wherein X is 1 Representing left-to-right, top-to-bottom ordered form center point coordinate set, x of document form layout features after corner alignment 1i The ith order of table center point coordinates representing the order of the document table layout features from left to right, from top to bottom, n represents the number of table center point coordinates of the document table layout features, X 0 Representing the ordered form center point coordinate set from left to right, from top to bottom and x of the form layout characteristics of the first template after corner alignment 0i Characterization and x 1i The table center point coordinates of the ith sequence of the table layout features of the first template with the same arrangement sequence, m represents the number of the table center point coordinates of the table layout features of the first template, and d 0 Characterizing the offset distance threshold, a characterizing the offset quantity threshold, count () as a count function, SIM 1 Characterizing similarity of table layout characteristics;
and carrying out similarity analysis on the document table layout characteristics and the first template table layout characteristics according to the table layout characteristic similarity analysis function to generate the first template similarity coefficient.
Specifically, after the angular points are aligned, the higher the similarity degree between the first document to be processed and the first form template is, the more the number of overlapped cells is, and a quantization index for similarity analysis can be set based on the number of overlapped cells, so that the similarity templates can be conveniently and rapidly sorted. Constructing a table layout feature similarity analysis function: x is X 1 =(x 11 ,x 12 ,..,x 1i ,…,x 1n );X 0 =(x 01 ,x 02 ,..,x 0i ,...,x 0m ); Wherein X is 1 Characterizing left-to-right, top-to-bottom document table layout features after corner alignmentOrdered form center point coordinate set, x 1i The ith order of table center point coordinates representing the order of the document table layout features from left to right, from top to bottom, n represents the number of table center point coordinates of the document table layout features, X 0 Representing the ordered form center point coordinate set from left to right, from top to bottom and x of the form layout characteristics of the first template after corner alignment 0i Characterization and x 1i The table center point coordinates of the ith sequence of the table layout features of the first template with the same arrangement sequence, m represents the number of the table center point coordinates of the table layout features of the first template, and d 0 Characterizing the offset distance threshold, a characterizing the offset quantity threshold, count () as a count function, SIM 1 The table layout feature similarity is characterized.
After the corner points are aligned, if the tables are overlapped, the coordinates of the central points of the diagonal lines are overlapped, so that the coordinates of the central points are used as indexes for evaluating the similarity degree of the table layout of the first document to be processed and the first table template. Processing efficiency is faster than analyzing the entire cell. By comparing the absolute value m-n and the absolute value a, direct elimination of the large table number difference is realized, the similarity is regarded as 0, irrelevant data can be rapidly removed, and the calculation efficiency is improved. And carrying out similarity analysis on the document table layout characteristics and the first template table layout characteristics according to the table layout characteristic similarity analysis function to generate a first template similarity coefficient.
S30: when the number of the table template matching results is larger than 1, carrying out secondary feature analysis on the table template matching results to generate template text semantic vectors and semantic vector layout features;
specifically, if the number of the table template matching results is greater than 1, the table layout is only relied on, and sorting can not be completed, then part of text semantic features are further added, wherein the part of text semantic features refer to text semantic vectors of templates and stored as the template text semantic vectors, and the semantic vectors refer to codes representing text contents. Simultaneously storing the layout positions of the semantic vectors of all the characters in the table, and recording the layout positions as semantic vector layout features, wherein preferably, the semantic vector of any one character has two position parameters: the cell position, the text number in the cell, and further, the cell position is preferably represented by the cell center coordinates. The template text refers to the initial text that has not been filled with user content. Illustratively: some identity registry, the initial text may include "identity registry, name, ethnicity", etc. The form template matching result can be subjected to secondary sorting by calling part of semantic features, and compared with global semantic analysis, the analysis of part of semantic features reduces the calculation pressure and is convenient for improving the overall document classification efficiency.
S40: sorting the form template matching result by combining the template semantic vector and the semantic vector layout characteristics to generate a first template sorting result;
further, sorting the form template matching result by combining the template semantic vector and the semantic vector layout feature to generate a first template sorting result, including:
constructing a template text similarity analysis function:
Y 0 =(y 01 ,y 02 ,..,y 0j ,...,y 0q );
Y 1 =(y 11 ,y 12 ,..,y 1j ,...,y 1q );
wherein Y is 0 Semantic binary encoding vectors of q different position tables characterizing table template matching results, Y 1 Characterizing a first document to be processed, semantic binary-coded vectors of q position tables corresponding to the matching result of the table templates, A (y) 1j ,y 0j ) Representing hamming distance of co-located semantic binary coded vector, a 0 Characterizing a departure distance threshold, SIM 2 Representing the similarity of the template characters;
according to the semantic vector layout characteristics, carrying out five-level characteristic analysis from the first document to be processed, and extracting semantic vectors of documents with the same layout;
according to the template text similarity analysis function, similarity analysis is carried out on the template semantic vector and the document semantic vector with the same layout, and a second template similarity coefficient is generated;
and when the second template similarity coefficient is greater than or equal to a second similarity coefficient threshold value, adding the form template matching result into the first template sorting result.
Specifically, the first template sorting result refers to a residual template obtained by further sorting the table template matching result according to the template semantic vector and the semantic vector layout characteristics. The preferred sorting procedure is as follows:
constructing a template text similarity analysis function: y is Y 0 =(y 01 ,y 02 ,..,y 0j ,...,y 0q );Y 1 =(y 11 ,y 12 ,..,y 1j ,...,y 1q );Wherein Y is 0 Semantic binary encoding vectors of q different position tables characterizing table template matching results, Y 1 Characterizing a first document to be processed, semantic binary-coded vectors of q position tables corresponding to the matching result of the table templates, A (y) 1j ,y 0j ) Representing hamming distance of co-located semantic binary coded vector, a 0 Characterizing a departure distance threshold, SIM 2 Representing the similarity of the template characters;
determining the position coordinates of the semantic vector of the template text of any template, namely the coordinates of the center point of the cell and the serial number in the cell based on the semantic vector layout characteristics; based on the coordinates of the central point of the cell and the sequence in the cell, matching each semantic vector of the first to-be-processed document at the same coordinate position, and if no text exists in the first to-be-processed document at the same coordinate position, defaulting to the fact that the semantic deviation of the corresponding position is larger than a deviation distance threshold. After the matching of each semantic vector of the first document to be processed is completed, storing the semantic vector as Y 1 Semantic binary coding vector of q different position tables of table template matching resultStored as Y 0 And then counting the template text similarity according to the template text similarity analysis function, and storing the template text similarity as a second template similarity coefficient. And when the second template similarity coefficient is greater than or equal to the second similarity coefficient threshold, adding the analyzed form template matching result into the first template sorting result.
S50: when the number of the first template sorting results is greater than 1, traversing the first template sorting results to perform three-level feature analysis, and generating a first filling character attribute vector and a first attribute vector layout feature;
specifically, when the number of the first template sorting results is greater than 1, it is explained that the sorting of the templates cannot be achieved by the semantic vectors of the dependent parts, three-level feature analysis is performed on the first template sorting results to obtain a first filler character attribute vector and a first attribute vector layout feature, and the filler character attribute refers to a data type representing the required filler characters, and the method is exemplified as follows: if a certain cell data attribute is gender, but the actual filling content is "xx family", the actual filling content is regarded as inconsistent. The attribute vector layout features refer to the positions of cells which characterize the characters to be filled, and are preferably characterized by coordinates of center points of the cells.
The attribute verification can be performed on the basis of the first filler character attribute vector and the filler characters of the first document to be processed, and the filler position verification can be performed on the basis of the first attribute vector layout characteristics and the filler characters of the first document to be processed, so that the template can be further sorted, and all character semantic recognition is not required. The optimized text attribute classification task is realized by a text attribute calibration table, and attribute identifications of a plurality of words or texts are stored in an associated mode to obtain the text attribute calibration table, so that the text attribute calibration table is convenient to call in the later step.
S60: performing four-level feature analysis on the first document to be processed to generate a second filling character attribute vector and a second attribute vector layout feature;
specifically, the second filled text attribute vector refers to the filled text attribute vector of each cell of the first document to be processed, and the second attribute vector layout feature refers to the cell center point coordinates of each of the filled text attribute vectors of the first document to be processed.
S70: sorting the first template sorting result by combining the first filling character attribute vector and the first attribute vector layout feature with the second filling character attribute vector and the second attribute vector layout feature to generate a second template sorting result;
further, sorting the first template sorting result by combining the first filled text attribute vector and the first attribute vector layout feature with the second filled text attribute vector and the second attribute vector layout feature, and generating a second template sorting result, including:
constructing a layout similarity analysis function and a vector similarity analysis function, wherein the layout similarity analysis function is the same as a table layout feature similarity analysis function, and the vector similarity analysis function is the same as the template text similarity analysis function;
performing similarity evaluation on the first attribute vector layout features and the second attribute vector layout features according to the layout similarity analysis function to generate a first layout similarity coefficient;
performing similarity evaluation on the first filling character attribute vector and the second filling character attribute vector according to the vector similarity analysis function to generate a first vector similarity coefficient;
extracting the first template sorting result that the first layout similarity coefficient is greater than or equal to a third similarity coefficient threshold value and the first vector similarity coefficient is greater than or equal to a fourth similarity coefficient threshold value, and adding the first template sorting result into the second template sorting result.
Specifically, the layout similarity analysis function is the same as the table layout feature similarity analysis function, namely, the first attribute vector layout features of any one template are ordered according to the order of the cells from left to right and from top to small, the second attribute vector layout features are ordered according to the order of the cells from left to right and from top to small, then the table layout feature similarity analysis function is activated, and the first layout similarity coefficients are counted.
The vector similarity analysis function is the same as the template text similarity analysis function, namely the first filling text attribute vector of any one template is ordered according to the sequence of the cells from left to right and from top to small, then the second filling text attribute vector of the first document to be processed is ordered according to the sequence of the cells from left to right and from top to small, and then the template text similarity analysis function is activated to obtain the first vector similarity coefficient. And extracting a first template sorting result of which the first layout similarity coefficient is greater than or equal to the third similarity coefficient threshold value and the first vector similarity coefficient is greater than or equal to the fourth similarity coefficient threshold value, and adding the first template sorting result into a second template sorting result.
S80: and when the number of the second template sorting results is equal to 1, classifying the texts of the first to-be-processed documents according to the second template sorting results.
Further, the method further comprises the following steps:
when the number of the second template sorting results is greater than 1, adding the first layout similarity coefficient and the first vector similarity coefficient of the second template sorting results to generate a template sorting trend score;
and carrying out serialization adjustment on the second template sorting result based on the template sorting trend score, and sending the second template sorting result to a user side to obtain a third template sorting result, wherein the number of the third template sorting results is equal to 1.
Specifically, after multi-stage sorting, the number of the remaining second template sorting results should be small, and when the number of the second template sorting results is equal to 1, text classification is performed on the first document to be processed according to the second template sorting results; when the number of the second template sorting results is greater than 1, adding the first layout similarity coefficient and the first vector similarity coefficient of the second template sorting results to be stored as template sorting trend scores; and sorting the second template sorting results based on the large-to-small template sorting trend scores, sending the second template sorting results to a user side, and automatically screening by the user to obtain third template sorting results, wherein the number of the third template sorting results is equal to 1. At this time, through multi-level semantic analysis, if the sorting result of the second template is not equal to 1, complex semantic sorting is performed again with high probability, so that the sorting is difficult to realize, and the user is directly handed to the user for self-sorting, and the sorting time of the user is shorter due to the smaller number of the sorting results of the second template.
Further, the method further comprises the following steps:
when the number of the form template matching results is equal to 0, or the number of the first template sorting results is equal to 0, or the number of the second template sorting results is equal to 0, generating a text rechecking instruction, and sending the text rechecking instruction to a user side to recheck the first document to be processed;
and when the number of the table template matching results is equal to 1 or the number of the first template sorting results is equal to 1, classifying the text of the first document to be processed according to the table template matching results or the first template sorting results.
Specifically, when the number of the matching results of the form templates is equal to 0, or the number of the sorting results of the first template is equal to 0, or the number of the sorting results of the second template is equal to 0, a text review instruction is generated and sent to the user terminal to review the first document to be processed, that is, when the matching number of the templates is 0, the document may be in an uploading error at the moment, and the document needs to be fed back to the user terminal to review. And when the number of the table template matching results is equal to 1 or the number of the first template sorting results is equal to 1, classifying the text of the first document to be processed according to the table template matching results or the first template sorting results.
In summary, the embodiment of the application has at least the following technical effects:
according to the embodiment of the application, the text features are divided into three feature information of form layout features, semantic features of template types of unfilled contents and semantic features of filled contents, a text sorting algorithm of the three feature information is constructed, sorting is carried out layer by layer, firstly, sorting is carried out according to the form layout features, if a return result is not unique, then the semantic features of the template types of the unfilled contents are activated for sorting, if a classification result is not unique, finally, the semantic features of the filled contents are activated for sorting. The computational power demand through the multistage sorting process is gradually increased, but the processed data volume is gradually reduced, and compared with the traditional text single-stage processing, the method has the technical effect of higher processing efficiency.
Example two
Based on the same inventive concept as one of the text processing methods in the previous embodiments, as shown in fig. 3, the present application provides a text processing system, including:
the first feature processing unit 100 is configured to perform primary feature analysis on a first document to be processed, and generate document table layout features;
the template matching unit 200 is used for traversing a table template library to perform template recognition in combination with the document table layout characteristics to generate a table template matching result;
the second feature processing unit 300 is configured to perform secondary feature analysis on the table template matching result when the number of the table template matching result is greater than 1, so as to generate a template text semantic vector and a semantic vector layout feature;
a first sorting unit 400, configured to sort the form template matching result by combining the template semantic vector and the semantic vector layout feature, and generate a first template sorting result;
the third feature processing unit 500 is configured to traverse the first template sorting result to perform three-level feature analysis when the number of the first template sorting results is greater than 1, and generate a first filling text attribute vector and a first attribute vector layout feature;
a fourth feature processing unit 600, configured to perform four-level feature analysis on the first document to be processed, and generate a second filling text attribute vector and a second attribute vector layout feature;
a second sorting unit 700, configured to sort the first template sorting result by combining the first text-filled attribute vector and the first attribute vector layout feature, and the second text-filled attribute vector and the second attribute vector layout feature, to generate a second template sorting result;
and the first execution unit 800 is configured to perform text classification on the first document to be processed according to the second template sorting result when the number of the second template sorting results is equal to 1.
Further, the template matching unit 200 performs the steps of:
evaluating the character arrangement direction of the first document to be processed to generate a first character arrangement direction;
extracting the character arrangement direction of a first table template of the table template library to generate a second character arrangement direction;
performing homodirectional corner alignment on the document table layout features and the first template table layout features based on the first character arrangement direction and the second character arrangement direction to generate a corner alignment result;
analyzing the similarity of the document table layout characteristics and the first template table layout characteristics according to the corner alignment result to generate a first template similarity coefficient;
and when the first template similarity coefficient meets a first similarity coefficient threshold, adding the first form template into the form template matching result.
Further, the template matching unit 200 performs the steps of:
constructing a table layout feature similarity analysis function:
X 1 =(x 11 ,x 12 ,..,x 1i ,…,x 1n );
X 0 =(x 01 ,x 02 ,..,x 0i ,…,x 0m );
wherein X is 1 Representing left-to-right, top-to-bottom ordered form center point coordinate set, x of document form layout features after corner alignment 1i Form center point coordinates of ith sequence of ordering from left to right, from top to bottom, representing document form layout features, n representing textForm center point coordinate number, X of form layout features 0 Representing the ordered form center point coordinate set from left to right, from top to bottom and x of the form layout characteristics of the first template after corner alignment 0i Characterization and x 1i The table center point coordinates of the ith sequence of the table layout features of the first template with the same arrangement sequence, m represents the number of the table center point coordinates of the table layout features of the first template, and d 0 Characterizing the offset distance threshold, a characterizing the offset quantity threshold, count () as a count function, SIM 1 Characterizing similarity of table layout characteristics;
and carrying out similarity analysis on the document table layout characteristics and the first template table layout characteristics according to the table layout characteristic similarity analysis function to generate the first template similarity coefficient.
Further, the first sorting unit 400 performs the steps of:
constructing a template text similarity analysis function:
Y 0 =(y 01 ,y 02 ,..,y 0j ,...,y 0q );
Y 1 =(y 11 ,y 12 ,..,y 1j ,...,y 1q );
wherein Y is 0 Semantic binary encoding vectors of q different position tables characterizing table template matching results, Y 1 Characterizing a first document to be processed, semantic binary-coded vectors of q position tables corresponding to the matching result of the table templates, A (y) 1j ,y 0j ) Representing hamming distance of co-located semantic binary coded vector, a 0 Characterizing a departure distance threshold, SIM 2 Representing the similarity of the template characters;
according to the semantic vector layout characteristics, carrying out five-level characteristic analysis from the first document to be processed, and extracting semantic vectors of documents with the same layout;
according to the template text similarity analysis function, similarity analysis is carried out on the template semantic vector and the document semantic vector with the same layout, and a second template similarity coefficient is generated;
and when the second template similarity coefficient is greater than or equal to a second similarity coefficient threshold value, adding the form template matching result into the first template sorting result.
Further, the second sorting unit 700 performs the steps of:
constructing a layout similarity analysis function and a vector similarity analysis function, wherein the layout similarity analysis function is the same as a table layout feature similarity analysis function, and the vector similarity analysis function is the same as the template text similarity analysis function;
performing similarity evaluation on the first attribute vector layout features and the second attribute vector layout features according to the layout similarity analysis function to generate a first layout similarity coefficient;
performing similarity evaluation on the first filling character attribute vector and the second filling character attribute vector according to the vector similarity analysis function to generate a first vector similarity coefficient;
extracting the first template sorting result that the first layout similarity coefficient is greater than or equal to a third similarity coefficient threshold value and the first vector similarity coefficient is greater than or equal to a fourth similarity coefficient threshold value, and adding the first template sorting result into the second template sorting result.
Further, the device also comprises a second execution unit, and the second execution unit executes the steps of:
when the number of the second template sorting results is greater than 1, adding the first layout similarity coefficient and the first vector similarity coefficient of the second template sorting results to generate a template sorting trend score;
and carrying out serialization adjustment on the second template sorting result based on the template sorting trend score, and sending the second template sorting result to a user side to obtain a third template sorting result, wherein the number of the third template sorting results is equal to 1.
Further, the device also comprises a third execution unit, and the third execution unit executes the steps of:
when the number of the form template matching results is equal to 0, or the number of the first template sorting results is equal to 0, or the number of the second template sorting results is equal to 0, generating a text rechecking instruction, and sending the text rechecking instruction to a user side to recheck the first document to be processed;
and when the number of the table template matching results is equal to 1 or the number of the first template sorting results is equal to 1, classifying the text of the first document to be processed according to the table template matching results or the first template sorting results.
The specification and figures are merely exemplary illustrations of the present application and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the application. It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the scope of the application. Thus, the present application is intended to include such modifications and alterations insofar as they come within the scope of the application or the equivalents thereof.

Claims (8)

1. A method of text processing, the method comprising:
carrying out primary feature analysis on the first document to be processed to generate document table layout features;
traversing a table template library to perform template recognition by combining the document table layout characteristics to generate a table template matching result;
when the number of the table template matching results is larger than 1, carrying out secondary feature analysis on the table template matching results to generate template text semantic vectors and semantic vector layout features;
sorting the form template matching result by combining the template text semantic vector and the semantic vector layout feature to generate a first template sorting result;
when the number of the first template sorting results is greater than 1, traversing the first template sorting results to perform three-level feature analysis, and generating a first filling character attribute vector and a first attribute vector layout feature;
performing four-level feature analysis on the first document to be processed to generate a second filling character attribute vector and a second attribute vector layout feature;
sorting the first template sorting result by combining the first filling character attribute vector and the first attribute vector layout feature with the second filling character attribute vector and the second attribute vector layout feature to generate a second template sorting result;
and when the number of the second template sorting results is equal to 1, classifying the texts of the first to-be-processed documents according to the second template sorting results.
2. The method of claim 1, wherein traversing a table template library for template recognition in combination with the document table layout features generates a table template matching result, comprising:
evaluating the character arrangement direction of the first document to be processed to generate a first character arrangement direction;
extracting the character arrangement direction of a first table template of the table template library to generate a second character arrangement direction;
performing homodirectional corner alignment on the document table layout features and the first template table layout features based on the first character arrangement direction and the second character arrangement direction to generate a corner alignment result;
analyzing the similarity of the document table layout characteristics and the first template table layout characteristics according to the corner alignment result to generate a first template similarity coefficient;
and when the first template similarity coefficient meets a first similarity coefficient threshold, adding the first form template into the form template matching result.
3. The method of claim 2, wherein parsing the similarity of the document table layout features and the first template table layout features to generate first template similarity coefficients based on the corner alignment results comprises:
constructing a table layout feature similarity analysis function:
X 1 =(x 11 ,x 12 ,..,x 1i ,…,x 1n );
X 0 =(x 01 ,x 02 ,..,x 0i ,…,x 0m );
wherein X is 1 Representing left-to-right, top-to-bottom ordered form center point coordinate set, x of document form layout features after corner alignment 1i The ith order of table center point coordinates representing the order of the document table layout features from left to right, from top to bottom, n represents the number of table center point coordinates of the document table layout features, X 0 Representing the ordered form center point coordinate set from left to right, from top to bottom and x of the form layout characteristics of the first template after corner alignment 0i Characterization and x 1i The table center point coordinates of the ith sequence of the table layout features of the first template with the same arrangement sequence, m represents the number of the table center point coordinates of the table layout features of the first template, and d 0 Characterizing the offset distance threshold, a characterizing the offset quantity threshold, count () as a count function, SIM 1 Characterizing similarity of table layout characteristics;
and carrying out similarity analysis on the document table layout characteristics and the first template table layout characteristics according to the table layout characteristic similarity analysis function to generate the first template similarity coefficient.
4. The method of claim 1, wherein sorting the form template matching results in combination with the template semantic vector and the semantic vector layout features to generate a first template sorting result comprises:
constructing a template text similarity analysis function:
Y 0 =(y 01 ,y 02 ,..,y 0j ,…,y 0q );
Y 1 =(y 11 ,y 12 ,..,y 1j ,…,y 1q );
wherein Y is 0 Semantic binary encoding vectors of q different position tables characterizing table template matching results, Y 1 Characterizing a first document to be processed, semantic binary-coded vectors of q position tables corresponding to the matching result of the table templates, A (y) 1j ,y 0j ) Representing hamming distance of co-located semantic binary coded vector, a 0 Characterizing a departure distance threshold, SIM 2 Representing the similarity of the template characters;
according to the semantic vector layout characteristics, carrying out five-level characteristic analysis from the first document to be processed, and extracting semantic vectors of documents with the same layout;
according to the template text similarity analysis function, similarity analysis is carried out on the template semantic vector and the document semantic vector with the same layout, and a second template similarity coefficient is generated;
and when the second template similarity coefficient is greater than or equal to a second similarity coefficient threshold value, adding the form template matching result into the first template sorting result.
5. The method of claim 1, wherein sorting the first template sort result in combination with the first infill text attribute vector and the first attribute vector layout feature and the second infill text attribute vector and the second attribute vector layout feature to generate a second template sort result comprises:
constructing a layout similarity analysis function and a vector similarity analysis function, wherein the layout similarity analysis function is the same as a table layout feature similarity analysis function, and the vector similarity analysis function is the same as the template text similarity analysis function;
performing similarity evaluation on the first attribute vector layout features and the second attribute vector layout features according to the layout similarity analysis function to generate a first layout similarity coefficient;
performing similarity evaluation on the first filling character attribute vector and the second filling character attribute vector according to the vector similarity analysis function to generate a first vector similarity coefficient;
extracting the first template sorting result that the first layout similarity coefficient is greater than or equal to a third similarity coefficient threshold value and the first vector similarity coefficient is greater than or equal to a fourth similarity coefficient threshold value, and adding the first template sorting result into the second template sorting result.
6. The method as recited in claim 1, further comprising:
when the number of the second template sorting results is greater than 1, adding the first layout similarity coefficient and the first vector similarity coefficient of the second template sorting results to generate a template sorting trend score;
and carrying out serialization adjustment on the second template sorting result based on the template sorting trend score, and sending the second template sorting result to a user side to obtain a third template sorting result, wherein the number of the third template sorting results is equal to 1.
7. The method as recited in claim 1, further comprising:
when the number of the form template matching results is equal to 0, or the number of the first template sorting results is equal to 0, or the number of the second template sorting results is equal to 0, generating a text rechecking instruction, and sending the text rechecking instruction to a user side to recheck the first document to be processed;
and when the number of the table template matching results is equal to 1 or the number of the first template sorting results is equal to 1, classifying the text of the first document to be processed according to the table template matching results or the first template sorting results.
8. A text processing system, comprising:
the first feature processing unit is used for carrying out primary feature analysis on the first document to be processed to generate document table layout features;
the template matching unit is used for traversing a table template library to perform template recognition by combining the document table layout characteristics to generate a table template matching result;
the second feature processing unit is used for carrying out secondary feature analysis on the table template matching result when the number of the table template matching result is more than 1, so as to generate template text semantic vectors and semantic vector layout features;
the first sorting unit is used for sorting the form template matching result by combining the template text semantic vector and the semantic vector layout characteristic to generate a first template sorting result;
the third feature processing unit is used for traversing the first template sorting result to perform three-level feature analysis when the number of the first template sorting result is larger than 1, and generating a first filling character attribute vector and a first attribute vector layout feature;
the fourth feature processing unit is used for carrying out four-level feature analysis on the first document to be processed to generate a second filling character attribute vector and a second attribute vector layout feature;
the second sorting unit is used for sorting the first template sorting result by combining the first filling character attribute vector and the first attribute vector layout characteristic with the second filling character attribute vector and the second attribute vector layout characteristic to generate a second template sorting result;
and the first execution unit is used for classifying the texts of the first documents to be processed according to the second template sorting results when the number of the second template sorting results is equal to 1.
CN202311227355.XA 2023-09-21 Text processing method and system Active CN117131196B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311227355.XA CN117131196B (en) 2023-09-21 Text processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311227355.XA CN117131196B (en) 2023-09-21 Text processing method and system

Publications (2)

Publication Number Publication Date
CN117131196A true CN117131196A (en) 2023-11-28
CN117131196B CN117131196B (en) 2024-05-10

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117540720A (en) * 2024-01-09 2024-02-09 深圳市宝溢显示技术有限公司 Exhibition guide information generation method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200074169A1 (en) * 2018-08-31 2020-03-05 Accenture Global Solutions Limited System And Method For Extracting Structured Information From Image Documents
CN114882515A (en) * 2022-05-30 2022-08-09 深圳壹账通智能科技有限公司 Table type determination method, device and medium based on neural network model
CN116524527A (en) * 2023-03-21 2023-08-01 山东浪潮科学研究院有限公司 Table image text recognition method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200074169A1 (en) * 2018-08-31 2020-03-05 Accenture Global Solutions Limited System And Method For Extracting Structured Information From Image Documents
CN114882515A (en) * 2022-05-30 2022-08-09 深圳壹账通智能科技有限公司 Table type determination method, device and medium based on neural network model
CN116524527A (en) * 2023-03-21 2023-08-01 山东浪潮科学研究院有限公司 Table image text recognition method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117540720A (en) * 2024-01-09 2024-02-09 深圳市宝溢显示技术有限公司 Exhibition guide information generation method and system
CN117540720B (en) * 2024-01-09 2024-03-26 深圳市宝溢显示技术有限公司 Exhibition guide information generation method and system

Similar Documents

Publication Publication Date Title
WO2021169111A1 (en) Resume screening method and apparatus, computer device and storage medium
US20230039496A1 (en) Question-and-answer processing method, electronic device and computer readable medium
US11804056B2 (en) Document spatial layout feature extraction to simplify template classification
Zhou et al. BSIFT: Toward data-independent codebook for large scale image search
US9043316B1 (en) Visual content retrieval
CN110909725A (en) Method, device and equipment for recognizing text and storage medium
US11294624B2 (en) System and method for clustering data
WO2019218473A1 (en) Field matching method and device, terminal device and medium
US9330332B2 (en) Fast computation of kernel descriptors
CN113378710B (en) Layout analysis method and device for image file, computer equipment and storage medium
Liu et al. Uniting keypoints: Local visual information fusion for large-scale image search
CN106845358B (en) Method and system for recognizing image features of handwritten characters
JPH0664631B2 (en) Character recognition device
CN112307820B (en) Text recognition method, device, equipment and computer readable medium
US20200175259A1 (en) Face recognition method and apparatus capable of face search using vector
CN110825894A (en) Data index establishing method, data index retrieving method, data index establishing device, data index retrieving device, data index establishing equipment and storage medium
WO2021012793A1 (en) Lawyer recommendation method based on big data analysis, and related device
CN105989001B (en) Image search method and device, image search system
CN112417381B (en) Method and device for rapidly positioning infringement image applied to image copyright protection
CN110362798B (en) Method, apparatus, computer device and storage medium for judging information retrieval analysis
CN111325033B (en) Entity identification method, entity identification device, electronic equipment and computer readable storage medium
CN111274445A (en) Similar video content retrieval method and system based on triple deep learning
CN111626250A (en) Line dividing method and device for text image, computer equipment and readable storage medium
CN113657504A (en) Image retrieval method, image retrieval device, computer equipment and storage medium
CN117131196B (en) Text processing method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240418

Address after: Room 201, Building 5, Room 202, Building 5, Room 203, Building 5, Room 204, Building 5, Room 205, Building 5, No. 8, Science Avenue, Huangpu District, Guangzhou City, Guangdong Province, 510700

Applicant after: Unicom Waupaca music culture Co.,Ltd.

Country or region after: China

Address before: 200120 room 205, west area, 2f, No. 707, Zhangyang Road, Pudong New Area, Shanghai

Applicant before: Shanghai Chenghu Information Technology Co.,Ltd.

Country or region before: China

GR01 Patent grant