CN113033170B - Form standardization processing method, device, equipment and storage medium - Google Patents

Form standardization processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN113033170B
CN113033170B CN202110441015.1A CN202110441015A CN113033170B CN 113033170 B CN113033170 B CN 113033170B CN 202110441015 A CN202110441015 A CN 202110441015A CN 113033170 B CN113033170 B CN 113033170B
Authority
CN
China
Prior art keywords
text
title
processed
feature vector
column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110441015.1A
Other languages
Chinese (zh)
Other versions
CN113033170A (en
Inventor
戚思骅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202110441015.1A priority Critical patent/CN113033170B/en
Publication of CN113033170A publication Critical patent/CN113033170A/en
Application granted granted Critical
Publication of CN113033170B publication Critical patent/CN113033170B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a form standardization processing method, which comprises the following steps: acquiring N rows and M columns of tables to be processed; performing similarity detection on texts of two adjacent lines from the 1 st line of the table, and determining that the 1 st line to the i st line and the i+1st line to the Nth line are title lines and the i+1st line to the Nth line are data lines when the text of the i st line and the i+1st line are detected to be inconsistent, wherein i is more than or equal to 1 and less than or equal to N-1; the method comprises the steps of detecting the similarity of texts in two adjacent columns from a 1 st column of a table, and determining the 1 st column to the j th column as a header column and the j+1st column to the M column as data columns when the fact that the texts in the j th column and the j+1st row are inconsistent is detected, wherein j is more than or equal to 1 and less than or equal to M-1; classifying and screening the title texts in the determined title rows and title columns to obtain preset standardized titles; and obtaining the standardized form of the form to be processed according to the standardized title and the data corresponding to the standardized title in the form to be processed. The method provided by the application can reduce the complexity of the standardized processing method of the table.

Description

Form standardization processing method, device, equipment and storage medium
Technical Field
The application belongs to the technical field of data processing, and particularly relates to a method, a device, equipment and a storage medium for standardized processing of a table.
Background
The form is a common data statistics mode, and enterprises can know the information such as the operation state of each service, the client preference and the like according to the data in the form and make corresponding decisions according to the information. However, the data formats in the tables of different storage types are different, and a great deal of manpower is required for standardizing the tables. And there is some irrelevant redundant information in the form, which is unfavorable for the enterprises to analyze the data in the form.
Identifying the header and data portions in the table is a key step in the standardized processing method of the table. The existing method for identifying the data part and the title part of the table comprises the following steps: firstly, each cell in the table is identified, then the text in each cell is extracted, and finally, each text is classified and standardized based on various algorithms. Since such methods require classification of the text in each cell, and thus determine whether each text belongs to the header portion or the data portion, the algorithm is complicated.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for standardized processing of a table, which can reduce the complexity of the standardized processing method of the table.
In a first aspect, an embodiment of the present application provides a method for table normalization processing, where the method includes:
acquiring a table to be processed, wherein the table to be processed comprises N rows and M columns; starting from the 1 st row in the table to be processed, performing similarity detection on texts of two adjacent rows until the fact that the texts of the i st row and the i+1st row are inconsistent is detected, determining the 1 st row to the i st row title row, and determining the i+1st row to the N row data row, wherein i is more than or equal to 1 and less than or equal to N-1; starting from the 1 st column in the table to be processed, performing similarity detection on texts of two adjacent columns until the fact that the j-th column and the j+1th line are inconsistent is detected, determining the 1 st column to the j-th column as a header column, and the j+1th column to the M-th column as a data column, wherein j is more than or equal to 1 and less than or equal to M-1; classifying and screening the title texts in the determined title rows and title columns to obtain preset standardized titles; and obtaining the standardized form of the form to be processed according to the standardized title and the data corresponding to the standardized title in the form to be standardized.
Based on the table standardization processing method provided by the application, the similarity between adjacent texts is detected from the first two rows or the first two columns of the table to be processed by utilizing the difference between the header text and the data text in the table, and once the text inconsistency is detected, the data part and the header part in the table to be processed are recognized, so that the rest rows or columns in the table to be standardized are not required to be detected, and the complexity of an algorithm is reduced. And then determining the category of each title text, screening each title text, and removing redundant information in the title text according to the requirement to obtain a standardized title. And classifying and normalizing each title row and each title column in the table to be processed according to the types in the standard library, and removing redundant information. And finally, storing the data in the to-be-processed form and the standardized header row and header column corresponding to the data in the database, so that the enterprise can analyze the data conveniently.
Optionally, the text of any two adjacent rows in the to-be-processed table is a set of text pairs, the text of any two adjacent columns in the to-be-processed table is a set of text pairs, and the method for detecting the similarity of any text pair in the to-be-processed table includes:
inputting a first text and a second text in a text pair into a trained similarity detection model for processing to obtain a similarity detection result of the first text and the second text, wherein the similarity detection result indicates that the first text and the second text are consistent or inconsistent, and the first text and the second text are texts of two adjacent rows or two adjacent columns in a to-be-processed table;
the similarity detection model comprises a pre-training language model, a cross attention layer and a full connection layer, and the processing of the text by using the similarity detection model comprises the following steps:
inputting the first text and the second text into a pre-training language model for feature extraction to obtain a first feature vector and a second feature vector; inputting the first feature vector and the second feature vector into a cross attention layer for fusion processing to obtain a high-dimensional feature vector; and inputting the high-dimensional feature vector into a full-connection layer for full-connection calculation to obtain a similarity detection result.
Optionally, inputting the first feature vector and the second feature vector into the cross attention layer for fusion processing to obtain a high-dimensional feature vector, including:
calculating a first weight of each value in the first feature vector relative to each value in the second feature vector; carrying out weighted summation calculation on the first feature vector according to the first weight to obtain a third feature vector; calculating a second weight of each value in the second feature vector relative to each value in the first feature vector; carrying out weighted summation calculation on the second feature vector according to the second weight to obtain a fourth feature vector; and cascading the third feature vector and the fourth feature vector to obtain a high-dimensional feature vector.
Based on the above-mentioned alternative mode, in the course of detecting the similarity of the texts, fully consider the relativity of the characteristic between two input texts, add the cross attention layer on the basis of existing pre-training language model, calculate the high-dimensional characteristic vector by the way of cross, can effectively characterize the mapping relation between the characteristic of two input texts, strengthen the relativity between two input texts, thus has improved the accuracy rate of the detection model text recognition of the similarity.
Optionally, the form normalization processing method further includes: if the similarity detection result of each group of text pairs in the to-be-processed table is the first text is consistent with the second text, determining that the to-be-processed table is a cross-page table, and combining the cross-page table with the first table which is ordered in the previous position of the cross-page table.
Optionally, the method for merging the cross-page table with the first table ordered first in the cross-page table includes: respectively acquiring the row number and the column number of the page-crossing table and the first table; if the number of lines of the page crossing table is the same as that of the first table, performing column merging; if the column number of the page crossing table is the same as that of the first table, carrying out row merging.
Optionally, classifying and screening the title text in the determined title row and title column to obtain a preset standardized title, which includes: classifying the title text by using a trained text classification model, and determining the corresponding category of the title text in a preset standard library; and carrying out text screening on the title text according to the category corresponding to the title text in a preset standard library to obtain a preset standardized title.
Based on the above optional manner, the header text in the form can be screened according to the service requirement, and invalid information is filtered, so that enterprises can conveniently and accurately analyze data according to the information in the form.
Optionally, acquiring a table to be processed includes: and acquiring a table to be processed from a table data source, wherein the table data source comprises a picture format type, a PDF type or a table type.
In a second aspect, an embodiment of the present application provides a form normalization apparatus, including:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a table to be processed, and the table to be processed comprises N rows and M columns;
the recognition unit starts from the 1 st row in the table to be processed, carries out similarity detection on texts of two adjacent rows until the 1 st row and the (i+1) th row are detected to be inconsistent, determines the 1 st row to the (i) th row of the title row, and determines the (i+1) th row to the (N) th row of the data row, wherein i is more than or equal to 1 and less than or equal to N; starting from the 1 st column in the table to be processed, performing similarity detection on texts of two adjacent columns until the fact that the j-th column and the j+1th line are inconsistent is detected, determining the 1 st column to the j-th column as a header column, and the j+1th column to the M-th column as a data column, wherein j is more than or equal to 1 and less than or equal to M;
the classifying unit is used for classifying and screening the title texts in the determined title rows and the title columns to obtain preset standardized titles;
and the standardized unit is used for obtaining the standardized form of the form to be processed according to the standardized title and the data corresponding to the standardized title in the form to be processed.
In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement a method according to any one of the embodiments of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements a method according to any one of the embodiments of the first aspect.
In a fifth aspect, embodiments of the present application provide a computer program product for, when run on a terminal device, causing the terminal device to perform the method of any one of the first aspects.
It will be appreciated that the advantages of the second to fifth aspects may be found in the relevant description of the first aspect, and are not described here again.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required for the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for table extraction and parsing according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a table to be processed according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a standardized table provided in an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a similarity detection model according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a cross-attention layer according to an embodiment of the present application;
FIG. 6 is a flowchart of similarity detection for text pairs based on a similarity detection model according to one embodiment of the present application;
FIG. 7 is a training flowchart of a similarity detection model according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a table extraction and analysis device according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
The form is a common data statistics mode, and enterprises can know information such as operation conditions of various services, client preference and the like according to data in the form and make corresponding decisions according to the information. However, the data formats in the tables of different storage types are different, and a great deal of manpower is required for standardizing the tables. And some invalid information exists in the table, which is unfavorable for enterprises to analyze the data in the table.
Identifying the header and data portions in the table is a key step in the standardized processing method of the table. The existing method for identifying the data part and the title part of the table comprises the following steps: firstly, each cell in the table is identified, then the text in each cell is extracted, and finally, each text is classified and standardized based on various algorithms. Since such methods require identifying and classifying the text in each cell, the complexity of the algorithm is high to determine whether each text belongs to the header portion or the data portion.
In order to reduce the complexity of a table extraction and analysis method, the application provides a table extraction and analysis method, a device, equipment and a storage medium. And fully considering the correlation of the characteristics between two input texts, providing a new text similarity detection model based on the existing pre-training language model, and determining the data part and the title part in the form by judging the text consistency of two adjacent rows and two adjacent columns in the form to be processed. According to the method, the consistency of the text is detected from the first two rows or the first two columns of the to-be-processed table, once the initial row and the initial column of the data are identified, the rest rows and columns in the to-be-standardized table do not need to be detected, and the complexity of an algorithm is reduced.
The technical scheme of the present application is described in detail below with specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.
Fig. 1 shows a flowchart of a table extraction parsing method provided in the present application. The method comprises the following steps:
s11, acquiring a table to be processed, wherein the table to be processed comprises N rows and M columns.
In this embodiment, a table to be processed may be acquired from a plurality of table data sources. The tabular data source includes: picture format (e.g., jpg, png, etc.), PDF format, and tabular format (e.g., excel, csv, tsv, etc.).
In one embodiment, for a data source of a picture format type, a OCR (Optical Character Recognition) tool is utilized to process the data source of the picture format type, so as to obtain a table to be processed and a row number corresponding to the table to be processed.
In another embodiment, a PDF format data source is processed based on a computer vision algorithm and a PDF format tool to obtain a table to be processed and a row number corresponding to the table to be processed. Illustratively, the method includes:
step one, converting a PDF file into a Png image;
Binarizing the Png image, and carrying out noise reduction on the binarized image through corrosion and expansion treatment in mathematical morphology to obtain a table contour in the Png image;
step three, acquiring the coordinates of the top left vertex and the bottom right vertex of the surface outline in the Png image;
mapping the coordinates of the upper left vertex and the lower right vertex of the table outline in the Png image into a PDF file, determining the range of the table in the PDF file, and intercepting the PDF file in the range;
and fifthly, processing the table in the interception range by using a pdfplum tool to obtain a table to be processed and a row number corresponding to the table.
S12, starting from the 1 st row in the table to be processed, performing similarity detection on texts of two adjacent rows until the fact that the texts of the i th row and the i+1th row are inconsistent is detected, determining the 1 st row to the i th row title row, the i+1th row to the N th row data row, and enabling i to be more than or equal to 1 and less than or equal to N-1.
Specifically, the table includes a header section and a data section, and the header section includes a header row and a header column. i is a variable from 1 to N-1, and i is a positive integer. The text of any two adjacent rows in the form to be processed is a set of text pairs. The text pair comprises a first text and a second text, and the first text and the second text are respectively the texts of two adjacent rows in the table to be processed.
In one embodiment, starting from line 1 in the table to be processed, text pairs of two adjacent lines may be directly input into a trained similarity detection model for similarity detection. The text in two adjacent rows of cells can be respectively combined from the 1 st row in the table to be processed to obtain a first text character string and a second text character string, and the first text character string and the second text character string are input into a trained similarity detection model to carry out similarity detection. The similarity detection result indicates that the first text and the second text are consistent or inconsistent. By way of example, the similarity detection model may be an Albert model, an Xlnet model, or the like.
Illustratively, assume that the table to be processed includes 6 rows. Inputting a first text corresponding to the 1 st row and a second text corresponding to the 2 nd row in a table to be processed into a trained similarity detection model, and if the detection result is that the first text is consistent with the second text, continuing to perform similarity detection on text pairs of two adjacent rows in the table from the 2 nd row. And inputting a first text corresponding to the 2 nd line and a second text corresponding to the 3 rd line of the table into the trained similarity detection model, and if the detection result is that the texts are inconsistent, determining that the 1 st line and the 2 nd line are header lines and the 3 rd line to the 6 th line are data lines.
S13, starting from the 1 st column in the table to be processed, performing similarity detection on texts of two adjacent columns until the fact that the text of the j-th column and the text of the j+1th column are inconsistent is detected, determining the 1 st column to the j-th column as a header column, and determining the j+1th column to the M-th column as a data column, wherein j is more than or equal to 1 and less than or equal to M-1.
Specifically, j is a variable from 1 to M-1, and j is a positive integer. The text in any two adjacent columns in the form to be processed is a set of text pairs. The text pair comprises a first text and a second text, and the first text and the second text are respectively the texts of two adjacent columns in the table to be processed. The specific process of inputting the texts of the two adjacent columns into the trained similarity detection model for similarity detection may refer to the specific process of inputting the texts of the two adjacent rows into the trained similarity detection model for similarity detection in S12, which is not described herein.
S14, classifying and screening the title texts in the determined title rows and the title columns to obtain preset standardized titles.
In one possible implementation, the text classification model is first used to classify the title text in the determined title line and title column, and determine the corresponding category of the title text in the preset standard library. Illustratively, the categories of classification include: insurance period, insurance age and sex, etc. The text classification model may be TextCNN or TextRNN.
And screening the title text according to the category corresponding to the title text in a preset standard library to obtain a preset standardized title. In one embodiment, the specific text matching the category of the title text can be extracted from the title text through a regular expression, namely, the specific part which is wanted is extracted from the title text according to the requirement, and invalid information in the title text is removed.
Illustratively, it is assumed that one of the header texts in the form to be processed is "insuring man-made" in which "insuring man-made" is invalid information to be removed and "man-made" in which "man-made" is a specific part to be reserved. Therefore, the category of the title text is determined to be 'gender' through a trained TextCNN text classification model, then the text 'male' matched with the 'gender' characteristic can be extracted from the title text by using a regular expression, and finally the title text 'insuring the person to be male' is converted into a preset standardized title 'male'.
S15, obtaining the standardized form of the form to be processed according to the standardized title and the data corresponding to the standardized title in the form to be processed.
In one possible implementation, on the one hand, as shown in fig. 3 (a), the standardized header and the data corresponding to the standardized header in the table to be processed may be extracted to form a standardized table. And standardized tables may be stored in different types of formats as desired, such as: PDF type, excel type, png type, and the like.
Alternatively, the standardized header and the data corresponding to the standardized header in the data portion may be converted into a specific data format and stored in a database. Illustratively, the data format may be SQL, JSON, XML, or the like. By way of example, the data 14716 corresponding to the standardized row header "three years" and the standardized column header "apply for years 18" shown in the table of fig. 3 (a) in the data portion of the table may be converted into an SQL type data format as shown in fig. 3 (b).
For a table data source of PDF format type or picture format type, a table in the same data source may have a phenomenon of page crossing. In this case, it is necessary to identify the page-crossing table in the same data source, and combine the page-crossing table with the first table ordered in the previous position of the page-crossing table, and then execute S14 above. Therefore, after S13 above, the table normalization processing method described in the present application further includes:
S16, if the similarity detection result of each group of text pairs in the to-be-processed table is text consistency, determining that the table is a cross-page table, and merging the cross-page table with a first table which is ordered in the previous position of the cross-page table.
The page crossing table and the first table ordered in the previous bit of the page crossing table are adjacent tables obtained from the same table data source in S11. In S11, after a plurality of tables to be processed are extracted from the same table data source, the plurality of tables to be processed are sequentially stored. And then executing the steps S12 and S13, carrying out similarity detection on text pairs of two adjacent rows and two adjacent columns in each to-be-processed table by using a trained similarity detection model, if the similarity detection result of each group of text pairs in the to-be-processed table is text consistency, determining that the table is a page-crossing table, and merging the page-crossing table with a first table sequenced in the previous position of the page-crossing table.
Illustratively, a PDF format type data source is taken as an example. It is assumed that two tables to be processed are obtained from the same PDF table data source, a first table to be processed and a second table to be processed are processed according to a sequence, and the first table to be processed is detected to contain a header part and a data part through a trained similarity detection model. If the trained similarity detection model detects that text pairs of two adjacent rows in the second to-be-processed form are text-consistent, and text pairs of two adjacent columns are text-inconsistent, namely the form does not contain a header row but contains a header column, the form is a cross-page table form. If the trained similarity detection model detects that text pairs of two adjacent columns in the second to-be-processed form are consistent in text, and text pairs of two adjacent rows are inconsistent in text, namely the form contains a header row but does not contain a header column, the form is a cross-page table form. In both cases, the cross-page table needs to be merged with the first pending table ordered in the previous bit of the cross-page table.
In one possible implementation, a method for merging a page crossing table with a first table ordered first in the page crossing table includes:
step one, respectively calculating the row and column numbers of the page-crossing table and the first table.
Illustratively, the number of columns and rows of the PDF table can be calculated by the pdfplumberer parsing library. The number of lines and columns of the picture form can be calculated by the OCR tool.
And step two, matching the row and column number of the first table with the row and column number of the page crossing table, and carrying out row combination or column combination according to the matching result.
In this embodiment, if the number of rows of the page crossing table is the same as the number of rows of the first table, column merging is performed. If the column number of the page crossing table is the same as that of the first table, carrying out row merging.
In another possible implementation, cross page tables may be artificially merged when special cases occur.
In one embodiment, the trained similarity detection model detects that text pairs of two adjacent columns in the to-be-processed form are consistent in text, and text pairs of two adjacent rows are inconsistent in text, that is, the to-be-processed form is a page crossing form, and the page crossing form contains neither a header row nor a header column, and only contains a data part. In this case, it is necessary to manually determine whether the cross page table matches the contents of the first table sorted in the preceding order of the cross page table, and if the contents match, the number of rows and columns of the data portion in the cross page table is compared with the number of rows and columns of the data portion in the first table, and then the data portion and the first table are merged. If the number of lines crossing the data portion in the page table is the same as the number of lines of the data portion in the first table, the data portions are column-merged. If the number of rows of the data portion in the cross page table is the same as the number of columns of the data portion in the first table, the data portions are row-combined.
According to the table standardization processing method, according to the difference between the header text and the data text in the table, the similarity of the text is detected from the first two rows or the first two columns of the table to be processed, and when the similarity detection result is inconsistent with the text, the header part and the data part in the table to be processed are considered to be detected. And once the initial row and the initial column of the data are identified, the rest rows and columns in the table to be processed do not need to be detected, so that the complexity of an algorithm is reduced. And screening the title row and the title column corresponding to each data according to the types in the standard library to obtain a standardized title text, and removing irrelevant information. And constructing a corresponding standardized form according to the data corresponding to the standardized title text in the form to be processed, so that enterprises can effectively analyze according to the information in the standardized form, and accordingly, a corresponding decision is made.
In another possible implementation manner, the present application proposes a similarity detection model for a method for performing similarity detection on texts in two adjacent rows or two adjacent columns in the table to be processed in S12 and S13, where the structure of the model is shown in fig. 4.
The similarity detection model includes a pre-trained language model, a cross-attention layer (Cross Attention Layer), and a full-join layer. The model comprises two input branches, each of which contains the same pre-trained language model. And respectively inputting a first text character string corresponding to the first text and a second text character string corresponding to the second text in the table to be processed into a pre-training language model, wherein the pre-training language model can encode the characteristic information of the text into an Embedding characteristic vector, namely respectively obtaining a first characteristic vector corresponding to the first text and a second characteristic vector corresponding to the second text. And inputting the first feature vector and the second feature vector into a cross attention layer for fusion processing to obtain a high-dimensional feature vector. And inputting the high-dimensional feature vector into a full-connection layer for full-connection calculation to obtain a similarity detection result.
Fig. 5 is a schematic structural diagram of a cross attention layer. The cross-attention layer (Cross Attention Layer) can characterize the mapping between one feature vector and another. And inputting the first feature vector and the second feature vector into a cross attention layer for fusion processing, so that a high-dimensional feature vector can be obtained. The fusion process is specifically as follows:
And firstly, carrying out similarity calculation on the first characteristic vector and the second characteristic vector. The first eigenvector is illustratively represented as q= { q 1 ,q 2 ,…,q n The second eigenvector is denoted as p= { p } 1 ,p 2 ,…,p m }. Where n represents the number of words in the first text string and m represents the number of words in the second text string. The first similar feature a of the second feature vector with respect to the first feature vector can be expressed as formula (1):
a T ={a 1 ,a 2 ,…,a n } (1)
a k ={q k *p 1 ,q k *p 2 ,…,q k *p m }(k∈1,…,n) (2)
the second similar feature b of the first feature vector with respect to the second feature vector can be expressed as formula (3):
b T ={b 1 ,b 2 ,…,b m } (3)
b l ={p l *q 1 ,p l *q 2 ,…,p l *q n }(l∈1,…,m) (4)
and then, respectively carrying out normalization processing on the first similar feature a and the second similar feature b by using a Softmax function to obtain a first weight w and a second weight v. Wherein the first weight w may be expressed as formula (5):
w T ={w 1 ,w 2 ,…,w n } (5)
in the formula (5), w T Represents the transposed matrix of w. The second weight v can be expressed as formula (8):
v T ={v 1 ,v 2 ,…,v m } (8)
in formula (8), v T Representing the transposed matrix of v. A third eigenvector q 'of 1 xn can be obtained from the first weight and the second eigenvector, and a fourth eigenvector p' of 1 xn can be obtained from the second weight and the first eigenvector. Specifically, the expression can be expressed as formula (11) and formula (12):
q'=wp T (11)
p'=vq T (12)
in the formula (5), p T Representing the transposed matrix of p. q T Represents the transposed matrix of q.
Finally, the third feature vector q 'and the fourth feature vector p' are cascaded to obtain a high-dimensional feature vector { q ', p' } with the size of 1× (m+n), and the high-dimensional feature vector is input into a full-connection layer to obtain a similarity detection result.
In the present application, the table to be processed is divided into a plurality of text pairs, and for each text pair, the similarity detection can be performed by using a similarity detection model as shown in fig. 4.
In one embodiment, the method of similarity detection for any text pair in the table to be processed using the similarity detection model shown in fig. 4 is shown in fig. 6. Illustratively, the method includes:
s61, acquiring text character strings of two adjacent rows or two adjacent columns.
The text character string pairs comprise a first text character string and a second text character string, wherein the first text character string and the second text character string are text character strings corresponding to two continuous lines of texts or text character strings corresponding to two adjacent columns of texts in a table to be processed.
In this embodiment, starting from the 1 st row in the table to be processed, the texts in two adjacent rows or columns of cells are respectively combined, and texts in different cells in each row or column are separated by commas to form a complete text string: CLS cell text 1, cell text 2 sep.
Illustratively, consider the example of behavior in the table to be processed shown in FIG. 2. Combining the texts in the 1 st row of cells in the table to be processed to obtain a first text string, wherein the first text string can be expressed as: [ CLS ] insuring age, during the period of payment, male [ SEP ]. And merging the texts in the 2 nd row of cells in the to-be-processed table to obtain a second text character string, wherein the second text character string can be expressed as: [ CLS ]18, 10 years, 1540[ SEP ].
S62, starting from the 1 st row or the 1 st column in the table to be processed, inputting text strings of two adjacent rows or two adjacent columns into a trained similarity detection model to perform similarity detection.
The specific implementation manner of this step is shown above, and will not be described here again.
The existing text similarity detection method generally utilizes a pre-training language model to respectively extract features of two input texts to obtain corresponding high-dimensional feature vectors, and the corresponding high-dimensional feature vectors are directly input into a fully-connected network for classification, so that the correlation of the features between the two input texts is not considered, and the classification accuracy of a deep learning model is low. Therefore, for similarity judgment of two input texts, a cross attention layer is added on the basis of the existing pre-training language model, a high-dimensional feature vector is calculated in a cross mode, the mapping relation between two input text features can be effectively represented, the relevance between the two input text features is enhanced, and therefore the accuracy of detection of a similarity detection model is improved.
The training flowchart of the similarity detection model shown in fig. 4 is shown in fig. 7, and the training process is as follows:
s71, acquiring a training sample set.
The training sample set comprises a plurality of training samples, each training sample comprises a first text character string sample and a second text character string sample, and the first text character string sample and the second text character string sample are text character string samples corresponding to two adjacent rows or two adjacent columns of text samples in the table sample.
Specifically, first, form samples are acquired from different form data sources according to the method described in S11 above. And then processing the table sample according to the method described in the above S61 to obtain text string samples of two adjacent rows or two adjacent columns in the table sample.
In addition, if the first text string sample and the second text string sample are respectively title text and data text, the labels of the text string pair samples are set to be inconsistent with the first text and the second text. And if the first text string sample and the second text string sample are both title texts or are both data texts, setting the labels of the texts on the samples as the first text and the second text to be consistent.
S72, inputting the training sample set into an initial similarity detection model for iterative training.
And respectively inputting the first text character string sample and the second text character string sample in each training sample into an initial pre-training language model to perform feature extraction, and respectively obtaining a first feature vector and a second feature vector. And inputting the first feature vector and the second feature vector into a cross attention layer for fusion processing to obtain a high-dimensional feature vector. And inputting the high-dimensional feature vector into a full-connection layer, and performing iterative training on the initial similarity detection model by using a two-class cross entropy loss function to obtain a trained deep learning model. The specific process may be referred to the above detailed description of the similarity detection model shown in fig. 4, and will not be repeated here.
On the one hand, in the process of similarity detection, the method fully considers the correlation of the features between two input texts, adds a cross attention layer on the basis of the existing pre-training language model, calculates high-dimensional feature vectors in a cross mode, can effectively represent the mapping relation between the features of the two input texts, and enhances the correlation between the features of the two input texts, so that the accuracy of similarity detection model identification is improved. On the other hand, the consistency of the text is detected from the first two rows or the first two columns of the table to be processed, and once the initial row and the initial column of the data are identified, the rest rows and columns in the table to be processed are not required to be detected, so that the complexity of an algorithm is reduced. And screening the title row and the title column corresponding to each data according to the types in the standard library, and removing irrelevant information. The method is beneficial to enterprises to accurately analyze the information in the standardized form according to business requirements.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.
Corresponding to the table normalization processing method described in the above embodiments, fig. 8 shows a block diagram of the table normalization apparatus provided in the embodiment of the present application, and for convenience of explanation, only the portion relevant to the embodiment of the present application is shown.
Referring to fig. 8, the apparatus 800 includes: an acquisition unit 801, an identification unit 802, a classification unit 803, and a normalization unit 801.
Specifically, the acquiring unit 801 is configured to acquire a table to be processed, where the table to be processed includes N rows and M columns.
The identifying unit 802 may start from the 1 st line in the table to be processed, perform similarity detection on the texts of two adjacent lines until the 1 st line and the i+1st line are detected to be inconsistent, and determine that the 1 st line to the i st line are title lines, and the i+1st line to the nth line are data lines, where 1 is equal to or greater than i is equal to or less than N-1; and (3) starting from the 1 st column in the table to be processed, performing similarity detection on texts of two adjacent columns until the fact that the j-th column and the j+1th line are inconsistent is detected, determining the 1 st column to the j-th column as a header column, and the j+1th column to the M-th column as a data column, wherein j is more than or equal to 1 and less than or equal to M-1.
The classifying unit 803 may perform a classification filtering process on the title text in the determined title line and title column, to obtain a preset standardized title.
The normalization unit 801 may obtain a normalized table of the table to be processed from the normalized header and data corresponding to the normalized header in the table to be processed.
Fig. 9 is a schematic structural diagram of a form normalization apparatus provided in the present application. Device 900 may be a terminal device or a server or chip. The device 900 comprises one or more processors 901, which one or more processors 901 may support the device 900 to implement the methods described in the method embodiments above. The processor 901 may be a general purpose processor or a special purpose processor. For example, the processor 901 may be a central processing unit (central processing unit, CPU). The CPU may be used to control the device 900, execute software programs, and process data for the software programs.
In one embodiment, the device 900 may include a communication unit 905 to enable input (reception) and output (transmission) of signals. For example, the device 900 may be a chip, the communication unit 905 may be an input and/or output circuit of the chip, or the communication unit 905 may be a communication interface of the chip, which may be an integral part of a terminal device or a network device or other electronic device. For another example, the device 900 may be a terminal device or a server, the communication unit 905 may be a transceiver of the terminal device or the server, or the communication unit 905 may be a transceiver circuit of the terminal device or the server.
In another embodiment, the apparatus 900 may include one or more memories 902, on which a program 904 is stored, where the program 904 is executable by the processor 901 to generate the instructions 903, so that the processor 901 performs the table normalization processing method described in the above method embodiment according to the instructions 903.
In other embodiments, the memory 902 may also have data stored therein. Alternatively, the processor 901 may also read data stored in the memory 902, which may be stored at the same memory address as the program 904, or which may be stored at a different memory address than the program 904.
The processor 901 and the memory 902 may be provided separately or may be integrated together, for example, on a System On Chip (SOC) of the terminal device.
The specific manner in which the processor 901 performs the table normalization processing method provided in the above embodiment can be referred to as the relevant description in the above embodiment.
It should be understood that the steps of the above-described method embodiments may be accomplished by logic circuitry in the form of hardware or instructions in the form of software in the processor 901. The processor 901 may be a CPU, digital signal processor (digital signal processor, DSP), application specific integrated circuit (application specific integrated circuit, ASIC), field programmable gate array (field programmable gate array, FPGA), or other programmable logic device such as discrete gates, transistor logic, or discrete hardware components.
The embodiment of the application also provides a network device, which comprises: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, which when executed by the processor performs the steps of any of the various method embodiments described above.
Embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps that may implement the various method embodiments described above.
Embodiments of the present application provide a computer program product which, when run on a mobile terminal, causes the mobile terminal to perform steps that may be performed in the various method embodiments described above.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (7)

1. A method for normalizing a table, comprising:
acquiring a table to be processed, wherein the table to be processed comprises N rows and M columns;
starting from the 1 st row in the table to be processed, performing similarity detection on texts of two adjacent rows until the fact that the texts of the i st row and the i+1st row are inconsistent is detected, determining the 1 st row to the i st row of title lines, and determining the i+1st row to the N row of data lines, wherein i is more than or equal to 1 and less than or equal to N-1;
starting from the 1 st column in the table to be processed, performing similarity detection on texts of two adjacent columns until the fact that the j-th column and the j+1th line are inconsistent is detected, determining the 1 st column to the j-th column as a header column, and the j+1th column to the M-th column as a data column, wherein j is more than or equal to 1 and less than or equal to M-1;
the text of any two adjacent rows in the to-be-processed table is a group of text pairs, the text of any two adjacent columns in the to-be-processed table is a group of text pairs, and the method for detecting the similarity of any text pair of the to-be-processed table comprises the following steps:
inputting a first text and a second text in the text pair into a trained similarity detection model for processing to obtain a similarity detection result of the first text and the second text, wherein the similarity detection result indicates that the first text and the second text are consistent or inconsistent, and the first text and the second text are texts of two adjacent rows or two adjacent columns in the to-be-processed table;
Wherein the similarity detection model comprises a pre-training language model, a cross-attention layer and a full-connection layer, and the processing the text pairs by using the similarity detection model comprises:
inputting the first text and the second text into the pre-training language model for feature extraction to obtain a first feature vector and a second feature vector, calculating a first weight of each value in the first feature vector relative to each value in the second feature vector, carrying out weighted summation calculation on the first feature vector according to the first weight to obtain a third feature vector, calculating a second weight of each value in the second feature vector relative to each value in the first feature vector, carrying out weighted summation calculation on the second feature vector according to the second weight to obtain a fourth feature vector, cascading the third feature vector and the fourth feature vector to obtain a high-dimensional feature vector, and inputting the high-dimensional feature vector into the full-connection layer for full-connection calculation to obtain the similarity detection result;
classifying the title texts in the determined title rows and the title columns by using a trained text classification model, and determining the corresponding categories of the title texts in a preset standard library;
Text screening is carried out on the title text according to the category corresponding to the title text in a preset standard library, so that a preset standardized title is obtained, wherein the standardized title is a specific text matched with the category of the title text is extracted from the title text through a regular expression;
and obtaining the standardized form of the form to be processed according to the standardized title and the data corresponding to the standardized title in the form to be processed.
2. The method of claim 1, wherein the method further comprises:
if the similarity detection result of each group of text pairs in the to-be-processed table is the same as the first text and the second text, determining that the to-be-processed table is a page-crossing table, and combining the page-crossing table with a first table which is arranged in front of the page-crossing table.
3. The method of claim 2, wherein the method of merging the spread table with a first table ordered first in the spread table, comprises:
respectively acquiring the row and column numbers of the page-crossing table and the first table;
if the line number of the page crossing table is the same as that of the first table, performing column merging;
And if the column number of the page crossing table is the same as that of the first table, performing row merging.
4. A method according to any one of claims 1 to 3, wherein the obtaining a table to be processed comprises: and acquiring a table to be processed from a table data source, wherein the table data source comprises a picture format type, a PDF type or a table type.
5. A form normalization apparatus, comprising:
an obtaining unit, configured to obtain a table to be processed, where the table to be processed includes N rows and M columns;
the recognition unit is used for detecting the similarity of texts of two adjacent lines from the 1 st line in the to-be-processed table until the fact that the texts of the i st line and the i+1st line are inconsistent is detected, determining the 1 st line to the i st line title line, and the i+1st line to the N line data line, wherein i is more than or equal to 1 and less than or equal to N-1; starting from the 1 st column in the table to be processed, performing similarity detection on texts of two adjacent columns until the fact that the j-th column and the j+1th line are inconsistent is detected, determining the 1 st column to the j-th column as a header column, and the j+1th column to the M-th column as a data column, wherein j is more than or equal to 1 and less than or equal to M-1;
the text of any two adjacent rows in the to-be-processed table is a group of text pairs, the text of any two adjacent columns in the to-be-processed table is a group of text pairs, and the method for detecting the similarity of any text pair of the to-be-processed table comprises the following steps:
Inputting a first text and a second text in the text pair into a trained similarity detection model for processing to obtain a similarity detection result of the first text and the second text, wherein the similarity detection result indicates that the first text and the second text are consistent or inconsistent, and the first text and the second text are texts of two adjacent rows or two adjacent columns in the to-be-processed table;
wherein the similarity detection model comprises a pre-training language model, a cross-attention layer and a full-connection layer, and the processing the text pairs by using the similarity detection model comprises:
inputting the first text and the second text into the pre-training language model for feature extraction to obtain a first feature vector and a second feature vector, calculating a first weight of each value in the first feature vector relative to each value in the second feature vector, carrying out weighted summation calculation on the first feature vector according to the first weight to obtain a third feature vector, calculating a second weight of each value in the second feature vector relative to each value in the first feature vector, carrying out weighted summation calculation on the second feature vector according to the second weight to obtain a fourth feature vector, cascading the third feature vector and the fourth feature vector to obtain a high-dimensional feature vector, and inputting the high-dimensional feature vector into the full-connection layer for full-connection calculation to obtain the similarity detection result;
The classifying unit is used for classifying the title texts in the determined title rows and the title columns by using a trained text classifying model, and determining the corresponding categories of the title texts in a preset standard library; text screening is carried out on the title text according to the category corresponding to the title text in a preset standard library, so that a preset standardized title is obtained, wherein the standardized title is a specific text matched with the category of the title text is extracted from the title text through a regular expression;
and the standardized unit is used for obtaining a standardized form of the form to be processed according to the standardized title and the data corresponding to the standardized title in the form to be processed.
6. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 4 when executing the computer program.
7. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method according to any one of claims 1 to 4.
CN202110441015.1A 2021-04-23 2021-04-23 Form standardization processing method, device, equipment and storage medium Active CN113033170B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110441015.1A CN113033170B (en) 2021-04-23 2021-04-23 Form standardization processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110441015.1A CN113033170B (en) 2021-04-23 2021-04-23 Form standardization processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113033170A CN113033170A (en) 2021-06-25
CN113033170B true CN113033170B (en) 2023-08-04

Family

ID=76457473

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110441015.1A Active CN113033170B (en) 2021-04-23 2021-04-23 Form standardization processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113033170B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815484A (en) * 2018-12-21 2019-05-28 平安科技(深圳)有限公司 Based on the semantic similarity matching process and its coalignment for intersecting attention mechanism
CN110019742A (en) * 2018-06-19 2019-07-16 北京京东尚科信息技术有限公司 Method and apparatus for handling information
CN111695553A (en) * 2020-06-05 2020-09-22 北京百度网讯科技有限公司 Form recognition method, device, equipment and medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010015554A (en) * 2008-06-03 2010-01-21 Just Syst Corp Table structure analysis device, table structure analysis method, and table structure analysis program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019742A (en) * 2018-06-19 2019-07-16 北京京东尚科信息技术有限公司 Method and apparatus for handling information
CN109815484A (en) * 2018-12-21 2019-05-28 平安科技(深圳)有限公司 Based on the semantic similarity matching process and its coalignment for intersecting attention mechanism
CN111695553A (en) * 2020-06-05 2020-09-22 北京百度网讯科技有限公司 Form recognition method, device, equipment and medium

Also Published As

Publication number Publication date
CN113033170A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
US20210224286A1 (en) Search result processing method and apparatus, and storage medium
US20210064860A1 (en) Intelligent extraction of information from a document
CN113657425B (en) Multi-label image classification method based on multi-scale and cross-modal attention mechanism
CN110555372A (en) Data entry method, device, equipment and storage medium
Singh et al. A study of moment based features on handwritten digit recognition
CN112632980A (en) Enterprise classification method and system based on big data deep learning and electronic equipment
CN107329954B (en) Topic detection method based on document content and mutual relation
CN109190698B (en) Classification and identification system and method for network digital virtual assets
CN110362798B (en) Method, apparatus, computer device and storage medium for judging information retrieval analysis
CN111125457A (en) Deep cross-modal Hash retrieval method and device
CN113627190A (en) Visualized data conversion method and device, computer equipment and storage medium
CN112786160A (en) Multi-image input multi-label gastroscope image classification method based on graph neural network
CN113033170B (en) Form standardization processing method, device, equipment and storage medium
CN111144453A (en) Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data
CN111460817A (en) Method and system for recommending criminal legal document related law provision
US20220358779A1 (en) Systems and Methods for Generating Document Numerical Representations
CN115640418A (en) Cross-domain multi-view target website retrieval method and device based on residual semantic consistency
CN116011810A (en) Regional risk identification method, device, equipment and storage medium
US11816909B2 (en) Document clusterization using neural networks
CN111898618B (en) Method, device and program storage medium for identifying ancient graphic characters
CN114637846A (en) Video data processing method, video data processing device, computer equipment and storage medium
CN114580398A (en) Text information extraction model generation method, text information extraction method and device
CN116450781A (en) Question and answer processing method and device
CN113535928A (en) Service discovery method and system of long-term and short-term memory network based on attention mechanism
CN111460206A (en) Image processing method, image processing device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant