CN116860747A - Training sample generation method and device, electronic equipment and storage medium - Google Patents

Training sample generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116860747A
CN116860747A CN202310738464.1A CN202310738464A CN116860747A CN 116860747 A CN116860747 A CN 116860747A CN 202310738464 A CN202310738464 A CN 202310738464A CN 116860747 A CN116860747 A CN 116860747A
Authority
CN
China
Prior art keywords
text
target
document file
information
text block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310738464.1A
Other languages
Chinese (zh)
Inventor
徐鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dt Dream Technology Co Ltd
Original Assignee
Hangzhou Dt Dream Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dt Dream Technology Co Ltd filed Critical Hangzhou Dt Dream Technology Co Ltd
Priority to CN202310738464.1A priority Critical patent/CN116860747A/en
Publication of CN116860747A publication Critical patent/CN116860747A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The application provides a training sample generation method, a training sample generation device, electronic equipment and a storage medium, and relates to the field of artificial intelligence, wherein the training sample generation method comprises the following steps: acquiring a first document file containing a table; the first document file comprises at least one text block and attribute information of the text block, the text block corresponds to a cell of the table, and the attribute information comprises position information; extracting at least one target text block in the first document file and target attribute information corresponding to the target text block; labeling the target text block according to the target attribute information to obtain a training sample; the training samples are used for training the form text extraction model. Therefore, the first document file can be utilized to divide the cells of the document table, the capacity of dividing the text blocks corresponding to the cells is output, the cells are marked instead of being identified manually, the training sample is obtained, the generation efficiency of the training sample can be improved, and the labor cost is reduced.

Description

Training sample generation method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a training sample generating method, device, electronic apparatus, and storage medium.
Background
OCR (Optical Character Recognition ) recognition refers to the process of an electronic device (e.g., a scanner or digital camera) checking characters printed on paper and then translating the shape into computer text using a character recognition method. Namely, the text data is scanned, and then the image file is analyzed and processed to obtain the text and layout information.
The function of extracting text from PDF (Portable Document Format, portable file format or portable document file format) files or image files and structuring them can be implemented using OCR technology in combination with deep learning methods. For example, a tabular text extraction model may be used to extract text from a PDF file or an image file to obtain structured information.
In order to improve the prediction effect and the prediction precision of the table text extraction model, a large number of training samples are required to be adopted for pre-training the model. However, a large number of training samples are generated, a large amount of manpower is required to be input for data labeling, namely, currently, the training samples are labeled manually in a crowdsourcing mode, the generation mode of the samples is high in manpower cost, and the generation efficiency of the samples is low.
Disclosure of Invention
The object of the present application is to solve at least to some extent one of the above technical problems.
Therefore, the application provides a method, a device, electronic equipment and a storage medium for generating training samples, which are used for dividing cells of a document table by using a first document file and outputting the capacity of dividing text blocks corresponding to the cells, so that the training samples are obtained by replacing manual identification of the cells to finish labeling of the cells, namely, automatic labeling of the text blocks in the first document file according to attribute information of each text block output by the first document file can be realized, the problems of low efficiency and high cost of manual labeling of the training samples can be solved, and the effects of improving the generation efficiency of the training samples and reducing the labor cost are achieved.
An embodiment of a first aspect of the present application provides a method for generating a training sample, including:
acquiring a first document file containing a table; wherein the first document file includes at least one text block, and attribute information of the text block, the text block corresponding to a cell of the table; the attribute information includes location information;
Extracting at least one target text block in the first document file and target attribute information corresponding to the target text block;
labeling the target text block according to the target attribute information to obtain a training sample; the training samples are used for training the table text extraction model.
An embodiment of a second aspect of the present application provides a training sample generating apparatus, including:
an acquisition module for acquiring a first document file containing a table; wherein the first document file includes at least one text block, and attribute information of the text block, the text block corresponding to a cell of the table; the attribute information includes location information;
the extraction module is used for extracting at least one target text block in the first document file and target attribute information corresponding to the target text block;
the labeling module is used for labeling the target text block according to the target attribute information so as to obtain a training sample; the training samples are used for training the table text extraction model.
An embodiment of a third aspect of the present application provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of generating training samples as described in the first aspect when the program is executed.
An embodiment of a fourth aspect of the present application proposes a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of generating training samples according to the first aspect.
An embodiment of a fifth aspect of the present application proposes a computer program product comprising a computer program which, when executed by a processor, implements a method of generating training samples according to the first aspect of the present application.
The technical scheme provided by the embodiment of the application at least has the following beneficial effects:
by acquiring a first document file containing a table; wherein the first document file comprises at least one text block and attribute information of the text block, the text block corresponds to a cell of the table, and the attribute information comprises position information; extracting at least one target text block in the first document file and target attribute information corresponding to the target text block; labeling the target text block according to the target attribute information to obtain a training sample; the training samples are used for training the table text extraction model. Therefore, the capability of dividing the cells of the document table by using the first document file and outputting the text blocks corresponding to the cells obtained by division can be realized, the identification of the cells is replaced by manual identification, and the training sample is obtained, namely, the text blocks in the first document file can be automatically marked according to the attribute information of each text block output by the first document file, the problems of low efficiency and high cost in manual marking of the training sample can be solved, and the effects of improving the generation efficiency of the training sample and reducing the labor cost are achieved.
Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
fig. 1 is a flow chart of a method for generating training samples according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating another method for generating training samples according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating another method for generating training samples according to an embodiment of the present application;
FIG. 4 is a flowchart illustrating another method for generating training samples according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating another method for generating training samples according to an embodiment of the present application;
FIG. 6 is a flowchart illustrating another method for generating training samples according to an embodiment of the present application;
FIG. 7 is a flow chart of a method for training a table text extraction model according to an embodiment of the present application;
FIG. 8 is a flowchart illustrating an application method of a form text extraction model according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a training sample generation process according to an embodiment of the present application;
FIG. 10 is a diagram illustrating table contents in a Word file according to an embodiment of the present application;
FIG. 11 is a diagram of a modified Word file according to an embodiment of the present application;
FIG. 12 is a diagram of a annotated text block sample provided by an embodiment of the present application;
FIG. 13 is a schematic structural view of a training sample generating apparatus according to an embodiment of the present application;
fig. 14 is a schematic structural view of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present application and should not be construed as limiting the application.
The embodiment of the application provides a training sample generation method and device and electronic equipment. Before describing the embodiments of the present application in detail, for ease of understanding, the general technical words are first introduced:
the PDF file is a document file whose file format is PDF format.
In some scenarios, text in a PDF file is embedded, for example, when a document file in a Word format (simply referred to as a Word file) is converted into a PDF file, text in the Word file is embedded in the PDF file in this form.
In other cases, the text in the PDF file is embedded in the form of a picture, for example, when the paper file is scanned using a scanner to generate the PDF file, the text is part of the PDF file in the form of a picture.
In general, most of the characters in PDF files are embedded in the form of pictures, and the difficulty of extracting text from such PDF files is high. In addition, a form can be embedded in the PDF file, and the extraction difficulty of the text in the form is also high.
The text structuring of the table is to extract text content in the table in the PDF file, especially for the table embedded in the picture form, and to sort and store the text content in the relational data table, which is called the text structuring of the table.
The method for generating training samples provided by the present application will be described in detail with reference to fig. 1.
Fig. 1 is a flow chart of a method for generating training samples according to an embodiment of the present application.
The training sample generation method of the embodiment of the application can be executed by the training sample generation device provided by the embodiment of the application. The training sample generating device can be applied to electronic equipment to execute the training sample generating function. Alternatively, the training sample generating device may be configured in an application of the electronic device, so that the application may perform the training sample generating function.
The electronic device may be any device with computing power, and the device or an application in the device may be capable of performing a function of generating training samples. The device with computing capability may be, for example, a personal computer, a mobile terminal, a server, etc., and the mobile terminal may be, for example, a vehicle-mounted device, a mobile phone, a tablet computer, a personal digital assistant, a wearable device, etc., with various operating systems, a touch screen, and/or a hardware device of a display screen.
As shown in fig. 1, the training sample generating method includes the following steps:
step S101, a first document file containing a table is acquired; wherein the first document file comprises at least one text block and attribute information of the text block, and the text block corresponds to a cell of the table; the attribute information includes location information.
In the embodiment of the present application, the attribute information may include, but is not limited to, location information of the text block (for example, a top left vertex coordinate and a bottom right vertex coordinate of the text block), and may further include other attributes besides the location information, for example, the attribute information may further include identification information (such as a number) of the text block, a top left vertex number, a bottom right vertex number, and the like.
In the embodiment of the present application, the file format of the first document file is not limited, for example, the file format of the first document file may be PDF format, OSD (Operational Support Document) format, or the like.
In an embodiment of the present application, at least one text block may be included in the first document file, where the text block corresponds to a cell of a table in the first document file, i.e., the text block is divided according to the cell in the table. Also, attribute information of each text block may be included in the first document file.
Step S102, at least one target text block in the first document file and target attribute information corresponding to the target text block are extracted.
The target text block may be any text block that needs to be identified in the first document file. For example, the target text block may be a text block with a relatively large area in each text block, or the target text block may be a text block with a relatively large confidence (which may be set manually or predicted by a model) in each text block, or the like, which is not limited by the present application.
In the embodiment of the present application, at least one target text block may be extracted from the first document file, and attribute information (referred to as target attribute information in the present application) of the target text block may be extracted.
Step S103, labeling the target text block according to the target attribute information to obtain a training sample; the training samples are used for training the table text extraction model.
In the embodiment of the application, the target text block in the first document file can be marked according to the target attribute information of the target text block to obtain a training sample, wherein the training sample is used for training the table text extraction model.
For example, a label frame may be added to the first document file according to the target attribute information, so as to obtain a training sample, where each label frame includes a target text block.
According to the training sample generation method, a first document file containing a table is obtained; wherein the first document file comprises at least one text block and attribute information of the text block, the text block corresponds to a cell of the table, and the attribute information comprises position information; extracting at least one target text block in the first document file and target attribute information corresponding to the target text block; labeling the target text block according to the target attribute information to obtain a training sample; the training samples are used for training the table text extraction model. Therefore, the capability of dividing the cells of the document table by using the first document file and outputting the text blocks corresponding to the cells obtained by division can be realized, the identification of the cells is replaced by manual identification, and the training sample is obtained, namely, the text blocks in the first document file can be automatically marked according to the attribute information of each text block output by the first document file, the problems of low efficiency and high cost in manual marking of the training sample can be solved, and the effects of improving the generation efficiency of the training sample and reducing the labor cost are achieved.
In order to clearly explain how the step S101 in the above embodiment of the present application obtains the first document file containing the table, the present application also proposes a training sample generating method.
Fig. 2 is a flowchart of another method for generating training samples according to an embodiment of the present application.
As shown in fig. 2, step S101 may include the following steps, based on the embodiment shown in fig. 1:
step S201, an editable second document file containing a table is acquired.
In the embodiment of the present application, the file format of the second document file is not limited, for example, the file format of the second document file may be a file format which is editable, and can draw or draw a table, such as a Word format, an Excel format, and the like.
In the embodiment of the present application, the obtaining manner of the second document file is not limited, for example, the second document file may be a document file manually provided by a related person, or the second document file may be a document file collected online, or the second document file may be a document file sent by another person, or the second document file may be a document file obtained or collected in an actual service scene, or the like, which is not limited in the present application.
Step S202, randomly modifying text content in the second document file to obtain at least one third document file.
In the embodiment of the application, a random modification algorithm can be adopted to randomly modify text contents in at least one cell in a table in the second document file so as to obtain at least one third document file. For example, the text content in the second document file may be randomly modified by using functions of RAND (), FLOOR (), etc., to obtain at least one third document file.
In any one embodiment of the application, not only the text content in the second document file but also the font style, the font size and the like in the second document can be supported to be modified, for example, when the font style in the second document is a regular script, the font style in at least part of the cells in the second document can be modified from the regular script to a Song body, a handwriting body and the like, so that the diversity of training samples is enhanced.
Step S203, at least one third document file is converted into at least one first document file.
In the embodiment of the present application, each of the third document files may be converted into one first document file.
The method for generating the training samples can generate a plurality of first document files according to the second editable document file, so that the plurality of first document files can be automatically marked, a plurality of training samples are obtained, namely, batch generation of the training samples can be realized, the richness of the training samples is improved, and the training effect of the table text extraction model is improved. And the document file is randomly modified, so that the generalization capability of the table text extraction model can be improved.
In order to clearly explain how the text content in the second document file is randomly modified in step S203 in the above embodiment of the present application to obtain at least one third document file, the present application also provides a training sample generating method.
Fig. 3 is a flowchart of another method for generating training samples according to an embodiment of the present application.
As shown in fig. 3, step S203 may include the following steps, on the basis of the embodiment shown in fig. 2:
step S301, at least one group of first update information is acquired; each group of first updating information comprises target position information and target content format, wherein the target position information is used for indicating cells to be updated in the table, and the target content format is the content format of the cells to be updated.
In the embodiment of the present application, each set of first update information may include target location information and a target content format, where the target location information is used to indicate a cell to be updated in a table of the first document file, and the target content format is a content format of the cell to be updated.
The target location information may be randomly selected, or may be manually specified by a related person, for example, for a table with a fixed typesetting format, an address, a phone, etc. are all located at a fixed location, where the related person may specify the target location information of the cell to be updated, or may randomly select the target location information of the cell to be updated.
The target content format is a content type corresponding to the text content in the cell to be updated, and the target content format may include, but is not limited to: text, number, value, date, etc.
It should be noted that, each set of first update information is used for updating the second document file to obtain a third document file, where the number of target location information in each set of first update information may be one or may be multiple, and the application is not limited thereto, and when the number of target location information is multiple (i.e. the number of cells to be updated is multiple), the number of target content formats may also be multiple.
Step S302, generating target text content matched with the corresponding target content format according to the corresponding target content format aiming at any first updating information.
In the embodiment of the application, for any group of first update information, target text content matched with a target content format can be generated according to the target content format in the first update information. For example, when the target content format is date, the generated target text content may be XXXX year YY month ZZ date (or XXXX-YY-ZZ, XXXX/YY/ZZ), and for example, when the target content format is number, the generated target text content is obtained by combining at least one number.
Step S303, determining a cell to be updated matched with the target position information in any first updating information from the cells in the table in the second document file.
In the embodiment of the application, according to the target position information in the first updating information, the cell to be updated, of which the position information is matched with the target position information, can be determined from the cells of the table in the second document file.
And step S304, according to the target text content, updating the text content in the to-be-updated cell in the second document file to obtain a third document file.
In the embodiment of the application, the text content in the cell to be updated in the second document file can be updated according to the target text content to obtain a third document file.
In any one embodiment of the application, not only the text content in the cell to be updated in the second document file can be modified, but also the font style, the font size and the like in the cell to be updated can be modified, for example, when the font style in the cell to be updated is regular script, the font style in the cell to be updated can be modified from the regular script to the Song style, the handwriting style and the like, so that the diversity of training samples is enhanced.
According to the training sample generation method, the text content in the cells to be updated in the document file can be effectively updated based on the position information and the content format of the cells to be updated, so that the text content of each cell in the updated document file is matched with the corresponding content format, and the updating effect or the modifying effect of the document file is improved.
In order to clearly explain how the text content in the second document file is randomly modified in step S203 in the above embodiment of the present application to obtain at least one third document file, the present application also provides a training sample generating method.
Fig. 4 is a flowchart of another method for generating training samples according to an embodiment of the present application.
As shown in fig. 4, on the basis of the embodiment shown in fig. 2, step S203 may include the steps of:
step S401, at least one group of second update information is acquired; wherein each set of second update information comprises at least one target attribute field to be updated.
Among other things, the target attribute field (key) may include, but is not limited to: address (or contact address, home address, work address, etc.), phone (or contact address), work unit (or practice unit), gender, age, academic calendar, etc.
In this embodiment of the present application, each set of second update information may include at least one target attribute field to be updated, where the target attribute field may be selected randomly, or may be manually specified by a relevant person, for example, for a table whose typesetting format is not fixed, an address, a phone, etc. are all located in a non-fixed location, so that the relevant person may specify the target attribute field to be updated, or may also randomly select the target attribute field to be updated.
Step S402, generating target attribute values corresponding to each target attribute field in any second updating information according to any second updating information; wherein the content format of the target attribute value matches the target attribute field.
In the embodiment of the application, for any group of second update information, a target attribute value corresponding to each target attribute field in the second update information can be generated, wherein the content format of the target attribute value is matched with the target attribute field.
For example, when the target attribute field is a phone, the target attribute value may be: 183XXXX1234 (or 12345678), and for example, when the target attribute field is a date, the target attribute value may be XXXX year YY month ZZ day (or XXXXXX-YY-ZZ, XXXX/YY/ZZ), and for example, when the target attribute field is an age, the target attribute value may be 29, and so on, which are not listed herein.
Step S403, determining target cells from the tables in the second document file for any target attribute field in any second update information; wherein the content format of the target cell matches any target attribute field.
In the embodiment of the present application, for any one of the target attribute fields in the second update information, the target cell may be determined from each cell of the table in the second document file, where the content format of the target cell matches the target attribute field.
For example, when the target attribute field is a date and the text content in a certain cell is xx year YY month ZZ day, it can be determined that the content format of the cell matches the target attribute field, and thus the cell can be regarded as the target cell.
For another example, when the target attribute field is phone call and the text content in a cell is 183XXXX1234, it may be determined that the content format of the cell matches the target attribute field, and thus the cell may be regarded as the target cell.
Step S404, according to the target attribute value corresponding to any target attribute field, updating the text content in the target cell to obtain a third document file.
In the embodiment of the application, the text content in the target cell in the second document file can be updated according to the target attribute value corresponding to the target attribute field so as to obtain a third document file.
In any one embodiment of the application, not only the text content in the target cell in the second document file can be modified, but also the font style, the font size and the like in the target cell can be modified, for example, when the font style in the target cell is regular script, the font style in the target cell can be modified from the regular script to Song style, handwriting and the like, so that the diversity of training samples is enhanced.
According to the training sample generation method, the text content in the content format and the target attribute field matched target cells in the document file can be effectively updated based on the target attribute field to be updated, so that the text content of each cell in the updated document file is matched with the corresponding content format, and the updating effect or the modifying effect of the document file is improved.
In order to clearly explain how the step S103 in the above embodiment of the present application marks the target text block according to the target attribute information to obtain the training sample, the present application also provides a method for generating the training sample.
Fig. 5 is a flowchart of another method for generating training samples according to an embodiment of the present application.
As shown in fig. 5, on the basis of any of the above embodiments, step S103 may include the steps of:
step S501, an abnormal text block is identified from each target text block.
In the embodiment of the application, the content information (or called text information, character information and text content) of each text block in the first document file can be extracted, and the abnormal text block with abnormality can be identified from each target text block based on the target attribute information and/or the content information of each target text block.
In a possible implementation manner of the embodiment of the present application, the identifying manner of the abnormal text block may include at least one of the following:
the first item, the form may include at least one line, and correspondingly, the first document file may further include position information of at least one line in the form.
The number of the target lines may be one or may be multiple, which is not limited in this embodiment of the present application.
That is, the text blocks are divided according to the cells in the table, each text block contains only the text content in one cell, and in the case that the text block contains the line, it is indicated that the text block may contain the text content in at least two cells, and therefore, the text block may be regarded as an abnormal text block having an abnormality.
In the second term, a candidate text block including a plurality of text segments (or referred to as text segments and natural segments) may be determined from each target text block, semantics of the plurality of text segments in the candidate text block may be determined based on a natural language processing technology, and then, whether a semantic association relationship exists between the plurality of text segments may be determined according to the semantics of the plurality of text segments, where it is determined that no semantic association relationship exists between any two adjacent text segments in the plurality of text segments, the candidate text block may be used as an abnormal text block.
That is, the semantic relevance between text pieces in the same cell or the same text block is strong, and in the case where it is determined that the semantic relevance between text pieces in a certain text block is weak, the text block may be regarded as an abnormal text block in which an abnormality exists.
The third item, according to the content information of each target text block, determining the semantics of each target text block, and according to the semantics of each target text block, judging whether at least two target text blocks with semantic association relations exist in each target text block, wherein the semantic association relations can also be called semantic context association relations, namely judging whether the semantic context association exists between each target text block; in the case where at least two target text blocks having a semantic association relationship exist in each target text block, that is, in the case of semantic context association between at least two target text blocks, the at least two target text blocks may be regarded as abnormal text blocks having an abnormality.
That is, in general, the semantic relevance between text contents in different cells or different text blocks is weak, for example, the semantic relevance between a certificate number and a contact address is weak, and in the case that the semantic relevance between a plurality of text blocks is determined to be strong, it is indicated that the text contents in the same cell may be divided into a plurality of text blocks, and at this time, the plurality of text blocks may be regarded as abnormal text blocks having an abnormality.
Therefore, the abnormal text blocks with the abnormality can be identified from the target text blocks based on various modes, and the flexibility and applicability of the method are improved.
Step S502, cleaning and/or correcting the abnormal text block.
In the embodiment of the application, the abnormal text blocks can be automatically cleaned and/or corrected.
In one possible implementation manner of the embodiment of the present application, the automatic cleaning and/or correction manner of the abnormal text block may include at least one of the following:
first, the abnormal text block is cleaned or deleted.
The second item, in the case that the abnormal text block includes the target line, or the abnormal text block intersects the target line, may divide the abnormal text block according to the position information of the target line; the method comprises the steps that at least two divided text blocks do not have target lines, and/or the at least two divided text blocks are not intersected with the target lines.
Third, in the case that any two adjacent text segments in the abnormal text block do not have a semantic association relationship, or at least two adjacent text segments in the abnormal text block do not have a semantic association relationship, any two (or at least two) adjacent text segments in the abnormal text block that do not have a semantic association relationship may be divided; wherein, the text block obtained by dividing only comprises one text segment of any two (or at least two) adjacent text segments.
Fourth item: in the case where there is a semantic association between at least two abnormal text blocks, the at least two abnormal text blocks having the semantic association may be combined.
Therefore, the method can realize automatic cleaning and/or correction of the abnormal text blocks in each target text block based on various modes, improves the flexibility and applicability of the method, and improves the updating effect of the text blocks, thereby improving the labeling effect of the training samples.
And step S503, marking according to the cleaned and/or corrected abnormal text blocks to obtain training samples.
In the embodiment of the application, the first document file can be marked according to the cleaned and/or corrected abnormal text blocks so as to obtain a training sample.
As an example, in the case of cleaning an abnormal text block, for other text blocks than the abnormal text block in each target text block, a comment box may be added to the first document file according to target attribute information of the other text blocks, wherein each comment box contains only one text block.
As another example, in the case of correcting the abnormal text block, a labeling frame may be added to the first document file according to the attribute information of the corrected abnormal text block, and for other text blocks than the abnormal text block in each target text block, a labeling frame may be added to the first document file according to the target attribute information of the other text blocks, so as to obtain a training sample, wherein each labeling frame contains only one corrected abnormal text block or other text block.
As yet another example, in the case of cleaning a part of the abnormal text blocks and correcting another part of the abnormal text blocks, a labeling frame may be added to the first document file according to the attribute information of the other part of the corrected abnormal text blocks, and for other text blocks than the abnormal text blocks in each target text block, a labeling frame may be added to the first document file according to the target attribute information of the other text blocks, so as to obtain training samples, wherein each labeling frame contains only one corrected abnormal text block or other text blocks.
In a possible implementation manner of the embodiment of the present application, in order to improve accuracy and reliability of the labeling result of the training sample, the cleaned and/or corrected abnormal text block may be updated manually.
As an example, the cleaned and/or corrected abnormal text block may be output, and a correction request manually triggered by the relevant person may be obtained, and the cleaned and/or corrected abnormal text block may be updated in response to the correction request. For example, the position information of the cleaned and/or corrected abnormal text block may be adjusted, or the cleaned and/or corrected abnormal text block may be deleted.
Therefore, the method can update the cleaned and/or corrected abnormal text blocks to improve the accuracy of the training sample labeling, and further improve the training effect of the table text extraction model.
According to the training sample generation method, the abnormal text blocks with the abnormality in the first document file can be identified, and the abnormal text blocks are cleaned and/or corrected, so that the first document file is marked according to the cleaned and/or corrected abnormal text blocks, and the accuracy and reliability of marking results can be improved.
In order to clearly explain how the target text block is marked according to the target attribute information in step S103 in the above embodiment of the present application, the present application further provides a training sample generating method.
Fig. 6 is a flowchart of another method for generating training samples according to an embodiment of the present application.
As shown in fig. 6, on the basis of any of the above embodiments, step S103 may include the steps of:
step S601, adding a labeling frame into the first document file according to the target attribute information, wherein the labeling frame contains a target text block.
In the embodiment of the application, the annotation boxes can be added into the first document file according to the target attribute information of each target text block, wherein each annotation box only contains one target text block.
Step S602, determining at least two labeling frames from the labeling frames according to the position information of the labeling frames.
In the embodiment of the application, at least two annotation frames can be determined from the annotation frames according to the position information of the annotation frames. For example, the at least two annotation boxes may be adjacent or contiguous annotation boxes, or the at least two annotation boxes may be annotation boxes having a distance less than a threshold value, or the like.
Step S603, obtaining an association relationship between at least two labeling frames, where the association relationship is determined according to content information of the target text block in the at least two labeling frames.
In the embodiment of the application, the association relationship includes, but is not limited to: key value relationships, combination relationships, and the like.
For example, the content information of the target text block in the annotation box 1 is: the content information of the target text block in the marking frame 2 is: 183XXXX1234, then the relationship between label frame 1 and label frame 2 can be determined as: key value relation.
For another example, the content information of the target text block in the labeling frame 3 is: the content information of the target text block in the marking frame 4 is: the birthday can determine that the association relationship between the labeling frame 3 and the labeling frame 4 is: and (5) a combination relation.
In the embodiment of the present application, based on a natural language processing technology, the association relationship between at least two labeling frames is determined according to the content information of the target text block in at least two labeling frames, or the association relationship between at least two labeling frames may be manually specified by a related person, which is not limited in the embodiment of the present application.
And step S604, carrying out relationship labeling on at least two labeling frames according to the association relationship so as to obtain a training sample.
In the embodiment of the application, the relationship marking can be carried out on at least two marking frames in the first document file according to the association relationship between at least two marking frames so as to obtain a training sample.
The method for generating the training sample can label the position information among the labeling frames (boxes) on the training sample, and can label the association relation among the labeling frames on the training sample, so that the text extraction model of the form can distinguish whether text contents in each cell in the document file are key value relations or combination relations and the like, and a relational data table can be generated based on the association relation, so that the use requirement of an actual service scene is met.
The above embodiments relate to a generation scenario of a training sample, and the present application further provides an application scenario of the training sample, that is, the training sample may be used to train the table text extraction model.
Fig. 7 is a flowchart of a method for training a table text extraction model according to an embodiment of the present application.
As shown in fig. 7, the method for training the table text extraction model according to any of the above embodiments may include the following steps:
in step S701, the training sample is input into the table text extraction model to perform text extraction, so as to obtain the position information of at least one prediction frame and the content information in the prediction frame.
In the embodiment of the application, the training sample can be input into a table text extraction model for text extraction, so as to obtain the position information of at least one prediction frame and the content information (or called text information, character information and text content) in each prediction frame.
As an example, the first prediction branch in the tabular text extraction model may be used to predict the location information of each annotation frame in the training sample to obtain the location information of at least one prediction frame. And, the second prediction branch in the table text extraction model can be adopted to predict or character-identify the text information in each prediction frame in the training sample according to the position information of each prediction frame, so as to obtain the content information in each prediction frame.
Step S702, a first loss value is generated according to the difference between the position information of the labeling frame and the position information of the prediction frame on the training sample.
In the embodiment of the application, the first loss value can be generated according to the difference between the position information of the labeling frame and the position information of the prediction frame on the training sample. The first loss value and the difference are in positive correlation, namely the smaller the difference is, the smaller the first loss value is, and conversely, the larger the difference is, the larger the first loss value is.
Step S703, generating a second loss value according to the difference between the content information in the labeling frame and the content information in the prediction frame.
In the embodiment of the application, the second loss value can be generated according to the difference between the content information of the target text block in the labeling frame and the content information in the prediction frame in the training sample. The second loss value and the difference are in positive correlation, namely the smaller the difference is, the smaller the second loss value is, and conversely, the larger the difference is, the larger the second loss value is.
Step S704, training the table text extraction model according to the first loss value and the second loss value.
In the embodiment of the application, the table text extraction model can be trained according to the first loss value and the second loss value.
As one example, a target loss value may be generated from the first loss value and the second loss value, and a table text extraction model may be trained based on the target loss value to minimize the target loss value.
The target loss value may be a mean value, a sum value, a weighted sum value, or the like of the first loss value and the second loss value, which is not limited by the present application.
It should be noted that, the foregoing example is only implemented by taking the termination condition of model training as the target loss value minimization, and other termination conditions may be set in practical application, for example, the termination condition may further include that the training duration reaches the set duration, the training frequency reaches the set frequency, and the embodiment of the application is not limited to this.
The method for training the table text extraction model can train the table text extraction model based on training samples so as to improve accuracy and reliability of model prediction results.
The embodiment is an embodiment corresponding to the training method of the table text extraction model, and the application further provides an application method of the table text extraction model.
Fig. 8 is a flowchart illustrating an application method of a table text extraction model according to an embodiment of the present application.
As shown in fig. 8, on the basis of any of the above embodiments, the method for generating a training sample may further include the following steps (or, the method for applying the table text extraction model may include the following steps):
step S801 acquires a non-editable fourth document file containing a table.
In the embodiment of the present application, the file format of the fourth document file is not limited, for example, the file format of the second document file may be PDF format, picture format, or the like.
In the embodiment of the present application, the obtaining manner of the fourth document file is not limited, for example, the fourth document file may be a document file manually provided by a related person, or the fourth document file may be a document file collected online, or the fourth document file may be a document file sent by another person, or the fourth document file may be a document file obtained or collected in an actual service scene, or the like, which is not limited in the present application.
Step S802, inputting the fourth document file into the trained form text extraction model for text extraction to obtain the position information of at least one output frame and the content information in the output frame output by the trained form text extraction model.
In the embodiment of the application, the fourth document file can be input into the trained form text extraction model for text extraction, so that the position information of at least one output frame and the content information (or called text information, character information and text content) in each output frame output by the trained form text extraction model are obtained.
In any embodiment of the present application, in the case that the training sample is marked with the association relationship between at least two marking frames, the trained form text extraction model may also output the association relationship (key value relationship, combination relationship, etc.) between at least one output frame.
Therefore, the output of the extraction model according to the form text can be realized, and the relational data table is generated so as to meet the use requirement of the actual service scene.
According to the application method of the table text extraction model, the trained table text extraction model is adopted to extract text content from the non-editable fourth document file, so that the accuracy of an extraction result can be improved.
In any one embodiment of the application, the method for generating the training sample is mainly used for solving the problem of higher cost of acquiring the training sample of the form text extraction model, a large number of basic sample sets are rapidly formed in an automatic mode, and the construction of a high-quality form text structured data set is completed by combining with simple manual check, so that a good foundation is provided for training of the form text extraction model.
The first document file is taken as a PDF file, the second document file is taken as a Word file for exemplary illustration, and the generation flow of the training sample can be as shown in FIG. 9, and mainly comprises the following steps:
1. and determining a Word file of the table to be extracted. The PDF file containing the table to be identified can be subjected to extraction analysis to obtain the approximate format of the table to be structured, and the corresponding template file is constructed by combining the table format or the table style in the Word file.
As an example, the table contents in the Word file may be as shown in fig. 10.
2. The table contents in the Word file are randomly modified by using the program and stored as a new Word file.
2.1, aiming at the content format requirements of different cells in the table, the text content in the table is modified in a targeted manner so as to realize a random modification scheme, such as generating a section of digital text, address text, telephone number, certificate number, name text, company text, time or time section text, regular paragraph text and the like.
2.2, by randomly generating and modifying text contents in the table, the generated training sample has very high randomness, and the generalization capability of a final table text extraction model can be effectively improved.
As an example, the modified Word file may be as shown in fig. 11.
3. The modified Word file is converted into a PDF file using a program.
4. The position information, the content information and the line position information of each text block and each text block in the PDF file are stored in the PDF file, and the text block information and the line information in the PDF file can be read through a PDF access interface by using programming languages such as Python to obtain the position information, the content information and the line position information of each text block.
5. After the position information, the content information and the line position information of the text blocks are read, abnormal text blocks which possibly have abnormality can be determined from the text blocks according to the length and the height of the text blocks, the content information of the text blocks and the line position information, and the abnormal text blocks are cleaned and/or corrected.
6. And importing the corrected text block information into a data labeling platform for artificial verification and correction. Although the previous steps do some cleaning and correction, it may not be possible to ensure complete accuracy of the text block and accuracy of the frame, and thus, correction may be combined with manual verification.
As an example, the text blocks in fig. 11 are labeled, and a labeled text block sample may be as shown in fig. 12.
7. And supplementing and completing the structured marking information according to the actual application requirements. That is, on the basis of marking the marking boxes (or text boxes), the association relationship (such as key value relationship, combination relationship and the like) among the marking boxes is further marked, so that the data can be correspondingly stored in the relational data table.
8. A table structured training sample is obtained. The flow can be generated in batches or subjected to loop iteration, so that a large number of high-quality training samples can be obtained rapidly.
In conclusion, the automatic generation of the training samples with the structured form text can be realized, the acquisition cost of the training samples can be effectively reduced, and the development efficiency of the extraction model of the scene form text is greatly improved. A reduction in the generation time of 1000 training samples from 10 days/person to below 1 day/person can be achieved.
Corresponding to the training sample generation methods provided in the above embodiments, an embodiment of the present application further provides a training sample generation device. Since the training sample generating device provided in the embodiment of the present application corresponds to the training sample generating method provided in the above embodiments, the implementation of the training sample generating method is also applicable to the training sample generating device provided in the embodiment, and will not be described in detail in the embodiment.
Fig. 13 is a schematic structural view of a training sample generating apparatus according to an embodiment of the present application.
As shown in fig. 13, the training sample generating apparatus 1300 may include: an acquisition module 1301, an extraction module 1302, and an annotation module 1303.
Wherein, the obtaining module 1301 is configured to obtain a first document file including a table; wherein the first document file comprises at least one text block and attribute information of the text block, and the text block corresponds to a cell of the table; the attribute information includes location information.
The extracting module 1302 is configured to extract at least one target text block in the first document file and target attribute information corresponding to the target text block.
The marking module 1303 is used for marking the target text block according to the target attribute information so as to obtain a training sample; the training samples are used for training the table text extraction model.
As a possible implementation manner of the embodiment of the present application, the obtaining module 1301 is specifically configured to: acquiring an editable second document file containing a table; randomly modifying text content in the second document file to obtain at least one third document file; at least one third document file is converted into at least one first document file.
As a possible implementation manner of the embodiment of the present application, the obtaining module 1301 is specifically configured to: acquiring at least one set of first update information; each group of first updating information comprises target position information and target content formats, wherein the target position information is used for indicating cells to be updated in a table, and the target content formats are the content formats of the cells to be updated; generating target text content matched with the corresponding target content format according to the corresponding target content format aiming at any first updating information; determining a cell to be updated matched with the target position information in any first updating information from all cells in a table in the second document file; and updating the text content in the cell to be updated in the second document file according to the target text content to obtain a third document file.
As a possible implementation manner of the embodiment of the present application, the obtaining module 1301 is specifically configured to: acquiring at least one set of second update information; wherein each set of second updating information comprises at least one target attribute field to be updated; generating target attribute values corresponding to each target attribute field in any second updating information aiming at any second updating information; wherein the content format of the target attribute value is matched with the target attribute field; determining target cells from the tables in the second document file for any target attribute field in any second update information; wherein the content format of the target cell is matched with any target attribute field; and updating the text content in the target cell according to the target attribute value corresponding to any target attribute field to obtain a third document file.
As a possible implementation manner of the embodiment of the present application, the labeling module 1303 is specifically configured to: identifying abnormal text blocks from the target text blocks; cleaning and/or correcting the abnormal text block; labeling according to the cleaned and/or corrected abnormal text blocks to obtain training samples.
As a possible implementation manner of the embodiment of the present application, the labeling module 1303 is specifically configured to perform at least one of the following:
extracting the position information of at least one line in the first document file, and taking any target text block as an abnormal text block under the condition that the target line exists in at least one line according to the position information of any target text block and the position information of at least one line; the target line is positioned in any target text block or intersects with any target text block;
determining candidate text blocks comprising a plurality of text fragments from each target text block, and taking the candidate text blocks as abnormal text blocks under the condition that any two adjacent text fragments in the plurality of text fragments do not have semantic association relations according to the semantics of the plurality of text fragments;
and determining at least two target text blocks with semantic association relations from the target text blocks according to the content information of the target text blocks, and taking the at least two target text blocks as abnormal text blocks.
As a possible implementation manner of the embodiment of the present application, the labeling module 1303 is specifically configured to perform at least one of the following:
cleaning the abnormal text block;
dividing the abnormal text blocks according to the position information of the target lines; wherein, the target line does not exist in at least two divided text blocks, and/or the at least two divided text blocks are not intersected with the target line;
dividing any two adjacent text fragments without semantic association in the abnormal text block; the text blocks obtained through division comprise one text segment of any two adjacent text segments;
and merging at least two abnormal text blocks with semantic association relations.
As a possible implementation manner of the embodiment of the present application, the labeling module 1303 is further configured to: outputting the cleaned and/or corrected abnormal text blocks; acquiring a correction request; and updating the cleaned and/or corrected abnormal text blocks in response to the correction request.
As a possible implementation manner of the embodiment of the present application, the labeling module 1303 is specifically configured to: adding a labeling frame into the first document file according to the target attribute information, wherein the labeling frame comprises a target text block; determining at least two marking frames from the marking frames according to the position information of the marking frames; acquiring an association relationship between at least two annotation frames, wherein the association relationship is determined according to content information of target text blocks in the at least two annotation frames; and carrying out relationship labeling on at least two labeling frames according to the association relationship to obtain a training sample.
As a possible implementation manner of the embodiment of the present application, the generating device 1300 of the training sample may further include:
the training module is used for training the table text extraction model by adopting a training sample, wherein the training mode of the table text extraction model comprises the following steps: inputting the training sample into a table text extraction model for text extraction to obtain the position information of at least one prediction frame and the content information in the prediction frame; generating a first loss value according to the difference between the position information of the marking frame and the position information of the predicting frame on the training sample; generating a second loss value according to the difference between the content information in the labeling frame and the content information in the prediction frame; and training the table text extraction model according to the first loss value and the second loss value.
As a possible implementation manner of the embodiment of the present application, the obtaining module 1301 is further configured to: a non-editable fourth document file containing a table is acquired.
The training sample generating apparatus 1300 may further include:
and the extraction module is used for inputting the fourth document file into the trained form text extraction model to carry out text extraction so as to obtain the position information of at least one output frame and the content information in the output frame which are output by the trained form text extraction model.
As a possible implementation manner of the embodiment of the present application, in the case that the training sample is marked with the association relationship between at least two marking frames, the trained form text extraction model also outputs the association relationship between at least one output frame; the training sample generating apparatus 1300 may further include:
and the integration module is used for integrating the content information in the at least one output frame according to the association relation between the at least one output frame and the position information of the at least one output frame so as to obtain a relational data table.
The training sample generating device in the embodiment of the application obtains a first document file containing a table; wherein the first document file comprises at least one text block and attribute information of the text block, the text block corresponds to a cell of the table, and the attribute information comprises position information; extracting at least one target text block in the first document file and target attribute information corresponding to the target text block; labeling the target text block according to the target attribute information to obtain a training sample; the training samples are used for training the table text extraction model. Therefore, the capability of dividing the cells of the document table by using the first document file and outputting the text blocks corresponding to the cells obtained by division can be realized, the identification of the cells is replaced by manual identification, and the training sample is obtained, namely, the text blocks in the first document file can be automatically marked according to the attribute information of each text block output by the first document file, the problems of low efficiency and high cost in manual marking of the training sample can be solved, and the effects of improving the generation efficiency of the training sample and reducing the labor cost are achieved.
In order to implement the above embodiment, the present application further provides an electronic device, and fig. 14 is a schematic structural diagram of an electronic device provided in the embodiment of the present application. The electronic device includes:
memory 1401, processor 1402, and a computer program stored on memory 1401 and executable on processor 1402.
The processor 1402, when executing the program, implements the training sample generation method provided in any of the embodiments described above.
Further, the electronic device further includes:
a communication interface 1403 for communication between the memory 1401 and the processor 1402.
A memory 1401 for storing a computer program executable on a processor 1402.
The memory 1401 may include high-speed RAM memory and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
A processor 1402, configured to implement the training sample generating method according to any of the foregoing embodiments when executing the program.
If the memory 1401, the processor 1402, and the communication interface 1403 are implemented independently, the communication interface 1403, the memory 1401, and the processor 1402 can be connected to each other through a bus and perform communication with each other. The bus may be an industry standard architecture (Industry Standard Architecture, abbreviated ISA) bus, an external device interconnect (Peripheral Component, abbreviated PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 14, but not only one bus or one type of bus.
Alternatively, in a specific implementation, if the memory 1401, the processor 1402, and the communication interface 1403 are integrated on a chip, the memory 1401, the processor 1402, and the communication interface 1403 may perform communication with each other through internal interfaces.
The processor 1402 may be a central processing unit (Central Processing Unit, abbreviated as CPU) or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC) or one or more integrated circuits configured to implement embodiments of the present application.
In order to implement the above embodiments, the embodiments of the present application also provide a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of generating training samples as provided in any of the embodiments above.
In order to implement the above embodiments, the embodiments of the present application further provide a computer program product, which when executed by an instruction processor in the computer program product, implements the method for generating training samples provided in any of the above embodiments.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order from that shown or discussed, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims (15)

1. A method for generating training samples, the method comprising:
acquiring a first document file containing a table; wherein the first document file includes at least one text block, and attribute information of the text block, the text block corresponding to a cell of the table; the attribute information includes location information;
extracting at least one target text block in the first document file and target attribute information corresponding to the target text block;
labeling the target text block according to the target attribute information to obtain a training sample; the training samples are used for training the table text extraction model.
2. The method of claim 1, wherein the obtaining a first document file containing a table comprises:
Acquiring an editable second document file containing a table;
randomly modifying text content in the second document file to obtain at least one third document file;
converting said at least one third document file into at least one said first document file.
3. The method of claim 2, wherein randomly modifying text content in the second document file to obtain at least one third document file comprises:
acquiring at least one set of first update information; each set of first updating information comprises target position information and target content formats, wherein the target position information is used for indicating cells to be updated in the table, and the target content formats are content formats of the cells to be updated;
generating target text content matched with a corresponding target content format according to the corresponding target content format aiming at any first updating information;
determining a cell to be updated matched with the target position information in the arbitrary first updating information from the cells in the table in the second document file;
and updating the text content in the cell to be updated in the second document file according to the target text content to obtain the third document file.
4. The method of claim 2, wherein randomly modifying text content in the second document file to obtain at least one third document file comprises:
acquiring at least one set of second update information; wherein, each group of the second updating information comprises at least one target attribute field to be updated;
generating a target attribute value corresponding to each target attribute field in any second updating information aiming at any second updating information; wherein the content format of the target attribute value is matched with the target attribute field;
determining target cells from the table in the second document file for any target attribute field in the any second update information; wherein the content format of the target cell is matched with the arbitrary target attribute field;
and updating the text content in the target cell according to the target attribute value corresponding to the arbitrary target attribute field to obtain the third document file.
5. The method according to claim 1, wherein labeling the target text block according to the target attribute information to obtain a training sample comprises:
Identifying abnormal text blocks from the target text blocks;
cleaning and/or correcting the abnormal text block;
labeling according to the cleaned and/or corrected abnormal text blocks to obtain the training sample.
6. The method of claim 5, wherein identifying abnormal text blocks from the target text blocks comprises at least one of:
extracting the position information of at least one line in the first document file, and taking any target text block as the abnormal text block under the condition that the target line exists in the at least one line according to the position information of the any target text block and the position information of the at least one line; the target line is positioned in the arbitrary target text block or intersects with the arbitrary target text block;
determining candidate text blocks comprising a plurality of text fragments from the target text blocks, and taking the candidate text blocks as the abnormal text blocks under the condition that any two adjacent text fragments in the plurality of text fragments do not have semantic association relations according to the semantics of the plurality of text fragments;
And determining at least two target text blocks with semantic association relations from the target text blocks according to the content information of the target text blocks, and taking the at least two target text blocks as the abnormal text blocks.
7. The method of claim 6, wherein the cleaning and correcting the abnormal text block comprises at least one of:
cleaning the abnormal text block;
dividing the abnormal text blocks according to the position information of the target lines; the target line does not exist in at least two divided text blocks, and/or the at least two divided text blocks are not intersected with the target line;
dividing any two adjacent text fragments which do not have semantic association relations in the abnormal text block; wherein the text blocks obtained by dividing comprise one text segment in any two adjacent text segments;
and merging at least two abnormal text blocks with semantic association relations.
8. The method according to any one of claims 5-7, wherein before said labeling according to the cleaned and/or corrected abnormal text blocks, the method further comprises:
Outputting the cleaned and/or corrected abnormal text blocks;
acquiring a correction request;
and responding to the correction request, and updating the cleaned and/or corrected abnormal text blocks.
9. The method according to any one of claims 1-7, wherein labeling the target text block according to the target attribute information to obtain a training sample includes:
adding a labeling frame into the first document file according to the target attribute information, wherein the labeling frame comprises a target text block;
determining at least two labeling frames from the labeling frames according to the position information of the labeling frames;
acquiring an association relationship between the at least two annotation frames, wherein the association relationship is determined according to content information of target text blocks in the at least two annotation frames;
and according to the association relation, carrying out relation labeling on the at least two labeling frames to obtain the training sample.
10. The method according to any one of claims 1-7, further comprising:
training the table text extraction model by adopting the training sample, wherein the method for training the table text extraction model comprises the following steps:
Inputting the training sample into the table text extraction model to perform text extraction so as to obtain the position information of at least one prediction frame and the content information in the prediction frame;
generating a first loss value according to the difference between the position information of the marking frame on the training sample and the position information of the prediction frame;
generating a second loss value according to the difference between the content information in the labeling frame and the content information in the prediction frame;
and training the table text extraction model according to the first loss value and the second loss value.
11. The method according to any one of claims 1-7, further comprising:
acquiring a non-editable fourth document file containing a table;
and inputting the fourth document file into a trained form text extraction model for text extraction to obtain the position information of at least one output frame and the content information in the output frame output by the trained form text extraction model.
12. The method of claim 11, wherein in the case where the training sample is labeled with an association between at least two labeling frames, the trained form text extraction model further outputs an association between the at least one output frame, the method further comprising:
And integrating the content information in the at least one output frame according to the association relation between the at least one output frame and the position information of the at least one output frame to obtain a relational data table.
13. A training sample generation apparatus, the apparatus comprising:
an acquisition module for acquiring a first document file containing a table; wherein the first document file includes at least one text block, and attribute information of the text block, the text block corresponding to a cell of the table; the attribute information includes location information;
the extraction module is used for extracting at least one target text block in the first document file and target attribute information corresponding to the target text block;
the labeling module is used for labeling the target text block according to the target attribute information so as to obtain a training sample; the training samples are used for training the table text extraction model.
14. An electronic device, comprising:
memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the training sample generation method according to any of claims 1-12 when the program is executed.
15. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a method of generating training samples according to any of claims 1-12.
CN202310738464.1A 2023-06-21 2023-06-21 Training sample generation method and device, electronic equipment and storage medium Pending CN116860747A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310738464.1A CN116860747A (en) 2023-06-21 2023-06-21 Training sample generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310738464.1A CN116860747A (en) 2023-06-21 2023-06-21 Training sample generation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116860747A true CN116860747A (en) 2023-10-10

Family

ID=88220738

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310738464.1A Pending CN116860747A (en) 2023-06-21 2023-06-21 Training sample generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116860747A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117173725A (en) * 2023-11-03 2023-12-05 之江实验室 Table information processing method, apparatus, computer device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117173725A (en) * 2023-11-03 2023-12-05 之江实验室 Table information processing method, apparatus, computer device and storage medium
CN117173725B (en) * 2023-11-03 2024-04-09 之江实验室 Table information processing method, apparatus, computer device and storage medium

Similar Documents

Publication Publication Date Title
CN109344831B (en) Data table identification method and device and terminal equipment
US10915788B2 (en) Optical character recognition using end-to-end deep learning
CN112396049A (en) Text error correction method and device, computer equipment and storage medium
US9286526B1 (en) Cohort-based learning from user edits
CN112949476B (en) Text relation detection method, device and storage medium based on graph convolution neural network
CN112417899A (en) Character translation method, device, computer equipment and storage medium
CN111144210A (en) Image structuring processing method and device, storage medium and electronic equipment
CN116860747A (en) Training sample generation method and device, electronic equipment and storage medium
CN112149680A (en) Wrong word detection and identification method and device, electronic equipment and storage medium
CN112464927B (en) Information extraction method, device and system
CN114022891A (en) Method, device and equipment for extracting key information of scanned text and storage medium
CN113673294A (en) Method and device for extracting key information of document, computer equipment and storage medium
CN112990290A (en) Sample data generation method, device, equipment and storage medium
CN117332766A (en) Flow chart generation method, device, computer equipment and storage medium
CN111930976A (en) Presentation generation method, device, equipment and storage medium
CN115130437B (en) Intelligent document filling method and device and storage medium
CN116225956A (en) Automated testing method, apparatus, computer device and storage medium
CN113486171B (en) Image processing method and device and electronic equipment
CN115904482A (en) Interface document generation method, device, equipment and storage medium
CN112560849B (en) Neural network algorithm-based grammar segmentation method and system
CN113936187A (en) Text image synthesis method and device, storage medium and electronic equipment
CN113128496B (en) Method, device and equipment for extracting structured data from image
CN117095422B (en) Document information analysis method, device, computer equipment and storage medium
CN110457659B (en) Clause document generation method and terminal equipment
CN114398492B (en) Knowledge graph construction method, terminal and medium in digital field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination