WO2022062798A1 - Rpa and ai-based table information extraction method and apparatus, device and medium - Google Patents

Rpa and ai-based table information extraction method and apparatus, device and medium Download PDF

Info

Publication number
WO2022062798A1
WO2022062798A1 PCT/CN2021/114068 CN2021114068W WO2022062798A1 WO 2022062798 A1 WO2022062798 A1 WO 2022062798A1 CN 2021114068 W CN2021114068 W CN 2021114068W WO 2022062798 A1 WO2022062798 A1 WO 2022062798A1
Authority
WO
WIPO (PCT)
Prior art keywords
key
content
information extraction
value
information
Prior art date
Application number
PCT/CN2021/114068
Other languages
French (fr)
Chinese (zh)
Inventor
汪冠春
胡一川
褚瑞
李玮
张海雷
白龙飞
Original Assignee
北京来也网络科技有限公司
来也科技(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202011024745.3A external-priority patent/CN112149399B/en
Application filed by 北京来也网络科技有限公司, 来也科技(北京)有限公司 filed Critical 北京来也网络科技有限公司
Publication of WO2022062798A1 publication Critical patent/WO2022062798A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the present invention relates to the technical field of table processing, and in particular, to a method, device, device and medium for extracting table information based on RPA and AI.
  • RPA Robot Process Automation
  • AI Artificial Intelligence
  • RPA has unique advantages: low-code, non-intrusive.
  • Low-code means that RPA does not require a high level of IT to operate, and business personnel who do not understand programming can develop processes; non-intrusive means that RPA can simulate human operations without opening interfaces to software systems.
  • traditional RPA has certain limitations: it can only be based on fixed rules, and the application scenarios are limited.
  • AI Artificial Intelligence
  • RPA will encounter a large amount of tabular data. Especially for enterprises and institutions, they may face a large amount of tabular data every day. It is particularly useful to correctly extract useful information from these tabular data and enter it into the designated system. At present, it is generally done in the following two ways: one is to manually screen the information in the form to select useful information, and then manually enter the information obtained from the screening into the system. The second is to manually intervene and summarize the matching rules of various forms, that is, specify the corresponding rule template according to the structure information of the form, and then extract the form information through a program or algorithm, and then fill in the system structure according to the need, and then follow the program. Or manually enter the extracted information into the system.
  • the present invention provides a table information extraction method, device, device and medium based on RPA and AI, so as to overcome at least one technical problem existing in the prior art.
  • a method for extracting table information based on RPA and AI includes:
  • a table information extraction device based on RPA and AI includes:
  • An image conversion template configured to convert a file containing a table into an image
  • the template generation module is configured to identify the table in the picture, and generate an information extraction template corresponding to the table type according to the identification result, and the information extraction template contains the keys and position information of each key-value pair in the table, And the location information of the value of each key-value pair to be extracted;
  • the content extraction module is configured to extract table content from the recognition result according to the information extraction template.
  • an embodiment of the present invention further provides a computing device, including:
  • a processor coupled to the memory
  • the processor invokes the executable program code stored in the memory to execute part or all of the steps of the table information extraction method based on RPA and AI provided by any embodiment of the present invention.
  • an embodiment of the present invention further provides a computer-readable storage medium, which stores a computer program, and the computer program includes a method for executing the table information extraction method based on RPA and AI provided by any embodiment of the present invention. Instructions for some or all of the steps.
  • a file containing a table when extracting table information, can be converted into a picture, so as to associate the content of the cells in the table with the table.
  • an information extraction template corresponding to the table type can be generated according to the identification result.
  • the information extraction template includes the key and location information of each key-value pair in the table, as well as the information of each key-value pair to be extracted. Location information for the value.
  • the table content can be extracted from the recognition result.
  • an information extraction template corresponding to the table type can be generated according to the recognition result.
  • the table content can be extracted from the recognition result, which avoids the problems of high labor cost and poor accuracy when manually extracting table information.
  • the method provided by this implementation does not require developers to summarize different rules, and is more versatile. This is one of the innovative points of the embodiments of the present invention.
  • the method of first converting the file containing the table into a picture, and then recognizing the table in the picture improves the reliability of the table data and helps to improve the versatility of the information extraction template, which is an embodiment of the present invention.
  • the generated information extraction template contains some special syntax identifiers, such as square brackets, angle brackets, etc. These identifiers are determined based on table properties, which are related to the content and location of cells in the table.
  • table properties which are related to the content and location of cells in the table.
  • the angle brackets indicate that the content needs to be fuzzy matched.
  • Square brackets indicate that the content in it needs to be strictly matched. This setting helps to improve the accuracy of table content extraction, which is one of the innovative points of the embodiments of the present invention.
  • the position information of each key-value pair in the information extraction template is represented in the form of a regular expression. This setting can avoid the problem that the table content cannot be accurately extracted due to disordered row and column information of the cells in the OCR identification result, which is one of the innovative points of the embodiments of the present invention.
  • the first information extraction template corresponding to the left and right one-to-one type table is generated based on the content of each row in the table, that is, for the content of each row in the table, a first information extraction template will be generated correspondingly, that is The number of rows of the table is equal to the number of the first information extraction templates.
  • this embodiment can reduce the number of information extraction templates and improve the speed of template generation, which is one of the innovative points of the embodiments of the present invention.
  • an auxiliary variable can be added before the cells in the first column of the table to distinguish the contents of different rows in the table, so as to avoid extracting the contents of the next row as the contents of the current row during information extraction, which is an embodiment of the present invention.
  • the information extraction template corresponds to the table type and has strong versatility, that is, if there are multiple tables of the same type in the picture, the method provided by the embodiment of the present invention can generate the same information extraction for multiple tables of the same type template. According to this template, the contents of multiple tables of the same type can be extracted, which improves the speed of extracting table contents, which is one of the innovative points of the embodiments of the present invention.
  • FIG. 1a is a flowchart of table information extraction and input based on the combination of RPA+AI provided by Embodiment 1 of the present invention
  • FIG. 1b is a schematic diagram of a field establishment interface according to Embodiment 1 of the present invention.
  • FIG. 1c is a schematic interface diagram of a pre-release information extraction template provided by Embodiment 1 of the present invention.
  • FIG. 1d is a schematic interface diagram of a post-release information extraction template provided by Embodiment 1 of the present invention.
  • FIG. 1e is a schematic interface diagram of an information extraction template corresponding to a one-to-many type table after publishing provided by Embodiment 1 of the present invention
  • FIG. 2 is a schematic flowchart of a method for extracting table information based on RPA and AI according to Embodiment 2 of the present invention
  • FIG. 3 is a flowchart of a preferred method for extracting table information based on RPA and AI provided by Embodiment 3 of the present invention
  • FIG. 4 is a flowchart of a preferred method for extracting table information based on RPA and AI provided by Embodiment 4 of the present invention
  • Embodiment 5 is a flowchart of a preferred method for extracting table information based on RPA and AI provided by Embodiment 5 of the present invention
  • FIG. 6 is a flowchart of a preferred method for extracting table information based on RPA and AI provided by Embodiment 6 of the present invention
  • FIG. 7 is a flowchart of a preferred method for extracting table information based on RPA and AI provided by Embodiment 7 of the present invention.
  • FIG. 8 is a schematic structural diagram of an apparatus for extracting table information based on RPA and AI according to Embodiment 8 of the present invention.
  • FIG. 9 is a schematic structural diagram of a computing device according to Embodiment 9 of the present invention.
  • template is a text expression provided by the developer for the "information extraction” function. Use this expression to match several fragments of text and extract information.
  • template provided by the embodiment of the present invention, you need to understand the following necessary syntax:
  • the "square brackets" [] represent strict matching.
  • the matching content can be the "vocabulary”, “regular expression” defined in the resource in advance, or the phrase that needs to be matched.
  • can only appear in the template header, used to define that the template must be matched from scratch.
  • field refers to the key information extracted from the template, which is a name specific to the current information extraction task, and the name is generally designated by the user.
  • a "vocabulary” is an information structure composed of ⁇ vocabulary name, vocabulary value, and various expressions of vocabulary value>.
  • a vocabulary describes a relatively fixed class of "external knowledge" in lexical form that is strongly related to the developer's field.
  • the term "regular expression” is a logical formula for operating on strings, describing a pattern of string matching, which can be used to check whether a string contains a certain substring, a substring that will be matched Replace or extract a substring that meets a certain condition from a certain string, etc.
  • the embodiment of the present invention discloses a table information extraction method, device, device and medium based on RPA and AI. Each of them will be described in detail below.
  • Robotic Process Automation for short is a specific "robot software” that simulates human operations on a computer and automatically performs process tasks according to rules.
  • AI Artificial Intelligence
  • AI is the English abbreviation of artificial intelligence. It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence.
  • Fig. 1a is a flow chart of table information extraction and input based on the combination of RPA+AI provided by Embodiment 1 of the present invention, and each step in Fig. 1a is introduced below:
  • the file can be parsed by writing a large number of codes and rules in a conventional manner.
  • the traditional method can easily lead to the inability of parsing programs and rules to be reused in many cases, which increases the development cost.
  • an automated service platform such as Uibot software, can automatically convert the table file into a picture by adopting a construction process.
  • an OCR Optical Character Recognition, Optical Character Recognition
  • OCR Optical Character Recognition
  • the position information includes information such as start row index, start column index, end row index and end column index.
  • Step 130 is the key point of the embodiment of the present invention, that is, a template for extracting table information is automatically generated according to the OCR recognition result of the table image.
  • the form information extraction template corresponds to the form type.
  • information extraction templates corresponding to the table types can be generated by calling different template interfaces.
  • the step of automatically generating the form information extraction template will be analyzed from the following two aspects:
  • the table format is in the form of left and right key-value, as shown in Table 1 below.
  • the table type in Table 1 below can be considered as a one-to-one form, that is, the fields to be extracted are on the left, such as name, while The corresponding value is on the right, such as Zhang San.
  • the row and column indices corresponding to each cell can be obtained (the row and column indices are calculated from 0). For example, for the cell "name”, the corresponding row and column indices are 0, 0 respectively. , the row and column indices corresponding to the cell "Hefei City, Anhui province” are 1, 5.
  • FIG. 1b is a schematic diagram of a field establishment interface according to Embodiment 1 of the present invention. As shown in Figure 1b, the interface shows the following partial fields created by the user: "place of birth”, “name”, “educational education”, “age” and "place of origin”.
  • the corresponding information extraction template is generated by row.
  • the contents of each cell in the table need to be spliced according to the position information of the row and column in the table.
  • the location information and content of each cell are included.
  • the generation process of the information extraction template corresponding to each line of content can be as follows: the position information of each key in the spliced content can be used as the position information of the key in the information extraction template; each key in the spliced content can be used as the information extraction template The key in ; the position information of the value of each key-value pair in the spliced content is used as the position information of the value to be extracted in the information extraction template.
  • corresponding grammatical identifiers need to be added for the subsequent extraction of table content.
  • the corresponding information extraction template is:
  • FIG. 1c is a schematic interface diagram of an information extraction template before publishing according to Embodiment 1 of the present invention.
  • a column of matching text corresponds to each node in the table, that is, each key-value pair in the table.
  • the user can choose whether to output to the specified field for each generated node. For example, in the first row of the table, if the user wants to extract the name and age, then select the name and age in the output to field below. For the table content that the user does not want to extract, select "Do not output" in the output to field.
  • FIG. 1d is a schematic interface diagram of a post-release information extraction template provided by Embodiment 1 of the present invention.
  • the information extraction template corresponding to each line of content is generated, if a template publishing instruction triggered by the user is received, the finally displayed information template corresponding to Table 1 above is displayed to the user.
  • the user can edit, copy and delete any one of the templates in Figure 1d.
  • the table format is in the form of upper and lower key-value, as shown in Table 2 below.
  • the first column of the value to be extracted in the upper and lower form tables is not enumerable or irregular, that is, it cannot be expressed by establishing a vocabulary or by using regular expressions.
  • a preset standard template needs to be specified first to match the keys in Table 2 above, that is, to match the "item”, “note”, “2020 semi-annual-amount” and "2019 semi-annual” in the table -Amount".
  • the default standard template is:
  • the content of each cell in the table is spliced according to the position information of the row, and the spliced content is matched with the content of the preset standard template. If the match is successful, the key in the table and the to-be-extracted key can be determined.
  • the start position information and end position information of the value in the table and can record the number of matching columns cols.
  • an auxiliary variable @Frow_n- is introduced, where row_n represents the number of rows, and a template is established. Among them, the auxiliary variable is used to distinguish the content of different rows when the table content is extracted.
  • FIG. 1e is a schematic diagram of an interface of an information extraction template corresponding to a one-to-many type table after publishing according to Embodiment 1 of the present invention.
  • Table 2 if the user presets the extracted fields as item, 2019 semi-annual-amount, and 2020 semi-annual-amount, the generated information extraction template is shown in Figure 1e.
  • F0 is an auxiliary variable.
  • the user can choose whether to output the extracted table content to the field on the display interface before publishing.
  • the user can also perform operations such as editing, copying and deleting on the generated information extraction template on the interface shown in FIG. 1e.
  • the first column of the value to be extracted in the upper and lower form tables is enumerable or can be represented by a regular expression.
  • the information extraction template corresponding to the above Table 2 is:
  • the method for generating a form information extraction template avoids the problems of high labor cost and poor accuracy when manually extracting form information, and compared with the method of summarizing the matching rules of various forms by manual intervention , the method provided by this implementation does not require developers to summarize different rules, and has strong generality.
  • an automated service platform such as Uibot software
  • Uibot software can be used to implement automatic input of information by building a process.
  • the input method provided by this embodiment has higher versatility, and reduces labor costs and maintenance costs to a great extent.
  • FIG. 2 is a schematic flowchart of a method for extracting table information based on RPA and AI according to Embodiment 2 of the present invention.
  • the method can be applied to application scenarios such as table data screening and entry systems, and can be performed by a table information extraction device based on RPA and AI, which can be implemented by software and/or hardware.
  • the method provided by this embodiment specifically includes:
  • the file containing the table may be a Word document, an Excel document, or a PDF document, or the like.
  • the RPA technology can be used to convert the file containing the table into a picture.
  • the table content and its position information in the table can be solidified together. If the information extraction template is generated by directly recognizing the table in the file, it is easy to identify the content in the table as the text in the file, thereby causing the loss of data information in the table.
  • directly identifying the table content will also lead to the fact that the parsing programs and rules used to identify the table content cannot be reused in many cases, resulting in an increase in development costs.
  • the file containing the table is first converted into a picture, and then the table in the picture is identified, which improves the reliability of the table data and helps to improve the generality of the information extraction template.
  • OCR Optical Character Recognition, Optical Character Recognition
  • the recognition result includes the content of each cell in the table and the position information of each cell in the table.
  • the position information of each cell in the table includes a start row index, a start column index, an end row index, an end column index, and the like.
  • the table type can be determined by the positional relationship and the corresponding relationship between each key-value pair in the table.
  • information extraction templates corresponding to the table types can be generated by calling different template interfaces. Before calling different template interfaces, users can specify the fields they want to extract according to the key-value pair information in the table. After the information extraction template is generated, the user can also select whether to output the extracted table content by triggering the field output instruction.
  • an information extraction template corresponding to the table type can be generated according to the content of each cell in the table and the position information of each cell in the table.
  • the information extraction template includes the key and position information of each key-value pair in the table, and the position information of the value of each key-value pair to be extracted.
  • the constructed information extraction template is as follows:
  • [@R0@C0-] ⁇ name> means that the row and column information of the "name” in the table is the zeroth row and zeroth column; [@R0@C1-] ⁇ name: ⁇ *> ⁇ means The row and column information where the content of the value corresponding to "name" is located is the zeroth row and the first column.
  • Other fields to be extracted, such as age, gender, and ethnicity, are represented in the information extraction template in a manner similar to the representation of the above-mentioned names, and will not be repeated here.
  • some special syntax identifiers can be added to it, and these identifiers are determined according to the table attributes. For example, for the location information of the cells in the table, add square brackets [ ], such as [@R0@C0-] in the template above. Add angle brackets ⁇ > to the keys of key-value pairs in the table, such as ⁇ name> in the template above. For the value of the key-value pair to be extracted in the table, express it in the form of asterisks in angle brackets, such as ⁇ *>, and separate the value to be extracted and its corresponding key with a colon ":”.
  • curly brackets to each key-value pair, such as ⁇ name: ⁇ *:0,> ⁇ in the above template. If the user has set that the value to be extracted does not need to be output to the field, there is no need to add the above curly brackets.
  • the grammatical identifiers in the information extraction template have certain preset meanings. For example, square brackets represent strict matching, that is, it is determined whether the strings to be matched are the same; angle brackets represent fuzzy matching, that is, it is determined that matching is performed. Whether the similarity of the content is greater than the set threshold.
  • square brackets represent strict matching, that is, it is determined whether the strings to be matched are the same; angle brackets represent fuzzy matching, that is, it is determined that matching is performed. Whether the similarity of the content is greater than the set threshold.
  • the position information of each key-value pair in the information extraction template can be represented in the form of regular expressions. This setting can avoid the problem that the table content cannot be accurately extracted due to disordered row and column information of the cells in the OCR recognition result.
  • the user can perform related debugging according to the automatically generated template, for example, the template can be edited, copied and deleted.
  • the user can call the information extraction engine interface to extract information.
  • all contents in the information extraction template may be matched with the OCR identification result until the matching is successful.
  • the matching process it is necessary to perform matching according to the preset meaning corresponding to the grammatical identifier in the information extraction template. For example, it is determined whether the string in square brackets corresponds to the string corresponding to the cell position information in the OCR recognition result. The same; or, determine whether the similarity between the content in the angle brackets and the key of the key-value pair in the OCR recognition result is greater than the set threshold. If the strings are equal or the similarity of the text is greater than the set threshold, the match is successful. After the matching is successful, the table content to be extracted can be extracted from the recognition result.
  • a file containing a table when extracting table information, a file containing a table can be converted into a picture, so as to associate the content of the cells in the table with the table.
  • an information extraction template corresponding to the table type can be generated according to the identification result.
  • the information extraction template includes the key and location information of each key-value pair in the table, as well as the information of each key-value pair to be extracted. Location information for the value.
  • the table content can be extracted from the recognition result.
  • FIG. 3 is a flowchart of a preferred method for extracting table information based on RPA and AI provided by Embodiment 3 of the present invention.
  • this embodiment on the basis of the above-mentioned embodiments, interprets the information corresponding to the left and right one-to-one format for the table type.
  • the generation process of the extraction template is described in detail.
  • the key and value of each key-value pair in the table in the left-right one-to-one format have a left-right positional relationship, and the key and the value have a one-to-one relationship.
  • the method includes:
  • the row and column index of each cell and the correspondence between the cells have been determined.
  • the spliced content is embodied in the form of a character string.
  • the field to be extracted is on the left, such as "name” in Table 1 above, and the corresponding value is on the right, such as "Zhang San”.
  • the first information extraction template corresponding to the left and right one-to-one type table is generated based on the content of each row in the table, that is, for the content of each row in the table, a corresponding one is generated.
  • the first information extraction template that is, the number of rows of the table is equal to the number of the first information extraction template.
  • each row of content in the spliced content includes the position information and content of each cell.
  • the generation process of the first information extraction template corresponding to each line of content can be as follows: the position information of each key in the content after splicing can be used as the position information of the key in the first information extraction template; each key in the content after splicing, As the key in the first information extraction template; the position information of the value of each key-value pair in the spliced content is used as the position information of the value to be extracted in the first information extraction template.
  • a corresponding grammatical identifier needs to be added to it, so as to be used for subsequent table content extraction .
  • the corresponding first extraction template is:
  • the corresponding first extraction template is:
  • this embodiment refines the generation process of the first information extraction template corresponding to the table whose table type is one-to-one format.
  • the position information of the inner row and column is spliced, and the first information extraction template corresponding to the content of each row in the table is generated based on the spliced content, avoiding the problems of high labor cost and poor accuracy when manually extracting table information.
  • the method provided by this implementation does not require developers to summarize different rules, and is more versatile.
  • FIG. 4 is a flowchart of a preferred method for extracting table information based on RPA and AI according to Embodiment 4 of the present invention.
  • the table type is corresponding to the upper and lower one-to-one format.
  • the generation process of the information extraction template is introduced in detail.
  • the key and value of each key-value pair in the table in the upper-lower one-to-many format have an upper-lower positional relationship, and the key and the value have a one-to-many relationship.
  • the table type in the upper and lower one-to-many format includes the following two situations: 1.
  • the first column of the value to be extracted in the table is not enumerable or irregular, that is, it cannot be established by establishing a vocabulary or cannot be used. 2.
  • the first column in the value that needs to be extracted in the table is enumerable or can be represented by a regular expression.
  • the above-mentioned first case is first described in detail.
  • the table information extraction method based on RPA and AI provided by this embodiment includes:
  • the preset vocabulary table includes all the contents in the table pre-set and extracted by the user. If the preset vocabulary is not detected, it means that the first column in the value to be extracted in the table is not enumerable or irregular.
  • the information extraction template corresponds to the table type, and has strong generality. That is, if there are multiple tables of the same type in the picture, the method provided in this embodiment can generate the same table for multiple tables of the same type. Information extraction template. According to this template, the content in multiple tables of the same type can be extracted, which improves the speed of extracting the content of subsequent tables.
  • the preset standard template includes the key of the extracted key-value pair preset by the user.
  • the grammatical identifiers of the key and its location information in the key-value pair in the preset standard template are the same as the grammatical identifiers of the key and its location information in the information extraction template in the embodiment of the present invention.
  • the corresponding preset standard templates are as follows:
  • the content of each cell in the table is spliced according to the position information of the row, and the spliced content is matched with the content of the preset standard template.
  • This setting is to determine the user preset from the recognition result.
  • the key corresponding to the extracted content can be determined, and the start position information and end position information of the value to be extracted in the table can be determined.
  • the number of second targets is 4.
  • the spliced content is matched with the content of the preset standard template. After the matching is successful, the value corresponding to each key in the table, as well as the start position information and end position information of the row where the content to be extracted is located are also determined. .
  • auxiliary variables are added before the cells in the first column of the table to distinguish the contents of different rows in the table, so as to avoid extracting the contents of the next row as the contents of the current row during information extraction.
  • the second information extraction template includes auxiliary variables, matching keys and their location information, and location information of the values of each key-value pair to be extracted in the table.
  • the generation process of the second information extraction template may specifically be as follows: adding auxiliary variables to the starting position of the second information extraction template; taking the position information of each key that matches as the position information of the keys in the second information extraction template ; Use the matched keys as the keys in the second information extraction template; use the position information of the values to be extracted corresponding to each key as the position information of the values to be extracted in the second information extraction template.
  • a corresponding grammatical identifier is added for the key of each key-value pair and its location information, as well as the location information of the value of each key-value pair to be extracted, to be used for subsequent table contents. Extract.
  • the syntax identifier involved in the second information extraction template has the same meaning as the syntax identifier mentioned in the first information extraction template, which is not repeated in this embodiment.
  • the generated second information extraction template is:
  • this embodiment generates a second information extraction template corresponding to the table type in the top-bottom one-to-many format, and the first column of the value to be extracted is not enumerable or irregular.
  • the process is refined.
  • the starting position information in the table where the value to be extracted is located can be obtained and end location information.
  • the second information extraction template corresponding to the table type can be generated, avoiding the introduction of excessive manual intervention.
  • FIG. 5 is a flowchart of a preferred method for extracting table information based on RPA and AI provided by Embodiment 5 of the present invention.
  • the first column of the value to be extracted in the table is:
  • the cases that can be enumerated or can be represented by regular expressions are described in detail.
  • the table information extraction method based on RPA and AI provided by this embodiment includes:
  • the matching is successful, the content of each cell in the table is spliced according to the position information of the row, and the spliced content is matched with the content of the preset standard template.
  • the standard template includes the key of the extracted key-value pair preset by the user.
  • the specific matching method is the same as the matching method mentioned in the above-mentioned embodiment, and will not be repeated here.
  • the third information extraction template includes a key matching the preset standard template, the position information of the key in the table, and the position information of the value of each key-value pair to be extracted in the table. Different from the second information extraction template, there is no need to add auxiliary variables to the third information extraction template. In addition, the method of generating the third information extraction template is similar to the generation method of the second information extraction template, which will not be repeated here. .
  • the generated third information extraction template is:
  • the table type is in the upper-lower one-to-many format
  • the first column in the value to be extracted is enumerable, that is, the first column corresponding to the table that can be expressed in the form of a vocabulary
  • the generation process of the three information extraction templates is refined. Different from the above-mentioned second information extraction template, the generation process of the third information extraction template does not need to add auxiliary variables, but it needs to judge whether the value to be extracted belongs to the preset vocabulary, and if it belongs to the preset vocabulary, it is based on the matching vocabulary.
  • the key and its location information can generate a third information extraction template corresponding to the table type.
  • This embodiment is set in this way to avoid introducing too much manual intervention, and compared with the method of summarizing matching rules of various tables by manual intervention, the method provided by this implementation does not require developers to summarize different rules, and is more versatile.
  • FIG. 6 is a flowchart of a preferred method for extracting table information based on RPA and AI according to Embodiment 6 of the present invention.
  • This embodiment introduces in detail the generation of an information extraction template corresponding to a left-right one-to-many format.
  • the key and value of each key-value pair in the table in the left-right one-to-many format have a left-right positional relationship, and the key and the value have a one-to-many relationship.
  • the first column in the value to be extracted is not enumerable or irregular.
  • the generation method of the fourth information extraction template provided in this embodiment is similar to the generation method of the second information extraction template corresponding to the upper and lower one-to-many format and the first column of the value to be extracted is non-enumerable, and the difference is Because of the positional relationship between key-value pairs in the table, in this embodiment, the cell content is spliced by column, and the table traversal is performed by column, so as to determine the number of table content rows.
  • the table information extraction method based on RPA and AI provided by this embodiment includes:
  • the standard template includes pre-set keys of the extracted key-value pairs.
  • the fourth information extraction template includes auxiliary variables, the matched keys and their location information, and location information of the values of each key-value pair in the table.
  • the specific generation process of the fourth information extraction template is similar to the generation process of the second information extraction template. For details, please refer to the above-mentioned generation process of the second information extraction template, which will not be repeated here.
  • the content of each cell in the table is spliced according to the position information of the column, and the spliced content and the preset The content of the standard template is matched. If the match is successful, the start position information and end position information in the table where the value to be extracted is located can be obtained. Traverse the table by column. If the number of rows in the table matches the number of keys in the preset standard template, add auxiliary variables before the cells in the first column of the table to distinguish the contents of each row in the table.
  • a fourth information extraction template corresponding to the table type may be generated based on the auxiliary variable, the key whose table content matches the preset standard template, and its position information.
  • FIG. 7 is a flowchart of a preferred method for extracting table information based on RPA and AI according to Embodiment 7 of the present invention.
  • This embodiment introduces in detail the generation of an information extraction template corresponding to a left-right one-to-many format.
  • the key and value of each key-value pair in the table in the left-right one-to-many format have a left-right positional relationship, and the key and the value have a one-to-many relationship.
  • the first column of the value to be extracted can be enumerated, that is, it can be expressed in the form of a vocabulary.
  • the generation method of the fifth information extraction template provided by this embodiment is similar to the generation method of the third information extraction template corresponding to the first column of the value that needs to be extracted in the one-to-many format and the first column can be enumerated. That is, due to the positional relationship between the key-value pairs in the table, in this embodiment, the content of the cells is spliced by columns. Traversing the table is to traverse by column to determine the number of table content rows. As shown in Figure 7, the table information extraction method based on RPA and AI provided by this embodiment includes:
  • the matching is successful, the content of each cell in the table is spliced according to the position information of the column, and the spliced content is matched with the content of the preset standard template.
  • the preset standard template includes preset keys of the extracted key-value pairs.
  • the fifth information extraction template includes the keys matching the preset standard template and their location information, and the location information of the values of each key-value pair to be extracted in the table.
  • the specific generation process of the fifth information extraction template is similar to the generation process of the third information extraction template. For details, please refer to the above-mentioned generation process of the third information extraction template, which will not be repeated here.
  • the table type is left-right one-to-many format
  • the first column in the value to be extracted is enumerable, that is, the first column corresponding to the table that can be expressed in the form of a vocabulary
  • the generation process of five information extraction templates is refined. Different from the above-mentioned fourth information extraction template, the generation process of the fifth information extraction template does not need to add auxiliary variables, but it needs to judge whether the value to be extracted belongs to the preset vocabulary, and if it belongs to the preset vocabulary, it can be based on matching. key and its location information, and generate a fourth information extraction template corresponding to the table type.
  • This embodiment is set in this way to avoid introducing too much manual intervention, and compared with the method of summarizing matching rules of various tables by manual intervention, the method provided by this implementation does not require developers to summarize different rules, and is more versatile.
  • the apparatus includes: a picture conversion template 810, a template generation module 820, and a content extraction module 830; in,
  • the image conversion template 810 is configured to convert a file containing a table into an image
  • the template generation module 820 is configured to identify the table in the picture, and generate an information extraction template corresponding to the table type according to the identification result, and the information extraction template includes the keys of each key-value pair in the table and their location information , and the location information of the value of each key-value pair to be extracted;
  • the content extraction module 830 is configured to extract table content from the identification result according to the information extraction template.
  • the template generation module 820 includes:
  • the picture recognition unit is configured to perform optical character OCR recognition on the picture to obtain a recognition result, where the recognition result includes the content of each cell in each table, and the position information of each cell in the table;
  • the template generating unit is configured to, for any type of table, generate an information extraction template corresponding to the table type according to the content of each cell in the table and the position information of each cell in the table.
  • the table type includes a left-right one-to-one format, the key and value of each key-value pair in the left-right one-to-one format table are in a left-right positional relationship, and the key and value are in a one-to-one relationship;
  • the template generation unit is specifically configured as:
  • the first information extraction template includes the key and position information of each key-value pair in each row in the table, and the position information of the value of each key-value pair to be extracted.
  • the table type includes a top-bottom one-to-many format, the key and value of each key-value pair in the top-bottom one-to-many format table are in a top-bottom position relationship, and the key and value are a one-to-many relationship;
  • the template generation unit is specifically configured as:
  • the content of each cell in the table is spliced according to the position information of the row, and the spliced content is matched with the content of a preset standard template, the standard template includes preset Determine the key of the extracted key-value pair;
  • the match is successful, the number of columns corresponding to the matched key is taken as the first target number;
  • an auxiliary variable is added before the cells in the first column in the table, and the auxiliary variable is used to distinguish the content of each row in the table when the table content is extracted;
  • the second information extraction template includes the auxiliary variable, the matching key and its location information, and the location information of the value of each key-value pair to be extracted in the table.
  • the table type includes a top-bottom one-to-many format, the key and value of each key-value pair in the top-bottom one-to-many format table are in a top-bottom position relationship, and the key and value are a one-to-many relationship;
  • the template generation unit is specifically configured as:
  • the value of each key-value pair in the table is matched with the content of the preset vocabulary
  • the content of each cell in the table is spliced according to the position information of the row, and the spliced content is matched with the content of a preset standard template, where the preset standard template includes preset extracted the key of the key-value pair;
  • the number of columns corresponding to the matched key is taken as the second target number
  • the third information extraction template includes the matched key and its location information, and the location information of the value of each key-value pair to be extracted in the table.
  • the table type includes a left-right one-to-many format, the key and value of each key-value pair in the left-right one-to-many format table have a left-right positional relationship, and the key and the value have a one-to-many relationship;
  • the template generation unit is specifically configured as:
  • the content of each cell in the table is spliced according to the position information of the column, and the spliced content is matched with the keys in the preset standard template, and the preset standard template includes Preset the key of the extracted key-value pair;
  • the number of rows corresponding to the matched key is taken as the third target number
  • an auxiliary variable is added before each cell in the table, and the auxiliary variable is used to distinguish the content of each row in the table when the table content is extracted;
  • a fourth information extraction template corresponding to the form type is generated;
  • the fourth information extraction template includes the auxiliary variable, the matching key and its position information, and the position information of the value of each key-value pair in the table.
  • the table type includes a left-right one-to-many format, the key and value of each key-value pair in the left-right one-to-many format table have a left-right positional relationship, and the key and the value have a one-to-many relationship;
  • the template generation unit is specifically configured as:
  • the value of each key-value pair in the table is matched with the content of the preset vocabulary
  • the content of each cell in the table is spliced according to the position information of the column, and the spliced content is matched with the content of a preset standard template, where the preset standard template includes preset extracted the key of the key-value pair;
  • the number of columns corresponding to the matched key is taken as the fourth target number
  • the fifth information extraction template includes the matched key and its location information, and location information of the value of each key-value pair to be extracted in the table.
  • the apparatus for extracting table information based on RPA and AI provided by the embodiment of the present invention can execute the basic information provided by any embodiment of the present invention.
  • the table information extraction method for RPA and AI has corresponding functional modules and beneficial effects of the execution method.
  • FIG. 9 is a schematic structural diagram of a computing device according to Embodiment 9 of the present invention.
  • the computing device may include:
  • a memory 901 storing executable program code
  • processor 902 coupled to the memory 901;
  • the processor 902 invokes the executable program code stored in the memory 901 to execute the table information extraction method based on RPA and AI provided by any embodiment of the present invention.
  • the embodiment of the present invention also discloses a computer-readable storage medium storing a computer program, wherein the computer program enables a computer to execute the RPA- and AI-based table information extraction method provided by any embodiment of the present invention.
  • the modules in the apparatus in the embodiment may be distributed in the apparatus in the embodiment according to the description of the embodiment, and may also be located in one or more apparatuses different from this embodiment with corresponding changes.
  • the modules in the foregoing embodiments may be combined into one module, or may be further split into multiple sub-modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Character Input (AREA)

Abstract

An RPA and AI-based table information extraction method and apparatus, a device and a medium. Said method comprises: S1, converting a file containing a table into a picture; S2, recognizing the table in the picture, and according to a recognition result, generating an information extraction template corresponding to a table type, the information extraction template containing keys of key-value pairs in the table and position information thereof, and position information of values of the key-value pairs to be extracted; and S3, extracting table content from the recognition result according to the information extraction template. Said method reduces labor costs, improves the universality of the information extraction template, and increases the accuracy of table content extraction.

Description

基于RPA及AI的表格信息抽取方法、装置、设备及介质Form information extraction method, device, equipment and medium based on RPA and AI 技术领域technical field
本发明涉及表格处理技术领域,具体而言,涉及一种基于RPA及AI的表格信息抽取方法、装置、设备及介质。The present invention relates to the technical field of table processing, and in particular, to a method, device, device and medium for extracting table information based on RPA and AI.
背景技术Background technique
RPA(Robotic Process Automation,机器人流程自动化),是通过特定的“机器人软件”,模拟人在计算机上的操作,按规则自动执行流程任务。RPA (Robotic Process Automation) is to simulate the operation of people on the computer through a specific "robot software", and automatically execute process tasks according to rules.
AI(Artificial Intelligence,人工智能)是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学。AI (Artificial Intelligence) is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence.
RPA具有独特的优势:低代码、非侵入。低代码是说,RPA不需要很高的IT水平就能操作,不懂编程的业务人员也能开发流程;非侵入是说,RPA可以模拟人的操作,不用软件系统开放接口。但是传统的RPA具有一定的局限性:只能基于固定的规则,并且应用场景受限。随着AI(Artificial Intelligence)技术的不断发展,RPA与AI深度融合克服了传统RPA的局限,RPA+AI=Hand work+Head work,正在极大的改变劳动力的价值。RPA has unique advantages: low-code, non-intrusive. Low-code means that RPA does not require a high level of IT to operate, and business personnel who do not understand programming can develop processes; non-intrusive means that RPA can simulate human operations without opening interfaces to software systems. However, traditional RPA has certain limitations: it can only be based on fixed rules, and the application scenarios are limited. With the continuous development of AI (Artificial Intelligence) technology, the deep integration of RPA and AI has overcome the limitations of traditional RPA. RPA+AI=Hand work+Head work is greatly changing the value of labor.
RPA在处理任务的过程中,会遇到大量的表格数据。特别是对于企事业单位而言,每天都可能面临海量的表格数据,要想从这些表格数据中正确提取有用的信息,并将其录入到指定的系统中显得尤为作用。目前,一般是通过如下两种方式来完成:一是人工来去对表格中信息进行筛查从而选择有用的信息,之后借助人工的方式将筛查得到信息录入到系统中。二是人工干预总结各类表格的匹配规则,即通过按照表格的结构信息指定相应的规则模版,之后通过程序或者是算法的方式来去提取表格信息,之后按照需要填写的系统结构,再按照程序或者是人工方式将抽取的信息录入系统。In the process of processing tasks, RPA will encounter a large amount of tabular data. Especially for enterprises and institutions, they may face a large amount of tabular data every day. It is particularly useful to correctly extract useful information from these tabular data and enter it into the designated system. At present, it is generally done in the following two ways: one is to manually screen the information in the form to select useful information, and then manually enter the information obtained from the screening into the system. The second is to manually intervene and summarize the matching rules of various forms, that is, specify the corresponding rule template according to the structure information of the form, and then extract the form information through a program or algorithm, and then fill in the system structure according to the need, and then follow the program. Or manually enter the extracted information into the system.
然而,对于上述第一种方式,当人工去筛查表格信息的时候,可能由于人思维的一些偏差或者是惰性导致在录入信息的时候出现错误,并且人力成本较高。对于上述第二种方式,会存在如下缺陷:(1)表格结构不一致,需要人工总结不同的规则,通用性不足。(2)系统架构不一致,导致在设计程序或者是算法时对设计人员的编程能力有较高的要求,同时设计的程序通用性不足,比如说当系统架构发生变化时,对于设计人员来说程序的改动就比较大,费时费力,导致工作效率低下。However, for the first method above, when manually screening the form information, errors may occur when entering information due to some deviations or inertia of human thinking, and the labor cost is high. For the above-mentioned second method, there are the following defects: (1) The table structure is inconsistent, different rules need to be manually summarized, and the generality is insufficient. (2) The system architecture is inconsistent, which leads to higher requirements for the designer's programming ability when designing programs or algorithms, and at the same time the designed programs are not versatile enough. For example, when the system architecture changes, for designers, the program The changes are relatively large, time-consuming and labor-intensive, resulting in low work efficiency.
发明内容SUMMARY OF THE INVENTION
本发明提供一种基于RPA及AI的表格信息抽取方法、装置、设备及介质,用以克服现有技术中存在的至少一个技术问题。The present invention provides a table information extraction method, device, device and medium based on RPA and AI, so as to overcome at least one technical problem existing in the prior art.
本发明实施例的第一方面,提供了一种基于RPA及AI的表格信息抽取方法,该方法包括:In a first aspect of the embodiments of the present invention, a method for extracting table information based on RPA and AI is provided, and the method includes:
S1、将包含有表格的文件转化为图片;S1. Convert the file containing the table into a picture;
S2、识别所述图片中的表格,并根据识别结果生成与表格类型对应的信息抽取模板,所述信息抽取模板中包含有表格内各个键值对的键及其位置信息,以及待抽取的各个键值对的值的位置信息;S2. Identify the table in the picture, and generate an information extraction template corresponding to the table type according to the identification result, where the information extraction template includes the key and position information of each key-value pair in the table, and the information about each key-value pair to be extracted. The location information of the value of the key-value pair;
S3、按照所述信息抽取模板,从所述识别结果中抽取表格内容。S3. Extract table content from the recognition result according to the information extraction template.
本发明实施例的第二方面,提供了一种基于RPA及AI的表格信息抽取装置,该装置包括:In a second aspect of the embodiments of the present invention, a table information extraction device based on RPA and AI is provided, and the device includes:
图片转化模板,被配置为将包含有表格的文件转化为图片;An image conversion template, configured to convert a file containing a table into an image;
模板生成模块,被配置为识别所述图片中的表格,并根据识别结果生成与表格类型对应的信息抽取模板,所述信息抽取模板中包含有表格内各个键值对的键及其位置信息,以及待抽取的各个键值对的值的位置信息;The template generation module is configured to identify the table in the picture, and generate an information extraction template corresponding to the table type according to the identification result, and the information extraction template contains the keys and position information of each key-value pair in the table, And the location information of the value of each key-value pair to be extracted;
内容抽取模块,被配置按照所述信息抽取模板,从所述识别结果中抽取表格内容。The content extraction module is configured to extract table content from the recognition result according to the information extraction template.
第三方面,本发明实施例还提供了一种计算设备,包括:In a third aspect, an embodiment of the present invention further provides a computing device, including:
存储有可执行程序代码的存储器;a memory in which executable program code is stored;
与所述存储器耦合的处理器;a processor coupled to the memory;
所述处理器调用所述存储器中存储的所述可执行程序代码,执行本发明任意实施例所提供的基于RPA及AI的表格信息抽取方法的部分或全部步骤。The processor invokes the executable program code stored in the memory to execute part or all of the steps of the table information extraction method based on RPA and AI provided by any embodiment of the present invention.
第四方面,本发明实施例还提供了一种计算机可读存储介质,其存储计算机程序,所述计算机程序包括用于执行本发明任意实施例所提供的基于RPA及AI的表格信息抽取方法的部分或全部步骤的指令。In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, which stores a computer program, and the computer program includes a method for executing the table information extraction method based on RPA and AI provided by any embodiment of the present invention. Instructions for some or all of the steps.
本发明实施例提供的技术方案,在进行表格信息抽取时,可将包含有表格的文件转化为图片,以将表格内单元格的内容与表格关联到一起。通过识别图片中的表格,可根据识别结果生成与表格类型对应的信息抽取模板,该信息抽取模板中包含有表格内各个键值对的键及其位置信息,以及待抽取的各个键值对的值的位置信息。按照该信息抽取模板,可从识别结果中抽取表格内容。通过采用上述技术方案,避免了人工在进行表格信息抽取时,人力成本较高且准确性较差的问题。并且,相对于采用人工干预总结各类表格的匹配规则的方式,本实施提供的方法无需研发人员总结不同的规则,通用性较强。In the technical solution provided by the embodiments of the present invention, when extracting table information, a file containing a table can be converted into a picture, so as to associate the content of the cells in the table with the table. By identifying the table in the picture, an information extraction template corresponding to the table type can be generated according to the identification result. The information extraction template includes the key and location information of each key-value pair in the table, as well as the information of each key-value pair to be extracted. Location information for the value. According to the information extraction template, the table content can be extracted from the recognition result. By adopting the above technical solution, the problems of high labor cost and poor accuracy when manually extracting table information are avoided. Moreover, compared with the method of summarizing matching rules of various forms by manual intervention, the method provided by this implementation does not require developers to summarize different rules, and is more versatile.
本发明实施例的创新点包括:The innovative points of the embodiments of the present invention include:
1、通过将包含有表格的文件转化为图片,并通过识别图片中的表格,可根据识别结果生成与表格类型对应的信息抽取模板。按照该信息抽取模板,可从识别结果中抽取表格内容,避免了人工在进行表格信息抽取时,人力成本较高且准确性较差的问题。并且,相对于采用人工干预总结各类表格的匹配规则的方式,本实施提供的方法无需研发人员总结不同的规则,通用性较强。是本发明实施例的创新点之一。1. By converting the file containing the table into a picture, and by identifying the table in the picture, an information extraction template corresponding to the table type can be generated according to the recognition result. According to the information extraction template, the table content can be extracted from the recognition result, which avoids the problems of high labor cost and poor accuracy when manually extracting table information. Moreover, compared with the method of summarizing matching rules of various forms by manual intervention, the method provided by this implementation does not require developers to summarize different rules, and is more versatile. This is one of the innovative points of the embodiments of the present invention.
2、采用先将包含有表格的文件转化为图片,然后再识别图片中的表格的方式,提高了表格数据的可靠性,并且有助于提高信息抽取模板的通用性,是本发明实施例的创新点之一。2. The method of first converting the file containing the table into a picture, and then recognizing the table in the picture improves the reliability of the table data and helps to improve the versatility of the information extraction template, which is an embodiment of the present invention. One of the innovations.
3、对于生成的信息抽取模板,其中包含有一些特殊的语法标识,例如中括号、尖括号等。这些标识是根据表格属性来确定的,具体与表格中单元格的内容及其位置有关。在利用信息抽取模板进行表格内容的抽取时,需按照语法标识所代表的预设含义将信息抽取模板中的内容与图片的识别结果进行匹配,例如,尖括号表示其中的内容需要进行模糊匹配,中括号表示其中的内容需要进行严格匹配。这样设置有助于提高表格内容提取的准确性,是本发明实施例的创新点之一。3. The generated information extraction template contains some special syntax identifiers, such as square brackets, angle brackets, etc. These identifiers are determined based on table properties, which are related to the content and location of cells in the table. When using the information extraction template to extract the table content, it is necessary to match the content in the information extraction template with the recognition result of the picture according to the preset meaning represented by the grammar mark. For example, the angle brackets indicate that the content needs to be fuzzy matched. Square brackets indicate that the content in it needs to be strictly matched. This setting helps to improve the accuracy of table content extraction, which is one of the innovative points of the embodiments of the present invention.
4、将信息抽取模板中各个键值对的位置信息以正则表达式的形式进行表示。这样设置,可避免由于OCR识别结果中单元格的行列信息错乱,所导致的无法准确提取表格内容的问题,是本发明实施例的创新点之一。4. The position information of each key-value pair in the information extraction template is represented in the form of a regular expression. This setting can avoid the problem that the table content cannot be accurately extracted due to disordered row and column information of the cells in the OCR identification result, which is one of the innovative points of the embodiments of the present invention.
5、左右一对一类型的表格所对应的第一信息抽取模板,是基于表格中每一行的内容生成的,即对于表格中每一行的内容,都会对应生成一个第一信息抽取模板,也即表格的行数与第一信息抽取模板的个数相等。相对于将表格中每个键值对均生成一个模板的方式,本实施例这样设置,可减少信息抽取模板的个数,提高了模板生成的速度,是本发明实施例的创新点之一。5. The first information extraction template corresponding to the left and right one-to-one type table is generated based on the content of each row in the table, that is, for the content of each row in the table, a first information extraction template will be generated correspondingly, that is The number of rows of the table is equal to the number of the first information extraction templates. Compared with the method of generating a template for each key-value pair in the table, this embodiment can reduce the number of information extraction templates and improve the speed of template generation, which is one of the innovative points of the embodiments of the present invention.
6、对于上下一对多格式,或左右一对多格式的表格,如果需要抽取的value中第一列是不可枚举或者是无规律的,则在生成该类型的表格对应的信息抽取模板时,可在表格中第一列单元格之前添加辅助变量,以将表格中不同行的内容区分开,避免在信息抽取时将下一行的内容当做当前行的内容进行抽取,是本发明实施例的创新点之一。6. For the table in the upper-lower one-to-many format, or the left-right one-to-many format, if the first column of the value to be extracted is not enumerable or irregular, when generating the information extraction template corresponding to this type of table , an auxiliary variable can be added before the cells in the first column of the table to distinguish the contents of different rows in the table, so as to avoid extracting the contents of the next row as the contents of the current row during information extraction, which is an embodiment of the present invention. One of the innovations.
7、信息抽取模板与表格类型相对应,通用性较强,即如果图片中存在多个相同类型的表格,通过本发明实施例提供的方法,可为多个同类型的表格生成同一个信息抽取模板。按照该模板,可将多个同类型的表格中的内容都提取出来,提高了表格内容提取的速度,是本发明实施例的创新点之一。7. The information extraction template corresponds to the table type and has strong versatility, that is, if there are multiple tables of the same type in the picture, the method provided by the embodiment of the present invention can generate the same information extraction for multiple tables of the same type template. According to this template, the contents of multiple tables of the same type can be extracted, which improves the speed of extracting table contents, which is one of the innovative points of the embodiments of the present invention.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.
图1a为本发明实施例一提供的基于RPA+AI相结合的方式进行表格信息抽取和录入的流程图;1a is a flowchart of table information extraction and input based on the combination of RPA+AI provided by Embodiment 1 of the present invention;
图1b为本发明实施例一提供的一种字段建立界面示意图;FIG. 1b is a schematic diagram of a field establishment interface according to Embodiment 1 of the present invention;
图1c为本发明实施例一提供的一种发布前的信息抽取模板的界面示意图;1c is a schematic interface diagram of a pre-release information extraction template provided by Embodiment 1 of the present invention;
图1d为本发明实施例一提供的一种发布后的信息抽取模板的界面示意图;1d is a schematic interface diagram of a post-release information extraction template provided by Embodiment 1 of the present invention;
图1e为本发明实施例一提供的发布后上下一对多类型的表格对应的信息抽取模板的界面示意图;FIG. 1e is a schematic interface diagram of an information extraction template corresponding to a one-to-many type table after publishing provided by Embodiment 1 of the present invention;
图2为本发明实施例二提供的一种基于RPA及AI的表格信息抽取方法的流程示意图;2 is a schematic flowchart of a method for extracting table information based on RPA and AI according to Embodiment 2 of the present invention;
图3为本发明实施例三提供的一种优选的基于RPA及AI的表格信息抽取方法的流程图;3 is a flowchart of a preferred method for extracting table information based on RPA and AI provided by Embodiment 3 of the present invention;
图4为本发明实施例四提供的一种优选的基于RPA及AI的表格信息抽取方法的流程图;4 is a flowchart of a preferred method for extracting table information based on RPA and AI provided by Embodiment 4 of the present invention;
图5为本发明实施例五提供的一种优选的基于RPA及AI的表格信息抽取方法的流程图;5 is a flowchart of a preferred method for extracting table information based on RPA and AI provided by Embodiment 5 of the present invention;
图6为本发明实施例六提供的一种优选的基于RPA及AI的表格信息抽取方法的流程图;6 is a flowchart of a preferred method for extracting table information based on RPA and AI provided by Embodiment 6 of the present invention;
图7为本发明实施例七提供的一种优选的基于RPA及AI的表格信息抽取方法的流程图;7 is a flowchart of a preferred method for extracting table information based on RPA and AI provided by Embodiment 7 of the present invention;
图8为本发明实施例八提供的一种基于RPA及AI的表格信息抽取装置的结构示意图;8 is a schematic structural diagram of an apparatus for extracting table information based on RPA and AI according to Embodiment 8 of the present invention;
图9为本发明实施例九提供的一种计算设备的结构示意图。FIG. 9 is a schematic structural diagram of a computing device according to Embodiment 9 of the present invention.
具体实施方式detailed description
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有付出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
需要说明的是,本发明实施例及附图中的术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "comprising" and "having" and any modifications thereof in the embodiments of the present invention and the accompanying drawings are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optionally also includes For other steps or units inherent to these processes, methods, products or devices.
本发明的描述中,“模板”是开发者为“信息抽取”功能提供的文本表达式。利用这个表达式,去匹配文本的若干片段、提取信息。对于本发明实施例提供的模板,需要了解如下必要的语法:In the description of the present invention, "template" is a text expression provided by the developer for the "information extraction" function. Use this expression to match several fragments of text and extract information. For the template provided by the embodiment of the present invention, you need to understand the following necessary syntax:
1、“中括号”[]代表严格匹配,匹配的内容可以是预先在资源中定义好的“词表”、“正则表达式”,也可以是需要匹配的短语。1. The "square brackets" [] represent strict matching. The matching content can be the "vocabulary", "regular expression" defined in the resource in advance, or the phrase that needs to be matched.
2、“尖括号”<>代表模糊匹配。模糊匹配是与严格匹配相对应的概念。严格匹配要求待匹配的文本与指定的匹配内容必须完全一致。模糊匹配则只要两者语义接近即可,即相似度需大于设定阈值。2. "Angle brackets" <> represent fuzzy matching. Fuzzy matching is a concept corresponding to strict matching. Strict matching requires that the text to be matched must be exactly the same as the specified matching content. Fuzzy matching only needs to be semantically close to each other, that is, the similarity needs to be greater than the set threshold.
3、符号<*>代表匹配任意长度的文本片段。3. The symbol <*> represents matching text fragments of any length.
4、模板中需要匹配{,},[,],<,>,|,{,},*时,使用“\”转义。4. When matching {,},[,],<,>,|,{,},* in the template, use "\" to escape.
5、^:只能出现在模板首部,用于定义模板必须从头匹配。5, ^: can only appear in the template header, used to define that the template must be matched from scratch.
6、$:只能出现在模板尾部,用于定义模板必须匹配到结尾。6. $: can only appear at the end of the template, and is used to define that the template must match to the end.
本发明的描述中,“字段”是针对模版抽取出来的关键信息,取的特定于当前信息抽取任务的名字,该名字一般由用户指定。In the description of the present invention, "field" refers to the key information extracted from the template, which is a name specific to the current information extraction task, and the name is generally designated by the user.
本发明的描述中,“词表”是一个由<词表名称、词表值、词表值的多种说法>构成的信息结构体。一个词表描述了与开发者所在领域强相关的、相对比较固定的一类以词汇形式存在的“外部知识”。In the description of the present invention, a "vocabulary" is an information structure composed of <vocabulary name, vocabulary value, and various expressions of vocabulary value>. A vocabulary describes a relatively fixed class of "external knowledge" in lexical form that is strongly related to the developer's field.
本发明的描述中,术语“正则表达式”是对字符串操作的一种逻辑公式,描述了一种字符串匹配的模式,可以用来检查一个串是否含有某种子串、将匹配的子串替换或者从某个串中取出符合某个条件的子串等。In the description of the present invention, the term "regular expression" is a logical formula for operating on strings, describing a pattern of string matching, which can be used to check whether a string contains a certain substring, a substring that will be matched Replace or extract a substring that meets a certain condition from a certain string, etc.
本发明实施例公开了一种基于RPA及AI的表格信息抽取方法、装置、设备及介质。以下分别进行详细说明。The embodiment of the present invention discloses a table information extraction method, device, device and medium based on RPA and AI. Each of them will be described in detail below.
实施例一Example 1
机器人流程自动化(Robotic Process Automation)简称RPA,是通过特定的“机器人软件”,模拟人在计算机上的操作,按规则自动执行流程任务。Robotic Process Automation (RPA) for short is a specific "robot software" that simulates human operations on a computer and automatically performs process tasks according to rules.
AI(Artificial Intelligence)是人工智能的英文缩写,它是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学。AI (Artificial Intelligence) is the English abbreviation of artificial intelligence. It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence.
随着互联网技术的不断发展,会积累出海量的文本数据,这些数据包含无结构化数据以及结构化数据。对于无结构化数据,比如说文本、图片以及视频等,对于结构化数据比如说表格数据等,都需要花费大量的人力物力去从这些海量数据集中提取有用的信息。With the continuous development of Internet technology, massive amounts of text data will be accumulated, including unstructured data and structured data. For unstructured data, such as text, pictures and videos, and for structured data such as tabular data, it takes a lot of manpower and material resources to extract useful information from these massive data sets.
对于企事业单位而言,每天都可能面临海量的表格数据。对于从这些表格数据中正确提取有用的信息,并将其录入到指定系统这一流程,单纯地依靠人力不仅需要投入非常昂贵的代价,而且在很多时候极有可能出现错误,带来难以估量的损失。因此,基于这种考虑,本实施例提出了一种基于RPA+AI的表格信息抽取方法,以实现表格信息抽取和表格信息的自动录入。图1a为本发明实施例 一提供的基于RPA+AI相结合的方式进行表格信息抽取和录入的流程图,下面对图1a中的各个步骤进行介绍:For enterprises and institutions, they may face massive amounts of tabular data every day. For the process of correctly extracting useful information from these tabular data and entering it into the designated system, relying solely on manpower is not only very expensive, but also very likely to make mistakes in many cases, bringing incalculable costs. loss. Therefore, based on this consideration, this embodiment proposes a table information extraction method based on RPA+AI, so as to realize table information extraction and automatic entry of table information. Fig. 1a is a flow chart of table information extraction and input based on the combination of RPA+AI provided by Embodiment 1 of the present invention, and each step in Fig. 1a is introduced below:
110、借助RPA技术将表格文件转换为图片。110. Convert table files into pictures with the help of RPA technology.
本实施例中,可利用传统方式通过编写大量的代码和规则进行文件的解析。但是由于表格形式的多样性,传统方式容易导致解析程序和规则在很多情况下不能复用,加大了开发成本。本实施例中,为了解决上述问题,可在自动化服务平台,例如Uibot软件,通过采用搭建流程的方式自动将表格文件转化为图片。In this embodiment, the file can be parsed by writing a large number of codes and rules in a conventional manner. However, due to the diversity of table forms, the traditional method can easily lead to the inability of parsing programs and rules to be reused in many cases, which increases the development cost. In this embodiment, in order to solve the above problem, an automated service platform, such as Uibot software, can automatically convert the table file into a picture by adopting a construction process.
120、对生成的图片进行OCR识别。120. Perform OCR identification on the generated picture.
本实施例中,可利用OCR(Optical Character Recognition,光学字符识别)技术对图片进行识别。在经过OCR识别之后,将返回每一个单元格的内容以及每一个单元格在表格中的位置信息,该位置信息包括开始行索引、开始列索引、结束行索引和结束列索引等信息。In this embodiment, an OCR (Optical Character Recognition, Optical Character Recognition) technology can be used to recognize the picture. After OCR identification, the content of each cell and the position information of each cell in the table will be returned. The position information includes information such as start row index, start column index, end row index and end column index.
130、自动生成表格信息抽取模板。130. Automatically generate a form information extraction template.
步骤130是本发明实施例的重点,即根据表格图片的OCR识别结果自动生成表格信息抽取的模版。本实施例中,表格信息抽取模板与表格类型相对应。对于不同类型的表格,可通过调用不同的模板接口生成与表格类型对应的信息抽取模板。下面,将从以下两个方面对自动生成表格信息抽取模板这一步骤来进行解析:Step 130 is the key point of the embodiment of the present invention, that is, a template for extracting table information is automatically generated according to the OCR recognition result of the table image. In this embodiment, the form information extraction template corresponds to the form type. For different types of tables, information extraction templates corresponding to the table types can be generated by calling different template interfaces. Next, the step of automatically generating the form information extraction template will be analyzed from the following two aspects:
a、表格格式为左右key-value形式,如下表1所示,下述表1的表格类型可认为是一种左右一对一形式的表格,即需要抽取的字段在左边,比如说姓名,而对应的value在右边,比如说张三。在对包含有表1的图片进行OCR识别之后,可得到每一个单元格对应的行列索引(行列索引从0开始计算),比如对于单元格“姓名”,其对应的行列索引分别为0,0,对于单元格“安徽省合肥市”对应的行列索引为1,5。a. The table format is in the form of left and right key-value, as shown in Table 1 below. The table type in Table 1 below can be considered as a one-to-one form, that is, the fields to be extracted are on the left, such as name, while The corresponding value is on the right, such as Zhang San. After performing OCR on the pictures containing Table 1, the row and column indices corresponding to each cell can be obtained (the row and column indices are calculated from 0). For example, for the cell "name", the corresponding row and column indices are 0, 0 respectively. , the row and column indices corresponding to the cell "Hefei City, Anhui Province" are 1, 5.
表1个人信息表 Form 1 Personal Information Form
Figure PCTCN2021114068-appb-000001
Figure PCTCN2021114068-appb-000001
本实施例中,在生成信息抽取模板前,用户可根据表格中的键值对建立想要抽取的字段,比如针对上述表1,用户需要抽取的字段可包括:姓名、年龄、性别、民族、籍贯、出生年月、出生地、学历、硕士和现居住城市等。图1b为本发明实施例一提供的一种字段建立界面示意图。如图1b所示,该界面中示出了用户建立的如下部分字段:“出生地”、“姓名”、“学历”、“年龄”和“籍贯”。In this embodiment, before generating the information extraction template, the user can create the fields to be extracted according to the key-value pairs in the table. For example, for the above Table 1, the fields that the user needs to extract may include: name, age, gender, ethnicity, Birthplace, date of birth, place of birth, education, master's degree and current city of residence, etc. FIG. 1b is a schematic diagram of a field establishment interface according to Embodiment 1 of the present invention. As shown in Figure 1b, the interface shows the following partial fields created by the user: "place of birth", "name", "educational education", "age" and "place of origin".
本实施例中,对于左右一对一类型的表格,其对应的信息抽取模板是按行生成的。在生成信息抽取模板时,需将表格中各单元格的内容按照在表格内行和列的位置信息进行拼接。对于拼接后内容中的每一行内容,均包括每个单元格的位置信息及内容。In this embodiment, for a one-to-one left-right table, the corresponding information extraction template is generated by row. When generating the information extraction template, the contents of each cell in the table need to be spliced according to the position information of the row and column in the table. For each line of content in the spliced content, the location information and content of each cell are included.
对于每一行内容对应的信息抽取模板的生成过程可以为:可将拼接后内容中各个键的位置信息,作为信息抽取模板中键的位置信息;将拼接后内容中的各个键,作为信息抽取模板中的键;将拼接后内容中各个键值对的值的位置信息作为信息抽取模板中待抽取值的位置信息。此外,对于信息抽取模板各个键值对的键及其位置信息,以及待抽取的各个键值对的值的位置信息,需为其添加对应的语法标识,以用于后续表格内容的提取。The generation process of the information extraction template corresponding to each line of content can be as follows: the position information of each key in the spliced content can be used as the position information of the key in the information extraction template; each key in the spliced content can be used as the information extraction template The key in ; the position information of the value of each key-value pair in the spliced content is used as the position information of the value to be extracted in the information extraction template. In addition, for the key and location information of each key-value pair of the information extraction template, as well as the location information of the value of each key-value pair to be extracted, corresponding grammatical identifiers need to be added for the subsequent extraction of table content.
具体的,对于上述表1中的第一行内容,对应的信息抽取模板为:Specifically, for the first line of content in Table 1 above, the corresponding information extraction template is:
[@R0@C0-]<姓名>[@R0@C1-]{姓名:<*:0,>}[@R0@C2-]<年龄>[@R0@C3-]{年龄:<*:0,>}[@R0@C4-]<性别>[@R0@C5-]<*:0,>[@R0@C6-]<民族>[@R0@C7-]<*:0,>[\n]。[@R0@C0-]<name>[@R0@C1-]{name:<*:0,>}[@R0@C2-]<age>[@R0@C3-]{age:<*: 0,>}[@R0@C4-]<gender>[@R0@C5-]<*:0,>[@R0@C6-]<ethnicity>[@R0@C7-]<*:0,> [\n].
对于上述表1中的第二行内容,对应的信息抽取模板为:For the second line of content in Table 1 above, the corresponding information extraction template is:
[@R1@C0-]<籍贯>[@R1@C1-]{籍贯:<*:0,>}[@R1@C2-]<出生年月>[@R1@C3-]<*:0,>[@R1@C4-]<出生地>[@R1@C5-]{出生地:<*:0,>}[\n]。[@R1@C0-]<Hometown>[@R1@C1-]{Hometown:<*:0,>}[@R1@C2-]<Birthplace>[@R1@C3-]<*:0 ,>[@R1@C4-]<Birthplace>[@R1@C5-]{Birthplace:<*:0,>}[\n].
对于上述表1中的第三行内容,对应的信息抽取模板为:For the content of the third row in Table 1 above, the corresponding information extraction template is:
[@R2@C0-]<学历>[@R2@C1-]<{学历:<*:0,>}[@R2@C2-]<现居住城市>[@R2@C3-]<*:0,>[\n]。[@R2@C0-]<educational education>[@R2@C1-]<{educational education:<*:0,>}[@R2@C2-]<current residence city>[@R2@C3-]<*: 0,>[\n].
具体的,图1c为本发明实施例一提供的一种发布前的信息抽取模板的界面示意图。如图1c所示,对于上述表1中的第三行内容,匹配文本一栏对应表格中的每个节点,即表格中每个键值对。用户 可对生成的每一个节点选择是否输出到指定的字段,比如在表格第一行中,如果用户想要抽取姓名和年龄,那么在下方输出到字段选择姓名和年龄即可。对于用户不想抽取的表格内容,则在输出到字段中选择“不输出”即可。Specifically, FIG. 1c is a schematic interface diagram of an information extraction template before publishing according to Embodiment 1 of the present invention. As shown in Figure 1c, for the content of the third row in the above Table 1, a column of matching text corresponds to each node in the table, that is, each key-value pair in the table. The user can choose whether to output to the specified field for each generated node. For example, in the first row of the table, if the user wants to extract the name and age, then select the name and age in the output to field below. For the table content that the user does not want to extract, select "Do not output" in the output to field.
图1d为本发明实施例一提供的一种发布后的信息抽取模板的界面示意图。本实施例中,在生成每一行内容对应的信息抽取模板之后,如果接收到用户触发的模板发布指令,则将最终展示的与上述表1对应的信息模版展示给用户。此外,用户可以对图1d中的任意一个模板进行编辑、复制和删除等操作。FIG. 1d is a schematic interface diagram of a post-release information extraction template provided by Embodiment 1 of the present invention. In this embodiment, after the information extraction template corresponding to each line of content is generated, if a template publishing instruction triggered by the user is received, the finally displayed information template corresponding to Table 1 above is displayed to the user. In addition, the user can edit, copy and delete any one of the templates in Figure 1d.
b、表格格式为上下key-value形式,如下表2所示。b. The table format is in the form of upper and lower key-value, as shown in Table 2 below.
表2项目信息表Table 2 Project Information Sheet
项目project 附注Notes 2020年半年度-金额2020 Half Year - Amount 2019年半年度-金额2019 Half Year - Amount
一、营业总收入I. Total operating income    1545105967.981545105967.98 1559860357.351559860357.35
其中:营业收入Of which: operating income    1545105967.981545105967.98 1559860357.351559860357.35
利息收入interest income         
已赚保费Premium earned         
手续费及佣金收入Fee and commission income         
二、营业总成本2. Total operating cost    1509316804.581509316804.58 1509825477.671509825477.67
其中:营业成本Of which: Operating costs    1476660727.541476660727.54 1463505521.821463505521.82
利息支出interest expense         
手续费及佣金支出Fees and Commissions Expenses         
退保金Surrender         
赔付支出净额Net payout         
提取保险责任准备金净额Net withdrawal of insurance liability reserves         
保单红利支出dividend payment policy         
分保费用Reinsurance costs         
税金及附加Taxes and surcharges    3239576.543239576.54 3747299.053747299.05
销售费用sales expense    4707080.574707080.57 14884188.7814884188.78
管理费用Management fees    8800099.478800099.47 7604838.027604838.02
研发费用R&D expenses    13470896.3513470896.35 18269596.8818269596.88
对于上述上下形式的表格形式,将分为两种情况进行考虑:For the above-mentioned upper and lower table forms, two cases will be considered:
(1)□上下形式表格中需要抽取的value中第一列是不可枚举或者是无规律,即不能通过建立词表或者是不能采用正则表达式的方式表示出来。(1) □ The first column of the value to be extracted in the upper and lower form tables is not enumerable or irregular, that is, it cannot be expressed by establishing a vocabulary or by using regular expressions.
对于上述情况,首先需指定一个预设标准模板,用于匹配上述表2中的key,即匹配表格中的“项目”、“附注”、“2020年半年度-金额”和“2019年半年度-金额”。以上述表2为例,预设标准模版为:For the above situation, a preset standard template needs to be specified first to match the keys in Table 2 above, that is, to match the "item", "note", "2020 semi-annual-amount" and "2019 semi-annual" in the table -Amount". Taking the above Table 2 as an example, the default standard template is:
[@R0@C0-]<项目>[@R0@C1-]<附注>[@R0@C2-]<2020年半年度>[@R0@C3-]<2019年半年度>[@R0@C0-]<project>[@R0@C1-]<note>[@R0@C2-]<2020 half year>[@R0@C3-]<2019 half year>
然后,将表格中各单元格的内容按照所在行的位置信息进行拼接,并将拼接后内容与预设标准模板的内容进行匹配,如果匹配成功,则可确定表格中的键,以及待抽取的值在表格中的开始位置信息和结束位置信息,并可记录相匹配的列的个数cols。通过按行对OCR识别结果进行遍历,如果表格中列的个数也为cols,则引入一个辅助变量@Frow_n-,其中row_n代表的是行数,并建立模板。其中,辅助变量用于在表格内容抽取时将不同行的内容进行区分。Then, the content of each cell in the table is spliced according to the position information of the row, and the spliced content is matched with the content of the preset standard template. If the match is successful, the key in the table and the to-be-extracted key can be determined. The start position information and end position information of the value in the table, and can record the number of matching columns cols. By traversing the OCR recognition results by row, if the number of columns in the table is also cols, an auxiliary variable @Frow_n- is introduced, where row_n represents the number of rows, and a template is established. Among them, the auxiliary variable is used to distinguish the content of different rows when the table content is extracted.
具体的,图1e为本发明实施例一提供的发布后上下一对多类型的表格对应的信息抽取模板的界面示意图。对于上述表2,如果用户预先设定抽取的字段为项目、2019年半年度-金额以及2020年半年度-金额,则生成的信息抽取模板如图1e所示。其中,F0为辅助变量。用户可在发布前的显示界面选择抽取的表格内容是否输出到字段。并且,用户也可在图1e所示的界面对生成的信息抽取模板进行编辑、复制和删除等操作。Specifically, FIG. 1e is a schematic diagram of an interface of an information extraction template corresponding to a one-to-many type table after publishing according to Embodiment 1 of the present invention. For the above Table 2, if the user presets the extracted fields as item, 2019 semi-annual-amount, and 2020 semi-annual-amount, the generated information extraction template is shown in Figure 1e. Among them, F0 is an auxiliary variable. The user can choose whether to output the extracted table content to the field on the display interface before publishing. In addition, the user can also perform operations such as editing, copying and deleting on the generated information extraction template on the interface shown in FIG. 1e.
(2)上下形式表格中需要抽取的value中第一列是可枚举或者是可采用正则表达式表示。(2) The first column of the value to be extracted in the upper and lower form tables is enumerable or can be represented by a regular expression.
对于这种情况,需判断表格中待抽取的内容是否属于预设词表,如果不属于预设词表,则停止生 成信息抽取模板的操作;如果属于预设词表,则再将拼接后内容与预设标准模板进行匹配。如果匹配成功,则可得到相匹配的列的个数,接下来可基于与预设标准模板相匹配的键及该键在表格中的位置信息,生成信息抽取模板。In this case, it is necessary to judge whether the content to be extracted in the table belongs to the preset vocabulary, and if it does not belong to the preset vocabulary, stop the operation of generating the information extraction template; if it belongs to the preset vocabulary, then splicing the content Match with preset standard templates. If the matching is successful, the number of matching columns can be obtained, and then an information extraction template can be generated based on the key matching the preset standard template and the position information of the key in the table.
具体的,上述表2对应的信息抽取模板为:Specifically, the information extraction template corresponding to the above Table 2 is:
[@R1@C0-]{项目:[@V_D]}[@R1@C1-]{附注<*:0,>}[@R1@C2-]{2020年半年度:<*:0,>}[@R1@C3-]{2019年半年度:<*:0,>}[\n]。[@R1@C0-]{project:[@V_D]}[@R1@C1-]{notes<*:0,>}[@R1@C2-]{2020 half year:<*:0,> }[@R1@C3-]{Semi-annual 2019:<*:0,>}[\n].
上述信息抽取模板中,由于第一列是可以根据正则表达式进行表示,则可以将上面的词表替换成正则表达式V_D。In the above information extraction template, since the first column can be represented according to the regular expression, the above vocabulary can be replaced with the regular expression V_D.
本实施例提供的表格信息抽取模板的生成方法,避免了人工在进行表格信息抽取时,人力成本较大且准确性较差的问题,并且相对于采用人工干预总结各类表格的匹配规则的方式,本实施提供的方法无需研发人员总结不同的规则,通用性较强。The method for generating a form information extraction template provided by this embodiment avoids the problems of high labor cost and poor accuracy when manually extracting form information, and compared with the method of summarizing the matching rules of various forms by manual intervention , the method provided by this implementation does not require developers to summarize different rules, and has strong generality.
140、基于生成的模板进行表格信息抽取。140. Perform table information extraction based on the generated template.
150、利用RPA技术将抽取信息自动录入系统。150. Use RPA technology to automatically enter the extracted information into the system.
本实施例中,可利用自动化服务平台,例如Uibot软件,采用搭建流程的方式实现信息的自动录入。相对于采用人工或者是通过编程进行信息录入的传统方式,本实施例提供的录入方式具备较高的通用性,在很大程度上降低了人工成本和维护成本。In this embodiment, an automated service platform, such as Uibot software, can be used to implement automatic input of information by building a process. Compared with the traditional method of entering information manually or through programming, the input method provided by this embodiment has higher versatility, and reduces labor costs and maintenance costs to a great extent.
实施例二Embodiment 2
图2为本发明实施例二提供的一种基于RPA及AI的表格信息抽取方法的流程示意图。该方法可应用于表格数据的筛选、录入系统等应用场景下,可由基于RPA及AI的表格信息抽取装置来执行,该装置可通过软件和/或硬件的方式实现。如图2所示,本实施例提供的方法具体包括:FIG. 2 is a schematic flowchart of a method for extracting table information based on RPA and AI according to Embodiment 2 of the present invention. The method can be applied to application scenarios such as table data screening and entry systems, and can be performed by a table information extraction device based on RPA and AI, which can be implemented by software and/or hardware. As shown in Figure 2, the method provided by this embodiment specifically includes:
210、将包含有表格的文件转化为图片。210. Convert the file containing the table into a picture.
其中,包含有表格的文件可以为Word文档、Excel文档或PDF文档等。本实施例中,可采用RPA技术将包含有表格的文件转化为图片。这样设置,可将表格内容与其在表格中的位置信息固化到一起。如果采用直接识别文件中的表格来生成信息抽取模板的方式,容易将表格中的内容识别为文件中的正文,从而造成表格中数据信息的丢失。同时由于表格形式的多样性,直接识别表格内容也会导致用于识别表格内容的解析程序和规则在很多情况下不能复用,导致开发成本的增加。本实施例采用先将包含有表格的文件转化为图片,然后再识别图片中的表格的方式,提高了表格数据的可靠性,并且有助于提高信息抽取模板的通用性。The file containing the table may be a Word document, an Excel document, or a PDF document, or the like. In this embodiment, the RPA technology can be used to convert the file containing the table into a picture. With this setting, the table content and its position information in the table can be solidified together. If the information extraction template is generated by directly recognizing the table in the file, it is easy to identify the content in the table as the text in the file, thereby causing the loss of data information in the table. At the same time, due to the diversity of table forms, directly identifying the table content will also lead to the fact that the parsing programs and rules used to identify the table content cannot be reused in many cases, resulting in an increase in development costs. In this embodiment, the file containing the table is first converted into a picture, and then the table in the picture is identified, which improves the reliability of the table data and helps to improve the generality of the information extraction template.
220、识别图片中的表格,并根据识别结果生成与表格类型对应的信息抽取模板。220. Identify the table in the picture, and generate an information extraction template corresponding to the table type according to the identification result.
示例性的,可采用OCR(Optical Character Recognition,光学字符识别)技术对图片进行识别,其识别结果包括表格中每个单元格的内容,及各单元格在表格中的位置信息。其中,各单元格在表格中的位置信息包括开始行索引、开始列索引、结束行索引和结束列索引等。Exemplarily, OCR (Optical Character Recognition, Optical Character Recognition) technology can be used to recognize the picture, and the recognition result includes the content of each cell in the table and the position information of each cell in the table. The position information of each cell in the table includes a start row index, a start column index, an end row index, an end column index, and the like.
本实施例中,表格类型可由表格内各个键值对之间的位置关系和对应关系来确定。对于不同类型的表格,可通过调用不同的模板接口生成与表格类型对应的信息抽取模板。在调用不同的模板接口之前,用户可根据表格中的键值对信息指定想要抽取的字段。在生成信息抽取模板后,用户还可通过触发字段输出指令,来选择所抽取的表格内容是否输出。In this embodiment, the table type can be determined by the positional relationship and the corresponding relationship between each key-value pair in the table. For different types of tables, information extraction templates corresponding to the table types can be generated by calling different template interfaces. Before calling different template interfaces, users can specify the fields they want to extract according to the key-value pair information in the table. After the information extraction template is generated, the user can also select whether to output the extracted table content by triggering the field output instruction.
本实施例中,对于任意一种类型的表格,可根据表格中各单元格的内容及各单元格在表格中的位置信息,生成与表格类型对应的信息抽取模板。该信息抽取模板中包含有表格内各个键值对的键及其位置信息,以及待抽取的各个键值对的值的位置信息。In this embodiment, for any type of table, an information extraction template corresponding to the table type can be generated according to the content of each cell in the table and the position information of each cell in the table. The information extraction template includes the key and position information of each key-value pair in the table, and the position information of the value of each key-value pair to be extracted.
具体的,以上述表1为例,对于用户预先设定抽取的字段为姓名、年龄、性别和民族信息,构建的信息抽取模板如下:Specifically, taking the above Table 1 as an example, for the user preset extracted fields as name, age, gender and nationality information, the constructed information extraction template is as follows:
[@R0@C0-]<姓名>[@R0@C1-]{姓名:<*:0,>}[@R0@C2-]<年龄>[@R0@C3-]{年龄:<*:0,>}[@R0@C4-]<性别>[@R0@C5-]<*:0,>[@R0@C6-]<民族>[@R0@C7-]<*:0,>[\n][@R0@C0-]<name>[@R0@C1-]{name:<*:0,>}[@R0@C2-]<age>[@R0@C3-]{age:<*: 0,>}[@R0@C4-]<gender>[@R0@C5-]<*:0,>[@R0@C6-]<ethnicity>[@R0@C7-]<*:0,> [\n]
上述信息抽取模板中,[@R0@C0-]<姓名>表示表格中的“姓名”所在的行列信息为第零行第零列;[@R0@C1-]{姓名:<*>}表示“姓名”对应的值的内容所在的行列信息为第零行第一列。其他待抽取的字段,如年龄、性别和民族等,在信息抽取模板中的表示方式与上述姓名的表示方式类似,此处不再赘述。In the above information extraction template, [@R0@C0-]<name> means that the row and column information of the "name" in the table is the zeroth row and zeroth column; [@R0@C1-]{name:<*>} means The row and column information where the content of the value corresponding to "name" is located is the zeroth row and the first column. Other fields to be extracted, such as age, gender, and ethnicity, are represented in the information extraction template in a manner similar to the representation of the above-mentioned names, and will not be repeated here.
需要说明的是,对于生成的信息抽取模板,可为其添加一些特殊的语法标识,这些标识是根 据表格属性来确定的,例如,对于表格中单元格所在的位置信息,为其添加中括号[],例如上述模板中的[@R0@C0-]。对于表格中键值对的键,为其添加尖括号<>,例如上述模板中的<姓名>。对于表格中待抽取的键值对的值,将其用尖括号内星号的形式表示,例如<*>,并为待抽取的值和其对应的键用冒号“:”隔开。如果待抽取的值需要输出到字段,则为每对键值对添加大括号,例如上述模板中的{姓名:<*:0,>}。如果用户设置了待抽取的值无需输出到字段,则无需添加上述大括号。It should be noted that, for the generated information extraction template, some special syntax identifiers can be added to it, and these identifiers are determined according to the table attributes. For example, for the location information of the cells in the table, add square brackets [ ], such as [@R0@C0-] in the template above. Add angle brackets <> to the keys of key-value pairs in the table, such as <name> in the template above. For the value of the key-value pair to be extracted in the table, express it in the form of asterisks in angle brackets, such as <*>, and separate the value to be extracted and its corresponding key with a colon ":". If the value to be extracted needs to be output to a field, add curly brackets to each key-value pair, such as {name:<*:0,>} in the above template. If the user has set that the value to be extracted does not need to be output to the field, there is no need to add the above curly brackets.
此外,本实施例中,信息抽取模板中的语法标识均具有一定的预设含义,例如,中括号代表严格匹配,即判断进行匹配的字符串是否相同;尖括号代表模糊匹配,即判断进行匹配的内容的相似度是否大于设定阈值。在利用信息抽取模板进行表格内容的抽取时,需按照标识所代表的预设含义将信息抽取模板中的内容与图片的识别结果进行匹配。In addition, in this embodiment, the grammatical identifiers in the information extraction template have certain preset meanings. For example, square brackets represent strict matching, that is, it is determined whether the strings to be matched are the same; angle brackets represent fuzzy matching, that is, it is determined that matching is performed. Whether the similarity of the content is greater than the set threshold. When using the information extraction template to extract the table content, it is necessary to match the content in the information extraction template with the recognition result of the picture according to the preset meaning represented by the identifier.
还需要说明的是,为了保证信息抽取模板的准确性,以保证后续表格内容抽取的准确性,本实施例可将信息抽取模板中各个键值对的位置信息以正则表达式的形式进行表示。这样设置,可避免由于OCR识别结果中单元格的行列信息错乱,所导致的无法准确提取表格内容的问题。It should also be noted that, in order to ensure the accuracy of the information extraction template and the accuracy of subsequent table content extraction, in this embodiment, the position information of each key-value pair in the information extraction template can be represented in the form of regular expressions. This setting can avoid the problem that the table content cannot be accurately extracted due to disordered row and column information of the cells in the OCR recognition result.
进一步的,在生成信息抽取模板后,用户可以根据自动生成的模版进行相关的调试,例如,可对模板进行编辑、复制和删除等操作。Further, after the information extraction template is generated, the user can perform related debugging according to the automatically generated template, for example, the template can be edited, copied and deleted.
230、按照信息抽取模板,从识别结果中抽取表格内容。230. Extract table content from the recognition result according to the information extraction template.
在信息抽取模板生成后,用户可调用信息抽取引擎接口进行信息抽取。After the information extraction template is generated, the user can call the information extraction engine interface to extract information.
具体的,在按照信息抽取模板在进行表格信息的抽取时,可将该信息抽取模板中的所有内容与OCR识别结果进行匹配,直到匹配成功。Specifically, when the table information is extracted according to the information extraction template, all contents in the information extraction template may be matched with the OCR identification result until the matching is successful.
具体的,在匹配过程中,需按照信息抽取模板中的语法标识所对应的预设含义进行匹配,例如,判断中括号中的字符串与OCR识别结果中单元格位置信息所对应的字符串是否相同;或者,判断尖括号中的内容与OCR识别结果中键值对的键的相似度是否大于设定阈值。如果字符串相等或者文本的相似度大于设定阈值,则说明匹配成功。在匹配成功后,可从识别结果中提取出待抽取的表格内容。Specifically, in the matching process, it is necessary to perform matching according to the preset meaning corresponding to the grammatical identifier in the information extraction template. For example, it is determined whether the string in square brackets corresponds to the string corresponding to the cell position information in the OCR recognition result. The same; or, determine whether the similarity between the content in the angle brackets and the key of the key-value pair in the OCR recognition result is greater than the set threshold. If the strings are equal or the similarity of the text is greater than the set threshold, the match is successful. After the matching is successful, the table content to be extracted can be extracted from the recognition result.
本实施例提供的技术方案,在进行表格信息抽取时,可将包含有表格的文件转化为图片,以将表格内单元格的内容与表格关联到一起。通过识别图片中的表格,可根据识别结果生成与表格类型对应的信息抽取模板,该信息抽取模板中包含有表格内各个键值对的键及其位置信息,以及待抽取的各个键值对的值的位置信息。按照该信息抽取模板,可从识别结果中抽取表格内容。通过采用上述技术方案,避免了人工在进行表格信息抽取时,人力成本较高且准确性较差的问题。并且,相对于采用人工干预总结各类表格的匹配规则的方式,本实施提供的方法无需研发人员总结不同的规则,通用性较强。In the technical solution provided by this embodiment, when extracting table information, a file containing a table can be converted into a picture, so as to associate the content of the cells in the table with the table. By identifying the table in the picture, an information extraction template corresponding to the table type can be generated according to the identification result. The information extraction template includes the key and location information of each key-value pair in the table, as well as the information of each key-value pair to be extracted. Location information for the value. According to the information extraction template, the table content can be extracted from the recognition result. By adopting the above technical solution, the problems of high labor cost and poor accuracy when manually extracting table information are avoided. Moreover, compared with the method of summarizing matching rules of various forms by manual intervention, the method provided by this implementation does not require developers to summarize different rules, and is more versatile.
实施例三Embodiment 3
图3为本发明实施例三提供的一种优选的基于RPA及AI的表格信息抽取方法的流程图,本实施例在上述实施例的基础上,对表格类型为左右一对一格式对应的信息抽取模板的生成过程进行了详细介绍。其中,该左右一对一格式的表格中各个键值对的键与值是左右位置关系,且键与值是一对一的关系。如图3所示,该方法包括:FIG. 3 is a flowchart of a preferred method for extracting table information based on RPA and AI provided by Embodiment 3 of the present invention. On the basis of the above-mentioned embodiments, this embodiment, on the basis of the above-mentioned embodiments, interprets the information corresponding to the left and right one-to-one format for the table type. The generation process of the extraction template is described in detail. Wherein, the key and value of each key-value pair in the table in the left-right one-to-one format have a left-right positional relationship, and the key and the value have a one-to-one relationship. As shown in Figure 3, the method includes:
310、将包含有表格的文件转化为图片。310. Convert the file containing the table into a picture.
320、对图片进行光学字符OCR识别,得到识别结果,该识别结果包括各表格中每个单元格的内容,及各单元格在表格中的位置信息。320. Perform optical character OCR recognition on the picture to obtain a recognition result, where the recognition result includes the content of each cell in each table and the position information of each cell in the table.
330、将表格中各单元格的内容按照在表格内行和列的位置信息进行拼接。330. Splicing the content of each cell in the table according to the position information of the row and column in the table.
对于图片中的表格,在经过OCR识别后,每个单元格的行列索引及其单元格之间的对应关系均已得到确定。本实施例中,在将表格中各单元格的内容按照其在表格内行和列的位置信息进行拼接后,拼接后内容以字符串的形式体现。For the table in the picture, after OCR identification, the row and column index of each cell and the correspondence between the cells have been determined. In this embodiment, after the content of each cell in the table is spliced according to the position information of the row and column in the table, the spliced content is embodied in the form of a character string.
340、对于表格中的每一行内容,基于拼接后内容,生成与表格类型对应的第一信息抽取模板。340. For each row of content in the table, based on the spliced content, generate a first information extraction template corresponding to the table type.
本实施例中,对于左右一对一格式的表格,需要抽取的字段在左边,比如上表1中的“姓名”,而对应的value在右边,比如说“张三”。In this embodiment, for the one-to-one left-right form, the field to be extracted is on the left, such as "name" in Table 1 above, and the corresponding value is on the right, such as "Zhang San".
需要说明的是,本实施例中,左右一对一类型的表格所对应的第一信息抽取模板,是基于表格中每一行的内容生成的,即对于表格中每一行的内容,都会对应生成一个第一信息抽取模板,也即表 格的行数与第一信息抽取模板的个数相等。相对于将表格中每个键值对均生成一个模板的方式,本实施例这样设置,可减少信息抽取模板的个数,提高了模板生成的速度。It should be noted that, in this embodiment, the first information extraction template corresponding to the left and right one-to-one type table is generated based on the content of each row in the table, that is, for the content of each row in the table, a corresponding one is generated. The first information extraction template, that is, the number of rows of the table is equal to the number of the first information extraction template. Compared with the method of generating a template for each key-value pair in the table, this embodiment can reduce the number of templates for information extraction and improve the speed of template generation.
具体的,在将表格中各单元格的内容按照在表格内行和列的位置信息进行拼接后,对于拼接后内容中的每一行内容,均包括每个单元格的位置信息及内容。对于每一行内容对应的第一信息抽取模板的生成过程可以为:可将拼接后内容中各个键的位置信息,作为第一信息抽取模板中键的位置信息;将拼接后内容中的各个键,作为第一信息抽取模板中的键;将拼接后内容中各个键值对的值的位置信息作为第一信息抽取模板中待抽取值的位置信息。此外,对于第一信息抽取模板各个键值对的键及其位置信息,以及待抽取的各个键值对的值的位置信息,需为其添加对应的语法标识,以用于后续表格内容的提取。Specifically, after the content of each cell in the table is spliced according to the position information of the row and column in the table, each row of content in the spliced content includes the position information and content of each cell. The generation process of the first information extraction template corresponding to each line of content can be as follows: the position information of each key in the content after splicing can be used as the position information of the key in the first information extraction template; each key in the content after splicing, As the key in the first information extraction template; the position information of the value of each key-value pair in the spliced content is used as the position information of the value to be extracted in the first information extraction template. In addition, for the keys and their location information of each key-value pair of the first information extraction template, as well as the location information of the value of each key-value pair to be extracted, a corresponding grammatical identifier needs to be added to it, so as to be used for subsequent table content extraction .
具体的,对于上述表1中的第一行内容,对应的第一抽取模板为:Specifically, for the content of the first row in the above Table 1, the corresponding first extraction template is:
[@R0@C0-]<姓名>[@R0@C1-]{姓名:<*:0,>}[@R0@C2-]<年龄>[@R0@C3-]{年龄:<*:0,>}[@R0@C4-]<性别>[@R0@C5-]<*:0,>[@R0@C6-]<民族>[@R0@C7-]<*:0,>[\n]。[@R0@C0-]<name>[@R0@C1-]{name:<*:0,>}[@R0@C2-]<age>[@R0@C3-]{age:<*: 0,>}[@R0@C4-]<gender>[@R0@C5-]<*:0,>[@R0@C6-]<ethnicity>[@R0@C7-]<*:0,> [\n].
对于上述表1中的第二行内容,对应的第一抽取模板为:For the content of the second row in the above Table 1, the corresponding first extraction template is:
[@R1@C0-]<籍贯>[@R1@C1-]{籍贯:<*:0,>}[@R1@C2-]<出生年月>[@R1@C3-]<*:0,>[@R1@C4-]<出生地>[@R1@C5-]{出生地:<*:0,>}[\n]。[@R1@C0-]<Hometown>[@R1@C1-]{Hometown:<*:0,>}[@R1@C2-]<Birthplace>[@R1@C3-]<*:0 ,>[@R1@C4-]<Birthplace>[@R1@C5-]{Birthplace:<*:0,>}[\n].
对于上述表1中的第三行内容,对应的第一抽取模板为:For the content of the third row in the above table 1, the corresponding first extraction template is:
[@R2@C0-]<学历>[@R2@C1-]<{学历:<*:0,>}[@R2@C2-]<现居住城市>[@R2@C3-]<*:0,>[\n]。[@R2@C0-]<educational education>[@R2@C1-]<{educational education:<*:0,>}[@R2@C2-]<current residence city>[@R2@C3-]<*: 0,>[\n].
350、按照第一信息抽取模板,从识别结果中抽取表格内容。350. Extract table content from the recognition result according to the first information extraction template.
本实施例在上述实施例的基础上,对表格类型为左右一对一格式的表格所对应的第一信息抽取模板的生成过程进行了细化,通过将表格中各单元格的内容按照在表格内行和列的位置信息进行拼接,并基于拼接后内容生成与表格中每一行内容对应的第一信息抽取模板,避免了人工在进行表格信息抽取时,人力成本较高且准确性较差的问题,并且相对于采用人工干预总结各类表格的匹配规则的方式,本实施提供的方法无需研发人员总结不同的规则,通用性较强。On the basis of the above-mentioned embodiments, this embodiment refines the generation process of the first information extraction template corresponding to the table whose table type is one-to-one format. The position information of the inner row and column is spliced, and the first information extraction template corresponding to the content of each row in the table is generated based on the spliced content, avoiding the problems of high labor cost and poor accuracy when manually extracting table information. , and compared with the method of summarizing the matching rules of various forms by manual intervention, the method provided by this implementation does not require developers to summarize different rules, and is more versatile.
实施例四Embodiment 4
图4为本发明实施例四提供的一种优选的基于RPA及AI的表格信息抽取方法的流程图,本实施例在上述实施例的基础上,对表格类型为上下一对一格式所对应的信息抽取模板的生成过程进行了详细介绍。其中,该上下一对多格式的表格中各个键值对的键与值是上下位置关系,且键与值是一对多的关系。需要说明的是,上下一对多格式的表格类型包括如下两种情况:1、表格中需要抽取的value中第一列是不可枚举或者是无规律,即不能通过建立词表或者是不能采用正则表达式的方式表示出来;2、表格中需要抽取的value中第一列是可枚举或者是可采用正则表达式表示。本实施例中,先对上述第一种情况进行详细介绍。如图4所示,本实施例提供的基于RPA及AI的表格信息抽取方法包括:FIG. 4 is a flowchart of a preferred method for extracting table information based on RPA and AI according to Embodiment 4 of the present invention. In this embodiment, on the basis of the above embodiment, the table type is corresponding to the upper and lower one-to-one format. The generation process of the information extraction template is introduced in detail. Wherein, the key and value of each key-value pair in the table in the upper-lower one-to-many format have an upper-lower positional relationship, and the key and the value have a one-to-many relationship. It should be noted that the table type in the upper and lower one-to-many format includes the following two situations: 1. The first column of the value to be extracted in the table is not enumerable or irregular, that is, it cannot be established by establishing a vocabulary or cannot be used. 2. The first column in the value that needs to be extracted in the table is enumerable or can be represented by a regular expression. In this embodiment, the above-mentioned first case is first described in detail. As shown in Figure 4, the table information extraction method based on RPA and AI provided by this embodiment includes:
410、将包含有表格的文件转化为图片。410. Convert the file containing the table into a picture.
420、对图片进行光学字符OCR识别,得到识别结果,该识别结果中包括各表格中每个单元格的内容,及各单元格在表格中的位置信息。420. Perform optical character OCR recognition on the picture to obtain a recognition result, where the recognition result includes the content of each cell in each table and the position information of each cell in the table.
430、如果未检测到预设词表,则将表格中各单元格的内容按照所在行的位置信息进行拼接,并将拼接后内容与预设标准模板的内容进行匹配。430. If the preset vocabulary is not detected, splicing the content of each cell in the table according to the position information of the row, and matching the spliced content with the content of the preset standard template.
本实施例中,预设词表中包含有用户预先设定抽取的表格中的所有内容。如果未检测到预设词表,则说明表格中需要抽取的value中第一列是不可枚举或者是无规律的。In this embodiment, the preset vocabulary table includes all the contents in the table pre-set and extracted by the user. If the preset vocabulary is not detected, it means that the first column in the value to be extracted in the table is not enumerable or irregular.
本实施例中,信息抽取模板与表格类型相对应,通用性较强,即如果图片中存在多个相同类型的表格,通过本实施例提供的方法,可为多个同类型的表格生成同一个信息抽取模板。按照该模板,可将多个同类型的表格中的内容都提取出来,提高了后续表格内容提取的速度。In this embodiment, the information extraction template corresponds to the table type, and has strong generality. That is, if there are multiple tables of the same type in the picture, the method provided in this embodiment can generate the same table for multiple tables of the same type. Information extraction template. According to this template, the content in multiple tables of the same type can be extracted, which improves the speed of extracting the content of subsequent tables.
本实施例中,预设标准模板中包括用户预先设定抽取的键值对的键。该预设标准模板中键值对的键及其位置信息的语法标识与本发明实施例中信息抽取模板中键及其位置信息的语法标识相同。具体的,对于上述表2,对应的预设标准模板如下:In this embodiment, the preset standard template includes the key of the extracted key-value pair preset by the user. The grammatical identifiers of the key and its location information in the key-value pair in the preset standard template are the same as the grammatical identifiers of the key and its location information in the information extraction template in the embodiment of the present invention. Specifically, for the above Table 2, the corresponding preset standard templates are as follows:
[@R0@C0-]<项目>[@R0@C1-]<附注>[@R0@C2-]<2020年半年度>[@R0@C3-]<2019年半年度>[@R0@C0-]<project>[@R0@C1-]<note>[@R0@C2-]<2020 half year>[@R0@C3-]<2019 half year>
本实施例中,将表格中各单元格的内容按照所在行的位置信息进行拼接,并将拼接后内容与预设 标准模板的内容进行匹配,这样设置,是为了从识别结果中确定用户预先设定抽取的内容所对应的键,并可确定待抽取的值在表格中的开始位置信息和结束位置信息。In this embodiment, the content of each cell in the table is spliced according to the position information of the row, and the spliced content is matched with the content of the preset standard template. This setting is to determine the user preset from the recognition result. The key corresponding to the extracted content can be determined, and the start position information and end position information of the value to be extracted in the table can be determined.
440、如果匹配成功,则将表格内与预设标准模板相匹配的键所对应的列的个数作为第一目标个数。440. If the matching is successful, use the number of columns corresponding to the keys matching the preset standard template in the table as the first target number.
具体的,以上述表2对应的预设标准模板为例,如果匹配成功,则第二目标个数为4。Specifically, taking the preset standard template corresponding to Table 2 above as an example, if the matching is successful, the number of second targets is 4.
此外,将拼接后内容与预设标准模板的内容进行匹配,在匹配成功后,表格中每个键所对应的值,以及待抽取的内容所在行的开始位置信息和结束位置信息也得到了确定。In addition, the spliced content is matched with the content of the preset standard template. After the matching is successful, the value corresponding to each key in the table, as well as the start position information and end position information of the row where the content to be extracted is located are also determined. .
450、对表格按行进行遍历,将表格中列的个数作为第一标准个数。450. Traverse the table row by row, and use the number of columns in the table as the first standard number.
460、如果第一标准个数与第一目标个数相匹配,则在表格中第一列单元格之前添加辅助变量。460. If the first standard number matches the first target number, add an auxiliary variable before the cells in the first column in the table.
本实施例中,表格中第一列单元格之前添加辅助变量是为了将表格中不同行的内容区分开,以避免在信息抽取时将下一行的内容当做当前行的内容进行抽取。In this embodiment, auxiliary variables are added before the cells in the first column of the table to distinguish the contents of different rows in the table, so as to avoid extracting the contents of the next row as the contents of the current row during information extraction.
470、基于辅助变量,以及与预设标准模板相匹配的键及该键在表格中的位置信息,生成表格类型对应的第二信息抽取模板。470. Generate a second information extraction template corresponding to the table type based on the auxiliary variable, the key matching the preset standard template, and the position information of the key in the table.
其中,第二信息抽取模板中包含有辅助变量、相匹配的键及其位置信息,以及表格中待抽取的各个键值对的值的位置信息。The second information extraction template includes auxiliary variables, matching keys and their location information, and location information of the values of each key-value pair to be extracted in the table.
具体的,第二信息抽取模板的生成过程具体可以为:将辅助变量添加到第二信息抽取模板的开始位置;将相匹配的各个键的位置信息,作为第二信息抽取模板中键的位置信息;将相匹配的各个键,作为第二信息抽取模板中的键;将各个键对应的待抽取的值的位置信息作为第二信息抽取模板中待抽取值的位置信息。此外,对于生成的第二信息抽取模板,为各个键值对的键及其位置信息,以及待抽取的各个键值对的值的位置信息,添加对应的语法标识,以用于后续表格内容的抽取。其中,第二信息抽取模板所涉及到的语法标识与第一信息抽取模板中提及的语法标识的含义相同,本实施例不再赘述。Specifically, the generation process of the second information extraction template may specifically be as follows: adding auxiliary variables to the starting position of the second information extraction template; taking the position information of each key that matches as the position information of the keys in the second information extraction template ; Use the matched keys as the keys in the second information extraction template; use the position information of the values to be extracted corresponding to each key as the position information of the values to be extracted in the second information extraction template. In addition, for the generated second information extraction template, a corresponding grammatical identifier is added for the key of each key-value pair and its location information, as well as the location information of the value of each key-value pair to be extracted, to be used for subsequent table contents. Extract. The syntax identifier involved in the second information extraction template has the same meaning as the syntax identifier mentioned in the first information extraction template, which is not repeated in this embodiment.
具体的,对于上述表2,如果用户想抽取的字段为项目、附注、2019年半年度-金额和2020年半年度-金额,则生成的第二信息抽取模板为:Specifically, for the above table 2, if the fields that the user wants to extract are items, notes, semi-annual-amount in 2019 and semi-annual-amount in 2020, the generated second information extraction template is:
[Fi][@R1@C0-]{项目:<*:0,>}[@R1@C1-]{附注<*:0>}[@R1@C2-]{2020年半年度-金额:<*:0,>}[@R1@C3-]{2019年半年度:<*:0,>}[\n]。[Fi][@R1@C0-]{items:<*:0,>}[@R1@C1-]{notes<*:0>}[@R1@C2-]{2020 half-year-amount: <*:0,>}[@R1@C3-]{2019 half year:<*:0,>}[\n].
480、按照第二信息抽取模板,从识别结果中抽取表格内容。480. Extract table content from the recognition result according to the second information extraction template.
本实施例在上述实施例的基础上,对表格类型为上下一对多格式,且需要抽取的value中第一列是不可枚举或者是无规律的表格所对应的第二信息抽取模板的生成过程进行了细化。通过将表格中各单元格的内容按照所在行的位置信息进行拼接,并将拼接后内容与预设标准模板的内容进行匹配,如果匹配成功,可得到待抽取的value所在表格中的开始位置信息和结束位置信息。对表格按行进行遍历,如果表格中列的个数与预设标准模板中键的个数相匹配,则在表格中第一列单元格之前添加辅助变量,以将表格中各行内容区分开。基于辅助变量和相匹配的键及其位置信息,可生成表格类型对应的第二信息抽取模板,避免引入过多的人工干预,并且相对于采用人工干预总结各类表格的匹配规则的方式,本实施提供的方法无需研发人员总结不同的规则,通用性较强。On the basis of the above-mentioned embodiments, this embodiment generates a second information extraction template corresponding to the table type in the top-bottom one-to-many format, and the first column of the value to be extracted is not enumerable or irregular. The process is refined. By splicing the content of each cell in the table according to the position information of the row, and matching the content after splicing with the content of the preset standard template, if the matching is successful, the starting position information in the table where the value to be extracted is located can be obtained and end location information. Traverse the table row by row. If the number of columns in the table matches the number of keys in the preset standard template, add auxiliary variables before the cells in the first column of the table to distinguish the contents of each row in the table. Based on auxiliary variables and matching keys and their location information, the second information extraction template corresponding to the table type can be generated, avoiding the introduction of excessive manual intervention. The implementation of the provided method does not require developers to summarize different rules, and the generality is strong.
实施例五Embodiment 5
图5为本发明实施例五提供的一种优选的基于RPA及AI的表格信息抽取方法的流程图,本实施例对上述实施例的基础上,对表格中需要抽取的value中第一列是可枚举或者是可采用正则表达式表示的情况进行详细介绍。如图5所示,本实施例提供的基于RPA及AI的表格信息抽取方法包括:FIG. 5 is a flowchart of a preferred method for extracting table information based on RPA and AI provided by Embodiment 5 of the present invention. Based on the above embodiments, the first column of the value to be extracted in the table is: The cases that can be enumerated or can be represented by regular expressions are described in detail. As shown in Figure 5, the table information extraction method based on RPA and AI provided by this embodiment includes:
510、将包含有表格的文件转化为图片。510. Convert the file containing the table into a picture.
520、对图片进行光学字符OCR识别,得到识别结果,该识别结果中包括各表格中每个单元格的内容,及各单元格在表格中的位置信息。520. Perform optical character OCR recognition on the picture to obtain a recognition result, where the recognition result includes the content of each cell in each table and the position information of each cell in the table.
530、如果检测到预设词表,则将表格中各个键值对的值与预设词表的内容进行匹配。530. If the preset vocabulary is detected, match the value of each key-value pair in the table with the content of the preset vocabulary.
本实施例中,对于value可枚举的情况,需判断表格中待抽取的内容是否属于预设词表,如果属于预设词表,则再进行单元格内容的拼接操作;如果不属于预设词表,则停止生成信息抽取模板的操作。In this embodiment, for the case where the value can be enumerated, it is necessary to judge whether the content to be extracted in the table belongs to the preset vocabulary, and if it belongs to the preset vocabulary, then perform the splicing operation of the cell content; if it does not belong to the preset vocabulary word list, the operation of generating the information extraction template is stopped.
540、如果匹配成功,则将表格中各单元格的内容按照所在行的位置信息进行拼接,并将拼接后内容与预设标准模板的内容进行匹配。540. If the matching is successful, the content of each cell in the table is spliced according to the position information of the row, and the spliced content is matched with the content of the preset standard template.
其中,标准模板中包括用户预先设定抽取的键值对的键。具体匹配方法与上述实施例中提到的匹配方法相同,此处不再赘述。The standard template includes the key of the extracted key-value pair preset by the user. The specific matching method is the same as the matching method mentioned in the above-mentioned embodiment, and will not be repeated here.
550、如果匹配成功,则将表格内容与预设标准模板相匹配的键所对应的列的个数作为第二目标个数。550. If the matching is successful, use the number of columns corresponding to the keys whose table content matches the preset standard template as the second target number.
560、对表格按行进行遍历,将表格中列的个数作为第二标准个数。560. Traverse the table row by row, and use the number of columns in the table as the second standard number.
570、如果第二标准个数与第二目标个数相匹配,则基于与预设标准模板相匹配的键及该键在表格中的位置信息,生成表格类型对应的第三信息抽取模板。570. If the second standard number matches the second target number, generate a third information extraction template corresponding to the table type based on the key matching the preset standard template and the position information of the key in the table.
其中,第三信息抽取模板中包含有与预设标准模板相匹配的键及该键在表格中的位置信息,以及所述表格中待抽取的各个键值对的值的位置信息。与第二信息抽取模板不同的是,第三信息抽取模板中无需添加辅助变量,除此之外,生成第三信息抽取模板的方法与第二信息抽取模板的生成方式类似,此处不再赘述。The third information extraction template includes a key matching the preset standard template, the position information of the key in the table, and the position information of the value of each key-value pair to be extracted in the table. Different from the second information extraction template, there is no need to add auxiliary variables to the third information extraction template. In addition, the method of generating the third information extraction template is similar to the generation method of the second information extraction template, which will not be repeated here. .
具体的,对于上述表2,生成的第三信息抽取模板为:Specifically, for the above Table 2, the generated third information extraction template is:
[@R0@C0-]{项目:[@V_D]}[@R0@C1-]{附注<*:0,>}[@R0@C2-]{2020年半年度:<*:0,>}[@R0@C3-]{2019年半年度:<*:0,>}[\n]。[@R0@C0-]{project:[@V_D]}[@R0@C1-]{notes<*:0,>}[@R0@C2-]{2020 half year:<*:0,> }[@R0@C3-]{Semi-annual 2019:<*:0,>}[\n].
上述第三信息抽取模板中,由于第一列是可以根据正则表达式进行表示,则可以将上面的词表替换成正则表达式V_D。In the above third information extraction template, since the first column can be represented according to the regular expression, the above vocabulary can be replaced with the regular expression V_D.
580、按照第三信息抽取模板,从识别结果中抽取表格内容。580. Extract table content from the recognition result according to the third information extraction template.
本实施例在上述实施例的基础上,对表格类型为上下一对多格式,且需要抽取的value中第一列是可枚举,即可以用词表的形式表示出来的表格所对应的第三信息抽取模板的生成过程进行了细化。与上述第二信息抽取模板不同的是,第三信息抽取模板的生成过程无需添加辅助变量,但需要判断待抽取的值是否属于预设词表,如果属于预设词表,则基于相匹配的键及其位置信息,可生成表格类型对应的第三信息抽取模板。本实施例这样设置,避免引入过多的人工干预,并且相对于采用人工干预总结各类表格的匹配规则的方式,本实施提供的方法无需研发人员总结不同的规则,通用性较强。In this embodiment, on the basis of the above-mentioned embodiment, the table type is in the upper-lower one-to-many format, and the first column in the value to be extracted is enumerable, that is, the first column corresponding to the table that can be expressed in the form of a vocabulary The generation process of the three information extraction templates is refined. Different from the above-mentioned second information extraction template, the generation process of the third information extraction template does not need to add auxiliary variables, but it needs to judge whether the value to be extracted belongs to the preset vocabulary, and if it belongs to the preset vocabulary, it is based on the matching vocabulary. The key and its location information can generate a third information extraction template corresponding to the table type. This embodiment is set in this way to avoid introducing too much manual intervention, and compared with the method of summarizing matching rules of various tables by manual intervention, the method provided by this implementation does not require developers to summarize different rules, and is more versatile.
实施例六Embodiment 6
图6为本发明实施例六提供的一种优选的基于RPA及AI的表格信息抽取方法的流程图。本实施例对表格类型为左右一对多格式对应的信息抽取模板的生成进行了详细介绍。其中,该左右一对多格式的表格中各个键值对的键与值是左右位置关系,且键与值是一对多的关系。本实施例中需要抽取的value中第一列是不可枚举或者是无规律的。本实施例提供的第四信息抽取模板的生成方式,与上下一对多格式且需要抽取的value中第一列是不可枚举所对应的第二信息抽取模板的生成方式相类似,不同之处在于,由于表格中键值对之间的位置关系,本实施例在进行单元格内容拼接时是按列进行拼接,对表格遍历是按列进行遍历,从而确定表格内容行的个数。如图6所示,本实施例提供的基于RPA及AI的表格信息抽取方法包括:FIG. 6 is a flowchart of a preferred method for extracting table information based on RPA and AI according to Embodiment 6 of the present invention. This embodiment introduces in detail the generation of an information extraction template corresponding to a left-right one-to-many format. The key and value of each key-value pair in the table in the left-right one-to-many format have a left-right positional relationship, and the key and the value have a one-to-many relationship. In this embodiment, the first column in the value to be extracted is not enumerable or irregular. The generation method of the fourth information extraction template provided in this embodiment is similar to the generation method of the second information extraction template corresponding to the upper and lower one-to-many format and the first column of the value to be extracted is non-enumerable, and the difference is Because of the positional relationship between key-value pairs in the table, in this embodiment, the cell content is spliced by column, and the table traversal is performed by column, so as to determine the number of table content rows. As shown in Figure 6, the table information extraction method based on RPA and AI provided by this embodiment includes:
610、将包含有表格的文件转化为图片。610. Convert the file containing the table into a picture.
620、对图片进行光学字符OCR识别,得到识别结果,该识别结果中包括各表格中每个单元格的内容,及各单元格在表格中的位置信息。620. Perform optical character OCR recognition on the picture to obtain a recognition result, where the recognition result includes the content of each cell in each table and the position information of each cell in the table.
630、如果未检测到预设词表,则将表格中各单元格的内容按照所在列的位置信息进行拼接,并将拼接后内容与预设标准模板中的键进行匹配。630. If the preset vocabulary is not detected, splicing the content of each cell in the table according to the position information of the column, and matching the spliced content with the keys in the preset standard template.
其中,标准模板中包括预先设定抽取的键值对的键。Wherein, the standard template includes pre-set keys of the extracted key-value pairs.
640、如果匹配成功,则将表格内容与预设标准模板相匹配的键所对应的行的个数作为第三目标个数。640. If the matching is successful, use the number of rows corresponding to the key whose table content matches the preset standard template as the third target number.
650、对表格按列进行遍历,确定表格内行的第三标准个数。650. Traverse the table by column to determine the third standard number of rows in the table.
660、如果第三标准个数与第三目标个数相匹配,则在表格中各单元格之前添加辅助变量。660. If the third standard number matches the third target number, add an auxiliary variable before each cell in the table.
670、基于辅助变量、以及与预设标准模板相匹配的键及该键在表格中的位置信息,生成表格类型对应的第四信息抽取模板。670. Generate a fourth information extraction template corresponding to the table type based on the auxiliary variable, the key matching the preset standard template, and the position information of the key in the table.
其中,第四信息抽取模板中包含有辅助变量、所述相匹配的键及其位置信息,以及表格中各个键值对的值的位置信息。第四信息抽取模板的具体生成过程与第二信息抽取模板的生成过程类似,具体可参阅上述第二信息抽取模板的生成过程,此处不再赘述。The fourth information extraction template includes auxiliary variables, the matched keys and their location information, and location information of the values of each key-value pair in the table. The specific generation process of the fourth information extraction template is similar to the generation process of the second information extraction template. For details, please refer to the above-mentioned generation process of the second information extraction template, which will not be repeated here.
680、按照第四信息抽取模板,从识别结果中抽取表格内容。680. Extract table content from the recognition result according to the fourth information extraction template.
本实施例中,对于左右一对多格式,且第一列是不可枚举的类型表格,通过将表格中各单元格的内容按照所在列的位置信息进行拼接,并将拼接后内容与预设标准模板的内容进行匹配,如果匹配成功,可得到待抽取的value所在表格中的开始位置信息和结束位置信息。对表格按列进行遍历,如果表格中行的个数与预设标准模板中键的个数相匹配,则在表格中第一列单元格之前添加辅助变量,以将表格中各行内容区分开。基于辅助变量以及表格内容与预设标准模板相匹配的键及其位置信息,可生成表格类型对应的第四信息抽取模板。通过采用上述技术方案,避免引入过多的人工干预,并且相对于采用人工干预总结各类表格的匹配规则的方式,本实施提供的方法无需研发人员总结不同的规则,通用性较强。In this embodiment, for the left-right one-to-many format, and the first column is a non-enumerable type table, the content of each cell in the table is spliced according to the position information of the column, and the spliced content and the preset The content of the standard template is matched. If the match is successful, the start position information and end position information in the table where the value to be extracted is located can be obtained. Traverse the table by column. If the number of rows in the table matches the number of keys in the preset standard template, add auxiliary variables before the cells in the first column of the table to distinguish the contents of each row in the table. A fourth information extraction template corresponding to the table type may be generated based on the auxiliary variable, the key whose table content matches the preset standard template, and its position information. By adopting the above technical solution, the introduction of excessive manual intervention is avoided, and compared with the method of summarizing matching rules of various forms by manual intervention, the method provided by this implementation does not require developers to summarize different rules, and is more versatile.
实施例七Embodiment 7
图7为本发明实施例七提供的一种优选的基于RPA及AI的表格信息抽取方法的流程图。本实施例对表格类型为左右一对多格式对应的信息抽取模板的生成进行了详细介绍。其中,该左右一对多格式的表格中各个键值对的键与值是左右位置关系,且键与值是一对多的关系。本实施例中需要抽取的value中第一列是可以枚举的,即可以用词表的形式表示出来。本实施例提供的第五信息抽取模板的生成方式,与上下一对多格式且需要抽取的value中第一列可枚举的所对应的第三信息抽取模板的生成方式相类似,不同之处在于,由于表格中键值对之间的位置关系,本实施例在进行单元格内容拼接时是按列进行拼接。对表格遍历是按列进行遍历,从而确定表格内容行的个数。如图7所示,本实施例提供的基于RPA及AI的表格信息抽取方法包括:FIG. 7 is a flowchart of a preferred method for extracting table information based on RPA and AI according to Embodiment 7 of the present invention. This embodiment introduces in detail the generation of an information extraction template corresponding to a left-right one-to-many format. The key and value of each key-value pair in the table in the left-right one-to-many format have a left-right positional relationship, and the key and the value have a one-to-many relationship. In this embodiment, the first column of the value to be extracted can be enumerated, that is, it can be expressed in the form of a vocabulary. The generation method of the fifth information extraction template provided by this embodiment is similar to the generation method of the third information extraction template corresponding to the first column of the value that needs to be extracted in the one-to-many format and the first column can be enumerated. That is, due to the positional relationship between the key-value pairs in the table, in this embodiment, the content of the cells is spliced by columns. Traversing the table is to traverse by column to determine the number of table content rows. As shown in Figure 7, the table information extraction method based on RPA and AI provided by this embodiment includes:
710、将包含有表格的文件转化为图片。710. Convert the file containing the table into a picture.
720、对图片进行光学字符OCR识别,得到识别结果,该识别结果中包括各表格中每个单元格的内容,及各单元格在表格中的位置信息。720. Perform optical character OCR recognition on the picture to obtain a recognition result, where the recognition result includes the content of each cell in each table and the position information of each cell in the table.
730、如果检测到预设词表,则将表格中各个键值对的值与预设词表的内容进行匹配。730. If the preset vocabulary is detected, match the value of each key-value pair in the table with the content of the preset vocabulary.
740、如果匹配成功,则将表格中各单元格的内容按照所在列的位置信息进行拼接,并将拼接后内容与预设标准模板的内容进行匹配。740. If the matching is successful, the content of each cell in the table is spliced according to the position information of the column, and the spliced content is matched with the content of the preset standard template.
其中,预设标准模板中包括预先设定抽取的键值对的键。The preset standard template includes preset keys of the extracted key-value pairs.
750、如果匹配成功,则将表格内容与预设标准模板相匹配的键所对应的行的个数作为第四目标个数。750. If the matching is successful, use the number of rows corresponding to the keys whose table content matches the preset standard template as the fourth target number.
760、对表格按列进行遍历,将表格中行的个数作为第四标准个数。760. Traverse the table by column, and use the number of rows in the table as the fourth standard number.
770、如果第四标准个数与第四目标个数相匹配,则基于与预设标准模板相匹配的键及该键在表格中的位置信息,生成表格类型对应的第五信息抽取模板。770. If the fourth standard number matches the fourth target number, generate a fifth information extraction template corresponding to the table type based on the key matching the preset standard template and the position information of the key in the table.
其中,第五信息抽取模板中包含有与预设标准模板相匹配的键及其位置信息,以及表格中待抽取的各个键值对的值的位置信息。第五信息抽取模板的具体生成过程与第三信息抽取模板的生成过程类似,具体可参阅上述第三信息抽取模板的生成过程,此处不再赘述。Wherein, the fifth information extraction template includes the keys matching the preset standard template and their location information, and the location information of the values of each key-value pair to be extracted in the table. The specific generation process of the fifth information extraction template is similar to the generation process of the third information extraction template. For details, please refer to the above-mentioned generation process of the third information extraction template, which will not be repeated here.
780、按照第五信息抽取模板,从识别结果中抽取表格内容。780. Extract table content from the recognition result according to the fifth information extraction template.
本实施例在上述实施例的基础上,对表格类型为左右一对多格式,且需要抽取的value中第一列是可枚举,即可以用词表的形式表示出来的表格所对应的第五信息抽取模板的生成过程进行了细化。与上述第四信息抽取模板不同的是,第五信息抽取模板的生成过程无需添加辅助变量,但需要判断待抽取的值是否属于预设词表,如果属于预设词表,则可基于相匹配的键及其位置信息,生成表格类型对应的第四信息抽取模板。本实施例这样设置,避免引入过多的人工干预,并且相对于采用人工干预总结各类表格的匹配规则的方式,本实施提供的方法无需研发人员总结不同的规则,通用性较强。In this embodiment, on the basis of the above-mentioned embodiment, the table type is left-right one-to-many format, and the first column in the value to be extracted is enumerable, that is, the first column corresponding to the table that can be expressed in the form of a vocabulary The generation process of five information extraction templates is refined. Different from the above-mentioned fourth information extraction template, the generation process of the fifth information extraction template does not need to add auxiliary variables, but it needs to judge whether the value to be extracted belongs to the preset vocabulary, and if it belongs to the preset vocabulary, it can be based on matching. key and its location information, and generate a fourth information extraction template corresponding to the table type. This embodiment is set in this way to avoid introducing too much manual intervention, and compared with the method of summarizing matching rules of various tables by manual intervention, the method provided by this implementation does not require developers to summarize different rules, and is more versatile.
实施例八Embodiment 8
图8为本发明实施例八提供的一种基于RPA及AI的表格信息抽取装置的结构示意图,如图8所示,该装置包括:图片转化模板810、模板生成模块820和内容抽取模块830;其中,8 is a schematic structural diagram of an apparatus for extracting table information based on RPA and AI provided in Embodiment 8 of the present invention. As shown in FIG. 8 , the apparatus includes: a picture conversion template 810, a template generation module 820, and a content extraction module 830; in,
图片转化模板810,被配置为将包含有表格的文件转化为图片;The image conversion template 810 is configured to convert a file containing a table into an image;
模板生成模块820,被配置为识别所述图片中的表格,并根据识别结果生成与表格类型对应的信息抽取模板,所述信息抽取模板中包含有表格内各个键值对的键及其位置信息,以及待抽取的各个键值对的值的位置信息;The template generation module 820 is configured to identify the table in the picture, and generate an information extraction template corresponding to the table type according to the identification result, and the information extraction template includes the keys of each key-value pair in the table and their location information , and the location information of the value of each key-value pair to be extracted;
内容抽取模块830,被配置按照所述信息抽取模板,从所述识别结果中抽取表格内容。The content extraction module 830 is configured to extract table content from the identification result according to the information extraction template.
可选的,所述模板生成模块820,包括:Optionally, the template generation module 820 includes:
图片识别单元,被配置为对所述图片进行光学字符OCR识别,得到识别结果,该识别结果包括各表格中每个单元格的内容,及各单元格在所述表格中的位置信息;The picture recognition unit is configured to perform optical character OCR recognition on the picture to obtain a recognition result, where the recognition result includes the content of each cell in each table, and the position information of each cell in the table;
模板生成单元,被配置为对于任意一种类型的表格,根据表格中各单元格的内容及各单元格在所述表格中的位置信息,生成与表格类型对应的信息抽取模板。The template generating unit is configured to, for any type of table, generate an information extraction template corresponding to the table type according to the content of each cell in the table and the position information of each cell in the table.
可选的,表格类型包括左右一对一格式,该左右一对一格式的表格中各个键值对的键与值是左右位置关系,且键与值是一对一的关系;Optionally, the table type includes a left-right one-to-one format, the key and value of each key-value pair in the left-right one-to-one format table are in a left-right positional relationship, and the key and value are in a one-to-one relationship;
相应的,所述模板生成单元,具体被配置为:Correspondingly, the template generation unit is specifically configured as:
将表格中各单元格的内容按照在表格内行和列的位置信息进行拼接;Splicing the content of each cell in the table according to the position information of the row and column in the table;
对于表格中的每一行内容,基于拼接后内容,生成与表格类型对应的第一信息抽取模板;For each row of content in the table, based on the spliced content, generate a first information extraction template corresponding to the table type;
其中,第一信息抽取模板中包含有表格内每一行各个键值对的键及其位置信息,以及待抽取的各个键值对的值的位置信息。Wherein, the first information extraction template includes the key and position information of each key-value pair in each row in the table, and the position information of the value of each key-value pair to be extracted.
可选的,表格类型包括上下一对多格式,该上下一对多格式的表格中各个键值对的键与值是上下位置关系,且键与值是一对多的关系;Optionally, the table type includes a top-bottom one-to-many format, the key and value of each key-value pair in the top-bottom one-to-many format table are in a top-bottom position relationship, and the key and value are a one-to-many relationship;
相应的,所述模板生成单元,具体被配置为:Correspondingly, the template generation unit is specifically configured as:
如果未检测到预设词表,则将表格中各单元格的内容按照所在行的位置信息进行拼接,并将拼接后内容与预设标准模板的内容进行匹配,所述标准模板中包括预先设定抽取的键值对的键;If the preset vocabulary table is not detected, the content of each cell in the table is spliced according to the position information of the row, and the spliced content is matched with the content of a preset standard template, the standard template includes preset Determine the key of the extracted key-value pair;
如果匹配成功,则将相匹配的键所对应的列的个数作为第一目标个数;If the match is successful, the number of columns corresponding to the matched key is taken as the first target number;
对表格按行进行遍历,将表格中列的个数作为第一标准个数;Traverse the table row by row, and take the number of columns in the table as the first standard number;
如果第一标准个数与第一目标个数相匹配,则在表格中第一列单元格之前添加辅助变量,所述辅助变量用于在表格内容提取时将表格中各行内容进行区分;If the first standard number matches the first target number, an auxiliary variable is added before the cells in the first column in the table, and the auxiliary variable is used to distinguish the content of each row in the table when the table content is extracted;
基于所述辅助变量和所述相匹配的键及其位置信息,生成表格类型对应的第二信息抽取模板;Based on the auxiliary variable and the matched key and its position information, a second information extraction template corresponding to the form type is generated;
其中,第二信息抽取模板中包含有所述辅助变量、所述相匹配的键及其位置信息,以及表格中待抽取的各个键值对的值的位置信息。Wherein, the second information extraction template includes the auxiliary variable, the matching key and its location information, and the location information of the value of each key-value pair to be extracted in the table.
可选的,表格类型包括上下一对多格式,该上下一对多格式的表格中各个键值对的键与值是上下位置关系,且键与值是一对多的关系;Optionally, the table type includes a top-bottom one-to-many format, the key and value of each key-value pair in the top-bottom one-to-many format table are in a top-bottom position relationship, and the key and value are a one-to-many relationship;
相应的,所述模板生成单元,具体被配置为:Correspondingly, the template generation unit is specifically configured as:
如果检测到预设词表,则将表格中各个键值对的值与所述预设词表的内容进行匹配;If the preset vocabulary is detected, the value of each key-value pair in the table is matched with the content of the preset vocabulary;
如果匹配成功,则将表格中各单元格的内容按照所在行的位置信息进行拼接,并将拼接后内容与预设标准模板的内容进行匹配,所述预设标准模板中包括预先设定抽取的键值对的键;If the matching is successful, the content of each cell in the table is spliced according to the position information of the row, and the spliced content is matched with the content of a preset standard template, where the preset standard template includes preset extracted the key of the key-value pair;
如果匹配成功,则将相匹配的键所对应的列的个数作为第二目标个数;If the match is successful, the number of columns corresponding to the matched key is taken as the second target number;
对表格按行进行遍历,将表格中列的个数作为第二标准个数;Traverse the table row by row, and use the number of columns in the table as the second standard number;
如果第二标准个数与第二目标个数相匹配,则基于所述相匹配的键及其位置信息,生成表格类型对应的第三信息抽取模板;If the second standard number matches the second target number, then based on the matched keys and their location information, a third information extraction template corresponding to the table type is generated;
其中,第三信息抽取模板中包含有所述相匹配的键及其位置信息,以及所述表格中待抽取的各个键值对的值的位置信息。Wherein, the third information extraction template includes the matched key and its location information, and the location information of the value of each key-value pair to be extracted in the table.
可选的,表格类型包括左右一对多格式,该左右一对多格式的表格中各个键值对的键与值是左右位置关系,且键与值是一对多的关系;Optionally, the table type includes a left-right one-to-many format, the key and value of each key-value pair in the left-right one-to-many format table have a left-right positional relationship, and the key and the value have a one-to-many relationship;
相应的,所述模板生成单元,具体被配置为:Correspondingly, the template generation unit is specifically configured as:
如果未检测到预设词表,将表格中各单元格的内容按照所在列的位置信息进行拼接,并将拼接后内容与预设标准模板中的键进行匹配,所述预设标准模板中包括预先设定抽取的键值对的键;If the preset vocabulary table is not detected, the content of each cell in the table is spliced according to the position information of the column, and the spliced content is matched with the keys in the preset standard template, and the preset standard template includes Preset the key of the extracted key-value pair;
如果匹配成功,则将相匹配的键所对应的行的个数作为第三目标个数;If the match is successful, the number of rows corresponding to the matched key is taken as the third target number;
对表格按列进行遍历,确定表格内行的第三标准个数;Traverse the table by column to determine the third standard number of rows in the table;
如果第三标准个数与第三目标个数相匹配,则在表格中各单元格之前添加辅助变量,所述辅助变量用于在表格内容提取时将表格中各行内容进行区分;If the number of the third standard matches the number of the third target, an auxiliary variable is added before each cell in the table, and the auxiliary variable is used to distinguish the content of each row in the table when the table content is extracted;
基于所述辅助变量和所述相匹配的键及其位置信息,生成表格类型对应的第四信息抽取模板;Based on the auxiliary variable and the matched key and its position information, a fourth information extraction template corresponding to the form type is generated;
其中,第四信息抽取模板中包含有所述辅助变量、所述相匹配的键及其位置信息,以及所述表格 中各个键值对的值的位置信息。Wherein, the fourth information extraction template includes the auxiliary variable, the matching key and its position information, and the position information of the value of each key-value pair in the table.
可选的,表格类型包括左右一对多格式,该左右一对多格式的表格中各个键值对的键与值是左右位置关系,且键与值是一对多的关系;Optionally, the table type includes a left-right one-to-many format, the key and value of each key-value pair in the left-right one-to-many format table have a left-right positional relationship, and the key and the value have a one-to-many relationship;
相应的,所述模板生成单元,具体被配置为:Correspondingly, the template generation unit is specifically configured as:
如果检测到预设词表,则将表格中各个键值对的值与所述预设词表的内容进行匹配;If the preset vocabulary is detected, the value of each key-value pair in the table is matched with the content of the preset vocabulary;
如果匹配成功,则将表格中各单元格的内容按照所在列的位置信息进行拼接,并将拼接后内容与预设标准模板的内容进行匹配,所述预设标准模板中包括预先设定抽取的键值对的键;If the matching is successful, the content of each cell in the table is spliced according to the position information of the column, and the spliced content is matched with the content of a preset standard template, where the preset standard template includes preset extracted the key of the key-value pair;
如果匹配成功,则将相匹配的键所对应的列的个数作为第四目标个数;If the match is successful, the number of columns corresponding to the matched key is taken as the fourth target number;
对表格按列进行遍历,将表格中行的个数作为第四标准个数;Traverse the table by column, and use the number of rows in the table as the fourth standard number;
如果第四标准个数与第四目标个数相匹配,则基于相匹配的键及其位置信息,生成表格类型对应的第五信息抽取模板;If the fourth standard number matches the fourth target number, then based on the matched keys and their location information, a fifth information extraction template corresponding to the table type is generated;
其中,第五信息抽取模板中包含有所述相匹配的键及其位置信息,以及所述表格中待抽取的各个键值对的值的位置信息。Wherein, the fifth information extraction template includes the matched key and its location information, and location information of the value of each key-value pair to be extracted in the table.
本发明实施例所提供的基于RPA及AI的表格信息抽取装置可执行本发明任意实施例所提供的基The apparatus for extracting table information based on RPA and AI provided by the embodiment of the present invention can execute the basic information provided by any embodiment of the present invention.
于RPA及AI的表格信息抽取方法,具备执行方法相应的功能模块和有益效果。未在上述实施例中详尽描述的技术细节,可参见本发明任意实施例所提供的基于RPA及AI的表格信息抽取方法。The table information extraction method for RPA and AI has corresponding functional modules and beneficial effects of the execution method. For technical details not described in detail in the foregoing embodiments, reference may be made to the table information extraction method based on RPA and AI provided by any embodiment of the present invention.
实施例九Embodiment 9
请参阅图9,图9为本发明实施例九提供的一种计算设备的结构示意图。如图9所示,该计算设备可以包括:Please refer to FIG. 9 , which is a schematic structural diagram of a computing device according to Embodiment 9 of the present invention. As shown in Figure 9, the computing device may include:
存储有可执行程序代码的存储器901;a memory 901 storing executable program code;
与存储器901耦合的处理器902;a processor 902 coupled to the memory 901;
其中,处理器902调用存储器901中存储的可执行程序代码,执行本发明任意实施例所提供的基于RPA及AI的表格信息抽取方法。The processor 902 invokes the executable program code stored in the memory 901 to execute the table information extraction method based on RPA and AI provided by any embodiment of the present invention.
本发明实施例还公开一种计算机可读存储介质,其存储计算机程序,其中,该计算机程序使得计算机执行本发明任意实施例所提供的基于RPA及AI的表格信息抽取方法。The embodiment of the present invention also discloses a computer-readable storage medium storing a computer program, wherein the computer program enables a computer to execute the RPA- and AI-based table information extraction method provided by any embodiment of the present invention.
本领域普通技术人员可以理解:附图只是一个实施例的示意图,附图中的模块或流程并不一定是实施本发明所必须的。Those of ordinary skill in the art can understand that the accompanying drawing is only a schematic diagram of an embodiment, and the modules or processes in the accompanying drawing are not necessarily necessary to implement the present invention.
本领域普通技术人员可以理解:实施例中的装置中的模块可以按照实施例描述分布于实施例的装置中,也可以进行相应变化位于不同于本实施例的一个或多个装置中。上述实施例的模块可以合并为一个模块,也可以进一步拆分成多个子模块。Those skilled in the art may understand that: the modules in the apparatus in the embodiment may be distributed in the apparatus in the embodiment according to the description of the embodiment, and may also be located in one or more apparatuses different from this embodiment with corresponding changes. The modules in the foregoing embodiments may be combined into one module, or may be further split into multiple sub-modules.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present invention.

Claims (10)

  1. 一种基于RPA及AI的表格信息抽取方法,其特征在于,包括:A form information extraction method based on RPA and AI, characterized in that, comprising:
    S1、将包含有表格的文件转化为图片;S1. Convert the file containing the table into a picture;
    S2、识别所述图片中的表格,并根据识别结果生成与表格类型对应的信息抽取模板,所述信息抽取模板中包含有表格内各个键值对的键及其位置信息,以及待抽取的各个键值对的值的位置信息;S2. Identify the table in the picture, and generate an information extraction template corresponding to the table type according to the identification result, where the information extraction template includes the key and position information of each key-value pair in the table, and the information about each key-value pair to be extracted. The location information of the value of the key-value pair;
    S3、按照所述信息抽取模板,从所述识别结果中抽取表格内容。S3. Extract table content from the recognition result according to the information extraction template.
  2. 根据权利要求1所述的方法,其特征在于,步骤S2具体包括:The method according to claim 1, wherein step S2 specifically comprises:
    S21、对所述图片进行光学字符OCR识别,得到识别结果,该识别结果包括各表格中每个单元格的内容,及各单元格在所述表格中的位置信息;S21, performing optical character OCR recognition on the picture to obtain a recognition result, where the recognition result includes the content of each cell in each table, and the position information of each cell in the table;
    S22、对于任意一种类型的表格,根据表格中各单元格的内容及各单元格在所述表格中的位置信息,生成与表格类型对应的信息抽取模板。S22. For any type of table, generate an information extraction template corresponding to the table type according to the content of each cell in the table and the position information of each cell in the table.
  3. 根据权利要求1或2所述的方法,其特征在于,表格类型包括左右一对一格式,该左右一对一格式的表格中各个键值对的键与值是左右位置关系,且键与值是一对一的关系;The method according to claim 1 or 2, wherein the table type includes a left-right one-to-one format, and the key and value of each key-value pair in the left-right one-to-one format table have a left-right positional relationship, and the key and the value are in a left-right positional relationship. is a one-to-one relationship;
    相应的,步骤S22具体包括:Correspondingly, step S22 specifically includes:
    S221、将表格中各单元格的内容按照在表格内行和列的位置信息进行拼接;S221, splicing the content of each cell in the table according to the position information of the row and column in the table;
    S222、对于表格中的每一行内容,基于拼接后内容,生成与表格类型对应的第一信息抽取模板;S222, for each row of content in the table, based on the content after splicing, generate a first information extraction template corresponding to the table type;
    其中,第一信息抽取模板中包含有表格内每一行各个键值对的键及其位置信息,以及待抽取的各个键值对的值的位置信息。Wherein, the first information extraction template includes the key and position information of each key-value pair in each row in the table, and the position information of the value of each key-value pair to be extracted.
  4. 根据权利要求2所述的方法,其特征在于,表格类型包括上下一对多格式,该上下一对多格式的表格中各个键值对的键与值是上下位置关系,且键与值是一对多的关系;The method according to claim 2, wherein the table type includes a top-bottom one-to-many format, the key and value of each key-value pair in the top-bottom one-to-many format table are in a top-bottom position relationship, and the key and value are one many-to-many relationship;
    相应的,步骤S22具体包括:Correspondingly, step S22 specifically includes:
    S221、如果未检测到预设词表,则将表格中各单元格的内容按照所在行的位置信息进行拼接,并将拼接后内容与预设标准模板的内容进行匹配,所述预设标准模板中包括预先设定抽取的键值对的键;S221. If the preset vocabulary table is not detected, splicing the content of each cell in the table according to the position information of the row, and matching the spliced content with the content of a preset standard template, the preset standard template Include the key of the pre-set extracted key-value pair;
    S222、如果匹配成功,则将相匹配的键所对应的列的个数作为第一目标个数;S222, if the matching is successful, the number of columns corresponding to the matched keys is taken as the first target number;
    S223、对表格按行进行遍历,将表格中列的个数作为第一标准个数;S223, traverse the table row by row, and use the number of columns in the table as the first standard number;
    S224、如果第一标准个数与第一目标个数相匹配,则在表格中第一列单元格之前添加辅助变量,所述辅助变量用于在表格内容提取时将表格中各行内容进行区分;S224, if the first standard number matches the first target number, then add an auxiliary variable before the first column of cells in the table, and the auxiliary variable is used to distinguish the content of each row in the table when the table content is extracted;
    S225、基于所述辅助变量和所述相匹配的键及其位置信息,生成表格类型对应的第二信息抽取模板;S225, based on the auxiliary variable and the matched key and its position information, generate a second information extraction template corresponding to the form type;
    其中,第二信息抽取模板中包含有所述辅助变量、所述相匹配的键及其位置信息,以及表格中待抽取的各个键值对的值的位置信息。Wherein, the second information extraction template includes the auxiliary variable, the matching key and its location information, and the location information of the value of each key-value pair to be extracted in the table.
  5. 根据权利要求2所述的方法,其特征在于,表格类型包括上下一对多格式,该上下一对多格式的表格中各个键值对的键与值是上下位置关系,且键与值是一对多的关系;The method according to claim 2, wherein the table type includes a top-bottom one-to-many format, the key and value of each key-value pair in the top-bottom one-to-many format table are in a top-bottom position relationship, and the key and value are one many-to-many relationship;
    相应的,步骤S22具体包括:Correspondingly, step S22 specifically includes:
    S221、如果检测到预设词表,则将表格中各个键值对的值与所述预设词表的内容进行匹配;S221, if a preset vocabulary is detected, then the value of each key-value pair in the table is matched with the content of the preset vocabulary;
    S222、如果匹配成功,则将表格中各单元格的内容按照所在行的位置信息进行拼接,并将拼接后内容与预设标准模板的内容进行匹配,所述预设标准模板中包括预先设定抽取的键值对的键;S222. If the matching is successful, splicing the content of each cell in the table according to the position information of the row, and matching the spliced content with the content of a preset standard template, where the preset standard template includes preset The key of the extracted key-value pair;
    S223、如果匹配成功,则将相匹配的键所对应的列的个数作为第二目标个数;S223, if the matching is successful, the number of columns corresponding to the matched keys is taken as the second target number;
    S224、对表格按行进行遍历,将表格中列的个数作为第二标准个数;S224, traverse the table row by row, and use the number of columns in the table as the second standard number;
    S225、如果第二标准个数与第二目标个数相匹配,则基于所述相匹配的键及其位置信息, 生成表格类型对应的第三信息抽取模板;S225, if the second standard number matches the second target number, then based on the matched key and its position information, generate a third information extraction template corresponding to the form type;
    其中,第三信息抽取模板中包含有所述相匹配的键及其位置信息,以及所述表格中待抽取的各个键值对的值的位置信息。Wherein, the third information extraction template includes the matched key and its location information, and the location information of the value of each key-value pair to be extracted in the table.
  6. 根据权利要求2所述的方法,其特征在于,表格类型包括左右一对多格式,该左右一对多格式的表格中各个键值对的键与值是左右位置关系,且键与值是一对多的关系;The method according to claim 2, wherein the table type comprises a left-right one-to-many format, the key and value of each key-value pair in the left-right one-to-many format table have a left-right positional relationship, and the key and value are one many-to-many relationship;
    相应的,步骤S22具体包括:Correspondingly, step S22 specifically includes:
    S221、如果未检测到预设词表,将表格中各单元格的内容按照所在列的位置信息进行拼接,并将拼接后内容与预设标准模板中的键进行匹配,所述预设预先标准模板中包括预先设定抽取的键值对的键;S221. If the preset vocabulary table is not detected, splicing the content of each cell in the table according to the position information of the column, and matching the spliced content with the keys in the preset standard template. The template includes the key of the extracted key-value pair in advance;
    S222、如果匹配成功,则将相匹配的键所对应的行的个数作为第三目标个数;S222, if the matching is successful, the number of rows corresponding to the matched keys is taken as the third target number;
    S223、对表格按列进行遍历,确定表格内行的第三标准个数;S223, traverse the table by column, and determine the third standard number of rows in the table;
    S224、如果第三标准个数与第三目标个数相匹配,则在表格中各单元格之前添加辅助变量;S224. If the number of the third standard matches the number of the third target, add an auxiliary variable before each cell in the table;
    S225、基于所述辅助变量和所述相匹配的键及其位置信息,生成表格类型对应的第四信息抽取模板,所述辅助变量用于在表格内容提取时将表格中各行内容进行区分;S225, based on the auxiliary variable and the matching key and its position information, generate a fourth information extraction template corresponding to the table type, and the auxiliary variable is used to distinguish the content of each row in the table when the table content is extracted;
    其中,第四信息抽取模板中包含有所述辅助变量、所述相匹配的键及其位置信息,以及所述表格中各个键值对的值的位置信息。Wherein, the fourth information extraction template includes the auxiliary variable, the matched key and its location information, and the location information of the value of each key-value pair in the table.
  7. 根据权利要求2所述的方法,其特征在于,表格类型包括左右一对多格式,该左右一对多格式的表格中各个键值对的键与值是左右位置关系,且键与值是一对多的关系;The method according to claim 2, wherein the table type comprises a left-right one-to-many format, the key and value of each key-value pair in the left-right one-to-many format table have a left-right positional relationship, and the key and value are one many-to-many relationship;
    相应的,步骤S22具体包括:Correspondingly, step S22 specifically includes:
    S221、如果检测到预设词表,则将表格中各个键值对的值与所述预设词表的内容进行匹配;S221, if a preset vocabulary is detected, then the value of each key-value pair in the table is matched with the content of the preset vocabulary;
    S222、如果匹配成功,则将表格中各单元格的内容按照所在列的位置信息进行拼接,并将拼接后内容与预设标准模板的内容进行匹配,所述预设标准模板中包括预先设定抽取的键值对的键;S222. If the matching is successful, splicing the content of each cell in the table according to the position information of the column, and matching the spliced content with the content of a preset standard template, where the preset standard template includes preset The key of the extracted key-value pair;
    S223、如果匹配成功,则将相匹配的键所对应的列的个数作为第四目标个数;S223, if the matching is successful, the number of columns corresponding to the matched keys is taken as the fourth target number;
    S224、对表格按列进行遍历,将表格中行的个数作为第四标准个数;S224, traverse the table by column, and use the number of rows in the table as the fourth standard number;
    S225、如果第四标准个数与第四目标个数相匹配,则基于相匹配的键及其位置信息,生成表格类型对应的第五信息抽取模板;S225, if the fourth standard number matches the fourth target number, then based on the matched key and its position information, generate the fifth information extraction template corresponding to the table type;
    其中,第五信息抽取模板中包含有所述相匹配的键及其位置信息,以及所述表格中待抽取的各个键值对的值的位置信息。Wherein, the fifth information extraction template includes the matched key and its location information, and location information of the value of each key-value pair to be extracted in the table.
  8. 一种基于RPA及AI的表格信息抽取装置,其特征在于,包括:A table information extraction device based on RPA and AI, characterized in that, comprising:
    图片转化模板,被配置为将包含有表格的文件转化为图片;An image conversion template, configured to convert a file containing a table into an image;
    模板生成模块,被配置为识别所述图片中的表格,并根据识别结果生成与表格类型对应的信息抽取模板,所述信息抽取模板中包含有表格内各个键值对的键及其位置信息,以及待抽取的各个键值对的值的位置信息;a template generation module, configured to identify the table in the picture, and generate an information extraction template corresponding to the table type according to the identification result, the information extraction template contains the keys and position information of each key-value pair in the table, And the location information of the value of each key-value pair to be extracted;
    内容抽取模块,被配置按照所述信息抽取模板,从所述识别结果中抽取表格内容。The content extraction module is configured to extract table content from the recognition result according to the information extraction template.
  9. 一种计算设备,其特征在于,包括:A computing device, comprising:
    存储有可执行程序代码的存储器;memory in which executable program code is stored;
    与所述存储器耦合的处理器;a processor coupled to the memory;
    所述处理器调用所述存储器中存储的所述可执行程序代码,执行如权利要求1-7任一所述的基于RPA及AI的表格信息抽取方法。The processor invokes the executable program code stored in the memory to execute the RPA and AI-based table information extraction method according to any one of claims 1-7.
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1-7任一所述的基于RPA及AI的表格信息抽取方法。A computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the method for extracting table information based on RPA and AI according to any one of claims 1-7 is implemented.
PCT/CN2021/114068 2020-09-25 2021-08-23 Rpa and ai-based table information extraction method and apparatus, device and medium WO2022062798A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011024745.3 2020-09-25
CN202011024745.3A CN112149399B (en) 2020-09-25 Table information extraction method, device, equipment and medium based on RPA and AI

Publications (1)

Publication Number Publication Date
WO2022062798A1 true WO2022062798A1 (en) 2022-03-31

Family

ID=73897215

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/114068 WO2022062798A1 (en) 2020-09-25 2021-08-23 Rpa and ai-based table information extraction method and apparatus, device and medium

Country Status (1)

Country Link
WO (1) WO2022062798A1 (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160055376A1 (en) * 2014-06-21 2016-02-25 iQG DBA iQGATEWAY LLC Method and system for identification and extraction of data from structured documents
CN110008944A (en) * 2019-02-20 2019-07-12 平安科技(深圳)有限公司 OCR recognition methods and device, storage medium based on template matching
CN110334585A (en) * 2019-05-22 2019-10-15 平安科技(深圳)有限公司 Table recognition method, apparatus, computer equipment and storage medium
CN110377560A (en) * 2019-07-18 2019-10-25 中科鼎富(北京)科技发展有限公司 A kind of structural method and device of biographic information
CN112016424A (en) * 2020-03-31 2020-12-01 北京来也网络科技有限公司 Image data processing method and electronic equipment combining RPA and AI
CN112149399A (en) * 2020-09-25 2020-12-29 北京来也网络科技有限公司 Table information extraction method, device, equipment and medium based on RPA and AI
CN112232198A (en) * 2020-10-15 2021-01-15 北京来也网络科技有限公司 Table content extraction method, device, equipment and medium based on RPA and AI
CN113051011A (en) * 2021-02-02 2021-06-29 北京来也网络科技有限公司 RPA and AI combined image information extraction method and device
CN113191131A (en) * 2021-05-10 2021-07-30 重庆中科云从科技有限公司 Form template establishing method for text recognition, text recognition method and system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160055376A1 (en) * 2014-06-21 2016-02-25 iQG DBA iQGATEWAY LLC Method and system for identification and extraction of data from structured documents
CN110008944A (en) * 2019-02-20 2019-07-12 平安科技(深圳)有限公司 OCR recognition methods and device, storage medium based on template matching
CN110334585A (en) * 2019-05-22 2019-10-15 平安科技(深圳)有限公司 Table recognition method, apparatus, computer equipment and storage medium
CN110377560A (en) * 2019-07-18 2019-10-25 中科鼎富(北京)科技发展有限公司 A kind of structural method and device of biographic information
CN112016424A (en) * 2020-03-31 2020-12-01 北京来也网络科技有限公司 Image data processing method and electronic equipment combining RPA and AI
CN112149399A (en) * 2020-09-25 2020-12-29 北京来也网络科技有限公司 Table information extraction method, device, equipment and medium based on RPA and AI
CN112232198A (en) * 2020-10-15 2021-01-15 北京来也网络科技有限公司 Table content extraction method, device, equipment and medium based on RPA and AI
CN113051011A (en) * 2021-02-02 2021-06-29 北京来也网络科技有限公司 RPA and AI combined image information extraction method and device
CN113191131A (en) * 2021-05-10 2021-07-30 重庆中科云从科技有限公司 Form template establishing method for text recognition, text recognition method and system

Also Published As

Publication number Publication date
CN112149399A (en) 2020-12-29

Similar Documents

Publication Publication Date Title
US11010673B2 (en) Method and system for entity relationship model generation
JP7296419B2 (en) Method and device, electronic device, storage medium and computer program for building quality evaluation model
US10796084B2 (en) Methods, systems, and articles of manufacture for automatic fill or completion for application software and software services
US9058317B1 (en) System and method for machine learning management
EP3832488A2 (en) Method and apparatus for generating event theme, device and storage medium
US20140120513A1 (en) Question and Answer System Providing Indications of Information Gaps
US20150026559A1 (en) Information Extraction and Annotation Systems and Methods for Documents
WO2022218186A1 (en) Method and apparatus for generating personalized knowledge graph, and computer device
US11080563B2 (en) System and method for enrichment of OCR-extracted data
CN108595171B (en) Object model generation method, device, equipment and storage medium
EP3333731A1 (en) Method and system for creating an instance model
JP2022031625A (en) Method and device for pushing information, electronic device, storage medium, and computer program
US11461081B2 (en) Adapting existing source code snippets to new contexts
RU2544739C1 (en) Method to transform structured data array
KR20210090576A (en) A method, an apparatus, an electronic device, a storage medium and a program for controlling quality
CN113609838B (en) Document information extraction and mapping method and system
CN113742733A (en) Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device
WO2023278052A1 (en) Automated troubleshooter
WO2022143608A1 (en) Language labeling method and apparatus, and computer device and storage medium
CN112582073B (en) Medical information acquisition method, device, electronic equipment and medium
CN112632223A (en) Case and event knowledge graph construction method and related equipment
WO2022062798A1 (en) Rpa and ai-based table information extraction method and apparatus, device and medium
WO2023159778A1 (en) Bidding document acquisition method and apparatus combining rpa and ai
US20190129921A1 (en) Enhancing Crossing Copying and Pasting Operations
CN112149399B (en) Table information extraction method, device, equipment and medium based on RPA and AI

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21871171

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 22/05/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21871171

Country of ref document: EP

Kind code of ref document: A1