CN114492399A

CN114492399A - Contract information extraction system and method based on regular expression

Info

Publication number: CN114492399A
Application number: CN202111682272.0A
Authority: CN
Inventors: 孙常鹏; 戴斐斐; 高静; 赵猛; 贾晓亮; 李博; 刘德玉; 张耀心
Original assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd; Information and Telecommunication Branch of State Grid Tianjin Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd; Information and Telecommunication Branch of State Grid Tianjin Electric Power Co Ltd
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-05-13

Abstract

The invention relates to a contract information extraction system and method based on a regular expression. The invention extracts key information through a regular expression unstructured conversion technology, stores the information as structured data, and screens the data according to an inherent rule.

Description

Contract information extraction system and method based on regular expression

Technical Field

The invention belongs to the technical field of intelligent auditing, relates to an auditing information extraction system, and particularly relates to a contract information extraction system and method based on a regular expression.

Background

The traditional method for extracting information from files is only to simply turn over files and record information manually, errors are easy to occur, the efficiency of extracting information is low, and structured information data cannot be formed efficiently. The traditional method can not meet the requirements of the existing work at the present stage, and with the continuous development of scientific technology, a big data, intelligentization, innovative audit mode, innovative data analysis technology and method are promoted, and an effective information extraction system and method facing audit information are produced.

Upon search, no prior art publications that are the same or similar to the present invention were found.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a contract information extraction system and method based on a regular expression.

The invention solves the practical problem by adopting the following technical scheme:

a contract information extraction system based on a regular expression comprises a task setting module, a data acquisition module, an information extraction module, a data storage module and a big data analysis module; the output end of the task setting module is connected with the data acquisition module and is used for presetting tasks and parameters; the output end of the data acquisition module is connected with the information extraction module and used for realizing accurate acquisition flow of target data through the process automation operation terminal according to tasks and parameters preset by the task setting module and providing a data source for the information extraction module; the output end of the information extraction module is connected with the data storage module and used for processing the data acquired by the data acquisition module, the key information required by auditing is mined by adopting a regular expression matching algorithm for non-structural data, and a corresponding automaton is established by using a regular expression to match character strings; the output end of the information extraction module is connected with the data storage module and is used for storing the data of the data acquisition module and the information extraction module; and the output end of the data storage module is connected with the big data analysis module and is used for further data analysis of the data storage module.

Moreover, the method for matching the character strings by establishing the corresponding automaton by using the regular expression comprises the following steps: the regular expression is converted into an uncertain automaton, and then the uncertain automaton is converted into a certain automaton.

A contract information extraction method based on regular expressions comprises the following steps:

step 1, setting tasks and constructing an audit task list;

step 2, collecting target data through a process automation operation terminal according to the audit task list in the step 1;

and 3, extracting the information of the target data acquired in the step 2.

Further, the specific steps of step 1 include:

(1) designing a contract information auditing intermediate table according to fields of required data given by auditors and the meaning of the fields;

(2) meanwhile, a data acquisition path is set, and the work operation and the pre-programming operation of auditors are simulated;

(3) and setting an audit task list according to the audit task.

Moreover, the specific method of the step 2 is as follows:

and according to the collection path, the simulation operation, the data intermediate table and the audit task list set by the task, collecting contract information in the service system by the process automation operation terminal, and downloading the unstructured contract file.

Further, the specific steps of step 3 include:

(1) and reading the contract document acquired in the data acquisition stage into text information by using a reading technology of a robot.

(2) According to the read text information, unstructured data conversion is carried out by using an information extraction technology based on a regular expression, an automaton is constructed according to combination construction of syntactic elements of the regular expression and an expression matched with key information, and text key information is mined;

furthermore, the syntax elements of the regular expression of step 3 and step (2) include: common characters, character sets, matching times qualifiers, grouping expressions, selection expressions, and escape characters.

Furthermore, the step 3 further comprises the following steps:

step 4, analyzing and processing the data extracted in the step 3, and outputting audit doubtful points;

the specific method of the step 4 comprises the following steps:

firstly, an auditor analyzes and searches logics among data according to the collected data, a fixed audit model is constructed through business logic conversion, the business logic interacts with a program developer again, secondly, the program developer converts the business logic into a computer language, and auditing doubtful points are automatically judged and output through logic operation.

Step 5, verifying the suspicious points output in the step 4;

the specific method of the step 5 comprises the following steps:

and the contract auditing robot automatically sends the auditing doubt to the mailbox of the auditor to assist in verifying the doubt, directly locks the auditing problem after the auditor verifies and confirms, and finishes the process.

The invention has the advantages and beneficial effects that:

1. the invention adopts a flow Automation (Robotic Process Automation) technology and a Regular Expression (Regular Expression) technology, takes an RPA robot as a virtual labor force, takes a Regular Expression as an algorithm of unstructured data conversion, presets an audit task, and carries out Automation information extraction, data storage and data analysis. The traditional office process can be effectively optimized, the working efficiency is improved, the labor resource allocation of enterprises is indirectly optimized, and the digital upgrading of the enterprises is assisted.

2. The invention applies the RPA and regular expression technology to formulate the work task of the RPA robot to automatically execute at regular time, does not depend on manual triggering, is a 24-hour uninterrupted work mode, and can realize work closed loop in the whole work process. By using the algorithm of the regular expression, the required effective information of the file is accurately and efficiently extracted so as to assist the RPA robot to perform data analysis on the key information in the file. Compared with the traditional method, the traditional method mainly depends on a large amount of manpower to review the files, manually extracts key information, manually pastes or writes the key information, arranges the key information into normalized effective information, and uses the normalized effective information for work. The invention can replace manual operation with high repeatability and low complexity in the working process, is preset according to units, time, range and the like, automatically collects required data from the system according to a preset automatic process, downloads files in batches and the like, extracts key information through a regular expression unstructured conversion technology, stores the information into structured data, and screens the data according to inherent rules. The invention can be collectively called as an information extraction robot, and the automatic process can generate an intuitive structured data result for staff to quickly review files.

Drawings

FIG. 1 is a system configuration diagram of the present invention;

FIG. 2 is a process flow diagram of a data acquisition module of the present invention;

FIG. 3(a) is a schematic diagram of an A/B uncertain automaton of a data extraction module of the present invention;

fig. 3(b) is a schematic diagram of an a x uncertain automaton of the data extraction module of the present invention;

fig. 3(c) is a schematic diagram of an uncertain automaton of regular expression (a/B) × ABB of the data extraction module of the present invention;

fig. 3(d) is a schematic diagram of a deterministic automaton of regular expression (a/B) × ABB for the data extraction module of the present invention;

FIG. 4 is a process flow diagram of the present invention.

Detailed Description

The embodiments of the invention will be described in further detail below with reference to the accompanying drawings:

a contract information extraction system based on regular expressions is shown in figure 1 and comprises a task setting module, a data acquisition module, an information extraction module, a data storage module and a big data analysis module; the output end of the task setting module is connected with the data acquisition module and is used for presetting tasks and parameters; the output end of the data acquisition module is connected with the information extraction module and used for realizing accurate acquisition flow of target data through the process automation operation terminal according to tasks and parameters preset by the task setting module and providing a data source for the information extraction module; the output end of the information extraction module is connected with the data storage module and used for processing the data acquired by the data acquisition module, the key information required by auditing is mined by adopting a regular expression matching algorithm for non-structural data, and a corresponding automaton is established by using a regular expression to match character strings; the output end of the information extraction module is connected with the data storage module and is used for storing the data of the data acquisition module and the information extraction module; and the output end of the data storage module is connected with the big data analysis module and is used for further data analysis of the data storage module.

In this embodiment, the method for matching a character string by establishing a corresponding automaton using a regular expression includes: the regular expression is converted into an uncertain automaton, and then the uncertain automaton is converted into a certain automaton.

The composition and operation of the various modules within the system are further described below:

1. the task setting module is used for presetting a work plan input by a worker and is an important operation of interaction between the worker and the robot, and the process automation operation terminal reads work plan parameters and obtains preset login website, login account, password, unit, time and range information for setting query conditions in the acquisition process in the data acquisition module.

2. As shown in fig. 2, the data collection module accesses the web page by using the RPA and identifies the interface HTML program code according to the preset task and parameter of the task setting module, so as to realize accurate collection of the target data. And the process automation operation terminal and the target web page carry out full duplex communication through a WebSocket protocol to realize synchronous data interaction. The data acquisition module can acquire service data from the service system and internet data according to working requirements, download related files, store the data into a data warehouse by using the data storage module and provide a data source for the information extraction module.

3. The information extraction module is used for processing the data acquired by the data acquisition module, adopting a regular expression matching algorithm for the non-structural data, mining key information required by auditing, and establishing a corresponding automaton by using a regular expression to match character strings;

the automaton establishment steps are generally as follows: the regular expression is converted into an uncertain automaton, and then the uncertain automaton is converted into a certain automaton.

The uncertain automaton is defined as: a quintuple, M ═ K, Σ, f, S, Z) wherein:

(1) k is a finite set, each element of which is called a state;

(2) Σ is a finite alphabet, each element of which is called an input symbol and therefore also called an input symbol table;

(3) f is an image of a subset from K x Σ to K, Σ representing a sequence of strings on the alphabet;

(4)

is a non-empty state set;

(5)

is a final state set.

The defined automaton is defined as: a quintuple, M ═ K, Σ, f, S, Z) wherein:

(1) k is a finite set, each element of which is called a state;

(2) Σ is a finite alphabet, each element of which is called an input symbol;

(3) f is an image of a subset of the transfer function from K × Σ to K;

(4) s belongs to K and is only one initial state;

(5)

is a final state set.

Both deterministic automata and non-deterministic automata can be represented by graphs or matrices, as shown in fig. 3(a) -3 (d), where the nodes in the graphs represent states when represented graphically, and the deterministic automata and non-deterministic automata differ mainly by definition as follows: the determined automaton has a unique initial state and a final state set; the uncertain automata has an initial state set and a final state set; the character values on the edges of the diagram represent the transition from one state to another, a state of the deterministic automata can be converted to one or more states by a certain character value, and a state of the deterministic automata can only be converted to one deterministic state by a certain character value.

As for the regular expression Q ═ (a | B) × ABB, a | B is represented by an uncertain automaton as shown in fig. 3(a) where $ represents an empty string where the start node is node No. 1 and the end node is node No. 6. A is represented by an uncertain automaton as shown in fig. 3(b), where the start node is node No. 1 and the end node is node No. 4. The uncertainty is mainly reflected in that node number 1 can reach node number 2 and node number 4 by $ or that

node numbers

1, 2 and 4 are all starting nodes. The uncertain automata of Q is represented as shown in fig. 3(c), wherein node 1 is a starting node and node 7 is an ending node, the uncertain automata can be converted into a certain automata, the conversion result is shown in fig. 3(d), wherein node 1 is a starting node and node 5 is an ending node, the conversion process mainly uses a subset construction algorithm, and the main idea of the algorithm is as follows: each state in the deterministic automata corresponds to a set of states in the deterministic automata, i.e. the states of the deterministic automata are recorded for all states that may be reached after the deterministic automata reads in an input character.

When the determined automaton is used for matching the text character string, if the text character string sequence can reach the end node from the start node to each character on the determined automaton side in a matching mode, the text character string sequence can be matched with the regular expression. The regular expression matching method based on the passive factors is applied in a mode with higher current efficiency, namely the passive factors divide a text character string sequence into a segment of short text sub-character strings, whether prefixes and suffixes exist in each short character string sequence is judged, and if the prefixes or the suffixes do not exist, the prefixes and the suffixes are directly filtered. When both a prefix and a suffix are present in a short sequence of strings, a match verification can be performed in a deterministic automaton from each prefix position. This enables to find exactly all start and end positions in the text that match a given regular expression.

4. The data storage module stores the data of the data acquisition and information extraction module, is realized in a data warehouse mode, firstly determines the theme domain of the data warehouse according to the actual business field of an enterprise, and determines the analysis theme in each theme domain according to the model. For example, the compliance of a company staff during the hiring period within a certain time period, the implementation of a company's important policy within a certain time period, the contract signed in a month in a certain year, the bid winning document, etc. are analyzed. After the theme is clarified, information such as the measurement, the data granularity and the dimension of data analysis is determined, for example, the condition that a company is expected to analyze the important policy in terms of time, units, file types and the like is determined, and the time, the units and the file types are corresponding dimensions. The dimension and the original data are determined, the basis of the analysis of the data of each topic is determined, and the key object of the data maintenance work is determined.

5. The big data analysis module is used for further data analysis of data of the data storage module, and the OLAP service is adopted, supports complex analysis operation and can provide visual and understandable query functions. The staff analyzes data from different business angles for each theme, and obtains intuitive analysis results by performing analysis operations such as rotation, slicing, drilling and the like on the data in the database.

A contract information extraction method based on regular expressions, as shown in FIG. 4, includes the following steps:

step 1, setting tasks and constructing an audit task list;

the specific steps of the step 1 comprise:

(2) designing a contract information auditing intermediate table according to fields of required data given by auditors and the meaning of the fields;

in this embodiment, the contract information auditing intermediate table mainly includes fields such as "contract name, contract signing unit, contract undertaking unit, contract amount, contract signing date, bid winning date, purchasing mode, contract text link, bid winning notice link, and supplementary agreement link". Both structured and unstructured data are covered in the intermediate table.

(3) and setting an audit task list according to the audit task.

the specific method of the step 2 comprises the following steps:

Step 3, extracting information of the target data acquired in the step 2;

the specific steps of the step 3 comprise:

in this embodiment, the syntax elements of the regular expression include the following syntax elements in 6:

(1) common characters

Letters, numbers, Chinese characters, underlines, and punctuation marks without defined special meanings are all "common characters" which, when matched, match one character the same as it.

(2) Character set

Multiple characters are contained with brackets [ ] and any of the contained characters can be matched. Also, only one can be matched at a time.

[ m-n ] -e.g. [1-5], indicates that the character to be matched should be in the range of 1 to 5;

[ n1 n2 n3] -as in [135], indicates that the character to be matched is 1, 3 or 5.

(3) Matching times qualifier

The number of repetitions is contained in curly brackets, so that the modified expression can be repeatedly matched multiple times.

{ n } -expression fixed repeats n times: e.g., A {2}, indicating that a match to 2 consecutive letters A is required;

{ m, n } -the expression is repeated at least m times and at most n times.

{ m, } -the expression is at least m times, and the maximum number of repetitions is unlimited.

(4) Grouping expressions

Other expressions are contained in parentheses () so that the contained expressions form a whole and can be decorated as a whole when decorated for the number of matches.

(5) Selecting an expression

The vertical line "|" is used to separate the multiple segment expressions, and the expression on the left and right sides is in an "or" relationship, such as 010|021, then the expressions can only match 010 or 021.

(6) Escape character

Is there a The number of modifications matches is 0 or 1; such as to match the month of the day: 1-12 months, the regular expression can be set as: 0? [1-9] |1[0-2 ];

2-the number of modifications matches is at least 1;

modification matches 0 or arbitrary;

the three symbols above define a special meaning and therefore require a preceding "\" to be escape before the character itself can be matched.

By combining and constructing the syntax elements in the above 6, the formulation requirements of various format character string matching criteria, such as numbers, characters, dates, amounts, and descriptions of more complicated Email addresses, telephone numbers, Internet URL character strings, etc., can be satisfied.

Extracting file information content according to an audit task list preset by an auditor, constructing according to regular expression syntax element combination, matching an expression of key information, extracting the key information, and storing the information in an audit contract information data table by a robot.

For example, if a winning bid unit in a winning bid notice needs to be extracted, the winning bid notice can be found by a robot using an isamatch (material name, "# notice. # doc") method, a file Text message is Read by using a robot plug-in Read Text, and a regular expression "(? And searching all matching items, returning successful matching items, namely extracting a winning unit in a winning notice, assigning a value to a data field, storing, executing N circulation operations, reading all files, extracting key information in all files, and listing to form a structured audit material for auditing staff to perform full coverage audit work.

the specific method of the step 4 comprises the following steps:

Step 5, verifying the doubtful points output in the step 4;

the specific method of the step 5 comprises the following steps:

After the model is successfully built, the auditing robot can be started to carry out the acquisition, data processing, data analysis and output of the relevant data of the model

The regular expression-based information extraction method plays a key role in the construction of an audit trail contract audit model and contract management of enterprise business, and is an important field of internal control and compliance management. Meanwhile, the contract information is the basic data frequently applied by each professional auditing group such as engineering, finance and the like. Therefore, by utilizing a digital auditing means, the contract data value is mined, the problems in contract management are accurately and efficiently positioned, and meanwhile, high-quality information support services are provided for each professional auditing group, which becomes an essential requirement in auditing work, so that the contract information extraction method based on the regular expression plays a key role.

The system and the method have the advantages that the system and the method for extracting contract information based on the regular expression are utilized, a computer has text reading capacity, and helps workers to automatically process massive text data, so that the workers can quickly deal with complex work such as review, search and proofreading, risk terms in contract files can be effectively monitored, labor and time cost are saved, enterprise bidding files, internal document data and other long-term files can be effectively analyzed, valuable information can be extracted from a large amount of text data, and word processing efficiency and text mining depth are improved.

It should be emphasized that the examples described herein are illustrative and not restrictive, and thus the present invention includes, but is not limited to, those examples described in this detailed description, as well as other embodiments that can be derived from the teachings of the present invention by those skilled in the art and that are within the scope of the present invention.

Claims

1. A contract information extraction system based on regular expressions is characterized in that: the system comprises a task setting module, a data acquisition module, an information extraction module, a data storage module and a big data analysis module; the output end of the task setting module is connected with the data acquisition module and is used for presetting tasks and parameters; the output end of the data acquisition module is connected with the information extraction module and used for realizing accurate acquisition flow of target data through the process automation operation terminal according to tasks and parameters preset by the task setting module and providing a data source for the information extraction module; the output end of the information extraction module is connected with the data storage module and used for processing the data acquired by the data acquisition module, the key information required by auditing is mined by adopting a regular expression matching algorithm for non-structural data, and a corresponding automaton is established by using a regular expression to match character strings; the output end of the information extraction module is connected with the data storage module and is used for storing the data of the data acquisition module and the information extraction module; and the output end of the data storage module is connected with the big data analysis module and is used for further data analysis of the data storage module.

2. The regular-expression-based contract information extraction system according to claim 1, wherein: the method for matching the character strings by establishing the corresponding automaton by using the regular expression comprises the following steps: the regular expression is converted into an uncertain automaton, and then the uncertain automaton is converted into a certain automaton.

3. A contract information extraction method based on regular expressions is characterized in that: the method comprises the following steps:

step 1, setting tasks and constructing an audit task list;

and 3, extracting the information of the target data acquired in the step 2.

4. The regular expression-based contract information extraction method according to claim 3, characterized in that: the specific steps of the step 1 comprise:

(3) and setting an audit task list according to the audit task.

5. The regular expression-based contract information extraction method according to claim 3, characterized in that: the specific method of the step 2 comprises the following steps:

6. The regular-expression-based contract information extraction method according to claim 3, characterized in that: the specific steps of the step 3 comprise:

(2) According to the read text information, performing unstructured data conversion by using an information extraction technology based on a regular expression, constructing an automaton according to a combination of syntactic elements of the regular expression and an expression matched with key information, and mining text key information;

7. the regular expression-based contract information extraction method according to claim 6, wherein: the syntax elements of the regular expression in step 3 and step (2) comprise: common characters, character sets, matching times qualifiers, grouping expressions, selection expressions, and escape characters.

8. The regular expression-based contract information extraction method according to claim 3, characterized in that: the step 3 is followed by the following steps:

the specific method of the step 4 comprises the following steps:

Step 5, verifying the suspicious points output in the step 4;

the specific method of the step 5 comprises the following steps: