CN111815109A - Power grid engineering contract evaluation method based on image processing - Google Patents

Power grid engineering contract evaluation method based on image processing Download PDF

Info

Publication number
CN111815109A
CN111815109A CN202010480415.9A CN202010480415A CN111815109A CN 111815109 A CN111815109 A CN 111815109A CN 202010480415 A CN202010480415 A CN 202010480415A CN 111815109 A CN111815109 A CN 111815109A
Authority
CN
China
Prior art keywords
contract
special
image
standard electronic
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010480415.9A
Other languages
Chinese (zh)
Inventor
顾闻
陈凯玲
史松峰
韩东
徐雪莲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Shanghai Electric Power Co Ltd
Original Assignee
State Grid Shanghai Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Shanghai Electric Power Co Ltd filed Critical State Grid Shanghai Electric Power Co Ltd
Priority to CN202010480415.9A priority Critical patent/CN111815109A/en
Publication of CN111815109A publication Critical patent/CN111815109A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Marketing (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Development Economics (AREA)
  • Primary Health Care (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Educational Administration (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Technology Law (AREA)
  • Water Supply & Treatment (AREA)
  • Public Health (AREA)
  • Multimedia (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a power grid engineering contract evaluation method based on image processing, which comprises the following steps of: s1: acquiring an image of a power grid engineering contract, and preprocessing the image; s2: extracting a special clause part in the power grid engineering contract image and converting the special clause part into a word format; s3: acquiring a special clause part corresponding to a standard electronic contract; s4: respectively searching the same special clause from the paper contract image and the standard electronic contract according to the searching and positioning conditions, and calculating the text similarity of the paper contract image and the standard electronic contract; s5: repeating the step S4 until the text similarity calculation of all the special terms is completed; s6: and calculating to obtain the contract similarity according to the text similarity obtained by calculation, wherein if the contract similarity is higher than a set threshold, the contract evaluation result is qualified, otherwise, the contract evaluation result is unqualified.

Description

Power grid engineering contract evaluation method based on image processing
Technical Field
The invention relates to the field of contract intelligent evaluation, in particular to a power grid engineering contract evaluation method based on image processing.
Background
The construction of the power grid is in a rapid development stage, the investment of the construction of the power grid in China is increased year by year, but the technology management work is long in period, multiple in repetitive work and high in risk of post management, so that in order to realize lean management and control of the manufacturing cost, the improvement of the technology management informatization level is urgently needed, and a power grid engineering technology management 'big data' system is constructed.
Wherein, the contract management is an important link of the process management in the technical management. Currently, contracts are usually evaluated by manual checking, which often takes a long time, affects the construction period and increases the project cost. Therefore, the digital modeling of the contract and the intelligent evaluation of contract terms by adopting the artificial intelligence technology are of great significance for improving the technical and regulatory level. However, at present, for how to intelligently evaluate contract terms through digitization and intelligent technologies, relevant research is still insufficient, and the intelligent evaluation requirements of the power grid engineering contract cannot be met.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a power grid engineering contract evaluation method based on image processing, which is used for evaluating quickly and accurately.
The purpose of the invention can be realized by the following technical scheme:
a power grid engineering contract evaluation method based on image processing comprises the following steps:
s1: acquiring an image of a power grid engineering contract, and preprocessing the image;
s2: extracting a special clause part in the power grid engineering contract image and converting the special clause part into a word format;
s3: acquiring a special clause part corresponding to a standard electronic contract;
s4: respectively searching the same special clause from the paper contract image and the standard electronic contract according to the searching and positioning conditions, and calculating the text similarity of the paper contract image and the standard electronic contract;
s5: repeating the step S4 until the text similarity calculation of all the special terms is completed;
s6: and calculating to obtain contract similarity according to the text similarity obtained by calculation, wherein if the contract similarity is higher than a set threshold value, the contract evaluation result is qualified, and otherwise, the contract evaluation result is unqualified.
Further, the special terms are contract terms which need to be established by negotiation, and the general terms are general contract terms.
Further, the search positioning condition includes a first positioning keyword, a second positioning keyword and a third positioning keyword.
Furthermore, the power grid engineering contract comprises a construction contract, a supervision contract and a survey design contract.
Further, the step S4 specifically includes:
s41: searching the first positioning keyword, the second positioning keyword and the third positioning keyword in parallel at the same time to obtain the same special term in the paper contract image and the standard electronic contract;
s42: respectively extracting verb sequences in special terms of the paper contract image and special terms of the standard electronic contract;
s43: calculating grammar similarity f of special clauses of paper contract image and special clauses of standard electronic contract based on verb sequence1
S44: calculating semantic similarity f of special terms of paper contract image and special terms of standard electronic contract2
S45: combined with grammar similarity f1And semantic similarity f2And calculating the text similarity f of the special terms of the paper contract image and the special terms of the standard electronic contract.
Further, the step S43 specifically includes:
s431: respectively taking verb sequences of the special terms of the paper contract image and the special terms of the standard electronic contract as characteristic character strings;
s432: acquiring the number of public substrings from the special clause characteristic character string of the paper contract image to the special clause characteristic character string of the standard electronic contract, and recording as the number of the first public substrings;
s433: acquiring the number of common substrings from the special clause characteristic character string of the standard electronic contract to the special clause characteristic character string of the paper contract image, and recording as the number of second common substrings;
s434: selecting the maximum public substring number from the first public substring number and the second public substring number as the actual public substring number;
s435: calculating the grammar similarity f of the special terms of the paper contract image and the special terms of the standard electronic contract by using the number of the actual public substrings1
Further, the semantic similarity f2Through TF-IDF calculation based on a semantic space vector model.
Still further preferably, said grammar similarity f1The calculation formula of (2) is as follows:
Figure BDA0002517137720000031
wherein, c is the number of actual public substrings, a is the number of verbs in a verb sequence of the special clause of the paper contract image, and b is the number of verbs in a verb sequence of the special clause of the standard electronic contract;
the text similarity calculation formula is as follows:
f=α*f1+β*f2
where α is a grammar weighting coefficient, and its value is preferably 0.4, β is a semantic weighting coefficient, and its value is preferably 0.6, and the value is determined according to the weight of the grammar structure and the semantic structure in text similarity measurement.
Further, in step S2, the converting into the word format specifically includes:
s21: cutting the special clause partial image through a PIL library and a pylab library of Python to obtain a target image containing special clauses;
s22: performing character segmentation on each character in the target image by using a CFS connected domain segmentation method to generate an image of a single character;
s23: performing character recognition on the image subjected to character segmentation by using a Tesseract character recognition engine or an OCR character recognition service;
s24: writing and storing the word file by using a third party library of Python to obtain a special term of the word format;
the step S3 specifically includes:
s31: acquiring a corresponding standard electronic contract according to contract types, wherein the contract types comprise a construction contract, a supervision contract and a reconnaissance design contract;
s32: and (3) segmenting the standard electronic contract by adopting a top-down method in the Hierarchical layout segmentation, and cutting out a part containing special clauses.
Preferably, the set threshold includes a text portion threshold and a number portion threshold, the value of the text portion threshold is 90%, and the value of the number portion threshold is 100%.
Compared with the prior art, the invention has the following advantages:
1) the method divides the clause content in the power grid engineering contract into the special clause and the general clause, and only evaluates the clause with higher risk, namely the special clause, when evaluating the contract, so that the contract evaluation efficiency is improved;
2) according to the invention, through setting three positioning keywords and carrying out searching and positioning on the special terms in a simultaneous parallel mode, the same special term can be accurately positioned, and the reliability of contract evaluation is improved;
3) the invention through verb-basedThe text similarity calculation method comprises the steps of forming a moving word sequence by moving words with stop words removed in special terms as a text characteristic string, and calculating the similarity f of text grammar by combining a string matching algorithm1And extracting the feature vector of the text by using the semantic theme as the dimension of the vector space according to the IFIDF method, and calculating the semantic similarity f2The algorithm is simple, and the calculation speed and precision of contract evaluation are improved;
4) according to the invention, different threshold values are set according to whether the content of the special clause is characters or numbers, so that the method accords with the actual situation, and improves the reliability and the practicability of contract evaluation.
Drawings
FIG. 1 is a schematic diagram of the process steps of the present invention;
FIG. 2 is an overall flow chart of the method of the present invention;
FIG. 3 is a diagram illustrating a syntax similarity calculation process;
FIG. 4 is a schematic diagram of a semantic similarity calculation process;
FIG. 5 is a diagram illustrating the number of common substrings from text A to text B in the embodiment;
FIG. 6 is a diagram illustrating the number of common substrings from text B to text A in the embodiment.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
Examples
As shown in fig. 1 and fig. 2, the invention provides a power grid engineering contract evaluation method based on image processing, which includes the following steps:
s1: acquiring an image of a power grid engineering contract, and preprocessing the image;
s2: extracting a special clause part in the power grid engineering contract image and converting the special clause part into a word format; the conversion into the word format specifically comprises the following steps:
s21: cutting the special clause partial image through a PIL library and a pylab library of Python to obtain a target image containing special clauses;
s22: performing character segmentation on each character in the target image by using a CFS connected domain segmentation method to generate an image of a single character;
s23: performing character recognition on the image subjected to character segmentation by using a Tesseract character recognition engine or an OCR character recognition service;
s24: writing and storing the word file by using a third party library of Python to obtain a special term of the word format;
s3: the method for acquiring the special clause part of the corresponding standard electronic contract specifically comprises the following steps:
s31: acquiring a corresponding standard electronic contract according to the contract type, wherein the contract type comprises a construction contract, a supervision contract and a reconnaissance design contract;
s32: adopting a top-down method in Hierarchical layout segmentation to segment the standard electronic contract and cut out a part containing special clauses;
s4: according to the searching and positioning conditions, the same special clause is searched from the paper contract image and the standard electronic contract respectively, and the text similarity of the paper contract image and the standard electronic contract is calculated, which specifically comprises the following steps:
s41: searching the first positioning keyword, the second positioning keyword and the third positioning keyword in parallel at the same time to obtain the same special term in the paper contract image and the standard electronic contract;
s42: respectively extracting verb sequences in special terms of the paper contract image and special terms of the standard electronic contract;
s43: calculating grammar similarity f of special clauses of paper contract image and special clauses of standard electronic contract based on verb sequence1
S44: calculating semantic similarity f of special terms of paper contract image and special terms of standard electronic contract2
S45: combined with grammar similarity f1And semantic similarity f2Calculating special clauses and standard electronic contract of paper contract imageThe text similarity f with the special clause.
S5: repeating the step S4 until the text similarity calculation of all the special terms is completed;
s6: and comparing whether the similarity of each text obtained by calculation reaches a set threshold, if so, determining that the contract evaluation result is qualified, otherwise, determining that the contract evaluation result is unqualified.
The invention divides the contract clauses into special clauses and general clauses, the contract clauses which are made without human negotiation and agreement are the general clauses, otherwise, the special clauses are the contract clauses which are made with human negotiation and agreement. That is, the specific terms are contract terms that need to be formulated by negotiation, and the general terms are contract terms that are commonly specified. Because the general clauses stipulate the content of the universality, the risk is low; and if the obligation of the special provision is not clear, the final project may have huge disputes and losses, and huge legal risks exist. Therefore, the special terms are also terms that require a critical evaluation.
After the special terms of the paper contract image and the special terms of the standard electronic contract are respectively extracted, the same special terms need to be searched out from the two word files, and the follow-up similarity comparison is convenient. The method adopted by the invention is that the same positioning keyword segment is set for each special clause on two files for searching. Considering that one positioning keyword segment may appear at multiple positions in a word document and target special terms are difficult to position, the report further provides that the target special terms can be positioned by searching in a mode that a plurality of positioning keyword segments are parallel at the same time. Specifically, the search positioning condition includes a first positioning keyword, a second positioning keyword, and a third positioning keyword.
In this embodiment, a construction contract, a supervision contract and a survey design contract which are high in use frequency, easy to dispute, representative are selected as analysis objects, and for the special terms of the analysis objects, keywords required for searching and locating each special term in the contract are listed, specifically, as shown in tables 1, 2 and 3.
TABLE 1 search location conditions for terms specific to construction contract
Figure BDA0002517137720000061
Figure BDA0002517137720000071
TABLE 2 search location criteria for proctoring contract-specific terms
Figure BDA0002517137720000072
Figure BDA0002517137720000081
TABLE 3 search location conditions for investigating terms specific to a design contract
Figure BDA0002517137720000082
Figure BDA0002517137720000091
The method can be popularized and applied to all nine major contracts including pre-project contracts, investigation and design contracts, supervision contracts, power transmission and transformation project construction contracts, technical consultation and service contracts, material purchasing and matching contracts, information system purchasing and maintenance contracts and office contracts.
In the invention, the calculation of the text similarity comprises three major parts in total, and firstly, the grammar similarity f is carried out on the two texts by extracting verbs1The second is to carry out semantic similarity f by extracting feature items and utilizing a TF-IDF weighting method2Finally, the grammar similarity f is calculated1And semantic similarity f2And combining to obtain the text similarity f.
Firstly, grammar facies is carried out on two texts by extracting verbsSimilarity f1The calculation specifically comprises the following steps:
s431: respectively taking verb sequences of the special terms of the paper contract image and the special terms of the standard electronic contract as characteristic character strings;
s432: acquiring the number of public substrings from the special clause characteristic character string of the paper contract image to the special clause characteristic character string of the standard electronic contract, and recording as the number of the first public substrings;
s433: acquiring the number of common substrings from the special clause characteristic character string of the standard electronic contract to the special clause characteristic character string of the paper contract image, and recording as the number of second common substrings;
s434: selecting the maximum public substring number from the first public substring number and the second public substring number as the actual public substring number;
s435: calculating the grammar similarity f of the special terms of the paper contract image and the special terms of the standard electronic contract by using the number of the actual public substrings1
As shown in fig. 3, assuming that the terms specific to the paper contract image and the terms specific to the standard electronic contract are text a and text B, respectively, after obtaining the verb sequences, the verb sequences can be regarded as a character string to obtain a text a characteristic character string and a text B characteristic character string, and the similarity between the two verb sequences can be obtained by calculating the number of common substrings of the two characteristic character strings, assuming that the verb sequences of the text a are V1, V2, V3, V2 and V4, and the verb sequences of the text B are V1, V3, V2 and V4. The number of common substrings from the text a characteristic character string to the text B characteristic character string is shown in fig. 5, and the number of common substrings from the text B characteristic character string to the text a characteristic character string is shown in fig. 6. As can be seen from fig. 5 and 6, the number of common substrings from the text a characteristic character string to the text B characteristic character string is 3, the number of common substrings from the text B characteristic character string to the text a characteristic character string is 4, and the number of the larger common substrings of the two is taken as the number of the actual common substrings, so that the number of the actual common substrings is 4.
Finally, the similarity f of the grammar is passed1The calculation formula of (2) is as follows:
Figure BDA0002517137720000101
wherein c is the number of actual public substrings, a is the number of verbs in a verb sequence of the special clauses of the paper contract image, and b is the number of verbs in a verb sequence of the special clauses of the standard electronic contract.
(II) extracting characteristic items and performing semantic similarity f by using a TF-IDF weighting method2The calculation specifically comprises the following steps:
s441: constructing a feature item vector table in a semantic topic space P based on a semantic vector space model;
wherein S441 specifically includes:
s4411: determining a semantic topic set V for use in a semantic vector space modelT={τ12,…,τdDetermining a semantic topic space P;
s4412: determining text characteristic items of non-semantic subjects in a semantic vector space model, and recording the text characteristic items as a set VN
S4413: expressing semantic subjects and feature items as a set V, taking elements of the set as nodes, taking semantic relations among the elements as edges, and organizing a semantic relation graph G (V, E);
s4414: determining vectors corresponding to all semantic topics according to the semantic association graph G ═ V, E >;
s4415: and calculating the vector representation of each feature item, and constructing a feature item vector table in the semantic topic space P.
S442: respectively extracting all characteristic items in the special terms of the paper contract image and the special terms of the standard electronic contract to obtain a special term characteristic item set of the paper contract image and a special term characteristic item set of the standard electronic contract;
s443: respectively counting the occurrence times of each characteristic item in the special clause characteristic item set of the paper contract image and the special clause characteristic item set of the standard electronic contract;
s444: acquiring feature item vectors corresponding to feature items in a paper contract image special clause feature item set and a standard electronic contract special clause feature item set by using a feature item vector table;
s445: calculating a feature vector corresponding to the special clauses of the paper contract image and a feature vector corresponding to the special clauses of the standard electronic contract according to the feature item vector, and respectively carrying out standardization processing to obtain the feature vector of the special clauses of the paper contract image and the feature vector of the special clauses of the standard electronic contract;
feature vector corresponding to special clause of paper contract image
Figure BDA0002517137720000111
The calculation formula of (A) is as follows:
Figure BDA0002517137720000112
wherein f isi,kThe number of times of occurrence of the kth characteristic item in the characteristic item set of the special clauses of the paper contract image, n is the number of all the characteristic items in the special clauses of the paper contract image,
Figure BDA0002517137720000113
corresponding feature item vectors of the kth feature item in the special clause feature item set of the paper contract image in a semantic topic space P;
feature vector corresponding to special clause of standard electronic contract
Figure BDA0002517137720000114
The calculation formula of (A) is as follows:
Figure BDA0002517137720000115
wherein f isj,kThe number of times of occurrence of the kth characteristic item in the characteristic item set of the special clause of the standard electronic contract, m is the number of all the characteristic items in the special clause of the standard electronic contract,
Figure BDA0002517137720000116
kth special in standard electronic contract special clause feature item setAnd (4) corresponding feature item vectors of the feature items in the semantic topic space P.
S446: calculating the semantic similarity f of the special clauses of the paper contract image and the special clauses of the standard electronic contract according to the special clause feature vector of the paper contract image and the special clause feature vector of the standard electronic contract2
Semantic similarity f2The calculation formula of (A) is as follows:
Figure BDA0002517137720000117
Figure BDA0002517137720000118
Figure BDA0002517137720000119
wherein the content of the first and second substances,
Figure BDA00025171377200001110
a term feature vector specific to the paper contract image,
Figure BDA00025171377200001111
feature vector of terms specific to standard electronic contract, wi,jAnd forming an included angle between the special clause feature vector of the paper contract image and the special clause feature vector of the standard electronic contract.
As shown in fig. 4, the metric semantic similarity may refer to a vector model in the information retrieval. The basic idea of the vector space model is to represent texts by vectors, and words, words or phrases can be selected as feature items.
According to the method for calculating the TF-IDF similarity of the VSM, words are used as feature items of texts, and the problem of replacing similar words and synonymy heteromorphism words is ignored, so that the accuracy of a calculation result is reduced. This problem can be solved efficiently by using a semantic dictionary. The commonly used semantic dictionary mainly comprises synonym forest and knowledge network as the measure of word similarity according to the information of related word concepts provided by the semantic dictionary. Extracting feature vectors by taking semantic subjects as dimensions of a vector space, adopting a method based on corpus statistics, firstly selecting features of a group of words, then comparing each word with the features of the group of words to obtain a related feature vector, and calculating the similarity by calculating the cosine of an included angle of the vector
(III) similarity f of grammar1And semantic similarity f2And combining to obtain the text similarity f, wherein the text similarity calculation formula is as follows:
f=α*f1+β*f2
where α is a grammar weighting coefficient, and its value is preferably 0.4, β is a semantic weighting coefficient, and its value is preferably 0.6, and the value is determined according to the weight of the grammar structure and the semantic structure in text similarity measurement.
In the invention, special terms need to be evaluated, so that general terms in photos shot by the camera belong to redundant information, and influence on subsequent character recognition needs to be eliminated. Images containing special clause parts were cut out using two pools, PIL and pylab from Python.
After the image containing the special clause part is cut, the image needs to be converted into a word format in a character segmentation and character recognition mode. The purpose of character segmentation is to segment each character in the cut target image to generate an image of a single character. If the character segmentation is not accurate in the process, the system is difficult to acquire accurate character features, so that the character recognition has great deviation. In practical application, many factors interfere with each other, so that the segmentation work is complicated, for example, different fonts and sizes or the definition degree after binarization processing all have different influences on the whole recognition result. However, the interference of the factors is small, and the maximum interference is the definition degree of a scanning piece and the focusing definition of a camera caused by a light source when a picture is obtained, which can affect the binarized image more or less. The invention selects a CFS connected domain segmentation method, and the principle is that assuming that each character consists of a single connected domain, namely no adhesion, a black pixel is found and the judgment is started until all the connected black pixels are traversed and marked, and then the segmentation position of the character can be judged.
The character recognition in the invention adopts a Tesseract character recognition engine or a character recognition API provided in OCR character recognition service, the recognized characters need to be stored by Python so as to be used for final evaluation of a contract, and the writing and storage of a word file can be realized by a third party library of Python.
For the standard electronic contract, layout analysis is needed, wherein the layout analysis refers to the step of dividing the standard electronic contract and cutting out a part containing special terms. The Hierarchical layout segmentation method includes a top-down segmentation method and a bottom-up segmentation method, and in this embodiment, the top-down method is preferably adopted, and the top-down segmentation method is to take the whole layout as an object, and segment the document in sequence by using the result through information analysis of the whole layout. The method is simple and rough, the document can be quickly split, the images encountered in the work only contain clause information, the Hierarchical layout segmentation method is not only not discordant with water and soil, but also can improve the efficiency in the work due to the defect that the Hierarchical layout segmentation method can cope with the complex layout design.
The overall process of contract clause intelligent evaluation is shown in fig. 2, after a paper contract is subjected to image acquisition by means of camera shooting and the like, preprocessing is carried out by means of graying, noise reduction, binaryzation and the like; then cutting out more information such as general terms and the like to obtain an image of special terms needing to be evaluated; cutting each character in the cut target image to generate an image of a single character; and calling a text recognition API provided in a Tesseract text recognition engine or a hundredth OCR text recognition service to perform text recognition, and storing recognized texts as words. On the other hand, the standard electronic contract is subjected to layout analysis to cut out a part containing special clauses. After the processing, the special clause part of the paper contract is converted into a word file, the special clause part is cut out of the standard electronic contract which is originally the word file, and preparation is made for subsequent contract clause risk evaluation. And finally, respectively searching the same special term in the two word files, and setting a reasonable similarity threshold after comparing the similarity between the two word files. For example, when referring to numbers, the threshold is set to 100%, i.e., both are required to be identical; when comparing words, the threshold is set at 90%. If the similarity between the two is higher than the set threshold value, the contract clause is judged to have no problem and no manual intervention is needed. Otherwise, if the similarity between the two is lower than the set threshold, the contract clause is judged to have high risk, and at the moment, early warning is needed to remind of manual intervention examination, so that the working efficiency can be obviously improved.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and those skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A power grid engineering contract evaluation method based on image processing is characterized by comprising the following steps:
s1: acquiring an image of a power grid engineering contract, and preprocessing the image;
s2: extracting a special clause part in the power grid engineering contract image and converting the special clause part into a word format;
s3: acquiring a special clause part corresponding to a standard electronic contract;
s4: respectively searching the same special clause from the paper contract image and the standard electronic contract according to the searching and positioning conditions, and calculating the text similarity of the paper contract image and the standard electronic contract;
s5: repeating the step S4 until the text similarity calculation of all the special terms is completed;
s6: and comparing whether the similarity of each text obtained by calculation reaches a set threshold, if so, determining that the contract evaluation result is qualified, otherwise, determining that the contract evaluation result is unqualified.
2. The power grid engineering contract evaluation method based on image processing as claimed in claim 1, wherein the specific terms are contract terms that need to be established by negotiation, and the general terms are general contract terms.
3. The power grid engineering contract evaluation method based on image processing as claimed in claim 2, wherein the search positioning condition includes a first positioning keyword, a second positioning keyword and a third positioning keyword.
4. The image processing-based power grid engineering contract evaluation method according to claim 3, wherein the power grid engineering contract comprises a construction contract, a supervision contract and a survey design contract.
5. The power grid engineering contract evaluation method based on image processing as claimed in claim 4, wherein the step S4 specifically includes:
s41: searching the first positioning keyword, the second positioning keyword and the third positioning keyword in parallel at the same time to obtain the same special term in the paper contract image and the standard electronic contract;
s42: respectively extracting verb sequences in special terms of the paper contract image and special terms of the standard electronic contract;
s43: calculating grammar similarity f of special clauses of paper contract image and special clauses of standard electronic contract based on verb sequence1
S44: calculating semantic similarity f of special terms of paper contract image and special terms of standard electronic contract2
S45: combined with grammar similarity f1And semantic similarity f2And calculating the text similarity f of the special terms of the paper contract image and the special terms of the standard electronic contract.
6. The power grid engineering contract evaluation method based on image processing as claimed in claim 5, wherein the step S43 specifically includes:
s431: respectively taking verb sequences of the special terms of the paper contract image and the special terms of the standard electronic contract as characteristic character strings;
s432: acquiring the number of public substrings from the special clause characteristic character string of the paper contract image to the special clause characteristic character string of the standard electronic contract, and recording as the number of the first public substrings;
s433: acquiring the number of common substrings from the special clause characteristic character string of the standard electronic contract to the special clause characteristic character string of the paper contract image, and recording as the number of second common substrings;
s434: selecting the maximum public substring number from the first public substring number and the second public substring number as the actual public substring number;
s435: calculating the grammar similarity f of the special terms of the paper contract image and the special terms of the standard electronic contract by using the number of the actual public substrings1
7. The power grid engineering contract evaluation method based on image processing as claimed in claim 6, wherein the semantic similarity f2Through TF-IDF calculation based on a semantic space vector model.
8. The power grid engineering contract evaluation method based on image processing as claimed in claim 7, wherein the grammar similarity f1The calculation formula of (2) is as follows:
Figure FDA0002517137710000021
wherein, c is the number of actual public substrings, a is the number of verbs in a verb sequence of the special clause of the paper contract image, and b is the number of verbs in a verb sequence of the special clause of the standard electronic contract;
the text similarity calculation formula is as follows:
f=α*f1+β*f2
where α is a grammar weighting coefficient, and its value is preferably 0.4, β is a semantic weighting coefficient, and its value is preferably 0.6, and the value is determined according to the weight of the grammar structure and the semantic structure in text similarity measurement.
9. The power grid engineering contract evaluation method based on image processing as claimed in claim 1, wherein in step S2, the conversion into a word format specifically includes:
s21: cutting the special clause partial image through a PIL library and a pylab library of Python to obtain a target image containing special clauses;
s22: performing character segmentation on each character in the target image by using a CFS connected domain segmentation method to generate an image of a single character;
s23: performing character recognition on the image subjected to character segmentation by using a Tesseract character recognition engine or an OCR character recognition service;
s24: writing and storing the word file by using a third party library of Python to obtain a special term of the word format;
the step S3 specifically includes:
s31: acquiring a corresponding standard electronic contract according to contract types, wherein the contract types comprise a construction contract, a supervision contract and a reconnaissance design contract;
s32: and (3) segmenting the standard electronic contract by adopting a top-down method in the Hierarchical layout segmentation, and cutting out a part containing special clauses.
10. The power grid engineering contract evaluation method based on image processing as claimed in claim 1, wherein the set threshold includes a text part threshold and a digital part threshold, the value of the text part threshold is 90%, and the value of the digital part threshold is 100%.
CN202010480415.9A 2020-05-30 2020-05-30 Power grid engineering contract evaluation method based on image processing Pending CN111815109A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010480415.9A CN111815109A (en) 2020-05-30 2020-05-30 Power grid engineering contract evaluation method based on image processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010480415.9A CN111815109A (en) 2020-05-30 2020-05-30 Power grid engineering contract evaluation method based on image processing

Publications (1)

Publication Number Publication Date
CN111815109A true CN111815109A (en) 2020-10-23

Family

ID=72847880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010480415.9A Pending CN111815109A (en) 2020-05-30 2020-05-30 Power grid engineering contract evaluation method based on image processing

Country Status (1)

Country Link
CN (1) CN111815109A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112668899A (en) * 2020-12-31 2021-04-16 无锡软美信息科技有限公司 Contract risk identification method and device based on artificial intelligence

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112668899A (en) * 2020-12-31 2021-04-16 无锡软美信息科技有限公司 Contract risk identification method and device based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
US8315997B1 (en) Automatic identification of document versions
US8341112B2 (en) Annotation by search
CN107391614A (en) A kind of Chinese question and answer matching process based on WMD
Ueki et al. Waseda_Meisei at TRECVID 2017: Ad-hoc Video Search.
CN109408578B (en) Monitoring data fusion method for heterogeneous environment
CN101814067A (en) System and methods for quantitative assessment of information in natural language contents
CN114911917B (en) Asset meta-information searching method and device, computer equipment and readable storage medium
CN107844493B (en) File association method and system
CN112182248A (en) Statistical method for key policy of electricity price
CN110956033A (en) Text similarity calculation method and device
CN113971210B (en) Data dictionary generation method and device, electronic equipment and storage medium
CN106933824A (en) The method and apparatus that the collection of document similar to destination document is determined in multiple documents
CN111815108A (en) Evaluation method for power grid engineering design change and on-site visa approval sheet
CN111815109A (en) Power grid engineering contract evaluation method based on image processing
Zhang et al. Semantic image retrieval using region based inverted file
CN111881695A (en) Audit knowledge retrieval method and device
CN117609583A (en) Customs import and export commodity classification method based on image text combination retrieval
Taghva et al. Address extraction using hidden markov models
CN111401047A (en) Method and device for generating dispute focus of legal document and computer equipment
CN116108181A (en) Client information processing method and device and electronic equipment
CN114756617A (en) Method, system, equipment and storage medium for extracting structured data of engineering archives
Hyun et al. Image recommendation for automatic report generation using semantic similarity
CN115858797A (en) Method and system for generating Chinese near-meaning words based on OCR technology
Das et al. Semantic segmentation of MOOC lecture videos by analyzing concept change in domain knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination