CN112417835A

CN112417835A - Intelligent inspection method and system for purchase file based on natural language processing technology

Info

Publication number: CN112417835A
Application number: CN202011299881.3A
Authority: CN
Inventors: 汤力; 姜劲; 杜洁; 李芹; 王菁
Original assignee: Information Center of Yunnan Power Grid Co Ltd
Current assignee: Information Center of Yunnan Power Grid Co Ltd
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2021-02-26
Anticipated expiration: 2040-11-18
Also published as: CN112417835B

Abstract

The invention relates to a method and a system for intelligently reviewing purchase files based on a natural language processing technology, and belongs to the technical field of intelligent text review of project purchase data. Firstly, solidifying an online template of a document by adopting a web technology and a frame for a technical specification book and an exploratable estimation book; exporting the core field data of the work item part of the solidified technical specification and the exploratable estimation book, and carrying out data preprocessing; and analyzing the processed core field data of the technical specification and the processed core field data of the exploitable estimation book by adopting a similarity algorithm to obtain an examination report. The invention reduces the repetitive and fussy work in manual examination, avoids the detail error caused by high-load manual examination, and is easy to popularize and apply.

Description

Intelligent inspection method and system for purchase file based on natural language processing technology

Technical Field

The invention belongs to the technical field of intelligent text review of project purchase data, and particularly relates to a purchase file intelligent review method and system based on a natural language processing technology.

Background

With the promotion of digital transformation of a power grid, an information center serves as a project construction main body, the number of informationized projects is increased year by year, 275 informationized projects reaching the center are expected to be issued by companies in 2020, and the total investment is nearly 3 hundred million. The templates and requirements related to the whole process of the information project are more, the planning construction department is used as a function management department for project construction and bid procurement, the examination of the project construction process templates and procurement files is realized in a manual processing mode, the efficiency is low, and errors are easy to occur. With the enhancement of audit consciousness and the improvement of lean project management, project managers need to carry out point-to-point examination on technical specifications and exploitable and estimable work items, ensure that the technical specifications are in an exploitable scope and have no defects, and avoid audit risks; meanwhile, key point examination needs to be conducted on the purchasing element list and the technical specification, and the completeness and reasonability of the purchasing file are ensured. However, due to the rapid increase of the number of projects and the high requirement of timeliness of the bidding work, project management is obligated to examine the project quantity by up to 59 sub-packages in two days, and the contradiction between the quality and the time of manual examination is increasingly prominent. Once the quality problem of examination occurs, the influence on project purchase and subsequent project construction is brought. Therefore, how to overcome the defects of the prior art is a problem which needs to be solved in the technical field of intelligent text review of the current procurement data.

Disclosure of Invention

The invention aims to solve the defects of the prior art and provides a purchase file intelligent examination method and a purchase file intelligent examination system based on a natural language processing technology.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the intelligent inspection method of the purchase file based on the natural language processing technology comprises the following steps:

step (1), the solidification of the online template of the document is realized by adopting web technology and a frame for the technical specification book and the exploratable estimation book;

step (2), exporting the core field data of the solidified technical specification and the exploratable estimation book work item part, and carrying out data preprocessing;

the core field in the technical specification comprises project early-stage preparation, project development and project popularization and implementation; the core fields in the applicable evaluation book include construction cost and equipment purchase cost;

and (3) analyzing the core field data of the technical specification and the core field data of the searchable estimation book processed in the step (2) by adopting a similarity algorithm to obtain an examination report.

Further, it is preferable that both the technical specification and the exploratory estimate are standard document templates; and (3) solidifying the document by adopting a web technology and a control until the contents in the document can only be copied and identified and cannot be modified to serve as a standard for document comparison.

Further, preferably, the specific method for solidifying the document by using the web technology and the control is as follows: aiming at project file templates in a technical specification book and an estimable book, writing a corresponding form page by adopting an element component library; and exporting the data in the form into corresponding word and excel files by using an ActiveXObject control.

Further, preferably, the construction cost field includes project development, project implementation, integrated development, project test, and technical consultation; the device purchase fee includes a hardware device purchase and a system software purchase.

Further, in the step (2), preferably, the data preprocessing manner includes text word segmentation, regular matching, stop word processing, character string processing and data reduction.

Further, preferably, the text segmentation adopts a BilSTM + CRF segmentation method.

Further, preferably, after the text word segmentation is completed, the text character strings are cleaned in a regular matching mode, and the special symbols and stop words are filtered to obtain a dictionary database.

Further, it is preferable that the reduced data uses a principal component analysis algorithm, specifically as follows:

original data X ═ X, X₂,x₃,...,x_nNeeds to be reduced to k dimension, x₁To x_nRepresenting the extracted word vector matrix;

1) de-centering, each eigenvector value minus the average of the respective eigenvector

2) Calculating covariance

3) Covariance matrix calculation by singular value decomposition

The eigenvalues and eigenvectors of (a);

4) sorting the eigenvalues from small to large, and selecting the largest kAnThen corresponding k to itAnThe eigenvectors are respectively used as row vectors to form an eigenvector matrix P;

5) converting data to kAnIn the new space constructed by the feature vector, i.e. Y is PX, and Y is reduced from n dimension to kVitamin C After thatAnd (6) obtaining the result.

Further, preferably, in the step (3), the similarity algorithm adopts a comprehensive similarity algorithm, that is, three different similarity algorithms are respectively adopted to calculate the similarity of the core field data, and then the comprehensive similarity is obtained by using a weighted average method for each similarity, where the specific point is

The principle mode is as follows:

edit distance similarity of characters:

adding operation:

d₁＝ED(A_i-1，B_j)+1

and (3) deleting operation:

d₂＝ED(A_i，B_j-1)+1

and (3) modifying operation:

taking the minimum one of the above 3 as the minimum edit distance to obtain a state transition equation:

in the above formula, d₁，d₂，d₃Respectively representing the edit distance similarity of the adding, deleting and modifying operations; a and B represent two strings to be compared; ED is an edit distance function;

representing a minimum edit distance; l is_A，L_BRespectively, when A or B is length, A_iDenotes the i-th in AAnA character; b is_jDenotes the j-th in BAnA character;

jaccard coefficient similarity:

in the above formula, the first and second carbon atoms are,

the number of attributes of A and B which are 0 at the same time is represented;

the number of attributes in which the attribute A is 0 and the attribute B is 1 is represented;

the number of attributes in which the attribute A is 1 and the attribute B is 0 is represented;

the number of attributes of which the attributes A and B are 1 at the same time is represented;

cosine similarity:

where cos α is the cosine distance between two strings, x_iAnd y_iA word vector of two characters;

and obtaining comprehensive similarity by adopting a weighted average mode for the three similarities:

in the formula, lambda and lambda are coefficients corresponding to three similarity distances;

with sentences as minimum detection units, obtaining core field data of the technical specification book and core field data similarity of the estimable book through comprehensive similarity;

and outputting an examination report of the core field.

The invention also provides a purchase file intelligent examination system based on the natural language processing technology, and the purchase file intelligent examination method based on the natural language processing technology comprises the following steps:

the data acquisition device is used for acquiring the technical specification and the exploitable estimation book;

the template curing module is connected with the data acquisition device and is used for curing the acquired technical specification book and the researched evaluation book until the contents can only be copied and identified and cannot be modified;

the core field data export module is connected with the template curing module and is used for exporting the core field data of the technical specification book and the exploitable estimate book work item part after curing;

the data preprocessing module is connected with the core field data everywhere module and is used for preprocessing the exported core field data;

the similarity analysis module is connected with the data preprocessing module and is used for analyzing the core field data of the technical specification book and the core field data of the searchable estimation book processed by the data preprocessing module by adopting a similarity algorithm;

and the report output module is used for outputting the unmatched items in a report form mode to obtain an examination report.

Further, it is preferable that the similarity of 90% or more is regarded as a match, and otherwise, it is regarded as a mismatch.

Compared with the prior art, the invention has the beneficial effects that:

the invention solidifies the technical specification and the estimators on line, and aims to solve the problem of document requirement of project responsible personnel in the whole process of an information project, reduce unnecessary communication cost caused by inconsistent document templates, reduce the reduction of project construction efficiency caused by repeated reworking of editing and reviewing work, and improve purchasing efficiency.

The first intelligent examination of the consistency of the technical specification and the searchable estimation book is realized through the research on the intelligent examination technology of the technical specification and the searchable estimation book, an examination report is formed by comparing examination results, and a suggestion whether manual review is needed or not is provided according to the examination results and the content needing review is prompted. The artificial intelligence means is applied to the information project management work, so that the workload of manual examination is greatly reduced, the examination efficiency of technical specifications is improved, the problem of low accuracy possibly caused by manual examination is solved, the audit risk is reduced, and the purchasing quality and the project management quality are improved.

Comparing the key parts of the technical specification and the exploitable estimation book by a natural language processing technology to realize the application of the natural language technology in the technical specification and the exploitable estimation book; the web technology is adopted to apply the web document solidification technology to the technical specification and the exploitable estimation book, so that the application of the web document solidification technology in the examination of the technical specification and the exploitable estimation book in the power industry is realized; and exporting the data in the form into corresponding word and excel files by using an ActiveXObject control.

Drawings

FIG. 1 is a flow chart of the BilSTM + CRF word segmentation method;

FIG. 2 is a schematic structural diagram of an intelligent examination system for procurement files based on natural language processing technology;

101, a data acquisition device; 102. a template curing module; 103. the core field data is distributed to the modules; 104. a data preprocessing module; 105. a similarity analysis module; 106. and a report output module.

Detailed Description

The present invention will be described in further detail with reference to examples.

It will be appreciated by those skilled in the art that the following examples are illustrative of the invention only and should not be taken as limiting the scope of the invention. The examples do not specify particular techniques or conditions, and are performed according to the techniques or conditions described in the literature in the art or according to the product specifications. The materials or equipment used are not indicated by manufacturers, and all are conventional products available by purchase.

Example 1

The intelligent inspection method of the purchase file based on the natural language processing technology is characterized by comprising the following steps:

Example 2

The technical specification and the estimators are standard document templates; and (3) solidifying the document by adopting a web technology and a control until the contents in the document can only be copied and identified and cannot be modified to serve as a standard for document comparison.

The concrete method for solidifying the document by adopting the web technology and the control comprises the following steps: aiming at project file templates in a technical specification book and an estimable book, writing a corresponding form page by adopting an element component library; and exporting the data in the form into corresponding word and excel files by using an ActiveXObject control.

The construction cost field comprises project development, project implementation, integrated development, project test and technical consultation; the device purchase fee includes a hardware device purchase and a system software purchase.

In the step (2), the data preprocessing mode comprises text word segmentation, regular matching, stop word processing, character string processing and data reduction.

The text word segmentation adopts a BilSTM + CRF word segmentation method.

After the text word segmentation is finished, the text character strings are cleaned in a regular matching mode, and the special symbols and stop words are filtered to obtain a dictionary base.

The reduced data uses a principal component analysis algorithm as follows:

2) de-centering, each eigenvector value minus the average of the respective eigenvector

3) Calculating covariance

3) Covariance matrix calculation by singular value decomposition

The eigenvalues and eigenvectors of (a);

4) sorting the eigenvalues from small to large, selecting the largest k eigenvectors, and then taking the corresponding k eigenvectors as row vectors respectively to form an eigenvector matrix P;

5) the data is converted into a new space constructed by k eigenvectors, i.e. Y ═ PX, Y is the result of the reduction from n dimension to k dimension.

In the step (3), the similarity algorithm adopts a comprehensive similarity algorithm, that is, three different similarity algorithms are respectively adopted to calculate the similarity of the core field data, and then the comprehensive similarity is obtained by utilizing a weighted average mode for each similarity, and the specific processing mode is as follows:

edit distance similarity of characters:

adding operation:

d₁＝ED(A_i-1，B_j)+1

and (3) deleting operation:

d₂＝ED(A_i，B_j-1)+1

and (3) modifying operation:

Jaccardcoefficient similarity:

in the above formula, the first and second carbon atoms are,

indicates that the A and B attributes are the sameThe number of attributes whose time is 1;

cosine similarity:

and outputting an examination report of the core field.

As shown in fig. 2, the system for intelligently reviewing a purchase file based on a natural language processing technology, which adopts the method for intelligently reviewing a purchase file based on a natural language processing technology, is characterized by comprising:

the data acquisition device 101 is used for acquiring a technical specification and an estimative book;

the template curing module 102 is connected with the data acquisition device 101 and is used for curing the acquired technical specification book and the acquired estimators until the contents can be copied and identified and cannot be modified;

a core field data export module 103 connected with the template curing module 102 and used for exporting the core field data of the work item part of the cured technical specification and the exploratable estimation book;

the data preprocessing module 104 is connected with the core field data everywhere module 103 and is used for preprocessing the derived core field data;

the similarity analysis module 105 is connected with the data preprocessing module 104 and is used for analyzing the core field data of the technical specification book and the core field data of the searchable estimation book processed by the data preprocessing module 104 by adopting a similarity algorithm;

and the report output module 106 is configured to output the unmatched items in a report form to obtain an examination report.

Example 3

(1) the solidification of the online template of the document is realized by adopting a web technology and a frame for the technical specification book and the exploratable estimation book;

(2) leading out the core field data of the technical specification and the exploratable estimation book work item part through document solidification, and carrying out data preprocessing;

(3) and analyzing the document to be examined by adopting a similarity algorithm to obtain a preliminary examination report.

In the step (1), the technical specification and the exploratory estimation book are standard document templates. And solidifying the document by adopting web technology and a framework to serve as a standard for document comparison.

In the step (2), the data preprocessing mode comprises regular matching, text word segmentation, stop word processing, character string processing and data reduction.

The text word segmentation adopts a recurrent neural network word segmentation method (the flow is shown in figure 1);

after word segmentation is finished, cleaning text character strings by using a regular expression to filter special symbols and stop words to obtain a dictionary library;

the principle of the reduction of the data is as follows:

1) de-centering, each eigenvector value minus the average of the respective eigenvector, i.e. x₁To x_nVarious decentralization of vector matrixes;

2) calculating covariance

3) Covariance matrix solving by eigenvalue decomposition method

The eigenvalues and eigenvectors of (a);

5) the data is converted into a new space constructed by k eigenvectors, i.e. Y ═ PX, i.e. the result after decreasing from n dimensions to k dimensions.

In the step (3), the similarity algorithm adopts a comprehensive similarity algorithm, that is, three different similarity algorithms are respectively adopted to calculate the similarity of the core field data, and then the comprehensive similarity is obtained by using a weighted average method for each similarity, and the specific processing method is as follows:

edit distance similarity of characters:

adding operation:

d₁＝ED(A_i-1，B_j)+1

and (3) deleting operation:

d₂＝ED(A_i，B_j-1)+1

and (3) modifying operation:

Jaccardcoefficient similarity:

cosine similarity:

in the formula, lambda and lambda are coefficients corresponding to three similarity distances; preferably 0.2, 0.4.

With sentences as the minimum detection unit, obtaining the similarity between the core field data of the document to be checked and the core field data of the solidified template through comprehensive similarity;

and outputting the examination report of the core field of the document to be examined.

Preferably, the technical specification and the estimable book of the specified template are imported, the names of the function items (core field data) in the function item tables in the two documents are extracted, similarity calculation is carried out, the matching is determined when the similarity is 90% or more, and the unmatched items are provided for a project manager in a report form for manual check.

Examples of the applications

(1) And (5) solidifying the document. The specification documents of the technical specification and the exploitable estimate are first solidified.

(2) And (6) document comparison. Firstly, exporting the core field data of the solidified technical specification and the exploitable estimate book work item part, and preprocessing the core field by adopting a natural language processing technology. And secondly, calculating the similarity between the core field and the core field in the solidified document, and if the similarity is more than 90%, determining that the document is qualified, otherwise, determining that the document is unqualified.

(3) And (5) outputting the document. Exporting the data in the form into corresponding word and excel files through an ActiveXObject control. Wherein the unmatched items are output in the form of a report.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The intelligent inspection method of the purchase file based on the natural language processing technology is characterized by comprising the following steps:

2. The intelligent natural language processing technology-based procurement file review method according to claim 1, characterized by comprising the steps of, a technical specification and an estimatable book being standard document templates; and (3) solidifying the document by adopting a web technology and a control until the contents in the document can only be copied and identified and cannot be modified to serve as a standard for document comparison.

3. The intelligent examination method for procurement files based on natural language processing technology as claimed in claim 2, characterized in that the concrete method for solidifying the documents by adopting web technology and controls is as follows: aiming at project file templates in a technical specification book and an estimable book, writing a corresponding form page by adopting an element component library; and exporting the data in the form into corresponding word and excel files by using an ActiveXObject control.

4. The intelligent examination method for procurement files based on natural language processing technology as claimed in claim 1, characterized in that the construction cost field comprises project development, project implementation, integrated development, project test, technical consultation; the device purchase fee includes a hardware device purchase and a system software purchase.

5. The intelligent examination method for procurement files based on natural language processing technology as claimed in claim 1, characterized in that in step (2), the data preprocessing mode comprises text word segmentation, regular matching, stop word processing, character string processing and data reduction.

6. The intelligent method for examining purchase documents based on natural language processing technology as claimed in claim 5, wherein the text segmentation adopts BilSTM + CRF segmentation.

7. The intelligent examination method for procurement files based on natural language processing technology as claimed in claim 5, characterized in that after the text word segmentation is completed, the text character strings are cleaned in a regular matching mode, and the special symbols and stop words are filtered to obtain a dictionary database.

8. The intelligent method for reviewing procurement files based on natural language processing technology as claimed in claim 5, wherein the reduced data is obtained by using a principal component analysis algorithm, and the method comprises the following steps:

original data X ═ X, X₂，x₃，...，x_nNeeds to be reduced to k dimension, x₁To x_nRepresenting the extracted word vector matrix;

2) Calculating covariance

3) Covariance matrix calculation by singular value decomposition

The eigenvalues and eigenvectors of (a);

9. The intelligent examination method for procurement files based on natural language processing technology as claimed in claim 1, wherein in step (3), the similarity algorithm adopts a comprehensive similarity algorithm, that is, three different similarity algorithms are respectively adopted to calculate the similarity of core field data, and then the comprehensive similarity is obtained by using a weighted average method for each similarity, and the specific processing method is as follows:

edit distance similarity of characters:

adding operation:

d₁＝ED(A_i-1，B_j)+1

and (3) deleting operation:

d₂＝ED(A_i，B_j-1)+1

and (3) modifying operation:

representing a minimum edit distance; l is_A，L_BRespectively, when A or B is length, A_iRepresenting the ith character in A; b is_jRepresents the jth character in B;

jaccard coefficient similarity:

in the above formula, the first and second carbon atoms are,

cosine similarity:

in the formula, λ₁、λ₂、λ₃The coefficients corresponding to the three similarity distances;

and outputting an examination report of the core field.

10. The intelligent inspection system for the procurement files based on the natural language processing technology adopts the intelligent inspection method for the procurement files based on the natural language processing technology, which is characterized by comprising the following steps:

the data acquisition device (101) is used for acquiring the technical specification and the exploratory estimation book;

the template curing module (102) is connected with the data acquisition device (101) and is used for curing the acquired technical specification book and the researched evaluation book until the contents in the technical specification book and the researched evaluation book can be copied and identified and cannot be modified;

the core field data export module (103) is connected with the template curing module (102) and is used for exporting the core field data of the technical specification book and the exploratable estimation book work item part after curing;

the data preprocessing module (104) is connected with the core field data everywhere module (103) and is used for preprocessing the derived core field data;

the similarity analysis module (105) is connected with the data preprocessing module (104) and is used for analyzing the core field data of the technical specification book and the core field data of the estimable book processed by the data preprocessing module (104) by adopting a similarity algorithm;

and the report output module (106) is used for outputting the unmatched items in a report form mode to obtain an examination report.