CN115982324A

CN115982324A - Purchase file inspection method based on improved natural language processing

Info

Publication number: CN115982324A
Application number: CN202310265680.9A
Authority: CN
Inventors: 吴志刚; 龙佽飞; 梁燕君; 陈楚云; 黄康君; 李筱菁; 冯指明
Original assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2023-03-20
Filing date: 2023-03-20
Publication date: 2023-04-18

Abstract

The invention discloses a purchasing file inspection method based on improved natural language processing. Performing OCR preprocessing on all purchase files to respectively generate corresponding multiple data structured purchase texts; generating a word vector by a sensor-bert model in a mode of combining two parts of all purchased texts, converting the word vector into a sentence vector, performing cosine similarity calculation on the sentence vector, and performing correlation comparison on the sentence vector to obtain similarity comparison based on improved natural language processing; and (4) establishing an expert knowledge base and an inference machine in the purchase file auditing system, connecting the expert knowledge base with the inference machine, and then starting the inference machine to judge whether the purchase file is in compliance. Compared with the traditional NLP text processing method, the method improves the accuracy of text similarity analysis and accelerates the analysis speed.

Description

Purchase file inspection method based on improved natural language processing

Technical Field

The invention belongs to the technical field of electronic document auditing systems, and particularly relates to a purchasing file inspection method based on improved natural language processing.

Background

When an electronic document is automatically processed, based on knowledge graphs in different fields, after the electronic document is scanned by OCR, normalization and intelligent processing of unstructured content of the electronic document by adopting natural language processing NLP are common technologies, the technologies are commonly used in a file auditing system, and the judgment of similarity among multiple files is a common auditing requirement.

When the existing document auditing system judges the similarity of documents, keywords of bidding documents are extracted and calculated through the central words, high-frequency words are extracted and calculated through text information, and difference analysis is carried out on the keywords and the high-frequency words, so that early warning is given to the condition of high similarity of the documents, and the early warning is fed back to auditors.

However, in the existing file auditing system, a compliance auditing method for electronic purchase files is lacked, a rule base and an expert experience knowledge base are not formulated from the perspective of the possibility of label string for the purchase files, and a step of providing experts and auditors for interaction participation in training and error correction is not provided, so that a similarity judgment result of the purchase files is different from an actual situation. In the prior art, a traditional multilayer converter structure is used, the capability of extracting text information features is not strong enough, the training speed is low, parallelization processing cannot be realized, and the semantic distinction of words and phrases in the same sentence pattern is not obvious. In addition, the multi-layer converter is long in training time, large in network parameters, large in occupied space, low in prediction speed, and not suitable for generating tasks, processing ultra-long texts and processing tasks which only need shallow semantics.

Disclosure of Invention

In order to overcome one or more defects and shortcomings in the prior art, the invention aims to provide a purchase file inspection method based on improved natural language processing, which is used for performing semantic similarity analysis on a purchase file to judge whether a purchase non-compliance condition exists.

In order to achieve the above object, the present invention adopts the following technical means.

A purchasing file inspection method based on improved natural language processing comprises the following steps:

building a comprehensive database for storing purchase files in a purchase file auditing system, taking out multiple purchase files to be subjected to file similarity auditing from the comprehensive database, then performing OCR (optical character recognition) pretreatment on all the purchase files, and respectively generating corresponding multiple data structured purchase texts;

generating a word vector by a sensor-bert model in a mode of combining two parts of all purchased texts, converting the word vector into a sentence vector, performing cosine similarity calculation on the sentence vector, and performing correlation comparison on the sentence vector to obtain similarity comparison based on improved natural language processing; the sensor-bert model consists of two bert networks which have consistent structures and the same parameters and are distributed in parallel;

an expert knowledge base and an inference machine are set up in a purchase file auditing system, the expert knowledge base is connected with the inference machine, and then the inference machine is started to judge whether the purchase file is in compliance; an audit rule file for checking whether the content of the purchase text is in compliance or not is stored in the expert knowledge base, wherein the audit rule file comprises a plurality of items of audit rule information; the inference machine is used for matching the contents and data information of each field of the purchase file with a plurality of items of audit rule information in the expert knowledge base audit rule file and judging whether the contents of the purchase file accord with the audit rules or not.

Preferably, the process of generating the word vector by the sensor-bert model includes:

in the two pieces of purchasing texts, respectively splitting respective paragraphs of the two pieces of purchasing texts into sentences; and then, inputting the sentences in the first purchase text and the sentences in the second purchase text into two bert networks of the sensor-bert model respectively to obtain all word vectors of the sentences of the two purchase texts respectively.

Further, the process of converting the word vector into the sentence vector includes:

and carrying out an averaging operation on all the word vectors of the sentence, and taking an average result as a sentence vector u of the current sentence.

Further, the cosine similarity calculation of the sentence vectors is as follows: and (4) simultaneously generating a word vector from the two purchasing texts through a sensor-bert model, converting the word vector into a sentence vector to obtain two sentence vectors, and calculating the cosine similarity.

Further, after cosine similarity calculation is carried out on the sentence vectors, loss function construction is also carried out;

the process of constructing the loss function comprises the following steps:

and calculating the mean square error by taking the cosine similarity of all the sentence vectors as a sample, and taking the mean square error as a loss function of the sensor-bert model for adjusting the parameters of the sensor-bert model.

Further, the process of performing correlation comparison on sentence vectors includes:

calculating Spanish coefficients or Pearson coefficients between all sentence vectors u and sentence vectors v obtained from the two purchased texts, and then whether the Spanish coefficients or the Pearson coefficients reach a set numerical range; if not, the two purchasing texts are not similar; if yes, the two purchasing texts are similar.

Further, the process of starting the inference engine to output the result of whether the purchase file is in compliance includes:

and matching and judging the purchasing file to be detected and all the auditing rule information in the expert knowledge base one by one, and marking the judgment result that the contents and data information of each field of the purchasing file do not accord with a certain auditing rule.

Further, after the inference machine is started to judge whether the purchase file is in compliance, an interpreter is set up in the purchase file auditing system; the interpreter is connected with the inference machine and used for converting the mark of the inference machine whether the content of the procurement files conforms to the audit rule into natural language and then outputting and displaying the natural language.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

based on the improved natural language processing, the method determines the similarity of the purchased text by using a sensor-bert model and combining the spearman coefficient or the pearson coefficient between sentence vectors based on the calculation of the correlation as a target, improves the accuracy of text similarity analysis and accelerates the analysis speed compared with the traditional NLP model; in addition, the sensor-bert model occupies fewer computer memory resources and has higher performance; the expert knowledge base and the inference machine which are set up by the invention determine the contents of all fields and the data information of the purchase file through the matching judgment of the audit rule information and the contents of all fields and the data information of the purchase file, further compliance audit of the purchase file is realized, and the text similarity judgment of the purchase file and the compliance audit of the purchase file are combined.

Drawings

FIG. 1 is a schematic flow diagram of a general method for verification of a procurement documentation based on improved natural language processing in accordance with the invention;

fig. 2 is a schematic structural framework diagram of the procurement file auditing system in fig. 1.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and embodiments thereof. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Examples

As shown in fig. 1 and fig. 2, the method for checking a purchase document based on improved natural language processing of the present embodiment includes the following steps:

s1, building a comprehensive database for storing purchase files in a purchase file auditing system, taking out multiple purchase files to be subjected to file similarity auditing from the comprehensive database, performing OCR (optical character recognition) pretreatment on all the purchase files, respectively generating corresponding multiple data structured purchase texts, and then executing the next step;

in this embodiment, it is preferable that the purchase file is in a PDF format and/or a picture format, and when the purchase file uploaded by the user is in a non-PDF format and/or a picture format, the purchase file auditing system first operates the purchase file to convert the purchase file into a PDF format and/or a picture format, and then stores the purchase file into the integrated database;

s2, a similarity analysis module for analyzing text similarity is built in the purchase document auditing system, and the similarity analysis module comprises a sensor-bert model, a sentence vector generation module, a cosine similarity calculation module and a correlation calculation module which are connected in sequence; carrying out similarity comparison based on improved natural language processing on all the purchase texts in a mode of taking two copies as a group;

carrying out similarity comparison based on improved natural language processing, wherein the specific process comprises the following steps:

s21, in the two pieces of purchasing texts, respectively splitting paragraphs of the purchasing texts into sentences; then, inputting a sentence a in the first purchase text and a sentence B in the second purchase text into a network A and a network B of a sensor-bert model respectively; then the network A outputs all the word vectors of the sentence a, and the network B outputs all the word vectors of the sentence B;

the sensor-bert model is formed by evolution after improvement on the basis of the bert model; the network structure of the sensor-bert model consists of two bert networks which have consistent structures and the same parameters and are distributed in parallel, wherein the two bert networks are respectively marked as a network A and a network B as shown above;

s22, in the sentence vector generation module, carrying out averaging operation on all word vectors of the sentence a, and taking an averaging result as a sentence vector u of the current sentence a; in a sentence vector generating module, carrying out an average value operation on all word vectors of a sentence b, and taking an average value result as a current sentence b vector v; then, in a cosine similarity calculation module, calculating the cosine similarity of the sentence vector u and the sentence vector v;

s23, carrying out operations from step S21 to step S22 on different sentences in the first purchase text and different sentences in the second purchase text;

building a loss function module in the purchase file auditing system; performing MSE processing on all the accumulated cosine similarities in a loss function module every time the cosine similarities are calculated, and calculating the mean square error of all the cosine similarities by taking the MSE processing process as a loss function so as to correct the parameters of the sensor-bert model;

s3, calculating Spanish coefficients or Pearson coefficients between all sentence vectors u and sentence vectors v obtained from the two purchased texts in a correlation calculation module, taking the Spanish coefficients or Pearson coefficients as similarity indexes of the purchased texts, and comparing whether the correlation shown by the two purchased texts in the Spanish coefficients or the Pearson coefficients reaches a set numerical range or not; if not, the two purchasing texts are not similar; if so, indicating that the two purchasing texts are highly similar and having the risk of label stringing or label enclosing; after the comparison is completed, the purchase text is returned and stored in the comprehensive database;

s4, establishing an expert knowledge base, an inference machine and a man-machine interface in the purchase file auditing system;

the expert knowledge base is respectively connected with a human-computer interface and an inference machine, and the human-computer interface provides an operation interface for carrying out iterative updating on the expert knowledge base in a manual modification mode;

an auditing rule file for checking whether the content of the purchase text is in compliance or not is stored in the expert knowledge base, and the auditing rule file comprises a plurality of items of auditing rule information; the step of carrying out iterative update on the expert knowledge base refers to modifying and updating audit rule information, and when the current audit rule information is inconsistent with the knowledge experience rule of the audit expert, the audit expert carries out editing operation on the audit rule information through a human-computer interface to realize iterative update;

the inference machine is connected with the comprehensive database and used for calling the purchase file processed based on the improved natural language and the audit rule file of the expert knowledge base from the comprehensive database, matching the content and the data information of each field of the purchase file with a plurality of items of audit rule information in the audit rule file of the expert knowledge base and judging whether the content of the purchase file conforms to the audit rule or not;

s5, performing compliance audit on the purchase file processed by the improved natural language, and specifically comprising the following steps:

starting an inference machine, carrying out one-by-one matching judgment on the purchase file to be tested and all audit rule information in an expert knowledge base, and marking the judgment result that the content and the data information of each field of the purchase file do not accord with a certain audit rule; the inference engine can adopt a forward inference or direction inference mode to judge;

s6, an interpreter is built in the purchase file auditing system and is respectively connected with a human-computer interface and an inference machine; the interpreter is used for converting the mark of whether the content of the purchase file conforms to the auditing rule or not by the inference engine into natural language and then outputting the natural language to the human-computer interface for display.

Compared with the prior art, the purchasing file inspection method based on the improved natural language processing has the beneficial effects that:

in the embodiment, based on the improved natural language processing, after the structured purchase document is converted into the unstructured purchase document, the sensor-bert model is adopted to combine with the spearman coefficient or the pearson coefficient between sentence vectors to determine the similarity of the purchase document, so that compared with the traditional NLP model, the accuracy of text similarity analysis is improved, and the analysis speed is accelerated; in addition, the sensor-bert model occupies fewer computer memory resources and has higher performance; the expert knowledge base and the inference machine built in the embodiment determine the contents of each field and the data information of the purchase file through matching judgment of the audit rule information and the contents of each field and the data information of the purchase file, further compliance audit of the purchase file is realized, a way for iteratively updating the audit rule information in the expert knowledge base is provided, and text similarity judgment of the purchase file and compliance audit of the purchase file are combined.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A purchase document inspection method based on improved natural language processing is characterized by comprising the following steps:

an expert knowledge base and an inference machine are set up in a purchase file auditing system, the expert knowledge base is connected with the inference machine, and then the inference machine is started to judge whether the purchase file is in compliance; an audit rule file for checking whether the content of the purchase text is in compliance or not is stored in the expert knowledge base, wherein the audit rule file comprises a plurality of items of audit rule information; the inference machine is used for matching the contents and data information of each field of the purchase file with a plurality of items of audit rule information in the expert knowledge base audit rule file and judging whether the contents of the purchase file accord with the audit rule or not.

2. The method for procuring verification of procurement documentation based on advanced natural language processing of claim 1 wherein the process of generating word vectors by the sensor-bert model comprises:

3. The method of claim 2 wherein said converting the word vector into a sentence vector comprises:

and carrying out an averaging operation on all the word vectors of the sentence, and taking an averaging result as a sentence vector u of the current sentence.

4. The method for verifying procurement documentation based on improved natural language processing of claim 3 wherein the cosine similarity of the sentence vectors is calculated as: and (4) simultaneously generating a word vector from the two purchasing texts through a sensor-bert model, converting the word vector into a sentence vector to obtain two sentence vectors, and calculating the cosine similarity.

5. The method for inspecting procurement documentation based on improved natural language processing of claim 4 characterized by further performing loss function construction after cosine similarity calculation is performed on sentence vectors;

the process of constructing the loss function comprises the following steps:

6. The method for procuring inspection of procurement documentation based on modified natural language processing of claim 5 wherein, the process of comparing the relevance of the sentence vectors comprises:

7. The method for checking procurement documentation based on modified natural language processing of claim 6 wherein the process of initiating an inference engine to output the results of whether the procurement documentation is in compliance comprises:

8. The method for inspecting procurement files based on improved natural language processing as claimed in claim 7, characterized in that after the inference engine is started to judge whether the procurement files are in compliance, an interpreter is built in the procurement file auditing system; the interpreter is connected with the inference machine and used for converting the mark of the inference machine on whether the content of the purchase file conforms to the audit rule into natural language and then outputting and displaying the natural language.