CN115982324A - Purchase file inspection method based on improved natural language processing - Google Patents

Purchase file inspection method based on improved natural language processing Download PDF

Info

Publication number
CN115982324A
CN115982324A CN202310265680.9A CN202310265680A CN115982324A CN 115982324 A CN115982324 A CN 115982324A CN 202310265680 A CN202310265680 A CN 202310265680A CN 115982324 A CN115982324 A CN 115982324A
Authority
CN
China
Prior art keywords
purchase
file
natural language
sentence
texts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310265680.9A
Other languages
Chinese (zh)
Inventor
吴志刚
龙佽飞
梁燕君
陈楚云
黄康君
李筱菁
冯指明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Original Assignee
Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd filed Critical Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority to CN202310265680.9A priority Critical patent/CN115982324A/en
Publication of CN115982324A publication Critical patent/CN115982324A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a purchasing file inspection method based on improved natural language processing. Performing OCR preprocessing on all purchase files to respectively generate corresponding multiple data structured purchase texts; generating a word vector by a sensor-bert model in a mode of combining two parts of all purchased texts, converting the word vector into a sentence vector, performing cosine similarity calculation on the sentence vector, and performing correlation comparison on the sentence vector to obtain similarity comparison based on improved natural language processing; and (4) establishing an expert knowledge base and an inference machine in the purchase file auditing system, connecting the expert knowledge base with the inference machine, and then starting the inference machine to judge whether the purchase file is in compliance. Compared with the traditional NLP text processing method, the method improves the accuracy of text similarity analysis and accelerates the analysis speed.

Description

Purchase file inspection method based on improved natural language processing
Technical Field
The invention belongs to the technical field of electronic document auditing systems, and particularly relates to a purchasing file inspection method based on improved natural language processing.
Background
When an electronic document is automatically processed, based on knowledge graphs in different fields, after the electronic document is scanned by OCR, normalization and intelligent processing of unstructured content of the electronic document by adopting natural language processing NLP are common technologies, the technologies are commonly used in a file auditing system, and the judgment of similarity among multiple files is a common auditing requirement.
When the existing document auditing system judges the similarity of documents, keywords of bidding documents are extracted and calculated through the central words, high-frequency words are extracted and calculated through text information, and difference analysis is carried out on the keywords and the high-frequency words, so that early warning is given to the condition of high similarity of the documents, and the early warning is fed back to auditors.
However, in the existing file auditing system, a compliance auditing method for electronic purchase files is lacked, a rule base and an expert experience knowledge base are not formulated from the perspective of the possibility of label string for the purchase files, and a step of providing experts and auditors for interaction participation in training and error correction is not provided, so that a similarity judgment result of the purchase files is different from an actual situation. In the prior art, a traditional multilayer converter structure is used, the capability of extracting text information features is not strong enough, the training speed is low, parallelization processing cannot be realized, and the semantic distinction of words and phrases in the same sentence pattern is not obvious. In addition, the multi-layer converter is long in training time, large in network parameters, large in occupied space, low in prediction speed, and not suitable for generating tasks, processing ultra-long texts and processing tasks which only need shallow semantics.
Disclosure of Invention
In order to overcome one or more defects and shortcomings in the prior art, the invention aims to provide a purchase file inspection method based on improved natural language processing, which is used for performing semantic similarity analysis on a purchase file to judge whether a purchase non-compliance condition exists.
In order to achieve the above object, the present invention adopts the following technical means.
A purchasing file inspection method based on improved natural language processing comprises the following steps:
building a comprehensive database for storing purchase files in a purchase file auditing system, taking out multiple purchase files to be subjected to file similarity auditing from the comprehensive database, then performing OCR (optical character recognition) pretreatment on all the purchase files, and respectively generating corresponding multiple data structured purchase texts;
generating a word vector by a sensor-bert model in a mode of combining two parts of all purchased texts, converting the word vector into a sentence vector, performing cosine similarity calculation on the sentence vector, and performing correlation comparison on the sentence vector to obtain similarity comparison based on improved natural language processing; the sensor-bert model consists of two bert networks which have consistent structures and the same parameters and are distributed in parallel;
an expert knowledge base and an inference machine are set up in a purchase file auditing system, the expert knowledge base is connected with the inference machine, and then the inference machine is started to judge whether the purchase file is in compliance; an audit rule file for checking whether the content of the purchase text is in compliance or not is stored in the expert knowledge base, wherein the audit rule file comprises a plurality of items of audit rule information; the inference machine is used for matching the contents and data information of each field of the purchase file with a plurality of items of audit rule information in the expert knowledge base audit rule file and judging whether the contents of the purchase file accord with the audit rules or not.
Preferably, the process of generating the word vector by the sensor-bert model includes:
in the two pieces of purchasing texts, respectively splitting respective paragraphs of the two pieces of purchasing texts into sentences; and then, inputting the sentences in the first purchase text and the sentences in the second purchase text into two bert networks of the sensor-bert model respectively to obtain all word vectors of the sentences of the two purchase texts respectively.
Further, the process of converting the word vector into the sentence vector includes:
and carrying out an averaging operation on all the word vectors of the sentence, and taking an average result as a sentence vector u of the current sentence.
Further, the cosine similarity calculation of the sentence vectors is as follows: and (4) simultaneously generating a word vector from the two purchasing texts through a sensor-bert model, converting the word vector into a sentence vector to obtain two sentence vectors, and calculating the cosine similarity.
Further, after cosine similarity calculation is carried out on the sentence vectors, loss function construction is also carried out;
the process of constructing the loss function comprises the following steps:
and calculating the mean square error by taking the cosine similarity of all the sentence vectors as a sample, and taking the mean square error as a loss function of the sensor-bert model for adjusting the parameters of the sensor-bert model.
Further, the process of performing correlation comparison on sentence vectors includes:
calculating Spanish coefficients or Pearson coefficients between all sentence vectors u and sentence vectors v obtained from the two purchased texts, and then whether the Spanish coefficients or the Pearson coefficients reach a set numerical range; if not, the two purchasing texts are not similar; if yes, the two purchasing texts are similar.
Further, the process of starting the inference engine to output the result of whether the purchase file is in compliance includes:
and matching and judging the purchasing file to be detected and all the auditing rule information in the expert knowledge base one by one, and marking the judgment result that the contents and data information of each field of the purchasing file do not accord with a certain auditing rule.
Further, after the inference machine is started to judge whether the purchase file is in compliance, an interpreter is set up in the purchase file auditing system; the interpreter is connected with the inference machine and used for converting the mark of the inference machine whether the content of the procurement files conforms to the audit rule into natural language and then outputting and displaying the natural language.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
based on the improved natural language processing, the method determines the similarity of the purchased text by using a sensor-bert model and combining the spearman coefficient or the pearson coefficient between sentence vectors based on the calculation of the correlation as a target, improves the accuracy of text similarity analysis and accelerates the analysis speed compared with the traditional NLP model; in addition, the sensor-bert model occupies fewer computer memory resources and has higher performance; the expert knowledge base and the inference machine which are set up by the invention determine the contents of all fields and the data information of the purchase file through the matching judgment of the audit rule information and the contents of all fields and the data information of the purchase file, further compliance audit of the purchase file is realized, and the text similarity judgment of the purchase file and the compliance audit of the purchase file are combined.
Drawings
FIG. 1 is a schematic flow diagram of a general method for verification of a procurement documentation based on improved natural language processing in accordance with the invention;
fig. 2 is a schematic structural framework diagram of the procurement file auditing system in fig. 1.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and embodiments thereof. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Examples
As shown in fig. 1 and fig. 2, the method for checking a purchase document based on improved natural language processing of the present embodiment includes the following steps:
s1, building a comprehensive database for storing purchase files in a purchase file auditing system, taking out multiple purchase files to be subjected to file similarity auditing from the comprehensive database, performing OCR (optical character recognition) pretreatment on all the purchase files, respectively generating corresponding multiple data structured purchase texts, and then executing the next step;
in this embodiment, it is preferable that the purchase file is in a PDF format and/or a picture format, and when the purchase file uploaded by the user is in a non-PDF format and/or a picture format, the purchase file auditing system first operates the purchase file to convert the purchase file into a PDF format and/or a picture format, and then stores the purchase file into the integrated database;
s2, a similarity analysis module for analyzing text similarity is built in the purchase document auditing system, and the similarity analysis module comprises a sensor-bert model, a sentence vector generation module, a cosine similarity calculation module and a correlation calculation module which are connected in sequence; carrying out similarity comparison based on improved natural language processing on all the purchase texts in a mode of taking two copies as a group;
carrying out similarity comparison based on improved natural language processing, wherein the specific process comprises the following steps:
s21, in the two pieces of purchasing texts, respectively splitting paragraphs of the purchasing texts into sentences; then, inputting a sentence a in the first purchase text and a sentence B in the second purchase text into a network A and a network B of a sensor-bert model respectively; then the network A outputs all the word vectors of the sentence a, and the network B outputs all the word vectors of the sentence B;
the sensor-bert model is formed by evolution after improvement on the basis of the bert model; the network structure of the sensor-bert model consists of two bert networks which have consistent structures and the same parameters and are distributed in parallel, wherein the two bert networks are respectively marked as a network A and a network B as shown above;
s22, in the sentence vector generation module, carrying out averaging operation on all word vectors of the sentence a, and taking an averaging result as a sentence vector u of the current sentence a; in a sentence vector generating module, carrying out an average value operation on all word vectors of a sentence b, and taking an average value result as a current sentence b vector v; then, in a cosine similarity calculation module, calculating the cosine similarity of the sentence vector u and the sentence vector v;
s23, carrying out operations from step S21 to step S22 on different sentences in the first purchase text and different sentences in the second purchase text;
building a loss function module in the purchase file auditing system; performing MSE processing on all the accumulated cosine similarities in a loss function module every time the cosine similarities are calculated, and calculating the mean square error of all the cosine similarities by taking the MSE processing process as a loss function so as to correct the parameters of the sensor-bert model;
s3, calculating Spanish coefficients or Pearson coefficients between all sentence vectors u and sentence vectors v obtained from the two purchased texts in a correlation calculation module, taking the Spanish coefficients or Pearson coefficients as similarity indexes of the purchased texts, and comparing whether the correlation shown by the two purchased texts in the Spanish coefficients or the Pearson coefficients reaches a set numerical range or not; if not, the two purchasing texts are not similar; if so, indicating that the two purchasing texts are highly similar and having the risk of label stringing or label enclosing; after the comparison is completed, the purchase text is returned and stored in the comprehensive database;
s4, establishing an expert knowledge base, an inference machine and a man-machine interface in the purchase file auditing system;
the expert knowledge base is respectively connected with a human-computer interface and an inference machine, and the human-computer interface provides an operation interface for carrying out iterative updating on the expert knowledge base in a manual modification mode;
an auditing rule file for checking whether the content of the purchase text is in compliance or not is stored in the expert knowledge base, and the auditing rule file comprises a plurality of items of auditing rule information; the step of carrying out iterative update on the expert knowledge base refers to modifying and updating audit rule information, and when the current audit rule information is inconsistent with the knowledge experience rule of the audit expert, the audit expert carries out editing operation on the audit rule information through a human-computer interface to realize iterative update;
the inference machine is connected with the comprehensive database and used for calling the purchase file processed based on the improved natural language and the audit rule file of the expert knowledge base from the comprehensive database, matching the content and the data information of each field of the purchase file with a plurality of items of audit rule information in the audit rule file of the expert knowledge base and judging whether the content of the purchase file conforms to the audit rule or not;
s5, performing compliance audit on the purchase file processed by the improved natural language, and specifically comprising the following steps:
starting an inference machine, carrying out one-by-one matching judgment on the purchase file to be tested and all audit rule information in an expert knowledge base, and marking the judgment result that the content and the data information of each field of the purchase file do not accord with a certain audit rule; the inference engine can adopt a forward inference or direction inference mode to judge;
s6, an interpreter is built in the purchase file auditing system and is respectively connected with a human-computer interface and an inference machine; the interpreter is used for converting the mark of whether the content of the purchase file conforms to the auditing rule or not by the inference engine into natural language and then outputting the natural language to the human-computer interface for display.
Compared with the prior art, the purchasing file inspection method based on the improved natural language processing has the beneficial effects that:
in the embodiment, based on the improved natural language processing, after the structured purchase document is converted into the unstructured purchase document, the sensor-bert model is adopted to combine with the spearman coefficient or the pearson coefficient between sentence vectors to determine the similarity of the purchase document, so that compared with the traditional NLP model, the accuracy of text similarity analysis is improved, and the analysis speed is accelerated; in addition, the sensor-bert model occupies fewer computer memory resources and has higher performance; the expert knowledge base and the inference machine built in the embodiment determine the contents of each field and the data information of the purchase file through matching judgment of the audit rule information and the contents of each field and the data information of the purchase file, further compliance audit of the purchase file is realized, a way for iteratively updating the audit rule information in the expert knowledge base is provided, and text similarity judgment of the purchase file and compliance audit of the purchase file are combined.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (8)

1. A purchase document inspection method based on improved natural language processing is characterized by comprising the following steps:
building a comprehensive database for storing purchase files in a purchase file auditing system, taking out multiple purchase files to be subjected to file similarity auditing from the comprehensive database, then performing OCR (optical character recognition) pretreatment on all the purchase files, and respectively generating corresponding multiple data structured purchase texts;
generating a word vector by a sensor-bert model in a mode of combining two parts of all purchased texts, converting the word vector into a sentence vector, performing cosine similarity calculation on the sentence vector, and performing correlation comparison on the sentence vector to obtain similarity comparison based on improved natural language processing; the sensor-bert model consists of two bert networks which have consistent structures and the same parameters and are distributed in parallel;
an expert knowledge base and an inference machine are set up in a purchase file auditing system, the expert knowledge base is connected with the inference machine, and then the inference machine is started to judge whether the purchase file is in compliance; an audit rule file for checking whether the content of the purchase text is in compliance or not is stored in the expert knowledge base, wherein the audit rule file comprises a plurality of items of audit rule information; the inference machine is used for matching the contents and data information of each field of the purchase file with a plurality of items of audit rule information in the expert knowledge base audit rule file and judging whether the contents of the purchase file accord with the audit rule or not.
2. The method for procuring verification of procurement documentation based on advanced natural language processing of claim 1 wherein the process of generating word vectors by the sensor-bert model comprises:
in the two pieces of purchasing texts, respectively splitting respective paragraphs of the two pieces of purchasing texts into sentences; and then, inputting the sentences in the first purchase text and the sentences in the second purchase text into two bert networks of the sensor-bert model respectively to obtain all word vectors of the sentences of the two purchase texts respectively.
3. The method of claim 2 wherein said converting the word vector into a sentence vector comprises:
and carrying out an averaging operation on all the word vectors of the sentence, and taking an averaging result as a sentence vector u of the current sentence.
4. The method for verifying procurement documentation based on improved natural language processing of claim 3 wherein the cosine similarity of the sentence vectors is calculated as: and (4) simultaneously generating a word vector from the two purchasing texts through a sensor-bert model, converting the word vector into a sentence vector to obtain two sentence vectors, and calculating the cosine similarity.
5. The method for inspecting procurement documentation based on improved natural language processing of claim 4 characterized by further performing loss function construction after cosine similarity calculation is performed on sentence vectors;
the process of constructing the loss function comprises the following steps:
and calculating the mean square error by taking the cosine similarity of all the sentence vectors as a sample, and taking the mean square error as a loss function of the sensor-bert model for adjusting the parameters of the sensor-bert model.
6. The method for procuring inspection of procurement documentation based on modified natural language processing of claim 5 wherein, the process of comparing the relevance of the sentence vectors comprises:
calculating Spanish coefficients or Pearson coefficients between all sentence vectors u and sentence vectors v obtained from the two purchased texts, and then whether the Spanish coefficients or the Pearson coefficients reach a set numerical range; if not, the two purchasing texts are not similar; if yes, the two purchasing texts are similar.
7. The method for checking procurement documentation based on modified natural language processing of claim 6 wherein the process of initiating an inference engine to output the results of whether the procurement documentation is in compliance comprises:
and matching and judging the purchasing file to be detected and all the auditing rule information in the expert knowledge base one by one, and marking the judgment result that the contents and data information of each field of the purchasing file do not accord with a certain auditing rule.
8. The method for inspecting procurement files based on improved natural language processing as claimed in claim 7, characterized in that after the inference engine is started to judge whether the procurement files are in compliance, an interpreter is built in the procurement file auditing system; the interpreter is connected with the inference machine and used for converting the mark of the inference machine on whether the content of the purchase file conforms to the audit rule into natural language and then outputting and displaying the natural language.
CN202310265680.9A 2023-03-20 2023-03-20 Purchase file inspection method based on improved natural language processing Pending CN115982324A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310265680.9A CN115982324A (en) 2023-03-20 2023-03-20 Purchase file inspection method based on improved natural language processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310265680.9A CN115982324A (en) 2023-03-20 2023-03-20 Purchase file inspection method based on improved natural language processing

Publications (1)

Publication Number Publication Date
CN115982324A true CN115982324A (en) 2023-04-18

Family

ID=85970534

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310265680.9A Pending CN115982324A (en) 2023-03-20 2023-03-20 Purchase file inspection method based on improved natural language processing

Country Status (1)

Country Link
CN (1) CN115982324A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191903A (en) * 2019-12-24 2020-05-22 中科金审(北京)科技有限公司 Early warning method and device for monitoring bid document, server and storage medium
CN113435182A (en) * 2021-07-21 2021-09-24 唯品会(广州)软件有限公司 Method, device and equipment for detecting conflict of classification labels in natural language processing
CN114332872A (en) * 2022-03-14 2022-04-12 四川国路安数据技术有限公司 Contract document fault-tolerant information extraction method based on graph attention network
CN115062148A (en) * 2022-06-23 2022-09-16 广东国义信息科技有限公司 Database-based risk control method
CN115689696A (en) * 2022-11-03 2023-02-03 安徽皖电招标有限公司 Intelligent bid evaluation method and system based on artificial intelligence technology

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191903A (en) * 2019-12-24 2020-05-22 中科金审(北京)科技有限公司 Early warning method and device for monitoring bid document, server and storage medium
CN113435182A (en) * 2021-07-21 2021-09-24 唯品会(广州)软件有限公司 Method, device and equipment for detecting conflict of classification labels in natural language processing
CN114332872A (en) * 2022-03-14 2022-04-12 四川国路安数据技术有限公司 Contract document fault-tolerant information extraction method based on graph attention network
CN115062148A (en) * 2022-06-23 2022-09-16 广东国义信息科技有限公司 Database-based risk control method
CN115689696A (en) * 2022-11-03 2023-02-03 安徽皖电招标有限公司 Intelligent bid evaluation method and system based on artificial intelligence technology

Similar Documents

Publication Publication Date Title
CN109670191B (en) Calibration optimization method and device for machine translation and electronic equipment
CN110298033B (en) Keyword corpus labeling training extraction system
CN112069811B (en) Electronic text event extraction method with multi-task interaction enhancement
WO2015043075A1 (en) Microblog-oriented emotional entity search system
CN112100401B (en) Knowledge graph construction method, device, equipment and storage medium for science and technology services
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
CN112163424A (en) Data labeling method, device, equipment and medium
CN113360582B (en) Relation classification method and system based on BERT model fusion multi-entity information
CN114547072A (en) Method, system, equipment and storage medium for converting natural language query into SQL
CN115964273A (en) Spacecraft test script automatic generation method based on deep learning
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN111858842A (en) Judicial case screening method based on LDA topic model
CN116304748A (en) Text similarity calculation method, system, equipment and medium
CN111091009B (en) Document association auditing method based on semantic analysis
CN111753067A (en) Innovative assessment method, device and equipment for technical background text
CN112287119B (en) Knowledge graph generation method for extracting relevant information of online resources
CN112989803B (en) Entity link prediction method based on topic vector learning
CN115204143B (en) Method and system for calculating text similarity based on prompt
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature
CN114202038B (en) Crowdsourcing defect classification method based on DBM deep learning
CN116150010A (en) Test case classification method based on ship feature labels
CN116383414A (en) Intelligent file review system and method based on carbon check knowledge graph
CN115982324A (en) Purchase file inspection method based on improved natural language processing
CN114117069A (en) Semantic understanding method and system for intelligent knowledge graph question answering
CN111209375B (en) Universal clause and document matching method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20230418