CN115982324A - Purchase file inspection method based on improved natural language processing - Google Patents
Purchase file inspection method based on improved natural language processing Download PDFInfo
- Publication number
- CN115982324A CN115982324A CN202310265680.9A CN202310265680A CN115982324A CN 115982324 A CN115982324 A CN 115982324A CN 202310265680 A CN202310265680 A CN 202310265680A CN 115982324 A CN115982324 A CN 115982324A
- Authority
- CN
- China
- Prior art keywords
- purchase
- file
- natural language
- sentence
- texts
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000003058 natural language processing Methods 0.000 title claims abstract description 28
- 238000007689 inspection Methods 0.000 title claims abstract description 9
- 239000013598 vector Substances 0.000 claims abstract description 62
- 238000004364 calculation method Methods 0.000 claims abstract description 11
- 238000012550 audit Methods 0.000 claims description 33
- 230000008569 process Effects 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 9
- 238000012015 optical character recognition Methods 0.000 claims description 7
- 238000012935 Averaging Methods 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 2
- 238000012795 verification Methods 0.000 claims description 2
- 230000000977 initiatory effect Effects 0.000 claims 1
- 238000004458 analytical method Methods 0.000 abstract description 10
- 238000007781 pre-processing Methods 0.000 abstract 1
- 238000003672 processing method Methods 0.000 abstract 1
- 238000012545 processing Methods 0.000 description 6
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a purchasing file inspection method based on improved natural language processing. Performing OCR preprocessing on all purchase files to respectively generate corresponding multiple data structured purchase texts; generating a word vector by a sensor-bert model in a mode of combining two parts of all purchased texts, converting the word vector into a sentence vector, performing cosine similarity calculation on the sentence vector, and performing correlation comparison on the sentence vector to obtain similarity comparison based on improved natural language processing; and (4) establishing an expert knowledge base and an inference machine in the purchase file auditing system, connecting the expert knowledge base with the inference machine, and then starting the inference machine to judge whether the purchase file is in compliance. Compared with the traditional NLP text processing method, the method improves the accuracy of text similarity analysis and accelerates the analysis speed.
Description
Technical Field
The invention belongs to the technical field of electronic document auditing systems, and particularly relates to a purchasing file inspection method based on improved natural language processing.
Background
When an electronic document is automatically processed, based on knowledge graphs in different fields, after the electronic document is scanned by OCR, normalization and intelligent processing of unstructured content of the electronic document by adopting natural language processing NLP are common technologies, the technologies are commonly used in a file auditing system, and the judgment of similarity among multiple files is a common auditing requirement.
When the existing document auditing system judges the similarity of documents, keywords of bidding documents are extracted and calculated through the central words, high-frequency words are extracted and calculated through text information, and difference analysis is carried out on the keywords and the high-frequency words, so that early warning is given to the condition of high similarity of the documents, and the early warning is fed back to auditors.
However, in the existing file auditing system, a compliance auditing method for electronic purchase files is lacked, a rule base and an expert experience knowledge base are not formulated from the perspective of the possibility of label string for the purchase files, and a step of providing experts and auditors for interaction participation in training and error correction is not provided, so that a similarity judgment result of the purchase files is different from an actual situation. In the prior art, a traditional multilayer converter structure is used, the capability of extracting text information features is not strong enough, the training speed is low, parallelization processing cannot be realized, and the semantic distinction of words and phrases in the same sentence pattern is not obvious. In addition, the multi-layer converter is long in training time, large in network parameters, large in occupied space, low in prediction speed, and not suitable for generating tasks, processing ultra-long texts and processing tasks which only need shallow semantics.
Disclosure of Invention
In order to overcome one or more defects and shortcomings in the prior art, the invention aims to provide a purchase file inspection method based on improved natural language processing, which is used for performing semantic similarity analysis on a purchase file to judge whether a purchase non-compliance condition exists.
In order to achieve the above object, the present invention adopts the following technical means.
A purchasing file inspection method based on improved natural language processing comprises the following steps:
building a comprehensive database for storing purchase files in a purchase file auditing system, taking out multiple purchase files to be subjected to file similarity auditing from the comprehensive database, then performing OCR (optical character recognition) pretreatment on all the purchase files, and respectively generating corresponding multiple data structured purchase texts;
generating a word vector by a sensor-bert model in a mode of combining two parts of all purchased texts, converting the word vector into a sentence vector, performing cosine similarity calculation on the sentence vector, and performing correlation comparison on the sentence vector to obtain similarity comparison based on improved natural language processing; the sensor-bert model consists of two bert networks which have consistent structures and the same parameters and are distributed in parallel;
an expert knowledge base and an inference machine are set up in a purchase file auditing system, the expert knowledge base is connected with the inference machine, and then the inference machine is started to judge whether the purchase file is in compliance; an audit rule file for checking whether the content of the purchase text is in compliance or not is stored in the expert knowledge base, wherein the audit rule file comprises a plurality of items of audit rule information; the inference machine is used for matching the contents and data information of each field of the purchase file with a plurality of items of audit rule information in the expert knowledge base audit rule file and judging whether the contents of the purchase file accord with the audit rules or not.
Preferably, the process of generating the word vector by the sensor-bert model includes:
in the two pieces of purchasing texts, respectively splitting respective paragraphs of the two pieces of purchasing texts into sentences; and then, inputting the sentences in the first purchase text and the sentences in the second purchase text into two bert networks of the sensor-bert model respectively to obtain all word vectors of the sentences of the two purchase texts respectively.
Further, the process of converting the word vector into the sentence vector includes:
and carrying out an averaging operation on all the word vectors of the sentence, and taking an average result as a sentence vector u of the current sentence.
Further, the cosine similarity calculation of the sentence vectors is as follows: and (4) simultaneously generating a word vector from the two purchasing texts through a sensor-bert model, converting the word vector into a sentence vector to obtain two sentence vectors, and calculating the cosine similarity.
Further, after cosine similarity calculation is carried out on the sentence vectors, loss function construction is also carried out;
the process of constructing the loss function comprises the following steps:
and calculating the mean square error by taking the cosine similarity of all the sentence vectors as a sample, and taking the mean square error as a loss function of the sensor-bert model for adjusting the parameters of the sensor-bert model.
Further, the process of performing correlation comparison on sentence vectors includes:
calculating Spanish coefficients or Pearson coefficients between all sentence vectors u and sentence vectors v obtained from the two purchased texts, and then whether the Spanish coefficients or the Pearson coefficients reach a set numerical range; if not, the two purchasing texts are not similar; if yes, the two purchasing texts are similar.
Further, the process of starting the inference engine to output the result of whether the purchase file is in compliance includes:
and matching and judging the purchasing file to be detected and all the auditing rule information in the expert knowledge base one by one, and marking the judgment result that the contents and data information of each field of the purchasing file do not accord with a certain auditing rule.
Further, after the inference machine is started to judge whether the purchase file is in compliance, an interpreter is set up in the purchase file auditing system; the interpreter is connected with the inference machine and used for converting the mark of the inference machine whether the content of the procurement files conforms to the audit rule into natural language and then outputting and displaying the natural language.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
based on the improved natural language processing, the method determines the similarity of the purchased text by using a sensor-bert model and combining the spearman coefficient or the pearson coefficient between sentence vectors based on the calculation of the correlation as a target, improves the accuracy of text similarity analysis and accelerates the analysis speed compared with the traditional NLP model; in addition, the sensor-bert model occupies fewer computer memory resources and has higher performance; the expert knowledge base and the inference machine which are set up by the invention determine the contents of all fields and the data information of the purchase file through the matching judgment of the audit rule information and the contents of all fields and the data information of the purchase file, further compliance audit of the purchase file is realized, and the text similarity judgment of the purchase file and the compliance audit of the purchase file are combined.
Drawings
FIG. 1 is a schematic flow diagram of a general method for verification of a procurement documentation based on improved natural language processing in accordance with the invention;
fig. 2 is a schematic structural framework diagram of the procurement file auditing system in fig. 1.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and embodiments thereof. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Examples
As shown in fig. 1 and fig. 2, the method for checking a purchase document based on improved natural language processing of the present embodiment includes the following steps:
s1, building a comprehensive database for storing purchase files in a purchase file auditing system, taking out multiple purchase files to be subjected to file similarity auditing from the comprehensive database, performing OCR (optical character recognition) pretreatment on all the purchase files, respectively generating corresponding multiple data structured purchase texts, and then executing the next step;
in this embodiment, it is preferable that the purchase file is in a PDF format and/or a picture format, and when the purchase file uploaded by the user is in a non-PDF format and/or a picture format, the purchase file auditing system first operates the purchase file to convert the purchase file into a PDF format and/or a picture format, and then stores the purchase file into the integrated database;
s2, a similarity analysis module for analyzing text similarity is built in the purchase document auditing system, and the similarity analysis module comprises a sensor-bert model, a sentence vector generation module, a cosine similarity calculation module and a correlation calculation module which are connected in sequence; carrying out similarity comparison based on improved natural language processing on all the purchase texts in a mode of taking two copies as a group;
carrying out similarity comparison based on improved natural language processing, wherein the specific process comprises the following steps:
s21, in the two pieces of purchasing texts, respectively splitting paragraphs of the purchasing texts into sentences; then, inputting a sentence a in the first purchase text and a sentence B in the second purchase text into a network A and a network B of a sensor-bert model respectively; then the network A outputs all the word vectors of the sentence a, and the network B outputs all the word vectors of the sentence B;
the sensor-bert model is formed by evolution after improvement on the basis of the bert model; the network structure of the sensor-bert model consists of two bert networks which have consistent structures and the same parameters and are distributed in parallel, wherein the two bert networks are respectively marked as a network A and a network B as shown above;
s22, in the sentence vector generation module, carrying out averaging operation on all word vectors of the sentence a, and taking an averaging result as a sentence vector u of the current sentence a; in a sentence vector generating module, carrying out an average value operation on all word vectors of a sentence b, and taking an average value result as a current sentence b vector v; then, in a cosine similarity calculation module, calculating the cosine similarity of the sentence vector u and the sentence vector v;
s23, carrying out operations from step S21 to step S22 on different sentences in the first purchase text and different sentences in the second purchase text;
building a loss function module in the purchase file auditing system; performing MSE processing on all the accumulated cosine similarities in a loss function module every time the cosine similarities are calculated, and calculating the mean square error of all the cosine similarities by taking the MSE processing process as a loss function so as to correct the parameters of the sensor-bert model;
s3, calculating Spanish coefficients or Pearson coefficients between all sentence vectors u and sentence vectors v obtained from the two purchased texts in a correlation calculation module, taking the Spanish coefficients or Pearson coefficients as similarity indexes of the purchased texts, and comparing whether the correlation shown by the two purchased texts in the Spanish coefficients or the Pearson coefficients reaches a set numerical range or not; if not, the two purchasing texts are not similar; if so, indicating that the two purchasing texts are highly similar and having the risk of label stringing or label enclosing; after the comparison is completed, the purchase text is returned and stored in the comprehensive database;
s4, establishing an expert knowledge base, an inference machine and a man-machine interface in the purchase file auditing system;
the expert knowledge base is respectively connected with a human-computer interface and an inference machine, and the human-computer interface provides an operation interface for carrying out iterative updating on the expert knowledge base in a manual modification mode;
an auditing rule file for checking whether the content of the purchase text is in compliance or not is stored in the expert knowledge base, and the auditing rule file comprises a plurality of items of auditing rule information; the step of carrying out iterative update on the expert knowledge base refers to modifying and updating audit rule information, and when the current audit rule information is inconsistent with the knowledge experience rule of the audit expert, the audit expert carries out editing operation on the audit rule information through a human-computer interface to realize iterative update;
the inference machine is connected with the comprehensive database and used for calling the purchase file processed based on the improved natural language and the audit rule file of the expert knowledge base from the comprehensive database, matching the content and the data information of each field of the purchase file with a plurality of items of audit rule information in the audit rule file of the expert knowledge base and judging whether the content of the purchase file conforms to the audit rule or not;
s5, performing compliance audit on the purchase file processed by the improved natural language, and specifically comprising the following steps:
starting an inference machine, carrying out one-by-one matching judgment on the purchase file to be tested and all audit rule information in an expert knowledge base, and marking the judgment result that the content and the data information of each field of the purchase file do not accord with a certain audit rule; the inference engine can adopt a forward inference or direction inference mode to judge;
s6, an interpreter is built in the purchase file auditing system and is respectively connected with a human-computer interface and an inference machine; the interpreter is used for converting the mark of whether the content of the purchase file conforms to the auditing rule or not by the inference engine into natural language and then outputting the natural language to the human-computer interface for display.
Compared with the prior art, the purchasing file inspection method based on the improved natural language processing has the beneficial effects that:
in the embodiment, based on the improved natural language processing, after the structured purchase document is converted into the unstructured purchase document, the sensor-bert model is adopted to combine with the spearman coefficient or the pearson coefficient between sentence vectors to determine the similarity of the purchase document, so that compared with the traditional NLP model, the accuracy of text similarity analysis is improved, and the analysis speed is accelerated; in addition, the sensor-bert model occupies fewer computer memory resources and has higher performance; the expert knowledge base and the inference machine built in the embodiment determine the contents of each field and the data information of the purchase file through matching judgment of the audit rule information and the contents of each field and the data information of the purchase file, further compliance audit of the purchase file is realized, a way for iteratively updating the audit rule information in the expert knowledge base is provided, and text similarity judgment of the purchase file and compliance audit of the purchase file are combined.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
Claims (8)
1. A purchase document inspection method based on improved natural language processing is characterized by comprising the following steps:
building a comprehensive database for storing purchase files in a purchase file auditing system, taking out multiple purchase files to be subjected to file similarity auditing from the comprehensive database, then performing OCR (optical character recognition) pretreatment on all the purchase files, and respectively generating corresponding multiple data structured purchase texts;
generating a word vector by a sensor-bert model in a mode of combining two parts of all purchased texts, converting the word vector into a sentence vector, performing cosine similarity calculation on the sentence vector, and performing correlation comparison on the sentence vector to obtain similarity comparison based on improved natural language processing; the sensor-bert model consists of two bert networks which have consistent structures and the same parameters and are distributed in parallel;
an expert knowledge base and an inference machine are set up in a purchase file auditing system, the expert knowledge base is connected with the inference machine, and then the inference machine is started to judge whether the purchase file is in compliance; an audit rule file for checking whether the content of the purchase text is in compliance or not is stored in the expert knowledge base, wherein the audit rule file comprises a plurality of items of audit rule information; the inference machine is used for matching the contents and data information of each field of the purchase file with a plurality of items of audit rule information in the expert knowledge base audit rule file and judging whether the contents of the purchase file accord with the audit rule or not.
2. The method for procuring verification of procurement documentation based on advanced natural language processing of claim 1 wherein the process of generating word vectors by the sensor-bert model comprises:
in the two pieces of purchasing texts, respectively splitting respective paragraphs of the two pieces of purchasing texts into sentences; and then, inputting the sentences in the first purchase text and the sentences in the second purchase text into two bert networks of the sensor-bert model respectively to obtain all word vectors of the sentences of the two purchase texts respectively.
3. The method of claim 2 wherein said converting the word vector into a sentence vector comprises:
and carrying out an averaging operation on all the word vectors of the sentence, and taking an averaging result as a sentence vector u of the current sentence.
4. The method for verifying procurement documentation based on improved natural language processing of claim 3 wherein the cosine similarity of the sentence vectors is calculated as: and (4) simultaneously generating a word vector from the two purchasing texts through a sensor-bert model, converting the word vector into a sentence vector to obtain two sentence vectors, and calculating the cosine similarity.
5. The method for inspecting procurement documentation based on improved natural language processing of claim 4 characterized by further performing loss function construction after cosine similarity calculation is performed on sentence vectors;
the process of constructing the loss function comprises the following steps:
and calculating the mean square error by taking the cosine similarity of all the sentence vectors as a sample, and taking the mean square error as a loss function of the sensor-bert model for adjusting the parameters of the sensor-bert model.
6. The method for procuring inspection of procurement documentation based on modified natural language processing of claim 5 wherein, the process of comparing the relevance of the sentence vectors comprises:
calculating Spanish coefficients or Pearson coefficients between all sentence vectors u and sentence vectors v obtained from the two purchased texts, and then whether the Spanish coefficients or the Pearson coefficients reach a set numerical range; if not, the two purchasing texts are not similar; if yes, the two purchasing texts are similar.
7. The method for checking procurement documentation based on modified natural language processing of claim 6 wherein the process of initiating an inference engine to output the results of whether the procurement documentation is in compliance comprises:
and matching and judging the purchasing file to be detected and all the auditing rule information in the expert knowledge base one by one, and marking the judgment result that the contents and data information of each field of the purchasing file do not accord with a certain auditing rule.
8. The method for inspecting procurement files based on improved natural language processing as claimed in claim 7, characterized in that after the inference engine is started to judge whether the procurement files are in compliance, an interpreter is built in the procurement file auditing system; the interpreter is connected with the inference machine and used for converting the mark of the inference machine on whether the content of the purchase file conforms to the audit rule into natural language and then outputting and displaying the natural language.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310265680.9A CN115982324A (en) | 2023-03-20 | 2023-03-20 | Purchase file inspection method based on improved natural language processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310265680.9A CN115982324A (en) | 2023-03-20 | 2023-03-20 | Purchase file inspection method based on improved natural language processing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115982324A true CN115982324A (en) | 2023-04-18 |
Family
ID=85970534
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310265680.9A Pending CN115982324A (en) | 2023-03-20 | 2023-03-20 | Purchase file inspection method based on improved natural language processing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115982324A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111191903A (en) * | 2019-12-24 | 2020-05-22 | 中科金审(北京)科技有限公司 | Early warning method and device for monitoring bid document, server and storage medium |
CN113435182A (en) * | 2021-07-21 | 2021-09-24 | 唯品会(广州)软件有限公司 | Method, device and equipment for detecting conflict of classification labels in natural language processing |
CN114332872A (en) * | 2022-03-14 | 2022-04-12 | 四川国路安数据技术有限公司 | Contract document fault-tolerant information extraction method based on graph attention network |
CN115062148A (en) * | 2022-06-23 | 2022-09-16 | 广东国义信息科技有限公司 | Database-based risk control method |
CN115689696A (en) * | 2022-11-03 | 2023-02-03 | 安徽皖电招标有限公司 | Intelligent bid evaluation method and system based on artificial intelligence technology |
-
2023
- 2023-03-20 CN CN202310265680.9A patent/CN115982324A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111191903A (en) * | 2019-12-24 | 2020-05-22 | 中科金审(北京)科技有限公司 | Early warning method and device for monitoring bid document, server and storage medium |
CN113435182A (en) * | 2021-07-21 | 2021-09-24 | 唯品会(广州)软件有限公司 | Method, device and equipment for detecting conflict of classification labels in natural language processing |
CN114332872A (en) * | 2022-03-14 | 2022-04-12 | 四川国路安数据技术有限公司 | Contract document fault-tolerant information extraction method based on graph attention network |
CN115062148A (en) * | 2022-06-23 | 2022-09-16 | 广东国义信息科技有限公司 | Database-based risk control method |
CN115689696A (en) * | 2022-11-03 | 2023-02-03 | 安徽皖电招标有限公司 | Intelligent bid evaluation method and system based on artificial intelligence technology |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109670191B (en) | Calibration optimization method and device for machine translation and electronic equipment | |
CN110298033B (en) | Keyword corpus labeling training extraction system | |
CN112069811B (en) | Electronic text event extraction method with multi-task interaction enhancement | |
WO2015043075A1 (en) | Microblog-oriented emotional entity search system | |
CN112100401B (en) | Knowledge graph construction method, device, equipment and storage medium for science and technology services | |
CN112035599B (en) | Query method and device based on vertical search, computer equipment and storage medium | |
CN112163424A (en) | Data labeling method, device, equipment and medium | |
CN113360582B (en) | Relation classification method and system based on BERT model fusion multi-entity information | |
CN114547072A (en) | Method, system, equipment and storage medium for converting natural language query into SQL | |
CN115964273A (en) | Spacecraft test script automatic generation method based on deep learning | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN111858842A (en) | Judicial case screening method based on LDA topic model | |
CN116304748A (en) | Text similarity calculation method, system, equipment and medium | |
CN111091009B (en) | Document association auditing method based on semantic analysis | |
CN111753067A (en) | Innovative assessment method, device and equipment for technical background text | |
CN112287119B (en) | Knowledge graph generation method for extracting relevant information of online resources | |
CN112989803B (en) | Entity link prediction method based on topic vector learning | |
CN115204143B (en) | Method and system for calculating text similarity based on prompt | |
CN110020024B (en) | Method, system and equipment for classifying link resources in scientific and technological literature | |
CN114202038B (en) | Crowdsourcing defect classification method based on DBM deep learning | |
CN116150010A (en) | Test case classification method based on ship feature labels | |
CN116383414A (en) | Intelligent file review system and method based on carbon check knowledge graph | |
CN115982324A (en) | Purchase file inspection method based on improved natural language processing | |
CN114117069A (en) | Semantic understanding method and system for intelligent knowledge graph question answering | |
CN111209375B (en) | Universal clause and document matching method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20230418 |