CN116129433A

CN116129433A - Risk element repeatability comparison method

Info

Publication number: CN116129433A
Application number: CN202310122409.XA
Authority: CN
Inventors: 黄维那
Original assignee: Sichuan Shudi Intelligent Zhongdeng Technology Co ltd
Current assignee: Sichuan Shudi Intelligent Zhongdeng Technology Co ltd
Priority date: 2023-02-15
Filing date: 2023-02-15
Publication date: 2023-05-16

Abstract

The invention relates to the technical field of electronic information, in particular to a risk factor repeatability comparison method which comprises the steps of acquiring a text to be identified from a registration file and converting the text into text data; extracting key element information from the text data through an element extraction algorithm to obtain extracted data; comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result; and inputting the extracted data and the judging result into a risk judging model, and outputting the data repeatability and the data risk rating. According to the method, the accuracy of key element information extraction is improved through an element extraction algorithm, the editing distance between the basic data and the data to be compared is considered during ambiguity calculation, the editing form is considered, and the method is designed into an edit distance algorithm logic, so that the accuracy of ambiguity calculation is improved, and the semantic calculation effect is improved.

Description

Risk element repeatability comparison method

Technical Field

The invention relates to the technical field of electronic information, in particular to a risk factor repeatability comparison method.

Background

The risk element extraction aims to extract a group of contents with risk possibility from the registered files of the mortgage and transfer service and the financing lease service of accounts receivable so as to improve the auditing efficiency of auditing personnel on the registered files, and is a basic natural language processing task. Common keyword extraction algorithms include TF-IDF, textRank, yake, autoPhrase, keyBert and the like.

TF-IDF is the ordering of keywords by counting the Inverse Document Frequency (IDF) of words in the corpus and The Frequency (TF) of words in sentences. TextRank is used for carrying out remote supervised learning by constructing a word graph, then using Pagerank algorithm to sequence keywords, yake is a keyword extraction algorithm integrating various statistical indexes, and AutoPhrase uses a knowledge base. The algorithms such as TF-IDF, textRank, yake can extract relatively reliable keywords quickly to some extent, but these algorithms often have a large number of noisy words (non-keyword misidentification is for keywords), which is problematic in that they ignore the semantic features of the text. The keyword extraction algorithm based on the semantics such as KeyBert generates candidate words in a mode of calculating N-Gram, but the calculation efficiency of the method is very low, and the semantic calculation effect is poor due to the characteristic of BERT anisotropy.

Disclosure of Invention

The invention aims to provide a risk element repeatability comparison method and aims to solve the problem that the existing risk element extraction method is poor in semantic calculation effect.

In order to achieve the above object, the present invention provides a risk factor repeatability comparison method, comprising the following steps:

acquiring a text to be identified from a registration file;

converting the text to be recognized into text data by an OCR technology;

extracting key element information from the text data through an element extraction algorithm to obtain extracted data;

comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result;

and inputting the extracted data and the judging result into a risk judging model, and outputting the data repeatability and the data risk rating.

The fuzzy matching algorithm comprises a public substring algorithm, an edit distance algorithm and a threshold rule.

The step of comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result, wherein the step of judging comprises the following steps:

calculating the basic data and the extracted data through the public substring algorithm to obtain a user public substring;

calculating the basic data and the extracted data through the editing distance algorithm to obtain an editing distance;

comparing whether the public substring of the user and the editing distance meet the ambiguity simultaneously or not to obtain a comparison result;

and calculating the ambiguity based on the comparison result through the threshold rule to obtain a judgment result.

Wherein the key element information includes a target invoice number, a contract name, and a project company name.

The extracting the target invoice number from the text data comprises the following steps:

judging that the invoice description form is in text data;

and extracting the invoice description form from the text data by using a regular expression to obtain a target invoice number.

The invoice description form comprises any one of an invoice number, an invoice number and invoice information.

The extracting the contract number of the text data comprises the following steps:

contract numbers are extracted from the text data using regular expressions.

The extracting the contract name of the text data comprises the following steps:

judging that the contract description form is in the text data;

and extracting the contract description form from the text data by using a regular expression to obtain a contract name.

Wherein the contract description form comprises any one of a contract, a construction project and an agreement.

The text data is used for extracting the names of the project companies, and the method comprises the following steps:

judging that the names of the project companies are in the registration file;

extracting the project company name from the text data using a regular expression.

According to the risk factor repeatability comparison method, the text to be identified is obtained from the registration file; converting the text to be recognized into text data by an OCR technology; extracting key element information from the text data through an element extraction algorithm to obtain extracted data; comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result; and inputting the extracted data and the judging result into a risk judging model, and outputting the data repeatability and the data risk rating. According to the method, the accuracy of key element information extraction is improved through an element extraction algorithm, when the ambiguity calculation is carried out, the editing distance between the basic data and the data to be compared is considered, the editing form is considered, and the method is designed into the edit distance algorithm logic, so that the accuracy of the ambiguity calculation is improved, the semantic calculation effect is improved, and the problem that the semantic calculation effect of the existing risk element extraction method is poor is solved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a risk factor repeatability comparison method provided by the invention.

Fig. 2 is a flowchart of comparing the basic data with the extracted data, and determining whether the basic data has repeated content by a fuzzy matching algorithm to obtain a determination result.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

Referring to fig. 1 to 2, the present invention provides a risk factor repeatability comparison method, which includes the following steps:

s1, acquiring a text to be identified from a registration file;

specifically, a text to be identified is obtained, wherein the text to be identified comprises a description of text data and an accessory. Specifically, in this embodiment, descriptions and attachments of text data are used as text to be identified, and risk elements of the text data are extracted. The accessory is returned from the medium login database, and the description is described in the medium login registration certificate.

S2, converting the text to be recognized into text data through an OCR technology;

s3, extracting key element information from the text data through an element extraction algorithm to obtain extracted data;

specifically, the key element information includes a target invoice number, a contract name, and a project company name.

judging that the invoice description form is in text data; and extracting the invoice description form from the text data by using a regular expression to obtain a target invoice number. The invoice description form comprises any one of an invoice number, an invoice number and invoice information.

contract numbers are extracted from the text data using regular expressions.

judging that the contract description form is in the text data; and extracting the contract description form from the text data by using a regular expression to obtain a contract name. The contract description form comprises any one of a contract, a construction project and an agreement.

The extracting the project company name of the text data comprises the following steps:

judging that the names of the project companies are in the registration file; extracting the project company name from the text data using a regular expression.

S4, comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result;

specifically, the fuzzy matching algorithm comprises a public substring algorithm, an edit distance algorithm and a threshold rule.

Comparing the basic data with the extracted data, judging whether repeated contents exist in the basic data through a fuzzy matching algorithm, and obtaining a judging result, wherein the judging result comprises the following steps:

s41, calculating the basic data and the extracted data through the public substring algorithm to obtain a user public substring;

specifically, the specific method of the public substring algorithm is as follows:

constructing a matrix, wherein the number of rows is the length +1 of the character string 1, and the number of columns is the length +1 of the character string 2; here +1 is for ease of calculation, if not +1 we need to do a single cycle for the first row, and the first column; after +1, we can discard this cycle. Mainly for this formula m [ i+1] [ j+1] =m [ i ] [ j ] +1 services.

S42, calculating the basic data and the extracted data through the editing distance algorithm to obtain an editing distance;

specifically, the specific method for editing the distance algorithm comprises the following steps:

the minimum number of editing operations required to switch from one to the other between two strings, if they are more edited; indicating that they are more different. The permitted editing operations include replacing one character with another, inserting one character, and deleting one character.

S43, comparing whether the user public substring and the editing distance meet the ambiguity simultaneously or not to obtain a comparison result;

s44, calculating the ambiguity based on the comparison result through the threshold rule, and obtaining a judgment result.

Specifically, the threshold rule is:

calculating the ambiguity according to the judgment of the public substring + editing distance of the user comprises the following steps: two characters, one of which is the other, determine the substring length duty cycle, such as 123 and 123456789.

S5, inputting the extracted data and the judging result into a risk judging model, and outputting data repeatability and data risk rating.

Specifically, the rules of the risk judgment model are as follows:

1. the data to be compared is divided into two types of elements and field contents corresponding to the elements, if the element A is defined, the corresponding contents are a, and the element A is generally a type of contents, such as invoice numbers, various types of numbers and the like.

2. The priority of the field contents is defined, e.g., a, b is high, c, d is medium, e, f is low.

3. Based on the result of the extraction of the element extraction method, it is determined whether or not the element A, B is present in the identified data.

4. And judging whether the content of the data submitted by the user has similar character strings in the identified data according to the result of the risk factor repeatability comparison method, such as a, b, c, d, e, f.

5. According to the result and the number of the similar character strings hit by a, b, c, d, e, f, the repeated risk level is judged, for example, when at least one of the fields a and b with high priority hits, the current data is judged to be high risk, c, d, e, f hits are also high risk, and only the fields c and d with medium priority hit are medium risk.

6. The risk level based on 5 is calculated again according to the existence of the element, and if the risk level determined based on the field content is middle, the risk level is calculated as high risk level if the element A or B exists.

After inputting the extracted data and the judgment result into a risk judgment model and outputting the data repeatability and the data risk rating, the method further comprises:

s6, classifying and storing the data repetition and the data risk rating into corresponding databases according to output time, and establishing a search keyword library for each database;

and S7, carrying out key extraction on the data extraction requirement to obtain target keywords, and matching the target keywords in the search keyword library to obtain target data.

The beneficial effects are that:

1. when the rule of element extraction is constructed, various forms of keywords in the true registration data are fully extracted through verification of a large amount of true business data, the rules are designed, and the accuracy of keyword extraction is improved.

2. When the ambiguity calculation is carried out, the method considers the edit distance of the basic data and the data to be compared, considers the edit form, designs the edit form into the edit distance arithmetic logic, and improves the accuracy of the ambiguity calculation.

3. The fuzzy matching algorithm is not limited by industry, can be flexibly applied, and determines the threshold value for determining repetition according to specific requirements.

The above disclosure is only a preferred embodiment of a method for comparing the repeatability of risk factors, but it should not be construed that the scope of the invention is limited thereto.

Claims

1. The risk factor repeatability comparison method is characterized by comprising the following steps of:

acquiring a text to be identified from a registration file;

converting the text to be recognized into text data by an OCR technology;

2. The method for comparing the repeatability of the risk factors according to claim 1,

3. The method for comparing the repeatability of the risk factors according to claim 2,

4. The method for comparing the repeatability of the risk factors according to claim 3,

the key element information includes a target invoice number, a contract name, and a project company name.

5. The method for risk factor reproducibility of claim 4,

judging that the invoice description form is in text data;

6. The method for risk factor reproducibility of claim 5,

7. The risk factor repeatability comparison method of claim 6, wherein said performing contract number extraction on text data comprises:

contract numbers are extracted from the text data using regular expressions.

8. The method for risk factor reproducibility of claim 7,

judging that the contract description form is in the text data;

9. The method for risk factor reproducibility of claim 8,

the contract description form comprises any one of a contract, a construction project and an agreement.

10. The method for risk factor reproducibility of claim 9,

judging that the names of the project companies are in the registration file;