CN116129433A - Risk element repeatability comparison method - Google Patents
Risk element repeatability comparison method Download PDFInfo
- Publication number
- CN116129433A CN116129433A CN202310122409.XA CN202310122409A CN116129433A CN 116129433 A CN116129433 A CN 116129433A CN 202310122409 A CN202310122409 A CN 202310122409A CN 116129433 A CN116129433 A CN 116129433A
- Authority
- CN
- China
- Prior art keywords
- data
- text
- judging
- text data
- risk
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000000605 extraction Methods 0.000 claims abstract description 21
- 230000014509 gene expression Effects 0.000 claims description 12
- 238000005516 engineering process Methods 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 abstract description 13
- 230000000694 effects Effects 0.000 abstract description 5
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/048—Fuzzy inferencing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/18—Legal services; Handling legal documents
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Abstract
The invention relates to the technical field of electronic information, in particular to a risk factor repeatability comparison method which comprises the steps of acquiring a text to be identified from a registration file and converting the text into text data; extracting key element information from the text data through an element extraction algorithm to obtain extracted data; comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result; and inputting the extracted data and the judging result into a risk judging model, and outputting the data repeatability and the data risk rating. According to the method, the accuracy of key element information extraction is improved through an element extraction algorithm, the editing distance between the basic data and the data to be compared is considered during ambiguity calculation, the editing form is considered, and the method is designed into an edit distance algorithm logic, so that the accuracy of ambiguity calculation is improved, and the semantic calculation effect is improved.
Description
Technical Field
The invention relates to the technical field of electronic information, in particular to a risk factor repeatability comparison method.
Background
The risk element extraction aims to extract a group of contents with risk possibility from the registered files of the mortgage and transfer service and the financing lease service of accounts receivable so as to improve the auditing efficiency of auditing personnel on the registered files, and is a basic natural language processing task. Common keyword extraction algorithms include TF-IDF, textRank, yake, autoPhrase, keyBert and the like.
TF-IDF is the ordering of keywords by counting the Inverse Document Frequency (IDF) of words in the corpus and The Frequency (TF) of words in sentences. TextRank is used for carrying out remote supervised learning by constructing a word graph, then using Pagerank algorithm to sequence keywords, yake is a keyword extraction algorithm integrating various statistical indexes, and AutoPhrase uses a knowledge base. The algorithms such as TF-IDF, textRank, yake can extract relatively reliable keywords quickly to some extent, but these algorithms often have a large number of noisy words (non-keyword misidentification is for keywords), which is problematic in that they ignore the semantic features of the text. The keyword extraction algorithm based on the semantics such as KeyBert generates candidate words in a mode of calculating N-Gram, but the calculation efficiency of the method is very low, and the semantic calculation effect is poor due to the characteristic of BERT anisotropy.
Disclosure of Invention
The invention aims to provide a risk element repeatability comparison method and aims to solve the problem that the existing risk element extraction method is poor in semantic calculation effect.
In order to achieve the above object, the present invention provides a risk factor repeatability comparison method, comprising the following steps:
acquiring a text to be identified from a registration file;
converting the text to be recognized into text data by an OCR technology;
extracting key element information from the text data through an element extraction algorithm to obtain extracted data;
comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result;
and inputting the extracted data and the judging result into a risk judging model, and outputting the data repeatability and the data risk rating.
The fuzzy matching algorithm comprises a public substring algorithm, an edit distance algorithm and a threshold rule.
The step of comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result, wherein the step of judging comprises the following steps:
calculating the basic data and the extracted data through the public substring algorithm to obtain a user public substring;
calculating the basic data and the extracted data through the editing distance algorithm to obtain an editing distance;
comparing whether the public substring of the user and the editing distance meet the ambiguity simultaneously or not to obtain a comparison result;
and calculating the ambiguity based on the comparison result through the threshold rule to obtain a judgment result.
Wherein the key element information includes a target invoice number, a contract name, and a project company name.
The extracting the target invoice number from the text data comprises the following steps:
judging that the invoice description form is in text data;
and extracting the invoice description form from the text data by using a regular expression to obtain a target invoice number.
The invoice description form comprises any one of an invoice number, an invoice number and invoice information.
The extracting the contract number of the text data comprises the following steps:
contract numbers are extracted from the text data using regular expressions.
The extracting the contract name of the text data comprises the following steps:
judging that the contract description form is in the text data;
and extracting the contract description form from the text data by using a regular expression to obtain a contract name.
Wherein the contract description form comprises any one of a contract, a construction project and an agreement.
The text data is used for extracting the names of the project companies, and the method comprises the following steps:
judging that the names of the project companies are in the registration file;
extracting the project company name from the text data using a regular expression.
According to the risk factor repeatability comparison method, the text to be identified is obtained from the registration file; converting the text to be recognized into text data by an OCR technology; extracting key element information from the text data through an element extraction algorithm to obtain extracted data; comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result; and inputting the extracted data and the judging result into a risk judging model, and outputting the data repeatability and the data risk rating. According to the method, the accuracy of key element information extraction is improved through an element extraction algorithm, when the ambiguity calculation is carried out, the editing distance between the basic data and the data to be compared is considered, the editing form is considered, and the method is designed into the edit distance algorithm logic, so that the accuracy of the ambiguity calculation is improved, the semantic calculation effect is improved, and the problem that the semantic calculation effect of the existing risk element extraction method is poor is solved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a risk factor repeatability comparison method provided by the invention.
Fig. 2 is a flowchart of comparing the basic data with the extracted data, and determining whether the basic data has repeated content by a fuzzy matching algorithm to obtain a determination result.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
Referring to fig. 1 to 2, the present invention provides a risk factor repeatability comparison method, which includes the following steps:
s1, acquiring a text to be identified from a registration file;
specifically, a text to be identified is obtained, wherein the text to be identified comprises a description of text data and an accessory. Specifically, in this embodiment, descriptions and attachments of text data are used as text to be identified, and risk elements of the text data are extracted. The accessory is returned from the medium login database, and the description is described in the medium login registration certificate.
S2, converting the text to be recognized into text data through an OCR technology;
s3, extracting key element information from the text data through an element extraction algorithm to obtain extracted data;
specifically, the key element information includes a target invoice number, a contract name, and a project company name.
The extracting the target invoice number from the text data comprises the following steps:
judging that the invoice description form is in text data; and extracting the invoice description form from the text data by using a regular expression to obtain a target invoice number. The invoice description form comprises any one of an invoice number, an invoice number and invoice information.
The extracting the contract number of the text data comprises the following steps:
contract numbers are extracted from the text data using regular expressions.
The extracting the contract name of the text data comprises the following steps:
judging that the contract description form is in the text data; and extracting the contract description form from the text data by using a regular expression to obtain a contract name. The contract description form comprises any one of a contract, a construction project and an agreement.
The extracting the project company name of the text data comprises the following steps:
judging that the names of the project companies are in the registration file; extracting the project company name from the text data using a regular expression.
S4, comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result;
specifically, the fuzzy matching algorithm comprises a public substring algorithm, an edit distance algorithm and a threshold rule.
Comparing the basic data with the extracted data, judging whether repeated contents exist in the basic data through a fuzzy matching algorithm, and obtaining a judging result, wherein the judging result comprises the following steps:
s41, calculating the basic data and the extracted data through the public substring algorithm to obtain a user public substring;
specifically, the specific method of the public substring algorithm is as follows:
constructing a matrix, wherein the number of rows is the length +1 of the character string 1, and the number of columns is the length +1 of the character string 2; here +1 is for ease of calculation, if not +1 we need to do a single cycle for the first row, and the first column; after +1, we can discard this cycle. Mainly for this formula m [ i+1] [ j+1] =m [ i ] [ j ] +1 services.
S42, calculating the basic data and the extracted data through the editing distance algorithm to obtain an editing distance;
specifically, the specific method for editing the distance algorithm comprises the following steps:
the minimum number of editing operations required to switch from one to the other between two strings, if they are more edited; indicating that they are more different. The permitted editing operations include replacing one character with another, inserting one character, and deleting one character.
S43, comparing whether the user public substring and the editing distance meet the ambiguity simultaneously or not to obtain a comparison result;
s44, calculating the ambiguity based on the comparison result through the threshold rule, and obtaining a judgment result.
Specifically, the threshold rule is:
calculating the ambiguity according to the judgment of the public substring + editing distance of the user comprises the following steps: two characters, one of which is the other, determine the substring length duty cycle, such as 123 and 123456789.
S5, inputting the extracted data and the judging result into a risk judging model, and outputting data repeatability and data risk rating.
Specifically, the rules of the risk judgment model are as follows:
1. the data to be compared is divided into two types of elements and field contents corresponding to the elements, if the element A is defined, the corresponding contents are a, and the element A is generally a type of contents, such as invoice numbers, various types of numbers and the like.
2. The priority of the field contents is defined, e.g., a, b is high, c, d is medium, e, f is low.
3. Based on the result of the extraction of the element extraction method, it is determined whether or not the element A, B is present in the identified data.
4. And judging whether the content of the data submitted by the user has similar character strings in the identified data according to the result of the risk factor repeatability comparison method, such as a, b, c, d, e, f.
5. According to the result and the number of the similar character strings hit by a, b, c, d, e, f, the repeated risk level is judged, for example, when at least one of the fields a and b with high priority hits, the current data is judged to be high risk, c, d, e, f hits are also high risk, and only the fields c and d with medium priority hit are medium risk.
6. The risk level based on 5 is calculated again according to the existence of the element, and if the risk level determined based on the field content is middle, the risk level is calculated as high risk level if the element A or B exists.
After inputting the extracted data and the judgment result into a risk judgment model and outputting the data repeatability and the data risk rating, the method further comprises:
s6, classifying and storing the data repetition and the data risk rating into corresponding databases according to output time, and establishing a search keyword library for each database;
and S7, carrying out key extraction on the data extraction requirement to obtain target keywords, and matching the target keywords in the search keyword library to obtain target data.
The beneficial effects are that:
1. when the rule of element extraction is constructed, various forms of keywords in the true registration data are fully extracted through verification of a large amount of true business data, the rules are designed, and the accuracy of keyword extraction is improved.
2. When the ambiguity calculation is carried out, the method considers the edit distance of the basic data and the data to be compared, considers the edit form, designs the edit form into the edit distance arithmetic logic, and improves the accuracy of the ambiguity calculation.
3. The fuzzy matching algorithm is not limited by industry, can be flexibly applied, and determines the threshold value for determining repetition according to specific requirements.
The above disclosure is only a preferred embodiment of a method for comparing the repeatability of risk factors, but it should not be construed that the scope of the invention is limited thereto.
Claims (10)
1. The risk factor repeatability comparison method is characterized by comprising the following steps of:
acquiring a text to be identified from a registration file;
converting the text to be recognized into text data by an OCR technology;
extracting key element information from the text data through an element extraction algorithm to obtain extracted data;
comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result;
and inputting the extracted data and the judging result into a risk judging model, and outputting the data repeatability and the data risk rating.
2. The method for comparing the repeatability of the risk factors according to claim 1,
the fuzzy matching algorithm comprises a public substring algorithm, an edit distance algorithm and a threshold rule.
3. The method for comparing the repeatability of the risk factors according to claim 2,
comparing the basic data with the extracted data, judging whether repeated contents exist in the basic data through a fuzzy matching algorithm, and obtaining a judging result, wherein the judging result comprises the following steps:
calculating the basic data and the extracted data through the public substring algorithm to obtain a user public substring;
calculating the basic data and the extracted data through the editing distance algorithm to obtain an editing distance;
comparing whether the public substring of the user and the editing distance meet the ambiguity simultaneously or not to obtain a comparison result;
and calculating the ambiguity based on the comparison result through the threshold rule to obtain a judgment result.
4. The method for comparing the repeatability of the risk factors according to claim 3,
the key element information includes a target invoice number, a contract name, and a project company name.
5. The method for risk factor reproducibility of claim 4,
the extracting the target invoice number from the text data comprises the following steps:
judging that the invoice description form is in text data;
and extracting the invoice description form from the text data by using a regular expression to obtain a target invoice number.
6. The method for risk factor reproducibility of claim 5,
the invoice description form comprises any one of an invoice number, an invoice number and invoice information.
7. The risk factor repeatability comparison method of claim 6, wherein said performing contract number extraction on text data comprises:
contract numbers are extracted from the text data using regular expressions.
8. The method for risk factor reproducibility of claim 7,
the extracting the contract name of the text data comprises the following steps:
judging that the contract description form is in the text data;
and extracting the contract description form from the text data by using a regular expression to obtain a contract name.
9. The method for risk factor reproducibility of claim 8,
the contract description form comprises any one of a contract, a construction project and an agreement.
10. The method for risk factor reproducibility of claim 9,
the text data is used for extracting the names of the project companies, and the method comprises the following steps:
judging that the names of the project companies are in the registration file;
extracting the project company name from the text data using a regular expression.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310122409.XA CN116129433A (en) | 2023-02-15 | 2023-02-15 | Risk element repeatability comparison method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310122409.XA CN116129433A (en) | 2023-02-15 | 2023-02-15 | Risk element repeatability comparison method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116129433A true CN116129433A (en) | 2023-05-16 |
Family
ID=86304509
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310122409.XA Withdrawn CN116129433A (en) | 2023-02-15 | 2023-02-15 | Risk element repeatability comparison method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116129433A (en) |
-
2023
- 2023-02-15 CN CN202310122409.XA patent/CN116129433A/en not_active Withdrawn
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111723215B (en) | Device and method for establishing biotechnological information knowledge graph based on text mining | |
Agichtein et al. | Mining reference tables for automatic text segmentation | |
WO2021114745A1 (en) | Named entity recognition method employing affix perception for use in social media | |
Arasu et al. | Transformation-based framework for record matching | |
US8190538B2 (en) | Methods and systems for matching records and normalizing names | |
US20040141354A1 (en) | Query string matching method and apparatus | |
Chu et al. | Tegra: Table extraction by global record alignment | |
US20180181646A1 (en) | System and method for determining identity relationships among enterprise data entities | |
JP2012510654A (en) | System and method for matching entities | |
US7627567B2 (en) | Segmentation of strings into structured records | |
Valarakos et al. | Enhancing ontological knowledge through ontology population and enrichment | |
CN112417891B (en) | Text relation automatic labeling method based on open type information extraction | |
Ye et al. | Unknown Chinese word extraction based on variety of overlapping strings | |
CN109522396B (en) | Knowledge processing method and system for national defense science and technology field | |
Kosala et al. | Information extraction from structured documents using k-testable tree automaton inference | |
CN116628173B (en) | Intelligent customer service information generation system and method based on keyword extraction | |
CN116129433A (en) | Risk element repeatability comparison method | |
CN116824587A (en) | Method for extracting risk elements in registration file | |
Zhang et al. | Extracting Product Features and Sentiments from Chinese Customer Reviews. | |
Wen | Text mining using HMM and PMM | |
WO2024045399A1 (en) | User name blacklist fuzzy matching method based on text feature similarity | |
Chen et al. | A three-phase system for chinese named entity recognition | |
Zhang et al. | A statistical approach to opinion target extraction using domain relevance | |
US20240046039A1 (en) | Method for News Mapping and Apparatus for Performing the Method | |
US20240070396A1 (en) | Method for Determining Candidate Company Related to News and Apparatus for Performing the Method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20230516 |