CN116129433A - Risk element repeatability comparison method - Google Patents

Risk element repeatability comparison method Download PDF

Info

Publication number
CN116129433A
CN116129433A CN202310122409.XA CN202310122409A CN116129433A CN 116129433 A CN116129433 A CN 116129433A CN 202310122409 A CN202310122409 A CN 202310122409A CN 116129433 A CN116129433 A CN 116129433A
Authority
CN
China
Prior art keywords
data
text
judging
text data
risk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202310122409.XA
Other languages
Chinese (zh)
Inventor
黄维那
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Shudi Intelligent Zhongdeng Technology Co ltd
Original Assignee
Sichuan Shudi Intelligent Zhongdeng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Shudi Intelligent Zhongdeng Technology Co ltd filed Critical Sichuan Shudi Intelligent Zhongdeng Technology Co ltd
Priority to CN202310122409.XA priority Critical patent/CN116129433A/en
Publication of CN116129433A publication Critical patent/CN116129433A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/048Fuzzy inferencing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services; Handling legal documents
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention relates to the technical field of electronic information, in particular to a risk factor repeatability comparison method which comprises the steps of acquiring a text to be identified from a registration file and converting the text into text data; extracting key element information from the text data through an element extraction algorithm to obtain extracted data; comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result; and inputting the extracted data and the judging result into a risk judging model, and outputting the data repeatability and the data risk rating. According to the method, the accuracy of key element information extraction is improved through an element extraction algorithm, the editing distance between the basic data and the data to be compared is considered during ambiguity calculation, the editing form is considered, and the method is designed into an edit distance algorithm logic, so that the accuracy of ambiguity calculation is improved, and the semantic calculation effect is improved.

Description

Risk element repeatability comparison method
Technical Field
The invention relates to the technical field of electronic information, in particular to a risk factor repeatability comparison method.
Background
The risk element extraction aims to extract a group of contents with risk possibility from the registered files of the mortgage and transfer service and the financing lease service of accounts receivable so as to improve the auditing efficiency of auditing personnel on the registered files, and is a basic natural language processing task. Common keyword extraction algorithms include TF-IDF, textRank, yake, autoPhrase, keyBert and the like.
TF-IDF is the ordering of keywords by counting the Inverse Document Frequency (IDF) of words in the corpus and The Frequency (TF) of words in sentences. TextRank is used for carrying out remote supervised learning by constructing a word graph, then using Pagerank algorithm to sequence keywords, yake is a keyword extraction algorithm integrating various statistical indexes, and AutoPhrase uses a knowledge base. The algorithms such as TF-IDF, textRank, yake can extract relatively reliable keywords quickly to some extent, but these algorithms often have a large number of noisy words (non-keyword misidentification is for keywords), which is problematic in that they ignore the semantic features of the text. The keyword extraction algorithm based on the semantics such as KeyBert generates candidate words in a mode of calculating N-Gram, but the calculation efficiency of the method is very low, and the semantic calculation effect is poor due to the characteristic of BERT anisotropy.
Disclosure of Invention
The invention aims to provide a risk element repeatability comparison method and aims to solve the problem that the existing risk element extraction method is poor in semantic calculation effect.
In order to achieve the above object, the present invention provides a risk factor repeatability comparison method, comprising the following steps:
acquiring a text to be identified from a registration file;
converting the text to be recognized into text data by an OCR technology;
extracting key element information from the text data through an element extraction algorithm to obtain extracted data;
comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result;
and inputting the extracted data and the judging result into a risk judging model, and outputting the data repeatability and the data risk rating.
The fuzzy matching algorithm comprises a public substring algorithm, an edit distance algorithm and a threshold rule.
The step of comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result, wherein the step of judging comprises the following steps:
calculating the basic data and the extracted data through the public substring algorithm to obtain a user public substring;
calculating the basic data and the extracted data through the editing distance algorithm to obtain an editing distance;
comparing whether the public substring of the user and the editing distance meet the ambiguity simultaneously or not to obtain a comparison result;
and calculating the ambiguity based on the comparison result through the threshold rule to obtain a judgment result.
Wherein the key element information includes a target invoice number, a contract name, and a project company name.
The extracting the target invoice number from the text data comprises the following steps:
judging that the invoice description form is in text data;
and extracting the invoice description form from the text data by using a regular expression to obtain a target invoice number.
The invoice description form comprises any one of an invoice number, an invoice number and invoice information.
The extracting the contract number of the text data comprises the following steps:
contract numbers are extracted from the text data using regular expressions.
The extracting the contract name of the text data comprises the following steps:
judging that the contract description form is in the text data;
and extracting the contract description form from the text data by using a regular expression to obtain a contract name.
Wherein the contract description form comprises any one of a contract, a construction project and an agreement.
The text data is used for extracting the names of the project companies, and the method comprises the following steps:
judging that the names of the project companies are in the registration file;
extracting the project company name from the text data using a regular expression.
According to the risk factor repeatability comparison method, the text to be identified is obtained from the registration file; converting the text to be recognized into text data by an OCR technology; extracting key element information from the text data through an element extraction algorithm to obtain extracted data; comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result; and inputting the extracted data and the judging result into a risk judging model, and outputting the data repeatability and the data risk rating. According to the method, the accuracy of key element information extraction is improved through an element extraction algorithm, when the ambiguity calculation is carried out, the editing distance between the basic data and the data to be compared is considered, the editing form is considered, and the method is designed into the edit distance algorithm logic, so that the accuracy of the ambiguity calculation is improved, the semantic calculation effect is improved, and the problem that the semantic calculation effect of the existing risk element extraction method is poor is solved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a risk factor repeatability comparison method provided by the invention.
Fig. 2 is a flowchart of comparing the basic data with the extracted data, and determining whether the basic data has repeated content by a fuzzy matching algorithm to obtain a determination result.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
Referring to fig. 1 to 2, the present invention provides a risk factor repeatability comparison method, which includes the following steps:
s1, acquiring a text to be identified from a registration file;
specifically, a text to be identified is obtained, wherein the text to be identified comprises a description of text data and an accessory. Specifically, in this embodiment, descriptions and attachments of text data are used as text to be identified, and risk elements of the text data are extracted. The accessory is returned from the medium login database, and the description is described in the medium login registration certificate.
S2, converting the text to be recognized into text data through an OCR technology;
s3, extracting key element information from the text data through an element extraction algorithm to obtain extracted data;
specifically, the key element information includes a target invoice number, a contract name, and a project company name.
The extracting the target invoice number from the text data comprises the following steps:
judging that the invoice description form is in text data; and extracting the invoice description form from the text data by using a regular expression to obtain a target invoice number. The invoice description form comprises any one of an invoice number, an invoice number and invoice information.
The extracting the contract number of the text data comprises the following steps:
contract numbers are extracted from the text data using regular expressions.
The extracting the contract name of the text data comprises the following steps:
judging that the contract description form is in the text data; and extracting the contract description form from the text data by using a regular expression to obtain a contract name. The contract description form comprises any one of a contract, a construction project and an agreement.
The extracting the project company name of the text data comprises the following steps:
judging that the names of the project companies are in the registration file; extracting the project company name from the text data using a regular expression.
S4, comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result;
specifically, the fuzzy matching algorithm comprises a public substring algorithm, an edit distance algorithm and a threshold rule.
Comparing the basic data with the extracted data, judging whether repeated contents exist in the basic data through a fuzzy matching algorithm, and obtaining a judging result, wherein the judging result comprises the following steps:
s41, calculating the basic data and the extracted data through the public substring algorithm to obtain a user public substring;
specifically, the specific method of the public substring algorithm is as follows:
constructing a matrix, wherein the number of rows is the length +1 of the character string 1, and the number of columns is the length +1 of the character string 2; here +1 is for ease of calculation, if not +1 we need to do a single cycle for the first row, and the first column; after +1, we can discard this cycle. Mainly for this formula m [ i+1] [ j+1] =m [ i ] [ j ] +1 services.
S42, calculating the basic data and the extracted data through the editing distance algorithm to obtain an editing distance;
specifically, the specific method for editing the distance algorithm comprises the following steps:
the minimum number of editing operations required to switch from one to the other between two strings, if they are more edited; indicating that they are more different. The permitted editing operations include replacing one character with another, inserting one character, and deleting one character.
S43, comparing whether the user public substring and the editing distance meet the ambiguity simultaneously or not to obtain a comparison result;
s44, calculating the ambiguity based on the comparison result through the threshold rule, and obtaining a judgment result.
Specifically, the threshold rule is:
calculating the ambiguity according to the judgment of the public substring + editing distance of the user comprises the following steps: two characters, one of which is the other, determine the substring length duty cycle, such as 123 and 123456789.
S5, inputting the extracted data and the judging result into a risk judging model, and outputting data repeatability and data risk rating.
Specifically, the rules of the risk judgment model are as follows:
1. the data to be compared is divided into two types of elements and field contents corresponding to the elements, if the element A is defined, the corresponding contents are a, and the element A is generally a type of contents, such as invoice numbers, various types of numbers and the like.
2. The priority of the field contents is defined, e.g., a, b is high, c, d is medium, e, f is low.
3. Based on the result of the extraction of the element extraction method, it is determined whether or not the element A, B is present in the identified data.
4. And judging whether the content of the data submitted by the user has similar character strings in the identified data according to the result of the risk factor repeatability comparison method, such as a, b, c, d, e, f.
5. According to the result and the number of the similar character strings hit by a, b, c, d, e, f, the repeated risk level is judged, for example, when at least one of the fields a and b with high priority hits, the current data is judged to be high risk, c, d, e, f hits are also high risk, and only the fields c and d with medium priority hit are medium risk.
6. The risk level based on 5 is calculated again according to the existence of the element, and if the risk level determined based on the field content is middle, the risk level is calculated as high risk level if the element A or B exists.
After inputting the extracted data and the judgment result into a risk judgment model and outputting the data repeatability and the data risk rating, the method further comprises:
s6, classifying and storing the data repetition and the data risk rating into corresponding databases according to output time, and establishing a search keyword library for each database;
and S7, carrying out key extraction on the data extraction requirement to obtain target keywords, and matching the target keywords in the search keyword library to obtain target data.
The beneficial effects are that:
1. when the rule of element extraction is constructed, various forms of keywords in the true registration data are fully extracted through verification of a large amount of true business data, the rules are designed, and the accuracy of keyword extraction is improved.
2. When the ambiguity calculation is carried out, the method considers the edit distance of the basic data and the data to be compared, considers the edit form, designs the edit form into the edit distance arithmetic logic, and improves the accuracy of the ambiguity calculation.
3. The fuzzy matching algorithm is not limited by industry, can be flexibly applied, and determines the threshold value for determining repetition according to specific requirements.
The above disclosure is only a preferred embodiment of a method for comparing the repeatability of risk factors, but it should not be construed that the scope of the invention is limited thereto.

Claims (10)

1. The risk factor repeatability comparison method is characterized by comprising the following steps of:
acquiring a text to be identified from a registration file;
converting the text to be recognized into text data by an OCR technology;
extracting key element information from the text data through an element extraction algorithm to obtain extracted data;
comparing the basic data with the extracted data, and judging whether repeated contents exist in the basic data through a fuzzy matching algorithm to obtain a judging result;
and inputting the extracted data and the judging result into a risk judging model, and outputting the data repeatability and the data risk rating.
2. The method for comparing the repeatability of the risk factors according to claim 1,
the fuzzy matching algorithm comprises a public substring algorithm, an edit distance algorithm and a threshold rule.
3. The method for comparing the repeatability of the risk factors according to claim 2,
comparing the basic data with the extracted data, judging whether repeated contents exist in the basic data through a fuzzy matching algorithm, and obtaining a judging result, wherein the judging result comprises the following steps:
calculating the basic data and the extracted data through the public substring algorithm to obtain a user public substring;
calculating the basic data and the extracted data through the editing distance algorithm to obtain an editing distance;
comparing whether the public substring of the user and the editing distance meet the ambiguity simultaneously or not to obtain a comparison result;
and calculating the ambiguity based on the comparison result through the threshold rule to obtain a judgment result.
4. The method for comparing the repeatability of the risk factors according to claim 3,
the key element information includes a target invoice number, a contract name, and a project company name.
5. The method for risk factor reproducibility of claim 4,
the extracting the target invoice number from the text data comprises the following steps:
judging that the invoice description form is in text data;
and extracting the invoice description form from the text data by using a regular expression to obtain a target invoice number.
6. The method for risk factor reproducibility of claim 5,
the invoice description form comprises any one of an invoice number, an invoice number and invoice information.
7. The risk factor repeatability comparison method of claim 6, wherein said performing contract number extraction on text data comprises:
contract numbers are extracted from the text data using regular expressions.
8. The method for risk factor reproducibility of claim 7,
the extracting the contract name of the text data comprises the following steps:
judging that the contract description form is in the text data;
and extracting the contract description form from the text data by using a regular expression to obtain a contract name.
9. The method for risk factor reproducibility of claim 8,
the contract description form comprises any one of a contract, a construction project and an agreement.
10. The method for risk factor reproducibility of claim 9,
the text data is used for extracting the names of the project companies, and the method comprises the following steps:
judging that the names of the project companies are in the registration file;
extracting the project company name from the text data using a regular expression.
CN202310122409.XA 2023-02-15 2023-02-15 Risk element repeatability comparison method Withdrawn CN116129433A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310122409.XA CN116129433A (en) 2023-02-15 2023-02-15 Risk element repeatability comparison method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310122409.XA CN116129433A (en) 2023-02-15 2023-02-15 Risk element repeatability comparison method

Publications (1)

Publication Number Publication Date
CN116129433A true CN116129433A (en) 2023-05-16

Family

ID=86304509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310122409.XA Withdrawn CN116129433A (en) 2023-02-15 2023-02-15 Risk element repeatability comparison method

Country Status (1)

Country Link
CN (1) CN116129433A (en)

Similar Documents

Publication Publication Date Title
CN111723215B (en) Device and method for establishing biotechnological information knowledge graph based on text mining
Agichtein et al. Mining reference tables for automatic text segmentation
WO2021114745A1 (en) Named entity recognition method employing affix perception for use in social media
Arasu et al. Transformation-based framework for record matching
US8190538B2 (en) Methods and systems for matching records and normalizing names
US20040141354A1 (en) Query string matching method and apparatus
Chu et al. Tegra: Table extraction by global record alignment
US20180181646A1 (en) System and method for determining identity relationships among enterprise data entities
JP2012510654A (en) System and method for matching entities
US7627567B2 (en) Segmentation of strings into structured records
Valarakos et al. Enhancing ontological knowledge through ontology population and enrichment
CN112417891B (en) Text relation automatic labeling method based on open type information extraction
Ye et al. Unknown Chinese word extraction based on variety of overlapping strings
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
Kosala et al. Information extraction from structured documents using k-testable tree automaton inference
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN116129433A (en) Risk element repeatability comparison method
CN116824587A (en) Method for extracting risk elements in registration file
Zhang et al. Extracting Product Features and Sentiments from Chinese Customer Reviews.
Wen Text mining using HMM and PMM
WO2024045399A1 (en) User name blacklist fuzzy matching method based on text feature similarity
Chen et al. A three-phase system for chinese named entity recognition
Zhang et al. A statistical approach to opinion target extraction using domain relevance
US20240046039A1 (en) Method for News Mapping and Apparatus for Performing the Method
US20240070396A1 (en) Method for Determining Candidate Company Related to News and Apparatus for Performing the Method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20230516