CN113988062A - Client unit information semi-automatic verification method based on short text matching - Google Patents
Client unit information semi-automatic verification method based on short text matching Download PDFInfo
- Publication number
- CN113988062A CN113988062A CN202111233985.9A CN202111233985A CN113988062A CN 113988062 A CN113988062 A CN 113988062A CN 202111233985 A CN202111233985 A CN 202111233985A CN 113988062 A CN113988062 A CN 113988062A
- Authority
- CN
- China
- Prior art keywords
- unit
- matching
- information
- name
- unit information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012795 verification Methods 0.000 title claims abstract description 24
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000012552 review Methods 0.000 claims abstract description 5
- 238000001914 filtration Methods 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000014759 maintenance of location Effects 0.000 claims description 2
- 238000012800 visualization Methods 0.000 claims description 2
- 238000012550 audit Methods 0.000 abstract description 11
- 238000004904 shortening Methods 0.000 abstract 1
- 238000012163 sequencing technique Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000010977 unit operation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/319—Inverted lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a customer unit information semi-automatic verification method based on short text matching, which comprises the steps of constructing a unit information database of all candidate units, acquiring unit names of all the candidate units by utilizing a search engine and constructing an inverted index; according to the name of the client table filling unit, carrying out rough recall on the unit name of each candidate unit according to the inverted index result; recalling the coarse recall result according to the address information of the client table filling unit; sorting the recorded data after the rough recall and the recalling, further scoring the unit names of all candidate units, determining the matching degree score, and sorting according to the height of the matching score; and taking the unit information data record with the highest sorted matching degree value as a final matching result, setting a verification label, and visualizing the final matching result with the verification label for subsequent manual judgment and review. Compared with the prior art, the method has the advantages of shortening the overall audit time, improving the overall audit efficiency and the like.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a semi-automatic verification method for customer unit information based on short text matching.
Background
The current method for verifying the information of the customer unit of the bank credit card adopts manual verification, and the verification mode requires that an auditor manually searches and inquires the relevant information of the customer unit through an internet channel (such as a sky eye check, a Baidu map, an industrial and commercial bureau network and the like) and verifies and evaluates whether the customer normally works in the filled unit and whether the unit operation condition is good or not according to the relevant information. The scheme has the following defects:
1) the information searching rate of the manual searching unit through the internet channel is low. An internal auditing worker manually queries the information of the units in an internet channel, and the internal environment of bank security management has external network access limitation, so that the website can be queried less, and the information of the relevant customer units is difficult to query;
2) the overall working efficiency of manual auditing is low. Internal auditors need to audit a certain amount of client unit information every day, and because the information is acquired by means of manual retrieval and is manually evaluated and judged, one piece entering processing flow is long in time consumption, and auditing efficiency is low.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a semi-automatic verification method of customer unit information based on short text matching.
The purpose of the invention can be realized by the following technical scheme:
a semi-automatic verification method for customer unit information based on short text matching comprises the following steps:
constructing a unit information database of all candidate units, acquiring unit names of all candidate units by using a search engine and constructing an inverted index;
according to the name of the client table filling unit, carrying out rough recall on the unit name of each candidate unit according to the inverted index result;
recalling the coarse recall result according to the address information of the client table filling unit;
sorting the recorded data after the rough recall and the recalling, further scoring the unit names of all candidate units, determining the matching degree score, and sorting according to the height of the matching score;
and taking the unit information data record with the highest sorted matching degree value as a final matching result, setting a verification label, and visualizing the final matching result with the verification label for subsequent manual judgment and review.
Further, the unit information database of all candidate units includes unit related information including unit name, update identification, unified social credit code, registered address, registered capital, enterprise status, enterprise type, legal name, national standard industry gate type, national standard industry code, element industry code.
Further, the search engine adopts an ElasticSearch search engine.
Further, the specific content of obtaining the unit name of each candidate unit and constructing the inverted index by using the search engine is as follows:
appointing an ElasticSearch search engine to set unit names and unit address fields of each candidate unit as a text type, and setting types of other fields according to requirements; writing each item of data after the type is set into an ElasticSearch search engine one by one, and automatically constructing an inverted index for the text type field by the ElasticSearch search engine.
Further, according to the address information of the client table filling unit, the specific content of recalling the rough recall result is as follows:
and according to the client table filling unit address, dividing the table filling unit address into three levels of administration of province, city and district by using a geographic position word segmentation library, filtering each unit information which is coarsely recalled according to the three levels of administration information, and determining filtering retention according to business requirements.
Further, the matching score is calculated as:
in the formula: a is the name of the unit of inquiry, B is the name of the unit after recall, | A ≧ B | is the number of characters of intersection of A and B, | A ≧ B | is the number of characters of the union of A and B.
And further, taking the unit information data record with the highest sorted matching degree value as a final matching result, setting a verification tag, and performing web page visualization on the final matching result with the verification tag for subsequent manual judgment and review.
Compared with the prior art, the semi-automatic verification method for the customer unit information based on the short text matching at least has the following beneficial effects:
1) by summarizing the unit information data set and utilizing the short text matching technology, the checking yield of the unit information obtained by checking is improved, and the information accuracy is improved;
2) for the incoming parts with higher unit information matching degree, the programmed auxiliary audit can be realized to a certain degree, the workload of manual audit is reduced, the time consumption of the whole audit is shortened, and the whole audit efficiency is improved.
Drawings
FIG. 1 is a flow chart of a method for semi-automatically verifying customer-unit information based on short text matching in an embodiment.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
Examples
The invention relates to a client unit information semi-automatic verification method based on short text matching, which mainly comprises three steps of coarse recalling, recalling again, sequencing and the like of candidate unit names, and specifically comprises the following steps:
step 1, constructing a candidate unit information database
Third party data are introduced in a bank purchasing mode, and a relatively accurate industrial and commercial network unit information base is constructed, wherein the third party data include but are not limited to: the system comprises a unit name, an updating mark, a unified social credit code, a registered address, registered capital, an enterprise state, an enterprise type, a legal name, a national standard industry gate type, a national standard industry code, an element industry code and other unit related information. The number of data in the general unit information database is in the tens of millions or nearly hundreds of millions, and the unit information in the whole country is covered.
Step 2, constructing unit name inverted index by utilizing search engine
The search engine can preferably select an elastic search (hereinafter abbreviated as ES) and the like, specify the ES to set the type of text for the unit name and the unit address field, and set the type of the rest fields according to requirements. The data is written into the ES one by one, and the ES automatically constructs an inverted index for the text type field.
Step 3, performing rough recall according to the name of the client table filling unit and the inverted index result
And primarily recalling a small number of unit name similar results from the million-level unit information by utilizing ES query language according to the name of the client filling unit. Regarding the unit information, the number of rough recalls can be set to about 1000 in consideration of the existence of branch company situation and the number of provincial, city, district and county of China.
Step 4, recalling the coarse recall result by the address information of the client form filling unit
According to the client form filling unit address, the geographic position word segmentation library is utilized to divide the form filling unit address into three levels of administration of province, city and county, and the unit address in 1000 pieces of unit information which are coarsely recalled is filtered according to the three levels of administration information of form filling, for example, only recall unit information which is the same as the province of the client form filling address is reserved, or only province and city are reserved, and specific rules can be determined according to business requirements.
Step 5, sorting according to the business rules or historical sample data
And sequencing the recorded data after the rough recall and the recalling, further scoring the matched unit names according to the following formula to determine a matching degree score, and sequencing according to the matching scores.
Wherein, A represents the name of the query unit, B represents the name of the unit after recalling, | A ^ B | represents the number of the characters of intersection of A and B, | A ^ B | represents the number of the characters of the union of A and B, the similarity is also called jaccard distance, and can measure the similarity of two character strings of A and B.
Step 6, outputting unit information with the highest matching degree and giving approval suggestions
And taking the unit information data record with the highest sorted matching degree value as a final matching result, primarily marking an auditing suggestion tag according to a judgment rule, and visualizing in a web page mode for manual judgment and auditing.
According to the method, the unit information data sets are collected, and the short text matching technology is utilized, so that the checking yield of the unit information obtained by checking is improved, and the information accuracy is improved; for the incoming parts with higher unit information matching degree, the programmed auxiliary audit can be realized to a certain degree, the workload of manual audit is reduced, the time consumption of the whole audit is shortened, and the whole audit efficiency is improved.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and those skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (7)
1. A client unit information semi-automatic verification method based on short text matching is characterized by comprising the following steps:
constructing a unit information database of all candidate units, acquiring unit names of all candidate units by using a search engine and constructing an inverted index;
according to the name of the client table filling unit, carrying out rough recall on the unit name of each candidate unit according to the inverted index result;
recalling the coarse recall result according to the address information of the client table filling unit;
sorting the recorded data after the rough recall and the recalling, further scoring the unit names of all candidate units, determining the matching degree score, and sorting according to the height of the matching score;
and taking the unit information data record with the highest sorted matching degree value as a final matching result, setting a verification label, and visualizing the final matching result with the verification label for subsequent manual judgment and review.
2. The method of semi-automatic verification of customer unit information based on short text matching as claimed in claim 1, wherein the unit information database of all candidate units includes unit related information including unit name, update identification, unified social credit code, registered address, registered capital, business status, business type, legal name, national standard industry gate class, national standard industry code, element industry code.
3. The method for semi-automatically verifying customer premise information based on short text matching as claimed in claim 2, wherein the search engine is an ElasticSearch search engine.
4. The method for semi-automatically verifying customer unit information based on short text matching as claimed in claim 3, wherein the specific contents of obtaining the unit name of each candidate unit and constructing the inverted index by using the search engine are as follows:
appointing an ElasticSearch search engine to set unit names and unit address fields of each candidate unit as a text type, and setting types of other fields according to requirements; writing each item of data after the type is set into an ElasticSearch search engine one by one, and automatically constructing an inverted index for the text type field by the ElasticSearch search engine.
5. The semi-automatic client unit information verification method based on short text matching as claimed in claim 3, wherein the specific contents of recalling the rough recall result according to the client table filling unit address information are as follows:
and according to the client table filling unit address, dividing the table filling unit address into three levels of administration of province, city and district by using a geographic position word segmentation library, filtering each unit information which is coarsely recalled according to the three levels of administration information, and determining filtering retention according to business requirements.
6. The method of claim 3, wherein the matching score is calculated by the following formula:
in the formula: a is the name of the unit of inquiry, B is the name of the unit after recall, | A ≧ B | is the number of characters of intersection of A and B, | A ≧ B | is the number of characters of the union of A and B.
7. The semi-automatic client unit information verification method based on short text matching as claimed in claim 1, wherein the unit information data record with the highest sorted matching degree score is used as the final matching result, a verification tag is set, and the final matching result with the verification tag is subjected to web page visualization for subsequent manual judgment and review.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111233985.9A CN113988062A (en) | 2021-10-22 | 2021-10-22 | Client unit information semi-automatic verification method based on short text matching |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111233985.9A CN113988062A (en) | 2021-10-22 | 2021-10-22 | Client unit information semi-automatic verification method based on short text matching |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113988062A true CN113988062A (en) | 2022-01-28 |
Family
ID=79740487
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111233985.9A Pending CN113988062A (en) | 2021-10-22 | 2021-10-22 | Client unit information semi-automatic verification method based on short text matching |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113988062A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110674271A (en) * | 2019-08-27 | 2020-01-10 | 腾讯科技(深圳)有限公司 | Question and answer processing method and device |
CN110929125A (en) * | 2019-11-15 | 2020-03-27 | 腾讯科技(深圳)有限公司 | Search recall method, apparatus, device and storage medium thereof |
CN111191084A (en) * | 2020-04-09 | 2020-05-22 | 速度时空信息科技股份有限公司 | Map structure-based place name address resolution method |
CN111651670A (en) * | 2020-05-26 | 2020-09-11 | 中国平安财产保险股份有限公司 | Content retrieval method, device terminal and storage medium based on user behavior map |
-
2021
- 2021-10-22 CN CN202111233985.9A patent/CN113988062A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110674271A (en) * | 2019-08-27 | 2020-01-10 | 腾讯科技(深圳)有限公司 | Question and answer processing method and device |
CN110929125A (en) * | 2019-11-15 | 2020-03-27 | 腾讯科技(深圳)有限公司 | Search recall method, apparatus, device and storage medium thereof |
CN111191084A (en) * | 2020-04-09 | 2020-05-22 | 速度时空信息科技股份有限公司 | Map structure-based place name address resolution method |
CN111651670A (en) * | 2020-05-26 | 2020-09-11 | 中国平安财产保险股份有限公司 | Content retrieval method, device terminal and storage medium based on user behavior map |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5659731A (en) | Method for rating a match for a given entity found in a list of entities | |
BR112019015920A2 (en) | MASSIVE SCALE HETEROGENEOUS DATA INGESTION AND USER RESOLUTION | |
CN110781246A (en) | Enterprise association relationship construction method and system | |
CN110597870A (en) | Enterprise relation mining method | |
CN109558541B (en) | Information processing method and device and computer storage medium | |
GB2513472A (en) | Resolving similar entities from a database | |
JP2019502979A (en) | Automatic interpretation of structured multi-field file layouts | |
CN113342976B (en) | Method, device, storage medium and equipment for automatically acquiring and processing data | |
CN112926299B (en) | Text comparison method, contract review method and auditing system | |
CN113342923A (en) | Data query method and device, electronic equipment and readable storage medium | |
CN116414823A (en) | Address positioning method and device based on word segmentation model | |
CN111815162A (en) | Digital auditing tool and method | |
CN110705297A (en) | Enterprise name-identifying method, system, medium and equipment | |
CN108073678B (en) | Document analysis processing method, system and device applied to big data analysis | |
CN110580301A (en) | efficient trademark retrieval method, system and platform | |
KR102110350B1 (en) | Domain classifying device and method for non-standardized databases | |
CN113988062A (en) | Client unit information semi-automatic verification method based on short text matching | |
Gabor-Toth et al. | Linking Deutsche Bundesbank Company Data | |
CN110941952A (en) | Method and device for perfecting audit analysis model | |
CN115062108A (en) | Method for obtaining standardized house address | |
CN109919811B (en) | Insurance agent culture scheme generation method based on big data and related equipment | |
CN114513550A (en) | Method and device for processing geographical position information and electronic equipment | |
CN112287110A (en) | Post intelligent classification method and device for recruitment data | |
CN111091454A (en) | Financial public opinion recommendation method based on knowledge graph | |
Dalcin et al. | Data quality assessment at the Rio de Janeiro Botanical Garden Herbarium Database and considerations for data quality improvement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |