CN113988062A - Client unit information semi-automatic verification method based on short text matching - Google Patents

Client unit information semi-automatic verification method based on short text matching Download PDF

Info

Publication number
CN113988062A
CN113988062A CN202111233985.9A CN202111233985A CN113988062A CN 113988062 A CN113988062 A CN 113988062A CN 202111233985 A CN202111233985 A CN 202111233985A CN 113988062 A CN113988062 A CN 113988062A
Authority
CN
China
Prior art keywords
unit
matching
information
name
unit information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111233985.9A
Other languages
Chinese (zh)
Inventor
赵呈亮
冯耀
俞敏
赵权有
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Pudong Development Bank Co Ltd
Original Assignee
Shanghai Pudong Development Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Pudong Development Bank Co Ltd filed Critical Shanghai Pudong Development Bank Co Ltd
Priority to CN202111233985.9A priority Critical patent/CN113988062A/en
Publication of CN113988062A publication Critical patent/CN113988062A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a customer unit information semi-automatic verification method based on short text matching, which comprises the steps of constructing a unit information database of all candidate units, acquiring unit names of all the candidate units by utilizing a search engine and constructing an inverted index; according to the name of the client table filling unit, carrying out rough recall on the unit name of each candidate unit according to the inverted index result; recalling the coarse recall result according to the address information of the client table filling unit; sorting the recorded data after the rough recall and the recalling, further scoring the unit names of all candidate units, determining the matching degree score, and sorting according to the height of the matching score; and taking the unit information data record with the highest sorted matching degree value as a final matching result, setting a verification label, and visualizing the final matching result with the verification label for subsequent manual judgment and review. Compared with the prior art, the method has the advantages of shortening the overall audit time, improving the overall audit efficiency and the like.

Description

Client unit information semi-automatic verification method based on short text matching
Technical Field
The invention relates to the technical field of natural language processing, in particular to a semi-automatic verification method for customer unit information based on short text matching.
Background
The current method for verifying the information of the customer unit of the bank credit card adopts manual verification, and the verification mode requires that an auditor manually searches and inquires the relevant information of the customer unit through an internet channel (such as a sky eye check, a Baidu map, an industrial and commercial bureau network and the like) and verifies and evaluates whether the customer normally works in the filled unit and whether the unit operation condition is good or not according to the relevant information. The scheme has the following defects:
1) the information searching rate of the manual searching unit through the internet channel is low. An internal auditing worker manually queries the information of the units in an internet channel, and the internal environment of bank security management has external network access limitation, so that the website can be queried less, and the information of the relevant customer units is difficult to query;
2) the overall working efficiency of manual auditing is low. Internal auditors need to audit a certain amount of client unit information every day, and because the information is acquired by means of manual retrieval and is manually evaluated and judged, one piece entering processing flow is long in time consumption, and auditing efficiency is low.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a semi-automatic verification method of customer unit information based on short text matching.
The purpose of the invention can be realized by the following technical scheme:
a semi-automatic verification method for customer unit information based on short text matching comprises the following steps:
constructing a unit information database of all candidate units, acquiring unit names of all candidate units by using a search engine and constructing an inverted index;
according to the name of the client table filling unit, carrying out rough recall on the unit name of each candidate unit according to the inverted index result;
recalling the coarse recall result according to the address information of the client table filling unit;
sorting the recorded data after the rough recall and the recalling, further scoring the unit names of all candidate units, determining the matching degree score, and sorting according to the height of the matching score;
and taking the unit information data record with the highest sorted matching degree value as a final matching result, setting a verification label, and visualizing the final matching result with the verification label for subsequent manual judgment and review.
Further, the unit information database of all candidate units includes unit related information including unit name, update identification, unified social credit code, registered address, registered capital, enterprise status, enterprise type, legal name, national standard industry gate type, national standard industry code, element industry code.
Further, the search engine adopts an ElasticSearch search engine.
Further, the specific content of obtaining the unit name of each candidate unit and constructing the inverted index by using the search engine is as follows:
appointing an ElasticSearch search engine to set unit names and unit address fields of each candidate unit as a text type, and setting types of other fields according to requirements; writing each item of data after the type is set into an ElasticSearch search engine one by one, and automatically constructing an inverted index for the text type field by the ElasticSearch search engine.
Further, according to the address information of the client table filling unit, the specific content of recalling the rough recall result is as follows:
and according to the client table filling unit address, dividing the table filling unit address into three levels of administration of province, city and district by using a geographic position word segmentation library, filtering each unit information which is coarsely recalled according to the three levels of administration information, and determining filtering retention according to business requirements.
Further, the matching score is calculated as:
Figure BDA0003317106160000021
in the formula: a is the name of the unit of inquiry, B is the name of the unit after recall, | A ≧ B | is the number of characters of intersection of A and B, | A ≧ B | is the number of characters of the union of A and B.
And further, taking the unit information data record with the highest sorted matching degree value as a final matching result, setting a verification tag, and performing web page visualization on the final matching result with the verification tag for subsequent manual judgment and review.
Compared with the prior art, the semi-automatic verification method for the customer unit information based on the short text matching at least has the following beneficial effects:
1) by summarizing the unit information data set and utilizing the short text matching technology, the checking yield of the unit information obtained by checking is improved, and the information accuracy is improved;
2) for the incoming parts with higher unit information matching degree, the programmed auxiliary audit can be realized to a certain degree, the workload of manual audit is reduced, the time consumption of the whole audit is shortened, and the whole audit efficiency is improved.
Drawings
FIG. 1 is a flow chart of a method for semi-automatically verifying customer-unit information based on short text matching in an embodiment.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
Examples
The invention relates to a client unit information semi-automatic verification method based on short text matching, which mainly comprises three steps of coarse recalling, recalling again, sequencing and the like of candidate unit names, and specifically comprises the following steps:
step 1, constructing a candidate unit information database
Third party data are introduced in a bank purchasing mode, and a relatively accurate industrial and commercial network unit information base is constructed, wherein the third party data include but are not limited to: the system comprises a unit name, an updating mark, a unified social credit code, a registered address, registered capital, an enterprise state, an enterprise type, a legal name, a national standard industry gate type, a national standard industry code, an element industry code and other unit related information. The number of data in the general unit information database is in the tens of millions or nearly hundreds of millions, and the unit information in the whole country is covered.
Step 2, constructing unit name inverted index by utilizing search engine
The search engine can preferably select an elastic search (hereinafter abbreviated as ES) and the like, specify the ES to set the type of text for the unit name and the unit address field, and set the type of the rest fields according to requirements. The data is written into the ES one by one, and the ES automatically constructs an inverted index for the text type field.
Step 3, performing rough recall according to the name of the client table filling unit and the inverted index result
And primarily recalling a small number of unit name similar results from the million-level unit information by utilizing ES query language according to the name of the client filling unit. Regarding the unit information, the number of rough recalls can be set to about 1000 in consideration of the existence of branch company situation and the number of provincial, city, district and county of China.
Step 4, recalling the coarse recall result by the address information of the client form filling unit
According to the client form filling unit address, the geographic position word segmentation library is utilized to divide the form filling unit address into three levels of administration of province, city and county, and the unit address in 1000 pieces of unit information which are coarsely recalled is filtered according to the three levels of administration information of form filling, for example, only recall unit information which is the same as the province of the client form filling address is reserved, or only province and city are reserved, and specific rules can be determined according to business requirements.
Step 5, sorting according to the business rules or historical sample data
And sequencing the recorded data after the rough recall and the recalling, further scoring the matched unit names according to the following formula to determine a matching degree score, and sequencing according to the matching scores.
Figure BDA0003317106160000041
Wherein, A represents the name of the query unit, B represents the name of the unit after recalling, | A ^ B | represents the number of the characters of intersection of A and B, | A ^ B | represents the number of the characters of the union of A and B, the similarity is also called jaccard distance, and can measure the similarity of two character strings of A and B.
Step 6, outputting unit information with the highest matching degree and giving approval suggestions
And taking the unit information data record with the highest sorted matching degree value as a final matching result, primarily marking an auditing suggestion tag according to a judgment rule, and visualizing in a web page mode for manual judgment and auditing.
According to the method, the unit information data sets are collected, and the short text matching technology is utilized, so that the checking yield of the unit information obtained by checking is improved, and the information accuracy is improved; for the incoming parts with higher unit information matching degree, the programmed auxiliary audit can be realized to a certain degree, the workload of manual audit is reduced, the time consumption of the whole audit is shortened, and the whole audit efficiency is improved.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and those skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (7)

1. A client unit information semi-automatic verification method based on short text matching is characterized by comprising the following steps:
constructing a unit information database of all candidate units, acquiring unit names of all candidate units by using a search engine and constructing an inverted index;
according to the name of the client table filling unit, carrying out rough recall on the unit name of each candidate unit according to the inverted index result;
recalling the coarse recall result according to the address information of the client table filling unit;
sorting the recorded data after the rough recall and the recalling, further scoring the unit names of all candidate units, determining the matching degree score, and sorting according to the height of the matching score;
and taking the unit information data record with the highest sorted matching degree value as a final matching result, setting a verification label, and visualizing the final matching result with the verification label for subsequent manual judgment and review.
2. The method of semi-automatic verification of customer unit information based on short text matching as claimed in claim 1, wherein the unit information database of all candidate units includes unit related information including unit name, update identification, unified social credit code, registered address, registered capital, business status, business type, legal name, national standard industry gate class, national standard industry code, element industry code.
3. The method for semi-automatically verifying customer premise information based on short text matching as claimed in claim 2, wherein the search engine is an ElasticSearch search engine.
4. The method for semi-automatically verifying customer unit information based on short text matching as claimed in claim 3, wherein the specific contents of obtaining the unit name of each candidate unit and constructing the inverted index by using the search engine are as follows:
appointing an ElasticSearch search engine to set unit names and unit address fields of each candidate unit as a text type, and setting types of other fields according to requirements; writing each item of data after the type is set into an ElasticSearch search engine one by one, and automatically constructing an inverted index for the text type field by the ElasticSearch search engine.
5. The semi-automatic client unit information verification method based on short text matching as claimed in claim 3, wherein the specific contents of recalling the rough recall result according to the client table filling unit address information are as follows:
and according to the client table filling unit address, dividing the table filling unit address into three levels of administration of province, city and district by using a geographic position word segmentation library, filtering each unit information which is coarsely recalled according to the three levels of administration information, and determining filtering retention according to business requirements.
6. The method of claim 3, wherein the matching score is calculated by the following formula:
Figure FDA0003317106150000021
in the formula: a is the name of the unit of inquiry, B is the name of the unit after recall, | A ≧ B | is the number of characters of intersection of A and B, | A ≧ B | is the number of characters of the union of A and B.
7. The semi-automatic client unit information verification method based on short text matching as claimed in claim 1, wherein the unit information data record with the highest sorted matching degree score is used as the final matching result, a verification tag is set, and the final matching result with the verification tag is subjected to web page visualization for subsequent manual judgment and review.
CN202111233985.9A 2021-10-22 2021-10-22 Client unit information semi-automatic verification method based on short text matching Pending CN113988062A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111233985.9A CN113988062A (en) 2021-10-22 2021-10-22 Client unit information semi-automatic verification method based on short text matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111233985.9A CN113988062A (en) 2021-10-22 2021-10-22 Client unit information semi-automatic verification method based on short text matching

Publications (1)

Publication Number Publication Date
CN113988062A true CN113988062A (en) 2022-01-28

Family

ID=79740487

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111233985.9A Pending CN113988062A (en) 2021-10-22 2021-10-22 Client unit information semi-automatic verification method based on short text matching

Country Status (1)

Country Link
CN (1) CN113988062A (en)

Similar Documents

Publication Publication Date Title
US5659731A (en) Method for rating a match for a given entity found in a list of entities
BR112019015920A2 (en) MASSIVE SCALE HETEROGENEOUS DATA INGESTION AND USER RESOLUTION
CN110781246A (en) Enterprise association relationship construction method and system
CN109558541B (en) Information processing method and device and computer storage medium
CN110597870A (en) Enterprise relation mining method
GB2513472A (en) Resolving similar entities from a database
JP2019502979A (en) Automatic interpretation of structured multi-field file layouts
CN113342976B (en) Method, device, storage medium and equipment for automatically acquiring and processing data
CN110599289A (en) Method for formatting official document
CN112926299B (en) Text comparison method, contract review method and auditing system
CN113342923A (en) Data query method and device, electronic equipment and readable storage medium
CN111815162A (en) Digital auditing tool and method
CN108073678B (en) Document analysis processing method, system and device applied to big data analysis
CN116414823A (en) Address positioning method and device based on word segmentation model
CN110580301A (en) efficient trademark retrieval method, system and platform
CN113988062A (en) Client unit information semi-automatic verification method based on short text matching
CN110705297A (en) Enterprise name-identifying method, system, medium and equipment
Gabor-Toth et al. Linking Deutsche Bundesbank Company Data
CN110941952A (en) Method and device for perfecting audit analysis model
CN115062108A (en) Method for obtaining standardized house address
CN109919811B (en) Insurance agent culture scheme generation method based on big data and related equipment
CN114513550A (en) Method and device for processing geographical position information and electronic equipment
CN112287110A (en) Post intelligent classification method and device for recruitment data
CN114722163B (en) Data query method and device, electronic equipment and storage medium
CN112182184A (en) Audit database-based accurate matching search method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination