CN116843432A - Anti-fraud method and device based on address text information - Google Patents

Anti-fraud method and device based on address text information Download PDF

Info

Publication number
CN116843432A
CN116843432A CN202310522873.8A CN202310522873A CN116843432A CN 116843432 A CN116843432 A CN 116843432A CN 202310522873 A CN202310522873 A CN 202310522873A CN 116843432 A CN116843432 A CN 116843432A
Authority
CN
China
Prior art keywords
address
text information
fraud
partner
address text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310522873.8A
Other languages
Chinese (zh)
Other versions
CN116843432B (en
Inventor
赵佳悌
杨武力
林悦贤
姜辉
武广柱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Weiju Future Technology Co ltd
Beijing Weijuzhihui Technology Co ltd
Original Assignee
Beijing Weiju Future Technology Co ltd
Beijing Weijuzhihui Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Weiju Future Technology Co ltd, Beijing Weijuzhihui Technology Co ltd filed Critical Beijing Weiju Future Technology Co ltd
Priority to CN202310522873.8A priority Critical patent/CN116843432B/en
Publication of CN116843432A publication Critical patent/CN116843432A/en
Application granted granted Critical
Publication of CN116843432B publication Critical patent/CN116843432B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Accounting & Taxation (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Character Discrimination (AREA)

Abstract

The embodiment of the invention provides an anti-fraud method and device based on address text information, comprising the following steps: extracting features of the address text information to obtain text feature vectors of each character of the address text information, dividing the address text information into at least one address element with independent semantics corresponding to the address text information, and correcting and complementing each address element with independent semantics obtained by dividing to obtain preprocessed address elements corresponding to the address elements with independent semantics; and performing anti-fraud recognition on the address text information according to the preprocessed address elements and/or text feature vectors.

Description

Anti-fraud method and device based on address text information
Technical Field
The invention relates to the field of financial anti-fraud, in particular to an anti-fraud method and device based on address text information.
Background
In financial business, anti-fraud is one of the very important wind control means in order to prevent lawbreakers from stealing other people's property using false information. In recent years, banks, internet financial platforms and the like are continuously attempting to use new technologies for fraud detection, and at the same time, fraud means based on new technologies and new scenes are continuously upgraded, so that the fraud means are more specialized and intelligent. The fraud is relatively changeable and complex, and proper data are required to be selected for modeling analysis aiming at different application scenes.
At present, the anti-fraud technical scheme is generally formulated according to blacklist screening, multi-head lending, auditing rules, social relations, logic rules and the like. The scheme for analyzing the short text information frequently appearing in the scenes of credit card application, staged shopping and the like is less, and the scheme for utilizing the address information mostly considers the problem of inconsistent addresses, such as inconsistent application addresses and resident addresses, and a small number of schemes also involve the use of a pre-training model for processing the text address information.
In carrying out the present invention, the applicant has found that at least the following problems exist in the prior art:
the characteristics of free writing of the address text information, multiple default aliases, strong territory and the like form challenges for anti-fraud by utilizing the address text information.
Disclosure of Invention
The embodiment of the invention provides an anti-fraud method and device based on address text information, and also relates to an anti-fraud strategy construction method and device based on address short text standardization, which solve the problem that the characteristics of free writing of address text information, multiple default aliases, strong territory and the like are challenging to the anti-fraud by utilizing the address text information.
In order to achieve the above object, in one aspect, an embodiment of the present invention provides an anti-fraud method based on address text information, including:
Extracting features of the address text information to obtain text feature vectors of each character of the address text information, dividing the address text information into at least one address element with independent semantics corresponding to the address text information, and correcting and complementing each address element with independent semantics obtained by dividing to obtain preprocessed address elements corresponding to the address elements with independent semantics;
and performing anti-fraud recognition on the address text information according to the preprocessed address elements and/or text feature vectors.
In another aspect, an embodiment of the present invention provides an anti-fraud device based on address text information, including:
an address element identification and feature vector acquisition unit, configured to perform feature extraction on the address text information to obtain a text feature vector of each character of the address text information, divide the address text information into at least one address element with independent semantics corresponding to the address text information, and perform error correction and complementation on each address element with independent semantics obtained by dividing to obtain a preprocessed address element corresponding to the address element with independent semantics;
And the anti-fraud recognition unit is used for carrying out anti-fraud recognition on the address text information according to the preprocessed address elements and/or the text feature vectors.
The technical scheme has the following beneficial effects: address text information is split to obtain address elements with independent semantics, the address elements obtained by splitting are subjected to error correction and complementation to obtain preprocessed address elements, and anti-fraud recognition is performed on the basis, so that the problem that the address text information is used for anti-fraud challenge due to the characteristics of freeness in writing the address text information, multiple default aliases, strong territory and the like is solved, and the accuracy of anti-fraud based on the address text information is improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of an anti-fraud method based on address text information according to one embodiment of the present invention;
FIG. 2 is a block diagram of an anti-fraud device based on address text information according to one embodiment of the present invention;
FIG. 3 is another flow chart of an anti-fraud method based on address text information according to one embodiment of the present invention;
FIG. 4 is a flow chart of preprocessing of address text information in accordance with one embodiment of the present invention;
FIG. 5 is a flow chart of the construction of a corpus for training a nested entity recognition model according to one embodiment of the present invention;
FIG. 6 is a flow chart of false address identification in one embodiment of the present invention;
FIG. 7 is a nested entity identification flow diagram of one embodiment of the present invention;
FIG. 8 is a flow chart of partner fraud identification for one of the embodiments of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In a first aspect, as shown in fig. 1, an embodiment of the present invention provides an anti-fraud method based on address text information, including:
Step S100, extracting features of the address text information to obtain text feature vectors of each character of the address text information, dividing the address text information into at least one address element with independent semantics corresponding to the address text information, and correcting and complementing each address element with independent semantics obtained by dividing to obtain preprocessed address elements corresponding to the address elements with independent semantics;
and step S101, performing anti-fraud recognition on the address text information according to the at least one preprocessed address element and/or text feature vector.
In some embodiments, the address text information input by the user is often abbreviated, incomplete, nonstandard and even erroneous, and the anti-fraud recognition accuracy is low based on the address text information directly. Preferably, step S100 may implement splitting of the address elements by including, but not limited to, nested named entity recognition, error correction and completion by comparison with a preset address standard library.
Further, feature extraction is performed on the address text information to obtain a text feature vector of each character of the address text information, the address text information is divided into at least one address element with independent semantics corresponding to the address text information, and error correction and complementation are performed on each address element with independent semantics obtained by dividing to obtain a preprocessed address element corresponding to the address element with independent semantics, including:
vectorizing each character in the address text information to obtain vector characterization of the address text information;
according to the embodiment of the invention, the BERT pre-training language model is used for obtaining word vectors of Chinese character levels as vector representation of address text information, and because the BERT pre-training word vectors are obtained by training on a large-scale corpus, the general-purpose type is strong and contains rich features;
inputting the vector representation of the address text information into a two-way long-short-term memory network to obtain a text feature vector of each character in the address text information;
the embodiment of the invention is realized by a two-way long-short-term memory network (Bi-LSTM), the word vectors acquired by the word vector module are input into the two-way long-term memory network, the text is serialized and the text feature vectors are further acquired by combining the context.
Analyzing text feature vectors of each character in the address text information through a preset boundary detection model, and dividing the address text information into address elements with independent semantics;
according to the embodiment of the invention, whether a single character in the address text information is the first character or the last character of an entity is predicted through a boundary detection model, wherein the boundary detection module consists of two character-level classifiers, and a training objective function is defined as the sum of two classifier crossing functions.
Analyzing each address element with independent semantics through a preset entity prediction model, adding and averaging text feature vectors of characters in the address text information corresponding to the address elements with independent semantics to obtain span vectors corresponding to the address elements with independent semantics, and inputting the span vectors corresponding to the address elements with independent semantics into a preset full-connection layer to obtain address element types corresponding to the address elements with independent semantics;
according to the embodiment of the invention, the internal information of the span is aggregated in the entity prediction model to predict the entity classification, the text feature vector is summed and averaged to obtain the span vector representation, and then the span representation is input into a full connection layer to predict the entity class label of the span representation, so that the address element type corresponding to the entity is obtained.
Performing complementation and error correction on each address element with independent semantics according to a preset address standard library to obtain preprocessed address elements corresponding to the address elements with independent semantics;
the address standard library in the embodiment of the invention is mainly obtained by dividing standard websites and map services by the regions of the national statistical bureau.
Wherein the address element types include: provincial level, municipal level, district level, town level, district location, point location, physical organization; the type of the address element corresponding to each preprocessed address element is the same as the type of the address element corresponding to the preprocessed address element and having independent semantics; at least one preprocessed address element corresponding to the address element with independent semantics obtained by dividing the address text information is used as at least one preprocessed address element corresponding to the address text information; each address element with independent semantics and the preprocessed address element corresponding to the address element with independent semantics correspond to the same character in the address text information.
In some embodiments, the address preprocessing mainly includes performing standardized processing on address text information, extracting key information from unstructured address text information, and outputting structured data in a unified form as preprocessed key element information (i.e. preprocessed address elements). The key elements of the address text information are obtained by carrying out element analysis on the address text information mainly through nested named entity recognition, then the entity recognition result (namely the key elements of the obtained address text information) is verified through a standard address library, the accuracy of the address is perfected, and the pre-processed key element information is obtained by supplementing the missing information in the address. In the address element extraction process of the address text information, an intersection may exist between the address elements, for example, a "rich" of the "rich town dispatcher" is a starting position of the "rich town dispatcher" of the organization, and is also a starting position of a place name "rich Cheng Zhen", and a simple entity identification technology has a characteristic of single label, and only one of organization type entities or place name entities can be marked, which can cause the loss of information in the address text information. If an address element is not identified, the characteristics of the address element are lost, potentially affecting downstream tasks. Therefore, the embodiment of the invention proposes to solve the problems by using the nested named entity recognition, based on the above example, as shown in table 2, the nested named entity recognition can recognize the "rich town place" and the "rich Cheng Zhen", and compared with the common entity recognition, the additional extracted information can enhance the effect of the subsequent task. The address is usually obtained by actively filling in by a user, and a few are obtained by actively obtaining by a platform through a technical means. Especially, the address actively filled by the user has the characteristics of non-uniformity of writing, irregular description, missing geographic elements and even errors, different people can write the same address into different text forms, for example, the address is ' the xx unit xx number of the twenty-three inner street of Yiwu city of Zhejiang province ', some people can write and omit the common keyword ' the street ', and ' the xx unit xx number of the twenty-three inner street of Jin Huashi Yiwu city of Zhejiang province ' is directly written ', the two description modes actually represent the same address, an address standard library is maintained, and the address of different writing modes can be unified, so that the result of address element identification is audited and corrected. The extracted address element information (i.e. the address elements with independent semantics) is matched with a standard address library (i.e. an address standard library), and the address completion and error correction functions are mainly realized, wherein the completion comprises abbreviated completion and missing address completion, for example, "the address elements with independent semantics result is { ' pro ': the method comprises the following steps of ' Guangxi province ', ' city ' river basin city ', ' distribution ', ' Royal nationality ', ' Town ', ' area ', ' Rendelenburg ', ' pos ', ' org ' are obtained after error correction and complementation through address standard library matching. After the above processing, a final address element extraction result can be obtained as at least one address element after preprocessing.
Further, the performing anti-fraud recognition on the address text information according to the at least one preprocessed address element and/or text feature vector includes:
and analyzing at least one preprocessed address element corresponding to the address text information according to a preset false address identification method, and carrying out false address identification on the address text information.
In some embodiments, such as platforms or merchants, to promote the ranking of the platform or merchant, some people are employed to conduct dummy transactions, which are often filled in with dummy addresses. According to the embodiment of the invention, the address text information is split to obtain the address elements with independent semantics, the address elements are subjected to error correction and complementation to obtain the address elements after preprocessing, the address text information is vectorized to obtain the text feature vectors corresponding to the address text information, and the false address is identified, so that the false transaction is identified more accurately.
Further, the number of the address text messages is multiple, and the multiple address text messages are obtained in a preset time window; each address text message corresponds to user information;
performing anti-fraud recognition on the address text information according to the preprocessed address elements and/or text feature vectors, wherein the anti-fraud recognition comprises the following steps:
And analyzing at least one preprocessed address element corresponding to each address text message according to a preset fraud recognition method aiming at the plurality of address text messages, and performing fraud recognition on the plurality of address text messages.
In some embodiments, in fraud risk, organized participating mediation fraud is more focused, such as parallel, number keeping, cash registering and the like, and the embodiment of the invention divides address text information to obtain address elements with independent semantics, performs error correction and complementation on the address elements to obtain preprocessed address elements, and performs vectorization on the address text information to obtain text feature vectors corresponding to the address text information, so as to identify the partner fraud, and more accurately identify the organized participating mediation fraud.
Further, the performing anti-fraud recognition on the address text information according to the at least one preprocessed address element and/or text feature vector includes:
analyzing the text feature vector of each character of the address text information according to a preset attack type address identification method, and carrying out attack type address identification on the address text information.
In some embodiments, through adopting wrongly written characters, traditional Chinese characters, adding special characters in the middle of an address to attack the wind control rule of a platform in a mode of dividing keywords and the like, the method is also a common fraud method by using the address.
Further, the analyzing at least one preprocessed address element corresponding to the address text information according to a preset false address recognition method to perform false address recognition on the address text information includes:
if the area address and the point location address do not exist in the address element type corresponding to the at least one preprocessed address element corresponding to the address text information, judging the address text information as a false address; otherwise the first set of parameters is selected,
analyzing at least one preprocessed address element corresponding to the address text information through a preset false address identification model, and judging the address text information as a false address if the analysis result is an address which does not exist; otherwise the first set of parameters is selected,
And if the path distance between at least one preprocessed address element corresponding to the address text information and the address element corresponding to the common address of the equipment is greater than or equal to a specified path distance threshold value, judging the address text information as a false address.
In some embodiments, there may be three forms of spurious addresses: addresses that cannot be located to a specific location due to incomplete filling of information, addresses that do not exist, addresses that are real addresses but not the user himself. Different solutions are designed depending on the three types of false addresses that may be present.
If the area address and the point address in the address element type corresponding to the preprocessed address element are empty, namely the address element corresponding to the area address and the point address does not exist in the address text information, the address text information is judged to be an address which cannot be positioned to a specific position, the address is judged to be a false address, and otherwise, the next judgment is carried out.
And under the condition that the false address is not determined in the previous step, inputting the preprocessed address elements into a preset false address identification model. The false address recognition model is a pre-trained model for recognizing unreal addresses, a corpus used by the model consists of addresses which exist truly and addresses which do not exist, and the model mainly consists of LSTM and a fully-connected classification layer. Analyzing the preprocessed address elements through a preset false address identification model, judging the address text information as a false address if the analysis result is an address which does not exist, and otherwise, entering the next judgment.
If the false address recognition model in the last step does not determine that the false address is not determined, the address text information is considered to be a real address, but it cannot be determined whether the address text information is the real address of the user. The embodiment of the invention provides a method for comparing the preprocessed address elements corresponding to the address text information with the common address of the equipment, wherein the comparison mode is to calculate the path distance between seven-level elements. Obtaining at least one preprocessed address element corresponding to the equipment common address and a text feature vector corresponding to the equipment common address by using the same processing method as the address text information, and specifically, performing common use on the equipmentThe address obtains vector representation of the equipment common address by vectorizing each character in the equipment common address; inputting the vector representation of the equipment common address into a two-way long-short-term memory network to obtain a text feature vector of each character in the equipment common address, analyzing the text feature vector of each character in the equipment common address through a preset boundary detection model, and splitting the equipment common address into at least one entity; wherein each entity corresponds to different position spans in the common equipment address; analyzing each entity through a preset entity prediction model, adding and averaging text feature vectors of characters of the common address of the equipment in a span corresponding to the entity to be used as a span vector corresponding to the entity, inputting the span vector corresponding to the entity into a preset full-connection layer to obtain an address element type corresponding to the entity, and taking the entity with the determined address element type as an address element with independent semantics corresponding to the common address of the equipment; performing complementation and error correction on each address element with independent semantics corresponding to the equipment common address according to a preset address standard library to obtain at least one preprocessed address element corresponding to the equipment common address; after the processing, the address text information corresponds to at least one preprocessed address element, the common address of the equipment also corresponds to at least one preprocessed address element, and each address element corresponds to one address element type; for the address text information, each preprocessed address element corresponds to a continuous character and text feature vector of the character in the address text information, and the text feature vector of each address element is summed and then averaged to be used as the address vector of the address element type corresponding to the address element; the method comprises the steps that at least one preprocessed address element corresponding to the equipment common address is also processed the same as address text information, and an address vector of an address element type corresponding to the at least one preprocessed address element corresponding to the equipment common address can be obtained; the address vector of each address element type corresponding to the address text information and the common address of the equipment is expressed as R pro 、R city 、R district 、R town 、R area 、R pos 、R org (wherein the meaning of the subscript refers to table 1), each address element type of the address text information and the device common address is represented by a similarity between the address vector as an address text information and a device common address element level path distance, and the above element level path distances of the address text information and the device common address are weighted and summed as an overall path distance of the address text information and the device common address. The weight determination method can be to bring the analysis result of the calculated address element into a logistic regression model according to the data of whether the history is known to be the real address of the user himself or herself, and obtain the parameter omega i I.e. the weights used in calculating the overall path distance. The calculation formula is shown as formula (1):
i in the above publication (1) represents an address element type,address vector of the i-th address element type representing address text information +.>An address vector representing the i-th address element type of the device usual address,function representation +.>And->Is a similarity of (3). If the overall path distance is greater than the specified path distance threshold, the addresses are considered to be not matched, and if the addresses are not the addresses of the user, the addresses are considered to be false addresses.
Further, the analyzing, according to a preset method for identifying fraud, at least one preprocessed address element corresponding to each address text message for identifying fraud for the plurality of address text messages includes:
determining path distances between the preprocessed address elements corresponding to the same address element type among the address text information according to text feature vectors of characters in the address text information corresponding to at least one preprocessed address element corresponding to each of the address text information, determining path distances among the address text information according to the path distances corresponding to the obtained address element types, and dividing the address text information with the path distances among the address text information smaller than a designated group fraud path distance threshold into the same group;
adding address text information which corresponds to the user information and is not in the current group partner into the current group partner according to the user information which corresponds to each address text information in each group partner;
for a partner containing address text information and at least corresponding to two different user information, carrying out partner risk index analysis on the partner to obtain a partner risk index of the partner, judging the partner as a fraudulent suspicious partner if the partner risk index of the partner is greater than or equal to a designated partner fraud threshold, or for each partner, if a fraud seed address recorded in a preset dynamic address risk list library is contained in the partner, taking the partner as a fraudulent suspicious partner, and adding all text information addresses in the partner as fraud seed addresses to the dynamic address risk list library;
The dynamic address risk list library is used for recording fraudulent seed addresses.
In some embodiments, the group fraud has the characteristic of temporal and spatial aggregation, all addresses filled by users (same or different) in a certain time window T are acquired in a service scene, and whether the group fraud user exists is judged for the addresses in the time. The method mainly comprises two steps, wherein the first step is the division of the group partner, and the second step is the judgment of whether the group partner is a fraudulent group partner. First, obtainUser information and address text information in the time window T; preprocessing each piece of address text information, extracting features of the address text information to obtain text feature vectors of the address text information, dividing the address text information into at least one address element with independent semantics corresponding to the address text information, and correcting and complementing each address element with independent semantics obtained by dividing to obtain preprocessed address elements corresponding to the address elements with independent semantics; specifically, vectorizing each character in the address text information to obtain vector characterization of the address text information; inputting the vector representation of the address text information into a two-way long-short-term memory network to obtain a text feature vector of each character in the address text information; analyzing text feature vectors of each character in the address text information through a preset boundary detection model, and splitting the address text information into at least one entity (address element with independent semantics); wherein each entity corresponds to different position spans in the address text information; analyzing each entity (address elements with independent semantics) through a preset entity prediction model, adding and averaging text feature vectors of characters of the address text information in spans corresponding to the entities to obtain span vectors corresponding to the entities, inputting the span vectors corresponding to the entities (address elements with independent semantics) into a preset full-connection layer to obtain address element types corresponding to the entities (address elements with independent semantics), and taking the entities with determined address element types as address elements with independent semantics; performing complementation and error correction on each address element with independent semantics according to a preset address standard library to obtain preprocessed address elements corresponding to the address elements with independent semantics; each address text message corresponds to at least one preprocessed address element, each address element corresponds to an address element type, each preprocessed address element corresponding to each address text message corresponds to a character and text feature vector thereof on the address text message, and the corresponding text feature of each preprocessed address element of each address text message Summing the feature vectors, averaging, calculating the path distance between the address text information according to the same method as the formula (1) for each address text information as the address vector of the address element type corresponding to the address element, dividing the affiliation of which the path distance between the address text information is smaller than the designated affiliation fraud path distance threshold value into one affiliation, and finally dividing the address text information received in the time window T into at least one affiliation. If the user information corresponding to the address text information also corresponds to other address text information in a certain group partner, the other address text information is added into the group partner. Thus, a plurality of groups based on address relation division are obtained, and the following first discriminant and second discriminant steps are carried out on groups with the number of users being more than 2; in the first distinguishing mode of the partner fraud, the embodiment of the invention adopts the calculation of the partner risk index of the address of the same partner, and if the index is greater than the designated partner fraud threshold value, the partner is judged to be the fraudulent suspicious partner. In the embodiment of the invention, the risk index is calculated according to the characteristic information of the group partner, and the characteristic comprises: the number of orders, the number of addresses, the number of users, the number of most user orders in the group, the number of most user addresses in the group, the average number of user orders in the group, the average number of user addresses in the group, the number of least user orders in the group, the number of least user addresses in the group, and other characteristic information can be adopted in different business scenarios. The risk index is obtained by weighting and combining features, wherein the weights of the features are brought into a logistic regression model according to the feature information of the known group fraud and the non-group fraud of the historical data to obtain a parameter theta i I.e. the weights of the features. The calculation formula of the partner risk index is shown as formula (2):
∑θ i *X i (2)
wherein i represents a feature class; xi represents the ith feature.
In a second way of distinguishing the fraud of the group, the embodiment of the invention maintains a dynamic address risk list library, and if the same group contains fraud seeds, the group is judged to be a fraudulent suspicious group. Adding the partner address into a dynamic address risk list library to realize incremental update of the address risk list library
Further, the analyzing the partner risk index to obtain the partner risk index of the partner includes:
and acquiring a plurality of characteristic information corresponding to the group partner, and carrying out weighted summation on the plurality of characteristic information to obtain the group partner risk index of the group partner.
The embodiment of the invention adopts the calculation of the partner risk index to the address of the same partner, and judges the partner as a fraudulent suspicious partner if the index is larger than the fraud threshold of the appointed partner. In the embodiment of the invention, the risk index is calculated according to the characteristic information of the group partner, and the characteristic comprises: the number of orders, the number of addresses, the number of users, the number of most user orders in the group, the number of most user addresses in the group, the average number of user orders in the group, the average number of user addresses in the group, the number of least user orders in the group, the number of least user addresses in the group, and other characteristic information can be adopted in different business scenarios. The risk index is obtained by weighting and combining features, wherein the weights of the features are brought into a logistic regression model according to the feature information of the known group fraud and the non-group fraud of the historical data to obtain a parameter theta i I.e. the weights of the features. The calculation formula of the partner risk index is shown as formula (2).
Further, the analyzing the text feature vector of each character of the address text information according to the preset attack class address recognition method to perform attack class address recognition on the address text information includes:
inputting the text feature vector of each character of the address text information into a preset classification model for classification recognition to determine whether the address text information is an attack type address.
In some embodiments, the present invention inputs the text feature vector of the address text information into a classification model, and constructs a new classification model to identify the address of the attack class.
The embodiment of the invention has the following technical effects: address text information is split to obtain address elements with independent semantics, the address elements obtained by splitting are subjected to error correction and complementation to obtain preprocessed address elements, and anti-fraud recognition is performed on the basis, so that the problem that the address text information is used for anti-fraud challenge due to the characteristics of freeness in writing the address text information, multiple default aliases, strong territory and the like is solved, and the accuracy of anti-fraud based on the address text information is improved. Further, on the basis of splitting, correcting and supplementing the address text information, comprehensive fraud recognition is realized through false address recognition, partner fraud recognition and/or address attack user recognition, and the accuracy of fraud recognition is more effectively improved.
In a second aspect, as shown in fig. 2, an embodiment of the present invention provides an anti-fraud device based on address text information, including:
an address element identification and feature vector obtaining unit 200, configured to perform feature extraction on the address text information to obtain a text feature vector of each character of the address text information, divide the address text information into at least one address element with independent semantics corresponding to the address text information, and perform error correction and complement on each address element with independent semantics obtained by dividing to obtain a preprocessed address element corresponding to the address element with independent semantics;
and the anti-fraud recognition unit 201 is configured to perform anti-fraud recognition on the address text information according to the at least one preprocessed address element and/or text feature vector.
Further, the address element identification and feature vector acquisition unit 200 includes:
the word vector module is used for vectorizing each character in the address text information to obtain vector representation of the address text information;
the feature extraction module is used for inputting the vector representation of the address text information into a two-way long-short-term memory network to obtain the text feature vector of each character in the address text information;
The boundary detection module is used for analyzing the text feature vector of each character in the address text information through a preset boundary detection model and dividing the address text information into at least one address element with independent semantics;
the entity prediction module is used for analyzing each address element with independent semantics through a preset entity prediction model, adding and averaging text feature vectors of characters in the address text information in spans corresponding to the address elements with independent semantics to be used as span vectors corresponding to the address elements with independent semantics, and inputting the span vectors corresponding to the address elements with independent semantics into a preset full-connection layer to obtain address element types corresponding to the address elements with independent semantics;
the completion error correction module is used for carrying out completion and error correction on each address element with independent semantics according to a preset address standard library to obtain preprocessed address elements corresponding to the address elements with independent semantics;
wherein the address element types include: provincial level, municipal level, district level, town level, district location, point location, physical organization; the type of the address element corresponding to each preprocessed address element is the same as the type of the address element corresponding to the preprocessed address element and having independent semantics; at least one preprocessed address element corresponding to the address element with independent semantics obtained by dividing the address text information is used as at least one preprocessed address element corresponding to the address text information; each address element with independent semantics and the preprocessed address element corresponding to the address element with independent semantics correspond to the same character in the address text information.
Further, the anti-fraud recognition unit 201 includes:
and the false address identification module is used for analyzing at least one preprocessed address element corresponding to the address text information according to a preset false address identification method and carrying out false address identification on the address text information.
Further, the number of the address text messages is multiple, and the multiple address text messages are obtained in a preset time window; each address text message corresponds to user information;
the anti-fraud recognition unit 201 includes:
and the partner fraud recognition module is used for analyzing at least one preprocessed address element corresponding to each address text message according to a preset partner fraud recognition method aiming at the plurality of address text messages to perform partner fraud recognition on the plurality of address text messages.
Further, the anti-fraud recognition unit 201 includes:
and the address attack user identification module is used for analyzing the text feature vector of each character of the address text information according to a preset attack type address identification method and carrying out attack type address identification on the address text information.
Further, the false address identification module includes:
The first false address identification sub-module is used for judging the address text information as a false address if the area address and the point address do not exist in the address element type corresponding to the at least one preprocessed address element corresponding to the address text information;
the second false address identification sub-module is used for analyzing at least one preprocessed address element corresponding to the address text information through a preset false address identification model under the condition that the first false address identification module does not judge the false address, and judging the address text information as the false address if the analysis result is an address which does not exist;
and the third false address identification sub-module is used for judging the address text information as a false address if the path distance between at least one preprocessed address element corresponding to the address text information and the address element corresponding to the equipment common address is greater than or equal to a specified path distance threshold value under the condition that the second false address identification module does not judge the false address.
Further, the group fraud identification module includes:
a first group classification module, configured to determine a path distance between preprocessed address elements corresponding to a same address element type between address text information according to text feature vectors of characters in address text information corresponding to at least one preprocessed address element corresponding to each of the plurality of address text information, determine a path distance between each address text information according to the path distance corresponding to each address element type, and classify address text information in which the path distance between each address text information is smaller than a designated group fraud path distance threshold value into a same group;
The second group partner dividing module is used for adding the address text information which corresponds to the user information and is not in the current group partner into the current group partner according to the user information which corresponds to each address text information in each group partner;
a first partner fraud identification sub-module, configured to perform a partner risk indicator analysis on the partner for at least two partners corresponding to different user information, to obtain a partner risk indicator of the partner, and if the partner risk indicator of the partner is greater than or equal to a specified partner fraud threshold, determine the partner as a fraudulent suspicious partner;
a second group fraud identification sub-module, configured to, for each group, if the group includes a fraud seed address recorded in a preset dynamic address risk list library, use the group as a fraudulent suspicious group, and add all text information addresses in the group as fraud seed addresses to the dynamic address risk list library;
the dynamic address risk list library is used for recording fraudulent seed addresses.
Further, the first group identification sub-module includes:
and the partner risk index determining module is used for acquiring various characteristic information corresponding to the partner, and carrying out weighted summation on the various characteristic information to obtain the partner risk index of the partner.
Further, the address attack user identification module is specifically configured to input a text feature vector of each character of the address text information into a preset classification model to perform classification identification to determine whether the address text information is an attack address.
The embodiment of the present invention is an embodiment of an address text information based anti-fraud device corresponding to the foregoing address text information based anti-fraud method embodiment, and may be understood according to the foregoing address text information based anti-fraud method embodiment, which is not described herein again.
The embodiment of the invention has the following technical effects: address text information is split to obtain address elements with independent semantics, the address elements obtained by splitting are subjected to error correction and complementation to obtain preprocessed address elements, and anti-fraud recognition is performed on the basis, so that the problem that the address text information is used for anti-fraud challenge due to the characteristics of freeness in writing the address text information, multiple default aliases, strong territory and the like is solved, and the accuracy of anti-fraud based on the address text information is improved. Further, on the basis of splitting, correcting and supplementing the address text information, comprehensive fraud recognition is realized through false address recognition, partner fraud recognition and/or address attack user recognition, and the accuracy of fraud recognition is more effectively improved.
The foregoing technical solutions of the embodiments of the present invention will be described in detail with reference to specific application examples, and reference may be made to the foregoing related description for details of the implementation process that are not described.
Noun interpretation:
nested named entities refer to the nesting of one or more named entities within an entity, which may be multi-tiered.
Named entity recognition belongs to the field of natural language processing and aims to identify specific types of entities in natural language texts, such as person names, place names, organization names and the like.
The data feature extraction is the basis of anti-fraud capability, and the inventor finds that in the application of the anti-fraud data type features in the prior art, the scheme for processing the address text is less, the few address information methods are simpler, or the address feature data is directly extracted, or the address text information is directly used for modeling or calculating, and the method for preprocessing and reutilizing the address information is less.
In the prior art, some methods extract address feature data by splitting addresses step by step, and then use the address feature data as features to enter an anti-fraud model, so that fraud scoring values of the applicant are obtained; some methods also directly input address text data as text information into a model to obtain a word vector matrix; still other methods calculate the relationship match directly from the address.
The inventor finds that the situation of lack of writing and error writing of the address short text information is not considered in the prior art method.
In order to solve the problems, the embodiment of the invention provides an anti-fraud policy construction method based on address short text standardization. The embodiment of the invention obtains key elements of the address text information (i.e. the preprocessed address elements) by utilizing the preprocessed address text information, and then builds an anti-fraud strategy based on the key elements. When the method is used, the key elements in the address text information are identified by utilizing the pre-trained nested naming entity model, then the key elements are corrected and complemented by utilizing the standard address library, and finally the pre-processed key element information is subjected to false address identification, partner fraud identification and attack address identification. Compared with the direct step-by-step splitting of the address, the embodiment of the invention has better robustness, has a certain interpretability compared with the direct access of the address information to the module, and has the key points that the embodiment of the invention standardizes the address information (namely the address text information), complements and corrects errors and provides richer characteristic information for constructing an anti-fraud strategy at the downstream.
The embodiment of the invention mainly relates to address preprocessing, false address identification, partner fraud identification and address attack identification.
The address preprocessing mainly comprises the steps of extracting key information from unstructured address text information through standardized processing of the address text information, and outputting structured data in a unified form as preprocessed key element information. The key elements of the address text information are obtained by carrying out element analysis on the address text information mainly through nested named entity recognition, then the entity recognition result (namely the key elements of the obtained address text information) is verified through a standard address library, the accuracy of the address is perfected, and the pre-processed key element information is obtained by supplementing the missing information in the address. In the address element extraction process of the address text information, an intersection may exist between the address elements, for example, a "rich" of the "rich town dispatcher" is a starting position of the "rich town dispatcher" of the organization, and is also a starting position of a place name "rich Cheng Zhen", and a simple entity identification technology has a characteristic of single label, and only one of organization type entities or place name entities can be marked, which can cause the loss of information in the address text information. If an address element is not identified, the characteristics of the address element are lost, potentially affecting downstream tasks. Therefore, the embodiment of the invention proposes to solve the problems by adopting the nested named entity recognition, based on the above example, the nested named entity recognition can recognize the rich town place and the rich Cheng Zhen, and compared with the common entity recognition, the additional extracted information can enhance the effect of the subsequent task.
In the business scenario requiring address filling, there is a method of using addresses to perform fraud, and the embodiment of the invention mainly applies the address preprocessing result to false address identification, partner fraud identification and address attack user identification.
False address: for example, a platform or merchant, may employ some people, often filling in false addresses, to promote the ranking of the platform or merchant.
Partner fraud: among the risks of fraud, mediating fraud with organized participation has received more attention, such as pulling wool, nourishing numbers, and registering.
Address attack: the wind control rule of the platform is attacked by adopting the modes of wrongly written characters, traditional Chinese characters, address middle adding special characters to divide keywords and the like, and the method is also a common fraud method by utilizing the address.
The following description is of one specific embodiment:
in order to solve the problem of fraud by using addresses, as shown in fig. 3, an embodiment of the present invention provides an anti-fraud method based on address text information, and also an anti-fraud policy construction method based on address short text standardization, including:
step S11, obtaining address text information;
step S12, address preprocessing is carried out on the address text information, and at least one preprocessed address element is obtained; specifically, as shown in fig. 4, the method comprises the following steps:
Step S121, address element extraction: extracting address elements corresponding to at least one address element type defined by a preset address element grading rule from the address text information; the address element types include: provincial level (prop), municipal level (city), regional level (distribution), town level (town), location position (area), point location (pos), entity organization (org).
Preferably, the preset address element ranking rule is as shown in table 1.
Table 1 preset rule for grading address elements
In step S122, the corpus is constructed, specifically, as shown in fig. 5, in the application scenario, in step S1221, address data is extracted: extracting data of a certain magnitude according to the distribution of the addresses; step S1222, data set partitioning: then dividing the training set, the verification set and the test set according to a certain proportion; step S1223, layering BIO labeling: marking by combining the address element types to be identified in the step S121, and specifically using layered BIO
Marking the character-level sequence by a marking mode of (B-begin, I-insert, O-outlide); step S1224, cross-validation: and the marking accuracy is improved through cross verification. The labeling strategy of the nested entity comprises the following steps: hierarchical labels, cascading labels, tandem Token labels, and parse tree labels. As shown by statistics of collected addresses, the number of layers of nested named entities in Chinese address data is not more than three, and the embodiment of the invention adopts layered BIO
The character-level sequences are marked by a marking mode (B-begin, I-insert, O-outlide). And labeling each layer of the nested named entity identification in a layered labeling mode, and completely labeling all named entities contained in a Token sequence.
Table 2 comparison of common named entity recognition with nested named entity recognition
Step S123, training and predicting the nested entity recognition model: the nested entity recognition model is a training model using the corpus data in step S122, and the embodiment of the present invention proposes capturing the dependency relationship between the entity boundary and the fact label based on the multi-task learning. The module is mainly divided into two parts, namely, the position of an entity is positioned firstly, then the prediction of the entity type is carried out in a corresponding position interval, and the prediction is realized by four sub-modules: the system comprises a word vector module, a feature extraction module, a boundary detection module and an entity prediction module.
Step S124, collecting an address standard library: the address is usually obtained by actively filling in by a user, and a few are obtained by actively obtaining by a platform through a technical means. Especially, the address actively filled by the user has the characteristics of non-uniformity of writing, irregular description, missing geographic elements and even errors, different people can write the same address into different text forms, for example, the address is ' the xx unit xx number of the twenty-three inner street of Yiwu city of Zhejiang province ', some people can write and omit the common keyword ' the street ', and ' the xx unit xx number of the twenty-three inner street of Jin Huashi Yiwu city of Zhejiang province ' is directly written ', the two description modes actually represent the same address, an address standard library is maintained, and the address of different writing modes can be unified, so that the result of address element identification is audited and corrected.
Step S125, the elements identified by the model are complemented and corrected through an address standard library: the address element information extracted in the step S123 is matched with the standard address library acquired in the step S124, so that the functions of address complementation and error correction are mainly realized, wherein the completion includes abbreviated completion and missing address completion, for example, "Bai Fu Sheng village in county of Lao city of He Fu Cheng, guangxi", the result after the extraction in step S123 is { ' pro ': the method comprises the following steps of ' Guangxi province ', ' city ' river basin city ', ' distribution ', ' Royal nationality ', ' Town ', ' area ', ' Reptile village ', ' pos ', ' org ' are obtained after error correction and complementation through address library matching. After the above processing, a final address element extraction result can be obtained as at least one address element after preprocessing.
Step S13, false address identification: the false address identifying module identifies the false address by using the address text information obtained in the step S11 and the at least one address element obtained in the step S12 after preprocessing. The form of the dummy address includes: addresses that cannot be located to a specific location, addresses that do not exist, and addresses that are real addresses but not the user himself are caused by incomplete filling of information.
Depending on the type of false addresses that may be present, different solutions are designed, as shown in fig. 6, with the following specific steps:
step S131, if the area address and the point address in the step S12 are both empty, determining that the address is an address which cannot be positioned to a specific position, determining that the address is a false address, otherwise, entering step S132;
in step S132, an unreal address recognition model (i.e., a false address recognition model) is trained. Inputting the address which is not judged as the false address in the step S131 into the unreal address identification model, and in the case that the classification result of the unreal address identification model is the false address which is not existed, identifying the address which is not judged as the false address in the step S131 as the false address which is not existed, otherwise executing the step S133;
in step S133, the address determined in S131 and S132 is considered as a real address, but it cannot be determined whether or not the address is the real address of the user. The embodiment of the invention provides a path distance between an address result analyzed in the step S12 (i.e. at least one address element obtained in the step S12 after preprocessing) and a common address of equipment, if the path distance is greater than a specified path distance threshold, the address is considered to be not matched, and if the address is considered to be not the address of the user, the address is considered to be a false address.
Step S14, identifying the partner fraud: the partner fraud recognition module performs partner fraud recognition according to the address result (i.e. the at least one address element obtained in step S12 after preprocessing) after preprocessing in S12, mainly uses the characteristics of address concentration, similarity, etc. to recognize intermediaries, and performs diffusion mining according to intermediaries seed users.
Step S15, attack address identification: the attack address recognition module inputs the text token output by the feature extraction module in step S123 into a classification model, and constructs a new classification model to recognize the address of the attack class.
The following is another example:
the embodiment of the invention provides an anti-fraud strategy construction method based on address short text standardization, which is characterized in that key elements in an address are analyzed based on nested named entity identification by preprocessing address short text (namely address text information), and are supplemented and corrected through an address standard library, so that more abundant characteristic information can be additionally obtained for constructing an anti-fraud strategy, and false address identification, partner fraud identification and address attack user identification are realized.
The embodiment of the invention provides an anti-fraud strategy construction method based on address short text information, which comprises the following steps:
Step S21, address text information is acquired.
Step S22, preprocessing the address text information in step S21, mainly splitting the address text into key elements of independent semantics, and correcting and complementing to obtain at least one preprocessed key element.
Step S23, judging whether the text information of the input address is a false address.
Step S24, judging whether the address text information is the party fraud.
Step S25, judging whether the address text information is an attack type address.
In step S21, address text information is obtained, mainly for receiving address information actively filled in by the user on the platform, where the address type may be home address, company address, home address, etc.
In step S22, the method mainly comprises two parts of training a nested named entity recognition model and collecting an address standard library.
The method comprises the following specific steps:
step S221, address extraction element design: the address elements are classified in proper grades, and key information common to several types is mainly extracted. The embodiment of the invention mainly extracts seven types of address element types: provincial level (prop), municipal level (city), regional level (distribution), town level (town), location position (area), point location (pos), entity organization (org).
In step S222, a corpus is constructed, in an application scene, data with certain magnitude are extracted according to the distribution of addresses, and then the data are divided into a training set, a verification set and a test set according to certain proportion, and the types of address elements which need to be identified in step S221 are combined for marking. The embodiment of the invention adopts a layered BIO labeling mode to label the character-level sequence. And labeling each layer of the nested named entity identification in a layered labeling mode, and completely labeling all named entities contained in a Token sequence. Data annotation plays a vital role in building benchmarks and ensuring that the correct information is used to learn the model, so that the annotated data is cross-validated for a high quality address corpus.
In step S223, a nested named entity recognition model is constructed and trained to achieve address element resolution, and the dependency relationship between the entity boundary and the fact label is captured mainly based on multi-task learning, and the model mainly comprises four sub-modules, namely a word vector module, a feature extraction module, a boundary detection module and an entity prediction module. As shown in fig. 7, the following describes the four sub-modules, respectively:
Step S2231, text vectorization: in the word vector module, the embodiment of the invention uses the BERT pre-training language model to obtain the word vector of the Chinese character level as the vector representation of the address text information, because the BERT pre-training word vector is obtained by training on a large scale corpus, the universal type is strong, and the rich features are contained;
step S2232, feature extraction: in the feature extraction module, word vectors acquired by the word vector module are input into a two-way long-short-term memory network (Bi-LSTM) through the two-way long-term memory network, text is serialized and text features (namely text feature vectors) are further acquired by combining the context;
step S2233, boundary detection: in the boundary detection module, a single character is predicted to be the first character or the last character of an entity, wherein the boundary detection module consists of two character-level classifiers, and a training objective function is defined as the sum of two classifier crossing functions.
Step S2234, entity prediction: in the entity prediction module, the internal information of the span is aggregated to predict the entity classification thereof, the span vector representation is obtained by summing and averaging the character vectors (i.e. the text feature vector of each character) output by the S2232, and then the span representation is input into a full connection layer to predict the entity class label thereof.
In the reasoning process, the boundary probability and the label probability are needed to be considered jointly to make a decision, the loss function in the embodiment of the invention is obtained by adding the loss function of the boundary detection module and the loss function of the entity prediction module, and the importance degree of the two subtasks is balanced through the super-parameters.
In step S224, an address standard library is collected, and in the embodiment of the present invention, the address standard library is mainly obtained by dividing standard websites and map services by regions of the national statistical bureau.
And step S225, matching the address element information identified by the nested named entity module with the acquired address standard library, and complementing and correcting the address element information. The completion and correction mainly relates to four address elements of province level, city level, district level and town level, wherein the province level exists through the city level and performs completion and verification only, the city level performs completion and verification based on a province level-to-district level mapping dictionary, the district level performs completion and verification based on a city level-to-town level mapping dictionary, and the town level performs completion and verification based on a district level and a district position.
Step S23 the false address may take three forms: addresses that cannot be located to a specific location due to incomplete filling of information, addresses that do not exist, addresses that are real addresses but not the user himself. Different solutions are designed depending on the three types of false addresses that may be present.
Step S231, if the area address and the point location address in the step S2 are both empty, determining that the address is an address which cannot be positioned to a specific position, determining that the address is a false address, otherwise, entering the next step;
in step S232, an unreal address recognition model is trained, a corpus consists of addresses which exist truly and addresses which do not exist truly, and the model mainly consists of LSTM and a fully-connected classification layer. The address which is processed in the step S231 and is not identified as the false address is input into the unreal address identification model, and if the classification result of the unreal address identification model is the false address which is not present, the address which is processed in the step S231 and is not identified as the false address is determined as the false address which is not present, otherwise, the next step is entered;
in step S233, the address determined to be not a false address by the processing in S231 and S232 is considered to be a real address, but it cannot be determined whether this is the real address of the user himself. The embodiment of the invention provides a comparison method for comparing the address result analyzed in the step S2 with the common address of the equipment, wherein the comparison method is used for calculating the path distance between seven-level elements. The address element is obtained by summing and averaging the character vectors output in S2232 (text feature vectors obtained in S2232) Vector representation, separately computing address element representation R pro 、R city 、R district 、R town 、R area 、R pos 、R org Each address category uses the similarity between address vector representations as element level path distance, the element level path distance is weighted and combined to be the whole path distance, and according to the data of whether the history is known to be the real address of the user, the analysis result of the calculated address element is brought into a logistic regression model to obtain the parameter omega i I.e. the weights of the features. The calculation formula is as follows:
the above notation i represents the address element class. If the overall path distance is greater than the specified path distance threshold, the addresses are considered to be not matched, and if the addresses are not the addresses of the user, the addresses are considered to be false addresses.
Step S24, a group fraud identification module, wherein the group fraud has the characteristic of time and space aggregation, all addresses filled by users in a certain time window T are obtained in a service scene, and whether group fraud users exist or not is judged for the addresses in the certain time. The method mainly comprises two steps, wherein the first step is the division of the group partner, and the second step is the judgment of whether the group partner is a fraudulent group partner.
The specific steps are shown in fig. 8:
step S241, all user information and all address text information in a time window T are acquired;
Step S242, preprocessing all address text information: s22, preprocessing the address information in the step S241;
step S243, grouping is carried out according to the user information and the address text information, the grouping mode is to calculate the path distance between the addresses, the same as the path distance mode of S233 is calculated, the addresses with smaller path distance are classified into the same class, and in addition, if the same user comprises a plurality of addresses, the same class is also classified. The step obtains a plurality of groups based on address relation division, and carries out the following steps of a first discriminant and a second discriminant on groups with the number of users being more than 2.
Step S244, a first distinguishing mode of the partner fraud is adopted, the embodiment of the invention calculates the partner risk index of the same partner address, and if the index is greater than the appointed partner fraud threshold value, the partner is judged to be a fraudulent suspicious partner. In the embodiment of the invention, the risk index is calculated according to the characteristic information of the group partner, and the characteristic comprises: the number of orders, the number of addresses, the number of users, the number of most user orders in the group, the number of most user addresses in the group, the average number of user orders in the group, the average number of user addresses in the group, the number of least user orders in the group, the number of least user addresses in the group, and other characteristic information can be adopted in different business scenarios. The risk index is obtained by weighting and combining features, wherein the weights of the features are brought into a logistic regression model according to the feature information of the known group fraud and the non-group fraud of the historical data to obtain a parameter theta i I.e. the weights of the features. The calculation formula of the partner risk index is as follows:
∑θ i *X i
the above disclosure i represents a feature class.
Step S245, a second judging mode of the group partner fraud, the embodiment of the invention maintains a dynamic address risk list library, and if the same group partner contains fraud seeds, the group partner is judged to be a fraudulent suspicious group partner. And adding the address of the group partner into a dynamic address risk list library to realize incremental update of the address risk list library.
In step S25, the address attacks the user identification module, and in the embodiment of the present invention, the text token output by the feature extraction module in step S22 is input into a classification model, and a new classification model is constructed to identify the address of the attack class.
The suspicious fraudulent users identified by S23, S24 and S25 are subjected to differentiated wind control strategies, so that the loss can be effectively reduced.
It should be understood that the specific order or hierarchy of steps in the processes disclosed are examples of exemplary approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, application lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate preferred embodiment of this application.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. As will be apparent to those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, as used in the specification or claims, the term "comprising" is intended to be inclusive in a manner similar to the term "comprising". Furthermore, any use of the term "or" in the specification of the claims is intended to mean "non-exclusive or".
Those of skill in the art will further appreciate that the various illustrative logical blocks (illustrative logical block), units, and steps described in connection with the embodiments of the invention may be implemented by electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components (illustrative components), elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Those skilled in the art may implement the described functionality in varying ways for each particular application, but such implementation is not to be understood as beyond the scope of the embodiments of the present invention.
The various illustrative logical blocks or units described in the embodiments of the invention may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described. A general purpose processor may be a microprocessor, but in the alternative, the general purpose processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. In an example, a storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may reside in a user terminal. In the alternative, the processor and the storage medium may reside as distinct components in a user terminal.
In one or more exemplary designs, the above-described functions of embodiments of the present invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on a computer-readable medium or transmitted as one or more instructions or code on the computer-readable medium. Computer readable media includes both computer storage media and communication media that facilitate transfer of computer programs from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media may include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to carry or store program code in the form of instructions or data structures and other data structures that may be read by a general or special purpose computer, or a general or special purpose processor. Further, any connection is properly termed a computer-readable medium, e.g., if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless such as infrared, radio, and microwave, and is also included in the definition of computer-readable medium. The disks (disks) and disks (disks) include compact disks, laser disks, optical disks, DVDs, floppy disks, and blu-ray discs where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included within the computer-readable media.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (10)

1. An anti-fraud method based on address text information, comprising:
extracting features of the address text information to obtain text feature vectors of each character of the address text information, dividing the address text information into at least one address element with independent semantics corresponding to the address text information, and correcting and complementing each address element with independent semantics obtained by dividing to obtain preprocessed address elements corresponding to the address elements with independent semantics;
and performing anti-fraud recognition on the address text information according to the preprocessed address elements and/or text feature vectors.
2. The anti-fraud method based on address text information according to claim 1, wherein feature extraction is performed on the address text information to obtain a text feature vector of each character of the address text information, the address text information is divided into at least one address element with independent semantics corresponding to the address text information, and error correction and complementation are performed on each address element with independent semantics obtained by dividing to obtain a preprocessed address element corresponding to the address element with independent semantics, including:
Vectorizing each character in the address text information to obtain vector characterization of the address text information;
inputting the vector representation of the address text information into a two-way long-short-term memory network to obtain a text feature vector of each character in the address text information;
analyzing text feature vectors of each character in the address text information through a preset boundary detection model, and dividing the address text information into at least one address element with independent semantics;
analyzing each address element with independent semantics through a preset entity prediction model, adding and averaging text feature vectors of characters in the address text information corresponding to the address elements with independent semantics to obtain span vectors corresponding to the address elements with independent semantics, and inputting the span vectors corresponding to the address elements with independent semantics into a preset full-connection layer to obtain address element types corresponding to the address elements with independent semantics;
performing complementation and error correction on each address element with independent semantics according to a preset address standard library to obtain preprocessed address elements corresponding to the address elements with independent semantics;
Wherein the address element types include: provincial level, municipal level, district level, town level, district location, point location, physical organization; the type of the address element corresponding to each preprocessed address element is the same as the type of the address element corresponding to the preprocessed address element and having independent semantics; at least one preprocessed address element corresponding to the address element with independent semantics obtained by dividing the address text information is used as at least one preprocessed address element corresponding to the address text information; each address element with independent semantics and the preprocessed address element corresponding to the address element with independent semantics correspond to the same character in the address text information.
3. The address text information-based anti-fraud method according to claim 2, wherein said anti-fraud recognition of said address text information based on pre-processed address elements and/or text feature vectors comprises:
and analyzing at least one preprocessed address element corresponding to the address text information according to a preset false address identification method, and carrying out false address identification on the address text information.
4. The anti-fraud method based on address text information according to claim 2, wherein the address text information is a plurality of address text information obtained in a preset time window; each address text message corresponds to user information;
performing anti-fraud recognition on the address text information according to the preprocessed address elements and/or text feature vectors, wherein the anti-fraud recognition comprises the following steps:
and analyzing at least one preprocessed address element corresponding to each address text message according to a preset fraud recognition method aiming at the plurality of address text messages, and performing fraud recognition on the plurality of address text messages.
5. The address text information-based anti-fraud method according to claim 2, wherein said anti-fraud recognition of said address text information based on pre-processed address elements and/or text feature vectors comprises:
analyzing the text feature vector of each character of the address text information according to a preset attack type address identification method, and carrying out attack type address identification on the address text information.
6. The method for anti-fraud based on address text information according to claim 3, wherein said analyzing at least one pre-processed address element corresponding to said address text information according to a preset false address identification method for false address identification of said address text information comprises:
If the area address and the point location address do not exist in the address element type corresponding to the at least one preprocessed address element corresponding to the address text information, judging the address text information as a false address; otherwise the first set of parameters is selected,
analyzing at least one preprocessed address element corresponding to the address text information through a preset false address identification model, and judging the address text information as a false address if the analysis result is an address which does not exist; otherwise the first set of parameters is selected,
and if the path distance between at least one preprocessed address element corresponding to the address text information and the address element corresponding to the common address of the equipment is greater than or equal to a specified path distance threshold value, judging the address text information as a false address.
7. The method for anti-fraud based on address text information according to claim 4, wherein the analyzing at least one pre-processed address element corresponding to each address text information for the plurality of address text information according to a preset method for identifying partner fraud, performing partner fraud identification on the plurality of address text information, includes:
determining path distances between the preprocessed address elements corresponding to the same address element type among the address text information according to text feature vectors of characters in the address text information corresponding to at least one preprocessed address element corresponding to each of the plurality of address text information, determining path distances among the address text information according to the path distances corresponding to the obtained address element types, and dividing the address text information with the path distances among the address text information smaller than a designated group fraud path distance threshold into the same group;
Adding address text information which corresponds to the user information and is not in the current group partner into the current group partner according to the user information which corresponds to each address text information in each group partner;
for a partner containing address text information and at least corresponding to two different user information, carrying out partner risk index analysis on the partner to obtain a partner risk index of the partner, judging the partner as a fraudulent suspicious partner if the partner risk index of the partner is greater than or equal to a designated partner fraud threshold, or for each partner, if a fraud seed address recorded in a preset dynamic address risk list library is contained in the partner, taking the partner as a fraudulent suspicious partner, and adding all text information addresses in the partner as fraud seed addresses to the dynamic address risk list library;
the dynamic address risk list library is used for recording fraudulent seed addresses.
8. An address text information based anti-fraud method of claim 7, wherein performing a group risk indicator analysis on the group to obtain a group risk indicator for the group comprises:
and acquiring a plurality of characteristic information corresponding to the group partner, and carrying out weighted summation on the plurality of characteristic information to obtain the group partner risk index of the group partner.
9. The method for anti-fraud based on address text information according to claim 5, wherein the analyzing the text feature vector of each character of the address text information according to a preset attack class address recognition method for performing attack class address recognition on the address text information comprises:
inputting the text feature vector of each character of the address text information into a preset classification model for classification recognition to determine whether the address text information is an attack type address.
10. An anti-fraud device based on address text information, comprising:
an address element identification and feature vector acquisition unit, configured to perform feature extraction on the address text information to obtain a text feature vector of each character of the address text information, divide the address text information into at least one address element with independent semantics corresponding to the address text information, and perform error correction and complementation on each address element with independent semantics obtained by dividing to obtain a preprocessed address element corresponding to the address element with independent semantics;
and the anti-fraud recognition unit is used for carrying out anti-fraud recognition on the address text information according to the preprocessed address elements and/or the text feature vectors.
CN202310522873.8A 2023-05-10 2023-05-10 Anti-fraud method and device based on address text information Active CN116843432B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310522873.8A CN116843432B (en) 2023-05-10 2023-05-10 Anti-fraud method and device based on address text information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310522873.8A CN116843432B (en) 2023-05-10 2023-05-10 Anti-fraud method and device based on address text information

Publications (2)

Publication Number Publication Date
CN116843432A true CN116843432A (en) 2023-10-03
CN116843432B CN116843432B (en) 2024-03-22

Family

ID=88167782

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310522873.8A Active CN116843432B (en) 2023-05-10 2023-05-10 Anti-fraud method and device based on address text information

Country Status (1)

Country Link
CN (1) CN116843432B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805583A (en) * 2018-05-18 2018-11-13 连连银通电子支付有限公司 Electric business fraud detection method, device, equipment and medium based on address of cache
CN109816519A (en) * 2019-01-25 2019-05-28 宜人恒业科技发展(北京)有限公司 A kind of recognition methods of fraud clique, device and equipment
CN111372242A (en) * 2020-01-16 2020-07-03 深圳市随手商业保理有限公司 Fraud identification method, device, server and storage medium
CN111695355A (en) * 2020-05-26 2020-09-22 平安银行股份有限公司 Address text recognition method, device, medium and electronic equipment
CN111753545A (en) * 2020-06-19 2020-10-09 科大讯飞(苏州)科技有限公司 Nested entity recognition method and device, electronic equipment and storage medium
CN111861733A (en) * 2020-07-31 2020-10-30 重庆富民银行股份有限公司 Fraud prevention and control system and method based on address fuzzy matching
CN113449528A (en) * 2021-08-30 2021-09-28 企查查科技有限公司 Address element extraction method and device, computer equipment and storage medium
CN115481635A (en) * 2022-08-26 2022-12-16 东莞理工学院 Address element analysis method and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805583A (en) * 2018-05-18 2018-11-13 连连银通电子支付有限公司 Electric business fraud detection method, device, equipment and medium based on address of cache
CN109816519A (en) * 2019-01-25 2019-05-28 宜人恒业科技发展(北京)有限公司 A kind of recognition methods of fraud clique, device and equipment
CN111372242A (en) * 2020-01-16 2020-07-03 深圳市随手商业保理有限公司 Fraud identification method, device, server and storage medium
CN111695355A (en) * 2020-05-26 2020-09-22 平安银行股份有限公司 Address text recognition method, device, medium and electronic equipment
CN111753545A (en) * 2020-06-19 2020-10-09 科大讯飞(苏州)科技有限公司 Nested entity recognition method and device, electronic equipment and storage medium
CN111861733A (en) * 2020-07-31 2020-10-30 重庆富民银行股份有限公司 Fraud prevention and control system and method based on address fuzzy matching
CN113449528A (en) * 2021-08-30 2021-09-28 企查查科技有限公司 Address element extraction method and device, computer equipment and storage medium
CN115481635A (en) * 2022-08-26 2022-12-16 东莞理工学院 Address element analysis method and system

Also Published As

Publication number Publication date
CN116843432B (en) 2024-03-22

Similar Documents

Publication Publication Date Title
CN111783875A (en) Abnormal user detection method, device, equipment and medium based on cluster analysis
WO2022105525A1 (en) Method and apparatus for predicting user probability, and computer device
CN111260189B (en) Risk control method, risk control device, computer system and readable storage medium
CN111597340A (en) Text classification method and device and readable storage medium
CN109740642A (en) Invoice category recognition methods, device, electronic equipment and readable storage medium storing program for executing
CN110084609B (en) Transaction fraud behavior deep detection method based on characterization learning
CN114139676A (en) Training method of domain adaptive neural network
CN115408525A (en) Petition text classification method, device, equipment and medium based on multi-level label
Zhu et al. Irted-tl: An inter-region tax evasion detection method based on transfer learning
Wu et al. Application analysis of credit scoring of financial institutions based on machine learning model
CN113627151B (en) Cross-modal data matching method, device, equipment and medium
CN113378090B (en) Internet website similarity analysis method and device and readable storage medium
CN111046184A (en) Text risk identification method, device, server and storage medium
CN114119191A (en) Wind control method, overdue prediction method, model training method and related equipment
CN116821759A (en) Identification prediction method and device for category labels, processor and electronic equipment
CN116843432B (en) Anti-fraud method and device based on address text information
CN115713399B (en) User credit evaluation system combined with third-party data source
CN115310606A (en) Deep learning model depolarization method and device based on data set sensitive attribute reconstruction
CN113535888A (en) Emotion analysis device and method, computing equipment and readable storage medium
CN116522943B (en) Address element extraction method and device, storage medium and computer equipment
CN110909777A (en) Multi-dimensional feature map embedding method, device, equipment and medium
CN113743111B (en) Financial risk prediction method and device based on text pre-training and multi-task learning
Liao et al. Iicofdm: An Interpretable Initial Coin Offerings Fraud Detection Model Based on Random Forest and Shap
CN114548765A (en) Method and apparatus for risk identification
Wan et al. Research on the Combination Model Based on DPMM and IForest

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant