WO2021012913A1 - Data recognition method and system, electronic device and computer storage medium - Google Patents

Data recognition method and system, electronic device and computer storage medium Download PDF

Info

Publication number
WO2021012913A1
WO2021012913A1 PCT/CN2020/099572 CN2020099572W WO2021012913A1 WO 2021012913 A1 WO2021012913 A1 WO 2021012913A1 CN 2020099572 W CN2020099572 W CN 2020099572W WO 2021012913 A1 WO2021012913 A1 WO 2021012913A1
Authority
WO
WIPO (PCT)
Prior art keywords
risk
records
word segmentation
words
fraud
Prior art date
Application number
PCT/CN2020/099572
Other languages
French (fr)
Chinese (zh)
Inventor
程旺
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021012913A1 publication Critical patent/WO2021012913A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Definitions

  • This application relates to the computer field, in particular to data recognition methods, systems, electronic equipment, and computer storage media, and specifically to financial technology fraud detection technology.
  • This application provides a method, system, electronic device and computer storage medium for data recognition to solve the problem of low recognition rate of group fraud.
  • this application provides a method for data identification, which includes the following steps. Obtain multiple risk records. Convert each of the multiple risk records into multiple word segmentation vectors. According to the multiple word segmentation vectors, cluster the multiple risk records to obtain multiple category groups of risk records. A social network analysis algorithm is used to separately analyze each category group in the multiple category groups' risk-exposure records, and identify fraud cases in the multiple category groups. Due to the high-precision clustering of the risk records, the risk records of each category are very relevant data. When the clustered risk records are used for SNA analysis, the risk case relationship network can be better established, thereby improving The correct rate of group fraud identification.
  • a data identification system in a second aspect, includes an acquisition unit, a conversion unit, a clustering unit, and an identification unit, wherein the acquisition unit is used to acquire multiple risk records.
  • the conversion unit is used to convert each of the multiple risk records into multiple word segmentation vectors.
  • the clustering unit is used for clustering the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records.
  • the identification unit is configured to use a social network analysis algorithm to separately analyze each category group in the multiple category groups of the risk exposure records, and identify fraud cases in the multiple risk records. Since the system performs high-precision clustering of the risk records, the risk records of each category are very relevant data. When the clustered risk records are used for SNA analysis, the system can better establish the risk based on this system. The case is connected to the network, thereby improving the correct rate of group fraud identification.
  • an electronic device including a processor, an input device, an output device, and a memory, the processor, input device, output device, and memory are connected to each other, wherein the memory is used to store a computer program,
  • the computer program includes program instructions, and the processor is configured to invoke the program instructions to execute the method described in the first aspect above.
  • a computer-readable storage medium stores a computer program
  • the computer program includes program instructions that, when executed by a processor, cause the processor to execute the foregoing One side approach.
  • each of the multiple risk records is converted into multiple word segmentation vectors, according to The multiple word segmentation vectors cluster the multiple risk records to obtain multiple category groups of risk records, and use a social network analysis algorithm to perform a separate analysis on each category group of the multiple category groups’ risk records. Analyze and identify fraud cases in the multiple risk records. Due to the high-precision clustering of the risk records, the risk records of each category are very relevant data. When the clustered risk records are used for SNA analysis, the risk case relationship network can be better established, thereby improving The correct rate of group fraud identification.
  • Fig. 1 is a schematic flowchart of a data identification method provided by the present application.
  • Figure 2 is a schematic diagram of a case relationship network established by using the SNA algorithm in a scenario provided by this application.
  • Fig. 3 is a schematic diagram of a case relationship network established using the SNA algorithm in another scenario provided by this application.
  • Figure 4 is a schematic structural diagram of a data recognition system provided by the present application.
  • Fig. 5 is a schematic block diagram of the structure of an electronic device provided by the present application.
  • Fig. 1 is a schematic flowchart of a data identification method provided by the present application. It can be seen from Fig. 1 that the data identification method provided by this application includes the following steps.
  • the insurance refers to the occurrence of the compensation or payment conditions stipulated or agreed in the insurance contract.
  • the risk record may be the risk record data in a database.
  • the risk information may be an accident record, including an accident license plate, an accident location, an auto insurance policy number, an insurance policy agent, a claim record, Insurance purchase records, vehicle involved persons, including drivers, informants, beneficiaries, and injured persons, as well as data related to insurance such as repair shops, reporting telephone numbers, inspection locations, GPS information, and disease diagnosis records. It is understandable that using the risk record as the original data for anti-fraud identification can greatly improve the accuracy of anti-fraud identification compared to ordinary insurance data. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
  • S102 Transform each of the multiple risk records into multiple word segmentation vectors.
  • each of the multiple risk-out records is converted into multiple word segmentation vectors and includes. Perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple word segmentation words. Delete the participle words that are the same as the words in the stop word set among the multiple participle words, and keep the word participle words that are the same as the words in the reserved word set to obtain the filtered word participles, where the stop word set is A collection of a plurality of word segmentation words that are not related to the risk record information, and the reserved word set is a preset set of words that cannot be filtered out. Map the filtered word segmentation words into multiple word segmentation vectors.
  • the accident record includes multiple information, for example, it includes specific accidents, specific casualties, disease diagnosis records, specific traffic police accident identification records, etc.
  • the data length of each accident record is also different. If you directly Data preprocessing will be a lot of work. Therefore, firstly, the obtained risk records will be segmented to improve the efficiency of data processing.
  • the word segmentation process can be to segment each Chinese character sequence of the risk record into a single word. From the formal point of view, the word is a stable combination of characters. Therefore, in the context, the number of adjacent characters appearing at the same time The more, the more likely it is to form a word. Therefore, the frequency or probability of co-occurrence between characters and characters can better reflect the credibility of the word formation.
  • the frequency of combinations of adjacent co-occurring characters in the corpus can be counted, and their mutual occurrence information can be calculated. Define the mutual occurrence information of two characters, and calculate the adjacent co-occurrence probability of two Chinese characters X and Y. The mutual information reflects the closeness of the combination of Chinese characters.
  • the word group may constitute a word, and then the word segmentation operation can be carried out.
  • this method also has certain limitations. It will often extract some co-occurrence frequently, But it is not a common word group of words, such as "this”, “one”, “some”, “my”, “many”, etc., which wastes storage space and makes search efficiency inefficient. Therefore, the stop word set and the reserved word set can be combined to filter out the words that appear frequently and are not related to the risk. For example, modal auxiliary words, adverbs, prepositions, conjunctions, etc., have no clear meaning by themselves, only put them into one Words that have a certain role in complete sentences.
  • each dimension of the floating point number indicates the distance between it and another word.
  • the result of the representation is semantically similar Words are mapped to similar collection spaces, so that the computer can calculate the similarity between each word. In other words, after the computer can understand the meaning of the language, it can perform further anti-fraud recognition processing.
  • S103 Cluster the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records.
  • the clustering of the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records includes. From the multiple word segmentation vectors of each risk record, the target word segmentation vector is screened out, where the target word segmentation vector appears more frequently in the risk record to which it belongs than in other risk records, or The occurrence frequency of the target word segmentation vector in the risk record to which it belongs is lower than the frequency in other risk records.
  • the clustering algorithm multiple risk records containing the same or similar target word segmentation vectors are clustered into multiple category groups of risk records. In other words, after the risk record is converted into the word segmentation vector, there will be many word vectors.
  • the target word segmentation vector may use the term frequency-inverse text frequency index (Term Frequency-Inverse Document Frequency (TF-IDF) method is used to screen.
  • TF-IDF Term Frequency-Inverse Document Frequency
  • the TF-IDF method is a statistical method used to evaluate the importance of a word to a document set or a document in a corpus. It is understandable that the most meaningful words for distinguishing documents should be those that appear frequently in the document, but appear less frequently in other documents in the entire document collection, so if the feature space coordinate system takes TF word frequency as a measure, It can reflect the characteristics of similar texts. In addition, considering the ability of words to distinguish different categories, the TF-IDF method believes that the smaller the text frequency of a word, the greater its ability to distinguish texts of different categories.
  • the product of TF and IDF is used as the value measurement of the feature space coordinate system, and used to complete the adjustment of the weight TF.
  • the purpose of adjusting the weight is to highlight important words and suppress minor ones. word. In other words, if a word or phrase appears frequently in an article and rarely appears in other articles, it is considered that the word or phrase has a good classification ability and is suitable for classification. Therefore, compared with directly clustering the risk records, better clustering results can be obtained.
  • clustering according to the importance degree matrix of the risk records can more accurately compare the risk information
  • the cases are gathered together, and the clustering results are used as the input data of the SNA network, which can greatly improve the accuracy of anti-fraud case identification.
  • the clustering algorithms are K-means clustering algorithm, mean shift clustering algorithm, noisy density-based clustering algorithm, maximum expected clustering algorithm using Gaussian mixture model, and agglomerative hierarchical clustering
  • K-means clustering algorithm K-means clustering algorithm
  • mean shift clustering algorithm noisy density-based clustering algorithm
  • maximum expected clustering algorithm using Gaussian mixture model agglomerative hierarchical clustering
  • the result of clustering can be that the risk records of the same province or city or the purchase of the same type of reimbursement are gathered together, for example, the risk records of minor damage cases are gathered together, or the cases of the same province are gathered together, etc. , There is no specific limitation here.
  • S104 Use a social network analysis SNA algorithm for each category group in the risk exposure records of the multiple category groups to identify fraud cases in the multiple risk records.
  • the social network analysis algorithm is used to analyze each category group in the risk exposure records of the multiple category groups respectively, and it is recognized that the fraud cases in the multiple risk records include.
  • a social network analysis algorithm is used to establish a risk case relationship network, where the risk records of one category group correspond to one or more risk case relationship networks, and the risk case relationship network includes multiple nodes
  • the node represents an individual, an organization or a virtual individual in the risk record, and the connection between the multiple nodes indicates that there is a social relationship between the multiple nodes.
  • a social network analysis algorithm is used to analyze the relationship between each node in the risk case relationship network and other nodes, and extract the group risk characteristics corresponding to each node.
  • the group risk characteristics corresponding to each node are input into a classification model to obtain the fraud rate of each node.
  • the classification model is a model obtained by training a neural network using a sample set, and the sample set includes a plurality of known Dimensional group risk feature data and corresponding known fraud rate data. According to the fraud rate of each node, the risk records of multiple nodes whose fraud rate is higher than the first threshold are recognized as fraud cases, and the multiple nodes whose fraud rates are higher than the first threshold are recognized as fraud gangs.
  • the SNA network is a collection of points (social actors) and connections between points (relationships between actors).
  • Each node can be an entity or virtual individual with different meanings such as organization, individual, network ID, and the relationship between individuals can be relatives and friends, action behavior, sending and receiving messages, and other relationships.
  • SNA network analysis we can find the key information we need from the messy data and connection relationships, that is, the group risk behavior of each node. Taking the risk data of medical insurance as an example, the risk characteristics of each node can be. The area where the patient is located, the hospital where the patient sees the patient, the number and specific time of the items purchased by the patient, the disease the patient has, and the doctor who sees the patient.
  • Analyzing the group risk behavior of patients is equivalent to a comprehensive analysis of the area where the patients are located, the number and specific time of the items purchased by the patients, and the diseases the patients have suffered. If it is found that patients have purchased a large number of drugs in different hospitals many times, and the types of drugs are different, the characteristics of the group risk can be determined. The user purchases a large amount of medicines, many types of medicines, and so on. For another example, taking vehicle insurance risk data as an example, the risk characteristics of each node may be. The city where the vehicle is located, the license plate number, the type of insurance purchased by the vehicle, the traffic police who handled the accident, the identity information of the perpetrator, and the identity information of the victim, etc.
  • the group risk feature is that the user has participated in the risk of minor damage cases multiple times. If it is found that the vehicle has been in danger at different locations for multiple times and the victim's identity is the same person, it can be determined that the group risk is characterized by the user's multiple times of cooperating with others to defraud insurance. In the same way, the group risk characteristics of other risk data such as commercial insurance and accident insurance can be obtained. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
  • FIG. 2 is a schematic diagram of a case relationship network established by using the SNA algorithm in a scenario provided by this application, in which the gray dots represent users with more risk records, and the black dots represent users with risk records but fewer times. The white dots represent users who have no history of accidents.
  • the high risk rate of the gang reaches 66.8%, indicating that the average risk rate of the gang is low.
  • the risky users accounted for 91.4% of all users, further verifying the fraudulent nature of the gang. Understandably, after a fraudulent group is identified, the risk record of the group's participation can be confirmed as a fraud case.
  • Figure 3 is a schematic diagram of a case relationship network established using the SNA algorithm in another scenario provided by this application.
  • the black dots represent users who have a record of traversing but have a small number of times.
  • the case is related to the fraud of each node in the network
  • the rate of case relationship is not high, and the risk rate is not high.
  • the case relationship network in this scenario has a higher risk.
  • the network is usually When a group of people cooperated to commit crimes, their mutual communication means pairwise understanding, and the purpose behind it is mostly collusion. Forging information to meet the requirements of the risk requires special attention. It should be understood that the above examples are only for illustration.
  • the case relationship network established by the SNA algorithm can also be other network structures other than Figure 2 and Figure 3.
  • the analysis method is also different from the analysis method used in Figure 2 and Figure 3. This application does not Specific restrictions.
  • each of the multiple risk records is converted into multiple word segmentation vectors, and the multiple risk records are clustered according to the multiple word segmentation vectors to obtain
  • the risk records of multiple category groups are analyzed using a social network analysis algorithm to analyze each category group of the risk records of the multiple category groups to identify fraud cases in the multiple risk records. Due to the high-precision clustering of the risk records, the risk records of each category are very relevant data. When the clustered risk records are used for SNA analysis, the risk case relationship network can be better established, thereby improving The correct rate of group fraud identification.
  • Figure 4 is a schematic structural diagram of a data recognition system provided by the present application. It can be seen from FIG. 4 that the data recognition system provided by the present application includes an acquisition unit 410, a conversion unit 420, a clustering unit 430, and an identification unit 440. Among them,
  • the acquiring unit 410 is configured to acquire multiple risk records.
  • the conversion unit 420 is configured to convert each of the multiple risk records into multiple word segmentation vectors.
  • the clustering unit 430 is configured to cluster the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records.
  • the identification unit 440 is configured to use a social network analysis algorithm to separately analyze each category group in the multiple category groups of the risk exposure records, and identify fraud cases in the multiple risk records.
  • the insurance refers to the occurrence of the compensation or payment conditions stipulated or agreed in the insurance contract.
  • the risk record may be the risk record data in a database.
  • the risk information may be an accident record, including an accident license plate, an accident location, an auto insurance policy number, an insurance policy agent, a claim record, Insurance purchase records, vehicle involved persons, including drivers, informants, beneficiaries, and injured persons, as well as data related to insurance such as repair shops, reporting telephone numbers, inspection locations, GPS information, and disease diagnosis records. It is understandable that using the risk record as the original data for anti-fraud identification can greatly improve the accuracy of anti-fraud identification compared to ordinary insurance data. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
  • the conversion unit 420 is specifically configured to: perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple records. Participle words. Combining the stop word set and the reserved word set, the multiple word segmentation words are filtered to obtain the filtered word segmentation words, wherein the stop word set is a collection of multiple word segmentation words that are not related to the risk record information, so The reserved word set is a set of pre-set words that cannot be filtered out. Perform vectorization processing on the filtered word segmentation words to obtain multiple word segmentation vectors of each of the multiple risk records.
  • the accident record includes multiple information, for example, it includes specific accidents, specific casualties, disease diagnosis records, specific traffic police accident identification records, etc. The data length of each accident record is also different. If you directly Data preprocessing will be a lot of work. Therefore, firstly, the obtained risk records will be segmented to improve the efficiency of data processing.
  • the word segmentation process can be to segment each Chinese character sequence of the risk record into a single word. From the formal point of view, the word is a stable combination of characters. Therefore, in the context, the number of adjacent characters appearing at the same time The more, the more likely it is to form a word. Therefore, the frequency or probability of co-occurrence between characters and characters can better reflect the credibility of the word formation.
  • the frequency of combinations of adjacent co-occurring characters in the corpus can be counted, and their mutual occurrence information can be calculated. Define the mutual occurrence information of two characters, and calculate the adjacent co-occurrence probability of two Chinese characters X and Y. The mutual information reflects the closeness of the combination of Chinese characters.
  • the word group may constitute a word, and then the word segmentation operation can be carried out.
  • this method also has certain limitations. It will often extract some co-occurrence frequently, But it is not a common word group of words, such as "this”, “one”, “some”, “my”, “many”, etc., which wastes storage space and makes search efficiency inefficient. Therefore, the stop word set and the reserved word set can be combined to filter out the words that appear frequently and are not related to the risk. For example, modal auxiliary words, adverbs, prepositions, conjunctions, etc., have no clear meaning by themselves, only put them into one Words that have a certain role in complete sentences.
  • each dimension of the floating point number indicates the distance between it and another word.
  • the result of the representation is semantically similar Words are mapped to similar collection spaces, so that the computer can calculate the similarity between each word. In other words, after the computer can understand the meaning of the language, it can perform further anti-fraud recognition processing.
  • the clustering unit 430 is specifically configured to filter out the target word segmentation vector from the multiple word segmentation vectors of each risk record, wherein the target word segmentation vector appears in the risk record to which it belongs The frequency of is higher than the frequency of appearance in other risk records, or the frequency of the target word segmentation vector in the corresponding risk record is lower than the frequency of appearance in other risk records.
  • the clustering unit 430 is specifically configured to use a clustering algorithm to cluster multiple risk records containing the same or similar target word segmentation vectors into multiple category groups of risk records. In other words, after the risk record is converted into the word segmentation vector, there will be many word vectors.
  • the target word segmentation vector can be filtered using the TF-IDF method, where the TF-IDF method is a statistical method used to evaluate a word for a document set or a document in a corpus Degree of importance. It is understandable that the most meaningful words for distinguishing documents should be those that appear frequently in the document, but appear less frequently in other documents in the entire document collection, so if the feature space coordinate system takes TF word frequency as a measure, It can reflect the characteristics of similar texts. In addition, considering the ability of words to distinguish different categories, the TF-IDF method believes that the smaller the text frequency of a word, the greater its ability to distinguish texts of different categories. Therefore, the concept of inverse text frequency IDF is introduced.
  • the product of TF and IDF is used as the value measurement of the feature space coordinate system, and used to complete the adjustment of the weight TF.
  • the purpose of adjusting the weight is to highlight important words and suppress minor ones. word. In other words, if a word or phrase appears frequently in an article and rarely appears in other articles, it is considered that the word or phrase has a good classification ability and is suitable for classification. Therefore, compared with directly clustering the risk records, better clustering results can be obtained. After the word segmentation vector suitable for classification is selected, clustering according to the importance degree matrix of the risk records can more accurately compare the risk information The cases are gathered together, and the clustering results are used as the input data of the SNA network, which can greatly improve the accuracy of anti-fraud case identification.
  • the clustering algorithms are K-means clustering algorithm, mean shift clustering algorithm, noisy density-based clustering algorithm, maximum expected clustering algorithm using Gaussian mixture model, and agglomerative hierarchical clustering
  • K-means clustering algorithm K-means clustering algorithm
  • mean shift clustering algorithm noisy density-based clustering algorithm
  • maximum expected clustering algorithm using Gaussian mixture model agglomerative hierarchical clustering
  • the result of clustering can be that the risk records of the same province or city or the purchase of the same type of reimbursement are gathered together, for example, the risk records of minor damage cases are gathered together, or the cases of the same province are gathered together, etc. , There is no specific limitation here.
  • the identification unit 440 is specifically configured to obtain the risk exposure records of the multiple category groups.
  • the identification unit 440 is specifically configured to use a social network analysis algorithm to establish a risk case relationship network based on the risk records of the multiple category groups, wherein the risk record of a category group corresponds to one or more risk case relationship networks.
  • the risk case relationship network includes multiple nodes, the nodes representing individuals, organizations, or virtual individuals in the risk records, and the connections between the multiple nodes indicate that there is a social relationship between the multiple nodes.
  • the identification unit 440 is specifically configured to analyze the relationship between each node in the risk case relationship network and other nodes through a social network analysis algorithm, and extract the group risk characteristics corresponding to each node.
  • the identification unit 440 is specifically configured to input the group risk characteristics corresponding to each node into a classification model to obtain the fraud rate of each node, where the classification model is a model obtained by training a neural network using a sample set, and The sample set includes known risk feature data of multiple dimensional groups and corresponding known fraud rate data.
  • the identification unit 440 is specifically configured to identify, according to the fraud rate of each node, the risk records of multiple nodes whose fraud rate is higher than a first threshold as a fraud case, and identify multiple nodes whose fraud rate is higher than the first threshold Identified as a fraudulent group.
  • the SNA network is a collection of points (social actors) and connections between points (relationships between actors).
  • Each node can be an entity or virtual individual with different meanings such as organization, individual, network ID, and the relationship between individuals can be relatives and friends, action behavior, sending and receiving messages, and other relationships.
  • SNA network analysis we can find the key information we need from the messy data and connection relationships, that is, the group risk behavior of each node.
  • the risk characteristics of each node can be: the area where the patient is located, the hospital where the patient sees the patient, the number and specific time of the patient’s purchase of drugs, the disease the patient has, and the patient sees Behave like the doctor who consulted.
  • Analyzing the group risk behavior of patients is equivalent to a comprehensive analysis of the area where the patients are located, the number and specific time of the items purchased by the patients, and the diseases the patients have suffered. If it is found that patients have purchased a large number of drugs in different hospitals many times, and the types of drugs are different, the characteristics of the group risk can be determined. The user purchases a large amount of medicines, many types of medicines, and so on. For another example, taking vehicle insurance risk data as an example, the risk characteristics of each node may be. The city where the vehicle is located, the license plate number, the type of insurance purchased by the vehicle, the traffic police who handled the accident, the identity information of the perpetrator, and the identity information of the victim, etc.
  • the group risk feature is that the user has participated in the risk of minor damage cases multiple times. If it is found that the vehicle has been in danger at different locations for multiple times and the victim's identity is the same person, it can be determined that the group risk is characterized by the user's multiple times of cooperating with others to defraud insurance. In the same way, the group risk characteristics of other risk data such as commercial insurance and accident insurance can be obtained. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
  • Figure 2 is a case relationship network established by the SNA algorithm, in which the gray dots represent users with more risk records, the black dots represent users with risk records but fewer traverses, and white dots represent users with no risk records. user.
  • the high risk rate of the gang reaches 66.8%, indicating that the average risk rate of the gang is low.
  • the risky users accounted for 91.4% of all users, further verifying the fraudulent nature of the gang. Understandably, after a fraudulent group is identified, the risk record of the group's participation can be confirmed as a fraud case.
  • Figure 3 is a schematic diagram of a case relationship network established using the SNA algorithm in another scenario provided by this application.
  • the black dots represent users who have a record of traversing but have a small number of times.
  • the rate is not high, and the risk rate is not high, but because two related nodes in the case relationship network will be connected, the network has a higher risk.
  • the network is usually behind a multi-person collaborative group.
  • each of the multiple risk records is converted into multiple word segmentation vectors, and the multiple risk records are clustered according to the multiple word segmentation vectors to obtain
  • the risk records of multiple category groups are analyzed using a social network analysis algorithm to analyze each category group of the risk records of the multiple category groups to identify fraud cases in the multiple risk records. Due to the high-precision clustering of the risk records, the risk records of each category are very relevant data. When the clustered risk records are used for SNA analysis, the risk case relationship network can be better established, thereby improving The correct rate of group fraud identification.
  • FIG. 5 is a schematic structural diagram of an electronic device provided by the present application.
  • the electronic device in this embodiment as shown in the figure may include.
  • the processor 511, the memory 512, and the communication interface 513 may be connected through a bus 514.
  • the processor 511 includes one or more general-purpose processors, where the general-purpose processor can be any type of device capable of processing electronic instructions, including a central processing unit (Central Processing Unit, CPU), and an image processor (Graphics Processing Unit, GPU). ), microprocessors, microcontrollers, main processors, controllers and application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Digital Signal Processor (DSP), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the processor 511 is configured to execute program instructions stored in the memory 512.
  • the memory 512 may include a volatile memory, such as random access memory (Random Access Mmemory, RAM).
  • the memory may also include non-volatile memory, such as read-only memory (Read-Only Memory, ROM), flash memory (Flash Memory), hard disk (Hard Disk Drive, HDD) or solid-state drive (Solid-State Drive, SSD), the memory may also include a combination of the above types of memory.
  • the memory 512 may adopt centralized storage or distributed storage, which is not specifically limited here. It is understandable that the memory 512 is used to store computer programs, such as computer program instructions. In the embodiment of the present application, the memory 512 may provide instructions and data to the processor 511.
  • the communication interface 513 may be a wired interface (such as an Ethernet interface) or a wireless interface (such as a cellular network interface or using a wireless local area network interface) for communicating with other computer devices or users.
  • the communication interface 513 may adopt a network communication protocol (Transmission Control Protocol/Internet Protocol, TCP/IP) above the protocol family, for example, remote function call (Remote Function Call, RFC) protocol, Simple Object Access Protocol (Simple Object Access Protocol, SOAP) protocol, Simple Network Management Protocol (Simple Network Management Protocol, SNMP), Common Object Request Broker Architecture (CORBA), distributed protocol, etc.
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • RFC Remote Function Call
  • SOAP Simple Object Access Protocol
  • SOAP Simple Network Management Protocol
  • SNMP Simple Network Management Protocol
  • CORBA Common Object Request Broker Architecture
  • the communication interface 513 is a wireless interface
  • GSM Global System for Mobile Communication
  • CDMA Code Division Multiple Access
  • GSM Global System for Mobile Communication
  • CDMA Code Division Multiple Access
  • the processor 511, the memory 512, the communication interface 513, and the bus 514 can execute the implementation described in any embodiment of the data identification method provided in the embodiment of the present application.
  • the processor 511 can be used To call the instructions in the memory, execute the following method: Get multiple risk records. Convert each of the multiple risk records into multiple word segmentation vectors. According to multiple word segmentation vectors, cluster the multiple risk records to obtain multiple category groups of risk records. A social network analysis algorithm is used to analyze each category group in the risk records of multiple category groups to identify fraud cases in multiple risk records.
  • the processor 511 is specifically configured to execute: perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple word segmentation words. Delete the participle words that are the same as the words in the stop word set, and keep the same participle words as the words in the reserved word set to obtain the filtered word break words. Among them, the stop word set is multiple and out of danger The set of segmented words irrelevant to the record information.
  • the reserved word set is a set of pre-set words that cannot be filtered out. Map the filtered word segmentation words into multiple word segmentation vectors.
  • the processor 511 is specifically configured to perform: filter out the target word segmentation vector from the multiple word segmentation vectors of each risk record, wherein the target word segmentation vector appears more frequently in the risk record to which it belongs.
  • the frequency of appearance in other risk records, or the frequency of the target word segmentation vector in the risk record to which it belongs is lower than the frequency of appearance in other risk records.
  • the processor 511 is specifically configured to execute: obtain the risk-out records of multiple category groups. Based on the risk records of multiple category groups, a social network analysis algorithm is used to establish a risk case relationship network. Among them, the risk records of a category group correspond to one or more risk case relationship networks.
  • the risk case relationship network includes multiple nodes, which represent offices. Speaking of individuals, organizations or virtual individuals in the risk record, the connection between multiple nodes indicates that there is a social relationship between the multiple nodes.
  • a social network analysis algorithm is used to analyze the relationship between each node in the risk case relationship network and other nodes, and extract the group risk characteristics corresponding to each node.
  • the group risk characteristics corresponding to each node are input into the classification model to obtain the fraud rate of each node, wherein the classification model is a model obtained by training a neural network using a sample set, and the sample set includes known multiple dimensional groups Risk feature data and corresponding known fraud rate data.
  • the fraud rate of each node the risk records of multiple nodes whose fraud rate is higher than the first threshold are identified as fraud cases, and the multiple nodes whose fraud rates are higher than the first threshold are identified as fraud gangs.
  • the clustering algorithm is K-means clustering algorithm, mean shift clustering algorithm, density-based clustering algorithm with noise, maximum expectation clustering algorithm using Gaussian mixture model, and agglomerative hierarchical clustering algorithm.
  • K-means clustering algorithm mean shift clustering algorithm
  • density-based clustering algorithm with noise maximum expectation clustering algorithm using Gaussian mixture model
  • agglomerative hierarchical clustering algorithm One or more of.
  • a computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, implement the implementation described in any of the embodiments of the data identification method provided in this application, This will not be repeated here.
  • the computer-readable storage medium may be the internal storage unit of the terminal described in any of the foregoing embodiments, such as the hard disk or memory of the terminal.
  • the computer-readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk equipped on the terminal, a smart memory card (Smart Media Card, SMC), or a Secure Digital (SD) card , Flash memory card (Flash Card) and so on.
  • the computer-readable storage medium may also include both an internal storage unit of the terminal and an external storage device.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by the terminal.
  • the computer-readable storage medium can also be used to temporarily store data that has been output or will be output.
  • a computer readable storage medium stores a computer program, and the computer program is executed by a processor to implement the following method.
  • multiple risk records are clustered to obtain multiple category groups of risk records.
  • a social network analysis algorithm is used to separately analyze each category group in the multiple category groups of risk exposure records, and identify fraud cases in the multiple category groups.
  • word segmentation is performed on each of the multiple risk records, and each of the multiple risk records is divided into multiple Participle words. Delete the participle words that are the same as the words in the stop word set, and keep the same participle words as the words in the reserved word set to obtain the filtered word break words. Among them, the stop word set is multiple and out of danger The set of segmented words irrelevant to the record information.
  • the reserved word set is a set of pre-set words that cannot be filtered out. Map the filtered word segmentation words into multiple word segmentation vectors.
  • the following method is specifically implemented. From the multiple word segmentation vectors of each risk record, filter out the target word segmentation vector, where the target word segmentation vector appears more frequently in the risk record to which it belongs than in other risk records, or the target word segmentation vector The frequency of occurrence in the risk record of is lower than the frequency of occurrence in other risk records.
  • the clustering algorithm multiple risk records containing the same or similar target word segmentation vectors are clustered into multiple category groups of risk records.
  • the following method when the computer program is executed by the processor, the following method is specifically implemented.
  • Get the risk records of multiple category groups Based on the risk records of multiple category groups, a social network analysis algorithm is used to establish a risk case relationship network.
  • the risk records of a category group correspond to one or more risk case relationship networks.
  • the risk case relationship network includes multiple nodes, which represent offices.
  • Speaking of individuals, organizations or virtual individuals in the risk record the connection between multiple nodes indicates that there is a social relationship between the multiple nodes.
  • a social network analysis algorithm is used to analyze the relationship between each node in the risk case relationship network and other nodes, and extract the group risk characteristics corresponding to each node.
  • the group risk characteristics corresponding to each node are input into the classification model to obtain the fraud rate of each node.
  • the classification model is a model obtained by training the neural network using a sample set.
  • the sample set includes known multiple dimensional group risk characteristics data and Corresponding known fraud rate data.
  • the fraud rate of each node the risk records of multiple nodes whose fraud rate is higher than the first threshold are identified as fraud cases, and the multiple nodes whose fraud rates are higher than the first threshold are identified as fraud gangs.
  • the clustering algorithm is K-means clustering algorithm, mean shift clustering algorithm, density-based clustering algorithm with noise, maximum expectation clustering algorithm using Gaussian mixture model, and agglomerative hierarchical clustering algorithm.
  • K-means clustering algorithm mean shift clustering algorithm
  • density-based clustering algorithm with noise maximum expectation clustering algorithm using Gaussian mixture model
  • agglomerative hierarchical clustering algorithm One or more of.
  • the disclosed method and device can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Databases & Information Systems (AREA)
  • Finance (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed are a data recognition method and system, an electronic device and a computer storage medium. The method comprises: acquiring a plurality of accident records (S101); converting each of the plurality of accident records into a plurality of segmented word vectors (S102); clustering the plurality of accident records according to the plurality of segmented word vectors to obtain a plurality of category groups of accident records (S103); and respectively analyzing each category group of the plurality of category groups of accident records by using a social network analysis algorithm, to recognize fraudulent cases from the plurality of accident records (S104).

Description

数据识别的方法、系统、电子设备及计算机存储介质Data recognition method, system, electronic device and computer storage medium
本申请要求于2019年07月23日提交中国专利局、申请号为2019106648203,发明名称为“数据识别的方法、系统、电子设备及计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on July 23, 2019, the application number is 2019106648203, and the invention title is "Methods, systems, electronic equipment and computer storage media for data recognition", all of which are approved The reference is incorporated in this application.
技术领域Technical field
本申请涉及计算机领域,尤其涉及数据识别的方法、系统、电子设备及计算机存储介质,具体涉及了金融科技的欺诈检测技术。This application relates to the computer field, in particular to data recognition methods, systems, electronic equipment, and computer storage media, and specifically to financial technology fraud detection technology.
背景技术Background technique
目前,货运车辆保险行业反欺诈的判断,主要是通过人工分析报案描述信息和现场勘查等方式实现,无法实现群体欺诈的识别,即使通过社交网络分析(Social Network Analysis,SNA)算法构建网络来识别群体欺诈,在现实的复杂场景中,仅依赖离散型特征和少量的连续型特征,有时模型的表现会比较差。发明人意识到目前的反欺诈识别缺乏技术手段,欺诈行为识别率低。At present, anti-fraud judgments in the freight vehicle insurance industry are mainly realized through manual analysis of report description information and on-site surveys. The identification of group fraud cannot be achieved, even if the network is constructed through social network analysis (SNA) algorithms. Group fraud, in real complex scenarios, only relies on discrete features and a small number of continuous features, and sometimes the performance of the model will be poor. The inventor realizes that the current anti-fraud identification lacks technical means, and the fraud identification rate is low.
技术问题technical problem
本申请提供了数据识别的方法、系统、电子设备及计算机存储介质,用于解决群体欺诈行为识别率低的问题。This application provides a method, system, electronic device and computer storage medium for data recognition to solve the problem of low recognition rate of group fraud.
技术解决方案Technical solutions
第一方面,本申请提供了数据识别的方法,所述方法包括以下步骤。获取多个出险记录。将所述多个出险记录中每个出险记录转化为多个分词向量。根据所述多个分词向量,将所述多个出险记录进行聚类,获得多个类别组的出险记录。使用社交网络分析算法分别对所述多个类别组的出险记录中的每个类别组进行分析,识别出所述多个出险记录中的欺诈案件。由于对出险记录进行了高精度的聚类,每个类别的出险记录是相关性非常高的数据,使用聚类后的出险记录进行SNA分析时,能够更好的建立出险案件关系网络,从而提高了群体欺诈识别的正确率。In the first aspect, this application provides a method for data identification, which includes the following steps. Obtain multiple risk records. Convert each of the multiple risk records into multiple word segmentation vectors. According to the multiple word segmentation vectors, cluster the multiple risk records to obtain multiple category groups of risk records. A social network analysis algorithm is used to separately analyze each category group in the multiple category groups' risk-exposure records, and identify fraud cases in the multiple category groups. Due to the high-precision clustering of the risk records, the risk records of each category are very relevant data. When the clustered risk records are used for SNA analysis, the risk case relationship network can be better established, thereby improving The correct rate of group fraud identification.
第二方面,提供了一种数据识别系统,所述系统包括获取单元、转化单元、聚类单元以及识别单元,其中,所述获取单元用于获取多个出险记录。所述转化单元用于将所述多个出险记录中每个出险记录转化为多个分词向量。所述聚类单元用于根据所述多个分词向量,将所述多个出险记录进行聚类,获得多个类别组的出险记录。所述识别单元用于使用社交网络分析算法分别对所述多个类别组的出险记录中的每个类别组进行分析,识别出所述多个出险记录中的欺诈案件。由于该系统对出险记录进行了高精度的聚类,每个类别的出险记录是相关性非常高的数据,使用聚类后的出险记录进行SNA分析时,所以基于该系统能够更好的建立出险案件关系网络,从而提高了群体欺诈识别的正确率。In a second aspect, a data identification system is provided. The system includes an acquisition unit, a conversion unit, a clustering unit, and an identification unit, wherein the acquisition unit is used to acquire multiple risk records. The conversion unit is used to convert each of the multiple risk records into multiple word segmentation vectors. The clustering unit is used for clustering the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records. The identification unit is configured to use a social network analysis algorithm to separately analyze each category group in the multiple category groups of the risk exposure records, and identify fraud cases in the multiple risk records. Since the system performs high-precision clustering of the risk records, the risk records of each category are very relevant data. When the clustered risk records are used for SNA analysis, the system can better establish the risk based on this system. The case is connected to the network, thereby improving the correct rate of group fraud identification.
第三方面,提供了一种电子设备,包括处理器、输入设备、输出设备和存储器,所述处理器、输入设备、输出设备和存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行上述第一方面所述的方法。In a third aspect, an electronic device is provided, including a processor, an input device, an output device, and a memory, the processor, input device, output device, and memory are connected to each other, wherein the memory is used to store a computer program, The computer program includes program instructions, and the processor is configured to invoke the program instructions to execute the method described in the first aspect above.
第四方面,提供了一种计算机可读存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第一方面的方法。In a fourth aspect, a computer-readable storage medium is provided, the computer storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to execute the foregoing One side approach.
有益效果Beneficial effect
可以看到,基于本申请提供的数据识别的方法、系统、电子设备及计算机存储介质,通过获取多个出险记录,将所述多个出险记录中每个出险记录转化为多个分词向量,根据所述多个分词向量,将所述多个出险记录进行聚类,获得多个类别组的出险记录,使用社交网络分析算法分别对所述多个类别组的出险记录中的每个类别组进行分析,识别出所述多个出险记录中的欺诈案件。由于对出险记录进行了高精度的聚类,每个类别的出险记录是相关性非常高的数据,使用聚类后的出险记录进行SNA分析时,能够更好的建立出险案件关系网络,从而提高了群体欺诈识别的正确率。It can be seen that based on the data identification method, system, electronic equipment and computer storage medium provided in this application, by acquiring multiple risk records, each of the multiple risk records is converted into multiple word segmentation vectors, according to The multiple word segmentation vectors cluster the multiple risk records to obtain multiple category groups of risk records, and use a social network analysis algorithm to perform a separate analysis on each category group of the multiple category groups’ risk records. Analyze and identify fraud cases in the multiple risk records. Due to the high-precision clustering of the risk records, the risk records of each category are very relevant data. When the clustered risk records are used for SNA analysis, the risk case relationship network can be better established, thereby improving The correct rate of group fraud identification.
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. For those of ordinary skill in the art, without creative work, other drawings can be obtained from these drawings.
图1是本申请提供的一种数据识别方法的流程示意图。Fig. 1 is a schematic flowchart of a data identification method provided by the present application.
图2是本申请提供的一场景下使用SNA算法建立的案件关系网络示意图。Figure 2 is a schematic diagram of a case relationship network established by using the SNA algorithm in a scenario provided by this application.
图3是本申请提供的另一场景下使用SNA算法建立的案件关系网络示意图。Fig. 3 is a schematic diagram of a case relationship network established using the SNA algorithm in another scenario provided by this application.
图4是本申请提供的一种数据识别系统的结构示意图。Figure 4 is a schematic structural diagram of a data recognition system provided by the present application.
图5是本申请提供的一种电子设备结构示意框图。Fig. 5 is a schematic block diagram of the structure of an electronic device provided by the present application.
本发明的实施方式Embodiments of the invention
下面通过具体实施方式结合附图对本申请作进一步详细说明。在以下的实施方式中,很多细节描述是为了使得本申请能被更好的理解。然而,本领域技术人员可以毫不费力的认识到,其中部分特征在不同情况下是可以省略的,或者可以由其他方法所替代。在某些情况下,本申请相关的一些操作并没有在说明书中显示或描述,这是为了避免本申请的核心部分被过多的描述所淹没。对于本领域技术人员而言,详细描述这些相关操作并不是必要的,他们根据说明书中的描述以及本领域的一般技术知识即可完整了解相关操作。The application will be further described in detail below through specific implementations in conjunction with the drawings. In the following embodiments, many detailed descriptions are used to make the present application better understood. However, those skilled in the art can easily realize that some of the features can be omitted under different circumstances, or can be replaced by other methods. In some cases, some operations related to this application are not shown or described in the specification. This is to avoid the core part of this application being overwhelmed by excessive description. For those skilled in the art, it is not necessary to describe these related operations in detail. They can fully understand the related operations based on the description in the specification and general technical knowledge in the field.
应当理解,当在本说明书和所附权利要求书中使用术语时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when the terms are used in this specification and appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and components, but do not exclude one or more The existence or addition of other features, wholes, steps, operations, elements, components, and/or collections thereof.
需要说明的是,在本申请实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。It should be noted that the terms used in the embodiments of the present application are only for the purpose of describing specific embodiments, and are not intended to limit the present application. The singular forms of "a", "said" and "the" used in the embodiments of the present application and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings.
图1是本申请提供的一种数据识别方法的流程示意图。由图1可知,本申请提供的数据识别方法包括以下步骤。Fig. 1 is a schematic flowchart of a data identification method provided by the present application. It can be seen from Fig. 1 that the data identification method provided by this application includes the following steps.
S101:获取多个出险记录。S101: Acquire multiple risk records.
在本申请实施例中,出险指的是保险合同上规定或约定的赔偿或给付条件出现的情况,例如,车辆在保险期间,发生意外事故后,通知或者报告保险公司的过程就是汽车出险。其中,所述出险记录可以是数据库中的出险记录数据,例如,对于汽车保险来说,所述出险信息可以是事故记录,包括事故车牌,事故地点,车险保单号,保单代理人,理赔记录、保险购买记录、车辆涉案人员、包括司机、报案人、受益人和伤者,以及修理厂、报案电话、检修地点、GPS信息等数据、疾病诊断记录等涉及保险方面的记录。可以理解的是,将出险记录作为原始数据进行反欺诈识别,相比于普通的投保数据来说,可以大大的提高反欺诈识别的准确度。应理解,上述举例仅用于说明,并不能构成具体限定。In the embodiments of this application, the insurance refers to the occurrence of the compensation or payment conditions stipulated or agreed in the insurance contract. For example, the process of notifying or reporting to the insurance company during the insurance period of the vehicle, after an accident occurs. Wherein, the risk record may be the risk record data in a database. For example, for auto insurance, the risk information may be an accident record, including an accident license plate, an accident location, an auto insurance policy number, an insurance policy agent, a claim record, Insurance purchase records, vehicle involved persons, including drivers, informants, beneficiaries, and injured persons, as well as data related to insurance such as repair shops, reporting telephone numbers, inspection locations, GPS information, and disease diagnosis records. It is understandable that using the risk record as the original data for anti-fraud identification can greatly improve the accuracy of anti-fraud identification compared to ordinary insurance data. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
S102:将所述多个出险记录中每个出险记录转化为多个分词向量。S102: Transform each of the multiple risk records into multiple word segmentation vectors.
在本申请实施例中,将所述多个出险记录中的每个出险记录转化为多个分词向量包括。对所述多个出险记录中的每个出险记录进行切词处理,将所述多个出险记录中的每个出险记录分为多个分词词语。将所述多个分词词语中,与停用词集中的词语相同的分词词语删除,与保留词集中的词语相同的分词词语保留,获得筛选后的分词词语,其中,所述停用词集是多个与出险记录信息无关的分词词语的集合,所述保留词集是预先设定的不能筛选掉的词语的集合。将所述筛选后的分词词语映射为多个分词向量。应理解,出险记录包括了多个信息,比如,包括了具体的事故发生经过,具体的人员伤亡情况,疾病诊断记录,具体的交警事故认定记录等,各个出险记录的数据长短也不同,如果直接进行数据预处理,将会是一个很大的工作量,因此首先对获取的出险记录进行切词处理,从而提高数据处理效率。In this embodiment of the present application, each of the multiple risk-out records is converted into multiple word segmentation vectors and includes. Perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple word segmentation words. Delete the participle words that are the same as the words in the stop word set among the multiple participle words, and keep the word participle words that are the same as the words in the reserved word set to obtain the filtered word participles, where the stop word set is A collection of a plurality of word segmentation words that are not related to the risk record information, and the reserved word set is a preset set of words that cannot be filtered out. Map the filtered word segmentation words into multiple word segmentation vectors. It should be understood that the accident record includes multiple information, for example, it includes specific accidents, specific casualties, disease diagnosis records, specific traffic police accident identification records, etc. The data length of each accident record is also different. If you directly Data preprocessing will be a lot of work. Therefore, firstly, the obtained risk records will be segmented to improve the efficiency of data processing.
具体地,切词处理可以是将每条出险记录的汉字序列切分成一个一个单独的词,从形式上看,词是稳定的字的组合,因此在上下文中,相邻的字同时出现的次数越多,就越有可能构成一个词。因此字与字相邻共现的频率或概率能够较好的反映成词的可信度。可以对语料中相邻共现的各个字的组合的频度进行统计,计算它们的互现信息。定义两个字的互现信息,计算两个汉字X、Y的相邻共现概率。互现信息体现了汉字之间结合关系的紧密程度。当紧密程度高于某一个阈值时,便可认为此字组可能构成了一个词,进而可以进行切词的操作,但这种方法也有一定的局限性,会经常抽出一些共现频度高、但并不是词的常用字组,例如“这一”、“之一”、“有的”、“我的”、“许多的”等,浪费存储空间,也使得搜索效率低下。因此可以结合停用词集和保留词集,将出现频度高、不是出险相关信息的词语过滤出去,例如语气助词、副词、介词、连接词等自身并无明确意义,只有将其放入一个完整句子中才有一定作用的词语。同时使用统计方法识别一些新的词,即将串频统计和串匹配结合起来,既发挥匹配分词切分速度快、效率高的特点,又利用了无词典分词结合上下文识别生词、自动消除歧义的优点。并且,出险记录在进行分词操作后,还需要进行向量化的操作将每个出险记录转化为词向量。可以理解的是,“开心”和“幸福”对人来说,是两个非常接近的词语,而计算机是无法知道这两个词是相近的,因此,需要将每个词语用计算机能够理解的语言去表示,也就是说,将词语向量化,将单词表征为多维的浮点数,每一维的浮点数的数值大小表示了它与另一个单词之间的距离,表征的结果就是语义相近的词被映射到相近的集合空间上,从而使得计算机可以计算每个单词之间相似度,换句话说,使得计算机能够理解语言想要表达的含义后,再进行进一步反欺诈识别处理。Specifically, the word segmentation process can be to segment each Chinese character sequence of the risk record into a single word. From the formal point of view, the word is a stable combination of characters. Therefore, in the context, the number of adjacent characters appearing at the same time The more, the more likely it is to form a word. Therefore, the frequency or probability of co-occurrence between characters and characters can better reflect the credibility of the word formation. The frequency of combinations of adjacent co-occurring characters in the corpus can be counted, and their mutual occurrence information can be calculated. Define the mutual occurrence information of two characters, and calculate the adjacent co-occurrence probability of two Chinese characters X and Y. The mutual information reflects the closeness of the combination of Chinese characters. When the closeness is higher than a certain threshold, it can be considered that the word group may constitute a word, and then the word segmentation operation can be carried out. However, this method also has certain limitations. It will often extract some co-occurrence frequently, But it is not a common word group of words, such as "this", "one", "some", "my", "many", etc., which wastes storage space and makes search efficiency inefficient. Therefore, the stop word set and the reserved word set can be combined to filter out the words that appear frequently and are not related to the risk. For example, modal auxiliary words, adverbs, prepositions, conjunctions, etc., have no clear meaning by themselves, only put them into one Words that have a certain role in complete sentences. At the same time, statistical methods are used to identify some new words, that is, string frequency statistics and string matching are combined, which not only exerts the characteristics of fast and efficient word segmentation, but also uses the advantages of dictionary-free word segmentation combined with context to identify new words and automatically eliminate ambiguity . Moreover, after the word segmentation is performed on the risk record, a vectorization operation is needed to convert each risk record into a word vector. It is understandable that "happy" and "happiness" are two very close words for humans, and the computer cannot know that these two words are similar. Therefore, each word needs to be understood by the computer Language to express, that is to say, the word is vectorized, and the word is represented as a multi-dimensional floating point number. The value of each dimension of the floating point number indicates the distance between it and another word. The result of the representation is semantically similar Words are mapped to similar collection spaces, so that the computer can calculate the similarity between each word. In other words, after the computer can understand the meaning of the language, it can perform further anti-fraud recognition processing.
S103:根据所述多个分词向量,将所述多个出险记录进行聚类,获得多个类别组的出险记录。S103: Cluster the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records.
在本申请实施例中,所述根据所述多个分词向量,将所述多个出险记录进行聚类,获得多个类别组的出险记录包括。从所述每个出险记录的多个分词向量中,筛选出目标分词向量,其中,所述目标分词向量在所属的出险记录中出现的频率高于在其他出险记录中出现的频率,或者,所述目标分词向量在所属的出险记录中出现的频率低于在其他出险记录中出现的频率。通过聚类算法,将包含相同或相近目标分词向量的多个出险记录聚类为多个类别组的出险记录。也就是说,将出险记录转化为分词向量后,会出现很多个词语向量,虽然经过了停用词级和保留词级的筛选,但是很多词语向量对于接下来的反欺诈识别来说,数据量还是很大,应理解,因此,如果再将数据进行进一步的筛选,将易于识别的数据进行聚类,可以大大提高聚类的准确度,将准确的聚类结果输入SNA网络,可以得到更准确的识别结果。其中,更易于识别的数据可以是,例如,如果一个词汇在某一案件出险记录中出现次数较多,而在其它案件出险信息中出现较少,则认为该词汇具有很好的区分能力。再例如,如果出险信息字段中包含某词汇的案件数较少,则说明该词汇具有很好的区分能力。应理解,上述举例仅用于说明,并不能构成具体限定。In the embodiment of the present application, the clustering of the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records includes. From the multiple word segmentation vectors of each risk record, the target word segmentation vector is screened out, where the target word segmentation vector appears more frequently in the risk record to which it belongs than in other risk records, or The occurrence frequency of the target word segmentation vector in the risk record to which it belongs is lower than the frequency in other risk records. Through the clustering algorithm, multiple risk records containing the same or similar target word segmentation vectors are clustered into multiple category groups of risk records. In other words, after the risk record is converted into the word segmentation vector, there will be many word vectors. Although the stop word level and the reserved word level have been screened, many word vectors have a large amount of data for the subsequent anti-fraud recognition It is still very large, it should be understood. Therefore, if the data is further filtered and the easily identifiable data is clustered, the accuracy of the clustering can be greatly improved, and the accurate clustering results can be input into the SNA network to get more accurate The recognition result. Among them, data that is easier to identify may be, for example, if a word appears more frequently in the risk record of a certain case but less frequently in other cases, it is considered that the word has good distinguishing ability. For another example, if the number of cases containing a word in the risk information field is small, it means that the word has good distinguishing ability. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
在本申请实施例中,目标分词向量可以使用词频-逆文本频率指数(Term Frequency–Inverse Document Frequency,TF-IDF)方法进行筛选,其中,TF-IDF方法是一种统计方法,用于评估一个字词对于一个文件集或一个语料库中的其中一份文件的重要程度。可以理解的是,对区别文档最有意义的词语应该是那些在文档中出现频率高,而在整个文档集合的其他文档中出现频率少的词语,所以如果特征空间坐标系取TF词频作为测度,就可以体现同类文本的特点,另外考虑到单词区别不同类别的能力,TF-IDF方法认为一个单词出现的文本频数越小,它区别不同类别文本的能力就越大。因此引入了逆文本频度IDF的概念,以TF和IDF的乘积作为特征空间坐标系的取值测度,并用它完成对权值TF的调整,调整权值的目的在于突出重要单词,抑制次要单词。也就是说,如果某个词或短语在一篇文章中出现的频率高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。因此,与直接将出险记录进行聚类相比,可以获得更好地聚类结果,筛选出适合分类的分词向量后,根据出险记录地重要程度矩阵进行聚类,可以更加准确地将出险信息相似的案件聚集在一起,将聚类结果作为SNA网络的输入数据,可以大大提高反欺诈案件识别地准确率。In the embodiment of this application, the target word segmentation vector may use the term frequency-inverse text frequency index (Term Frequency-Inverse Document Frequency (TF-IDF) method is used to screen. The TF-IDF method is a statistical method used to evaluate the importance of a word to a document set or a document in a corpus. It is understandable that the most meaningful words for distinguishing documents should be those that appear frequently in the document, but appear less frequently in other documents in the entire document collection, so if the feature space coordinate system takes TF word frequency as a measure, It can reflect the characteristics of similar texts. In addition, considering the ability of words to distinguish different categories, the TF-IDF method believes that the smaller the text frequency of a word, the greater its ability to distinguish texts of different categories. Therefore, the concept of inverse text frequency IDF is introduced. The product of TF and IDF is used as the value measurement of the feature space coordinate system, and used to complete the adjustment of the weight TF. The purpose of adjusting the weight is to highlight important words and suppress minor ones. word. In other words, if a word or phrase appears frequently in an article and rarely appears in other articles, it is considered that the word or phrase has a good classification ability and is suitable for classification. Therefore, compared with directly clustering the risk records, better clustering results can be obtained. After the word segmentation vector suitable for classification is selected, clustering according to the importance degree matrix of the risk records can more accurately compare the risk information The cases are gathered together, and the clustering results are used as the input data of the SNA network, which can greatly improve the accuracy of anti-fraud case identification.
在本申请实施例中,所述聚类算法是K均值聚类算法、均值漂移聚类算法、具有噪声的基于密度的聚类算法、用高斯混合模型的最大期望聚类算法以及凝聚层次聚类算法中的一种或者多种,本申请不作具体限定。可以理解的是,聚类的结果可以是同省市或者是购买同类型报销的出险记录聚集在一起,例如,将微损案件的出险记录聚集在一起,或者,将同省的案件聚集在一起等,此处不作具体限定。In the embodiments of the present application, the clustering algorithms are K-means clustering algorithm, mean shift clustering algorithm, noisy density-based clustering algorithm, maximum expected clustering algorithm using Gaussian mixture model, and agglomerative hierarchical clustering One or more of the algorithms are not specifically limited in this application. It is understandable that the result of clustering can be that the risk records of the same province or city or the purchase of the same type of reimbursement are gathered together, for example, the risk records of minor damage cases are gathered together, or the cases of the same province are gathered together, etc. , There is no specific limitation here.
S104:对所述多个类别组的出险记录中的每个类别组,使用社交网络分析SNA算法,识别出所述多个出险记录中的欺诈案件。S104: Use a social network analysis SNA algorithm for each category group in the risk exposure records of the multiple category groups to identify fraud cases in the multiple risk records.
在本申请实施例中,所述使用社交网络分析算法分别对所述多个类别组的出险记录中的每个类别组进行分析,识别出所述多个出险记录中的欺诈案件包括。获取所述多个类别组的出险记录。基于所述多个类别组的出险记录,使用社交网络分析算法建立出险案件关系网络,其中,一个类别组的出险记录对应一个或者多个出险案件关系网络,所述出险案件关系网络包括多个节点,所述节点代表所述出险记录中的个体、组织或虚拟个体,所述多个节点之间的连线表示所述多个节点之间存在社交关系。通过社交网络分析算法对所述出险案件关系网络中的各个节点与其他节点之间的关系进行分析,提取出各个节点对应的群体出险特征。将所述各个节点对应的群体出险特征输入分类模型,获得各个节点的欺诈率,其中,所述分类模型是使用样本集对神经网络进行训练得到的模型,所述样本集包括已知的多个维度群体出险特征数据以及对应的已知的欺诈率数据。根据所述各个节点的欺诈率,将欺诈率高于第一阈值的多个节点所属的出险记录识别为欺诈案件,将欺诈率高于第一阈值的多个节点识别为欺诈团伙。其中,SNA网络是由多个点(社会行动者)和各个点之间的连线(行动者之间的关系)组成的集合。每个节点可以是组织、个人、网络ID等不同含义的实体或虚拟个体,而个体之间的关系可以是亲友、动作行为、收发消息等多种关系。通过SNA网络分析,可以从杂乱的数据和连接关系中,找到我们需要的关键信息,也就是各个节点的群体性出险行为。以医疗保险的出险数据为例,各个节点的出险特征可以是。病患所在的区域,病患看病的医院、病患采购药品项目的数量和具体时间,病患患得的疾病,病患看诊的医生等行为。对病患的群体性出险行为进行分析,就相当于对病患所在的区域、病患采购药品项目的数量和具体时间、病患患得的疾病等进行综合分析。若查到病患多次在不同的医院购买大量的药品,且药品的种类各不相同,可确定群体性出险特征为。用户的药品购买量大、药品类型多等等。再例如,以车辆保险出险数据为例,各个节点的出险特征可以是。车辆所处的城市、车牌号、车辆购买的保险类别、处理事故的交警、肇事者身份信息以及受害人身份信息等等。若查到车辆多次在不同地点出险并且均为微损案件,由于微损案件金额低,可以快速报案处理,因此可确定群体性出险特征为用户多次参与微损案件出险。若查到车辆多次在不同地点出险并且受害人身份均为同一人,可确定群体性出险特征为用户多次与他人协同骗保。同理,可以获得商业保险、意外保险等其他出险数据的群体性出险特征。应理解,上述举例仅用于说明,并不能构成具体限定。In the embodiment of the present application, the social network analysis algorithm is used to analyze each category group in the risk exposure records of the multiple category groups respectively, and it is recognized that the fraud cases in the multiple risk records include. Obtain the risk exposure records of the multiple category groups. Based on the risk records of the multiple category groups, a social network analysis algorithm is used to establish a risk case relationship network, where the risk records of one category group correspond to one or more risk case relationship networks, and the risk case relationship network includes multiple nodes The node represents an individual, an organization or a virtual individual in the risk record, and the connection between the multiple nodes indicates that there is a social relationship between the multiple nodes. A social network analysis algorithm is used to analyze the relationship between each node in the risk case relationship network and other nodes, and extract the group risk characteristics corresponding to each node. The group risk characteristics corresponding to each node are input into a classification model to obtain the fraud rate of each node. The classification model is a model obtained by training a neural network using a sample set, and the sample set includes a plurality of known Dimensional group risk feature data and corresponding known fraud rate data. According to the fraud rate of each node, the risk records of multiple nodes whose fraud rate is higher than the first threshold are recognized as fraud cases, and the multiple nodes whose fraud rates are higher than the first threshold are recognized as fraud gangs. Among them, the SNA network is a collection of points (social actors) and connections between points (relationships between actors). Each node can be an entity or virtual individual with different meanings such as organization, individual, network ID, and the relationship between individuals can be relatives and friends, action behavior, sending and receiving messages, and other relationships. Through SNA network analysis, we can find the key information we need from the messy data and connection relationships, that is, the group risk behavior of each node. Taking the risk data of medical insurance as an example, the risk characteristics of each node can be. The area where the patient is located, the hospital where the patient sees the patient, the number and specific time of the items purchased by the patient, the disease the patient has, and the doctor who sees the patient. Analyzing the group risk behavior of patients is equivalent to a comprehensive analysis of the area where the patients are located, the number and specific time of the items purchased by the patients, and the diseases the patients have suffered. If it is found that patients have purchased a large number of drugs in different hospitals many times, and the types of drugs are different, the characteristics of the group risk can be determined. The user purchases a large amount of medicines, many types of medicines, and so on. For another example, taking vehicle insurance risk data as an example, the risk characteristics of each node may be. The city where the vehicle is located, the license plate number, the type of insurance purchased by the vehicle, the traffic police who handled the accident, the identity information of the perpetrator, and the identity information of the victim, etc. If it is found that the vehicle has been exposed to multiple risks at different locations and all are minor damage cases, since the minor damage cases are low in amount, they can be reported and handled quickly. Therefore, it can be determined that the group risk feature is that the user has participated in the risk of minor damage cases multiple times. If it is found that the vehicle has been in danger at different locations for multiple times and the victim's identity is the same person, it can be determined that the group risk is characterized by the user's multiple times of cooperating with others to defraud insurance. In the same way, the group risk characteristics of other risk data such as commercial insurance and accident insurance can be obtained. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
在本申请实施例中,除了使用神经网络根据所述各个节点的欺诈率,可以识别出所述多个出险记录中的欺诈案件,还可以根据出险率识别高风险案件。例如,图2是本申请提供的一场景下使用SNA算法建立的案件关系网络示意图,其中,灰色的点代表出险记录较多的用户,黑色的点代表有出险记录但是次数较少的穿越用户,白色的点代表没有出险记录的用户。通过数据计算分析可以得出,该团伙的高出险率达到66.8%,说明该团伙的平均出险率较低。出险用户占所有用户的91.4%,进一步验证了该团伙的欺诈性。可以理解的,确认出欺诈团伙后,该团伙参与的出险记录即可确认为欺诈案件。In the embodiments of the present application, in addition to using a neural network to identify fraud cases in the multiple risk records based on the fraud rate of each node, it is also possible to identify high-risk cases based on the risk rate. For example, Figure 2 is a schematic diagram of a case relationship network established by using the SNA algorithm in a scenario provided by this application, in which the gray dots represent users with more risk records, and the black dots represent users with risk records but fewer times. The white dots represent users who have no history of accidents. Through data calculation and analysis, it can be concluded that the high risk rate of the gang reaches 66.8%, indicating that the average risk rate of the gang is low. The risky users accounted for 91.4% of all users, further verifying the fraudulent nature of the gang. Understandably, after a fraudulent group is identified, the risk record of the group's participation can be confirmed as a fraud case.
在本申请实施例中,使用SNA构建案件关系网络后,还可以根据网络结构,与预设的网络模型进行匹配,从而识别出高风险案件。例如,图3是本申请提供的另一场景下使用SNA算法建立的案件关系网络示意图,其中,黑色的点代表有出现记录但是次数较少的穿越用户,虽然该案件关系网络中各个节点的欺诈率不高,并且出险率也不高,但是由于案件关系网络中,有关系的两个节点会进行连接,因此该场景下的案件关系网络具有更高的风险性,该网络的背后通常是多人协作的团伙作案,其两两互通表示两两认识,背后的目的多为相互勾结,伪造信息以达到出险要求,需要重点关注。应理解,上述举例仅用于说明,SNA算法建立的案件关系网络还可以是除图2、图3外的其他网络结构,分析方法也与图2、图3使用的分析方法不同,本申请不作具体限定。In the embodiment of the present application, after the case relationship network is constructed using SNA, it can also be matched with a preset network model according to the network structure, so as to identify high-risk cases. For example, Figure 3 is a schematic diagram of a case relationship network established using the SNA algorithm in another scenario provided by this application. The black dots represent users who have a record of traversing but have a small number of times. Although the case is related to the fraud of each node in the network The rate of case relationship is not high, and the risk rate is not high. However, because two related nodes in the case relationship network are connected, the case relationship network in this scenario has a higher risk. The network is usually When a group of people cooperated to commit crimes, their mutual communication means pairwise understanding, and the purpose behind it is mostly collusion. Forging information to meet the requirements of the risk requires special attention. It should be understood that the above examples are only for illustration. The case relationship network established by the SNA algorithm can also be other network structures other than Figure 2 and Figure 3. The analysis method is also different from the analysis method used in Figure 2 and Figure 3. This application does not Specific restrictions.
上述方法中,通过获取多个出险记录,将所述多个出险记录中每个出险记录转化为多个分词向量,根据所述多个分词向量,将所述多个出险记录进行聚类,获得多个类别组的出险记录,使用社交网络分析算法分别对所述多个类别组的出险记录中的每个类别组进行分析,识别出所述多个出险记录中的欺诈案件。由于对出险记录进行了高精度的聚类,每个类别的出险记录是相关性非常高的数据,使用聚类后的出险记录进行SNA分析时,能够更好的建立出险案件关系网络,从而提高了群体欺诈识别的正确率。In the above method, by acquiring multiple risk records, each of the multiple risk records is converted into multiple word segmentation vectors, and the multiple risk records are clustered according to the multiple word segmentation vectors to obtain The risk records of multiple category groups are analyzed using a social network analysis algorithm to analyze each category group of the risk records of the multiple category groups to identify fraud cases in the multiple risk records. Due to the high-precision clustering of the risk records, the risk records of each category are very relevant data. When the clustered risk records are used for SNA analysis, the risk case relationship network can be better established, thereby improving The correct rate of group fraud identification.
图4是本申请提供的一种数据识别系统的结构示意图。由图4可知,本申请提供的数据识别系统包括获取单元410、转化单元420、聚类单元430以及识别单元440,其中,Figure 4 is a schematic structural diagram of a data recognition system provided by the present application. It can be seen from FIG. 4 that the data recognition system provided by the present application includes an acquisition unit 410, a conversion unit 420, a clustering unit 430, and an identification unit 440. Among them,
所述获取单元410用于获取多个出险记录。The acquiring unit 410 is configured to acquire multiple risk records.
所述转化单元420用于将所述多个出险记录中每个出险记录转化为多个分词向量。The conversion unit 420 is configured to convert each of the multiple risk records into multiple word segmentation vectors.
所述聚类单元430用于根据所述多个分词向量,将所述多个出险记录进行聚类,获得多个类别组的出险记录。The clustering unit 430 is configured to cluster the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records.
所述识别单元440用于使用社交网络分析算法分别对所述多个类别组的出险记录中的每个类别组进行分析,识别出所述多个出险记录中的欺诈案件。The identification unit 440 is configured to use a social network analysis algorithm to separately analyze each category group in the multiple category groups of the risk exposure records, and identify fraud cases in the multiple risk records.
在本申请实施例中,出险指的是保险合同上规定或约定的赔偿或给付条件出现的情况,例如,车辆在保险期间,发生意外事故后,通知或者报告保险公司的过程就是汽车出险。其中,所述出险记录可以是数据库中的出险记录数据,例如,对于汽车保险来说,所述出险信息可以是事故记录,包括事故车牌,事故地点,车险保单号,保单代理人,理赔记录、保险购买记录、车辆涉案人员、包括司机、报案人、受益人和伤者,以及修理厂、报案电话、检修地点、GPS信息等数据、疾病诊断记录等涉及保险方面的记录。可以理解的是,将出险记录作为原始数据进行反欺诈识别,相比于普通的投保数据来说,可以大大的提高反欺诈识别的准确度。应理解,上述举例仅用于说明,并不能构成具体限定。In the embodiments of this application, the insurance refers to the occurrence of the compensation or payment conditions stipulated or agreed in the insurance contract. For example, the process of notifying or reporting to the insurance company during the insurance period of the vehicle, after an accident occurs. Wherein, the risk record may be the risk record data in a database. For example, for auto insurance, the risk information may be an accident record, including an accident license plate, an accident location, an auto insurance policy number, an insurance policy agent, a claim record, Insurance purchase records, vehicle involved persons, including drivers, informants, beneficiaries, and injured persons, as well as data related to insurance such as repair shops, reporting telephone numbers, inspection locations, GPS information, and disease diagnosis records. It is understandable that using the risk record as the original data for anti-fraud identification can greatly improve the accuracy of anti-fraud identification compared to ordinary insurance data. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
在本申请实施例中,所述转化单元420具体用于:对所述多个出险记录中的每个出险记录进行切词处理,将所述多个出险记录中的每个出险记录分为多个分词词语。结合停用词集和保留词集,对所述多个分词词语进行筛选,获得筛选后的分词词语,其中,所述停用词集是多个与出险记录信息无关的分词词语的集合,所述保留词集是预先设定的不能筛选掉的词语的集合。对所述筛选后的分词词语进行向量化处理,获得所述多个出险记录中的每个出险记录的多个分词向量。应理解,出险记录包括了多个信息,比如,包括了具体的事故发生经过,具体的人员伤亡情况,疾病诊断记录,具体的交警事故认定记录等,各个出险记录的数据长短也不同,如果直接进行数据预处理,将会是一个很大的工作量,因此首先对获取的出险记录进行切词处理,从而提高数据处理效率。In the embodiment of the present application, the conversion unit 420 is specifically configured to: perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple records. Participle words. Combining the stop word set and the reserved word set, the multiple word segmentation words are filtered to obtain the filtered word segmentation words, wherein the stop word set is a collection of multiple word segmentation words that are not related to the risk record information, so The reserved word set is a set of pre-set words that cannot be filtered out. Perform vectorization processing on the filtered word segmentation words to obtain multiple word segmentation vectors of each of the multiple risk records. It should be understood that the accident record includes multiple information, for example, it includes specific accidents, specific casualties, disease diagnosis records, specific traffic police accident identification records, etc. The data length of each accident record is also different. If you directly Data preprocessing will be a lot of work. Therefore, firstly, the obtained risk records will be segmented to improve the efficiency of data processing.
具体地,切词处理可以是将每条出险记录的汉字序列切分成一个一个单独的词,从形式上看,词是稳定的字的组合,因此在上下文中,相邻的字同时出现的次数越多,就越有可能构成一个词。因此字与字相邻共现的频率或概率能够较好的反映成词的可信度。可以对语料中相邻共现的各个字的组合的频度进行统计,计算它们的互现信息。定义两个字的互现信息,计算两个汉字X、Y的相邻共现概率。互现信息体现了汉字之间结合关系的紧密程度。当紧密程度高于某一个阈值时,便可认为此字组可能构成了一个词,进而可以进行切词的操作,但这种方法也有一定的局限性,会经常抽出一些共现频度高、但并不是词的常用字组,例如“这一”、“之一”、“有的”、“我的”、“许多的”等,浪费存储空间,也使得搜索效率低下。因此可以结合停用词集和保留词集,将出现频度高、不是出险相关信息的词语过滤出去,例如语气助词、副词、介词、连接词等自身并无明确意义,只有将其放入一个完整句子中才有一定作用的词语。同时使用统计方法识别一些新的词,即将串频统计和串匹配结合起来,既发挥匹配分词切分速度快、效率高的特点,又利用了无词典分词结合上下文识别生词、自动消除歧义的优点。并且,出险记录在进行分词操作后,还需要进行向量化的操作将每个出险记录转化为词向量。可以理解的是,“开心”和“幸福”对人来说,是两个非常接近的词语,而计算机是无法知道这两个词是相近的,因此,需要将每个词语用计算机能够理解的语言去表示,也就是说,将词语向量化,将单词表征为多维的浮点数,每一维的浮点数的数值大小表示了它与另一个单词之间的距离,表征的结果就是语义相近的词被映射到相近的集合空间上,从而使得计算机可以计算每个单词之间相似度,换句话说,使得计算机能够理解语言想要表达的含义后,再进行进一步反欺诈识别处理。Specifically, the word segmentation process can be to segment each Chinese character sequence of the risk record into a single word. From the formal point of view, the word is a stable combination of characters. Therefore, in the context, the number of adjacent characters appearing at the same time The more, the more likely it is to form a word. Therefore, the frequency or probability of co-occurrence between characters and characters can better reflect the credibility of the word formation. The frequency of combinations of adjacent co-occurring characters in the corpus can be counted, and their mutual occurrence information can be calculated. Define the mutual occurrence information of two characters, and calculate the adjacent co-occurrence probability of two Chinese characters X and Y. The mutual information reflects the closeness of the combination of Chinese characters. When the closeness is higher than a certain threshold, it can be considered that the word group may constitute a word, and then the word segmentation operation can be carried out. However, this method also has certain limitations. It will often extract some co-occurrence frequently, But it is not a common word group of words, such as "this", "one", "some", "my", "many", etc., which wastes storage space and makes search efficiency inefficient. Therefore, the stop word set and the reserved word set can be combined to filter out the words that appear frequently and are not related to the risk. For example, modal auxiliary words, adverbs, prepositions, conjunctions, etc., have no clear meaning by themselves, only put them into one Words that have a certain role in complete sentences. At the same time, statistical methods are used to identify some new words, that is, string frequency statistics and string matching are combined, which not only exerts the characteristics of fast and efficient word segmentation, but also uses the advantages of dictionary-free word segmentation combined with context to identify new words and automatically eliminate ambiguity . Moreover, after the word segmentation is performed on the risk record, a vectorization operation is needed to convert each risk record into a word vector. It is understandable that "happy" and "happiness" are two very close words for humans, and the computer cannot know that these two words are similar. Therefore, each word needs to be understood by the computer Language to express, that is to say, the word is vectorized, and the word is represented as a multi-dimensional floating point number. The value of each dimension of the floating point number indicates the distance between it and another word. The result of the representation is semantically similar Words are mapped to similar collection spaces, so that the computer can calculate the similarity between each word. In other words, after the computer can understand the meaning of the language, it can perform further anti-fraud recognition processing.
在本申请实施例中,所述聚类单元430具体用于从所述每个出险记录的多个分词向量中,筛选出目标分词向量,其中,所述目标分词向量在所属的出险记录中出现的频率高于在其他出险记录中出现的频率,或者,所述目标分词向量在所属的出险记录中出现的频率低于在其他出险记录中出现的频率。所述聚类单元430具体用于通过聚类算法,将包含相同或相近目标分词向量的多个出险记录聚类为多个类别组的出险记录。也就是说,将出险记录转化为分词向量后,会出现很多个词语向量,虽然经过了停用词级和保留词级的筛选,但是很多词语向量对于接下来的反欺诈识别来说,数据量还是很大,应理解,因此,如果再将数据进行进一步的筛选,将易于识别的数据进行聚类,可以大大提高聚类的准确度,将准确的聚类结果输入SNA网络,可以得到更准确的识别结果。其中,更易于识别的数据可以是,例如,如果一个词汇在某一案件出险记录中出现次数较多,而在其它案件出险信息中出现较少,则认为该词汇具有很好的区分能力。再例如,如果出险信息字段中包含某词汇的案件数较少,则说明该词汇具有很好的区分能力。应理解,上述举例仅用于说明,并不能构成具体限定。In the embodiment of the present application, the clustering unit 430 is specifically configured to filter out the target word segmentation vector from the multiple word segmentation vectors of each risk record, wherein the target word segmentation vector appears in the risk record to which it belongs The frequency of is higher than the frequency of appearance in other risk records, or the frequency of the target word segmentation vector in the corresponding risk record is lower than the frequency of appearance in other risk records. The clustering unit 430 is specifically configured to use a clustering algorithm to cluster multiple risk records containing the same or similar target word segmentation vectors into multiple category groups of risk records. In other words, after the risk record is converted into the word segmentation vector, there will be many word vectors. Although the stop word level and the reserved word level have been screened, many word vectors have a large amount of data for the subsequent anti-fraud recognition It is still very large, it should be understood. Therefore, if the data is further filtered and the easily identifiable data is clustered, the accuracy of the clustering can be greatly improved, and the accurate clustering results can be input into the SNA network to get more accurate The recognition result. Among them, data that is easier to identify may be, for example, if a word appears more frequently in the risk record of a certain case but less frequently in other cases, it is considered that the word has good distinguishing ability. For another example, if the number of cases containing a word in the risk information field is small, it means that the word has good distinguishing ability. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
在本申请实施例中,目标分词向量可以使用TF-IDF方法进行筛选,其中,TF-IDF方法是一种统计方法,用于评估一个字词对于一个文件集或一个语料库中的其中一份文件的重要程度。可以理解的是,对区别文档最有意义的词语应该是那些在文档中出现频率高,而在整个文档集合的其他文档中出现频率少的词语,所以如果特征空间坐标系取TF词频作为测度,就可以体现同类文本的特点,另外考虑到单词区别不同类别的能力,TF-IDF方法认为一个单词出现的文本频数越小,它区别不同类别文本的能力就越大。因此引入了逆文本频度IDF的概念,以TF和IDF的乘积作为特征空间坐标系的取值测度,并用它完成对权值TF的调整,调整权值的目的在于突出重要单词,抑制次要单词。也就是说,如果某个词或短语在一篇文章中出现的频率高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。因此,与直接将出险记录进行聚类相比,可以获得更好地聚类结果,筛选出适合分类的分词向量后,根据出险记录地重要程度矩阵进行聚类,可以更加准确地将出险信息相似的案件聚集在一起,将聚类结果作为SNA网络的输入数据,可以大大提高反欺诈案件识别地准确率。In the embodiment of the present application, the target word segmentation vector can be filtered using the TF-IDF method, where the TF-IDF method is a statistical method used to evaluate a word for a document set or a document in a corpus Degree of importance. It is understandable that the most meaningful words for distinguishing documents should be those that appear frequently in the document, but appear less frequently in other documents in the entire document collection, so if the feature space coordinate system takes TF word frequency as a measure, It can reflect the characteristics of similar texts. In addition, considering the ability of words to distinguish different categories, the TF-IDF method believes that the smaller the text frequency of a word, the greater its ability to distinguish texts of different categories. Therefore, the concept of inverse text frequency IDF is introduced. The product of TF and IDF is used as the value measurement of the feature space coordinate system, and used to complete the adjustment of the weight TF. The purpose of adjusting the weight is to highlight important words and suppress minor ones. word. In other words, if a word or phrase appears frequently in an article and rarely appears in other articles, it is considered that the word or phrase has a good classification ability and is suitable for classification. Therefore, compared with directly clustering the risk records, better clustering results can be obtained. After the word segmentation vector suitable for classification is selected, clustering according to the importance degree matrix of the risk records can more accurately compare the risk information The cases are gathered together, and the clustering results are used as the input data of the SNA network, which can greatly improve the accuracy of anti-fraud case identification.
在本申请实施例中,所述聚类算法是K均值聚类算法、均值漂移聚类算法、具有噪声的基于密度的聚类算法、用高斯混合模型的最大期望聚类算法以及凝聚层次聚类算法中的一种或者多种,本申请不作具体限定。可以理解的是,聚类的结果可以是同省市或者是购买同类型报销的出险记录聚集在一起,例如,将微损案件的出险记录聚集在一起,或者,将同省的案件聚集在一起等,此处不作具体限定。In the embodiments of the present application, the clustering algorithms are K-means clustering algorithm, mean shift clustering algorithm, noisy density-based clustering algorithm, maximum expected clustering algorithm using Gaussian mixture model, and agglomerative hierarchical clustering One or more of the algorithms are not specifically limited in this application. It is understandable that the result of clustering can be that the risk records of the same province or city or the purchase of the same type of reimbursement are gathered together, for example, the risk records of minor damage cases are gathered together, or the cases of the same province are gathered together, etc. , There is no specific limitation here.
在本申请实施例中,所述识别单元440具体用于获取所述多个类别组的出险记录。所述识别单元440具体用于基于所述多个类别组的出险记录,使用社交网络分析算法建立出险案件关系网络,其中,一个类别组的出险记录对应一个或者多个出险案件关系网络,所述出险案件关系网络包括多个节点,所述节点代表所述出险记录中的个体、组织或虚拟个体,所述多个节点之间的连线表示所述多个节点之间存在社交关系。所述识别单元440具体用于通过社交网络分析算法对所述出险案件关系网络中的各个节点与其他节点之间的关系进行分析,提取出各个节点对应的群体出险特征。所述识别单元440具体用于将所述各个节点对应的群体出险特征输入分类模型,获得各个节点的欺诈率,其中,所述分类模型是使用样本集对神经网络进行训练得到的模型,所述样本集包括已知的多个维度群体出险特征数据以及对应的已知的欺诈率数据。所述识别单元440具体用于根据所述各个节点的欺诈率,将欺诈率高于第一阈值的多个节点所属的出险记录识别为欺诈案件,将欺诈率高于第一阈值的多个节点识别为欺诈团伙。其中,SNA网络是由多个点(社会行动者)和各个点之间的连线(行动者之间的关系)组成的集合。每个节点可以是组织、个人、网络ID等不同含义的实体或虚拟个体,而个体之间的关系可以是亲友、动作行为、收发消息等多种关系。通过SNA网络分析,可以从杂乱的数据和连接关系中,找到我们需要的关键信息,也就是各个节点的群体性出险行为。以医疗保险的出险数据为例,各个节点的出险特征可以是:病患所在的区域,病患看病的医院、病患采购药品项目的数量和具体时间,病患患得的疾病,病患看诊的医生等行为。对病患的群体性出险行为进行分析,就相当于对病患所在的区域、病患采购药品项目的数量和具体时间、病患患得的疾病等进行综合分析。若查到病患多次在不同的医院购买大量的药品,且药品的种类各不相同,可确定群体性出险特征为。用户的药品购买量大、药品类型多等等。再例如,以车辆保险出险数据为例,各个节点的出险特征可以是。车辆所处的城市、车牌号、车辆购买的保险类别、处理事故的交警、肇事者身份信息以及受害人身份信息等等。若查到车辆多次在不同地点出险并且均为微损案件,由于微损案件金额低,可以快速报案处理,因此可确定群体性出险特征为用户多次参与微损案件出险。若查到车辆多次在不同地点出险并且受害人身份均为同一人,可确定群体性出险特征为用户多次与他人协同骗保。同理,可以获得商业保险、意外保险等其他出险数据的群体性出险特征。应理解,上述举例仅用于说明,并不能构成具体限定。In this embodiment of the present application, the identification unit 440 is specifically configured to obtain the risk exposure records of the multiple category groups. The identification unit 440 is specifically configured to use a social network analysis algorithm to establish a risk case relationship network based on the risk records of the multiple category groups, wherein the risk record of a category group corresponds to one or more risk case relationship networks. The risk case relationship network includes multiple nodes, the nodes representing individuals, organizations, or virtual individuals in the risk records, and the connections between the multiple nodes indicate that there is a social relationship between the multiple nodes. The identification unit 440 is specifically configured to analyze the relationship between each node in the risk case relationship network and other nodes through a social network analysis algorithm, and extract the group risk characteristics corresponding to each node. The identification unit 440 is specifically configured to input the group risk characteristics corresponding to each node into a classification model to obtain the fraud rate of each node, where the classification model is a model obtained by training a neural network using a sample set, and The sample set includes known risk feature data of multiple dimensional groups and corresponding known fraud rate data. The identification unit 440 is specifically configured to identify, according to the fraud rate of each node, the risk records of multiple nodes whose fraud rate is higher than a first threshold as a fraud case, and identify multiple nodes whose fraud rate is higher than the first threshold Identified as a fraudulent group. Among them, the SNA network is a collection of points (social actors) and connections between points (relationships between actors). Each node can be an entity or virtual individual with different meanings such as organization, individual, network ID, and the relationship between individuals can be relatives and friends, action behavior, sending and receiving messages, and other relationships. Through SNA network analysis, we can find the key information we need from the messy data and connection relationships, that is, the group risk behavior of each node. Taking the risk data of medical insurance as an example, the risk characteristics of each node can be: the area where the patient is located, the hospital where the patient sees the patient, the number and specific time of the patient’s purchase of drugs, the disease the patient has, and the patient sees Behave like the doctor who consulted. Analyzing the group risk behavior of patients is equivalent to a comprehensive analysis of the area where the patients are located, the number and specific time of the items purchased by the patients, and the diseases the patients have suffered. If it is found that patients have purchased a large number of drugs in different hospitals many times, and the types of drugs are different, the characteristics of the group risk can be determined. The user purchases a large amount of medicines, many types of medicines, and so on. For another example, taking vehicle insurance risk data as an example, the risk characteristics of each node may be. The city where the vehicle is located, the license plate number, the type of insurance purchased by the vehicle, the traffic police who handled the accident, the identity information of the perpetrator, and the identity information of the victim, etc. If it is found that the vehicle has been exposed to multiple risks at different locations and all are minor damage cases, since the minor damage cases are low in amount, they can be reported and handled quickly. Therefore, it can be determined that the group risk feature is that the user has participated in the risk of minor damage cases multiple times. If it is found that the vehicle has been in danger at different locations for multiple times and the victim's identity is the same person, it can be determined that the group risk is characterized by the user's multiple times of cooperating with others to defraud insurance. In the same way, the group risk characteristics of other risk data such as commercial insurance and accident insurance can be obtained. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
在本申请实施例中,除了使用神经网络根据所述各个节点的欺诈率,可以识别出所述多个出险记录中的欺诈案件,还可以根据出险率识别高风险案件。例如,图2是一个SNA算法建立的案件关系网络,其中,灰色的点代表出险记录较多的用户,黑色的点代表有出险记录但是次数较少的穿越用户,白色的点代表没有出险记录的用户。通过数据计算分析可以得出,该团伙的高出险率达到66.8%,说明该团伙的平均出险率较低。出险用户占所有用户的91.4%,进一步验证了该团伙的欺诈性。可以理解的,确认出欺诈团伙后,该团伙参与的出险记录即可确认为欺诈案件。In the embodiments of the present application, in addition to using a neural network to identify fraud cases in the multiple risk records based on the fraud rate of each node, it is also possible to identify high-risk cases based on the risk rate. For example, Figure 2 is a case relationship network established by the SNA algorithm, in which the gray dots represent users with more risk records, the black dots represent users with risk records but fewer traverses, and white dots represent users with no risk records. user. Through data calculation and analysis, it can be concluded that the high risk rate of the gang reaches 66.8%, indicating that the average risk rate of the gang is low. The risky users accounted for 91.4% of all users, further verifying the fraudulent nature of the gang. Understandably, after a fraudulent group is identified, the risk record of the group's participation can be confirmed as a fraud case.
在本申请实施例中,使用SNA构建案件关系网络后,还可以根据网络结构,与预设的网络模型进行匹配,从而识别出高风险案件。例如,图3是本申请提供的另一场景下使用SNA算法建立的案件关系网络示意图,其中,黑色的点代表有出现记录但是次数较少的穿越用户,虽然该案件关系网络中各个节点的欺诈率不高,并且出险率也不高,但是由于案件关系网络中,有关系的两个节点会进行连接,因此该网络具有更高的风险性,这种网络的背后通常是多人协作的团伙作案,其两两互通表示两两认识,背后的目的多为相互勾结,伪造信息以达到出险要求,需要重点关注。应理解,上述举例仅用于说明,SNA算法建立的案件关系网络还可以是除图2、图3外的其他网络结构,分析方法也与图2、图3使用的分析方法不同,本申请不作具体限定。In the embodiment of the present application, after the case relationship network is constructed using SNA, it can also be matched with a preset network model according to the network structure, so as to identify high-risk cases. For example, Figure 3 is a schematic diagram of a case relationship network established using the SNA algorithm in another scenario provided by this application. The black dots represent users who have a record of traversing but have a small number of times. Although the case is related to the fraud of each node in the network The rate is not high, and the risk rate is not high, but because two related nodes in the case relationship network will be connected, the network has a higher risk. The network is usually behind a multi-person collaborative group. When committing a crime, the two-to-one communication means pairwise understanding, and the purpose behind it is mostly collusion. Forging information to meet the risk requirements requires special attention. It should be understood that the above examples are only for illustration. The case relationship network established by the SNA algorithm can also be other network structures other than Figure 2 and Figure 3. The analysis method is also different from the analysis method used in Figure 2 and Figure 3. This application does not Specific restrictions.
上述系统中,通过获取多个出险记录,将所述多个出险记录中每个出险记录转化为多个分词向量,根据所述多个分词向量,将所述多个出险记录进行聚类,获得多个类别组的出险记录,使用社交网络分析算法分别对所述多个类别组的出险记录中的每个类别组进行分析,识别出所述多个出险记录中的欺诈案件。由于对出险记录进行了高精度的聚类,每个类别的出险记录是相关性非常高的数据,使用聚类后的出险记录进行SNA分析时,能够更好的建立出险案件关系网络,从而提高了群体欺诈识别的正确率。In the above system, by acquiring multiple risk records, each of the multiple risk records is converted into multiple word segmentation vectors, and the multiple risk records are clustered according to the multiple word segmentation vectors to obtain The risk records of multiple category groups are analyzed using a social network analysis algorithm to analyze each category group of the risk records of the multiple category groups to identify fraud cases in the multiple risk records. Due to the high-precision clustering of the risk records, the risk records of each category are very relevant data. When the clustered risk records are used for SNA analysis, the risk case relationship network can be better established, thereby improving The correct rate of group fraud identification.
参见图5,图5是本申请提供的一种电子设备的结构示意图。如图所示的本实施例中的电子设备可以包括。一个或者多个处理器511、存储器512和通信接口513。其中,处理器511、存储器512和通信接口513之间可以通过总线514连接。Refer to FIG. 5, which is a schematic structural diagram of an electronic device provided by the present application. The electronic device in this embodiment as shown in the figure may include. One or more processors 511, memory 512, and communication interface 513. Among them, the processor 511, the memory 512, and the communication interface 513 may be connected through a bus 514.
处理器511包括一个或者多个通用处理器,其中,通用处理器可以是能够处理电子指令的任何类型的设备,包括中央处理器(Central Processing Unit, CPU)、图像处理器(Graphics Processing Unit, GPU)、微处理器、微控制器、主处理器、控制器以及专用集成电路(Application Specific Integrated Circuit, ASIC)、数字信号处理器(Digital Signal Processor, DSP)、可编程门阵列(Field -Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。处理器511用于执行存储器512存储的程序指令。The processor 511 includes one or more general-purpose processors, where the general-purpose processor can be any type of device capable of processing electronic instructions, including a central processing unit (Central Processing Unit, CPU), and an image processor (Graphics Processing Unit, GPU). ), microprocessors, microcontrollers, main processors, controllers and application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Digital Signal Processor (DSP), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The processor 511 is configured to execute program instructions stored in the memory 512.
存储器512可以包括易失性存储器,例如随机存取存储器(Random Access Mmemory, RAM)。存储器也可以包括非易失性存储器,例如只读存储器(Read-Only Memory, ROM)、快闪存储器(Flash Memory)、硬盘(Hard Disk Drive, HDD)或固态硬盘(Solid-State Drive, SSD),存储器还可以包括上述种类的存储器的组合。存储器512可以采用集中式存储,也可以采用分布式存储,此处不作具体限定。可以理解的是,存储器512用于存储计算机程序,例如:计算机程序指令等。在本申请实施例中,存储器512可以向处理器511提供指令和数据。The memory 512 may include a volatile memory, such as random access memory (Random Access Mmemory, RAM). The memory may also include non-volatile memory, such as read-only memory (Read-Only Memory, ROM), flash memory (Flash Memory), hard disk (Hard Disk Drive, HDD) or solid-state drive (Solid-State Drive, SSD), the memory may also include a combination of the above types of memory. The memory 512 may adopt centralized storage or distributed storage, which is not specifically limited here. It is understandable that the memory 512 is used to store computer programs, such as computer program instructions. In the embodiment of the present application, the memory 512 may provide instructions and data to the processor 511.
通信接口513可以为有线接口(例如以太网接口)或无线接口(例如蜂窝网络接口或使用无线局域网接口),用于与其他计算机设备或用户进行通信。当通信接口513为有线接口时,通信接口513可以采用网络通讯协议(Transmission Control Protocol/Internet Protocol, TCP/IP)之上的协议族,例如,远程函数调用(Remote Function Call, RFC)协议、简单对象访问协议(Simple Object Access Protocol, SOAP)协议、简单网络管理协议(Simple Network Management Protocol, SNMP)、公共对象请求代理体系结构协议(Common Object Request Broker Architecture, CORBA)以及分布式协议等等。当通信接口513为无线接口时,可以根据全球移动通信系统(Global System for Mobile Communication, GSM)或者码分多址(Code Division Multiple Access, CDMA)标准利用蜂窝通信,因此包括用于数据传输的无线调制解调器、电子处理设备、一个或多个数字存储器设备以及双天线。The communication interface 513 may be a wired interface (such as an Ethernet interface) or a wireless interface (such as a cellular network interface or using a wireless local area network interface) for communicating with other computer devices or users. When the communication interface 513 is a wired interface, the communication interface 513 may adopt a network communication protocol (Transmission Control Protocol/Internet Protocol, TCP/IP) above the protocol family, for example, remote function call (Remote Function Call, RFC) protocol, Simple Object Access Protocol (Simple Object Access Protocol, SOAP) protocol, Simple Network Management Protocol (Simple Network Management Protocol, SNMP), Common Object Request Broker Architecture (CORBA), distributed protocol, etc. When the communication interface 513 is a wireless interface, according to the Global System for Mobile Communications (Global System for Mobile Communication, GSM) or Code Division Multiple Access (Code The Division Multiple Access (CDMA) standard utilizes cellular communications and therefore includes a wireless modem for data transmission, electronic processing equipment, one or more digital memory devices, and dual antennas.
在本申请实施例中,处理器511、存储器512、通信接口513和总线514可执行本申请实施例提供的数据识别方法的任一实施例中所描述的实现方式,具体的,处理器511可用于调用存储器中的指令,执行如下方法:获取多个出险记录。将多个出险记录中每个出险记录转化为多个分词向量。根据多个分词向量,将所述多个出险记录进行聚类,获得多个类别组的出险记录。使用社交网络分析算法分别对多个类别组的出险记录中的每个类别组进行分析,识别出多个出险记录中的欺诈案件。In the embodiment of the present application, the processor 511, the memory 512, the communication interface 513, and the bus 514 can execute the implementation described in any embodiment of the data identification method provided in the embodiment of the present application. Specifically, the processor 511 can be used To call the instructions in the memory, execute the following method: Get multiple risk records. Convert each of the multiple risk records into multiple word segmentation vectors. According to multiple word segmentation vectors, cluster the multiple risk records to obtain multiple category groups of risk records. A social network analysis algorithm is used to analyze each category group in the risk records of multiple category groups to identify fraud cases in multiple risk records.
在一具体实施例中,处理器511具体用于执行:对多个出险记录中的每个出险记录进行切词处理,将多个出险记录中的每个出险记录分为多个分词词语。将多个分词词语中,与停用词集中的词语相同的分词词语删除、与保留词集中的词语相同的分词词语保留,获得筛选后的分词词语,其中,停用词集是多个与出险记录信息无关的分词词语的集合,保留词集是预先设定的不能筛选掉的词语的集合。将筛选后的分词词语映射为多个分词向量。In a specific embodiment, the processor 511 is specifically configured to execute: perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple word segmentation words. Delete the participle words that are the same as the words in the stop word set, and keep the same participle words as the words in the reserved word set to obtain the filtered word break words. Among them, the stop word set is multiple and out of danger The set of segmented words irrelevant to the record information. The reserved word set is a set of pre-set words that cannot be filtered out. Map the filtered word segmentation words into multiple word segmentation vectors.
在一具体实施例中,处理器511具体用于执行:从每个出险记录的多个分词向量中,筛选出目标分词向量,其中,目标分词向量在所属的出险记录中出现的频率高于在其他出险记录中出现的频率,或者,目标分词向量在所属的出险记录中出现的频率低于在其他出险记录中出现的频率。通过聚类算法,将包含相同或相近目标分词向量的多个出险记录聚类为多个类别组的出险记录。In a specific embodiment, the processor 511 is specifically configured to perform: filter out the target word segmentation vector from the multiple word segmentation vectors of each risk record, wherein the target word segmentation vector appears more frequently in the risk record to which it belongs. The frequency of appearance in other risk records, or the frequency of the target word segmentation vector in the risk record to which it belongs is lower than the frequency of appearance in other risk records. Through the clustering algorithm, multiple risk records containing the same or similar target word segmentation vectors are clustered into multiple category groups of risk records.
在一具体实施例中,处理器511具体用于执行:获取多个类别组的出险记录。基于多个类别组的出险记录,使用社交网络分析算法建立出险案件关系网络,其中,一个类别组的出险记录对应一个或者多个出险案件关系网络,出险案件关系网络包括多个节点,节点代表所述出险记录中的个体、组织或虚拟个体,多个节点之间的连线表示所述多个节点之间存在社交关系。通过社交网络分析算法对所述出险案件关系网络中的各个节点与其他节点之间的关系进行分析,提取出各个节点对应的群体出险特征。将各个节点对应的群体出险特征输入分类模型,获得各个节点的欺诈率,其中,所述分类模型是使用样本集对神经网络进行训练得到的模型,所述样本集包括已知的多个维度群体出险特征数据以及对应的已知的欺诈率数据。根据各个节点的欺诈率,将欺诈率高于第一阈值的多个节点所属的出险记录识别为欺诈案件,将欺诈率高于第一阈值的多个节点识别为欺诈团伙。In a specific embodiment, the processor 511 is specifically configured to execute: obtain the risk-out records of multiple category groups. Based on the risk records of multiple category groups, a social network analysis algorithm is used to establish a risk case relationship network. Among them, the risk records of a category group correspond to one or more risk case relationship networks. The risk case relationship network includes multiple nodes, which represent offices. Speaking of individuals, organizations or virtual individuals in the risk record, the connection between multiple nodes indicates that there is a social relationship between the multiple nodes. A social network analysis algorithm is used to analyze the relationship between each node in the risk case relationship network and other nodes, and extract the group risk characteristics corresponding to each node. The group risk characteristics corresponding to each node are input into the classification model to obtain the fraud rate of each node, wherein the classification model is a model obtained by training a neural network using a sample set, and the sample set includes known multiple dimensional groups Risk feature data and corresponding known fraud rate data. According to the fraud rate of each node, the risk records of multiple nodes whose fraud rate is higher than the first threshold are identified as fraud cases, and the multiple nodes whose fraud rates are higher than the first threshold are identified as fraud gangs.
在一具体实施例中,聚类算法是K均值聚类算法、均值漂移聚类算法、具有噪声的基于密度的聚类算法、用高斯混合模型的最大期望聚类算法以及凝聚层次聚类算法中的一种或者多种。In a specific embodiment, the clustering algorithm is K-means clustering algorithm, mean shift clustering algorithm, density-based clustering algorithm with noise, maximum expectation clustering algorithm using Gaussian mixture model, and agglomerative hierarchical clustering algorithm. One or more of.
需要说明的是,处理器511执行方法的具体内容可参考前文方法实施例的相关描述,为了说明书简洁,这里不再赘述。It should be noted that, for the specific content of the method executed by the processor 511, reference may be made to the relevant description of the foregoing method embodiment. For the sake of brevity of the description, the details are not repeated here.
在本申请的另一实施例中提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性。所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令被处理器执行时实现本申请提供的数据识别方法的任一实施例中所描述的实现方式,在此不再赘述。In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, implement the implementation described in any of the embodiments of the data identification method provided in this application, This will not be repeated here.
所述计算机可读存储介质可以是前述任一实施例所述的终端的内部存储单元,例如终端的硬盘或内存。所述计算机可读存储介质也可以是所述终端的外部存储设备,例如所述终端上配备的插接式硬盘,智能存储卡(Smart Media Card, SMC),安全数字(Secure Digital, SD)卡,闪存卡(Flash Card)等。进一步地,所述计算机可读存储介质还可以既包括所述终端的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述终端所需的其他程序和数据。所述计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be the internal storage unit of the terminal described in any of the foregoing embodiments, such as the hard disk or memory of the terminal. The computer-readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk equipped on the terminal, a smart memory card (Smart Media Card, SMC), or a Secure Digital (SD) card , Flash memory card (Flash Card) and so on. Further, the computer-readable storage medium may also include both an internal storage unit of the terminal and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the terminal. The computer-readable storage medium can also be used to temporarily store data that has been output or will be output.
本申请实施例中,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行以实现如下方法。获取多个出险记录。将所述多个出险记录中每个出险记录转化为多个分词向量。根据多个分词向量,将多个出险记录进行聚类,获得多个类别组的出险记录。使用社交网络分析算法分别对多个类别组的出险记录中的每个类别组进行分析,识别出所述多个出险记录中的欺诈案件。In the embodiments of the present application, a computer readable storage medium stores a computer program, and the computer program is executed by a processor to implement the following method. Obtain multiple risk records. Convert each of the multiple risk records into multiple word segmentation vectors. According to multiple word segmentation vectors, multiple risk records are clustered to obtain multiple category groups of risk records. A social network analysis algorithm is used to separately analyze each category group in the multiple category groups of risk exposure records, and identify fraud cases in the multiple category groups.
在一具体实施例中,计算机程序被处理器执行时具体实现如下方法:对多个出险记录中的每个出险记录进行切词处理,将多个出险记录中的每个出险记录分为多个分词词语。将多个分词词语中,与停用词集中的词语相同的分词词语删除、与保留词集中的词语相同的分词词语保留,获得筛选后的分词词语,其中,停用词集是多个与出险记录信息无关的分词词语的集合,保留词集是预先设定的不能筛选掉的词语的集合。将所述筛选后的分词词语映射为多个分词向量。In a specific embodiment, when the computer program is executed by the processor, the following method is specifically implemented: word segmentation is performed on each of the multiple risk records, and each of the multiple risk records is divided into multiple Participle words. Delete the participle words that are the same as the words in the stop word set, and keep the same participle words as the words in the reserved word set to obtain the filtered word break words. Among them, the stop word set is multiple and out of danger The set of segmented words irrelevant to the record information. The reserved word set is a set of pre-set words that cannot be filtered out. Map the filtered word segmentation words into multiple word segmentation vectors.
在一具体实施例中,计算机程序被处理器执行时具体实现如下方法。从每个出险记录的多个分词向量中,筛选出目标分词向量,其中,目标分词向量在所属的出险记录中出现的频率高于在其他出险记录中出现的频率,或者,目标分词向量在所属的出险记录中出现的频率低于在其他出险记录中出现的频率。通过聚类算法,将包含相同或相近目标分词向量的多个出险记录聚类为多个类别组的出险记录。In a specific embodiment, when the computer program is executed by the processor, the following method is specifically implemented. From the multiple word segmentation vectors of each risk record, filter out the target word segmentation vector, where the target word segmentation vector appears more frequently in the risk record to which it belongs than in other risk records, or the target word segmentation vector The frequency of occurrence in the risk record of is lower than the frequency of occurrence in other risk records. Through the clustering algorithm, multiple risk records containing the same or similar target word segmentation vectors are clustered into multiple category groups of risk records.
在一具体实施例中,计算机程序被处理器执行时具体实现如下方法。获取多个类别组的出险记录。基于多个类别组的出险记录,使用社交网络分析算法建立出险案件关系网络,其中,一个类别组的出险记录对应一个或者多个出险案件关系网络,出险案件关系网络包括多个节点,节点代表所述出险记录中的个体、组织或虚拟个体,多个节点之间的连线表示所述多个节点之间存在社交关系。通过社交网络分析算法对所述出险案件关系网络中的各个节点与其他节点之间的关系进行分析,提取出各个节点对应的群体出险特征。将各个节点对应的群体出险特征输入分类模型,获得各个节点的欺诈率,其中,分类模型是使用样本集对神经网络进行训练得到的模型,样本集包括已知的多个维度群体出险特征数据以及对应的已知的欺诈率数据。根据各个节点的欺诈率,将欺诈率高于第一阈值的多个节点所属的出险记录识别为欺诈案件,将欺诈率高于第一阈值的多个节点识别为欺诈团伙。In a specific embodiment, when the computer program is executed by the processor, the following method is specifically implemented. Get the risk records of multiple category groups. Based on the risk records of multiple category groups, a social network analysis algorithm is used to establish a risk case relationship network. Among them, the risk records of a category group correspond to one or more risk case relationship networks. The risk case relationship network includes multiple nodes, which represent offices. Speaking of individuals, organizations or virtual individuals in the risk record, the connection between multiple nodes indicates that there is a social relationship between the multiple nodes. A social network analysis algorithm is used to analyze the relationship between each node in the risk case relationship network and other nodes, and extract the group risk characteristics corresponding to each node. The group risk characteristics corresponding to each node are input into the classification model to obtain the fraud rate of each node. Among them, the classification model is a model obtained by training the neural network using a sample set. The sample set includes known multiple dimensional group risk characteristics data and Corresponding known fraud rate data. According to the fraud rate of each node, the risk records of multiple nodes whose fraud rate is higher than the first threshold are identified as fraud cases, and the multiple nodes whose fraud rates are higher than the first threshold are identified as fraud gangs.
在一具体实施例中,聚类算法是K均值聚类算法、均值漂移聚类算法、具有噪声的基于密度的聚类算法、用高斯混合模型的最大期望聚类算法以及凝聚层次聚类算法中的一种或者多种。In a specific embodiment, the clustering algorithm is K-means clustering algorithm, mean shift clustering algorithm, density-based clustering algorithm with noise, maximum expectation clustering algorithm using Gaussian mixture model, and agglomerative hierarchical clustering algorithm. One or more of.
需要说明的是,计算机程序被处理器执行方法的具体内容可参考前文方法实施例的相关描述,为了说明书简洁,这里不再赘述。在本申请所提供的几个实施例中,应该理解到,所揭露的方法及装置,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。It should be noted that, for the specific content of the method for executing the computer program by the processor, refer to the relevant description of the foregoing method embodiment, and for the sake of brevity of the description, it will not be repeated here. In the several embodiments provided in this application, it should be understood that the disclosed method and device can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims (20)

  1. 一种数据识别的方法,其中,所述方法包括:A method of data recognition, wherein the method includes:
    获取多个出险记录;Obtain multiple risk records;
    将所述多个出险记录中每个出险记录转化为多个分词向量;Converting each of the multiple risk records into multiple word segmentation vectors;
    根据所述多个分词向量,将所述多个出险记录进行聚类,获得多个类别组的出险记录;Clustering the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records;
    使用社交网络分析算法分别对所述多个类别组的出险记录中的每个类别组进行分析,识别出所述多个出险记录中的欺诈案件。A social network analysis algorithm is used to separately analyze each category group in the multiple category groups' risk-exposure records, and identify fraud cases in the multiple category groups.
  2. 根据权利要求1所述的方法,其中,将所述多个出险记录中的每个出险记录转化为多个分词向量包括:The method according to claim 1, wherein converting each of the plurality of danger records into a plurality of word segmentation vectors comprises:
    对所述多个出险记录中的每个出险记录进行切词处理,将所述多个出险记录中的每个出险记录分为多个分词词语;Perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple word segmentation words;
    将所述多个分词词语中,与停用词集中的词语相同的分词词语删除、与保留词集中的词语相同的分词词语保留,获得筛选后的分词词语,其中,所述停用词集是多个与出险记录信息无关的分词词语的集合,所述保留词集是预先设定的不能筛选掉的词语的集合;Among the plurality of word segmentation words, the word segmentation words that are the same as the words in the stop word set are deleted, and the word participle words that are the same as the words in the reserved word set are retained to obtain the filtered word segmentation words, wherein the stop word set is A collection of multiple word segmentation words irrelevant to the risk record information, the reserved word set is a preset set of words that cannot be filtered out;
    将所述筛选后的分词词语映射为多个分词向量。Map the filtered word segmentation words into multiple word segmentation vectors.
  3. 根据权利要求2所述的方法,其中,所述根据所述多个分词向量,将所述多个出险记录进行聚类,获得多个类别组的出险记录包括:The method according to claim 2, wherein the clustering the plurality of risk records according to the plurality of word segmentation vectors to obtain the risk records of multiple category groups comprises:
    从所述每个出险记录的多个分词向量中,筛选出目标分词向量,其中,所述目标分词向量在所属的出险记录中出现的频率高于在其他出险记录中出现的频率,或者,所述目标分词向量在所属的出险记录中出现的频率低于在其他出险记录中出现的频率;From the multiple word segmentation vectors of each risk record, the target word segmentation vector is screened out, where the target word segmentation vector appears more frequently in the risk record to which it belongs than in other risk records, or The frequency of the target word segmentation vector in the risk record to which it belongs is lower than the frequency in other risk records;
    通过聚类算法,将包含相同或相近目标分词向量的多个出险记录聚类为多个类别组的出险记录。Through the clustering algorithm, multiple risk records containing the same or similar target word segmentation vectors are clustered into multiple category groups of risk records.
  4. 根据权利要求1所述的方法,其中,所述使用社交网络分析算法分别对所述多个类别组的出险记录中的每个类别组进行分析,识别出所述多个出险记录中的欺诈案件包括:The method according to claim 1, wherein the use of a social network analysis algorithm separately analyzes each of the plurality of category groups in the risk records to identify fraud cases in the plurality of risk records include:
    获取所述多个类别组的出险记录;Obtaining the risk records of the multiple category groups;
    基于所述多个类别组的出险记录,使用社交网络分析算法建立出险案件关系网络,其中,一个类别组的出险记录对应一个或者多个出险案件关系网络,所述出险案件关系网络包括多个节点,所述节点代表所述出险记录中的个体、组织或虚拟个体,所述多个节点之间的连线表示所述多个节点之间存在社交关系;Based on the risk records of the multiple category groups, a social network analysis algorithm is used to establish a risk case relationship network, where the risk records of one category group correspond to one or more risk case relationship networks, and the risk case relationship network includes multiple nodes , The node represents an individual, an organization, or a virtual individual in the risk record, and the connection between the multiple nodes indicates that there is a social relationship between the multiple nodes;
    通过社交网络分析算法对所述出险案件关系网络中的各个节点与其他节点之间的关系进行分析,提取出各个节点对应的群体出险特征;Analyze the relationship between each node in the risk case relationship network and other nodes through a social network analysis algorithm, and extract the group risk characteristics corresponding to each node;
    将所述各个节点对应的群体出险特征输入分类模型,获得各个节点的欺诈率,其中,所述分类模型是使用样本集对神经网络进行训练得到的模型,所述样本集包括已知的多个维度群体出险特征数据以及对应的已知的欺诈率数据;The group risk characteristics corresponding to each node are input into a classification model to obtain the fraud rate of each node. The classification model is a model obtained by training a neural network using a sample set, and the sample set includes a plurality of known Dimensional group risk characteristics data and corresponding known fraud rate data;
    根据所述各个节点的欺诈率,将欺诈率高于第一阈值的多个节点所属的出险记录识别为欺诈案件,将欺诈率高于第一阈值的多个节点识别为欺诈团伙。According to the fraud rate of each node, the risk records of multiple nodes whose fraud rate is higher than the first threshold are recognized as fraud cases, and the multiple nodes whose fraud rates are higher than the first threshold are recognized as fraud gangs.
  5. 根据权利要求3所述的方法,其中,所述聚类算法是K均值聚类算法、均值漂移聚类算法、具有噪声的基于密度的聚类算法、用高斯混合模型的最大期望聚类算法以及凝聚层次聚类算法中的一种或者多种。The method according to claim 3, wherein the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a density-based clustering algorithm with noise, a maximum expectation clustering algorithm using a Gaussian mixture model, and One or more of agglomerative hierarchical clustering algorithms.
  6. 一种数据识别的系统,其中,所述系统包括获取单元、转化单元、聚类单元以及识别单元,其中,A data recognition system, wherein the system includes an acquisition unit, a conversion unit, a clustering unit, and an identification unit, wherein,
    所述获取单元用于获取多个出险记录;The obtaining unit is used to obtain multiple risk records;
    所述转化单元用于将所述多个出险记录中每个出险记录转化为多个分词向量;The conversion unit is configured to convert each of the multiple risk records into multiple word segmentation vectors;
    所述聚类单元用于根据所述多个分词向量,将所述多个出险记录进行聚类,获得多个类别组的出险记录;The clustering unit is configured to cluster the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records;
    所述识别单元用于使用社交网络分析算法分别对所述多个类别组的出险记录中的每个类别组进行分析,识别出所述多个出险记录中的欺诈案件。The identification unit is configured to use a social network analysis algorithm to separately analyze each category group in the multiple category groups of the risk exposure records, and identify fraud cases in the multiple risk records.
  7. 根据权利要求6所述的方法,其中,所述转化单元具体用于:The method according to claim 6, wherein the conversion unit is specifically used for:
    对所述多个出险记录中的每个出险记录进行切词处理,将所述多个出险记录中的每个出险记录分为多个分词词语;Perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple word segmentation words;
    将所述多个分词词语中,与停用词集中的词语相同的分词词语删除、与保留词集中的词语相同的分词词语保留,获得筛选后的分词词语,其中,所述停用词集是多个与出险记录信息无关的分词词语的集合,所述保留词集是预先设定的不能筛选掉的词语的集合;Among the plurality of word segmentation words, the word segmentation words that are the same as the words in the stop word set are deleted, and the word participle words that are the same as the words in the reserved word set are retained to obtain the filtered word break words, wherein the stop word set is A collection of multiple word segmentation words that are not related to the risk record information, and the reserved word set is a preset set of words that cannot be filtered out;
    将所述筛选后的分词词语映射为多个分词向量。Map the filtered word segmentation words into multiple word segmentation vectors.
  8. 根据权利要求7所述的系统,其中,The system according to claim 7, wherein:
    所述聚类单元具体用于从所述每个出险记录的多个分词向量中,筛选出目标分词向量,其中,所述目标分词向量在所属的出险记录中出现的频率高于在其他出险记录中出现的频率,或者,所述目标分词向量在所属的出险记录中出现的频率低于在其他出险记录中出现的频率;The clustering unit is specifically configured to filter out the target word segmentation vector from the multiple word segmentation vectors of each risk record, wherein the target word segmentation vector appears more frequently in the risk record to which it belongs than in other risk records The frequency of occurrence in, or the occurrence frequency of the target word segmentation vector in the risk record to which it belongs is lower than the frequency of appearance in other risk records;
    所述聚类单元具体用于通过聚类算法,将包含相同或相近目标分词向量的多个出险记录聚类为多个类别组的出险记录。The clustering unit is specifically configured to use a clustering algorithm to cluster multiple risk records containing the same or similar target word segmentation vectors into multiple category groups of risk records.
  9. 根据权利要求6所述的系统,其中,The system according to claim 6, wherein:
    所述识别单元具体用于获取所述多个类别组的出险记录;The identification unit is specifically configured to obtain the risk-exit records of the multiple category groups;
    所述识别单元具体用于基于所述多个类别组的出险记录,使用社交网络分析算法建立出险案件关系网络,其中,一个类别组的出险记录对应一个或者多个出险案件关系网络,所述出险案件关系网络包括多个节点,所述节点代表所述出险记录中的个体、组织或虚拟个体,所述多个节点之间的连线表示所述多个节点之间存在社交关系;The identification unit is specifically configured to use a social network analysis algorithm to establish a risk-out case relationship network based on the risk-out records of the multiple category groups, wherein the risk-out record of one category group corresponds to one or more risk-out case relation networks. The case relationship network includes multiple nodes, the nodes representing individuals, organizations, or virtual individuals in the risk record, and the connections between the multiple nodes indicate that there is a social relationship between the multiple nodes;
    所述识别单元具体用于通过社交网络分析算法对所述出险案件关系网络中的各个节点与其他节点之间的关系进行分析,以提取出各个节点对应的群体出险特征;The identification unit is specifically configured to analyze the relationship between each node in the risk case relationship network and other nodes through a social network analysis algorithm, so as to extract the group risk characteristics corresponding to each node;
    所述识别单元具体用于将所述各个节点对应的群体出险特征输入分类模型,获得各个节点的欺诈率,其中,所述分类模型是使用样本集对神经网络进行训练得到的模型,所述样本集包括已知的多个维度群体出险特征数据以及对应的已知的欺诈率数据;The identification unit is specifically configured to input the group risk characteristics corresponding to each node into a classification model to obtain the fraud rate of each node, wherein the classification model is a model obtained by training a neural network using a sample set, and the sample The set includes known risk characteristic data of multiple dimensional groups and corresponding known fraud rate data;
    所述识别单元具体用于将欺诈率高于第一阈值的多个节点所属的出险记录识别为欺诈案件,将欺诈率高于第一阈值的多个节点识别为欺诈团伙。The identification unit is specifically configured to identify risk exposure records belonging to multiple nodes with a fraud rate higher than a first threshold as fraud cases, and identify multiple nodes with a fraud rate higher than the first threshold as fraud gangs.
  10. 根据权利要求8所述的系统,其中,所述聚类算法是K均值聚类算法、均值漂移聚类算法、具有噪声的基于密度的聚类算法、用高斯混合模型的最大期望聚类算法以及凝聚层次聚类算法中的一种或者多种。The system according to claim 8, wherein the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a density-based clustering algorithm with noise, a maximum expectation clustering algorithm using a Gaussian mixture model, and One or more of agglomerative hierarchical clustering algorithms.
  11. 一种电子设备,其中,所述电子设备包括处理器和存储器;所述存储器用于存储指令;所述处理器用于调用存储器中的指令,执行如下方法:An electronic device, wherein the electronic device includes a processor and a memory; the memory is used to store instructions; the processor is used to call instructions in the memory to execute the following method:
    获取多个出险记录;Obtain multiple risk records;
    将所述多个出险记录中每个出险记录转化为多个分词向量;Converting each of the multiple risk records into multiple word segmentation vectors;
    根据所述多个分词向量,将所述多个出险记录进行聚类,获得多个类别组的出险记录;Clustering the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records;
    使用社交网络分析算法分别对所述多个类别组的出险记录中的每个类别组进行分析,识别出所述多个出险记录中的欺诈案件。A social network analysis algorithm is used to separately analyze each category group in the multiple category groups' risk-exposure records, and identify fraud cases in the multiple category groups.
  12. 根据权利要求11所述的电子设备,其中,所述处理器具体用于执行:The electronic device according to claim 11, wherein the processor is specifically configured to execute:
    对所述多个出险记录中的每个出险记录进行切词处理,将所述多个出险记录中的每个出险记录分为多个分词词语;Perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple word segmentation words;
    将所述多个分词词语中,与停用词集中的词语相同的分词词语删除、与保留词集中的词语相同的分词词语保留,获得筛选后的分词词语,其中,所述停用词集是多个与出险记录信息无关的分词词语的集合,所述保留词集是预先设定的不能筛选掉的词语的集合;Among the plurality of word segmentation words, the word segmentation words that are the same as the words in the stop word set are deleted, and the word participle words that are the same as the words in the reserved word set are retained to obtain the filtered word segmentation words, wherein the stop word set is A collection of multiple word segmentation words irrelevant to the risk record information, the reserved word set is a preset set of words that cannot be filtered out;
    将所述筛选后的分词词语映射为多个分词向量。Map the filtered word segmentation words into multiple word segmentation vectors.
  13. 根据权利要求12所述的电子设备,其中,所述处理器具体用于执行:The electronic device according to claim 12, wherein the processor is specifically configured to execute:
    从所述每个出险记录的多个分词向量中,筛选出目标分词向量,其中,所述目标分词向量在所属的出险记录中出现的频率高于在其他出险记录中出现的频率,或者,所述目标分词向量在所属的出险记录中出现的频率低于在其他出险记录中出现的频率;From the multiple word segmentation vectors of each risk record, the target word segmentation vector is screened out, where the target word segmentation vector appears more frequently in the risk record to which it belongs than in other risk records, or The frequency of the target word segmentation vector in the risk record to which it belongs is lower than the frequency in other risk records;
    通过聚类算法,将包含相同或相近目标分词向量的多个出险记录聚类为多个类别组的出险记录。Through the clustering algorithm, multiple risk records containing the same or similar target word segmentation vectors are clustered into multiple category groups of risk records.
  14. 根据权利要求11所述的电子设备,其中,所述处理器具体用于执行:The electronic device according to claim 11, wherein the processor is specifically configured to execute:
    获取所述多个类别组的出险记录;Obtaining the risk records of the multiple category groups;
    基于所述多个类别组的出险记录,使用社交网络分析算法建立出险案件关系网络,其中,一个类别组的出险记录对应一个或者多个出险案件关系网络,所述出险案件关系网络包括多个节点,所述节点代表所述出险记录中的个体、组织或虚拟个体,所述多个节点之间的连线表示所述多个节点之间存在社交关系;Based on the risk records of the multiple category groups, a social network analysis algorithm is used to establish a risk case relationship network, where the risk records of one category group correspond to one or more risk case relationship networks, and the risk case relationship network includes multiple nodes , The node represents an individual, an organization, or a virtual individual in the risk record, and the connection between the multiple nodes indicates that there is a social relationship between the multiple nodes;
    通过社交网络分析算法对所述出险案件关系网络中的各个节点与其他节点之间的关系进行分析,提取出各个节点对应的群体出险特征;Analyze the relationship between each node in the risk case relationship network and other nodes through a social network analysis algorithm, and extract the group risk characteristics corresponding to each node;
    将所述各个节点对应的群体出险特征输入分类模型,获得各个节点的欺诈率,其中,所述分类模型是使用样本集对神经网络进行训练得到的模型,所述样本集包括已知的多个维度群体出险特征数据以及对应的已知的欺诈率数据;The group risk characteristics corresponding to each node are input into a classification model to obtain the fraud rate of each node. The classification model is a model obtained by training a neural network using a sample set, and the sample set includes a plurality of known Dimensional group risk characteristics data and corresponding known fraud rate data;
    根据所述各个节点的欺诈率,将欺诈率高于第一阈值的多个节点所属的出险记录识别为欺诈案件,将欺诈率高于第一阈值的多个节点识别为欺诈团伙。According to the fraud rate of each node, the risk records of multiple nodes whose fraud rate is higher than the first threshold are recognized as fraud cases, and the multiple nodes whose fraud rates are higher than the first threshold are recognized as fraud gangs.
  15. 根据权利要求13所述的电子设备,其中,所述聚类算法是K均值聚类算法、均值漂移聚类算法、具有噪声的基于密度的聚类算法、用高斯混合模型的最大期望聚类算法以及凝聚层次聚类算法中的一种或者多种。The electronic device according to claim 13, wherein the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a noisy density-based clustering algorithm, a maximum expected clustering algorithm using a Gaussian mixture model And one or more of agglomerative hierarchical clustering algorithms.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以实现如下方法:A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the following method:
    获取多个出险记录;Obtain multiple risk records;
    将所述多个出险记录中每个出险记录转化为多个分词向量;Converting each of the multiple risk records into multiple word segmentation vectors;
    根据所述多个分词向量,将所述多个出险记录进行聚类,获得多个类别组的出险记录;Clustering the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records;
    使用社交网络分析算法分别对所述多个类别组的出险记录中的每个类别组进行分析,识别出所述多个出险记录中的欺诈案件。A social network analysis algorithm is used to separately analyze each category group in the multiple category groups' risk-exposure records, and identify fraud cases in the multiple category groups.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时具体实现如下方法:The computer-readable storage medium according to claim 16, wherein the computer program specifically implements the following method when being executed by the processor:
    对所述多个出险记录中的每个出险记录进行切词处理,将所述多个出险记录中的每个出险记录分为多个分词词语;Perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple word segmentation words;
    将所述多个分词词语中,与停用词集中的词语相同的分词词语删除、与保留词集中的词语相同的分词词语保留,获得筛选后的分词词语,其中,所述停用词集是多个与出险记录信息无关的分词词语的集合,所述保留词集是预先设定的不能筛选掉的词语的集合;Among the plurality of word segmentation words, the word segmentation words that are the same as the words in the stop word set are deleted, and the word participle words that are the same as the words in the reserved word set are retained to obtain the filtered word segmentation words, wherein the stop word set is A collection of multiple word segmentation words irrelevant to the risk record information, the reserved word set is a preset set of words that cannot be filtered out;
    将所述筛选后的分词词语映射为多个分词向量。Map the filtered word segmentation words into multiple word segmentation vectors.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时具体实现如下方法:The computer-readable storage medium according to claim 17, wherein the computer program specifically implements the following method when being executed by the processor:
    从所述每个出险记录的多个分词向量中,筛选出目标分词向量,其中,所述目标分词向量在所属的出险记录中出现的频率高于在其他出险记录中出现的频率,或者,所述目标分词向量在所属的出险记录中出现的频率低于在其他出险记录中出现的频率;From the multiple word segmentation vectors of each risk record, the target word segmentation vector is screened out, where the target word segmentation vector appears more frequently in the risk record to which it belongs than in other risk records, or The frequency of the target word segmentation vector in the risk record to which it belongs is lower than the frequency in other risk records;
    通过聚类算法,将包含相同或相近目标分词向量的多个出险记录聚类为多个类别组的出险记录。Through the clustering algorithm, multiple risk records containing the same or similar target word segmentation vectors are clustered into multiple category groups of risk records.
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时具体实现如下方法:The computer-readable storage medium according to claim 16, wherein the computer program specifically implements the following method when being executed by the processor:
    获取所述多个类别组的出险记录;Obtaining the risk records of the multiple category groups;
    基于所述多个类别组的出险记录,使用社交网络分析算法建立出险案件关系网络,其中,一个类别组的出险记录对应一个或者多个出险案件关系网络,所述出险案件关系网络包括多个节点,所述节点代表所述出险记录中的个体、组织或虚拟个体,所述多个节点之间的连线表示所述多个节点之间存在社交关系;Based on the risk records of the multiple category groups, a social network analysis algorithm is used to establish a risk case relationship network, where the risk records of one category group correspond to one or more risk case relationship networks, and the risk case relationship network includes multiple nodes , The node represents an individual, an organization, or a virtual individual in the risk record, and the connection between the multiple nodes indicates that there is a social relationship between the multiple nodes;
    通过社交网络分析算法对所述出险案件关系网络中的各个节点与其他节点之间的关系进行分析,提取出各个节点对应的群体出险特征;Analyze the relationship between each node in the risk case relationship network and other nodes through a social network analysis algorithm, and extract the group risk characteristics corresponding to each node;
    将所述各个节点对应的群体出险特征输入分类模型,获得各个节点的欺诈率,其中,所述分类模型是使用样本集对神经网络进行训练得到的模型,所述样本集包括已知的多个维度群体出险特征数据以及对应的已知的欺诈率数据;The group risk characteristics corresponding to each node are input into a classification model to obtain the fraud rate of each node. The classification model is a model obtained by training a neural network using a sample set, and the sample set includes a plurality of known Dimensional group risk characteristics data and corresponding known fraud rate data;
    根据所述各个节点的欺诈率,将欺诈率高于第一阈值的多个节点所属的出险记录识别为欺诈案件,将欺诈率高于第一阈值的多个节点识别为欺诈团伙。According to the fraud rate of each node, the risk records of multiple nodes whose fraud rate is higher than the first threshold are recognized as fraud cases, and the multiple nodes whose fraud rates are higher than the first threshold are recognized as fraud gangs.
  20. 根据权利要求18所述的计算机可读存储介质,其中,所述聚类算法是K均值聚类算法、均值漂移聚类算法、具有噪声的基于密度的聚类算法、用高斯混合模型的最大期望聚类算法以及凝聚层次聚类算法中的一种或者多种。The computer-readable storage medium according to claim 18, wherein the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a noise-based density-based clustering algorithm, a maximum expectation of a Gaussian mixture model One or more of clustering algorithms and agglomerated hierarchical clustering algorithms.
PCT/CN2020/099572 2019-07-23 2020-06-30 Data recognition method and system, electronic device and computer storage medium WO2021012913A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910664820.3 2019-07-23
CN201910664820.3A CN110490750B (en) 2019-07-23 2019-07-23 Data identification method, system, electronic equipment and computer storage medium

Publications (1)

Publication Number Publication Date
WO2021012913A1 true WO2021012913A1 (en) 2021-01-28

Family

ID=68548012

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/099572 WO2021012913A1 (en) 2019-07-23 2020-06-30 Data recognition method and system, electronic device and computer storage medium

Country Status (2)

Country Link
CN (1) CN110490750B (en)
WO (1) WO2021012913A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110490750B (en) * 2019-07-23 2022-10-28 平安科技(深圳)有限公司 Data identification method, system, electronic equipment and computer storage medium
CN110888987B (en) * 2019-12-13 2023-07-04 铭迅(北京)信息技术有限公司 Loan agency identification method, system, equipment and storage medium
CN111552851A (en) * 2020-04-24 2020-08-18 浙江每日互动网络科技股份有限公司 Type determination method and device, equipment and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109242499A (en) * 2018-09-19 2019-01-18 中国银行股份有限公司 A kind of processing method of transaction risk prediction, apparatus and system
CN109446528A (en) * 2018-10-30 2019-03-08 南京中孚信息技术有限公司 The recognition methods of new fraudulent gimmick and device
CN109685647A (en) * 2018-12-27 2019-04-26 阳光财产保险股份有限公司 The training method of credit fraud detection method and its model, device and server
CN109919781A (en) * 2019-01-24 2019-06-21 平安科技(深圳)有限公司 Case recognition methods, electronic device and computer readable storage medium are cheated by clique
CN110490750A (en) * 2019-07-23 2019-11-22 平安科技(深圳)有限公司 Data know method for distinguishing, system, electronic equipment and computer storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080172257A1 (en) * 2007-01-12 2008-07-17 Bisker James H Health Insurance Fraud Detection Using Social Network Analytics
CN107657536B (en) * 2017-02-20 2018-07-31 平安科技(深圳)有限公司 The recognition methods of social security fraud and device
CN109447658A (en) * 2018-09-10 2019-03-08 平安科技(深圳)有限公司 The generation of anti-fraud model and application method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109242499A (en) * 2018-09-19 2019-01-18 中国银行股份有限公司 A kind of processing method of transaction risk prediction, apparatus and system
CN109446528A (en) * 2018-10-30 2019-03-08 南京中孚信息技术有限公司 The recognition methods of new fraudulent gimmick and device
CN109685647A (en) * 2018-12-27 2019-04-26 阳光财产保险股份有限公司 The training method of credit fraud detection method and its model, device and server
CN109919781A (en) * 2019-01-24 2019-06-21 平安科技(深圳)有限公司 Case recognition methods, electronic device and computer readable storage medium are cheated by clique
CN110490750A (en) * 2019-07-23 2019-11-22 平安科技(深圳)有限公司 Data know method for distinguishing, system, electronic equipment and computer storage medium

Also Published As

Publication number Publication date
CN110490750A (en) 2019-11-22
CN110490750B (en) 2022-10-28

Similar Documents

Publication Publication Date Title
Chen et al. Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs
WO2021012913A1 (en) Data recognition method and system, electronic device and computer storage medium
US10043213B2 (en) Systems and methods for improving computation efficiency in the detection of fraud indicators for loans with multiple applicants
WO2021121187A1 (en) Method for detecting electronic medical case duplicates based on word segmentation, device, and computer equipment
WO2021051867A1 (en) Asset information identification method and apparatus, computer device and storage medium
WO2021004344A1 (en) Data analysis-based risk identification method and related device
Kerkouche et al. Privacy-preserving and bandwidth-efficient federated learning: An application to in-hospital mortality prediction
Thapen et al. The early bird catches the term: combining twitter and news data for event detection and situational awareness
CN109784736A (en) A kind of analysis and decision system based on big data
JP7106743B2 (en) Billing Fraud Prevention Method, Device, Device and Storage Medium Based on Graph Calculation Technology
US9495639B2 (en) Determining document classification probabilistically through classification rule analysis
CN111177367A (en) Case classification method, classification model training method and related products
CN111612038A (en) Abnormal user detection method and device, storage medium and electronic equipment
WO2022105496A1 (en) Intelligent follow-up contact method and apparatus, and electronic device and readable storage medium
Gao et al. An efficient fraud identification method combining manifold learning and outliers detection in mobile healthcare services
CN111159763A (en) System and method for analyzing portrait of law-related personnel group
US9141686B2 (en) Risk analysis using unstructured data
Zubi et al. Using data mining techniques to analyze crime patterns in the libyan national crime data
Zhang et al. Differential privacy medical data publishing method based on attribute correlation
CN109376287B (en) House property map construction method, device, computer equipment and storage medium
CN109194622B (en) Encrypted flow analysis feature selection method based on feature efficiency
CN112991079B (en) Multi-card co-occurrence medical treatment fraud detection method, system, cloud end and medium
CN113065892B (en) Information pushing method, device, equipment and storage medium
Srinivasan et al. Examining disease multimorbidity in US hospital visits before and during COVID-19 pandemic: a graph analytics approach
CN110766091B (en) Method and system for identifying trepanning loan group partner

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20843935

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20843935

Country of ref document: EP

Kind code of ref document: A1