WO2021012913A1

WO2021012913A1 - Data recognition method and system, electronic device and computer storage medium

Info

Publication number: WO2021012913A1
Application number: PCT/CN2020/099572
Authority: WO
Inventors: 程旺
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-07-23
Filing date: 2020-06-30
Publication date: 2021-01-28
Also published as: CN110490750A; CN110490750B

Abstract

Disclosed are a data recognition method and system, an electronic device and a computer storage medium. The method comprises: acquiring a plurality of accident records (S101); converting each of the plurality of accident records into a plurality of segmented word vectors (S102); clustering the plurality of accident records according to the plurality of segmented word vectors to obtain a plurality of category groups of accident records (S103); and respectively analyzing each category group of the plurality of category groups of accident records by using a social network analysis algorithm, to recognize fraudulent cases from the plurality of accident records (S104).

Description

Data recognition method, system, electronic device and computer storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on July 23, 2019, the application number is 2019106648203, and the invention title is "Methods, systems, electronic equipment and computer storage media for data recognition", all of which are approved The reference is incorporated in this application.

Technical field

This application relates to the computer field, in particular to data recognition methods, systems, electronic equipment, and computer storage media, and specifically to financial technology fraud detection technology.

Background technique

At present, anti-fraud judgments in the freight vehicle insurance industry are mainly realized through manual analysis of report description information and on-site surveys. The identification of group fraud cannot be achieved, even if the network is constructed through social network analysis (SNA) algorithms. Group fraud, in real complex scenarios, only relies on discrete features and a small number of continuous features, and sometimes the performance of the model will be poor. The inventor realizes that the current anti-fraud identification lacks technical means, and the fraud identification rate is low.

technical problem

This application provides a method, system, electronic device and computer storage medium for data recognition to solve the problem of low recognition rate of group fraud.

Technical solutions

In the first aspect, this application provides a method for data identification, which includes the following steps. Obtain multiple risk records. Convert each of the multiple risk records into multiple word segmentation vectors. According to the multiple word segmentation vectors, cluster the multiple risk records to obtain multiple category groups of risk records. A social network analysis algorithm is used to separately analyze each category group in the multiple category groups' risk-exposure records, and identify fraud cases in the multiple category groups. Due to the high-precision clustering of the risk records, the risk records of each category are very relevant data. When the clustered risk records are used for SNA analysis, the risk case relationship network can be better established, thereby improving The correct rate of group fraud identification.

In a second aspect, a data identification system is provided. The system includes an acquisition unit, a conversion unit, a clustering unit, and an identification unit, wherein the acquisition unit is used to acquire multiple risk records. The conversion unit is used to convert each of the multiple risk records into multiple word segmentation vectors. The clustering unit is used for clustering the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records. The identification unit is configured to use a social network analysis algorithm to separately analyze each category group in the multiple category groups of the risk exposure records, and identify fraud cases in the multiple risk records. Since the system performs high-precision clustering of the risk records, the risk records of each category are very relevant data. When the clustered risk records are used for SNA analysis, the system can better establish the risk based on this system. The case is connected to the network, thereby improving the correct rate of group fraud identification.

In a third aspect, an electronic device is provided, including a processor, an input device, an output device, and a memory, the processor, input device, output device, and memory are connected to each other, wherein the memory is used to store a computer program, The computer program includes program instructions, and the processor is configured to invoke the program instructions to execute the method described in the first aspect above.

In a fourth aspect, a computer-readable storage medium is provided, the computer storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to execute the foregoing One side approach.

Beneficial effect

It can be seen that based on the data identification method, system, electronic equipment and computer storage medium provided in this application, by acquiring multiple risk records, each of the multiple risk records is converted into multiple word segmentation vectors, according to The multiple word segmentation vectors cluster the multiple risk records to obtain multiple category groups of risk records, and use a social network analysis algorithm to perform a separate analysis on each category group of the multiple category groups’ risk records. Analyze and identify fraud cases in the multiple risk records. Due to the high-precision clustering of the risk records, the risk records of each category are very relevant data. When the clustered risk records are used for SNA analysis, the risk case relationship network can be better established, thereby improving The correct rate of group fraud identification.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. For those of ordinary skill in the art, without creative work, other drawings can be obtained from these drawings.

Fig. 1 is a schematic flowchart of a data identification method provided by the present application.

Figure 2 is a schematic diagram of a case relationship network established by using the SNA algorithm in a scenario provided by this application.

Fig. 3 is a schematic diagram of a case relationship network established using the SNA algorithm in another scenario provided by this application.

Figure 4 is a schematic structural diagram of a data recognition system provided by the present application.

Fig. 5 is a schematic block diagram of the structure of an electronic device provided by the present application.

Embodiments of the invention

The application will be further described in detail below through specific implementations in conjunction with the drawings. In the following embodiments, many detailed descriptions are used to make the present application better understood. However, those skilled in the art can easily realize that some of the features can be omitted under different circumstances, or can be replaced by other methods. In some cases, some operations related to this application are not shown or described in the specification. This is to avoid the core part of this application being overwhelmed by excessive description. For those skilled in the art, it is not necessary to describe these related operations in detail. They can fully understand the related operations based on the description in the specification and general technical knowledge in the field.

It should be understood that when the terms are used in this specification and appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and components, but do not exclude one or more The existence or addition of other features, wholes, steps, operations, elements, components, and/or collections thereof.

It should be noted that the terms used in the embodiments of the present application are only for the purpose of describing specific embodiments, and are not intended to limit the present application. The singular forms of "a", "said" and "the" used in the embodiments of the present application and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings.

Fig. 1 is a schematic flowchart of a data identification method provided by the present application. It can be seen from Fig. 1 that the data identification method provided by this application includes the following steps.

S101: Acquire multiple risk records.

In the embodiments of this application, the insurance refers to the occurrence of the compensation or payment conditions stipulated or agreed in the insurance contract. For example, the process of notifying or reporting to the insurance company during the insurance period of the vehicle, after an accident occurs. Wherein, the risk record may be the risk record data in a database. For example, for auto insurance, the risk information may be an accident record, including an accident license plate, an accident location, an auto insurance policy number, an insurance policy agent, a claim record, Insurance purchase records, vehicle involved persons, including drivers, informants, beneficiaries, and injured persons, as well as data related to insurance such as repair shops, reporting telephone numbers, inspection locations, GPS information, and disease diagnosis records. It is understandable that using the risk record as the original data for anti-fraud identification can greatly improve the accuracy of anti-fraud identification compared to ordinary insurance data. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.

S102: Transform each of the multiple risk records into multiple word segmentation vectors.

In this embodiment of the present application, each of the multiple risk-out records is converted into multiple word segmentation vectors and includes. Perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple word segmentation words. Delete the participle words that are the same as the words in the stop word set among the multiple participle words, and keep the word participle words that are the same as the words in the reserved word set to obtain the filtered word participles, where the stop word set is A collection of a plurality of word segmentation words that are not related to the risk record information, and the reserved word set is a preset set of words that cannot be filtered out. Map the filtered word segmentation words into multiple word segmentation vectors. It should be understood that the accident record includes multiple information, for example, it includes specific accidents, specific casualties, disease diagnosis records, specific traffic police accident identification records, etc. The data length of each accident record is also different. If you directly Data preprocessing will be a lot of work. Therefore, firstly, the obtained risk records will be segmented to improve the efficiency of data processing.

Specifically, the word segmentation process can be to segment each Chinese character sequence of the risk record into a single word. From the formal point of view, the word is a stable combination of characters. Therefore, in the context, the number of adjacent characters appearing at the same time The more, the more likely it is to form a word. Therefore, the frequency or probability of co-occurrence between characters and characters can better reflect the credibility of the word formation. The frequency of combinations of adjacent co-occurring characters in the corpus can be counted, and their mutual occurrence information can be calculated. Define the mutual occurrence information of two characters, and calculate the adjacent co-occurrence probability of two Chinese characters X and Y. The mutual information reflects the closeness of the combination of Chinese characters. When the closeness is higher than a certain threshold, it can be considered that the word group may constitute a word, and then the word segmentation operation can be carried out. However, this method also has certain limitations. It will often extract some co-occurrence frequently, But it is not a common word group of words, such as "this", "one", "some", "my", "many", etc., which wastes storage space and makes search efficiency inefficient. Therefore, the stop word set and the reserved word set can be combined to filter out the words that appear frequently and are not related to the risk. For example, modal auxiliary words, adverbs, prepositions, conjunctions, etc., have no clear meaning by themselves, only put them into one Words that have a certain role in complete sentences. At the same time, statistical methods are used to identify some new words, that is, string frequency statistics and string matching are combined, which not only exerts the characteristics of fast and efficient word segmentation, but also uses the advantages of dictionary-free word segmentation combined with context to identify new words and automatically eliminate ambiguity . Moreover, after the word segmentation is performed on the risk record, a vectorization operation is needed to convert each risk record into a word vector. It is understandable that "happy" and "happiness" are two very close words for humans, and the computer cannot know that these two words are similar. Therefore, each word needs to be understood by the computer Language to express, that is to say, the word is vectorized, and the word is represented as a multi-dimensional floating point number. The value of each dimension of the floating point number indicates the distance between it and another word. The result of the representation is semantically similar Words are mapped to similar collection spaces, so that the computer can calculate the similarity between each word. In other words, after the computer can understand the meaning of the language, it can perform further anti-fraud recognition processing.

S103: Cluster the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records.

In the embodiment of the present application, the clustering of the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records includes. From the multiple word segmentation vectors of each risk record, the target word segmentation vector is screened out, where the target word segmentation vector appears more frequently in the risk record to which it belongs than in other risk records, or The occurrence frequency of the target word segmentation vector in the risk record to which it belongs is lower than the frequency in other risk records. Through the clustering algorithm, multiple risk records containing the same or similar target word segmentation vectors are clustered into multiple category groups of risk records. In other words, after the risk record is converted into the word segmentation vector, there will be many word vectors. Although the stop word level and the reserved word level have been screened, many word vectors have a large amount of data for the subsequent anti-fraud recognition It is still very large, it should be understood. Therefore, if the data is further filtered and the easily identifiable data is clustered, the accuracy of the clustering can be greatly improved, and the accurate clustering results can be input into the SNA network to get more accurate The recognition result. Among them, data that is easier to identify may be, for example, if a word appears more frequently in the risk record of a certain case but less frequently in other cases, it is considered that the word has good distinguishing ability. For another example, if the number of cases containing a word in the risk information field is small, it means that the word has good distinguishing ability. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.

In the embodiment of this application, the target word segmentation vector may use the term frequency-inverse text frequency index (Term Frequency-Inverse Document Frequency (TF-IDF) method is used to screen. The TF-IDF method is a statistical method used to evaluate the importance of a word to a document set or a document in a corpus. It is understandable that the most meaningful words for distinguishing documents should be those that appear frequently in the document, but appear less frequently in other documents in the entire document collection, so if the feature space coordinate system takes TF word frequency as a measure, It can reflect the characteristics of similar texts. In addition, considering the ability of words to distinguish different categories, the TF-IDF method believes that the smaller the text frequency of a word, the greater its ability to distinguish texts of different categories. Therefore, the concept of inverse text frequency IDF is introduced. The product of TF and IDF is used as the value measurement of the feature space coordinate system, and used to complete the adjustment of the weight TF. The purpose of adjusting the weight is to highlight important words and suppress minor ones. word. In other words, if a word or phrase appears frequently in an article and rarely appears in other articles, it is considered that the word or phrase has a good classification ability and is suitable for classification. Therefore, compared with directly clustering the risk records, better clustering results can be obtained. After the word segmentation vector suitable for classification is selected, clustering according to the importance degree matrix of the risk records can more accurately compare the risk information The cases are gathered together, and the clustering results are used as the input data of the SNA network, which can greatly improve the accuracy of anti-fraud case identification.

In the embodiments of the present application, the clustering algorithms are K-means clustering algorithm, mean shift clustering algorithm, noisy density-based clustering algorithm, maximum expected clustering algorithm using Gaussian mixture model, and agglomerative hierarchical clustering One or more of the algorithms are not specifically limited in this application. It is understandable that the result of clustering can be that the risk records of the same province or city or the purchase of the same type of reimbursement are gathered together, for example, the risk records of minor damage cases are gathered together, or the cases of the same province are gathered together, etc. , There is no specific limitation here.

S104: Use a social network analysis SNA algorithm for each category group in the risk exposure records of the multiple category groups to identify fraud cases in the multiple risk records.

In the embodiment of the present application, the social network analysis algorithm is used to analyze each category group in the risk exposure records of the multiple category groups respectively, and it is recognized that the fraud cases in the multiple risk records include. Obtain the risk exposure records of the multiple category groups. Based on the risk records of the multiple category groups, a social network analysis algorithm is used to establish a risk case relationship network, where the risk records of one category group correspond to one or more risk case relationship networks, and the risk case relationship network includes multiple nodes The node represents an individual, an organization or a virtual individual in the risk record, and the connection between the multiple nodes indicates that there is a social relationship between the multiple nodes. A social network analysis algorithm is used to analyze the relationship between each node in the risk case relationship network and other nodes, and extract the group risk characteristics corresponding to each node. The group risk characteristics corresponding to each node are input into a classification model to obtain the fraud rate of each node. The classification model is a model obtained by training a neural network using a sample set, and the sample set includes a plurality of known Dimensional group risk feature data and corresponding known fraud rate data. According to the fraud rate of each node, the risk records of multiple nodes whose fraud rate is higher than the first threshold are recognized as fraud cases, and the multiple nodes whose fraud rates are higher than the first threshold are recognized as fraud gangs. Among them, the SNA network is a collection of points (social actors) and connections between points (relationships between actors). Each node can be an entity or virtual individual with different meanings such as organization, individual, network ID, and the relationship between individuals can be relatives and friends, action behavior, sending and receiving messages, and other relationships. Through SNA network analysis, we can find the key information we need from the messy data and connection relationships, that is, the group risk behavior of each node. Taking the risk data of medical insurance as an example, the risk characteristics of each node can be. The area where the patient is located, the hospital where the patient sees the patient, the number and specific time of the items purchased by the patient, the disease the patient has, and the doctor who sees the patient. Analyzing the group risk behavior of patients is equivalent to a comprehensive analysis of the area where the patients are located, the number and specific time of the items purchased by the patients, and the diseases the patients have suffered. If it is found that patients have purchased a large number of drugs in different hospitals many times, and the types of drugs are different, the characteristics of the group risk can be determined. The user purchases a large amount of medicines, many types of medicines, and so on. For another example, taking vehicle insurance risk data as an example, the risk characteristics of each node may be. The city where the vehicle is located, the license plate number, the type of insurance purchased by the vehicle, the traffic police who handled the accident, the identity information of the perpetrator, and the identity information of the victim, etc. If it is found that the vehicle has been exposed to multiple risks at different locations and all are minor damage cases, since the minor damage cases are low in amount, they can be reported and handled quickly. Therefore, it can be determined that the group risk feature is that the user has participated in the risk of minor damage cases multiple times. If it is found that the vehicle has been in danger at different locations for multiple times and the victim's identity is the same person, it can be determined that the group risk is characterized by the user's multiple times of cooperating with others to defraud insurance. In the same way, the group risk characteristics of other risk data such as commercial insurance and accident insurance can be obtained. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.

In the embodiments of the present application, in addition to using a neural network to identify fraud cases in the multiple risk records based on the fraud rate of each node, it is also possible to identify high-risk cases based on the risk rate. For example, Figure 2 is a schematic diagram of a case relationship network established by using the SNA algorithm in a scenario provided by this application, in which the gray dots represent users with more risk records, and the black dots represent users with risk records but fewer times. The white dots represent users who have no history of accidents. Through data calculation and analysis, it can be concluded that the high risk rate of the gang reaches 66.8%, indicating that the average risk rate of the gang is low. The risky users accounted for 91.4% of all users, further verifying the fraudulent nature of the gang. Understandably, after a fraudulent group is identified, the risk record of the group's participation can be confirmed as a fraud case.

In the embodiment of the present application, after the case relationship network is constructed using SNA, it can also be matched with a preset network model according to the network structure, so as to identify high-risk cases. For example, Figure 3 is a schematic diagram of a case relationship network established using the SNA algorithm in another scenario provided by this application. The black dots represent users who have a record of traversing but have a small number of times. Although the case is related to the fraud of each node in the network The rate of case relationship is not high, and the risk rate is not high. However, because two related nodes in the case relationship network are connected, the case relationship network in this scenario has a higher risk. The network is usually When a group of people cooperated to commit crimes, their mutual communication means pairwise understanding, and the purpose behind it is mostly collusion. Forging information to meet the requirements of the risk requires special attention. It should be understood that the above examples are only for illustration. The case relationship network established by the SNA algorithm can also be other network structures other than Figure 2 and Figure 3. The analysis method is also different from the analysis method used in Figure 2 and Figure 3. This application does not Specific restrictions.

In the above method, by acquiring multiple risk records, each of the multiple risk records is converted into multiple word segmentation vectors, and the multiple risk records are clustered according to the multiple word segmentation vectors to obtain The risk records of multiple category groups are analyzed using a social network analysis algorithm to analyze each category group of the risk records of the multiple category groups to identify fraud cases in the multiple risk records. Due to the high-precision clustering of the risk records, the risk records of each category are very relevant data. When the clustered risk records are used for SNA analysis, the risk case relationship network can be better established, thereby improving The correct rate of group fraud identification.

Figure 4 is a schematic structural diagram of a data recognition system provided by the present application. It can be seen from FIG. 4 that the data recognition system provided by the present application includes an acquisition unit 410, a conversion unit 420, a clustering unit 430, and an identification unit 440. Among them,

The acquiring unit 410 is configured to acquire multiple risk records.

The conversion unit 420 is configured to convert each of the multiple risk records into multiple word segmentation vectors.

The clustering unit 430 is configured to cluster the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records.

The identification unit 440 is configured to use a social network analysis algorithm to separately analyze each category group in the multiple category groups of the risk exposure records, and identify fraud cases in the multiple risk records.

In the embodiment of the present application, the conversion unit 420 is specifically configured to: perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple records. Participle words. Combining the stop word set and the reserved word set, the multiple word segmentation words are filtered to obtain the filtered word segmentation words, wherein the stop word set is a collection of multiple word segmentation words that are not related to the risk record information, so The reserved word set is a set of pre-set words that cannot be filtered out. Perform vectorization processing on the filtered word segmentation words to obtain multiple word segmentation vectors of each of the multiple risk records. It should be understood that the accident record includes multiple information, for example, it includes specific accidents, specific casualties, disease diagnosis records, specific traffic police accident identification records, etc. The data length of each accident record is also different. If you directly Data preprocessing will be a lot of work. Therefore, firstly, the obtained risk records will be segmented to improve the efficiency of data processing.

In the embodiment of the present application, the clustering unit 430 is specifically configured to filter out the target word segmentation vector from the multiple word segmentation vectors of each risk record, wherein the target word segmentation vector appears in the risk record to which it belongs The frequency of is higher than the frequency of appearance in other risk records, or the frequency of the target word segmentation vector in the corresponding risk record is lower than the frequency of appearance in other risk records. The clustering unit 430 is specifically configured to use a clustering algorithm to cluster multiple risk records containing the same or similar target word segmentation vectors into multiple category groups of risk records. In other words, after the risk record is converted into the word segmentation vector, there will be many word vectors. Although the stop word level and the reserved word level have been screened, many word vectors have a large amount of data for the subsequent anti-fraud recognition It is still very large, it should be understood. Therefore, if the data is further filtered and the easily identifiable data is clustered, the accuracy of the clustering can be greatly improved, and the accurate clustering results can be input into the SNA network to get more accurate The recognition result. Among them, data that is easier to identify may be, for example, if a word appears more frequently in the risk record of a certain case but less frequently in other cases, it is considered that the word has good distinguishing ability. For another example, if the number of cases containing a word in the risk information field is small, it means that the word has good distinguishing ability. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.

In the embodiment of the present application, the target word segmentation vector can be filtered using the TF-IDF method, where the TF-IDF method is a statistical method used to evaluate a word for a document set or a document in a corpus Degree of importance. It is understandable that the most meaningful words for distinguishing documents should be those that appear frequently in the document, but appear less frequently in other documents in the entire document collection, so if the feature space coordinate system takes TF word frequency as a measure, It can reflect the characteristics of similar texts. In addition, considering the ability of words to distinguish different categories, the TF-IDF method believes that the smaller the text frequency of a word, the greater its ability to distinguish texts of different categories. Therefore, the concept of inverse text frequency IDF is introduced. The product of TF and IDF is used as the value measurement of the feature space coordinate system, and used to complete the adjustment of the weight TF. The purpose of adjusting the weight is to highlight important words and suppress minor ones. word. In other words, if a word or phrase appears frequently in an article and rarely appears in other articles, it is considered that the word or phrase has a good classification ability and is suitable for classification. Therefore, compared with directly clustering the risk records, better clustering results can be obtained. After the word segmentation vector suitable for classification is selected, clustering according to the importance degree matrix of the risk records can more accurately compare the risk information The cases are gathered together, and the clustering results are used as the input data of the SNA network, which can greatly improve the accuracy of anti-fraud case identification.

In this embodiment of the present application, the identification unit 440 is specifically configured to obtain the risk exposure records of the multiple category groups. The identification unit 440 is specifically configured to use a social network analysis algorithm to establish a risk case relationship network based on the risk records of the multiple category groups, wherein the risk record of a category group corresponds to one or more risk case relationship networks. The risk case relationship network includes multiple nodes, the nodes representing individuals, organizations, or virtual individuals in the risk records, and the connections between the multiple nodes indicate that there is a social relationship between the multiple nodes. The identification unit 440 is specifically configured to analyze the relationship between each node in the risk case relationship network and other nodes through a social network analysis algorithm, and extract the group risk characteristics corresponding to each node. The identification unit 440 is specifically configured to input the group risk characteristics corresponding to each node into a classification model to obtain the fraud rate of each node, where the classification model is a model obtained by training a neural network using a sample set, and The sample set includes known risk feature data of multiple dimensional groups and corresponding known fraud rate data. The identification unit 440 is specifically configured to identify, according to the fraud rate of each node, the risk records of multiple nodes whose fraud rate is higher than a first threshold as a fraud case, and identify multiple nodes whose fraud rate is higher than the first threshold Identified as a fraudulent group. Among them, the SNA network is a collection of points (social actors) and connections between points (relationships between actors). Each node can be an entity or virtual individual with different meanings such as organization, individual, network ID, and the relationship between individuals can be relatives and friends, action behavior, sending and receiving messages, and other relationships. Through SNA network analysis, we can find the key information we need from the messy data and connection relationships, that is, the group risk behavior of each node. Taking the risk data of medical insurance as an example, the risk characteristics of each node can be: the area where the patient is located, the hospital where the patient sees the patient, the number and specific time of the patient’s purchase of drugs, the disease the patient has, and the patient sees Behave like the doctor who consulted. Analyzing the group risk behavior of patients is equivalent to a comprehensive analysis of the area where the patients are located, the number and specific time of the items purchased by the patients, and the diseases the patients have suffered. If it is found that patients have purchased a large number of drugs in different hospitals many times, and the types of drugs are different, the characteristics of the group risk can be determined. The user purchases a large amount of medicines, many types of medicines, and so on. For another example, taking vehicle insurance risk data as an example, the risk characteristics of each node may be. The city where the vehicle is located, the license plate number, the type of insurance purchased by the vehicle, the traffic police who handled the accident, the identity information of the perpetrator, and the identity information of the victim, etc. If it is found that the vehicle has been exposed to multiple risks at different locations and all are minor damage cases, since the minor damage cases are low in amount, they can be reported and handled quickly. Therefore, it can be determined that the group risk feature is that the user has participated in the risk of minor damage cases multiple times. If it is found that the vehicle has been in danger at different locations for multiple times and the victim's identity is the same person, it can be determined that the group risk is characterized by the user's multiple times of cooperating with others to defraud insurance. In the same way, the group risk characteristics of other risk data such as commercial insurance and accident insurance can be obtained. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.

In the embodiments of the present application, in addition to using a neural network to identify fraud cases in the multiple risk records based on the fraud rate of each node, it is also possible to identify high-risk cases based on the risk rate. For example, Figure 2 is a case relationship network established by the SNA algorithm, in which the gray dots represent users with more risk records, the black dots represent users with risk records but fewer traverses, and white dots represent users with no risk records. user. Through data calculation and analysis, it can be concluded that the high risk rate of the gang reaches 66.8%, indicating that the average risk rate of the gang is low. The risky users accounted for 91.4% of all users, further verifying the fraudulent nature of the gang. Understandably, after a fraudulent group is identified, the risk record of the group's participation can be confirmed as a fraud case.

In the embodiment of the present application, after the case relationship network is constructed using SNA, it can also be matched with a preset network model according to the network structure, so as to identify high-risk cases. For example, Figure 3 is a schematic diagram of a case relationship network established using the SNA algorithm in another scenario provided by this application. The black dots represent users who have a record of traversing but have a small number of times. Although the case is related to the fraud of each node in the network The rate is not high, and the risk rate is not high, but because two related nodes in the case relationship network will be connected, the network has a higher risk. The network is usually behind a multi-person collaborative group. When committing a crime, the two-to-one communication means pairwise understanding, and the purpose behind it is mostly collusion. Forging information to meet the risk requirements requires special attention. It should be understood that the above examples are only for illustration. The case relationship network established by the SNA algorithm can also be other network structures other than Figure 2 and Figure 3. The analysis method is also different from the analysis method used in Figure 2 and Figure 3. This application does not Specific restrictions.

In the above system, by acquiring multiple risk records, each of the multiple risk records is converted into multiple word segmentation vectors, and the multiple risk records are clustered according to the multiple word segmentation vectors to obtain The risk records of multiple category groups are analyzed using a social network analysis algorithm to analyze each category group of the risk records of the multiple category groups to identify fraud cases in the multiple risk records. Due to the high-precision clustering of the risk records, the risk records of each category are very relevant data. When the clustered risk records are used for SNA analysis, the risk case relationship network can be better established, thereby improving The correct rate of group fraud identification.

Refer to FIG. 5, which is a schematic structural diagram of an electronic device provided by the present application. The electronic device in this embodiment as shown in the figure may include. One or more processors 511, memory 512, and communication interface 513. Among them, the processor 511, the memory 512, and the communication interface 513 may be connected through a bus 514.

The processor 511 includes one or more general-purpose processors, where the general-purpose processor can be any type of device capable of processing electronic instructions, including a central processing unit (Central Processing Unit, CPU), and an image processor (Graphics Processing Unit, GPU). ), microprocessors, microcontrollers, main processors, controllers and application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Digital Signal Processor (DSP), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The processor 511 is configured to execute program instructions stored in the memory 512.

The memory 512 may include a volatile memory, such as random access memory (Random Access Mmemory, RAM). The memory may also include non-volatile memory, such as read-only memory (Read-Only Memory, ROM), flash memory (Flash Memory), hard disk (Hard Disk Drive, HDD) or solid-state drive (Solid-State Drive, SSD), the memory may also include a combination of the above types of memory. The memory 512 may adopt centralized storage or distributed storage, which is not specifically limited here. It is understandable that the memory 512 is used to store computer programs, such as computer program instructions. In the embodiment of the present application, the memory 512 may provide instructions and data to the processor 511.

The communication interface 513 may be a wired interface (such as an Ethernet interface) or a wireless interface (such as a cellular network interface or using a wireless local area network interface) for communicating with other computer devices or users. When the communication interface 513 is a wired interface, the communication interface 513 may adopt a network communication protocol (Transmission Control Protocol/Internet Protocol, TCP/IP) above the protocol family, for example, remote function call (Remote Function Call, RFC) protocol, Simple Object Access Protocol (Simple Object Access Protocol, SOAP) protocol, Simple Network Management Protocol (Simple Network Management Protocol, SNMP), Common Object Request Broker Architecture (CORBA), distributed protocol, etc. When the communication interface 513 is a wireless interface, according to the Global System for Mobile Communications (Global System for Mobile Communication, GSM) or Code Division Multiple Access (Code The Division Multiple Access (CDMA) standard utilizes cellular communications and therefore includes a wireless modem for data transmission, electronic processing equipment, one or more digital memory devices, and dual antennas.

In the embodiment of the present application, the processor 511, the memory 512, the communication interface 513, and the bus 514 can execute the implementation described in any embodiment of the data identification method provided in the embodiment of the present application. Specifically, the processor 511 can be used To call the instructions in the memory, execute the following method: Get multiple risk records. Convert each of the multiple risk records into multiple word segmentation vectors. According to multiple word segmentation vectors, cluster the multiple risk records to obtain multiple category groups of risk records. A social network analysis algorithm is used to analyze each category group in the risk records of multiple category groups to identify fraud cases in multiple risk records.

In a specific embodiment, the processor 511 is specifically configured to execute: perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple word segmentation words. Delete the participle words that are the same as the words in the stop word set, and keep the same participle words as the words in the reserved word set to obtain the filtered word break words. Among them, the stop word set is multiple and out of danger The set of segmented words irrelevant to the record information. The reserved word set is a set of pre-set words that cannot be filtered out. Map the filtered word segmentation words into multiple word segmentation vectors.

In a specific embodiment, the processor 511 is specifically configured to perform: filter out the target word segmentation vector from the multiple word segmentation vectors of each risk record, wherein the target word segmentation vector appears more frequently in the risk record to which it belongs. The frequency of appearance in other risk records, or the frequency of the target word segmentation vector in the risk record to which it belongs is lower than the frequency of appearance in other risk records. Through the clustering algorithm, multiple risk records containing the same or similar target word segmentation vectors are clustered into multiple category groups of risk records.

In a specific embodiment, the processor 511 is specifically configured to execute: obtain the risk-out records of multiple category groups. Based on the risk records of multiple category groups, a social network analysis algorithm is used to establish a risk case relationship network. Among them, the risk records of a category group correspond to one or more risk case relationship networks. The risk case relationship network includes multiple nodes, which represent offices. Speaking of individuals, organizations or virtual individuals in the risk record, the connection between multiple nodes indicates that there is a social relationship between the multiple nodes. A social network analysis algorithm is used to analyze the relationship between each node in the risk case relationship network and other nodes, and extract the group risk characteristics corresponding to each node. The group risk characteristics corresponding to each node are input into the classification model to obtain the fraud rate of each node, wherein the classification model is a model obtained by training a neural network using a sample set, and the sample set includes known multiple dimensional groups Risk feature data and corresponding known fraud rate data. According to the fraud rate of each node, the risk records of multiple nodes whose fraud rate is higher than the first threshold are identified as fraud cases, and the multiple nodes whose fraud rates are higher than the first threshold are identified as fraud gangs.

In a specific embodiment, the clustering algorithm is K-means clustering algorithm, mean shift clustering algorithm, density-based clustering algorithm with noise, maximum expectation clustering algorithm using Gaussian mixture model, and agglomerative hierarchical clustering algorithm. One or more of.

It should be noted that, for the specific content of the method executed by the processor 511, reference may be made to the relevant description of the foregoing method embodiment. For the sake of brevity of the description, the details are not repeated here.

In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, implement the implementation described in any of the embodiments of the data identification method provided in this application, This will not be repeated here.

The computer-readable storage medium may be the internal storage unit of the terminal described in any of the foregoing embodiments, such as the hard disk or memory of the terminal. The computer-readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk equipped on the terminal, a smart memory card (Smart Media Card, SMC), or a Secure Digital (SD) card , Flash memory card (Flash Card) and so on. Further, the computer-readable storage medium may also include both an internal storage unit of the terminal and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the terminal. The computer-readable storage medium can also be used to temporarily store data that has been output or will be output.

In the embodiments of the present application, a computer readable storage medium stores a computer program, and the computer program is executed by a processor to implement the following method. Obtain multiple risk records. Convert each of the multiple risk records into multiple word segmentation vectors. According to multiple word segmentation vectors, multiple risk records are clustered to obtain multiple category groups of risk records. A social network analysis algorithm is used to separately analyze each category group in the multiple category groups of risk exposure records, and identify fraud cases in the multiple category groups.

In a specific embodiment, when the computer program is executed by the processor, the following method is specifically implemented: word segmentation is performed on each of the multiple risk records, and each of the multiple risk records is divided into multiple Participle words. Delete the participle words that are the same as the words in the stop word set, and keep the same participle words as the words in the reserved word set to obtain the filtered word break words. Among them, the stop word set is multiple and out of danger The set of segmented words irrelevant to the record information. The reserved word set is a set of pre-set words that cannot be filtered out. Map the filtered word segmentation words into multiple word segmentation vectors.

In a specific embodiment, when the computer program is executed by the processor, the following method is specifically implemented. From the multiple word segmentation vectors of each risk record, filter out the target word segmentation vector, where the target word segmentation vector appears more frequently in the risk record to which it belongs than in other risk records, or the target word segmentation vector The frequency of occurrence in the risk record of is lower than the frequency of occurrence in other risk records. Through the clustering algorithm, multiple risk records containing the same or similar target word segmentation vectors are clustered into multiple category groups of risk records.

In a specific embodiment, when the computer program is executed by the processor, the following method is specifically implemented. Get the risk records of multiple category groups. Based on the risk records of multiple category groups, a social network analysis algorithm is used to establish a risk case relationship network. Among them, the risk records of a category group correspond to one or more risk case relationship networks. The risk case relationship network includes multiple nodes, which represent offices. Speaking of individuals, organizations or virtual individuals in the risk record, the connection between multiple nodes indicates that there is a social relationship between the multiple nodes. A social network analysis algorithm is used to analyze the relationship between each node in the risk case relationship network and other nodes, and extract the group risk characteristics corresponding to each node. The group risk characteristics corresponding to each node are input into the classification model to obtain the fraud rate of each node. Among them, the classification model is a model obtained by training the neural network using a sample set. The sample set includes known multiple dimensional group risk characteristics data and Corresponding known fraud rate data. According to the fraud rate of each node, the risk records of multiple nodes whose fraud rate is higher than the first threshold are identified as fraud cases, and the multiple nodes whose fraud rates are higher than the first threshold are identified as fraud gangs.

It should be noted that, for the specific content of the method for executing the computer program by the processor, refer to the relevant description of the foregoing method embodiment, and for the sake of brevity of the description, it will not be repeated here. In the several embodiments provided in this application, it should be understood that the disclosed method and device can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims

A method of data recognition, wherein the method includes:

Obtain multiple risk records;

Converting each of the multiple risk records into multiple word segmentation vectors;

Clustering the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records;

A social network analysis algorithm is used to separately analyze each category group in the multiple category groups' risk-exposure records, and identify fraud cases in the multiple category groups.
The method according to claim 1, wherein converting each of the plurality of danger records into a plurality of word segmentation vectors comprises:

Perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple word segmentation words;

Among the plurality of word segmentation words, the word segmentation words that are the same as the words in the stop word set are deleted, and the word participle words that are the same as the words in the reserved word set are retained to obtain the filtered word segmentation words, wherein the stop word set is A collection of multiple word segmentation words irrelevant to the risk record information, the reserved word set is a preset set of words that cannot be filtered out;

Map the filtered word segmentation words into multiple word segmentation vectors.
The method according to claim 2, wherein the clustering the plurality of risk records according to the plurality of word segmentation vectors to obtain the risk records of multiple category groups comprises:

From the multiple word segmentation vectors of each risk record, the target word segmentation vector is screened out, where the target word segmentation vector appears more frequently in the risk record to which it belongs than in other risk records, or The frequency of the target word segmentation vector in the risk record to which it belongs is lower than the frequency in other risk records;

Through the clustering algorithm, multiple risk records containing the same or similar target word segmentation vectors are clustered into multiple category groups of risk records.
The method according to claim 1, wherein the use of a social network analysis algorithm separately analyzes each of the plurality of category groups in the risk records to identify fraud cases in the plurality of risk records include:

Obtaining the risk records of the multiple category groups;

Based on the risk records of the multiple category groups, a social network analysis algorithm is used to establish a risk case relationship network, where the risk records of one category group correspond to one or more risk case relationship networks, and the risk case relationship network includes multiple nodes , The node represents an individual, an organization, or a virtual individual in the risk record, and the connection between the multiple nodes indicates that there is a social relationship between the multiple nodes;

Analyze the relationship between each node in the risk case relationship network and other nodes through a social network analysis algorithm, and extract the group risk characteristics corresponding to each node;

The group risk characteristics corresponding to each node are input into a classification model to obtain the fraud rate of each node. The classification model is a model obtained by training a neural network using a sample set, and the sample set includes a plurality of known Dimensional group risk characteristics data and corresponding known fraud rate data;

According to the fraud rate of each node, the risk records of multiple nodes whose fraud rate is higher than the first threshold are recognized as fraud cases, and the multiple nodes whose fraud rates are higher than the first threshold are recognized as fraud gangs.
The method according to claim 3, wherein the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a density-based clustering algorithm with noise, a maximum expectation clustering algorithm using a Gaussian mixture model, and One or more of agglomerative hierarchical clustering algorithms.
A data recognition system, wherein the system includes an acquisition unit, a conversion unit, a clustering unit, and an identification unit, wherein,

The obtaining unit is used to obtain multiple risk records;

The conversion unit is configured to convert each of the multiple risk records into multiple word segmentation vectors;

The clustering unit is configured to cluster the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records;

The identification unit is configured to use a social network analysis algorithm to separately analyze each category group in the multiple category groups of the risk exposure records, and identify fraud cases in the multiple risk records.
The method according to claim 6, wherein the conversion unit is specifically used for:

Perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple word segmentation words;

Among the plurality of word segmentation words, the word segmentation words that are the same as the words in the stop word set are deleted, and the word participle words that are the same as the words in the reserved word set are retained to obtain the filtered word break words, wherein the stop word set is A collection of multiple word segmentation words that are not related to the risk record information, and the reserved word set is a preset set of words that cannot be filtered out;

Map the filtered word segmentation words into multiple word segmentation vectors.
The system according to claim 7, wherein:

The clustering unit is specifically configured to filter out the target word segmentation vector from the multiple word segmentation vectors of each risk record, wherein the target word segmentation vector appears more frequently in the risk record to which it belongs than in other risk records The frequency of occurrence in, or the occurrence frequency of the target word segmentation vector in the risk record to which it belongs is lower than the frequency of appearance in other risk records;

The clustering unit is specifically configured to use a clustering algorithm to cluster multiple risk records containing the same or similar target word segmentation vectors into multiple category groups of risk records.
The system according to claim 6, wherein:

The identification unit is specifically configured to obtain the risk-exit records of the multiple category groups;

The identification unit is specifically configured to use a social network analysis algorithm to establish a risk-out case relationship network based on the risk-out records of the multiple category groups, wherein the risk-out record of one category group corresponds to one or more risk-out case relation networks. The case relationship network includes multiple nodes, the nodes representing individuals, organizations, or virtual individuals in the risk record, and the connections between the multiple nodes indicate that there is a social relationship between the multiple nodes;

The identification unit is specifically configured to analyze the relationship between each node in the risk case relationship network and other nodes through a social network analysis algorithm, so as to extract the group risk characteristics corresponding to each node;

The identification unit is specifically configured to input the group risk characteristics corresponding to each node into a classification model to obtain the fraud rate of each node, wherein the classification model is a model obtained by training a neural network using a sample set, and the sample The set includes known risk characteristic data of multiple dimensional groups and corresponding known fraud rate data;

The identification unit is specifically configured to identify risk exposure records belonging to multiple nodes with a fraud rate higher than a first threshold as fraud cases, and identify multiple nodes with a fraud rate higher than the first threshold as fraud gangs.
The system according to claim 8, wherein the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a density-based clustering algorithm with noise, a maximum expectation clustering algorithm using a Gaussian mixture model, and One or more of agglomerative hierarchical clustering algorithms.
An electronic device, wherein the electronic device includes a processor and a memory; the memory is used to store instructions; the processor is used to call instructions in the memory to execute the following method:

Obtain multiple risk records;

Converting each of the multiple risk records into multiple word segmentation vectors;

Clustering the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records;

A social network analysis algorithm is used to separately analyze each category group in the multiple category groups' risk-exposure records, and identify fraud cases in the multiple category groups.
The electronic device according to claim 11, wherein the processor is specifically configured to execute:

Perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple word segmentation words;

Among the plurality of word segmentation words, the word segmentation words that are the same as the words in the stop word set are deleted, and the word participle words that are the same as the words in the reserved word set are retained to obtain the filtered word segmentation words, wherein the stop word set is A collection of multiple word segmentation words irrelevant to the risk record information, the reserved word set is a preset set of words that cannot be filtered out;

Map the filtered word segmentation words into multiple word segmentation vectors.
The electronic device according to claim 12, wherein the processor is specifically configured to execute:

From the multiple word segmentation vectors of each risk record, the target word segmentation vector is screened out, where the target word segmentation vector appears more frequently in the risk record to which it belongs than in other risk records, or The frequency of the target word segmentation vector in the risk record to which it belongs is lower than the frequency in other risk records;

Through the clustering algorithm, multiple risk records containing the same or similar target word segmentation vectors are clustered into multiple category groups of risk records.
The electronic device according to claim 11, wherein the processor is specifically configured to execute:

Obtaining the risk records of the multiple category groups;

Based on the risk records of the multiple category groups, a social network analysis algorithm is used to establish a risk case relationship network, where the risk records of one category group correspond to one or more risk case relationship networks, and the risk case relationship network includes multiple nodes , The node represents an individual, an organization, or a virtual individual in the risk record, and the connection between the multiple nodes indicates that there is a social relationship between the multiple nodes;

Analyze the relationship between each node in the risk case relationship network and other nodes through a social network analysis algorithm, and extract the group risk characteristics corresponding to each node;

The group risk characteristics corresponding to each node are input into a classification model to obtain the fraud rate of each node. The classification model is a model obtained by training a neural network using a sample set, and the sample set includes a plurality of known Dimensional group risk characteristics data and corresponding known fraud rate data;

According to the fraud rate of each node, the risk records of multiple nodes whose fraud rate is higher than the first threshold are recognized as fraud cases, and the multiple nodes whose fraud rates are higher than the first threshold are recognized as fraud gangs.
The electronic device according to claim 13, wherein the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a noisy density-based clustering algorithm, a maximum expected clustering algorithm using a Gaussian mixture model And one or more of agglomerative hierarchical clustering algorithms.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the following method:

Obtain multiple risk records;

Converting each of the multiple risk records into multiple word segmentation vectors;

Clustering the multiple risk records according to the multiple word segmentation vectors to obtain multiple category groups of risk records;

A social network analysis algorithm is used to separately analyze each category group in the multiple category groups' risk-exposure records, and identify fraud cases in the multiple category groups.
The computer-readable storage medium according to claim 16, wherein the computer program specifically implements the following method when being executed by the processor:

Perform word segmentation processing on each of the multiple risk records, and divide each of the multiple risk records into multiple word segmentation words;

Among the plurality of word segmentation words, the word segmentation words that are the same as the words in the stop word set are deleted, and the word participle words that are the same as the words in the reserved word set are retained to obtain the filtered word segmentation words, wherein the stop word set is A collection of multiple word segmentation words irrelevant to the risk record information, the reserved word set is a preset set of words that cannot be filtered out;

Map the filtered word segmentation words into multiple word segmentation vectors.
The computer-readable storage medium according to claim 17, wherein the computer program specifically implements the following method when being executed by the processor:

From the multiple word segmentation vectors of each risk record, the target word segmentation vector is screened out, where the target word segmentation vector appears more frequently in the risk record to which it belongs than in other risk records, or The frequency of the target word segmentation vector in the risk record to which it belongs is lower than the frequency in other risk records;

Through the clustering algorithm, multiple risk records containing the same or similar target word segmentation vectors are clustered into multiple category groups of risk records.
The computer-readable storage medium according to claim 16, wherein the computer program specifically implements the following method when being executed by the processor:

Obtaining the risk records of the multiple category groups;

Based on the risk records of the multiple category groups, a social network analysis algorithm is used to establish a risk case relationship network, where the risk records of one category group correspond to one or more risk case relationship networks, and the risk case relationship network includes multiple nodes , The node represents an individual, an organization, or a virtual individual in the risk record, and the connection between the multiple nodes indicates that there is a social relationship between the multiple nodes;

Analyze the relationship between each node in the risk case relationship network and other nodes through a social network analysis algorithm, and extract the group risk characteristics corresponding to each node;

The group risk characteristics corresponding to each node are input into a classification model to obtain the fraud rate of each node. The classification model is a model obtained by training a neural network using a sample set, and the sample set includes a plurality of known Dimensional group risk characteristics data and corresponding known fraud rate data;

According to the fraud rate of each node, the risk records of multiple nodes whose fraud rate is higher than the first threshold are recognized as fraud cases, and the multiple nodes whose fraud rates are higher than the first threshold are recognized as fraud gangs.
The computer-readable storage medium according to claim 18, wherein the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a noise-based density-based clustering algorithm, a maximum expectation of a Gaussian mixture model One or more of clustering algorithms and agglomerated hierarchical clustering algorithms.