CN110659997B

CN110659997B - Data cluster recognition method, device, computer system and readable storage medium

Info

Publication number: CN110659997B
Application number: CN201910754337.4A
Authority: CN
Inventors: 张密; 唐文
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2023-06-27
Anticipated expiration: 2039-08-15
Also published as: CN110659997A

Abstract

The invention discloses a data cluster identification method, a device, a computer system and a readable storage medium, which are based on artificial intelligence and comprise the following steps: setting one case in the case library as a reference case, and setting other cases except the reference case in the case library as comparison cases; judging whether the reference case and each comparison case have an association relation in sequence and preparing a one-dimensional vector; sequentially obtaining one-dimensional vectors of all cases in a case library, and combining the one-dimensional vectors of all cases to obtain an adjacency matrix; calculating an adjacency matrix to obtain a dense vector; calculating a dense vector to cluster all cases in the case library and outputting a clustering result; and calculating the risk value of the case according to the policy information, the automobile characteristic and the case report characteristic of the case in the clustering result. According to the invention, the high-risk cases are obtained by analyzing the cases in the clusters below the clustering threshold, so that practitioners can perform key analysis on the high-risk cases to identify cases which are suspected to be fraudulent and have not been paid for or reported in the high-risk cases.

Description

Data cluster recognition method, device, computer system and readable storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a data cluster recognition method, a device, a computer system, and a readable storage medium.

Background

The prior insurance anti-fraud scheme mainly screens cases through black-and-white list rules, however, the conventional rule engine of the black-and-white list only depends on human experience, so that misjudgment is easy to produce for case screening, and great fatigue strength is brought to staff, thereby increasing the cost of enterprise staff;

if a mature neural network is manufactured by adopting a supervised learning modeling mode according to a black-and-white list rule, a large amount of tagged data, namely cases suspected of fraud, are required for the neural network to learn; since in practice it is difficult to obtain large amounts of tagged data, this approach is often difficult to implement in practice and thus difficult to make into an efficient and reliable mature neural network.

Disclosure of Invention

The invention aims to provide a data cluster identification method, a data cluster identification device, a computer system and a readable storage medium, which are used for solving the problems existing in the prior art.

In order to achieve the above object, the present invention provides a data cluster recognition method, comprising the steps of:

The data cluster recognition device S1 is used for setting one case in a case library as a reference case and setting other cases except the reference case in the case library as comparison cases; extracting a reference case and a comparison case from the case library, and sequentially judging whether the reference case and each comparison case have an association relation or not; if yes, a relation value between the reference case and the comparison case is assigned 1, and if not, a relation value between the reference case and the comparison case is assigned 0; making a one-dimensional vector of the reference case according to the relation value between the reference case and each comparison case; the data information of the case comprises a field picture, case reporting text information, field structure information, policy information, automobile characteristics and case reporting person characteristics;

s2: sequentially setting the cases in the case library as reference cases according to the method of S1, sequentially obtaining one-dimensional vectors of all the reference cases, and combining the one-dimensional vectors of all the cases in the case library to obtain an adjacent matrix;

s3: calculating the adjacency matrix by using an SDNE algorithm to obtain a dense vector for expressing the association relation among all cases in the case library;

S4: calculating the dense vector by using an AP clustering algorithm to cluster all cases in the case library, and outputting a clustering result;

s5: calculating a risk value of the case according to the policy information, the automobile characteristics and the case reporting person characteristics of the case in the clustering result, and judging whether the case in the cluster of the clustering result is a high-risk case or a low-risk case according to the risk value; generating a high-risk signal according to the high-risk case and outputting the high-risk signal to the client, or generating a low-risk signal according to the low-risk case and outputting the low-risk signal to the client;

wherein the high risk signal and the low risk signal are to be output to the client in the form of communication signals, respectively.

In the above scheme, the step S1 includes the following steps:

s11: extracting one case in a case library as a reference case, and setting cases except the basic case in the case library as comparison cases; judging the association relation between the reference case and the comparison case;

s12: sequentially judging the association relations between the reference cases and all the comparison cases in the case library according to the method of S11;

s13: if the reference case and the comparison case have an association relationship, the relationship value between the reference case and the comparison case is assigned as 1; if the reference case and the comparison case do not have an association relationship, the relationship value between the reference case and the comparison case is assigned to be 0;

S14: the relation value between the reference case and each comparison case is formed into a one-dimensional vector; the element values of each column in the one-dimensional vector are the relation values between the reference case and each comparison case.

In the above scheme, in the step S2, all cases in the case library are sequentially used as reference cases, and according to the method of S1, the association relationship between each reference case and each reference case is sequentially obtained, and the one-dimensional vector of each reference case is sequentially obtained; and merging the one-dimensional vectors of the reference cases to obtain an adjacency matrix.

In the above scheme, the step S3 includes the following steps:

s31: the SDNE algorithm has an embedded layer dimension, and an adjacency matrix is input into an input layer of the SDNE algorithm;

s32: controlling the SDNE algorithm to learn a first-order neighbor relation in the adjacent matrix in a supervised mode, and learning a second-order neighbor relation in the adjacent matrix in an unsupervised deep learning mode; and optimizing the SDNE algorithm by combining the loss functions of the two learning processes, and finally extracting an embedded layer in the SDNE algorithm as a dense vector of all cases in the case library.

In the above solution, the step S4 includes the following steps:

S41: inputting the dense vector into an AP clustering algorithm to obtain Euclidean distance between any cases in the case library and obtain a similarity matrix;

s42: clustering the cases in the dense vector according to the similarity matrix, sequentially obtaining the degree that each case in the dense vector is used as the clustering center of other cases through an attractive information matrix of an AP clustering algorithm, and sequentially obtaining the degree that each case in the dense vector selects other cases as the clustering center through an attribution information matrix of the AP clustering algorithm; ending and outputting a clustering result until the iteration times are reached or samples in each clustering area in the dense vector are kept unchanged.

In the above solution, the step S42 includes the following steps:

s42-1: setting the iteration times of the AP clustering algorithm as T, and initializing r (i, j) and a (i, j) to be 0;

wherein r (i, j) is the attraction information in the attraction information matrix, and is used for describing the degree of the case j in the dense vector, which is suitable as the clustering center of the case i;

a (i, j) is the attribution information in the attribution information matrix, and is used for selecting a case j as the suitability of the clustering center according to the case i in the dense vector;

S42-2, iterating an attraction information matrix r (i, j) in the AP clustering algorithm according to the following formula;

r(i,j)＝s(i,j)-max(j'≠j){a(i,j')+s(i,j')}；

wherein s (i, j) is the Euclidean distance between the dense vector case i and case j, namely the similarity between case i and case j;

the formula is used to describe the suitability of j as the center of i as: subtracting the similarity between i and j 'from the similarity between i and j', and selecting case j as the maximum value of the sum of the fitness degrees of the clustering centers of the cases i and j;

s42-3: iterating the attribution information a (i, j) in the AP clustering algorithm according to the following formula and condition;

when i-! When =j, a (i, j) =min (0, r (j, j) +Σ (i 'not in { i, j }) max {0, r (i', j) };

when i= j, a (i, i) = Σ (i 'noteqi) max {0, r (i', i) };

wherein, the formula is used for describing that if i= j, i is a cluster center, otherwise j is in the cluster of i;

s42-4: and (3) iterating the AP clustering algorithm according to the steps S42-2 and S42-3 until the iteration times reach T or samples in each clustering area are unchanged, ending the AP clustering algorithm and outputting a clustering result.

In the above solution, the step S5 includes the following steps:

s51: acquiring clusters, the number of cases of which is below a clustering threshold, in a clustering result; extracting policy information, automobile characteristics and case report characteristics of cases in the cluster;

S52: acquiring the quantity A of invalid insurance policies in the cluster according to the insurance policy information;

s53: acquiring the number B of second-hand vehicles and the number C of porcelain-touching vehicles of the cases in the cluster according to the automobile characteristics;

s54: acquiring the number D of blacklist reporting persons in the cluster according to the characteristics of the reporting persons;

s55: calculating the number A of the failure insurance policy, the number B of the second hand carts, the number C of the porcelain bumping carts and the number D of the blacklist claimant according to a weighted summation formula, and obtaining a risk value Y;

s56: if the risk value Y does not exceed the risk threshold, judging that the case in the cluster is a low-risk case;

if the risk value Y exceeds a risk threshold value, judging that the case in the cluster is a high-risk case;

s57: generating a high-risk signal according to the high-risk case, and outputting the high-risk signal to a client in a communication signal form; or (b)

And generating a low-risk signal according to the low-risk case, and outputting the low-risk signal to the client in the form of a communication signal.

In order to achieve the above object, the present invention further provides a data cluster recognition device, including:

the data cluster recognition device is characterized by comprising:

the one-dimensional vector formulation module is used for setting one case in the case library as a reference case and setting other cases except the reference case in the case library as comparison cases; extracting a reference case and a comparison case from the case library, and sequentially judging whether the reference case and each comparison case have an association relation or not; if yes, a relation value between the reference case and the comparison case is assigned 1, and if not, a relation value between the reference case and the comparison case is assigned 0; making a one-dimensional vector of the reference case according to the relation value between the reference case and each comparison case; the data information of the case comprises a field picture, case reporting text information, field structure information, policy information, automobile characteristics and case reporting person characteristics;

The adjacency matrix preparation module is used for calling the one-dimensional vector preparation module to sequentially set the cases in the case library as reference cases, sequentially obtain one-dimensional vectors of all the reference cases, and combine the one-dimensional vectors of all the cases in the case library to obtain an adjacency matrix;

the vector operation module is used for calculating the adjacency matrix by utilizing an SDNE algorithm so as to obtain dense vectors of all cases in the case library;

the clustering operation module is used for calculating the dense vector by using an AP clustering algorithm so as to cluster all cases in the case library and outputting a clustering result;

the risk evaluation module is used for calculating the risk value of the case according to the policy information, the automobile characteristic and the case report characteristic of the case in the clustering result, and judging whether the case in the cluster of the clustering result is a high-risk case or a low-risk case according to the risk value; generating a high-risk signal according to the high-risk case and outputting the high-risk signal to the client, or generating a low-risk signal according to the low-risk case and outputting the low-risk signal to the client; wherein the high risk signal and the low risk signal are to be output to the client in the form of communication signals, respectively.

The present invention also provides a computer system comprising a plurality of computer devices, each computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processors of the plurality of computer devices together implementing the steps of the data cluster recognition method described above when executing the computer program.

In order to achieve the above object, the present invention further provides a computer readable storage medium, which includes a plurality of storage media, each storage medium storing a computer program, and the steps of the data cluster recognition method are jointly implemented when the computer programs stored in the plurality of storage media are executed by a processor.

The invention provides a data cluster recognition method, a device, a computer system and a readable storage medium, wherein a one-dimensional vector formulation module and an adjacent matrix formulation module are utilized to obtain an adjacent matrix for expressing the association relation among cases in a case library; reducing the dimension of the adjacent matrix by using a vector operation module to obtain a dense vector for representing the association relation among all cases in the case library; clustering the dense vectors by using a clustering operation module, and calculating the dense vectors by using an AP clustering algorithm to obtain a clustering result for clustering all cases in the case library; obtaining a high-risk case and a low-risk case by using a risk assessment module;

in the insurance industry, the suspected fraudulent cases are generally small probability events and the probability of the case of the partner is very high, so that the clustering threshold is regulated through service experience and regional conditions, and the cases in the clusters below the clustering threshold are analyzed to obtain high-risk cases, so that practitioners can perform key analysis on the high-risk cases to identify the cases which are suspected to be fraudulent and do not follow up or report cases in the high-risk cases, and the cases are followed up or alarmed to recover the losses of the insurance company due to fraud;

Meanwhile, the evaluation accuracy of the case fraud risk is greatly improved, and the speed and the efficiency of case evaluation are also greatly improved; compared with the condition that the existing cases are directly submitted to manual treatment, the input of labor cost is greatly reduced;

moreover, the cases in the case library are clustered, and the cases in the clusters below the clustering threshold are subjected to heavy analysis, so that a large amount of tagged data, namely a large amount of cases with suspected fraud properties, are not required to be provided for the neural network to learn, the practice is easy to realize, and further the risk assessment of the cases can be rapidly and effectively realized.

Drawings

FIG. 1 is a flowchart of a data cluster recognition method according to an embodiment of the present invention;

FIG. 2 is a workflow diagram of a data cluster recognition device and a service system according to a first embodiment of the data cluster recognition method of the present invention;

FIG. 3 is a schematic diagram of a program module of a second embodiment of the data cluster recognition device of the present invention;

fig. 4 is a schematic hardware structure of a computer device in a third embodiment of the computer system according to the present invention.

Reference numerals:

1. data cluster recognition device 2, case library 3, computer device 4 and client

11. One-dimensional vector formulation module 12, adjacent matrix formulation module 13, and vector operation module

14. Clustering operation module 15, risk assessment module 31, memory 32, and processor

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides a data clustering identification method based on a one-dimensional vector formulation module, an adjacent matrix formulation module, a vector operation module, a clustering operation module and a risk assessment module, which is applicable to the field of artificial intelligence. Firstly, obtaining an adjacent matrix for expressing the association relation among all cases in a case library by utilizing a one-dimensional vector preparation module and an adjacent matrix preparation module; reducing the dimension of the adjacent matrix by using a vector operation module to obtain a dense vector for representing the association relation among all cases in the case library; clustering the dense vectors by using a clustering operation module, and calculating the dense vectors by using an AP clustering algorithm to obtain a clustering result for clustering all cases in the case library; obtaining a high-risk case and a low-risk case by using a risk assessment module;

The high-risk cases are obtained by analyzing the cases in the clusters below the clustering threshold, so that practitioners can perform key analysis on the high-risk cases to identify cases which are suspected to be fraudulent and are not paid or reported in the high-risk cases, and the cases are paid or reported for in order to recover the losses of insurance companies due to fraud.

Example 1

Referring to fig. 1 and 2, the data cluster recognition method of the present embodiment, using the data cluster recognition device 1, includes the following steps:

s1, setting one case in a case library 2 as a reference case, and setting other cases except the reference case in the case library 2 as comparison cases; extracting a reference case and a comparison case from the case library 2, and sequentially judging whether the reference case and each comparison case have an association relationship or not; if yes, a relation value between the reference case and the comparison case is assigned 1, and if not, a relation value between the reference case and the comparison case is assigned 0; making a one-dimensional vector of the reference case according to the relation value between the reference case and each comparison case; the data information of the case comprises a field picture, case reporting text information, field structure information, policy information, automobile characteristics and case reporting person characteristics;

S2: sequentially setting the cases in the case library 2 as reference cases according to the method of S1, sequentially obtaining one-dimensional vectors of all the reference cases, and combining the one-dimensional vectors of all the cases in the case library 2 to obtain an adjacent matrix;

s3: calculating the adjacency matrix by using an SDNE algorithm to obtain a dense vector for expressing the association relation among all cases in the case library 2;

s4: calculating the dense vector by using an AP clustering algorithm to cluster all cases in the case library 2 and outputting a clustering result;

s5: calculating a risk value of the case according to the policy information, the automobile characteristics and the case reporting person characteristics of the case in the clustering result, and judging whether the case in the cluster of the clustering result is a high-risk case or a low-risk case according to the risk value; generating a high-risk signal according to the high-risk case and outputting the high-risk signal to the client 4, or generating a low-risk signal according to the low-risk case and outputting the low-risk signal to the client 4;

wherein the high risk signal and the low risk signal are to be output to the client 4 in the form of communication signals, respectively.

Specifically, the step S1 includes the following steps:

further, the basic information in the step S11 includes a field picture, a report text and field structure information;

the history report data comprises field pictures, report text information and field structure information.

Further, the step S11 includes the following steps:

s11-01: extracting sift characteristics of all field pictures in the case library 2;

preferably, extracting the sift characteristic of the scene picture to be 128-bit sift characteristic;

s11-02: summarizing sift characteristics of all field pictures to obtain a local descriptor set; clustering the local descriptor subsets by using a clustering algorithm to combine similar sift features and construct a visual dictionary;

In the step, the clustering algorithm is a K-Means algorithm;

further, the K value in the K-Means algorithm is 80, so that the clustering algorithm can divide the sift characteristics in the local description subset into 80 clusters;

s11-03: selecting a case from the case library 2 as a reference case, and taking cases except the reference case in the case library 2 as comparison cases; taking the field picture in the reference case as a fixed picture and taking the field picture in the comparison case as a comparison picture; selecting a comparison picture of a comparison case, and respectively selecting a plurality of sift features with highest return values of the fixed picture and the comparison picture in a visual dictionary;

preferably, 80 sift features with highest return values and non-repetition are respectively obtained from the fixed picture and the comparison picture according to the visual dictionary;

s11-04: respectively preparing the plurality of sift features of the fixed picture and the comparison picture into a fixed vector and a comparison vector;

s11-05: calculating cosine similarity of the picture vector fixed picture vector and the reference picture vector, and obtaining a picture similarity value;

s11-06: if the similarity value exceeds a picture similarity threshold, judging that the fixed picture is associated with a comparison picture, and further judging that the reference case is associated with the comparison case;

If the similarity value does not exceed the picture similarity threshold, judging that the fixed picture and the comparison picture are not associated, and further judging that the reference case and the comparison case have no association relation.

The step S12 includes the steps of:

s12-01: judging whether all the comparison cases in the case library 2 have an association relationship with the reference cases according to the steps S11-03-S11-06;

preferably, the step S11 further includes the steps of:

s11-11: extracting all the case texts in the case library 2, training word2vec by using the case texts, and obtaining word vectors;

s11-12: selecting a reference case in the S11-03 from a case library 2, and extracting a case report text in the reference case as a reference text; selecting a comparison case in the S11-03, and extracting a newspaper text from the comparison case as a comparison text; respectively calculating word frequency vectors of the reference text and the comparison text according to the word vectors to obtain a reference word frequency vector and a comparison word frequency vector;

s11-13: calculating cosine similarity between the reference word frequency vector and the comparison word frequency vector, and obtaining a text similarity value;

S11-14: if the text similarity value is larger than a text similarity threshold value, judging that the reference text and the comparison text have text association, and further judging that the reference case and the comparison case have association relation;

if the text similarity value is smaller than a text similarity threshold value, judging that the reference text and the comparison text have no text association, and further judging that the reference case and the comparison case have no association relation;

the step S12 further includes the steps of:

s12-11: judging whether all the comparison cases in the case library 2 have an association relationship with the reference cases according to the steps S11-12-S11-14;

preferably, the step S11 further includes the steps of:

s11-21: acquiring the reference case in the S11-03 from the case library 2, and extracting field structure information in the reference case as reference field information; selecting a comparison case in the S11-03, and extracting field structure information in the comparison case as comparison field information;

s11-22: comparing the factor fields of the reference field information and the comparison field information;

in the step, the factor field of the field structure information comprises information of a case report person, a case-related license plate number and a case-related frame number; examples of field structure information are now provided as follows:

Report case person information Zhang san, 13900000000

Hux, 00000 of license plate number

Case related frame number WAUR1111111111111

Comparing the reference field information with the contents corresponding to the information of the claimant, the license plate number and the frame number of the claimant in the comparison field information;

optionally, the information of the claimant may be claimant phone, claimant name or combination of claimant phone and claimant name.

S11-23: if the reference site information and the comparison site information have corresponding factor fields with consistent content, judging that the reference case and the comparison case have an association relation;

and if the reference field information and the comparison field information do not have the factor fields with the corresponding content being consistent, judging that the reference case and the comparison case do not have the association relation.

The step S12 further includes the steps of:

s12-21: and judging whether the relation exists between all the comparison cases and the reference cases in the case library 2 according to the steps S11-21-S11-23.

Specifically, in the step S2, all cases in the case library 2 are sequentially used as reference cases, and according to the step S11, the association relationship between each reference case and each reference case is sequentially obtained, and the one-dimensional vector of each reference case is sequentially obtained; and integrating the one-dimensional vectors of the reference cases to obtain an adjacency matrix.

In this step, the cases in the case library 2 have case numbers, and the cases in the case library are sequentially set as reference cases according to the case numbers, and the one-dimensional vectors of the cases are sequentially obtained through the S1.

Optionally, a tag stack is provided, and the cases in the case library 2 have case labels, and all the case labels are stored in the tag stack; randomly extracting a case label from the marking stack, acquiring a case corresponding to the case label from the case library 2 as a reference case, setting other cases except the reference case as reference cases, acquiring a one-dimensional vector of the reference case by the method of S1, and removing the case label of the reference case from the marking stack; repeating the above operation until the mark stack is empty, and obtaining one-dimensional vectors of all cases.

In the mathematical field, a graph in a graph theory is defined as a graph composed of a number of given points and connecting lines between the points, and is generally used to describe an inherent relationship or a specific relationship between something. In this embodiment, the cases are used as nodes in the graph theory, and then the connection lines between the nodes are the association relationship between two cases, and the association relationship between the cases in the case library 2 is expressed by using the adjacency matrix concept in the graph theory;

In graph theory and computer science, the adjacency matrix is used as a structure mode in the expression graph, only two values of 1 and 0 exist, and represent whether nodes in the graph have an association relationship, and the main diagonal lines are all 0, and for a simple undirected graph, the adjacency matrix is also a symmetrical matrix, if the association relationship between two cases is provided, the adjacency matrix is assigned with 1, and if the association relationship between two cases is not provided, the adjacency matrix is assigned with 0;

therefore, the row of the one-dimensional vector is used for expressing the reference case, the column of the one-dimensional vector is used for expressing the comparison cases, and the element values in the one-dimensional vector are used for expressing the association relation between the reference case and each comparison case; therefore, by sequentially arranging and combining the one-dimensional vectors of each case in the row direction, an adjacency matrix is obtained;

the rows in the adjacency matrix are used for expressing each reference case, the columns in the adjacency matrix are used for expressing each comparison case, and the element values in the adjacency matrix are used for expressing the association relationship between the reference case on the row and the comparison case on the column.

Specifically, the step S3 includes the following steps:

s31: the SDNE (Structural Deep Network Embedding, deep network structure embedding) algorithm has an embedding layer dimension, and inputs an adjacency matrix into an input layer of the SDNE algorithm;

The dimension of the embedded layer of the SDNE algorithm can be set according to the requirement, for example, the dimension can be set to be 100 dimensions;

s32: controlling the SDNE algorithm to learn a first-order neighbor relation in the adjacent matrix in a supervised manner, and learning a second-order neighbor relation in the adjacent matrix by using an unsupervised deep learning technology AutoEncoder; optimizing the SDNE algorithm by combining the loss functions of the two learning processes, and finally extracting an embedded layer (Embedding) in the SDNE algorithm as a dense vector of all cases in the case library 2;

in the step, a first-order neighbor relation and a second-order neighbor relation are combined and fused in a learning process; by the first-order neighbor relation and the second-order neighbor relation, local characteristics and global characteristics of the network can be well captured.

The first-order neighbor relation refers to the proximity of local point pairs between two vertexes, and the second-order neighbor relation refers to the similarity between a pair of vertexes representing the similarity between neighbor network structures in the network; the first-order neighbor relation is used as supervision information to save the global structure of the network through supervised learning; the SDNE algorithm framework learned in a supervised mode consists of a plurality of nonlinear mapping functions, and maps an adjacent matrix to a high-dimensional nonlinear hidden space to capture a network structure; thus, the SDNE algorithm learns and obtains a first order loss function L in a supervised manner for first order neighbor relationships in the adjacency matrix _1st The method comprises the steps of carrying out a first treatment on the surface of the The second-order neighbor relation refers to the neighbor similarity of the nodes, so that the second-order similarity of the model requires the property of each node neighbor; the SDNE model learned by the unsupervised deep learning technology comprises an automatic encoder and a decoder; automatic codingThe generator is composed of a number of nonlinear functions, mapping the adjacency matrix to a representation space; correspondingly, the decoder is also made up of a number of non-linear functions, which map the representation space to the reconstruction space of the adjacency matrix; therefore, the SDNE algorithm learns and obtains a second-order loss function L for the second-order neighbor relation in the adjacency matrix by using an automatic encoder (automatic encoder) of an unsupervised deep learning technology _2nd ；

By providing a joint optimization loss function, the loss function combining the two learning processes is realized, and the joint optimization loss function is minimized, so that the aim of optimizing the SDNE algorithm is fulfilled; the joint optimization loss function is as follows:

L _mix ＝L _2nd +αL _1st +vL _reg

wherein L is _reg Is a regularization term, alpha is a parameter for controlling 1-order loss, and v is a parameter for controlling the regularization term;

and extracting an optimized embedded layer (Embedding) in the SDNE algorithm to be used as a dense vector of all cases in the case library 2.

For example, if the number of cases in the case library 2 is 10 ten thousand, a 10 ten thousand×10 ten thousand-dimensional adjacency matrix is generated; and inputting the adjacency matrix into an SDNE algorithm for learning, optimizing the SDNE algorithm by combining the supervised learning mode and the unsupervised learning mode, and finally extracting an embedded layer in the SDNE algorithm as a dense vector, wherein the dense vector is a matrix of 10 ten thousand multiplied by 100.

Specifically, the step S4 includes the following steps:

s41: inputting the dense vector into an AP clustering algorithm, (Affinity Propagation) to obtain Euclidean distance between any cases in the case library 2 and obtain a similarity matrix;

the AP clustering algorithm is one of common clustering algorithms, is different from Kmeans algorithm and the like, does not need to determine the number of clusters in advance, and the cluster center found by the AP algorithm is a point which really exists in the data;

in the step, the dense vector is input into the AP clustering algorithm, and the AP clustering algorithm calculates Euclidean distance between every two cases through the dense vector to be used as similarity s (i, j) between any two cases, namely, the similarity between the ith case and the jth case is expressed through the distance s between the ith case and the jth case;

according to the method, the similarity among all cases in the case library 2 is calculated according to the dense vector, and the similarity matrix is formed by summarizing.

S42: clustering the cases in the dense vector according to the similarity matrix, sequentially obtaining the degree that each case in the dense vector is used as the clustering center of other cases through an attractive information matrix of an AP clustering algorithm, and sequentially obtaining the degree that each case in the dense vector selects other cases as the clustering center through an attribution information matrix of the AP clustering algorithm; ending and outputting a clustering result until the iteration times are reached or samples in each clustering area in the dense vector are kept unchanged;

In this step, when s (i, j) > s (i, k), it means that the similarity between sample i and sample j is greater than the similarity between sample i and sample k;

the AP clustering algorithm is provided with a response information matrix R, wherein the response information R (i, j) in the matrix R is used for describing the degree that a case j is suitable as a clustering center of the case i, and represents a message from i to j;

the AP clustering algorithm has a home information (availability) matrix a, where the home information a (i, j) in the matrix a is used to describe the suitability of the case i to select the case j as its clustering center, and indicates a message from j to i.

Further, the step S42 includes the following steps:

s42-2, iterating the attraction information matrix r (i, j) in the AP clustering algorithm according to the following formula,

r(i,j)＝s(i,j)-max(j'≠j){a(i,j')+s(i,j')}；

namely, the fitness of j as the center of i is: subtracting the similarity between i and j 'from the similarity between i and j', and selecting case j as the maximum value of the sum of the fitness degrees of the clustering centers of the cases i and j;

in the step, the suitability s (i, j) of the clustering center of which j is i is recorded through the similarity matrix, so that only k is proved to be more suitable than other cases, and for other cases j ', s (i, j ') represents the suitability of the case j ' as the clustering center of the case i;

Then a (i, j ') is defined to represent i's degree of attribution to case j ';

adding the two values, and calculating the suitability degree of the case j ' serving as the clustering center of the case i by a (i, j ') +s (i, j ');

here, the maximum a (i, j ')+s (i, k ') is found among all other cases j ', that is, max { a (i, j ')+s (i, j ') } and the attraction degree of k to i can be obtained by using S (i, j) -max { a (i, j ')+s (i, k ') }: r (i, j) =s (i, j) -max { a (i, j ')+s (i, j') };

when i= j, a (i, i) = Σ (i 'noteqi) max {0, r (i', i) };

determining a cluster center, wherein if i= j, i is the cluster center, otherwise j is in the cluster of i;

j＝argmax{a(i,j)+r(i,j)}；

in this step, the attractiveness r (i', j) of the case j to other cases is calculated, and then an accumulated sum is made to represent the attractiveness of the case j to other cases: Σmax {0, r (i', j) }, then adding r (j, j);

according to the attraction formula, we can see that other r (j, j) reflects how the case j is not suitable to be divided into other clustering centers; a (j, j) mainly reflects the ability of k as a cluster center.

S42-4: iterating the AP clustering algorithm according to the steps S42-2 and S42-3 until the iteration times reach T or samples in each clustering area are unchanged, ending the AP clustering algorithm and outputting a clustering result;

in this step, the AP clustering algorithm finally outputs a clustering result having a plurality of clusters, and cases corresponding to nodes in each cluster are regarded as the same class.

Specifically, the step S5 includes the following steps:

preferably, the cluster threshold is adjustable as desired.

Alternatively, the clustering threshold may be 100.

In this step, the policy information includes an effective policy and a failure policy;

the automobile features include a handcart, a second handcart and a porcelain touching cart;

the characteristics of the case report person comprise normal case report persons and blacklist case report persons;

further, the blacklist is used for storing suspected fraudulent case report information, and if a certain case is determined to be a fraudulent case through investigation, the case report information of the case is recorded into the blacklist;

the blacklist report is report information in the blacklist, and the normal report is report information which is not recorded in the blacklist.

in this step, the weighted summation formula is:

y=ma+nb+pc+qd, wherein m, n, p, q is a natural number, respectively;

and m, n, p, q can be adjusted as desired;

Generating a low-risk signal according to a low-risk case, and outputting the low-risk signal to a client in the form of a communication signal;

in the step, the high-risk case is converted into an information source and is input to a transmitting device, the transmitting device converts the information source into an analog signal or a digital signal as a communication signal, and the communication signal is output to a client through a channel;

Converting the low-risk case into an information source and inputting the information source into a sending device, converting the information source into an analog signal or a digital signal by the sending device as a communication signal, and outputting the communication signal to a client through a channel;

the information source is an electric signal, so that the data information of the high-risk case or the data information of the low-risk case is converted into the electric signal and is input to the transmitting equipment;

the transmitting equipment can be an analog communication system or a digital communication system; the analog communication system is used for converting the electric signal into an analog signal through a modulator and outputting the analog signal to a client through a channel; the digital communication system is used for performing compression coding, encryption coding, channel coding and digital modulation operation on the electric signals, so that the electric signals are converted into digital signals and output to the client through a channel;

the channel is a physical medium for transmitting the communication signal received from the transmitting device to the client, and is divided into a wired channel and a wireless channel; for example, a mobile communication channel (such as 2G, 3G, or 4G, wimax) or a wired communication channel (such as Digital Subscriber Line (DSL), or Power Line Communication (PLC)) may be used.

Example two

Referring to fig. 3, a data cluster recognition device 1 of the present embodiment includes:

the one-dimensional vector formulation module 11 is configured to set one case in the case library 2 as a reference case, and set other cases in the case library 2 except the reference case as comparison cases; extracting a reference case and a comparison case from the case library 2, and sequentially judging whether the reference case and each comparison case have an association relationship or not; if yes, a relation value between the reference case and the comparison case is assigned 1, and if not, a relation value between the reference case and the comparison case is assigned 0; making a one-dimensional vector of the reference case according to the relation value between the reference case and each comparison case; the data information of the case comprises a field picture, case reporting text information, field structure information, policy information, automobile characteristics and case reporting person characteristics;

the adjacency matrix formulation module 12 is used for calling the one-dimensional vector formulation module 11 to sequentially set the cases in the case library 2 as reference cases, sequentially obtain one-dimensional vectors of all the reference cases, and combine the one-dimensional vectors of all the cases in the case library 2 to obtain an adjacency matrix;

The vector operation module 13 is configured to calculate the adjacency matrix by using an SDNE algorithm to obtain dense vectors of all cases in the case library 2;

a clustering operation module 14, configured to calculate the dense vector by using an AP clustering algorithm, so as to cluster all cases in the case library 2, and output a clustering result;

the risk assessment module 15 is configured to calculate a risk value of a case according to policy information, automobile characteristics and case reporting person characteristics of the case in the clustering result, and determine whether the case in the cluster of the clustering result is a high-risk case or a low-risk case according to the risk value; generating a high-risk signal according to the high-risk case and outputting the high-risk signal to the client 4, or generating a low-risk signal according to the low-risk case and outputting the low-risk signal to the client 4; wherein the high risk signal and the low risk signal are to be output to the client 4 in the form of communication signals, respectively.

Based on artificial intelligence, the technical scheme utilizes an intelligent decision technology to establish a classification model through a one-dimensional vector establishment module, an adjacent matrix establishment module, a vector operation module and a clustering operation module, and clusters dense vectors through a clustering algorithm to obtain a clustering result for clustering all cases in the case library; obtaining a high-risk case and a low-risk case by using a risk assessment module; the identification of high-risk cases is achieved.

Embodiment III:

in order to achieve the above objective, the present invention further provides a computer system, which includes a plurality of computer devices 3, where the components of the data cluster recognition device 1 of the second embodiment may be dispersed in different computer devices, and the computer devices may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack-mounted server, a blade server, a tower server, or a rack-mounted server (including an independent server or a server cluster formed by a plurality of servers), etc. for executing the program. The computer device of the present embodiment includes at least, but is not limited to: a memory 31, a processor 32, which may be communicatively coupled to each other via a system bus, as shown in fig. 4. It should be noted that fig. 4 only shows a computer device with components-but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead.

In the present embodiment, the memory 31 (i.e., readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 31 may be an internal storage unit of a computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory 31 may also be an external storage device of a computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like. Of course, the memory 31 may also include both internal storage units of the computer device and external storage devices. In this embodiment, the memory 31 is generally used to store an operating system installed in a computer device and various application software, such as program codes of the data cluster recognition device of the first embodiment. Further, the memory 31 may be used to temporarily store various types of data that have been output or are to be output.

Processor 32 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 32 is typically used to control the overall operation of the computer device. In this embodiment, the processor 32 is configured to execute the program code stored in the memory 31 or process data, for example, execute the data cluster recognition device, so as to implement the data cluster recognition method of the first embodiment.

Embodiment four:

to achieve the above object, the present invention also provides a computer-readable storage system including a plurality of storage media such as flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, server, app application store, etc., on which a computer program is stored that when executed by the processor 32 performs the corresponding functions. The computer readable storage medium of the present embodiment is used for storing the data cluster recognition device, and when executed by the processor 32, implements the data cluster recognition method of the first embodiment.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. The data clustering recognition method is characterized by comprising the following steps of:

s1, setting one case in a case library as a reference case, and setting other cases except the reference case in the case library as comparison cases; extracting a reference case and a comparison case from the case library, and sequentially judging whether the reference case and each comparison case have an association relation or not; if yes, a relation value between the reference case and the comparison case is assigned 1, and if not, a relation value between the reference case and the comparison case is assigned 0; making a one-dimensional vector of the reference case according to the relation value between the reference case and each comparison case; the data information of the case comprises a field picture, case reporting text information, field structure information, policy information, automobile characteristics and case reporting person characteristics;

2. The data cluster recognition method according to claim 1, wherein the step S1 includes the steps of:

S11: extracting one case in a case library as a reference case, and setting cases except the reference case in the case library as comparison cases; judging the association relation between the reference case and the comparison case;

3. The data clustering recognition method according to claim 1, wherein in the step S2, all cases in the case library are sequentially used as reference cases, and according to the method of S1, the association relationship between each reference case and each reference case is sequentially obtained, and the one-dimensional vector of each reference case is sequentially obtained; and merging the one-dimensional vectors of the reference cases to obtain an adjacency matrix.

4. The data cluster recognition method according to claim 1, wherein the step S3 includes the steps of:

5. The data cluster recognition method according to claim 1, wherein the step S4 includes the steps of:

6. The data cluster recognition method according to claim 5, wherein the step S42 includes the steps of:

r(i,j)＝s(i,j)-max(j'≠j){a(i,j')+s(i,j')}；

When i= j, a (i, i) = Σ (i 'noteqi) max {0, r (i', i) };

7. The data cluster recognition method according to claim 1, wherein the step S5 includes the steps of:

8. The data cluster recognition device is characterized by comprising:

9. A computer system comprising a plurality of computer devices, each computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processors of the plurality of computer devices collectively implement the steps of the data cluster recognition method of any one of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium comprising a plurality of storage media, each storage medium having stored thereon a computer program, characterized in that the computer programs stored on the plurality of storage media when executed by a processor collectively implement the steps of the data cluster recognition method of any one of claims 1 to 7.