CN111985837A

CN111985837A - Risk analysis method, device and equipment based on hierarchical clustering and storage medium

Info

Publication number: CN111985837A
Application number: CN202010895439.0A
Authority: CN
Inventors: 郭建福; 张旭
Original assignee: Ping An Medical and Healthcare Management Co Ltd
Current assignee: Ping An Medical and Healthcare Management Co Ltd
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2020-11-24

Abstract

The invention relates to the field of artificial intelligence, and discloses a risk analysis method, a device, equipment and a storage medium based on hierarchical clustering, which are applied to the field of intelligent medical treatment. The method comprises the following steps: acquiring initial data, wherein the initial data is used for indicating drug sales data of a plurality of hospitals, and the initial data is time sequence data; calculating correlation coefficients between any two different hospitals according to a preset similarity formula and the initial data to obtain a plurality of target correlation coefficients; generating a distance matrix among a plurality of hospitals according to the plurality of target correlation coefficients; pruning and hierarchical clustering operations are carried out on the distance matrix to generate a clustering tree, wherein the clustering tree comprises a plurality of clusters; and performing risk analysis according to the clustering tree to obtain a risk analysis result.

Description

Risk analysis method, device and equipment based on hierarchical clustering and storage medium

Technical Field

The invention relates to the field of medical data, in particular to a risk analysis method, device, equipment and storage medium based on hierarchical clustering.

Background

Risk control refers to the risk manager taking various measures and methods to eliminate or reduce the various possibilities of occurrence of a risk event, or the risk controller reducing the losses incurred when a risk event occurs. In the fields of e-commerce, credit card fraud prevention, medical insurance fund fraud prevention and the like, wind control is a very important direction.

In the existing scheme, a candidate abnormal result is generally found through an abnormal recognition model such as correlation analysis, statistical analysis and the like, but the data is often noisy, and the obtained result is often not ideal. Moreover, for high-dimensional data, the method is easy to be trapped in a dimension disaster (security of dimension) and the analysis result is distorted.

Disclosure of Invention

The invention provides a risk analysis method, a risk analysis device, risk analysis equipment and a risk analysis storage medium based on hierarchical clustering, which are used for avoiding dimension disasters when time series data are processed.

A first aspect of an embodiment of the present invention provides a risk analysis method based on hierarchical clustering, including: acquiring initial data, wherein the initial data is used for indicating drug sales data of a plurality of hospitals, and the initial data is time sequence data; calculating correlation coefficients between any two different hospitals according to a preset similarity formula and the initial data to obtain a plurality of target correlation coefficients; generating a distance matrix among a plurality of hospitals according to the plurality of target correlation coefficients; pruning and hierarchical clustering operations are carried out on the distance matrix to generate a clustering tree, wherein the clustering tree comprises a plurality of clusters; and performing risk analysis according to the clustering tree to obtain a risk analysis result.

Optionally, in a first implementation manner of the first aspect of the embodiment of the present invention, the calculating a correlation coefficient between any two different hospitals according to a preset similarity formula and the initial data to obtain a plurality of target correlation coefficients includes: respectively determining the sales amount Y of the medicine in the hospital i_iAnd the sales amount Y of the medicine of hospital j_j(ii) a Sales amount Y of the medicine_iAnd said drug sales amount Y_jInputting the data into a preset similarity formula to generate the correlation coefficient of the hospital i and the hospital j, wherein the preset similarity formula is

Wherein，Y_iIndicates the sales amount of the medicine in Hospital i, Y_jRepresents the sales amount of the medicine of the hospital j, i and j are positive integers,<>denotes the mean value, p_ijThe correlation coefficients of hospital i and hospital j; calculating correlation coefficients between any other two hospitals to obtain a plurality of other correlation coefficients, wherein the any other two hospitals do not include hospital i and hospital j at the same time; generating a plurality of target correlation coefficients including the correlation coefficients for Hospital i and Hospital j and the plurality of other correlation coefficients.

Optionally, in a second implementation manner of the first aspect of the embodiment of the present invention, the generating a distance matrix between multiple hospitals according to the multiple target correlation coefficients includes: calculating initial distances between any two different hospitals according to the target correlation coefficients to obtain a plurality of initial distances; generating a distance matrix based on the plurality of initial distances, the distance matrix indicating a distance between any two hospitals.

Optionally, in a third implementation manner of the first aspect of the embodiment of the present invention, the calculating an initial distance between any two different hospitals according to the multiple target correlation coefficients to obtain multiple initial distances includes: calling a preset distance formula to calculate the distance corresponding to each target correlation coefficient to obtain a plurality of initial distances, wherein d (i, j) represents the distance between hospital i and hospital j, and the preset distance formula is as follows:

optionally, in a fourth implementation manner of the first aspect of the embodiment of the present invention, the pruning and hierarchical clustering operations are performed on the distance matrix to generate a cluster tree, where the cluster tree includes a plurality of clusters, and the cluster tree includes: pruning the distance matrix to obtain a pruned distance matrix; and performing hierarchical clustering on the pruned distance matrix to generate a clustering tree.

Optionally, in a fifth implementation manner of the first aspect of the embodiment of the present invention, the pruning the distance matrix to obtain a pruned distance matrix includes: converting the distance matrix into an undirected graph; generating a minimum spanning tree by using a preset algorithm and the undirected graph; and pruning the distance matrix based on the minimum spanning tree to obtain the pruned distance matrix.

Optionally, in a sixth implementation manner of the first aspect of the embodiment of the present invention, the performing hierarchical clustering on the pruned distance matrix to generate a clustering tree includes: calling a preset matrix distance formula to calculate the distance of each data point in the pruned distance matrix to obtain a plurality of distances, wherein the preset matrix distance formula is

D represents the distance between any two data points; and performing hierarchical clustering on two nearest data points in the plurality of distances to obtain a plurality of data categories, wherein the data categories comprise data points and data combinations, and performing the hierarchical clustering process in an iterative manner until the distance matrix is converted into a plurality of clusters to generate a clustering tree.

A second aspect of the embodiments of the present invention provides a risk analysis device based on hierarchical clustering, including: the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring initial data, the initial data is used for indicating the drug sales data of a plurality of hospitals, and the initial data is time sequence data; the calculation module is used for calculating correlation coefficients between any two different hospitals according to a preset similarity formula and the initial data to obtain a plurality of target correlation coefficients; the generating module is used for generating a distance matrix among a plurality of hospitals according to the plurality of target correlation coefficients; the clustering module is used for pruning and hierarchical clustering operations on the distance matrix to generate a clustering tree, and the clustering tree comprises a plurality of clusters; and the analysis module is used for carrying out risk analysis according to the clustering tree to obtain a risk analysis result.

Optionally, in a first implementation manner of the second aspect of the embodiment of the present invention, the calculation module includes: a determination unit for determining the sales amount Y of the medicine in hospital i_iAnd the sales amount Y of the medicine of hospital j_j(ii) a An input unit for assigning the sales amount Y of the medicine_iAnd sale of said pharmaceutical productForehead Y_jInputting the data into a preset similarity formula to generate the correlation coefficient of the hospital i and the hospital j, wherein the preset similarity formula is

Wherein, Y_iIndicates the sales amount of the medicine in Hospital i, Y_jRepresents the sales amount of the medicine of the hospital j, i and j are positive integers,<>denotes the mean value, p_ijThe correlation coefficients of hospital i and hospital j; the first calculating unit is used for calculating correlation coefficients between any other two hospitals to obtain a plurality of other correlation coefficients, and the any other two hospitals do not include hospital i and hospital j at the same time; a first generating unit configured to generate a plurality of target correlation coefficients including the correlation coefficients of hospital i and hospital j and the plurality of other correlation coefficients.

Optionally, in a second implementation manner of the second aspect of the embodiment of the present invention, the generating module includes: the second calculation unit is used for calculating the initial distance between any two different hospitals according to the target correlation coefficients to obtain a plurality of initial distances; a second generating unit configured to generate a distance matrix based on the plurality of initial distances, the distance matrix indicating a distance between any two hospitals.

Optionally, in a third implementation manner of the second aspect of the embodiment of the present invention, the second calculating unit is specifically configured to: calling a preset distance formula to calculate the distance corresponding to each target correlation coefficient to obtain a plurality of initial distances, wherein d (i, j) represents the distance between hospital i and hospital j, and the preset distance formula is as follows:

optionally, in a fourth implementation manner of the second aspect of the embodiment of the present invention, the clustering module includes: the pruning unit is used for carrying out pruning operation on the distance matrix to obtain a pruned distance matrix; and the clustering unit is used for carrying out hierarchical clustering on the pruned distance matrix to generate a clustering tree.

Optionally, in a fifth implementation manner of the second aspect of the embodiment of the present invention, the pruning unit is specifically configured to: converting the distance matrix into an undirected graph; generating a minimum spanning tree by using a preset algorithm and the undirected graph; and pruning the distance matrix based on the minimum spanning tree to obtain the pruned distance matrix.

Optionally, in a sixth implementation manner of the second aspect of the embodiment of the present invention, the clustering unit is specifically configured to: calling a preset matrix distance formula to calculate the distance of each data point in the pruned distance matrix to obtain a plurality of distances, wherein the preset matrix distance formula is

A third aspect of an embodiment of the present invention provides a risk analysis device based on hierarchical clustering, including a memory and at least one processor, where the memory stores instructions, and the memory and the at least one processor are interconnected by a line; the at least one processor invokes the instructions in the memory to cause the hierarchical cluster-based risk analysis device to perform the hierarchical cluster-based risk analysis method described above.

A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, which stores instructions that, when executed by a processor, implement the steps of the risk analysis method based on hierarchical clustering according to any of the above embodiments.

According to the technical scheme provided by the embodiment of the invention, initial data are obtained, wherein the initial data are used for indicating the drug sales data of a plurality of hospitals, and the initial data are time series data; calculating correlation coefficients between any two different hospitals according to a preset similarity formula and the initial data to obtain a plurality of target correlation coefficients; generating a distance matrix among a plurality of hospitals according to the plurality of target correlation coefficients; pruning and hierarchical clustering operations are carried out on the distance matrix to generate a clustering tree, wherein the clustering tree comprises a plurality of clusters; and performing risk analysis according to the clustering tree to obtain a risk analysis result. According to the embodiment of the invention, the time series data is subjected to noise reduction and pruning treatment, so that the situation of dimension disaster is avoided, and the reliability of a risk analysis result is enhanced.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a risk analysis method based on hierarchical clustering according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of another embodiment of a risk analysis method based on hierarchical clustering according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an embodiment of a risk analysis apparatus based on hierarchical clustering according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another embodiment of a risk analysis device based on hierarchical clustering according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an embodiment of a risk analysis device based on hierarchical clustering in the embodiment of the present invention.

Detailed Description

The invention provides a risk analysis method, a risk analysis device, risk analysis equipment and a storage medium based on hierarchical clustering, which are used for carrying out noise reduction and pruning processing on time series data, avoiding the situation of being trapped in dimension disasters and enhancing the reliability of risk analysis results.

In order to make the technical field of the invention better understand the scheme of the invention, the embodiment of the invention will be described in conjunction with the attached drawings in the embodiment of the invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, a flowchart of a risk analysis method based on hierarchical clustering according to an embodiment of the present invention specifically includes:

101. initial data indicating drug sales data of a plurality of hospitals is acquired, and the initial data is time-series data.

The server acquires initial data indicating drug sales data of a plurality of hospitals, the initial data being time-series data. The initial data in this embodiment is time-series data, taking medical insurance wind control as an example, the basic data is medical insurance settlement data, and the server can obtain daily sales data of various medicines in each hospital through grouping and integration, which is time-series data.

It should be noted that the server analyzes the time-series data of each hospital, finds out a possibly abnormal hospital from the time-series data by using the risk analysis method based on hierarchical clustering proposed in this embodiment, and then performs visualization processing on the data of the abnormal hospital to discover and verify the reason of the abnormality and prompt the reason.

It is to be understood that the execution subject of the present invention may be a risk analysis device based on hierarchical clustering, and may also be a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

102. And calculating the correlation coefficient between any two different hospitals according to a preset similarity formula and initial data to obtain a plurality of target correlation coefficients.

And the server calculates the correlation coefficient between any two different hospitals according to a preset similarity formula and initial data to obtain a plurality of target correlation coefficients. The target correlation coefficient between any two hospitals is a correlation coefficient between drug sales, where the drug sales is a specific drug, and can be focused on drugs with higher unit price and higher medical insurance fund expenditure, such as drugs for treating cancer and drugs for treating cardiovascular diseases, which is not limited herein.

103. And generating a distance matrix among a plurality of hospitals according to the plurality of target correlation coefficients.

The server generates a distance matrix between a plurality of hospitals according to the plurality of target correlation coefficients.

Specifically, the server calculates an initial distance between any two different hospitals according to the multiple target correlation coefficients to obtain multiple initial distances; the server generates a distance matrix based on the plurality of initial distances, the distance matrix indicating a distance between any two hospitals.

Wherein, the server calculates the initial distance between any two different hospitals according to the multiple target correlation coefficients, and the obtaining of the multiple initial distances comprises: the server calls a second preset formula to calculate the distance corresponding to each target correlation coefficient to obtain a plurality of initial distances, d (i, j) represents the distance between hospital i and hospital j, and the second preset formula is as follows:

104. and pruning and hierarchical clustering operations are carried out on the distance matrix to generate a clustering tree, wherein the clustering tree comprises a plurality of clusters.

And the server performs pruning and hierarchical clustering operations on the distance matrix to generate a clustering tree, wherein the clustering tree comprises a plurality of clusters. Specifically, the server prunes the distance matrix to obtain a pruned distance matrix; and the server carries out hierarchical clustering on the pruned distance matrix to generate a clustering tree.

In this embodiment, a bottom-up merging method is used for clustering, and a merging algorithm of hierarchical clustering combines two most similar data points of all data points by calculating the similarity between the two types of data points, and iterates this process repeatedly. The merging algorithm of hierarchical clustering determines the similarity between data points of each category and all data points by calculating the distance between the data points, wherein the smaller the distance is, the higher the similarity is, and two data points or categories with the closest distance are combined to generate a clustering tree. Because the server already obtains the distance matrix after pruning, the server can combine different hospitals according to the distance between the hospitals to generate the clustering tree.

105. And performing risk analysis according to the clustering tree to obtain a risk analysis result.

And the server performs risk analysis according to the clustering tree to obtain a risk analysis result.

The clustering tree is obtained through hierarchical clustering, each cluster (hierarchical structure) can represent the hierarchical structure of one class of hospitals, and is compared with a preset rule, if a third-class hospital, a second-class hospital and a first-class hospital are respectively clustered together, no problem exists, and if the hospital A falls into other hierarchies, the hospital A is possibly abnormal. On the other hand, after hierarchical clustering, the distance between the closest two hospitals can be determined, namely a, the distance between the closest two hospitals and the last clustered two clusters is determined, namely b, then a and b are compared, if b >3a, the hospital in the last clustered cluster is determined to have an abnormality, and if the number of hospitals in the abnormal cluster is small, the abnormality suspicion degree is higher.

The criterion for determining the abnormality may be set according to actual conditions, for example, the determination condition "b >3 a" may be replaced with "b >4 a" or "b >2 a", and the determination condition is not limited herein.

According to the embodiment of the invention, the time series data is subjected to noise reduction and pruning treatment, so that the situation of dimension disaster is avoided, and the reliability of a risk analysis result is enhanced. And this scheme can be applied to in the wisdom medical treatment field to promote the construction in wisdom city.

Referring to fig. 2, another flowchart of the risk analysis method based on hierarchical clustering according to the embodiment of the present invention specifically includes:

201. initial data indicating drug sales data of a plurality of hospitals is acquired, and the initial data is time-series data.

202. And calculating the correlation coefficient between any two different hospitals according to a preset similarity formula and initial data to obtain a plurality of target correlation coefficients.

Specifically, the server determines the sales amount Y of the medicine in hospital i_iAnd the sales amount Y of the medicine of hospital j_j(ii) a The server sends the sales amount Y of the medicine_iAnd the sales amount of the drug Y_jInputting the data into a preset similarity formula to generate the correlation coefficient of the hospital i and the hospital j, wherein the preset similarity formula is

Wherein, Y_iIndicates the sales amount of the medicine in Hospital i, Y_jRepresents the sales amount of the medicine of the hospital j, i and j are positive integers,<>denotes the mean value, p_ijThe correlation coefficients of hospital i and hospital j; the server calculates the correlation coefficient between any two other hospitals to obtain a plurality of other correlation coefficients, and any two other hospitals do not simultaneously comprise the hospital i and the hospital j; the server generates a plurality of target correlation coefficients including a correlation coefficient for hospital i and hospital j and a plurality of other correlation coefficients.

203. And generating a distance matrix among a plurality of hospitals according to the plurality of target correlation coefficients.

204. and carrying out pruning operation on the distance matrix to obtain the pruned distance matrix.

And the server prunes the distance matrix to obtain the pruned distance matrix. Specifically, the server converts the distance matrix into an undirected graph; the server generates a minimum spanning tree by using a preset algorithm and an undirected graph; and the server prunes the distance matrix based on the minimum spanning tree to obtain the pruned distance matrix.

According to the distance matrix between hospitals, the undirected graph can be obtained by taking the hospitals as nodes and the distances between the hospitals as the weights of edges. The purpose of pruning and denoising is achieved by converting the undirected graph into the minimum spanning tree. In the pruned distance matrix, if two hospitals are connected by branches on the minimum spanning tree, the distance between the two hospitals is preserved, otherwise, the distance between the two hospitals is set to a large value, such as other values greater than 500, for example, 1000, 2000, etc., and the invention is not limited herein.

It should be noted that in a given undirected graph G ═ V, E, (u, V) represents the edge connecting vertex u (i.e., hospital) and vertex V (i.e., distance between hospitals), and w (u, V) represents the weight of this edge, and if T is a subset of E and is a acyclic graph such that w (T) is minimized, then T is the minimum spanning tree of G. A spanning tree for a connected graph with n nodes is a minimal connected subgraph of the original graph, and contains all n nodes in the original graph and has the least edges to keep the graph connected. The minimum spanning tree can be determined using the kruskal (kruskal) algorithm or the prim (prim) algorithm.

It should be noted that, because actual data is usually noisy, the calculated original distance matrix is also noisy, which may interfere with the final analysis result. The distance matrix is essentially a graph, pruning is carried out by using the idea of the minimum spanning tree, only the distance between the nodes of the obtained minimum spanning tree is reserved, other distances are set to be a large distance, and the noise of the pruned distance matrix is greatly reduced, thereby being beneficial to subsequent analysis.

205. And performing hierarchical clustering on the distance matrix after pruning to generate a clustering tree.

And the server performs hierarchical clustering on the pruned distance matrix to generate a clustering tree. Specifically, the server calls a preset matrix distance formula to calculate the distance of each data point in the pruned distance matrix to obtain a plurality of distances, wherein the preset matrix distance formula is

D represents the distance between any two data points; the server hierarchically clusters the two closest data points of the plurality of distances,and obtaining a plurality of data categories, wherein the data categories comprise data points and data combinations, and iteratively executing a hierarchical clustering process until the distance matrix is converted into a plurality of clusters to generate a clustering tree.

In this embodiment, a bottom-up merging method is used for clustering, and a merging algorithm of hierarchical clustering combines two most similar data points of all data points by calculating the similarity between the two types of data points, and iterates this process repeatedly. The merging algorithm of hierarchical clustering determines the similarity between data points of each category and all data points by calculating the distance between the data points, wherein the smaller the distance is, the higher the similarity is, and two data points or categories with the closest distance are combined to generate a clustering tree. Because the server already obtains the distance matrix after pruning, the server can combine different hospitals according to the distance between the hospitals to generate the clustering tree. Wherein, the distance between each data point and all data points is calculated to determine the Euclidean distance D between the data points, and the smaller the Euclidean distance is, the higher the similarity is.

In the present application, data points are hospitals, one hospital represents one data point, and one data point combination represents two combined data points, and it is assumed that the pruned distance matrix in the embodiment of the present invention includes six hospitals A, B, C, D, E, F, that is, includes a data point a, a data point B, a data point C, a data point D, a data point E, and a data point F, and then, after combining the data point B (hospital B) and the data point C (hospital C), a category (B, C) is obtained, and finally, a data category a, a data category (B, C), a data category D, a data category E, and a data category F are obtained, and the distance matrix between the data categories is recalculated.

It is understood that, for the calculation method of calculating the distance between data points, a preset matrix distance formula is used for calculation. For the calculation method between the calculation data combination and other data points, for example: when calculating the distance of a data combination (B, C) to a data point A, it is necessary to calculate the mean of the distances B to A and C to A, respectively, i.e.

For the calculation method to calculate the distance between two data combinations: the distance of each of the two combined data points from all other data points is calculated. The mean of all distances is taken as the distance between two combined data points. This method is more computationally intensive, but results are more reasonable than the first two methods. For example, for a data combination (A, E) to a data combination (B, C) the distance is

206. And performing risk analysis according to the clustering tree to obtain a risk analysis result.

In the above description of the risk analysis method based on hierarchical clustering in the embodiment of the present invention, referring to fig. 3, a risk analysis device based on hierarchical clustering in the embodiment of the present invention is described below, and an embodiment of the risk analysis device based on hierarchical clustering in the embodiment of the present invention includes:

an obtaining module 301, configured to obtain initial data, where the initial data is used to indicate drug sales data of multiple hospitals, and the initial data is time-series data;

a calculating module 302, configured to calculate correlation coefficients between any two different hospitals according to a preset similarity formula and the initial data to obtain multiple target correlation coefficients;

a generating module 303, configured to generate a distance matrix between multiple hospitals according to the multiple target correlation coefficients;

a clustering module 304, configured to perform pruning and hierarchical clustering operations on the distance matrix to generate a clustering tree, where the clustering tree includes multiple clusters;

and the analysis module 305 is configured to perform risk analysis according to the clustering tree to obtain a risk analysis result.

Referring to fig. 4, another embodiment of the risk analysis device based on hierarchical clustering according to the embodiment of the present invention includes:

Optionally, the calculating module 302 includes:

a determination unit 3021 for determining the sales amount Y of the drugs for hospital i, respectively_iAnd the sales amount Y of the medicine of hospital j_j；

An input unit 3022 for assigning the sales amount Y of the medicine_iAnd said drug sales amount Y_jInputting the data into a preset similarity formula to generate the correlation coefficient of the hospital i and the hospital j, wherein the preset similarity formula is

Wherein, Y_iIndicates the sales amount of the medicine in Hospital i, Y_jRepresents the sales amount of the medicine of the hospital j, i and j are positive integers,<>denotes the mean value, p_ijThe correlation coefficients of hospital i and hospital j;

the first calculating unit 3023 is configured to calculate correlation coefficients between any other two hospitals to obtain a plurality of other correlation coefficients, where the any other two hospitals do not include hospital i and hospital j at the same time;

a first generating unit 3024 configured to generate a plurality of target correlation coefficients including the correlation coefficients of hospital i and hospital j and the plurality of other correlation coefficients.

Optionally, the generating module 303 includes:

a second calculating unit 3031, configured to calculate initial distances between any two different hospitals according to the multiple target correlation coefficients, so as to obtain multiple initial distances;

a second generating unit 3032, configured to generate a distance matrix based on the plurality of initial distances, where the distance matrix is used to indicate a distance between any two hospitals.

Optionally, the second calculating unit 3031 is specifically configured to:

calling a preset distance formula to calculate the distance corresponding to each target correlation coefficient to obtain a plurality of initial distances, wherein d (i, j) represents the distance between hospital i and hospital jThe preset distance formula is as follows:

optionally, the clustering module 304 includes:

a pruning unit 3041, configured to perform a pruning operation on the distance matrix to obtain a pruned distance matrix;

a clustering unit 3042, configured to perform hierarchical clustering on the pruned distance matrix, and generate a clustering tree.

Optionally, the pruning unit 3041 is specifically configured to:

converting the distance matrix into an undirected graph; generating a minimum spanning tree by using a preset algorithm and the undirected graph; and pruning the distance matrix based on the minimum spanning tree to obtain the pruned distance matrix.

Optionally, the clustering unit 3042 is specifically configured to:

calling a preset matrix distance formula to calculate the distance of each data point in the pruned distance matrix to obtain a plurality of distances, wherein the preset matrix distance formula is

Fig. 3 to 4 describe the risk analysis device based on hierarchical clustering in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the risk analysis device based on hierarchical clustering in the embodiment of the present invention in detail from the perspective of hardware processing.

Fig. 5 is a schematic structural diagram of a risk analysis device based on hierarchical clustering according to an embodiment of the present invention, where the risk analysis device 500 based on hierarchical clustering may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, one or more storage media 530 (e.g., one or more mass storage devices) storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a series of instructions operating on the risk analysis device 500 based on hierarchical clustering. Still further, the processor 510 may be configured to communicate with the storage medium 530 to execute a series of instruction operations in the storage medium 530 on the hierarchical clustering based risk analysis device 500.

The hierarchical clustering-based risk analysis device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the hierarchical clustering based risk analysis device architecture shown in fig. 5 does not constitute a limitation of hierarchical clustering based risk analysis devices, and may include more or fewer components than shown, or combine certain components, or a different arrangement of components. The processor 510 may perform the functions of the obtaining module 301, the calculating module 302, the generating module 303, the clustering module 304 and the analyzing module 305 in the above embodiments.

The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the hierarchical clustering based risk analysis method.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A risk analysis method based on hierarchical clustering is characterized by comprising the following steps:

acquiring initial data, wherein the initial data is used for indicating drug sales data of a plurality of hospitals, and the initial data is time sequence data;

calculating correlation coefficients between any two different hospitals according to a preset similarity formula and the initial data to obtain a plurality of target correlation coefficients;

generating a distance matrix among a plurality of hospitals according to the plurality of target correlation coefficients;

pruning and hierarchical clustering operations are carried out on the distance matrix to generate a clustering tree, wherein the clustering tree comprises a plurality of clusters;

and performing risk analysis according to the clustering tree to obtain a risk analysis result.

2. The risk analysis method based on hierarchical clustering according to claim 1, wherein the calculating a correlation coefficient between any two different hospitals according to a preset similarity formula and the initial data to obtain a plurality of target correlation coefficients comprises:

respectively determining the sales amount Y of the medicine in the hospital i_iAnd the sales amount Y of the medicine of hospital j_j；

Sales amount Y of the medicine_iAnd said drug sales amount Y_jInputting the data into a preset similarity formula to generate the correlation coefficient of the hospital i and the hospital j, wherein the preset similarity formula is

calculating correlation coefficients between any other two hospitals to obtain a plurality of other correlation coefficients, wherein the any other two hospitals do not include hospital i and hospital j at the same time;

generating a plurality of target correlation coefficients including the correlation coefficients for Hospital i and Hospital j and the plurality of other correlation coefficients.

3. The hierarchical clustering-based risk analysis method according to claim 1, wherein the generating a distance matrix between hospitals according to the plurality of target correlation coefficients comprises:

calculating initial distances between any two different hospitals according to the target correlation coefficients to obtain a plurality of initial distances;

generating a distance matrix based on the plurality of initial distances, the distance matrix indicating a distance between any two hospitals.

4. The risk analysis method based on hierarchical clustering according to claim 3, wherein the calculating an initial distance between any two different hospitals according to the plurality of target correlation coefficients to obtain a plurality of initial distances comprises:

calling a preset distance formula to calculate the distance corresponding to each target correlation coefficient to obtain a plurality of initial distances, wherein d (i, j) represents the distance between hospital i and hospital j, and the preset distance formula is as follows:

5. the risk analysis method based on hierarchical clustering according to any one of claims 1-4, wherein the pruning and hierarchical clustering operations on the distance matrix generate a clustering tree, the clustering tree comprising a plurality of clusters, including:

pruning the distance matrix to obtain a pruned distance matrix;

and performing hierarchical clustering on the pruned distance matrix to generate a clustering tree.

6. The risk analysis method based on hierarchical clustering according to claim 5, wherein the pruning operation on the distance matrix to obtain the pruned distance matrix comprises:

converting the distance matrix into an undirected graph;

generating a minimum spanning tree by using a preset algorithm and the undirected graph;

and pruning the distance matrix based on the minimum spanning tree to obtain the pruned distance matrix.

7. The risk analysis method based on hierarchical clustering according to claim 5, wherein the hierarchical clustering of the pruned distance matrix to generate a clustering tree comprises:

D represents the distance between any two data points;

and performing hierarchical clustering on two nearest data points in the plurality of distances to obtain a plurality of data categories, wherein the data categories comprise data points and data combinations, and performing the hierarchical clustering process in an iterative manner until the distance matrix is converted into a plurality of clusters to generate a clustering tree.

8. A risk analysis device based on hierarchical clustering is characterized by comprising:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring initial data, the initial data is used for indicating the drug sales data of a plurality of hospitals, and the initial data is time sequence data;

the calculation module is used for calculating correlation coefficients between any two different hospitals according to a preset similarity formula and the initial data to obtain a plurality of target correlation coefficients;

the generating module is used for generating a distance matrix among a plurality of hospitals according to the plurality of target correlation coefficients;

the clustering module is used for pruning and hierarchical clustering operations on the distance matrix to generate a clustering tree, and the clustering tree comprises a plurality of clusters;

and the analysis module is used for carrying out risk analysis according to the clustering tree to obtain a risk analysis result.

9. A hierarchical clustering-based risk analysis device, characterized in that the hierarchical clustering-based risk analysis device comprises: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;

the at least one processor invokes the instructions in the memory to cause the hierarchical cluster-based risk analysis device to perform the hierarchical cluster-based risk analysis method of any one of claims 1-7.

10. A computer-readable storage medium storing instructions that, when executed by a processor, implement a hierarchical clustering based risk analysis method according to any one of claims 1 to 7.